+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- How To Build OFED 1.5.1
-
- March 2010
-
-
-==============================================================================
-Table of contents
-==============================================================================
-1. Overview
-2. Usage
-3. Requirements
-
-==============================================================================
-1. Overview
-==============================================================================
-The script "build.pl" is used to build the OFED package based on the
-OpenFabrics project. The package is built under /tmp directory.
-
-See OFED_release_notes.txt for more details.
-
-==============================================================================
-2. Usage
-==============================================================================
-
-The build script for the OFED package can be downloaded from:
- git://git.openfabrics.org/~vlad/build.git
- branch: master
-
-Name: build.pl
-
-Usage: ./build.pl --version <version> [-r|--release]|[--daily] [-d|--distribution <distribution name>] [-v|--verbose]
- [-b|--builddir <build directory>]
- [-p|--packagesdir <packages directory>]
- [--pre-build <pre-build script>]
- [--skip-prebuild]
- [--post-build <post-build script>]
- [--skip-postbuild]
-
-Example:
-
- ./build.pl --version 1.5.1-rc1 -p packages-ofed
-
- This command will create a package (i.e., subtree) called OFED-1.5.1-rc1
- under /tmp/$USER/
-
-==============================================================================
-3. Requirements
-==============================================================================
-
-1. Git:
- Can be downloaded from:
- http://www.kernel.org/pub/software/scm/git
-
-2. Autotools:
-
- libtool-1.5.20 or higher
- autoconf-2.59 or higher
- automake-1.9.6 or higher
- m4-1.4.4 or higher
-
- The above tools can be downloaded from the following URLs:
-
- libtool - "http://ftp.gnu.org/gnu/libtool/libtool-1.5.20.tar.gz"
- autoconf - "http://ftp.gnu.org/gnu/autoconf/autoconf-2.59.tar.gz"
- automake - "http://ftp.gnu.org/gnu/automake/automake-1.9.6.tar.gz"
- m4 - "http://ftp.gnu.org/gnu/m4/m4-1.4.4.tar.gz"
-
-3. wget or ssh slient
+++ /dev/null
-===============================================================================
- MLNX_EN driver for Mellanox Adapter Cards with 10GigE Support
- README for OFED 1.5.2
-
- December 2010
-===============================================================================
-
-Contents:
-=========
-1. Overview
-2. Ethernet Driver Usage and Configuration
-
-
-1. Overview
-===========
-MLNX_EN driver is composed from mlx4_core and mlx4_en kernel modules.
-
-The MLNX_EN driver release exposes the following capabilities:
-- Single/Dual port
-- Fibre Channel over Ethernet (FCoE)
-- Up to 16 Rx queues per port
-- 5 TX queues per port
-- Rx steering mode: Receive Core Affinity (RCA)
-- Tx arbitration mode: VLAN user-priority (off by default)
-- MSI-X or INTx
-- Adaptive interrupt moderation
-- HW Tx/Rx checksum calculation
-- Large Send Offload (i.e., TCP Segmentation Offload)
-- Large Receive Offload
-- IP reassembly offload for fragmented IP packets
-- Multi-core NAPI support
-- VLAN Tx/Rx acceleration (HW VLAN stripping/insertion)
-- HW VLAN filtering
-- HW multicast filtering
-- ifconfig up/down + mtu changes (up to 10K)
-- Ethtool support
-- Net device statistics
-
-
-2. Ethernet Driver Usage and Configuration
-==========================================
-
-- To assign an IP address to the interface run:
- #> ifconfig eth<x> <ip>
-
- where 'x' is the OS assigned interface number.
-
-- To check driver and device information run:
- #> ethtool -i eth<x>
-
- Example:
- #> ethtool -i eth2
- driver: mlx4_en (MT_0BD0110004)
- version: 1.5.2 (March 2010)
- firmware-version: 2.8.000
- bus-info: 0000:0e:00.0
-
-- To query stateless offload status run:
- #> ethtool -k eth<x>
-
-- To set stateless offload status run:
- #> ethtool -K eth<x> [rx on|off] [tx on|off] [sg on|off] [tso on|off]
-
-- To query interrupt coalescing settings run:
- #> ethtool -c eth<x>
-
-- By default, the driver uses adaptive interrupt moderation for the receive path,
- which adjusts the moderation time according to the traffic pattern.
- Adaptive moderation settings can be set by:
- #> ethtool -C eth<x> adaptive-rx on|off
-
-- To set interrupt coalescing settings run:
- #> ethtool -C eth<x> [rx-usecs N] [rx-frames N] [tx-usecs N] [tx-frames N]
-
- Note: usec settings correspond to the time to wait after the *last* packet
- sent/received before triggering an interrupt
-
-- To query pause frame settings run:
- #> ethtool -a eth<x>
-
-- To set pause frame settings run:
- #> ethtool -A eth<x> [rx on|off] [tx on|off]
-
-- To query ring size values run:
- #> ethtool -g eth<x>
-
-- To modify rings size run:
- #> ethtool -G eth<x> [rx <N>] [tx <N>]
-
-- To obtain additional device statistics, run:
- #> ethtool -S eth<x>
-
-- To perform a self diagnostics test, run:
- #> ethtool -t eth<x>
-
-
-The driver defaults to the following parameters:
-- Both ports are activated (i.e., a net device is created for each port)
-- The number of Rx rings for each port is the number of on-line CPUs
-- Per-core NAPI is enabled
-- LRO is enabled with 32 concurrent sessions per Rx ring
-
-Some of these values can be changed using module parameters, which are
-detailed by running:
-#> modinfo mlx4_en
-
-To set non-default values to module parameters, the following line should be
-added to /etc/modprobe.conf file:
- "options mlx4_en <param_name>=<value> <param_name>=<value> ..."
-
-Values of all parameters can be observed in /sys/module/mlx4_en/parameters/.
-
-
+++ /dev/null
-
- MPI in OFED 1.5.2 README
-
- September 2010
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. MVAPICH
-3. Open MPI
-4. MVAPICH2
-
-
-===============================================================================
-1. Overview
-===============================================================================
-Open Fabrics Enterprise Distribution (OFED)Three MPI stacks are included in
-this release of OFED:
-- MVAPICH 1.2.0
-- Open MPI 1.4.2
-- MVAPICH2 1.5.1
-
-Setup, compilation and run information of MVAPICH, Open MPI and MVAPICH2 is
-provided below in sections 2, 3 and 4 respectively.
-
-1.1 Installation Note
----------------------
-In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install
-one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt
-to learn about the different options.
-
-The installation script allows each MPI to be compiled using one or
-more compilers. Users need to set, per MPI stack installed, the PATH
-and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks.
-
-1.2 MPI Tests
--------------
-OFED includes four basic tests that can be run against each MPI stack:
-bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests
-are located under: <prefix>/mpi/<compiler>/<mpi stack>/tests/,
-where <prefix> is /usr by default.
-
-1.4 Selecting Which MPI to Use: mpi-selector
---------------------------------------------
-Depending on how the OFED installer was run, multiple different MPI
-implementations may be installed on your system. The OFED installer
-will run an MPI selector tool during the installation process,
-presenting a menu-based interface to select which MPI implementation
-is set as the default for all users. This MPI selector tool can be
-re-run at any time by the administrator after the OFED installer
-completes to modify the site-wide default MPI implementation selection
-by invoking the "mpi-selector-menu" command (root access is typically
-required to change the site-wide default).
-
-The mpi-selector-menu command can also be used by non-administrative
-users to override the site-wide default MPI implementation selection
-by setting a per-user default. Specifically: unless a user runs the
-MPI selector tool to set a per-user default, their environment will be
-setup for the site-wide default MPI implementation.
-
-Note that the default MPI selection does *not* affect the shell from
-which the command was invoked (or any other shells that were already
-running when the MPI selector tool was invoked). The default
-selection is only changed for *new* shells started after the selector
-tool was invoked. It is recommended that once the default MPI
-implementation is changed via the selector tool, users should logout
-and login again to ensure that they have a consistent view of the
-default MPI implementation. Other tools can be used to change the MPI
-environment in the current shell, such as the environment modules
-software package (which is not included in the OFED software package;
-see http://modules.sourceforge.net/ for details).
-
-Note that the site-wide default is set in a file that is typically not
-on a networked file system, and is therefore specific to the host on
-which it was run. As such, it is recommended to run the
-mpi-selector-menu command on all hosts in a cluster, picking the same
-default MPI implementation on each. It may be more convenient,
-however, to use the mpi-selector command in script-based scenarios
-(such as running on every host in a cluster); mpi-selector effects all
-the same functionality as mpi-selector-menu, but is intended for
-automated environments. See the mpi-selector(1) manual page for more
-details.
-
-Additionally, per-user defaults are set in a file in the user's $HOME
-directory. If this directory is not on a network-shared file system
-between all hosts that will be used for MPI applications, then it also
-needs to be propagated to all relevant hosts.
-
-Note: The MPI selector tool typically sets the PATH and/or
-LD_LIBRARY_PATH for a given MPI implementation. This step can, of
-course, also be performed manually by a user or on a site-wide basis.
-The MPI selector tool simply bundles up this functionality in a
-convenient set of command line tools and menus.
-
-1.4 Updating MPI Installations
-------------------------------
-Note that all of the MPI implementations included in the OFED software
-package are the versions that were available when OFED v1.5 was
-released. They have been QA tested with this version of OFED and are
-fully supported.
-
-However, note that administrators can go to the web sites of each MPI
-implementation and download / install newer versions after OFED has
-been successfully installed. There is nothing specific about the
-OFED-included MPI software packages that prohibit installing
-newer/other MPI implementations.
-
-It should be also noted that versions of MPI released after OFED v1.5
-are not supported by OFED. But since each MPI has its own release
-schedule and QA process (each of which involves testing with the OFED
-stack), it may sometimes be desirable -- or even advisable, depending
-on how old the MPI implementations are that are included in OFED -- to
-download install a newer version of MPI.
-
-The web sites of each MPI implementation are listed below:
-
-- Open MPI: http://www.open-mpi.org/
-- MVAPICH: http://mvapich.cse.ohio-state.edu/
-- MVAPICH2: http://mvapich.cse.ohio-state.edu/overview/mvapich2/
-
-===============================================================================
-2. MVAPICH MPI
-===============================================================================
-
-This package is a 1.2.0 version of the MVAPICH software package,
-and is the officially supported MPI stack for this release of OFED.
-See http://mvapich.cse.ohio-state.edu for more details.
-
-
-2.1 Setting up for MVAPICH
---------------------------
-To launch MPI jobs, its installation directory needs to be included
-in PATH and LD_LIBRARY_PATH. To set them, execute one of the following
-commands:
- source <prefix>/mpi/<compiler>/<mpi stack>/bin/mpivars.sh
- -- when using sh for launching MPI jobs
- or
- source <prefix>/mpi/<compiler>/<mpi stack>/bin/mpivars.csh
- -- when using csh for launching MPI jobs
-
-
-2.2 Compiling MVAPICH Applications:
------------------------------------
-***Important note***:
-A valid Fortran compiler must be present in order to build the MVAPICH MPI
-stack and tests.
-
-The default gcc-g77 Fortran compiler is provided with all RedHat Linux
-releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide
-this compiler as part of the default installation.
-
-The following compilers are supported by OFED's MVAPICH package: Gcc,
-Intel, Pathscale and PGI. The install script prompts the user to choose
-the compiler with which to build the MVAPICH RPM. Note that more
-than one compiler can be selected simultaneously, if desired.
-
-For details see:
- http://mvapich.cse.ohio-state.edu/support
-
-To review the default configuration of the installation, check the default
-configuration file: <prefix>/mpi/<compiler>/<mpi stack>/etc/mvapich.conf
-
-2.3 Running MVAPICH Applications:
----------------------------------
-Requirements:
-o At least two nodes. Example: mtlm01, mtlm02
-o Machine file: Includes the list of machines. Example: /root/cluster
-o Bidirectional rsh or ssh without a password
-
-Note: ssh will be used unless -rsh is specified. In order to use
-rsh, add to the mpirun_rsh command the parameter: -rsh
-
-*** Running OSU tests ***
-
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bw
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_latency
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bibw
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bcast
-
-*** Running Intel MPI Benchmark test (Full test) ***
-
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/IMB-3.2/IMB-MPI1
-
-*** Running Presta test ***
-
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/com -o 100
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/glob -o 100
-/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/globalop
-
-
-===============================================================================
-3. Open MPI
-===============================================================================
-
-Open MPI is a next-generation MPI implementation from the Open MPI
-Project (http://www.open-mpi.org/). Version 1.4 of Open MPI is
-included in this release, which is also available directly from the
-main Open MPI web site.
-
-A working Fortran compiler is not required to build Open MPI, but some
-of the included MPI tests are written in Fortran. These tests will
-not compile/run if Open MPI is built without Fortran support.
-
-The following compilers are supported by OFED's Open MPI package: GNU,
-Pathscale, Intel, or Portland. The install script prompts the user
-for the compiler with which to build the Open MPI RPM. Note that more
-than one compiler can be selected simultaneously, if desired.
-
-Users should check the main Open MPI web site for additional
-documentation and support. (Note: The FAQ file considers OpenFabrics
-tuning among other issues.)
-
-3.1 Setting up for Open MPI
----------------------------
-Selecting to use Open MPI via the mpi-selector-mpi and mpi-selector
-tools will perform all the necessary setup for users to build and run
-Open MPI applications. If you use the MPI selector tools, you can
-skip the rest of this section.
-
-If you do not wish to use the MPI selector tools, the Open MPI team
-strongly advises users to put the Open MPI installation directory in
-their PATH and LD_LIBRARY_PATH. This can be done at the system level
-if all users are going to use Open MPI. Specifically:
-
-- add <prefix>/bin to PATH
-- add <prefix>/lib to LD_LIBRARY_PATH
-
-<prefix> is the directory where the desired Open MPI instance was
-installed ("instance" refers to the compiler used for Open MPI
-compilation at install time.).
-
-If you are using a job scheduler to launch MPI jobs (e.g., SLURM,
-Torque), setting the PATH and LD_LIBRARY_PATH is still required, but
-it does not need to be set in your shell startup files. Procedures
-describing how to add these values to PATH and LD_LIBRARY_PATH are
-described in detail at:
-
- http://www.open-mpi.org/faq/?category=running
-
-3.2 Open MPI Installation Support / Updates
--------------------------------------------
-The OFED package will install Open MPI with support for TCP, shared
-memory, and the OpenFabrics network stacks. No other networks are
-supported by the OFED Open MPI installation.
-
-Open MPI supports a wide variety of run-time environments. The OFED
-installer will not include support for all of them, however (e.g.,
-Torque and PBS-based environments are not supported by the
-OFED-installed Open MPI).
-
-The ompi_info command can be used to see what support was installed;
-look for plugins for your specific environment / network / etc. If
-you do not see them, the OFED installer did not include support for
-them.
-
-As described above, administrators or users can go to the Open MPI web
-site and download / install either a newer version of Open MPI (if
-available), or the same version with different configuration options
-(e.g., support for Torque / PBS-based environments).
-
-3.3 Compiling Open MPI Applications
------------------------------------
-(copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see
-this web page for more details)
-
-The Open MPI team strongly recommends that you simply use Open MPI's
-"wrapper" compilers to compile your MPI applications. That is, instead
-of using (for example) gcc to compile your program, use mpicc. Open
-MPI provides a wrapper compiler for four languages:
-
- Language Wrapper compiler name
- ------------- --------------------------------
- C mpicc
- C++ mpiCC, mpicxx, or mpic++
- (note that mpiCC will not exist
- on case-insensitive file-systems)
- Fortran 77 mpif77
- Fortran 90 mpif90
- ------------- --------------------------------
-
-Note that if no Fortran 77 or Fortran 90 compilers were found when
-Open MPI was built, Fortran 77 and 90 support will automatically be
-disabled (respectively).
-
-If you expect to compile your program as:
-
- > gcc my_mpi_application.c -lmpi -o my_mpi_application
-
-Simply use the following instead:
-
- > mpicc my_mpi_application.c -o my_mpi_application
-
-Specifically: simply adding "-lmpi" to your normal compile/link
-command line *will not work*. See
-http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the
-Open MPI wrapper compilers.
-
-Note that Open MPI's wrapper compilers do not do any actual compiling
-or linking; all they do is manipulate the command line and add in all
-the relevant compiler / linker flags and then invoke the underlying
-compiler / linker (hence, the name "wrapper" compiler). More
-specifically, if you run into a compiler or linker error, check your
-source code and/or back-end compiler -- it is usually not the fault of
-the Open MPI wrapper compiler.
-
-3.4 Running Open MPI Applications:
-----------------------------------
-Open MPI uses either the "mpirun" or "mpiexec" commands to launch
-applications. If your cluster uses a resource manager (such as
-SLURM), providing a hostfile is not necessary:
-
- > mpirun -np 4 my_mpi_application
-
-If you use rsh/ssh to launch applications, they must be set up to NOT
-prompt for a password (see http://www.open-mpi.org/faq/?category=rsh
-for more details on this topic). Moreover, you need to provide a
-hostfile containing a list of hosts to run on.
-
-Example:
-
- > cat hostfile
- host1.example.com
- host2.example.com
- host3.example.com
- host4.example.com
-
- > mpirun -np 4 -hostfile hostfile my_mpi_application
- (application runs on all 4 hosts)
-
-In the following examples, replace <N> with the number of hosts to run on,
-and <HOSTFILE> with the filename of a valid hostfile listing the hosts
-to run on (unless you are running under a supported resource manager,
-in which case a hostfile is unnecessary).
-
-Also note that Open MPI is highly run-time tunable. There are many
-options that can be tuned to obtain optimal performance of your MPI
-applications (see the Open MPI web site / FAQ for more information:
-http://www.open-mpi.org/faq/).
-
- - <N> is an integer indicating how many MPI processes to run (e.g., 2)
- - <HOSTFILE> is the filename of a hostfile, as described above
-
-Example 1: Running the OSU bandwidth:
-
- > cd /usr/mpi/gcc/openmpi-1.4.1/tests/osu_benchmarks-3.1.1
- > mpirun -np <N> -hostfile <HOSTFILE> osu_bw
-
-Example 2: Running the Intel MPI Benchmark benchmarks:
-
- > cd /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2
- > mpirun -np <N> -hostfile <HOSTFILE> IMB-MPI1
-
- --> Note that the version of IMB-EXT that ships in this version of
- OFED contains a bug that will cause it to immediately error
- out when run with Open MPI.
-
-Example 3: Running the Presta benchmarks:
-
- > cd /usr/mpi/gcc/openmpi-1.4.1/tests/presta-1.4.0
- > mpirun -np <N> -hostfile <HOSTFILE> com -o 100
-
-NOTE: In order to run Open MPI over RoCCE (RDMAoE) network, follow MCA parameter
- is required:
- --mca btl_openib_cpc_include rdmacm
-
-
-3.5 More Open MPI Information
------------------------------
-Much, much more information is available about using and tuning Open
-MPI (to include OpenFabrics-specific tunable parameters) on the Open
-MPI web site FAQ:
-
- http://www.open-mpi.org/faq/
-
-Users who cannot find the answers that they are looking for, or are
-experiencing specific problems should consult the "how to get help" web
-page for more information:
-
- http://www.open-mpi.org/community/help/
-
-
-===============================================================================
-4. MVAPICH2 MPI
-===============================================================================
-
-MVAPICH2 is an MPI-2 implementation which includes all MPI-1 features.
-It is based on MPICH2 and MVICH. MVAPICH2 provides many features including
-fault-tolerance with checkpoint-restart, RDMA_CM support, iWARP support,
-optimized collectives, on-demand connection management, multi-core optimized
-and scalable shared memory support, and memory hook with ptmalloc2 library
-support. The ADI-3-level design of MVAPICH2 supports many features including:
-MPI-2 functionalities (one-sided, collectives and data-type), multi-threading
-and all MPI-1 functionalities. It also supports a wide range of platforms
-(architecture, OS, compilers, InfiniBand adapters and iWARP adapters). More
-information can be found on the MVAPICH2 project site:
-
-http://mvapich.cse.ohio-state.edu/overview/mvapich2/
-
-A valid Fortran compiler must be present in order to build the MVAPICH2
-MPI stack and tests. The following compilers are supported by OFED's
-MVAPICH2 MPI package: gcc, intel, pgi, and pathscale. The install script
-prompts the user to choose the compiler with which to build the MVAPICH2
-MPI RPM. Note that more than one compiler can be selected simultaneously,
-if desired.
-
-The install script prompts for various MVAPICH2 build options as detailed
-below:
-
-
-- Implementation (OFA or uDAPL) [default "OFA"]
- - OFA (IB and iWARP) Options:
- - ROMIO Support [default Y]
- - Shared Library Support [default Y]
- - Checkpoint-Restart Support [default N]
- * requires an installation of BLCR and prompts for the
- BLCR installation directory location
- - uDAPL Options:
- - ROMIO Support [default Y]
- - Shared Library Support [default Y]
- - Cluster Size [default "Small"]
- - I/O Bus [default "PCI-Express"]
- - Link Speed [default "SDR"]
- - Default DAPL Provider [default ""]
- * the default provider is determined based on detected OS
-
-For non-interactive builds where no MVAPICH2 build options are stored in
-the OFED configuration file, the default settings are:
-
-Implementation: OFA
-ROMIO Support: Y
-Shared Library Support: Y
-Checkpoint-Restart Support: N
-
-
-4.1 Setting up for MVAPICH2
----------------------------
-Selecting to use MVAPICH2 via the MPI selector tools will perform
-most of the setup necessary to build and run MPI applications with
-MVAPICH2. If one does not wish to use the MPI Selector tools, using
-the following settings should be enough:
-
- - add <prefix>/bin to PATH
-
-The <prefix> above is the directory where the desired MVAPICH2
-instance was installed ("instance" refers to the path based on
-the RPM package name, including the compiler chosen during the
-install). It is also possible to source the following files
-in order to setup the proper environment:
-
-source <prefix>/bin/mpivars.sh [for Bourne based shells]
-source <prefix>/bin/mpivars.csh [for C based shells]
-
-In addition to the user environment settings handled by the MPI selector
-tools, some other system settings might need to be modified. MVAPICH2
-requires the memlock resource limit to be modified from the default
-in /etc/security/limits.conf:
-
-* soft memlock unlimited
-
-MVAPICH2 requires bidirectional rsh or ssh without a password to work.
-The default is ssh, and in this case it will be required to add the
-following line to the /etc/init.d/sshd script before sshd is started:
-
-ulimit -l unlimited
-
-It is also possible to specify a specific size in kilobytes instead
-of unlimited if desired.
-
-The MVAPICH2 OFA build requires an /etc/mv2.conf file specifying the
-IP address of an Infiniband HCA (IPoIB) for RDMA-CM functionality
-or the IP address of an iWARP adapter for iWARP functionality if
-either of those are desired. This is not required by default, unless
-either of the following runtime environment variables are set when
-using the OFA MVAPICH2 build:
-
-RDMA-CM
--------
-MV2_USE_RDMA_CM=1
-
-iWARP
------
-MV2_USE_IWARP_MODE=1
-
-Otherwise, the OFA build will work without an /etc/mv2.conf file using
-only the Infiniband HCA directly.
-
-The MVAPICH2 uDAPL build requires an /etc/dat.conf file specifying the
-DAPL provider information. The default DAPL provider is chosen at
-build time, with a default value of "ib0", however it can also be
-specified at runtime by setting the following environment variable:
-
-MV2_DEFAULT_DAPL_PROVIDER=<interface>
-
-More information about MVAPICH2 can be found in the MVAPICH2 User Guide:
-
-http://mvapich.cse.ohio-state.edu/support/
-
-
-4.2 Compiling MVAPICH2 Applications
------------------------------------
-The MVAPICH2 compiler command for each language are:
-
-Language Compiler Command
--------- ----------------
-C mpicc
-C++ mpicxx
-Fortran 77 mpif77
-Fortran 90 mpif90
-
-The system compiler commands should not be used directly. The Fortran 90
-compiler command only exists if a Fortran 90 compiler was used during the
-build process.
-
-
-4.3 Running MVAPICH2 Applications
----------------------------------
-4.3.1 Running MVAPICH2 Applications with mpirun_rsh
----------------------------------------------------
->From release 1.2, MVAPICH2 comes with a faster and more scalable startup based
-on mpirun_rsh. To launch a MPI job with mpirun_rsh, password-less ssh needs to
-be enabled across all nodes.
-
-Note: ssh will be used by default. In order to use rsh, use the -rsh option on
-the mpirun_rsh command line. For more options, see mpirun_rsh -help or the
-MVAPICH2 user guide.
-
-*** Running 4 processes on 4 nodes ***
-
-$ cat > hostfile
-node1
-node2
-node3
-node4
-$ mpirun_rsh -np 4 -hostfile hostfile /path/to/my_mpi_app
-
-*** Running OSU tests ***
-
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bw
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_latency
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bibw
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bcast
-
-*** Running Intel MPI Benchmark test (Full test) ***
-
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2/IMB-MPI1
-
-*** Running Presta test ***
-
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/com -o 100
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/glob -o 100
-/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/globalop
-
-4.3.2 Running MVAPICH2 Applications with mpd and mpiexec
---------------------------------------------------------
-Launching processes in MVAPICH2 is a two step process. First, mpdboot must
-be used to launch MPD daemons on the desired hosts. Second, the mpiexec
-command is used to launch the processes. MVAPICH2 requires bidirectional
-ssh or rsh without a password. This is specified when the MPD daemons are
-launched with the mpdboot command through the --rsh command line option.
-The default is ssh. Once the processes are finished, stopping the MPD
-daemons with the mpdallexit command should be done. The following example
-shows the basic procedure:
-
-4 Processes on 4 Hosts Example:
-
-$ cat >hostsfile
-node1.example.com
-node2.example.com
-node3.example.com
-node4.example.com
-
-$ mpdboot -n 4 -f ./hostsfile
-
-$ mpiexec -n 4 ./my_mpi_application
-
-$ mpdallexit
-
-It is also possible to use the mpirun command in place of mpiexec. They are
-actually the same command in MVAPICH2, however using mpiexec is preferred.
-
-It is possible to run more processes than hosts. In this case, multiple
-processes will run on some or all of the hosts used. The following examples
-demonstrate how to run the MPI tests. The default installation prefix and
-gcc version of MVAPICH2 are shown. In each case, it is assumed that a hosts
-file has been created in the specific directory with two hosts.
-
-OSU Tests Example:
-
-$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1
-$ mpdboot -n 2 -f ./hosts
-$ mpiexec -n 2 ./osu_bcast
-$ mpiexec -n 2 ./osu_bibw
-$ mpiexec -n 2 ./osu_bw
-$ mpiexec -n 2 ./osu_latency
-$ mpdallexit
-
-Intel MPI Benchmark Example:
-
-$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2
-$ mpdboot -n 2 -f ./hosts
-$ mpiexec -n 2 ./IMB-MPI1
-$ mpdallexit
-
-Presta Benchmarks Example:
-
-$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0
-$ mpdboot -n 2 -f ./hosts
-$ mpiexec -n 2 ./com -o 100
-$ mpiexec -n 2 ./glob -o 100
-$ mpiexec -n 2 ./globalop
-$ mpdallexit
+++ /dev/null
-Mellanox Technologies - www.mellanox.com
-****************************************
-
-MSTFLINT Package - Firmware Burning and Diagnostics Tools
-
-1) Overview
- This package contains a burning tool and diagnostic tools for Mellanox
- manufactured HCA/NIC cards. It also provides access to the relevant source
- code. Please see the file LICENSE for licensing details.
- This package is based on a subset of the Mellanox Firmware Tools (MFT) package.
- For a full documentation of the MFT package, please refer to the downloads page
- in Mellanox web site.
-
- ----------------------------------------------------------------------------
- NOTE:
- This burning tool should be used only with Mellanox-manufactured
- HCA/NIC cards. Using it with cards manufactured by other vendors
- may be harmful to the cards (due to different configurations).
- Using the diagnostic tools is normally safe for all HCAs/NICs.
- ----------------------------------------------------------------------------
-
-2) Package Contents
- a) mstflint source code
- b) mflash lib
- This lib provides low level Flash access through Mellanox HCAs.
- c) mtcr lib (implemented in mtcr.h file)
- This lib enables access to HCA hardware registers.
- d) mstregdump utility
- This utility dumps hardware registers from Mellanox hardware
- for later analysis by Mellanox.
- e) mstvpd
- This utility dumps the on-card VPD.
- f) mstmcra
- This debug utility reads a single word from the device configuration space.
-
-3) Installation
- a) Build the mstflint utility. This package is built using a standard
- autotools method.
-
- Example:
- > ./configure
- > make
- > make install
-
- - Run "configure --help" for custom configuration options.
- - Typically, root privileges are required to run "make install"
-
-4) Hardware Access Device Names
- The tools in this package require a device name in the command
- line. The device name is the identifier of the target CA.
- This section describes the device name formats and the HW access flow.
-
- a) The devices can be accessed by their PCI ID as displayed by lspci
- (bus:dev.fn).
- Example:
- # List all Mellanox devices
- > /sbin/lspci -d 15b3:
- 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0)
-
- # Use mstflint tool to query the firmware on this device
- > mstflint -d 02:00.0 q
-
- b) When the IB driver (mlx4 or mthca) is loaded, the devices can be accessed
- by their IB device name.
- Example:
- # List the IB devices
- > ibv_devinfo | grep hca_id
- hca_id: mlx4_0
-
- # Use mstvpd tool to dump the VPD of this device
- > mstvpd mlx4_0
-
- c) PCI configuration access
- In examples a and b above, the device is accessed via PCI Memory Mapping.
- The device can also be accessed by PCI configuration cycles.
- PCI configuration access is slower and less safe than memory access --
- use it only if methods a and b above do not work.
-
- To force configuration access, use device names in the following format:
- /proc/bus/pci/<bus>/<dev.fn>
-
- Example:
- # List all Mellanox devices
- > /sbin/lspci -d 15b3:
- 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0)
-
- # Use mstregdump to dump HW registers, using PCI config cycles
- > mstregdump /proc/bus/pci/02/00.0 > crdump.log
-
- Note: Typically, you will need root privileges for hardware access
-
- d) Accessing a multi-function device:
-
- In some configuration, the CA device identifies as a multi-function device on PCI. E.G.:
- > /sbin/lspci -d 15b3:
- 07:00.0 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
- 07:00.1 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
- 07:00.2 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
- 07:00.3 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
- 07:00.4 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
- 07:00.5 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
- 07:00.6 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
- 07:00.7 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
-
- These multiple "logical" devices are actually a single physical device, so firmware update or "physical"
- diagnostics should be run only on one of the functions.
-
- When the device driver is loaded, only the primary physical function of the device can be accessed.
- In Linux that would typically be function 0. This function can be accessed using memory mapping, aas
- described in sub section a) above. E.G.:
- > mstflint -d 07:00.0 q
-
- When the device driver is not loaded, all the functions can be accessed using configuration cycles, as
- described in sub section c) above. It is recommended to use function 0 for FW update or diagnostics, E.G.:
- > mstflint -d /proc/bus/pci/07/00.0 q
-
-5) Usage (mstflint):
- Read mstflint usage. Enter "./mstflint -h" for a short help message, or
- "./mstflint -hh" for a detailed help message.
-
- Obtaining firmware files:
- If you purchased your card from Mellanox Technologies, please use the
- Mellanox website (www.mellanox.com, under 'Firmware' downloads) to
- download the firmware for your card.
- If you purchased your card from a vendor other than Mellanox, get a
- specific firmware configuration (INI) file from your HCA card vendor and
- generate the binary image.
-
- Use mstflint to burn a device according to the burning instructions in
- "mstflint -hh" and in Mellanox web site firmware page.
-
-6) Usage (mstregdump):
- An internal register dump is displayed to the standard output.
- Please store it in a file for analysis by Mellanox.
-
- Example:
- > mstregdump mthca0 > dumpfile
-
-7) Usage (mstvpd):
- A VPD dump is displayed to the standard output.
- A list of keywords to dump can be supplied after the -- flag
- to apply an output filter.
-
- Examples:
- > mstvpd mlx4_0
- ID: Hawk Dual Port
- PN: MNPH29C-XTR
- EC: X2
- SN: MT1001X00749
- V0: PCIe Gen2 x8
- V1: N/A
- YA: N/A
- RW:
-
- > mstvpd mlx4_0 -- PN ID
- PN: MNPH29C-XTR
- ID: Hawk Dual Port
-
-8) Problem Reporting:
- Please collect the following information when reporting issues:
-
- uname -a
- cat /etc/issue
- cat /proc/bus/pci/devices
- mstflint -vv
- lspci
- mstflint -d 02:00.0 v
- mstflint -d 02:00.0 q
- mstvpd 02:00.0
-
-
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- Version 1.5.2
- Installation Guide
-
- May 2010
-
-==============================================================================
-Table of contents
-==============================================================================
-
- 1. Overview
- 2. Contents of the OFED Distribution
- 3. Hardware and Software Requirements
- 4. How to Download and Extract the OFED Distribution
- 5. Installing OFED Software
- 6. Building OFED RPMs
- 7. IPoIB Configuration
- 8. Using SDP
- 9. Uninstalling OFED
- 10. Upgrading OFED
- 11. Configuration
- 12. Related Documentation
-
-
-==============================================================================
-1. Overview
-==============================================================================
-
-This is the OpenFabrics Enterprise Distribution (OFED) version 1.5.2
-software package supporting InfiniBand and iWARP fabrics. It is composed
-of several software modules intended for use on a computer cluster
-constructed as an InfiniBand subnet or an iWARP network.
-
-This document describes how to install the various modules and test them in
-a Linux environment.
-
-General Notes:
- 1) The install script removes all previously installed OFED packages
- and re-installs from scratch. (Note: Configuration files will not
- be removed). You will be prompted to acknowledge the deletion of
- the old packages.
-
- 2) When installing OFED on an entire [homogeneous] cluster, a common
- strategy is to install the software on one of the cluster nodes
- (perhaps on a shared file system such as NFS). The resulting RPMs,
- created under OFED-X.X.X/RPMS directory, can then be installed on all
- nodes in the cluster using any cluster-aware tools (such as pdsh).
-
-==============================================================================
-2. OFED Package Contents
-==============================================================================
-
-The OFED Distribution package generates RPMs for installing the following:
-
- o OpenFabrics core and ULPs:
- - HCA drivers (mthca, mlx4, qib, ehca)
- - iWARP driver (cxgb3, nes)
- - core
- - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER
- Initiator and target, RDS, qlgc_vnic, uDAPL and NFS-RDMA
- o OpenFabrics utilities
- - OpenSM: InfiniBand Subnet Manager
- - Diagnostic tools
- - Performance tests
- o MPI
- - OSU MVAPICH stack supporting the InfiniBand and iWARP interface
- - Open MPI stack supporting the InfiniBand and iWARP interface
- - OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface
- - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta)
- o Extra packages
- - open-iscsi: open-iscsi initiator with iSER support
- - ib-bonding: Bonding driver for IPoIB interface
- o Sources of all software modules (under conditions mentioned in the
- modules' LICENSE files)
- o Documentation
-
-==============================================================================
-3. Hardware and Software Requirements
-==============================================================================
-
-1) Server platform with InfiniBand HCA or iWARP RNIC (see OFED Distribution
- Release Notes for details)
-
-2) Linux operating system (see OFED Distribution Release Notes for details)
-
-3) Administrator privileges on your machine(s)
-
-4) Disk Space: - For Build & Installation: 300MB
- - For Installation only: 200MB
-
-5) For the OFED Distribution to compile on your machine, some software
- packages of your operating system (OS) distribution are required. These
- are listed here.
-
-OS Distribution Required Packages
---------------- ----------------------------------
-General:
-o Common to all gcc, glib, glib-devel, glibc, glibc-devel,
- glibc-devel-32bit (to build 32-bit libraries on x86_86
- and ppc64), zlib-devel, libstdc++-devel
-o RedHat, Fedora kernel-devel, rpm-build, redhat-rpm-config
-o SLES kernel-source, rpm
-
-Note: To build 32-bit libraries on x86_64 and ppc64 platforms, the 32-bit
- glibc-devel should be installed.
-
-Specific Component Requirements:
-o Mvapich a Fortran Compiler (such as gcc-g77)
-o Mvapich2 libsysfs-devel
-o Open MPI libsysfs-devel
-o ibutils tcl-8.4, tcl-devel-8.4, tk, libstdc++-devel
-o mstflint libstdc++-devel (32-bit on ppc64), gcc-c++
-o rnfs-utils krb5-devel, krb5-libs, libevent-devel,
- nfs-utils-lib-devel, openldap-devel,
- e2fsprogs-devel (on RedHat)
- krb5-devel, libevent-devel, nfsidmap-devel,
- libopenssl-devel, libblkid-devel (on SLES11)
- krb5-devel, libevent, nfsidmap, krb5, openldap2-devel,
- cyrus-sasl-devel, e2fsprogs-devel (on SLES10)
-
-Note: The installer will warn you if you attempt to compile any of the
- above packages and do not have the prerequisites installed.
- On SLES, some of required RPMs can be found on SLES SDK DVD.
- E.g.: libgfortran43, kernel-source, ...
-
-*** Important Note for open-iscsi users:
- Installing iSER as part of OFED installation will also install open-iscsi.
- Before installing OFED, please uninstall any open-iscsi version that may
- be installed on your machine. Installing OFED with iSER support while
- another open-iscsi version is already installed will cause the installation
- process to fail.
-
-==============================================================================
-4. How to Download and Extract the OFED Distribution
-==============================================================================
-
-1) Download the OFED-X.X.X.tgz file to your target Linux host.
-
- If this package is to be installed on a cluster, it is recommended to
- download it to an NFS shared directory.
-
-2) Extract the package using:
-
- tar xzvf OFED-X.X.X.tgz
-
-==============================================================================
-5. Installing OFED Software
-==============================================================================
-
-1) Go to the directory into which the package was extracted:
-
- cd /..../OFED-X.X.X
-
-2) Installing the OFED package must be done as root. For a
- menu-driven first build and installation, run the installer
- script:
-
- ./install.pl
-
- Interactive menus will direct you through the install process.
-
- Note: After the installer completes, information about the OFED
- installation such as the prefix, the kernel version, and
- installation parameters can be found by running
- /etc/infiniband/info.
-
- Information on the driver version and source git trees can be found
- using the ofed_info utility
-
-
- During the interactive installation of OFED, two files are
- generated: ofed.conf and ofed_net.conf.
- ofed.conf holds the installed software modules and configuration settings
- chosen by the user. ofed_net.conf holds the IPoIB settings chosen by the
- user.
-
- If the package is installed on a cluster-shared directory, these
- files can then be used to perform an automatic, unattended
- installation of OFED on other machines in the cluster. The
- unattended installation will use the same choices as were selected
- in the interactive installation.
-
- For an automatic installation on any host, run the following:
-
- ./OFED-X.X.X/install.pl -c <path>/ofed.conf -n <path>/ofed_net.conf
-
-3) Install script usage:
-
- Usage: ./install.pl [-c <packages config_file>|--all|--hpc|--basic]
- [-n|--net <network config_file>]
-
- -c|--config <packages config_file>. Example of the config file can
- be found under docs (ofed.conf-example)
- -n|--net <network config_file> Example of the config file can be
- found under docs (ofed_net.conf-example)
- -l|--prefix Set installation prefix.
- -p|--print-available Print available packages for current platform.
- And create corresponding ofed.conf file.
- -k|--kernel <kernel version>. Default on this system: $(uname -r)
- -s|--kernel-sources <path to the kernel sources>. Default on this
- system: /lib/modules/$(uname -r)/build
- --build32 Build 32-bit libraries. Relevant for x86_64 and
- ppc64 platforms
- --without-depcheck Skip Distro's libraries check
- -v|-vv|-vvv. Set verbosity level
- -q. Set quiet - no messages will be printed
- --force Force uninstall RPM coming with Distribution
- --builddir Change build directory. Default: /var/tmp/
-
- --all|--hpc|--basic Install all,hpc or basic packages
- correspondingly
-
-Notes:
-------
-a. It is possible to rename and/or edit the ofed.conf and ofed_net.conf files.
- Thus it is possible to change user choices (observing the original format).
- See examples of ofed.conf and ofed_net.conf under OFED-X.X.X/docs.
- Run './install.pl -p' to get ofed.conf with all available packages included.
-
-b. Important note for open-iscsi users:
- Installing iSER as part of the OFED installation will also install
- open-iscsi. Before installing OFED, please uninstall any open-iscsi version
- that may be installed on your machine. Installing OFED with iSER support
- while another open-iscsi version is already installed will cause the
- installation process to fail.
-
-
-Install Process Results:
-------------------------
-
-o The OFED package is installed under <prefix> directory. Default prefix is /usr
-o The kernel modules are installed under:
- - Infiniband subsystem:
- /lib/modules/`uname -r`/updates/kernel/drivers/infiniband/
- - open-iscsi:
- /lib/modules/`uname -r`/updates/kernel/drivers/scsi/
- - Chelsio driver:
- /lib/modules/`uname -r`/updates/kernel/drivers/net/cxgb3/
- - ConnectX driver:
- /lib/modules/`uname -r`/updates/kernel/drivers/net/mlx4/
- - RDS:
- /lib/modules/`uname -r`/updates/kernel/net/rds/
- - NFSoRDMA:
- /lib/modules/`uname -r`/updates/kernel/fs/exportfs/
- /lib/modules/`uname -r`/updates/kernel/fs/lockd/
- /lib/modules/`uname -r`/updates/kernel/fs/nfs/
- /lib/modules/`uname -r`/updates/kernel/fs/nfs_common/
- /lib/modules/`uname -r`/updates/kernel/fs/nfsd/
- /lib/modules/`uname -r`/updates/kernel/net/sunrpc/
- - Bonding module:
- /lib/modules/`uname -r`/updates/kernel/drivers/net/bonding/bonding.ko
-o The package kernel include files are placed under <prefix>/src/ofa_kernel/.
- These includes should be used when building kernel modules which use
- the Openfabrics stack. (Note that these includes, if needed, are
- "backported" to your kernel).
-o The raw package (un-backported) source files are placed under
- <prefix>/src/ofa_kernel-x.x.x
-o The script "openibd" is installed under /etc/init.d/. This script can
- be used to load and unload the software stack.
-o The directory /etc/infiniband is created with the files "info" and
- "openib.conf". The "info" script can be used to retrieve OFED
- installation information. The "openib.conf" file contains the list of
- modules that are loaded when the "openibd" script is used.
-o The file "90-ib.rules" is installed under /etc/udev/rules.d/
-o If libibverbs-utils is installed, then ofed.sh and ofed.csh are
- installed under /etc/profile.d/. These automatically update the PATH
- environment variable with <prefix>/bin. In addition, ofed.conf is
- installed under /etc/ld.so.conf.d/ to update the dynamic linker's
- run-time search path to find the InfiniBand shared libraries.
-o The file /etc/modprobe.d/ib_ipoib.conf is updated to include the following:
- - "alias ib<n> ib_ipoib" for each ib<n> interface.
-o The file /etc/modprobe.d/ib_sdp.conf is updated to include the following:
- - "alias net-pf-27 ib_sdp" for sdp.
-o If opensm is installed, the daemon opensmd is installed under /etc/init.d/
-o All verbs tests and examples are installed under <prefix>/bin and management
- utilities under <prefix>/sbin
-o ofed_info script provides information on the OFED version and git repository.
-o If iSER is included, open-iscsi user-space files will be also installed:
- - Configuration files will be installed at /etc/iscsi
- - Startup script will be installed at:
- - RedHat: /etc/init.d/iscsi
- - SuSE: /etc/init.d/open-iscsi
- - Other tools (iscsiadm, iscsid, iscsi_discovery, iscsi-iname, iscsistart)
- will be installed under /sbin.
- - Documentation will be installed under:
- - RedHat: /usr/share/doc/iscsi-initiator-utils-<version number>
- - SuSE: /usr/share/doc/packages/open-iscsi
-o man pages will be installed under /usr/share/man/.
-
-==============================================================================
-6. Building OFED RPMs
-==============================================================================
-
-1) Go to the directory into which the package was extracted:
-
- cd /..../OFED-X.X.X
-
-2) Run install.pl as explained above
- This script also builds OFED binary RPMs under OFED-X.X.X/RPMS; the sources
- are placed in OFED-X.X.X/SRPMS/.
-
- Once the install process has completed, the user may run ./install.pl on
- other machines that have the same operating system and kernel to
- install the new RPMs.
-
-Note: Depending on your hardware, the build procedure may take 30-45
- minutes. Installation, however, is a relatively short process
- (~5 minutes). A common strategy for OFED installation on large
- homogeneous clusters is to extract the tarball on a network
- file system (such as NFS), build OFED RPMs on NFS, and then run the
- installer on each node with the RPMs that were previously built.
-
-==============================================================================
-7. IP-over-IB (IPoIB) Configuration
-==============================================================================
-
-Configuring IPoIB is an optional step during the installation. During
-an interactive installation, the user may choose to insert the ifcfg-ib<n>
-files. If this option is chosen, the ifcfg-ib<n> files will be
-installed under:
-
-- RedHat: /etc/sysconfig/network-scripts/
-- SuSE: /etc/sysconfig/network/
-
-Setting IPoIB Configuration:
-----------------------------
-There is no default configuration for IPoIB interfaces.
-
-One should manually specify the full IP configuration during the
-interactive installation: IP address, network address, netmask, and
-broadcast address, or use the ofed_net.conf file.
-
-For bonding setting please see "ipoib_release_notes.txt"
-
-For unattended installations, a configuration file can be provided
-with this information. The configuration file must specify the
-following information:
-- Fixed values for each IPoIB interface
-- Base IPoIB configuration on Ethernet configuration (may be useful for
- cluster configuration)
-
-Here are some examples of ofed_net.conf:
-
-# Static settings; all values provided by this file
-IPADDR_ib0=172.16.0.4
-NETMASK_ib0=255.255.0.0
-NETWORK_ib0=172.16.0.0
-BROADCAST_ib0=172.16.255.255
-ONBOOT_ib0=1
-
-# Based on eth0; each '*' will be replaced by the script with corresponding
-# octet from eth0.
-LAN_INTERFACE_ib0=eth0
-IPADDR_ib0=172.16.'*'.'*'
-NETMASK_ib0=255.255.0.0
-NETWORK_ib0=172.16.0.0
-BROADCAST_ib0=172.16.255.255
-ONBOOT_ib0=1
-
-# Based on the first eth<n> interface that is found (for n=0,1,...);
-# each '*' will be replaced by the script with corresponding octet from eth<n>.
-LAN_INTERFACE_ib0=
-IPADDR_ib0=172.16.'*'.'*'
-NETMASK_ib0=255.255.0.0
-NETWORK_ib0=172.16.0.0
-BROADCAST_ib0=172.16.255.255
-ONBOOT_ib0=1
-
-
-==============================================================================
-8. Using SDP
-==============================================================================
-
-Overview:
----------
-
-Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol
-that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced
-protocol offload capabilities, SDP can provide lower latency, higher
-bandwidth, and lower CPU utilization than IPoIB running some sockets-based
-applications.
-
-SDP can be used by applications and improve their performance transparently
-(that is, without any recompilation). Since SDP has the same socket semantics
-as TCP, an existing application is able to run using SDP; the difference is
-that the application's TCP socket gets replaced with an SDP socket.
-
-It is also possible to configure the driver to automatically translate TCP to
-SDP based on the source IP, the destination, or the application name (See
-below).
-
-The SDP protocol is composed of a kernel module that implements the SDP as a
-new address-family/protocol-family, and a library that is used for replacing
-the TCP address family with SDP according to a policy.
-
-libsdp.so Library:
-------------------
-
-libsdp.so is a dynamically linked library, which is used for transparent
-integration of applications with SDP. The library is preloaded, and therefore
-takes precedence over glibc for certain socket calls. Thus, it can
-transparently replace the TCP socket family with SDP socket calls.
-
-The library also implements a user-level socket switch. Using a configuration
-file, the system administrator can set up the policy that selects the type of
-socket to be used. libsdp.so also has the option to allow server sockets to
-listen on both SDP and TCP interfaces. The various configurations with SDP/TCP
-sockets are explained inside the /etc/libsdp.conf file.
-
-Configuring SDP:
-----------------
-
-To load SDP upon boot, edit the file /etc/infiniband/openib.conf and set
-"SDP_LOAD=yes".
-
-Note: For the changes to take effect, run: /etc/init.d/openibd restart
-
-SDP shares the same IP addresses and interface names as IPoIB. See IPoIB
-Configuration (chapter 7)
-
-How to Know SDP Is Working:
----------------------------
-
-Since SDP is a transparent TCP replacement, it can sometimes be difficult to
-know that it is working correctly.
-To figure out whether traffic is passing through SDP or TCP, check
-/proc/net/sdpstats and monitor which counters are running.
-
-sdpnetstat:
------------
-
-The sdpnetstat program can be used to verify both that SDP is loaded and is
-being used:
-
-host1$ sdpnetstat -S
-
-This command shows all active SDP sockets using the same format as the
-traditional netstat program. Without the '-S' option, it shows all the
-information that netstat does plus SDP data.
-
-Assuming that the SDP kernel module is loaded and is being used, then the
-output of the command will show something like the following:
-
-host1$ sdpnetstat -S
-
-Proto Recv-Q Send-Q Local Address Foreign Address
-sdp 0 0 193.168.10.144:34216 193.168.10.125:12865
-sdp 0 884720 193.168.10.144:42724 193.168.10.:filenet-rmi
-
-The example output above shows two active SDP sockets and contains details
-about the connections. If the SDP kernel module is not loaded, or it is
-loaded but is not being used, then the output of the command will be something
-like the following:
-
-host1$ sdpnetstat -S
-
-Proto Recv-Q Send-Q Local Address Foreign Address
-netstat: no support for `AF INET (tcp)' on this system.
-
-To verify whether the module is loaded or not, you can use the lsmod command
-
-Monitoring and Troubleshooting Tools:
--------------------------------------
-
-SDP has debug support for both the user space libsdp.so library and the ib_sdp
-kernel module.
-Both can be useful to understand why a TCP socket was not redirected over SDP
-and to help find problems in the SDP implementation.
-
-User-space SDP debug is controlled by options in the libsdp.conf file. You can also have a local
-version and point to it explicitly using the following command:
-
-host1$ export LIBSDP_CONFIG_FILE=<path>/libsdp.conf
-
-To obtain extensive debug information, you can modify libsdp.conf to have the
-log directive produce maximum debug output (provide the min-level flag with
-the value 1). More details in the default libsdp.conf installed by OFED.
-A non-root user can configure libsdp.so to record function calls and return values in the file
-/tmp/libsdp.log.<pid>
-
-Kernel Space SDP Debug - The SDP kernel module can log detailed trace
-information if you enable it using the 'debug_level' variable in the sysfs
-filesystem. The following command performs this:
-
-host1$ echo 1 > /sys/module/ib_sdp/debug_level
-
-Note: Depending on the operating system distribution on your machine, you may need
-an extra level, 'parameters', in the directory structure, so you may need to direct
-the echo command to /sys/module/ib_sdp/parameters/debug_level.
-
-Turning off kernel debug is done by setting the sysfs variable to zero using
-the following command:
-
-host1$ echo 0 > /sys/module/ib_sdp/debug_level
-
-To display debug information, use the dmesg command:
-
-host1$ dmesg
-
-Environment Variables:
-----------------------
-
-For the transparent integration with SDP, the following two environment
-variables are required:
-1. LD_PRELOAD - this environment variable is used to preload libsdp.so and it
- should point to the libsdp.so library. The variable should be set by the
- system administrator to libsdp.so.
-2. LIBSDP_CONFIG_FILE - this environment variable is used to configure the
- policy for replacing TCP sockets with SDP sockets. By default it points to:
- /etc/libsdp.conf
-
-Using RDMA:
------------
-
-For smaller buffers, the overhead of preparing a user buffer to be RDMA'ed is
-too big; therefore, it is more efficient to use BCopy. (Large buffers can also
-be sent using RDMA, but they lower CPU utilization.) This mode is called
-"ZCopy combined mode". The sendmsg syscall is blocked until the buffer is
-transfered to the socket's peer, and the data is copied directly from the user
-buffer at the source side to the user buffer at the sink side.
-
-To set the threshold, use the module parameter sdp_zcopy_thresh. This parameter
-can be accessed through sysfs (/sys/module/ib_sdp/parameters/sdp_zcopy_thresh).
-Setting it to 0, disables ZCopy.
-
-
-==============================================================================
-9. Uninstalling OFED
-==============================================================================
-
-There are two ways to uninstall OFED:
-1) Via the installation menu.
-2) Using the script ofed_uninstall.sh. The script is part of ofed-scripts
- package.
-3) ofed_uninstall.sh script supports an option to executes 'openibd stop'
- before removing the RPMs using the flag: --unload-modules
-
-==============================================================================
-10. Upgrading OFED
-==============================================================================
-
-If an old OFED version is installed, it may be upgraded by installing a
-new OFED version as described in section 5. Note that if the old OFED
-version was loaded before upgrading, you need to restart OFED or reboot
-your machine in order to start the new OFED stack.
-
-==============================================================================
-11. Configuration
-==============================================================================
-
-Most of the OFED components can be configured or reconfigured after
-the installation by modifying the relevant configuration files. The
-list of the modules that will be loaded automatically upon boot can be
-found in the /etc/infiniband/openib.conf file. Other configuration
-files include:
-- SDP configuration file: /etc/libsdp.conf
-- OpenSM configuration file: /etc/ofa/opensm.conf (for RedHat)
- /etc/sysconfig/opensm (for SuSE) - should be
- created manually if required.
-- DAPL configuration file: /etc/dat.conf
-
-See packages Release Notes for more details.
-
-Note: After the installer completes, information about the OFED
- installation such as the prefix, kernel version, and
- installation parameters can be found by running
- /etc/infiniband/info.
-
-
-==============================================================================
-12. Related Documentation
-==============================================================================
-
-OFED documentation is located in the ofed-docs RPM. After
-installation the documents are located under the directory:
-/usr/share/doc/ofed-docs-x.x.x for RedHat
-/usr/share/doc/packages/ofed-docs-x.x.x for SuSE
-
-Documents list:
-
- o README.txt
- o OFED_Installation_Guide.txt
- o MPI_README.txt
- o Examples of configuration files
- o OFED_tips.txt
- o HOWTO.build_ofed
- o All release notes and README files
-
-For more information, please visit the OpenFabrics web site:
- http://www.openfabrics.org
-
-open-iscsi documentation is located at:
-- RedHat: /usr/share/doc/iscsi-initiator-utils-<version number>
-- SuSE: /usr/share/doc/packages/open-iscsi
-
-For more information, please visit the open-iscsi web site:
- http://www.open-iscsi.org
openmpi-1.4.3-1.src.rpm
2. Added RHEL6 support
+3. Added RHEL5.6 support
===============================================================================
6. Known Issues
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- Performance Tests README for OFED 1.5
-
- December 2010
-
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Notes on Testing Methodology
-3. Test Descriptions
-4. Running Tests
-
-===============================================================================
-1. Overview
-===============================================================================
-This is a collection of tests written over uverbs intended for use as a
-performance micro-benchmark. As an example, the tests can be used for
-HW or SW tuning and/or functional testing.
-
-The collection conatains a set of BW and latency benchmark such as :
-
- * Read - ib_read_bw and ib_read_lat.
- * Write - ib_write_bw, ib_write_bw_postlist and ib_write_lat.
- * Send - ib_send_bw and ib_send_lat.
- * RDMA - rdma_bw and rdma_lat.
- * Additional benchmark: ib_clock_test.
-
-Please post results/observations/bugs/remarks to the mailing list specified below:
- * Maintainer - idos@dev.mellanox.co.il
- * OFED mailing list - ewg@lists.openfabrics.org
- or linux-rdma@vger.kernel.org
- * http://openib.org/mailman/listinfo/openib-general
-
-===============================================================================
-2. Notes on Testing Methodology
-===============================================================================
-The bencmarks specified below are tested of the following architectures:
-- i686
-- x86_64
-- ia64
-
-- The benchmark uses the CPU cycle counter to get time stamps without context
- switch. Some CPU architectures (e.g., Intel's 80486 or older PPC) do NOT
- have such capability.
-
-- The benchmark measures round-trip time but reports half of that as one-way
- latency. Thus, it may not be sufficiently accurate for asymmetrical
- configurations.
-
-- On BW benchmarks , the BW is calculated on the send side only, as it calculates
- the BW after collecting completion from the receive side.
- If using the bidirectional flag , BW is calculated on both sides
-
-- Min/Median/Max result is reported.
- The median (vs average) is less sensitive to extreme scores.
- Typically, the "Max" value is the first value measured.
-
-- Larger samples help marginally only. The default (1000) is sufficient.
- Note that an array of cycles_t (typically unsigned long) is allocated
- once to collect samples and again to store the difference between them.
- Large sample sizes (e.g., 1 million) might expose other problems
- with the program.
-
-- The "-H" option will dump the histogram for additional statistical analysis.
- See xgraph, ygraph, r-base (http://www.r-project.org/), pspp, or other
- statistical math programs.
-
-===============================================================================
-4. Test Descriptions
-===============================================================================
-
-rdma_lat.c latency test with RDMA write transactions
-rdma_bw.c streaming BW test with RDMA write transactions
-
-
-The following tests are mainly useful for HW/SW benchmarking.
-They are not intended as actual usage examples.
-
-send_lat.c latency test with send transactions
-send_bw.c BW test with send transactions
-write_lat.c latency test with RDMA write transactions
-write_bw.c BW test with RDMA write transactions
-read_lat.c latency test with RDMA read transactions
-read_bw.c BW test with RDMA read transactions
-
-The executable name of each test starts with the general prefix "ib_",
-e.g., ib_write_lat, except for those of RDMA tests,
-their excutable have the same name except of the .c.
-
-Running Tests
--------------
-
-Prerequisites:
- kernel 2.6
- ib_uverbs (kernel module) matches libibverbs
- ("match" means binary compatible, but ideally of the same SVN rev)
-
-Server: ./<test name> <options>
-Client: ./<test name> <options> <server IP address>
-
- o <server address> is IPv4 or IPv6 address. You can use the IPoIB
- address if IPoIB is configured.
- o --help lists the available <options>
-
- *** IMPORTANT NOTE: The SAME OPTIONS must be passed to both server and client.
-
-
-Common Options to all tests:
- -p, --port=<port> Listen on/connect to port <port> (default: 18515).
- -m, --mtu=<mtu> Mtu size (default: 1024).
- -d, --ib-dev=<dev> Use IB device <dev> (default: first device found).
- -i, --ib-port=<port> Use port <port> of IB device (default: 1).
- -o, --out=<num_of_out> Number of outstanding reads. only in READ.
- -q, --qp=<num_of_qps> Number of Qps to perform. only in write_bw.
- -c, --connection=<c> Connection type : RC,UC,UD according to spec.
- -g, --mcg=<num_of_qps> Number of Qps in MultiCast group. in SEND only
- -M, --MGID=<addr> <addr> as the group MGID in format '255:1:X:X:X:X:X:X:X:X:X:X:X:X:X:X'.
- -s, --size=<size> Size of message to exchange (default: 1).
- -a, --all Run sizes from 2 till 2^23.
- -t, --tx-depth=<dep> Size of tx queue (default: 50).
- -r, --rx-depth=<dep> Make rx queue bigger than tx (default 600).
- -n, --iters=<iters> Number of exchanges (at least 100, default: 1000).
- -I, --inline_size=<size> Max size of message to be sent in inline mode.
- On Bw tests default is 1,latency tests is 400.
- -C, --report-cycles Report times in cpu cycle units.
- -u, --qp-timeout=<timeout> QP timeout, timeout value is 4 usec*2 ^(timeout).
- Default is 14.
- -S, --sl=<sl> SL (default 0).
- -H, --report-histogram Print out all results (Default: summary only).
- Only on Latnecy tests.
- -x, --gid-index=<index> Test uses GID with GID index taken from command
- Line (for RDMAoE index should be 0).
- -b, --bidirectional Measure bidirectional bandwidth (default uni).
- On BW tests only (Implicit on latency tests).
- -V, --version Display version number.
- -e, --events Sleep on CQ events (default poll).
- -N, --no peak-bw Cancel peak-bw calculation (default with peak-bw)
- -F, --CPU-freq Do not fail even if cpufreq_ondemand module.
-
- *** IMPORTANT NOTE: You need to be running a Subnet Manager on the switch or
- on one of the nodes in your fabric.
-
-
+++ /dev/null
-This is a release of the QLogic VNIC driver on OFED 1.4. This driver is
-currently supported on Intel x86 32 and 64 bit machines.
-Supported OS are:
-- RHEL 4 Update 4.
-- RHEL 4 Update 5.
-- RHEL 4 Update 6.
-- SLES 10.
-- SLES 10 Service Pack 1.
-- SLES 10 Service Pack 1 Update 1.
-- SLES 10 Service Pack 2.
-- RHEL 5.
-- RHEL 5 Update 1.
-- RHEL 5 Update 2.
-- vanilla 2.6.27 kernel.
-
-The VNIC driver in conjunction with the QLogic Ethernet Virtual I/O Controller
-(EVIC) provides Ethernet interfaces on a host with IB HCA(s) without the need
-for any physical Ethernet NIC.
-
-This file describes the use of the QLogic VNIC ULP service on an OFED stack
-and covers the following points:
-
-A) Creating QLogic VNIC interfaces
-B) Discovering VEx/EVIC IOCs present on the fabric using ib_qlgc_vnic_query
-C) Starting the QLogic VNIC driver and the VNIC interfaces
-D) Assigning IP addresses etc for the QLogic VNIC interfaces
-E) Information about the QLogic VNIC interfaces
-F) Deleting a specific QLogic VNIC interface
-G) Forced Failover feature for QLogic VNIC.
-H) Infiniband Quality of Service for VNIC.
-I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support
-J) Information about creating VLAN interfaces
-K) Information about enabling IB Multicast for QLogic VNIC interface
-L) Basic Troubleshooting
-
-A) Creating QLogic VNIC interfaces
-
-The VNIC interfaces can be created with the help of
-the configuration file which must be placed at /etc/infiniband/qlgc_vnic.cfg.
-
-Please take a look at /etc/infiniband/qlgc_vnic.cfg.sample file (available also
-as part of the documentation) to see how VNIC configuration files are written.
-You can use this configuration file as the basis for creating a VNIC configuration
-file by copying it to /etc/infiniband/qlgc_vnic.cfg. Of course you will have to
-replace the IOCGUID, IOCSTRING values etc in the sample configuration file
-with those of the EVIC IOCs present on your fabric.
-
-(For backward compatibilty, if this file is missing,
-/etc/infiniband/qlogic_vnic.cfg or /etc/sysconfig/ics_inic.cfg
-will be used for configuration)
-
-Please note that using DGID of the EVIC/VEx IOC is
-recommended as it will ensure the quickest startup of the
-VNIC service. If DGID is specified then you must also
-specify the IOCGUID. More details can be found in
-the qlgc_vnic.cfg.sample file.
-
-In case of a host consisting of more than 1 HCAs plugged in, VNIC
-interfaces can be configured based on HCA no and Port No or PORTGUID.
-
-B) Discovering EVIC/VEx IOCs present on the fabric using ib_qlgc_vnic_query
-
-For writing the configuration file, you will need information
-about the EVIC/VEx IOCs present on the fabric like their IOCGUID,
-IOCSTRING etc. The ib_qlgc_vnic_query tool should be used to get this
-information.
-
-When ib_qlgc_vnic_query is executed without any options, it scans through ALL
-active IB ports on the host and obtains the detailed information about all the
-EVIC/VEx IOCs reachable through each active IB port:
-
-# ib_qlgc_vnic_query
-
-HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
-
- IO Unit Info:
- port LID: 0008
- port GID: fe8000000000000000066a11de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 1]
- GUID: 00066a01de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
- service entries: 2
- service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
- service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
-
- IO Unit Info:
- port LID: 0009
- port GID: fe8000000000000000066a21de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 2]
- GUID: 00066a02de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
- service entries: 2
- service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
- service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
-
-HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
-
- IO Unit Info:
- port LID: 0008
- port GID: fe8000000000000000066a11de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 1]
- GUID: 00066a01de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
- service entries: 2
- service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
- service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
-
- IO Unit Info:
- port LID: 0009
- port GID: fe8000000000000000066a21de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 2]
- GUID: 00066a02de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
- service entries: 2
- service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
- service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
-
-HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
-
- Port State is Down. Skipping search of DM nodes on this port.
-
-HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
-
- IO Unit Info:
- port LID: 0008
- port GID: fe8000000000000000066a11de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 1]
- GUID: 00066a01de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
- service entries: 2
- service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
- service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
-
- IO Unit Info:
- port LID: 0009
- port GID: fe8000000000000000066a21de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 2]
- GUID: 00066a02de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
- service entries: 2
- service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
- service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
-
-This is meant to help the network administrator to know about HCA/Port information
-on host along with EVIC IOCs reachable through given IB ports on fabric. When
-ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID information
-and with -s option it reports the IOCSTRING information for the EVIC/VEx IOCs
-present on the fabric:
-
-# ib_qlgc_vnic_query -e
-
-HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
-HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
-HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
-
- Port State is Down. Skipping search of DM nodes on this port.
-
-HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
-
-# ib_qlgc_vnic_query -s
-
-HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
-
-"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
-"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
-
-"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
-"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
-
- Port State is Down. Skipping search of DM nodes on this port.
-
-HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
-
-"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
-"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-
-# ib_qlgc_vnic_query -es
-
-HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
-
- Port State is Down. Skipping search of DM nodes on this port.
-
-HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-
-ib_qlgc_vnic_query can be used to discover EVIC IOCs on the fabric based on
-umad device, HCA no/Port no and PORTGUID as follows:
-
-For umad devices, it takes the name of the umad device mentioned with '-d'
-option:
-
-# ib_qlgc_vnic_query -es -d /dev/infiniband/umad0
-
-HCA No = 0, HCA = mlx4_0, Port = 1
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-
-If the name of the HCA and its port no is known, then ib_qlgc_vnic_query can
-make use of this information to discover EVIC IOCs on the fabric. HCA name
-and port no is specified with '-C' and '-P' options respectively.
-
-# ib_qlgc_vnic_query -es -C mlx4_1 -P 2
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-
-In case, if HCA name is not specified but port no is specified, HCA 0 is
-selected as default HCA to discover IOCs and if Port no is missing then,
-Port 1 of HCA name mentioned is used to discover the IOCs. If both are
-missing, the behaviour is default and ib_qlgc_vnic_query will scan all the
-IB ports on the host to discover IOCs reachable through each one of them.
-
-PORTGUID information about the IB ports on given host can be obtained using
-the option '-L':
-
-# ib_qlgc_vnic_query -L
-
-0,mlx4_0,1,0x0002c903000010f5
-0,mlx4_0,2,0x0002c903000010f6
-1,mlx4_1,1,0x0002c90300000785
-1,mlx4_1,2,0x0002c90300000786
-
-This actually lists different configurable parameters of IB ports present on
-given host in the order: HCA No, HCA Name, Port No, PORTGUID separated by
-commas. PORTGUID value obtained thus, can be used to discover EVIC IOCs
-reachable through it using '-G' option as follows:
-
-# ib_qlgc_vnic_query -es -G 0x0002c903000010f5
-
-HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
-
- ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
- ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
-
-C) Starting the QLogic VNIC driver and the QLogic VNIC interfaces
-
-To start the QLogic VNIC service as a part of startup of OFED stack, set
-
-QLGC_VNIC_LOAD=yes
-
-in /etc/infiniband/openib.conf file. With this actually, the QLogic VNIC
-service will also be stopped when the OFED stack is stopped. Also, if OFED
-stack has been marked to start on boot, QLogic VNIC service will also start
-on boot.
-
-The rest of the discussion in this subsection C) is valid only if
-
-QLGC_VNIC_LOAD=no
-
-is set into /etc/infiniband/openib.conf.
-
-Once you have created a configuration file, you can start the VNIC driver
-and create the VNIC interfaces specified in the configuration file with:
-
-#/sbin/service qlgc_vnic start
-
-You can stop the VNIC driver and bring down the VNIC interfaces with
-
-#/sbin/service qlgc_vnic stop
-
-To restart the QLogic VNIC driver, you can use
-
-#/sbin/service qlgc_vnic restart
-
-If you have not started the Infiniband network stack (Infinipath or OFED),
-then running "/sbin/service qlgc_vnic start" command will also cause the
-Infiniband network stack to be started since the QLogic VNIC service requires
-the Infiniband stack.
-
-On the other hand if you start the Infiniband network stack separately, then
-the correct order of starting is:
-
-- Start the Infiniband stack
-- Start QLogic VNIC service
-
-For example, if you use OFED, correct order of starting is:
-
-/sbin/service openibd start
-/sbin/service qlgc_vnic start
-
-Correct order of stopping is:
-
-- Stop QLogic VNIC service
-- Stop the Infiniband stack
-
-For example, if you use OFED, correct order of stopping is:
-
-/sbin/service qlgc_vnic stop
-/sbin/service openibd stop
-
-If you try to stop the Infiniband stack when the QLogic VNIC service is
-running,
-you will get an error message that some of the modules of the Infiniband stack
-are in use by the QLogic VNIC service. Also, any QLogic VNIC interfaces that
-you
-created are removed (because stopping the Infiniband network stack causes the
-HCA
-driver to be unloaded which is required for the VNIC interfaces to be
-present).
-In this case, do the following:
-
- 1. Stop the QLogic VNIC service with "/sbin/service qlgc_vnic stop"
-
- 2. Stop the Infiniband stack again.
-
- 3. If you want to restart the QLogic VNIC interfaces, use
- "/sbin/service qlgc_vnic start".
-
-
-D) Assigning IP addresses etc for the QLogic VNIC interfaces
-
-This can be done with ifconfig or by setting up the ifcfg-XXX (ifcfg-veth0 for
-an interface named veth0 etc) network files for the corresponding VNIC interfaces.
-
-E) Information about the QLogic VNIC interfaces
-
-Information about VNIC interfaces on a given host can be obtained using a
-script "ib_qlgc_vnic_info" :-
-
-# ib_qlgc_vnic_info
-
-VNIC Interface : eioc0
- VNIC State : VNIC_REGISTERED
- Current Path : primary path
- Receive Checksum : true
- Transmit checksum : true
-
- Primary Path :
- VIPORT State : VIPORT_CONNECTED
- Link State : LINK_IDLING
- HCA Info. : vnic-mthca0-1
- Heartbeat : 100
- IOC String : EVIC in Chassis 0x00066a00db000010, Slot 4, Ioc 1
- IOC GUID : 66a01de000037
- DGID : fe8000000000000000066a11de000037
- P Key : ffff
-
- Secondary Path :
- VIPORT State : VIPORT_DISCONNECTED
- Link State : INVALID STATE
- HCA Info. : vnic-mthca0-2
- Heartbeat : 100
- IOC String :
- IOC GUID : 66a01de000037
- DGID : 00000000000000000000000000000000
- P Key : 0
-
-This information is collected from /sys/class/infiniband_qlgc_vnic/interfaces/
-directory under which there is a separate directory corresponding to each
-VNIC interface.
-
-F) Deleting a specific QLogic VNIC interface
-
-VNIC interfaces can be deleted by writing the name of the interface to
-the /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic file.
-
-For example to delete interface veth0
-
-echo -n veth0 > /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic
-
-G) Forced Failover feature for QLogic VNIC.
-
-VNIC interfaces, when configured with failover configuration, can be
-forced to failover to use other active path. For example, if VNIC interface
-"veth1" is configured with failover configuration, then to switch to other
-path, use command:
-
-echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/force_failover
-
-This will make VNIC interface veth1 to switch to other active path, even though
-the path of VNIC interface, before the forced failover operation, is not in
-disconnected state.
-
-This feature allows the network administrator to control the path of the
-VNIC traffic at run time and reconfiguration as well as restart of VNIC
-service is not required to achieve the same.
-
-Once enabled as mentioned above, forced failover can be cleared with
-the unfailover command:
-
-echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/unfailover
-
-This clears the forced failover on VNIC interface "veth1". Once cleared,
-if module parameter "default_prefer_primary" is set to 1, then VNIC
-interface switches back to primary path. If module parameter
-"default_prefer_primary" is set to 0, then VNIC interface continues to
-use its current active path.
-
-Forced failover, thus, takes priority over default_prefer_primary and the
-default_prefer_primary feature will not be active unless the forced
-failover is cleared through "unfailover".
-
-Besides this forced failover, QLogic VNIC service does retain its
-original failover feature which gets triggered when current active
-path gets disconnected.
-
-H) Infiniband Quality of Service for VNIC:-
-
-To enforce infiniband Quality of Service(QoS) for VNIC protocol, there
-is no configuration required on host side. The service level for the
-VNIC protocol can be configured using service ID or target port guid
-in the "qos-ulps" section of /etc/opensm/qos-policy.conf on the host
-running OpenSM.
-
-Service IDs for the EVIC IO controllers can be obtained from the output
-of ib_qlgc_vnic_query:
-
-HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
-
- IO Unit Info:
- port LID: 0008
- port GID: fe8000000000000000066a11de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 1]
- GUID: 00066a01de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
- service entries: 2
-------> service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
-------> service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
-
- IO Unit Info:
- port LID: 0009
- port GID: fe8000000000000000066a21de000070
- change ID: 0003
- max controllers: 0x02
-
-
- controller[ 2]
- GUID: 00066a02de000070
- vendor ID: 00066a
- device ID: 000030
- IO class : 2000
- ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
- service entries: 2
-------> service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
-------> service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
-
-Numbers 1000066a00000002, 1000066a00000102 are the required service IDs.
-
-Finer control on quality of service for the VNIC protocol can be achieved by
-configuring the service level using target port guid values of the EVIC IO
-controllers. Target port guid values for the EVIC IO controllers can be
-obtained using "saquery" command supplied by OFED package.
-
-I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support:-
-
-This tool is started and stopped as part of the QLogic VNIC service
-(refer to C above) and provides the following features:
-
-1. Dynamic update of disconnected interfaces (which have been configured
-WITHOUT using the DGID option in the configuration file) :
-
-At the start up of VNIC driver, if the HCA port through which a particular VNIC
-interface path (primary or secondary) connects to target is down or the
-EVIC/VEx IOC is not available then all the required parameters (DGID etc) for connecting
-with the EVIC/VEx cannot be determined. Hence the corresponding VNIC interface
-path is not available at the start of the VNIC service. This daemon constantly
-monitors the configured VNIC interfaces to check if any of them are disconnected.
-If any of the interfaces are disconnected, it scans for available EVIC/VEx targets using
-"ib_qlgc_vnic_query" tool. When daemon sees that for a given path of a VNIC interface,
-the configured EVIC/VEx IOC has become available, it dynamically updates the
-VNIC kernel driver with the required information to establish connection for
-that path of the interface. In this way, the interface gets connected with
-the configured EVIC/VEx whenever it becomes available without any manual
-intervention.
-
-2. Hot Swap support :
-
-Hot swap is an operation in which an existing EVIC/VEx is replaced by another
-EVIC/VEx (in the same slot of the switch chassis as the older one). In such a
-case, the current connection for the corresponding VNIC interface will have to
-be re-established. The daemon detects this hot swap case and re-establishes
-the connection automatically. To make use of this feature of the daemon, it is
-recommended that IOCSTRING be used in the configuration file to configure the
-VNIC interfaces.
-
-This is because, after a hot swap though all other parameters like DGID, IOCGUID etc
-of the EVIC/VEx change, the IOCSTRING remains the same. Thus the daemon monitors
-for changes in IOCGUID and DGID of disconnected interfaces based on the IOCSTRING.
-If these values have changed it updates the kernel driver so that the VNIC
-interface can start using the new EVIC/VEx.
-
-If in addition to IOCSTRING, DGID and IOCGUID have been used to configure
-a VNIC interface, then on a hotswap the daemon will update the parameters as required.
-But to have that VNIC interface available immediately on the next restart of the
-QLogic VNIC service, please make sure to update the configuration file with the
-new DGID and IOCGUID values. Otherwise, the creation of such interfaces will be
-delayed till the daemon runs and updates the parameters.
-
-J) Information about creating VLAN interfaces
-
-The EVIC/VEx supports VLAN tagging without having to explicitly create VLAN
-interfaces for the VNIC interface on the host. This is done by enabling
-Egress/Ingress tagging on the EVIC/VEx and setting the "Host ignores VLAN"
-option for the VNIC interface. The "Host ignores VLAN" option is enabled
-by default due to which VLAN tags are ignored on the host by the QLogic
-VNIC driver. Thus explicitly created VLAN interfaces (using vconfig command)
-for a given VNIC interface will not be operational.
-
-If you want to explicitly create a VLAN interface for a given VNIC interface,
-then you will have to disable the "Host ignores VLAN" option for the
-VNIC interface on the EVIC/VEx. The qlgc_vnic service must be restarted
-on the host after disabling (or enabling) the "Host ignores VLAN" option.
-
-Please refer to the EVIC/VEx documentation for more information on Egress/Ingress
-port tagging feature and disabling the "Host ignores VLAN" option.
-
-K) Information about enabling IB Multicast for QLogic VNIC interface
-
-QLogic VNIC driver has been upgraded to support the IB Multicasting feature of
-EVIC/VEx. This feature enables the QLogic VNIC host driver to support the IP
-multicasting more efficiently. With this feature enabled, infiniband multicast
-group acts as a carrier of IP multicast traffic. EVIC will make use of such IB
-multicast groups for forwarding IP multicast traffic to VNIC interfaces which
-are member of given IP multicast group. In the older QLogic VNIC host driver,
-IB multicasting was not being used to carry IP multicast traffic.
-
-By default, IB multicasting is disabled on EVIC/VEx; but it is enabled by
-default at the QLogic VNIC host driver.
-
-To disable IB multicast feature on the host driver, VNIC configuration file
-needs to be modified by setting the parameter IB_MULTICAST=FALSE in the
-interface configuration. Please refer to the qlgc_vnic.cfg.sample for more
-details on configuration of VNIC interfaces for IB multicasting.
-IB multicasting also needs to be enabled over EVIC/VEx. Please refer to the
-EVIC/VEx documentation for more information on enabling IB multicast
-feature over EVIC/VEx.
-
-L) Basic Troubleshooting
-
-1. In case of any problems, make sure that:
-
- a) The HCA ports you are trying to use have IB cables connected and are in an
- active state. You can use the "ibv_devinfo" tool to check the state of
- your HCA ports.
-
- b) If your HCA ports are not active, check if an SM is running on the fabric
- where the HCA ports are connected. If you have done a full install of
- OFED, you can use the "sminfo" command ("sminfo -P 2" for port 2) to
- check SM information.
-
- c) Make sure that the EVIC/VEx is powered up and its Ethernet cables are connected
- properly.
-
- d) Check /var/log/messages for any error messages.
-
-2. If some of your VNIC interfaces are not available:
-
- a) Use "ifconfig" tool with -a option to see if all interfaces are created.
- It is possible that the interfaces are created but do not have an
- IP address. Make sure that you have setup a correct ifcfg-XXX file for your
- VNIC interfaces for automatic assignment of IP addresses.
-
- If the VNIC interface is created and the ifcfg file is also correct
- but the VNIC interface is not UP, make sure that the target EVIC/VEx
- IOC has an Ethernet cable properly connected.
-
- b) Make sure that the VNIC configuration file has been setup properly
- with correct EVIC/VEx target DGID/IOCGUID/IOCSTRING information and
- instance numbers.
-
- c) Make sure that the EVIC/VEx target IOC specified for that interface is
- available. You can use the "ib_qlgc_vnic_query" tool to verify this. If it is not
- available when you started the service, but it becomes available later
- on, then the QLogic VNIC dynamic update daemon will bring up the
- interface when the target becomes available. You will see messages in
- /var/log/messages when the corresponding interface is created.
-
- d) Make sure that you have not exceeded the total number of Virtual interfaces
- supported by the EVIC/VEx. You can check the total number of Virtual interfaces
- currently in use on the HTTP interface of the EVIC/VEx.
-
+++ /dev/null
-
- QoS support in OFED
-
-==============================================================================
-Table of contents
-==============================================================================
-
-1. Overview
-2. Architecture
-3. Supported Policy
-4. CMA functionality
-5. IPoIB functionality
-6. SDP functionality
-7. RDS functionality
-8. SRP functionality
-9. iSER functionality
-10. OpenSM functionality
-
-
-==============================================================================
-1. Overview
-==============================================================================
-
-Quality of Service requirements stem from the realization of I/O consolidation
-over IB network: As multiple applications and ULPs share the same fabric,
-means to control their use of the network resources are becoming a must.
-The basic need is to differentiate the service levels provided to different
-traffic flows, such that a policy could be enforced and control each flow
-utilization of the fabric resources.
-
-IBTA specification defined several hardware features and management interfaces
-to support QoS:
-* Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
-* Arbitration between traffic of different VLs is performed by a 2 priority
- levels weighted round robin arbiter. The arbiter is programmable with
- a sequence of (VL, weight) pairs and maximal number of high priority credits
- to be processed before low priority is served
-* Packets carry class of service marking in the range 0 to 15 in their
- header SL field
-* Each switch can map the incoming packet by its SL to a particular output
- VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
-* The Subnet Administrator controls each communication flow parameters
- by providing them as a response to Path Record (PR) or MultiPathRecord (MPR)
- queries
-
-The IB QoS features provide the means to implement a DiffServ like
-architecture. DiffServ architecture (IETF RFC 2474 & 2475) is widely used
-today in highly dynamic fabrics.
-
-This document provides the detailed functional definition for the various
-software elements that enable a DiffServ like architecture over the
-OpenFabrics software stack.
-
-
-==============================================================================
-2. Architecture
-==============================================================================
-
-QoS functionality is split between the SM/SA, CMA and the various ULPS.
-We take the "chronology approach" to describe how the overall system works.
-
-2.1. The network manager (human) provides a set of rules (policy) that
-define how the network is being configured and how its resources are split
-to different QoS-Levels. The policy also define how to decide which QoS-Level
-each application or ULP or service use.
-
-2.2. The SM analyzes the provided policy to see if it is realizable and
-performs the necessary fabric setup. Part of this policy defines the default
-QoS-Level of each partition. The SA is enhanced to match the requested Source,
-Destination, QoS-Class, Service-ID, PKey against the policy, so clients
-(ULPs, programs) can obtain a policy enforced QoS. The SM may also set up
-partitions with appropriate IPoIB broadcast group. This broadcast group
-carries its QoS attributes: SL, MTU, RATE, and Packet Lifetime.
-
-2.3. IPoIB is being setup. IPoIB uses the SL, MTU, RATE and Packet Lifetime
-available on the multicast group which forms the broadcast group of this
-partition.
-
-2.4. MPI which provides non IB based connection management should be
-configured to run using hard coded SLs. It uses these SLs for every QP
-being opened.
-
-2.5. ULPs that use CM interface (like SRP) have their own pre-assigned
-Service-ID and use it while obtaining PathRecord/MultiPathRecord (PR/MPR)
-for establishing connections. The SA receiving the PR/MPR matches it
-against the policy and returns the appropriate PR/MPR including SL, MTU,
-RATE and Lifetime.
-
-2.6. ULPs and programs (e.g. SDP) use CMA to establish RC connection provide
-the CMA the target IP and port number. ULPs might also provide QoS-Class.
-The CMA then creates Service-ID for the ULP and passes this ID and optional
-QoS-Class in the PR/MPR request. The resulting PR/MPR is used for configuring
-the connection QP.
-
-PathRecord and MultiPathRecord enhancement for QoS:
-
-As mentioned above the PathRecord and MultiPathRecord attributes are enhanced
-to carry the Service-ID which is a 64bit value. A new field QoS-Class is also
-provided.
-A new capability bit describes the SM QoS support in the SA class port info.
-This approach provides an easy migration path for existing access layer and
-ULPs by not introducing new set of PR/MPR attributes.
-
-
-==============================================================================
-3. Supported Policy
-==============================================================================
-
-The QoS policy that is specified in a separate file is divided into
-4 sub sections:
-
-I) Port Group: a set of CAs, Routers or Switches that share the same settings.
- A port group might be a partition defined by the partition manager policy,
- list of GUIDs, or list of port names based on NodeDescription.
-
-II) Fabric Setup: Defines how the SL2VL and VLArb tables should be setup.
- NOTE: Currently this part of the policy is ignored. SL2VL and VLArb
- tables should be configured in the OpenSM options file
- (opensm.opts).
-
-III) QoS-Levels Definition: This section defines the possible sets of
- parameters for QoS that a client might be mapped to. Each set holds
- SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits.
- NOTE: Currently, Path Bits are not implemented.
-
-IV) Matching Rules: A list of rules that match an incoming PR/MPR request
- to a QoS-Level. The rules are processed in order such as the first match
- is applied. Each rule is built out of a set of match expressions which
- should all match for the rule to apply. The matching expressions are
- defined for the following fields:
- - SRC and DST to lists of port groups
- - Service-ID to a list of Service-ID values or ranges
- - QoS-Class to a list of QoS-Class values or ranges
-
-
-==============================================================================
-4. CMA features
-==============================================================================
-
-The CMA interface supports Service-ID through the notion of port space
-as a prefixes to the port_num which is part of the sockaddr provided to
-rdma_resolve_add().
-CMP also allows the ULP (like SDP) to propagate a request for specific
-QoS-Class. CMA uses the provided QoS-Class and Service-ID in the sent PR/MPR.
-
-
-==============================================================================
-5. IPoIB
-==============================================================================
-
-IPoIB queries the SA for its broadcast group information.
-It provides the broadcast group SL, MTU, and RATE in every following
-PathRecord query performed when a new UDAV is needed by IPoIB.
-
-
-==============================================================================
-6. SDP
-==============================================================================
-
-SDP uses CMA for building its connections.
-The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
-holding the remote TCP/IP Port Number to connect to.
-
-
-==============================================================================
-7. RDS
-==============================================================================
-
-RDS uses CMA and thus it is very close to SDP. The Service-ID for RDS is
-0x000000000106PPPP, where PPPP are 4 hex digits holding the TCP/IP Port
-Number that the protocol connects to.
-Default port number for RDS is 0x48CA, which makes a default Service-ID
-0x00000000010648CA.
-
-
-==============================================================================
-8. SRP
-==============================================================================
-
-Current SRP implementation uses its own CM callbacks (not CMA). So SRP fills
-in the Service-ID in the PR/MPR by itself and use that information in setting
-up the QP.
-SRP Service-ID is defined by the SRP target I/O Controller (it also complies
-with IBTA Service-ID rules). The Service-ID is reported by the I/O Controller
-in the ServiceEntries DMA attribute and should be used in the PR/MPR if the
-SA reports its ability to handle QoS PR/MPRs.
-
-
-==============================================================================
-9. iSER
-==============================================================================
-
-Similar to RDS, iSER also uses CMA. The Service-ID for iSER is similar to RDS
-(0x000000000106PPPP), with default port number 0x0CBC, which makes a default
-Service-ID 0x0000000001060CBC.
-
-
-==============================================================================
-10. OpenSM features
-==============================================================================
-
-The QoS related functionality that is provided by OpenSM can be split into two
-main parts:
-
-10.1. Fabric Setup
-During fabric initialization the SM parses the policy and apply its settings
-to the discovered fabric elements.
-
-10.2. PR/MPR query handling:
-OpenSM enforces the provided policy on client request.
-The overall flow for such requests is: first the request is matched against
-the defined match rules such that the target QoS-Level definition is found.
-Given the QoS-Level a path(s) search is performed with the given restrictions
-imposed by that level.
-
-==============================================================================
+++ /dev/null
-
- QoS Management in OpenSM
-
-==============================================================================
- Table of contents
-==============================================================================
-
-1. Overview
-2. Full QoS Policy File
-3. Simplified QoS Policy Definition
-4. Policy File Syntax Guidelines
-5. Examples of Full Policy File
-6. Simplified QoS Policy - Details and Examples
-7. SL2VL Mapping and VL Arbitration
-
-
-==============================================================================
- 1. Overview
-==============================================================================
-
-When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
-The default name of OpenSM QoS policy file is
-/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
-or --qos_policy_file option with OpenSM.
-
-During fabric initialization and at every heavy sweep OpenSM parses the QoS
-policy file, applies its settings to the discovered fabric elements, and
-enforces the provided policy on client requests. The overall flow for such
-requests is:
- - The request is matched against the defined matching rules such that the
- QoS Level definition is found.
- - Given the QoS Level, path(s) search is performed with the given
- restrictions imposed by that level.
-
-There are two ways to define QoS policy:
- - Full policy, where the policy file syntax provides an administrator
- various ways to match PathRecord/MultiPathRecord (PR/MPR) request and
- enforce various QoS constraints on the requested PR/MPR
- - Simplified QoS policy definition, where an administrator would be able to
- match PR/MPR requests by various ULPs and applications running on top of
- these ULPs.
-
-While the full policy syntax is very flexible, in many cases the simplified
-policy definition would be sufficient.
-
-
-==============================================================================
- 2. Full QoS Policy File
-==============================================================================
-
-QoS policy file has the following sections:
-
-I) Port Groups (denoted by port-groups).
-This section defines zero or more port groups that can be referred later by
-matching rules (see below). Port group lists ports by:
- - Port GUID
- - Port name, which is a combination of NodeDescription and IB port number
- - PKey, which means that all the ports in the subnet that belong to
- partition with a given PKey belong to this port group
- - Partition name, which means that all the ports in the subnet that belong
- to partition with a given name belong to this port group
- - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and
- SELF (SM's port).
-
-II) QoS Setup (denoted by qos-setup).
-This section describes how to set up SL2VL and VL Arbitration tables on
-various nodes in the fabric.
-However, this is not supported in OpenSM currently.
-SL2VL and VLArb tables should be configured in the OpenSM options file
-(default location - /usr/local/etc/opensm/opensm.conf).
-
-III) QoS Levels (denoted by qos-levels).
-Each QoS Level defines Service Level (SL) and a few optional fields:
- - MTU limit
- - Rate limit
- - PKey
- - Packet lifetime
-When path(s) search is performed, it is done with regards to restriction that
-these QoS Level parameters impose.
-One QoS level that is mandatory to define is a DEFAULT QoS level. It is
-applied to a PR/MPR query that does not match any existing match rule.
-Similar to any other QoS Level, it can also be explicitly referred by any
-match rule.
-
-IV) QoS Matching Rules (denoted by qos-match-rules).
-Each PathRecord/MultiPathRecord query that OpenSM receives is matched against
-the set of matching rules. Rules are scanned in order of appearance in the QoS
-policy file such as the first match takes precedence.
-Each rule has a name of QoS level that will be applied to the matching query.
-A default QoS level is applied to a query that did not match any rule.
-Queries can be matched by:
- - Source port group (whether a source port is a member of a specified group)
- - Destination port group (same as above, only for destination port)
- - PKey
- - QoS class
- - Service ID
-To match a certain matching rule, PR/MPR query has to match ALL the rule's
-criteria. However, not all the fields of the PR/MPR query have to appear in
-the matching rule.
-For instance, if the rule has a single criterion - Service ID, it will match
-any query that has this Service ID, disregarding rest of the query fields.
-However, if a certain query has only Service ID (which means that this is the
-only bit in the PR/MPR component mask that is on), it will not match any rule
-that has other matching criteria besides Service ID.
-
-
-==============================================================================
- 3. Simplified QoS Policy Definition
-==============================================================================
-
-Simplified QoS policy definition comprises of a single section denoted by
-qos-ulps. Similar to the full QoS policy, it has a list of match rules and
-their QoS Level, but in this case a match rule has only one criterion - its
-goal is to match a certain ULP (or a certain application on top of this ULP)
-PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
-The simplified policy section may appear in the policy file in combine with
-the full policy, or as a stand-alone policy definition.
-See more details and list of match rule criteria below.
-
-
-==============================================================================
- 4. Policy File Syntax Guidelines
-==============================================================================
-
-- Empty lines are ignored.
-- Leading and trailing blanks, as well as empty lines, are ignored, so
- the indentation in the example is just for better readability.
-- Comments are started with the pound sign (#) and terminated by EOL.
-- Any keyword should be the first non-blank in the line, unless it's a
- comment.
-- Keywords that denote section/subsection start have matching closing
- keywords.
-- Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR
- requests that didn't match any of the matching rules.
-- Any section/subsection of the policy file is optional.
-
-
-==============================================================================
- 5. Examples of Full Policy File
-==============================================================================
-
-As mentioned earlier, any section of the policy file is optional, and
-the only mandatory part of the policy file is a default QoS Level.
-Here's an example of the shortest policy file:
-
- qos-levels
- qos-level
- name: DEFAULT
- sl: 0
- end-qos-level
- end-qos-levels
-
-Port groups section is missing because there are no match rules, which means
-that port groups are not referred anywhere, and there is no need defining
-them. And since this policy file doesn't have any matching rules, PR/MPR query
-won't match any rule, and OpenSM will enforce default QoS level.
-Essentially, the above example is equivalent to not having QoS policy file
-at all.
-
-The following example shows all the possible options and keywords in the
-policy file and their syntax:
-
- #
- # See the comments in the following example.
- # They explain different keywords and their meaning.
- #
- port-groups
-
- port-group # using port GUIDs
- name: Storage
- # "use" is just a description that is used for logging
- # Other than that, it is just a comment
- use: SRP Targets
- port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
- port-guid: 0x1000000000FFFF
- end-port-group
-
- port-group
- name: Virtual Servers
- # The syntax of the port name is as follows:
- # "node_description/Pnum".
- # node_description is compared to the NodeDescription of the node,
- # and "Pnum" is a port number on that node.
- port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
- end-port-group
-
- # using partitions defined in the partition policy
- port-group
- name: Partitions
- partition: Part1
- pkey: 0x1234
- end-port-group
-
- # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
- # or ALL (for all the nodes in the subnet)
- port-group
- name: CAs and SM
- node-type: CA, SELF
- end-port-group
-
- end-port-groups
-
- qos-setup
- # This section of the policy file describes how to set up SL2VL and VL
- # Arbitration tables on various nodes in the fabric.
- # However, this is not supported in OpenSM currently - the section is
- # parsed and ignored. SL2VL and VLArb tables should be configured in the
- # OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf).
- end-qos-setup
-
- qos-levels
-
- # Having a QoS Level named "DEFAULT" is a must - it is applied to
- # PR/MPR requests that didn't match any of the matching rules.
- qos-level
- name: DEFAULT
- use: default QoS Level
- sl: 0
- end-qos-level
-
- # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime
- qos-level
- name: WholeSet
- sl: 1
- mtu-limit: 4
- rate-limit: 5
- pkey: 0x1234
- packet-life: 8
- end-qos-level
-
- end-qos-levels
-
- # Match rules are scanned in order of their apperance in the policy file.
- # First matched rule takes precedence.
- qos-match-rules
-
- # matching by single criteria: QoS class
- qos-match-rule
- use: by QoS class
- qos-class: 7-9,11
- # Name of qos-level to apply to the matching PR/MPR
- qos-level-name: WholeSet
- end-qos-match-rule
-
- # show matching by destination group and service id
- qos-match-rule
- use: Storage targets
- destination: Storage
- service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF
- qos-level-name: WholeSet
- end-qos-match-rule
-
- qos-match-rule
- source: Storage
- use: match by source group only
- qos-level-name: DEFAULT
- end-qos-match-rule
-
- qos-match-rule
- use: match by all parameters
- qos-class: 7-9,11
- source: Virtual Servers
- destination: Storage
- service-id: 0x0000000000010000-0x000000000001FFFF
- pkey: 0x0F00-0x0FFF
- qos-level-name: WholeSet
- end-qos-match-rule
-
- end-qos-match-rules
-
-
-==============================================================================
- 6. Simplified QoS Policy - Details and Examples
-==============================================================================
-
-Simplified QoS policy match rules are tailored for matching ULPs (or some
-application on top of a ULP) PR/MPR requests. This section has a list of
-per-ULP (or per-application) match rules and the SL that should be enforced
-on the matched PR/MPR query.
-
-Match rules include:
- - Default match rule that is applied to PR/MPR query that didn't match any
- of the other match rules
- - SDP
- - SDP application with a specific target TCP/IP port range
- - SRP with a specific target IB port GUID
- - RDS
- - iSER
- - iSER application with a specific target TCP/IP port range
- - IPoIB with a default PKey
- - IPoIB with a specific PKey
- - any ULP/application with a specific Service ID in the PR/MPR query
- - any ULP/application with a specific PKey in the PR/MPR query
- - any ULP/application with a specific target IB port GUID in the PR/MPR query
-
-Since any section of the policy file is optional, as long as basic rules of
-the file are kept (such as no referring to nonexisting port group, having
-default QoS Level, etc), the simplified policy section (qos-ulps) can serve
-as a complete QoS policy file.
-The shortest policy file in this case would be as follows:
-
- qos-ulps
- default : 0 #default SL
- end-qos-ulps
-
-It is equivalent to the previous example of the shortest policy file, and it
-is also equivalent to not having policy file at all.
-
-Below is an example of simplified QoS policy with all the possible keywords:
-
- qos-ulps
- default : 0 # default SL
- sdp, port-num 30000 : 0 # SL for application running on top
- # of SDP when a destination
- # TCP/IPport is 30000
- sdp, port-num 10000-20000 : 0
- sdp : 1 # default SL for any other
- # application running on top of SDP
- rds : 2 # SL for RDS traffic
- iser, port-num 900 : 0 # SL for iSER with a specific target
- # port
- iser : 3 # default SL for iSER
- ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with
- # pkey 0x0001
- ipoib : 4 # default IPoIB partition,
- # pkey=0x7FFF
- any, service-id 0x6234 : 6 # match any PR/MPR query with a
- # specific Service ID
- any, pkey 0x0ABC : 6 # match any PR/MPR query with a
- # specific PKey
- srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on
- # a specified IB port GUID
- any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with
- # a specific target port GUID
- end-qos-ulps
-
-
-Similar to the full policy definition, matching of PR/MPR queries is done in
-order of appearance in the QoS policy file such as the first match takes
-precedence, except for the "default" rule, which is applied only if the query
-didn't match any other rule.
-
-All other sections of the QoS policy file take precedence over the qos-ulps
-section. That is, if a policy file has both qos-match-rules and qos-ulps
-sections, then any query is matched first against the rules in the
-qos-match-rules section, and only if there was no match, the query is matched
-against the rules in qos-ulps section.
-
-Note that some of these match rules may overlap, so in order to use the
-simplified QoS definition effectively, it is important to understand how each
-of the ULPs is matched:
-
-6.1 IPoIB
-IPoIB query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so
-the following three match rules are equivalent:
-
- ipoib : <SL>
- ipoib, pkey 0x7fff : <SL>
- any, pkey 0x7fff : <SL>
-
-6.2 SDP
-SDP PR query is matched by Service ID. The Service-ID for SDP is
-0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port
-Number to connect to. The following two match rules are equivalent:
-
- sdp : <SL>
- any, service-id 0x0000000000010000-0x000000000001ffff : <SL>
-
-6.3 RDS
-Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS
-is 0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP
-Port Number to connect to. Default port number for RDS is 0x48CA, which makes
-a default Service-ID 0x00000000010648CA. The following two match rules are
-equivalent:
-
- rds : <SL>
- any, service-id 0x00000000010648CA : <SL>
-
-6.4 iSER
-Similar to RDS, iSER query is matched by Service ID, where the the Service ID
-is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes
-a default Service-ID 0x0000000001060CBC. The following two match rules are
-equivalent:
-
- iser : <SL>
- any, service-id 0x0000000001060CBC : <SL>
-
-6.5 SRP
-Service ID for SRP varies from storage vendor to vendor, thus SRP query is
-matched by the target IB port GUID. The following two match rules are
-equivalent:
-
- srp, target-port-guid 0x1234 : <SL>
- any, target-port-guid 0x1234 : <SL>
-
-Note that any of the above ULPs might contain target port GUID in the PR
-query, so in order for these queries not to be recognized by the QoS manager
-as SRP, the SRP match rule (or any match rule that refers to the target port
-guid only) should be placed at the end of the qos-ulps match rules.
-
-6.6 MPI
-SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL
-on the MPI traffic, and that's why it is the only ULP that did not appear in
-the qos-ulps section.
-
-
-==============================================================================
- 7. SL2VL Mapping and VL Arbitration
-==============================================================================
-
-OpenSM cached options file has a set of QoS related configuration parameters,
-that are used to configure SL2VL mapping and VL arbitration on IB ports.
-These parameters are:
- - Max VLs: the maximum number of VLs that will be on the subnet.
- - High limit: the limit of High Priority component of VL Arbitration
- table (IBA 7.6.9).
- - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template.
- - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template.
- - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs
- corresponding to SLs 0-15 (Note that VL15 used here means drop this SL).
-
-There are separate QoS configuration parameters sets for various target types:
-CAs, routers, switch external ports, and switch's enhanced port 0. The names
-of such parameters are prefixed by "qos_<type>_" string. Here is a full list
-of the currently supported sets:
-
- qos_ca_ - QoS configuration parameters set for CAs.
- qos_rtr_ - parameters set for routers.
- qos_sw0_ - parameters set for switches' port 0.
- qos_swe_ - parameters set for switches' external ports.
-
-Here's the example of typical default values for CAs and switches' external
-ports (hard-coded in OpenSM initialization):
-
- qos_ca_max_vls 15
- qos_ca_high_limit 0
- qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
- qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
- qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
-
- qos_swe_max_vls 15
- qos_swe_high_limit 0
- qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
- qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
- qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
-
-VL arbitration tables (both high and low) are lists of VL/Weight pairs.
-Each list entry contains a VL number (values from 0-14), and a weighting value
-(values 0-255), indicating the number of 64 byte units (credits) which may be
-transmitted from that VL when its turn in the arbitration occurs. A weight
-of 0 indicates that this entry should be skipped. If a list entry is
-programmed for VL15 or for a VL that is not supported or is not currently
-configured by the port, the port may either skip that entry or send from any
-supported VL for that entry.
-
-Note, that the same VLs may be listed multiple times in the High or Low
-priority arbitration tables, and, further, it can be listed in both tables.
-
-The limit of high-priority VLArb table (qos_<type>_high_limit) indicates the
-number of high-priority packets that can be transmitted without an opportunity
-to send a low-priority packet. Specifically, the number of bytes that can be
-sent is high_limit times 4K bytes.
-
-A high_limit value of 255 indicates that the byte limit is unbounded.
-Note: if the 255 value is used, the low priority VLs may be starved.
-A value of 0 indicates that only a single packet from the high-priority table
-may be sent before an opportunity is given to the low-priority table.
-
-Keep in mind that ports usually transmit packets of size equal to MTU.
-For instance, for 4KB MTU a single packet will require 64 credits, so in order
-to achieve effective VL arbitration for packets of 4KB MTU, the weighting
-values for each VL should be multiples of 64.
-
-Below is an example of SL2VL and VL Arbitration configuration on subnet:
-
- qos_ca_max_vls 15
- qos_ca_high_limit 6
- qos_ca_vlarb_high 0:4
- qos_ca_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
- qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
-
- qos_swe_max_vls 15
- qos_swe_high_limit 6
- qos_swe_vlarb_high 0:4
- qos_swe_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
- qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
-
-In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is
-defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single
-transmission burst. Such configuration would suilt VL that needs low latency
-and uses small MTU when transmitting packets. Rest of VLs are defined as low
-priority VLs with different weights, while VL4 is effectively turned off.
+++ /dev/null
-RDS(7) RDS(7)
-
-
-
-NAME
- RDS - Reliable Datagram Sockets
-
-SYNOPSIS
- #include <sys/socket.h>
- #include <netinet/in.h>
-
-DESCRIPTION
- This is an implementation of the RDS socket API. It provides reliable,
- in-order datagram delivery between sockets over a variety of trans‐
- ports.
-
- Currently, RDS can be transported over Infiniband, and loopback.
- iWARP bcopy is supported, but not RDMA operations.
-
- RDS uses standard AF_INET addresses as described in ip(7) to identify
- end points.
-
- Socket Creation
- RDS is still in development and as such does not have a reserved proto‐
- col family constant. Applications must read the string representation
- of the protocol family value from the pf_rds sysctl parameter file
- described below.
-
- rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
-
-
- Socket Options
- RDS sockets support a number of socket options through the setsock‐
- opt(2) and getsockopt(2) calls. The following generic options (with
- socket level SOL_SOCKET) are of specific importance:
-
- SO_RCVBUF
- Specifies the size of the receive buffer. See section on "Con‐
- gestion Control" below.
-
- SO_SNDBUF
- Specifies the size of the send buffer. See "Message Transmis‐
- sion" below.
-
- SO_SNDTIMEO
- Specifies the send timeout when trying to enqueue a message on a
- socket with a full queue in blocking mode.
-
- In addition to these, RDS supports a number of protocol specific
- options (with socket level SOL_RDS). Just as with the RDS protocol
- family, an official value has not been assigned yet, so the kernel will
- assign a value dynamically. The assigned value can be retrieved from
- the sol_rds sysctl parameter file.
-
- RDS specific socket options will be described in a separate section
- below.
-
- Binding
- A new RDS socket has no local address when it is first returned from
- socket(2). It must be bound to a local address by calling bind(2)
- before any messages can be sent or received. This will also attach the
- socket to a specific transport, based on the type of interface the
- local address is attached to. From that point on, the socket can only
- reach destinations which are available through this transport.
-
- For instance, when binding to the address of an Infiniband interface
- such as ib0, the socket will use the Infiniband transport. If RDS is
- not able to associate a transport with the given address, it will
- return EADDRNOTAVAIL.
-
- An RDS socket can only be bound to one address and only one socket can
- be bound to a given address/port pair. If no port is specified in the
- binding address then an unbound port is selected at random.
-
- RDS does not allow the application to bind a previously bound socket to
- another address. Binding to the wildcard address INADDR_ANY is not per‐
- mitted either.
-
- Connecting
- The default mode of operation for RDS is to use unconnected socket, and
- specify a destination address as an argument to sendmsg. However, RDS
- allows sockets to be connected to a remote end point using connect(2).
- If a socket is connected, calling sendmsg without specifying a destina‐
- tion address will use the previously given remote address.
-
- Congestion Control
- RDS does not have explicit congestion control like common streaming
- protocols such as TCP. However, sockets have two queue limits associ‐
- ated with them; the send queue size and the receive queue size. Mes‐
- sages are accounted based on the number of bytes of payload.
-
- The send queue size limits how much data local processes can queue on a
- local socket (see the following section). If that limit is exceeded,
- the kernel will not accept further messages until the queue is drained
- and messages have been delivered to and acknowledged by the remote
- host.
-
- The receive queue size limits how much data RDS will put on the receive
- queue of a socket before marking the socket as congested. When a
- socket becomes congested, RDS will send a congestion map update to the
- other participating hosts, who are then expected to stop sending more
- messages to this port.
-
- There is a timing window during which a remote host can still continue
- to send messages to a congested port; RDS solves this by accepting
- these messages even if the socket's receive queue is already over the
- limit.
-
- As the application pulls incoming messages off the receive queue using
- recvmsg(2), the number of bytes on the receive queue will eventually
- drop below the receive queue size, at which point the port is then
- marked uncongested, and another congestion update is sent to all par‐
- ticipating hosts. This tells them to allow applications to send addi‐
- tional messages to this port.
-
- The default values for the send and receive buffer size are controlled
- by the A given RDS socket has limited transmit buffer space. It
- defaults to the system wide socket send buffer size set in the
- wmem_default and rmem_default sysctls, respectively. They can be tuned
- by the application through the SO_SNDBUF and SO_RCVBUF socket options.
-
-
- Blocking Behavior
- The sendmsg(2) and recvmsg(2) calls can block in a variety of situa‐
- tions. Whether a call blocks or returns with an error depends on the
- non-blocking setting of the file descriptor and the MSG_DONTWAIT mes‐
- sage flag. If the file descriptor is set to blocking mode (which is the
- default), and the MSG_DONTWAIT flag is not given, the call will block.
-
- In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used
- to specify a timeout (in seconds) after which the call will abort wait‐
- ing, and return an error. The default timeout is 0, which tells RDS to
- block indefinitely.
-
- Message Transmission
- Messages may be sent using sendmsg(2) once the RDS socket is bound.
- Message length cannot exceed 4 gigabytes as the wire protocol uses an
- unsigned 32 bit integer to express the message length.
-
- RDS does not support out of band data. Applications are allowed to send
- to unicast addresses only; broadcast or multicast are not supported.
-
- A successful sendmsg(2) call puts the message in the socket's transmit
- queue where it will remain until either the destination acknowledges
- that the message is no longer in the network or the application removes
- the message from the send queue.
-
- Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO
- socket option described below.
-
- While a message is in the transmit queue its payload bytes are
- accounted for. If an attempt is made to send a message while there is
- not sufficient room on the transmit queue, the call will either block
- or return EAGAIN.
-
- Trying to send to a destination that is marked congested (see above),
- the call will either block or return ENOBUFS.
-
- A message sent with no payload bytes will not consume any space in the
- destination's send buffer but will result in a message receipt on the
- destination. The receiver will not get any payload data but will be
- able to see the sender's address.
-
- Messages sent to a port to which no socket is bound will be silently
- discarded by the destination host. No error messages are reported to
- the sender.
-
- Message Receipt
- Messages may be received with recvmsg(2) on an RDS socket once it is
- bound to a source address. RDS will return messages in-order, i.e. mes‐
- sages from the same sender will arrive in the same order in which they
- were be sent.
-
- The address of the sender will be returned in the sockaddr_in structure
- pointed to by the msg_name field, if set.
-
- If the MSG_PEEK flag is given, the first message on the receive is
- returned without removing it from the queue.
-
- The memory consumed by messages waiting for delivery does not limit the
- number of messages that can be queued for receive. RDS does attempt to
- perform congestion control as described in the section above.
-
- If the length of the message exceeds the size of the buffer provided to
- recvmsg(2), then the remainder of the bytes in the message are dis‐
- carded and the MSG_TRUNC flag is set in the msg_flags field. In this
- truncating case recvmsg(2) will still return the number of bytes
- copied, not the length of entire messge. If MSG_TRUNC is set in the
- flags argument to recvmsg(2), then it will return the number of bytes
- in the entire message. Thus one can examine the size of the next mes‐
- sage in the receive queue without incurring a copying overhead by pro‐
- viding a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the
- flags argument.
-
- The sending address of a zero-length message will still be provided in
- the msg_name field.
-
- Control Messages
- RDS uses control messages (a.k.a. ancillary data) through the msg_con‐
- trol and msg_controllen fields in sendmsg(2) and recvmsg(2). Control
- messages generated by RDS have a cmsg_level value of sol_rds. Most
- control messages are related to the zerocopy interface added in RDS
- version 3, and are described in rds-rdma(7).
-
- The only exception is the RDS_CMSG_CONG_UPDATE message, which is
- described in the following section.
-
- Polling
- RDS supports the poll(2) interface in a limited fashion. POLLIN is
- returned when there is a message (either a proper RDS message, or a
- control message) waiting in the socket's receive queue. POLLOUT is
- always returned while there is room on the socket's send queue.
-
- Sending to congested ports requires special handling. When an applica‐
- tion tries to send to a congested destination, the system call will
- return ENOBUFS. However, it cannot poll for POLLOUT, as there is prob‐
- ably still room on the transmit queue, so the call to poll(2) would
- return immediately, even though the destination is still congested.
-
- There are two ways of dealing with this situation. The first is to sim‐
- ply poll for POLLIN. By default, a process sleeping in poll(2) is
- always woken up when the congestion map is updated, and thus the appli‐
- cation can retry any previously congested sends.
-
- The second option is explicit congestion monitoring, which gives the
- application more fine-grained control.
-
- With explicit monitoring, the application polls for POLLIN as before,
- and additionally uses the RDS_CONG_MONITOR socket option to install a
- 64bit mask value in the socket, where each bit corresponds to a group
- of ports. When a congestion update arrives, RDS checks the set of ports
- that became uncongested against the bit mask installed in the socket.
- If they overlap, a control messages is enqueued on the socket, and the
- application is woken up. When it calls recvmsg(2), it will be given the
- control message containing the bitmap. on the socket.
-
- The congestion monitor bitmask can be set and queried using setsock‐
- opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.
-
- Congestion updates are delivered to the application via
- RDS_CMSG_CONG_UPDATE control messages. These control messages are
- always delivered by themselves (or possibly additional control mes‐
- sages), but never along with a RDS data message. The cmsg_data field of
- the control message is an 8 byte datum containing the 64bit mask value.
-
- Applications can use the following macros to test for and set bits in
- the bitmask:
-
- #define RDS_CONG_MONITOR_SIZE 64
- #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
- #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
-
-
- Canceling Messages
- An application can cancel (flush) messages from the send queue using
- the RDS_CANCEL_SENT_TO socket option with setsockopt(2). This call
- takes an optional sockaddr_in address structure as argument. If given,
- only messages to the destination specified by this address are dis‐
- carded. If no address is given, all pending messages are discarded.
-
- Note that this affects messages that have not yet been transmitted as
- well as messages that have been transmitted, but for which no acknowl‐
- edgment from the remote host has been received yet.
-
- Reliability
- If sendmsg(2) succeeds, RDS guarantees that the message will be vis‐
- ible to recvmsg(2) on a socket bound to the destination address as
- long as that destination socket remains open.
-
- If there is no socket bound on the destination, the message is
- silently dropped. If the sending RDS can't be sure that there is no
- socket bound then it will try to send the message indefinitely until it
- can be sure or the sent message is canceled.
-
- If a socket is closed then all pending sent messages on the socket are
- canceled and may or may not be seen by the receiver.
-
- The RDS_CANCEL_SENT_TO socket option can be used to cancel all pending
- messages to a given destination.
-
- If a receiving socket is closed with pending messages then the sender
- considers those messages as having left the network and will not
- retransmit them.
-
- A message will only be seen by recvmsg(2) once, unless MSG_PEEK was
- specified. Once the message has been delivered it is removed from the
- sending socket's transmit queue.
-
- All messages sent from the same socket to the same destination will be
- delivered in the order they're sent. Messages sent from different sock‐
- ets, or to different destinations, may be delivered in any order.
-
-SYSCTL VALUES
- These parameteres may only be accessed through their files in
- /proc/sys/net/rds. Access through sysctl(2) is not supported.
-
- pf_rds This file contains the string representation of the protocol
- family constant passed to socket(2) to create a new RDS socket.
-
- sol_rds
- This file contains the string representation of the socket level
- parameter that is passed to getsockopt(2) and setsockopt(2) to
- manipulate RDS socket options.
-
- max_unacked_bytes and max_unacked_packets
- These parameters are used to tune the generation of acknowledge‐
- ments. By default, the system receiving RDS messages does not
- send back explicit acknowledgements unless it transmits a mes‐
- sage of its own (in which case the ACK is piggybacked onto the
- outgoing message), or when the sending system requests an ACK.
-
- However, the sender needs to see an ACK from time to time so
- that it can purge old messages from the send queue. The unacked
- bytes and packet counters are used to keep track of how much
- data has been sent without requesting an ACK. The default is to
- request an acknowledgement every 16 packets, or every 16 MB,
- whichever comes first.
-
- reconnect_delay_min_ms and reconnect_delay_max_ms
- RDS uses host-to-host connections to transport RDS messages
- (both for the TCP and the Infiniband transport). If this connec‐
- tion breaks, RDS will try to re-establish the connection.
- Because this reconnect may be triggered by both hosts at the
- same time and fail, RDS uses a random backoff before attempting
- a reconnect. These two parameters specify the minimum and maxi‐
- mum delay in milliseconds. The default values are 1 and 1000,
- respectively.
-
-SEE ALSO
- rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2),
- setsockopt(2).
-
-
-
- RDS(7)
Open Fabrics Enterprise Distribution (OFED)
- Version 1.5.2
- README
-
- September 2010
+ Version 1.5.3
+ README
+
+ January 2010
+
+==============================================================================
+Table of contents
+==============================================================================
+
+ 1. Overview
+ 2. Contents of the OFED Distribution
+ 3. Hardware and Software Requirements
+ 4. How to Download and Extract the OFED Distribution
+ 5. Installing OFED Software
+ 6. Building OFED RPMs
+ 7. IPoIB Configuration
+ 8. Using SDP
+ 9. Uninstalling OFED
+ 10. Upgrading OFED
+ 11. Configuration
+ 12. Starting and Verifying the IB Fabric
+ 13. MPI (Message Passing Interface)
+ 14. Related Documentation
+
+
+==============================================================================
+1. Overview
+==============================================================================
This is the OpenFabrics Enterprise Distribution (OFED) version 1.5.2
software package supporting InfiniBand and iWARP fabrics. It is composed
of several software modules intended for use on a computer cluster
constructed as an InfiniBand subnet or an iWARP network.
-*** Note: If you plan to upgrade OFED on your cluster, please upgrade all
- its nodes to this new version.
+This document describes how to install the various modules and test them in
+a Linux environment.
-This document includes the following sections:
+General Notes:
+ 1) The install script removes all previously installed OFED packages
+ and re-installs from scratch. (Note: Configuration files will not
+ be removed). You will be prompted to acknowledge the deletion of
+ the old packages.
-1. HW and SW Requirements
-2. OFED Package Contents
-3. Installing OFED Software
-4. Starting and Verifying the IB Fabric
-5. MPI (Message Passing Interface)
-6. Related Documentation
-
-OpenFabrics Home Page: http://www.openfabrics.org
+ 2) When installing OFED on an entire [homogeneous] cluster, a common
+ strategy is to install the software on one of the cluster nodes
+ (perhaps on a shared file system such as NFS). The resulting RPMs,
+ created under OFED-X.X.X/RPMS directory, can then be installed on all
+ nodes in the cluster using any cluster-aware tools (such as pdsh).
-The OFED rev 1.5.2 software download available in
-http://www.openfabrics.org/builds/ofed-1.5.2/release/
+==============================================================================
+2. OFED Package Contents
+==============================================================================
-Please email bugs and error reports to your InfiniBand vendor, or use bugzilla
-https://bugs.openfabrics.org/
+The OFED Distribution package generates RPMs for installing the following:
+ o OpenFabrics core and ULPs:
+ - HCA drivers (mthca, mlx4, qib, ehca)
+ - iWARP driver (cxgb3, nes)
+ - core
+ - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER
+ Initiator and target, RDS, qlgc_vnic, uDAPL and NFS-RDMA
+ o OpenFabrics utilities
+ - OpenSM: InfiniBand Subnet Manager
+ - Diagnostic tools
+ - Performance tests
+ o MPI
+ - OSU MVAPICH stack supporting the InfiniBand and iWARP interface
+ - Open MPI stack supporting the InfiniBand and iWARP interface
+ - OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface
+ - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta)
+ o Extra packages
+ - open-iscsi: open-iscsi initiator with iSER support
+ - ib-bonding: Bonding driver for IPoIB interface
+ o Sources of all software modules (under conditions mentioned in the
+ modules' LICENSE files)
+ o Documentation
+==============================================================================
+3. Hardware and Software Requirements
+==============================================================================
-1. HW and SW Requirements:
-==========================
1) Server platform with InfiniBand HCA or iWARP RNIC (see OFED Distribution
Release Notes for details)
another open-iscsi version is already installed will cause the installation
process to fail.
-2. OFED Package Contents
-========================
+==============================================================================
+4. How to Download and Extract the OFED Distribution
+==============================================================================
-The OFED Distribution package generates RPMs for installing the following:
+1) Download the OFED-X.X.X.tgz file to your target Linux host.
- o OpenFabrics core and ULPs
- - HCA drivers (mthca, mlx4, mlx4_en, qib, ehca)
- - iWARP driver (cxgb3, nes)
- - core
- - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER
- Initiator and target, RDS, uDAPL, qlgc_vnic and NFS-RDMA.
- o OpenFabrics utilities
- - OpenSM: InfiniBand Subnet Manager
- - Diagnostic tools
- - Performance tests
- o MPI
- - OSU MVAPICH stack supporting the InfiniBand and iWARP interface
- - Open MPI stack supporting the InfiniBand and iWARP interface
- - OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface
- - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta)
- o Extra packages
- - open-iscsi: open-iscsi initiator with iSER support
- - ib-bonding: Bonding driver for IPoIB interface
- o Sources of all software modules (under conditions mentioned in the
- modules' LICENSE files)
- o Documentation
+ If this package is to be installed on a cluster, it is recommended to
+ download it to an NFS shared directory.
+
+2) Extract the package using:
+
+ tar xzvf OFED-X.X.X.tgz
+
+==============================================================================
+5. Installing OFED Software
+==============================================================================
+
+1) Go to the directory into which the package was extracted:
+
+ cd /..../OFED-X.X.X
+
+2) Installing the OFED package must be done as root. For a
+ menu-driven first build and installation, run the installer
+ script:
+
+ ./install.pl
+
+ Interactive menus will direct you through the install process.
+
+ Note: After the installer completes, information about the OFED
+ installation such as the prefix, the kernel version, and
+ installation parameters can be found by running
+ /etc/infiniband/info.
+
+ Information on the driver version and source git trees can be found
+ using the ofed_info utility
+
+
+ During the interactive installation of OFED, two files are
+ generated: ofed.conf and ofed_net.conf.
+ ofed.conf holds the installed software modules and configuration settings
+ chosen by the user. ofed_net.conf holds the IPoIB settings chosen by the
+ user.
+ If the package is installed on a cluster-shared directory, these
+ files can then be used to perform an automatic, unattended
+ installation of OFED on other machines in the cluster. The
+ unattended installation will use the same choices as were selected
+ in the interactive installation.
-3. Installing OFED Software
-============================
+ For an automatic installation on any host, run the following:
-The default installation directory is: /usr
+ ./OFED-X.X.X/install.pl -c <path>/ofed.conf -n <path>/ofed_net.conf
-Install Quick Guide:
-1) Download and extract: tar xzvf OFED-1.5.2.tgz file.
-2) Change into directory: cd OFED-1.5.2
-3) Run as root: ./install.pl
-4) Follow the directions to install required components. For details, please see
- OFED_Installation_Guide.txt under OFED-1.5.2/docs.
+3) Install script usage:
+ Usage: ./install.pl [-c <packages config_file>|--all|--hpc|--basic]
+ [-n|--net <network config_file>]
+
+ -c|--config <packages config_file>. Example of the config file can
+ be found under docs (ofed.conf-example)
+ -n|--net <network config_file> Example of the config file can be
+ found under docs (ofed_net.conf-example)
+ -l|--prefix Set installation prefix.
+ -p|--print-available Print available packages for current platform.
+ And create corresponding ofed.conf file.
+ -k|--kernel <kernel version>. Default on this system: $(uname -r)
+ -s|--kernel-sources <path to the kernel sources>. Default on this
+ system: /lib/modules/$(uname -r)/build
+ --build32 Build 32-bit libraries. Relevant for x86_64 and
+ ppc64 platforms
+ --without-depcheck Skip Distro's libraries check
+ -v|-vv|-vvv. Set verbosity level
+ -q. Set quiet - no messages will be printed
+ --force Force uninstall RPM coming with Distribution
+ --builddir Change build directory. Default: /var/tmp/
+
+ --all|--hpc|--basic Install all,hpc or basic packages
+ correspondingly
Notes:
-1. The install script removes previously installed IB packages and
- re-installs from scratch. You will be prompted to acknowledge the deletion
- of the old packages. However, configuration files (.conf) will be
- preserved and saved with a ".rpmsave" extension.
+------
+a. It is possible to rename and/or edit the ofed.conf and ofed_net.conf files.
+ Thus it is possible to change user choices (observing the original format).
+ See examples of ofed.conf and ofed_net.conf under OFED-X.X.X/docs.
+ Run './install.pl -p' to get ofed.conf with all available packages included.
+
+b. Important note for open-iscsi users:
+ Installing iSER as part of the OFED installation will also install
+ open-iscsi. Before installing OFED, please uninstall any open-iscsi version
+ that may be installed on your machine. Installing OFED with iSER support
+ while another open-iscsi version is already installed will cause the
+ installation process to fail.
+
+
+Install Process Results:
+------------------------
+
+o The OFED package is installed under <prefix> directory. Default prefix is /usr
+o The kernel modules are installed under:
+ - Infiniband subsystem:
+ /lib/modules/`uname -r`/updates/kernel/drivers/infiniband/
+ - open-iscsi:
+ /lib/modules/`uname -r`/updates/kernel/drivers/scsi/
+ - Chelsio driver:
+ /lib/modules/`uname -r`/updates/kernel/drivers/net/cxgb3/
+ - ConnectX driver:
+ /lib/modules/`uname -r`/updates/kernel/drivers/net/mlx4/
+ - RDS:
+ /lib/modules/`uname -r`/updates/kernel/net/rds/
+ - NFSoRDMA:
+ /lib/modules/`uname -r`/updates/kernel/fs/exportfs/
+ /lib/modules/`uname -r`/updates/kernel/fs/lockd/
+ /lib/modules/`uname -r`/updates/kernel/fs/nfs/
+ /lib/modules/`uname -r`/updates/kernel/fs/nfs_common/
+ /lib/modules/`uname -r`/updates/kernel/fs/nfsd/
+ /lib/modules/`uname -r`/updates/kernel/net/sunrpc/
+ - Bonding module:
+ /lib/modules/`uname -r`/updates/kernel/drivers/net/bonding/bonding.ko
+o The package kernel include files are placed under <prefix>/src/ofa_kernel/.
+ These includes should be used when building kernel modules which use
+ the Openfabrics stack. (Note that these includes, if needed, are
+ "backported" to your kernel).
+o The raw package (un-backported) source files are placed under
+ <prefix>/src/ofa_kernel-x.x.x
+o The script "openibd" is installed under /etc/init.d/. This script can
+ be used to load and unload the software stack.
+o The directory /etc/infiniband is created with the files "info" and
+ "openib.conf". The "info" script can be used to retrieve OFED
+ installation information. The "openib.conf" file contains the list of
+ modules that are loaded when the "openibd" script is used.
+o The file "90-ib.rules" is installed under /etc/udev/rules.d/
+o If libibverbs-utils is installed, then ofed.sh and ofed.csh are
+ installed under /etc/profile.d/. These automatically update the PATH
+ environment variable with <prefix>/bin. In addition, ofed.conf is
+ installed under /etc/ld.so.conf.d/ to update the dynamic linker's
+ run-time search path to find the InfiniBand shared libraries.
+o The file /etc/modprobe.d/ib_ipoib.conf is updated to include the following:
+ - "alias ib<n> ib_ipoib" for each ib<n> interface.
+o The file /etc/modprobe.d/ib_sdp.conf is updated to include the following:
+ - "alias net-pf-27 ib_sdp" for sdp.
+o If opensm is installed, the daemon opensmd is installed under /etc/init.d/
+o All verbs tests and examples are installed under <prefix>/bin and management
+ utilities under <prefix>/sbin
+o ofed_info script provides information on the OFED version and git repository.
+o If iSER is included, open-iscsi user-space files will be also installed:
+ - Configuration files will be installed at /etc/iscsi
+ - Startup script will be installed at:
+ - RedHat: /etc/init.d/iscsi
+ - SuSE: /etc/init.d/open-iscsi
+ - Other tools (iscsiadm, iscsid, iscsi_discovery, iscsi-iname, iscsistart)
+ will be installed under /sbin.
+ - Documentation will be installed under:
+ - RedHat: /usr/share/doc/iscsi-initiator-utils-<version number>
+ - SuSE: /usr/share/doc/packages/open-iscsi
+o man pages will be installed under /usr/share/man/.
+
+==============================================================================
+6. Building OFED RPMs
+==============================================================================
+
+1) Go to the directory into which the package was extracted:
+
+ cd /..../OFED-X.X.X
+
+2) Run install.pl as explained above
+ This script also builds OFED binary RPMs under OFED-X.X.X/RPMS; the sources
+ are placed in OFED-X.X.X/SRPMS/.
+
+ Once the install process has completed, the user may run ./install.pl on
+ other machines that have the same operating system and kernel to
+ install the new RPMs.
+
+Note: Depending on your hardware, the build procedure may take 30-45
+ minutes. Installation, however, is a relatively short process
+ (~5 minutes). A common strategy for OFED installation on large
+ homogeneous clusters is to extract the tarball on a network
+ file system (such as NFS), build OFED RPMs on NFS, and then run the
+ installer on each node with the RPMs that were previously built.
+
+==============================================================================
+7. IP-over-IB (IPoIB) Configuration
+==============================================================================
+
+Configuring IPoIB is an optional step during the installation. During
+an interactive installation, the user may choose to insert the ifcfg-ib<n>
+files. If this option is chosen, the ifcfg-ib<n> files will be
+installed under:
+
+- RedHat: /etc/sysconfig/network-scripts/
+- SuSE: /etc/sysconfig/network/
+
+Setting IPoIB Configuration:
+----------------------------
+There is no default configuration for IPoIB interfaces.
+
+One should manually specify the full IP configuration during the
+interactive installation: IP address, network address, netmask, and
+broadcast address, or use the ofed_net.conf file.
+
+For bonding setting please see "ipoib_release_notes.txt"
+
+For unattended installations, a configuration file can be provided
+with this information. The configuration file must specify the
+following information:
+- Fixed values for each IPoIB interface
+- Base IPoIB configuration on Ethernet configuration (may be useful for
+ cluster configuration)
+
+Here are some examples of ofed_net.conf:
+
+# Static settings; all values provided by this file
+IPADDR_ib0=172.16.0.4
+NETMASK_ib0=255.255.0.0
+NETWORK_ib0=172.16.0.0
+BROADCAST_ib0=172.16.255.255
+ONBOOT_ib0=1
+
+# Based on eth0; each '*' will be replaced by the script with corresponding
+# octet from eth0.
+LAN_INTERFACE_ib0=eth0
+IPADDR_ib0=172.16.'*'.'*'
+NETMASK_ib0=255.255.0.0
+NETWORK_ib0=172.16.0.0
+BROADCAST_ib0=172.16.255.255
+ONBOOT_ib0=1
+
+# Based on the first eth<n> interface that is found (for n=0,1,...);
+# each '*' will be replaced by the script with corresponding octet from eth<n>.
+LAN_INTERFACE_ib0=
+IPADDR_ib0=172.16.'*'.'*'
+NETMASK_ib0=255.255.0.0
+NETWORK_ib0=172.16.0.0
+BROADCAST_ib0=172.16.255.255
+ONBOOT_ib0=1
+
+
+==============================================================================
+8. Using SDP
+==============================================================================
+
+Overview:
+---------
+
+Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol
+that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced
+protocol offload capabilities, SDP can provide lower latency, higher
+bandwidth, and lower CPU utilization than IPoIB running some sockets-based
+applications.
+
+SDP can be used by applications and improve their performance transparently
+(that is, without any recompilation). Since SDP has the same socket semantics
+as TCP, an existing application is able to run using SDP; the difference is
+that the application's TCP socket gets replaced with an SDP socket.
+
+It is also possible to configure the driver to automatically translate TCP to
+SDP based on the source IP, the destination, or the application name (See
+below).
+
+The SDP protocol is composed of a kernel module that implements the SDP as a
+new address-family/protocol-family, and a library that is used for replacing
+the TCP address family with SDP according to a policy.
+
+libsdp.so Library:
+------------------
+
+libsdp.so is a dynamically linked library, which is used for transparent
+integration of applications with SDP. The library is preloaded, and therefore
+takes precedence over glibc for certain socket calls. Thus, it can
+transparently replace the TCP socket family with SDP socket calls.
+
+The library also implements a user-level socket switch. Using a configuration
+file, the system administrator can set up the policy that selects the type of
+socket to be used. libsdp.so also has the option to allow server sockets to
+listen on both SDP and TCP interfaces. The various configurations with SDP/TCP
+sockets are explained inside the /etc/libsdp.conf file.
+
+Configuring SDP:
+----------------
+
+To load SDP upon boot, edit the file /etc/infiniband/openib.conf and set
+"SDP_LOAD=yes".
+
+Note: For the changes to take effect, run: /etc/init.d/openibd restart
+
+SDP shares the same IP addresses and interface names as IPoIB. See IPoIB
+Configuration (chapter 7)
+
+How to Know SDP Is Working:
+---------------------------
+
+Since SDP is a transparent TCP replacement, it can sometimes be difficult to
+know that it is working correctly.
+To figure out whether traffic is passing through SDP or TCP, check
+/proc/net/sdpstats and monitor which counters are running.
+
+sdpnetstat:
+-----------
+
+The sdpnetstat program can be used to verify both that SDP is loaded and is
+being used:
+
+host1$ sdpnetstat -S
+
+This command shows all active SDP sockets using the same format as the
+traditional netstat program. Without the '-S' option, it shows all the
+information that netstat does plus SDP data.
+
+Assuming that the SDP kernel module is loaded and is being used, then the
+output of the command will show something like the following:
+
+host1$ sdpnetstat -S
+
+Proto Recv-Q Send-Q Local Address Foreign Address
+sdp 0 0 193.168.10.144:34216 193.168.10.125:12865
+sdp 0 884720 193.168.10.144:42724 193.168.10.:filenet-rmi
+
+The example output above shows two active SDP sockets and contains details
+about the connections. If the SDP kernel module is not loaded, or it is
+loaded but is not being used, then the output of the command will be something
+like the following:
+
+host1$ sdpnetstat -S
+
+Proto Recv-Q Send-Q Local Address Foreign Address
+netstat: no support for `AF INET (tcp)' on this system.
+
+To verify whether the module is loaded or not, you can use the lsmod command
+
+Monitoring and Troubleshooting Tools:
+-------------------------------------
+
+SDP has debug support for both the user space libsdp.so library and the ib_sdp
+kernel module.
+Both can be useful to understand why a TCP socket was not redirected over SDP
+and to help find problems in the SDP implementation.
-2. After the installer completes, information about the OFED
- installation such as the prefix, the kernel version, and
- installation parameters can be found by running
- /etc/infiniband/info.
+User-space SDP debug is controlled by options in the libsdp.conf file. You can also have a local
+version and point to it explicitly using the following command:
-3. Information on the driver version and source git trees can be found
- using the ofed_info utility
+host1$ export LIBSDP_CONFIG_FILE=<path>/libsdp.conf
+To obtain extensive debug information, you can modify libsdp.conf to have the
+log directive produce maximum debug output (provide the min-level flag with
+the value 1). More details in the default libsdp.conf installed by OFED.
+A non-root user can configure libsdp.so to record function calls and return values in the file
+/tmp/libsdp.log.<pid>
-4. Starting and Verifying the IB Fabric
-=======================================
+Kernel Space SDP Debug - The SDP kernel module can log detailed trace
+information if you enable it using the 'debug_level' variable in the sysfs
+filesystem. The following command performs this:
+host1$ echo 1 > /sys/module/ib_sdp/debug_level
+
+Note: Depending on the operating system distribution on your machine, you may need
+an extra level, 'parameters', in the directory structure, so you may need to direct
+the echo command to /sys/module/ib_sdp/parameters/debug_level.
+
+Turning off kernel debug is done by setting the sysfs variable to zero using
+the following command:
+
+host1$ echo 0 > /sys/module/ib_sdp/debug_level
+
+To display debug information, use the dmesg command:
+
+host1$ dmesg
+
+Environment Variables:
+----------------------
+
+For the transparent integration with SDP, the following two environment
+variables are required:
+1. LD_PRELOAD - this environment variable is used to preload libsdp.so and it
+ should point to the libsdp.so library. The variable should be set by the
+ system administrator to libsdp.so.
+2. LIBSDP_CONFIG_FILE - this environment variable is used to configure the
+ policy for replacing TCP sockets with SDP sockets. By default it points to:
+ /etc/libsdp.conf
+
+Using RDMA:
+-----------
+
+For smaller buffers, the overhead of preparing a user buffer to be RDMA'ed is
+too big; therefore, it is more efficient to use BCopy. (Large buffers can also
+be sent using RDMA, but they lower CPU utilization.) This mode is called
+"ZCopy combined mode". The sendmsg syscall is blocked until the buffer is
+transfered to the socket's peer, and the data is copied directly from the user
+buffer at the source side to the user buffer at the sink side.
+
+To set the threshold, use the module parameter sdp_zcopy_thresh. This parameter
+can be accessed through sysfs (/sys/module/ib_sdp/parameters/sdp_zcopy_thresh).
+Setting it to 0, disables ZCopy.
+
+
+==============================================================================
+9. Uninstalling OFED
+==============================================================================
+
+There are two ways to uninstall OFED:
+1) Via the installation menu.
+2) Using the script ofed_uninstall.sh. The script is part of ofed-scripts
+ package.
+3) ofed_uninstall.sh script supports an option to executes 'openibd stop'
+ before removing the RPMs using the flag: --unload-modules
+
+==============================================================================
+10. Upgrading OFED
+==============================================================================
+
+If an old OFED version is installed, it may be upgraded by installing a
+new OFED version as described in section 5. Note that if the old OFED
+version was loaded before upgrading, you need to restart OFED or reboot
+your machine in order to start the new OFED stack.
+
+==============================================================================
+11. Configuration
+==============================================================================
+
+Most of the OFED components can be configured or reconfigured after
+the installation by modifying the relevant configuration files. The
+list of the modules that will be loaded automatically upon boot can be
+found in the /etc/infiniband/openib.conf file. Other configuration
+files include:
+- SDP configuration file: /etc/libsdp.conf
+- OpenSM configuration file: /etc/ofa/opensm.conf (for RedHat)
+ /etc/sysconfig/opensm (for SuSE) - should be
+ created manually if required.
+- DAPL configuration file: /etc/dat.conf
+
+See packages Release Notes for more details.
+
+Note: After the installer completes, information about the OFED
+ installation such as the prefix, kernel version, and
+ installation parameters can be found by running
+ /etc/infiniband/info.
+
+
+==============================================================================
+12. Starting and Verifying the IB Fabric
+==============================================================================
1) If you rebooted your machine after the installation process completed,
IB interfaces should be up. If you did not reboot your machine, please
- enter the following command: /etc/init.d/openibd start
+ enter the following command: /etc/init.d/openibd restart
2) Check that the IB driver is running on all nodes: ibv_devinfo should print
"hca_id: <linux device name>" on the first line.
-
+
3) Make sure that a Subnet Manager is running by invoking the sminfo utility.
If an SM is not running, sminfo prints:
sminfo: iberror: query failed
(or LD_PRELOAD='stack_prefix'/lib64/libsdp.so on 64 bit machines)
The default 'stack_prefix' is /usr
-
-5. MPI (Message Passing Interface)
-==================================
-
+==============================================================================
+13. MPI (Message Passing Interface)
+==============================================================================
In Step 2 of the main menu of install.pl, options 2, 3 and 4 can
install one or more MPI stacks. Multiple MPI stacks can be installed
simultaneously -- they will not conflict with each other.
Please see MPI_README.txt for more details on each MPI package and how to run
the tests.
+==============================================================================
+14. Related Documentation
+==============================================================================
+
+OFED documentation is located in the ofed-docs RPM. After
+installation the documents are located under the directory:
+/usr/share/doc/ofed-docs-x.x.x for RedHat
+/usr/share/doc/packages/ofed-docs-x.x.x for SuSE
+
+Documents list:
-6. Related Documentation
-========================
-1) Release Notes for OFED Distribution components are to be found under
- OFED-1.5.2/docs and, after the package installation, under
- /usr/share/doc/ofed-docs-1.5.2 for RedHat
- /usr/share/doc/packages/ofed-docs-1.5.2 for SuSE.
-2) For a detailed installation guide, see OFED_Installation_Guide.txt.
-3) For more information, please visit the OFED web-page http://www.openfabrics.org
+ o README.txt
+ o OFED_Installation_Guide.txt
+ o MPI_README.txt
+ o Examples of configuration files
+ o OFED_tips.txt
+ o HOWTO.build_ofed
+ o All release notes and README files
+For more information, please visit the OpenFabrics web site:
+ http://www.openfabrics.org
-For more information contact your vendor.
+open-iscsi documentation is located at:
+- RedHat: /usr/share/doc/iscsi-initiator-utils-<version number>
+- SuSE: /usr/share/doc/packages/open-iscsi
+For more information, please visit the open-iscsi web site:
+ http://www.open-iscsi.org
+++ /dev/null
-===============================================================================
- OFED-1.5.1 RoCEE Support README
- February 2010
-===============================================================================
-
-Contents:
-=========
-1. Overview
-2. Software Dependencies
-3. User Guidelines
-4. Ported Applications
-5. Gid tables
-6. Using VLANs
-7. Statistic counters
-8. Firmware Requirements
-9. Supported hardware
-10. Added fearues
-11. Known Issues
-
-
-1. Overview
-===========
-RDMA over Converged Enhanced Ethernet (RoCEE) allows InfiniBand (IB) transport
-over Ethernet networks. It encapsulates IB transport and GRH headers in
-Ethernet packets bearing a dedicated ether type.
-While the use of GRH is optional within IB subnets, it is mandatory when using
-RoCEE. Verbs applications written over IB verbs should work seamlessly, but
-they require provisioning of GRH information when creating address vectors. The
-library and driver are modified to provide for mapping from GID to MAC
-addresses required by the hardware.
-
-2. Software Dependencies
-========================
-In order to use RoCEE over Mellanox ConnectX(R) hardware, the mlx4_en driver
-must be loaded. Please refer to MLNX_EN_README.txt for further details.
-
-
-3. User Guidelines
-==================
-Since RoCEE encapsulates InfiniBand traffic in Ethernet frames, the
-corresponding net device must be up and running. In case of Mellanox
-hardware, mlx4_en must be loaded and the corresponding interface configured.
-- Make sure mlx4_en.ko is loaded
-- Make sure an IP address has been configured to this interface
-- Run "ibv_devinfo". There is a new field named "link_layer" which can be
- either "Ethernet" or "IB". If the value is IB, then you need to use
- connectx_port_config to change the ConnectX ports designation to eth (see
- mlx4_release_notes.txt for details)
-- Configure the IP address of the interface so that the link will become
- active
-- All IB verbs applications which run over IB verbs should work on RoCEE
- links as long as they use GRH headers (that is, as long as they specify use
- of GRH in their address vector)
-- rdma_cm applications working over RoCEE will have the TOS field set to a
- default value of 3. The default value is given as a module paramter to
- rdma_cm:
- def_prec2sl:Default value for SL priority with RoCE. Valid values 0 - 7 (int).
-
-
-4. Ported Applications
-======================
-- ibv_*_pingpong examples have been ported too. The user must specify the GID
- of the remote peer using the new '-g' option. The GID has the same format as
- that in /sys/class/infiniband/mlx4_0/ports/1/gids/0
-
-- Note: Care should be taken when using ibv_ud_pingpong. The default message
- size is 2K, which is likely to exceed the MTU of the RoCEE link. Use
- ibv_devinfo to inspect the link MTU and specify an appropriate message size
-
-- All rdma_cm applications should work seamlessly without any change
-
-- libsdp works without any change
-
-- Performance tests have been ported
-
-
-5. Gid tables
-=============
-With RoCEE, there may be several entries in a port's GID table. The first entry
-always contains the IPv6 link local address of the corresponding ethernet
-interface. The link local address is formed in the following way:
-
-gid[0..7] = fe80000000000000
-gid[8] = mac[0] ^ 2
-gid[9] = mac[1]
-gid[10] = mac[2]
-gid[11] = ff
-gid[12] = fe
-gid[13] = mac[3]
-gid[14] = mac[4]
-gid[15] = mac[5]
-
-If VLAN is supported by the kernel, and there are VLAN interfaces on the main
-ethernet interface (the interface that the IB port is tied to), each such VLAN
-will appear as a new GID in the port's GID table. The format of the GID entry
-will be identical to the one decribed above with the following change:
-
-gid[11] = VLAN ID high byte (4 MS bits).
-gid[12] = VLAN ID low byte
-
-Please note that VLAN ID is 12 bits.
-
-Priority pause frames
----------------------
-Tagged ethernet frames carry a 3 bit priority field. The value of this field is
-derived from the IB SL field by taking the 3 LS bits of the SL field.
-
-
-6. Using VLANs
-==============
-In order for RoCEE traffic to used VLAN tagged frames, the user has to specify
-GID table entries that are derived from VLAN devices, when creating address
-vectors. Consider the example bellow:
-
-6.1 Make sure VLAN support is enabled by the kernel. Usually this requires
-loading the 8021q module.
-- modprobe 8021q
-
-6.2 Add a VLAN device
-- vconfig add eth2 7
-
-6.3 Assign IP address to the VLAN interface
-- ifconfig eth2.7 7.10.11.12
-suppose this created a new entry in the GID table in index 1.
-
-6.4 verbs test:
-server: ibv_rc_pingpong -g 1
-client: ibv_rc_pingpongs -g 1 server
-
-6.5 For rdma_cm applications, the user only needs to specify an IP address of a
-VLAN device for the traffic to go with that VLAN tagged frames.
-
-7. Statistic counters
-=====================
-RoCEE traffic is counted and can be read from the sysfs counters in the same
-manner as it is done for regular Infiniband devices. Only the following
-counters are supported:
-- port_xmit_packets
-- port_rcv_packets
-- port_rcv_data
-- port_xmit_data
-
-For example, to read the number of transmitted packets on port 2 of device
-mlx4_1, one needs to read the file:
-/sys/class/infiniband/mlx4_1/ports/2/counters/port_xmit_packets
-
-Note: RoCEE traffic will not show in the associated Etherent device's counters
-since it is offloaded by the hardware and does not go through Ethernet network
-driver.
-
-
-8. Firmware Requirements
-========================
-RoCEE has limited support with firmware 2.7.700 and will be fully supported
-with firmware 2.8.000.
-
-
-9. Supported hardware
-=====================
-Currently, ConnectX B0 hardware is supported. A0 hardware may have issues.
-
-
-10. Added fearues
-=================
-ibdev2netdev is a utility that displays the association between an HCA's port
-and the network interface bound to it. Example run:
-
-sw417:/usr/src/packages/SOURCES/ofa_kernel-1.5.2 # ibdev2netdev
-mlx4_0 port 1 ==> ib0 (Down)
-mlx4_0 port 2 ==> ib1 (Down)
-mlx4_1 port 1 ==> eth2 (Up)
-mlx4_1 port 2 ==> eth3 (Up)
-
-
-
-11. Known Issues
-===============
-- PowerPC and ia64 architectures are not supported. x32 architectures were
- not tested.
-
-- SRP is not supported.
-
-- UD QPs that send traffic with VLAN tags (e.g. 802.1q tagged frames) do not
- work. This will be fixed in a subsequent release.
+++ /dev/null
-SCSI RDMA Protocol (SRP) Target driver for Linux
-=================================================
-
-SRP Target driver is designed to work directly on top of OpenFabrics
-OFED-1.x software stack (http://www.openfabrics.org) or Infiniband
-drivers in Linux kernel tree (kernel.org). It also interfaces with
-Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net)
-
-By interfacing with SCST driver we are able to work and support a lot IO
-modes on real or virtual devices in the backend
-
-1. scst_disk -- interfacing with scsi sub-system to claim and export real
- scsi devices ie. disks, hardware raid volumes, tape library as SRP's luns
-
-2. scst_vdisk -- fileio and blockio modes. This allows you to turn software
- raid volumes, LVM volumes, IDE disks, block devices and normal files into
- SRP's luns
-
-3. NULLIO mode will allow you to measure the performance without sending IOs
- to *real* devices
-
-
-Prerequisites
--------------
-0. Supported distributions: RHEL 5.2/5.3/5.4, SLES 10 sp2/sp3, SLES 11
-
-NOTES: On distribution default kernels, you can run scst_vdisk blockio mode
- to have good performance.
-
- It is required to patch and recompile the kernel to run scst_disk
- ie. scsi pass-thru mode
- OR
- You have to compile scst with -DSTRICT_SERIALIZING enabled and this
- does not yield good performance.
-
-1. Download and install SCST driver (supported version 1.0.1.1)
-
-1a. Download scst-1.0.1.1.tar.gz from this URL
- http://scst.sourceforge.net/downloads.html
-
-1b. untar and install scst-1.0.1.1
-
- $ tar zxvf scst-1.0.1.1.tar.gz
- $ cd scst-1.0.1.1
-
- THIS STEP IS SPECIFIC FOR SLES 10 sp2/sp3 distributions:
-
- $ patch -p1 -i <path to OFED>/docs/scst/scst_sles10_sp2.patch
-
- For all distributions:
-
- $ make && make install
-
-NOTES: FOR SLES 11 distribution, skip next step (step 1c) and go directly to
- step (2)
-
-1c. patch scst.h header file with scst.patch
-
- $ cd /usr/local/include/scst
- $ patch -p1 -i <path to OFED>/docs/scst/scst.patch
-
-
-2. Download/install OFED-1.5.1 package - SRP target is part of OFED package
-
-NOTES: if your system already have OFED stack installed, you need to remove
- the previous built of kernel-ib RPMs and reinstall
-
- $ cd ~/OFED-1.5.1
- $ rm RPMS/*/*/kernel-ib*
- $ ./install.pl -c ofed.conf
-
- Make sure that srpt=y in the ofed.conf
-
-2a. download OFED packages from this URL
- http://www.openfabrics.org/downloads/OFED/OFED-1.5.1/
-
-2b. install OFED - remember to choose srpt=y
-
- $ cd ~/OFED-1.5.1
- $ ./install.pl
-
-
-How-to run
------------
-
-A. On srp target machine
-
-A1. Please refer to SCST's README for loading scst driver and its dev_handlers
- drivers (scst_disk, scst_vdisk block or file IO mode, nullio, ...)
- SCST's README locates in ~/scst-1.0.1.1/ directory
-
-NOTES: In any mode you always need to have lun 0 in any group's device list
- Then you can have any lun number following lun 0 (it does not required
- have lun number in order except that the first lun is always 0)
-
- Setting SRPT_LOAD=yes in /etc/infiniband/openib.conf is not good enough
- It only load ib_srpt module and does not load scst and its dev_handlers
-
- SCST's scst_disk module (pass-thru mode) does not run on default
- distribution kernels (kernels come with RHEL 5.2/5.3/5.4 & SLES 11)
- because it requires to patch and recompile the kernel. It can only
- run with vanilla kernels.
-
-Example 1: working with VDISK BLOCKIO mode
- (using md0 device, sda, and cciss/c1d0)
-a. modprobe scst
-b. modprobe scst_vdisk
-c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
-d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
-e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
-f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
-g. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
-h. echo "add vdisk2 2" >/proc/scsi_tgt/groups/Default/devices
-
-Example 2: working with real back-end scsi disks in scsi pass-thru mode
-a. modprobe scst
-b. modprobe scst_disk
-c. cat /proc/scsi_tgt/scsi_tgt
-ibstor00:~ # cat /proc/scsi_tgt/scsi_tgt
-Device (host:ch:id:lun or name) Device handler
-0:0:0:0 dev_disk
-4:0:0:0 dev_disk
-5:0:0:0 dev_disk
-6:0:0:0 dev_disk
-7:0:0:0 dev_disk
-
-Now you want to exclude the first scsi disk and expose the last 4 scsi disks
-as IB/SRP luns for I/O
-
-echo "add 4:0:0:0 0" >/proc/scsi_tgt/groups/Default/devices
-echo "add 5:0:0:0 1" >/proc/scsi_tgt/groups/Default/devices
-echo "add 6:0:0:0 2" >/proc/scsi_tgt/groups/Default/devices
-echo "add 7:0:0:0 3" >/proc/scsi_tgt/groups/Default/devices
-
-Example 3: working with scst_vdisk FILEIO mode
- (using md0 device and file 10G-file)
-a. modprobe scst
-b. modprobe scst_vdisk
-c. echo "open vdisk0 /dev/md0" > /proc/scsi_tgt/vdisk/vdisk
-d. echo "open vdisk1 /10G-file" > /proc/scsi_tgt/vdisk/vdisk
-e. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
-f. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
-
-A2. modprobe ib_srpt
-
-
-B. On initiator machines you can manualy do the following steps:
-
-B1. modprobe ib_srp
-B2. ipsrpdm -c -d /dev/infiniband/umadX
- (to discover new SRP target)
- umad0: port 1 of the first HCA
- umad1: port 2 of the first HCA
- umad2: port 1 of the second HCA
-B3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target
-B4. fdisk -l (will show new discovered scsi disks)
-
-Example:
-Assume that you use port 1 of first HCA in the system ie. mthca0
-
-[root@lab104 ~]# ibsrpdm -c -d /dev/infiniband/umad0
-id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
-dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4
-[root@lab104 ~]# echo id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
-dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 >
-/sys/class/infiniband_srp/srp-mthca0-1/add_target
-
-OR
-
-+ You can edit /etc/infiniband/openib.conf to load srp driver and srp HA daemon
-automatically ie. set SRP_LOAD=yes, SRP_DAEMON_ENABLE=yes, and SRPHA_ENABLE=yes
-+ To set up and use high availability feature you need dm-multipath driver
-and multipath tool
-+ Please refer to OFED-1.5.1 SRP's user manual for more in-details instructions
-on how-to enable/use HA feature (OFED-1.5.1/docs/srp_release_notes.txt)
-
-
-Here is an example of srp target setup file
---------------------------------------------
-
-*********************** srpt.sh *****************************************
-#!/bin/sh
-modprobe scst scst_threads=1
-modprobe scst_vdisk scst_vdisk_ID=100
-
-echo "open vdisk0 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
-echo "open vdisk1 /dev/sdb BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
-echo "open vdisk2 /dev/sdc BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
-echo "open vdisk3 /dev/sdd BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
-echo "add vdisk0 0" > /proc/scsi_tgt/groups/Default/devices
-echo "add vdisk1 1" > /proc/scsi_tgt/groups/Default/devices
-echo "add vdisk2 2" > /proc/scsi_tgt/groups/Default/devices
-echo "add vdisk3 3" > /proc/scsi_tgt/groups/Default/devices
-
-modprobe ib_srpt
-
-echo "add "mgmt"" > /proc/scsi_tgt/trace_level
-echo "add "mgmt_dbg"" > /proc/scsi_tgt/trace_level
-echo "add "out_of_mem"" > /proc/scsi_tgt/trace_level
-
-*********************** End srpt.sh **************************************
-
-
-How-to unload/shutdown
------------------------
-
-1. Unload ib_srpt
- $ modprobe -r ib_srpt
-2. Unload scst and its dev_handlers
- $ modprobe -r scst_vdisk scst
-3. Unload ofed
- $ /etc/rc.d/openibd stop
-
-===========================================================================
-Known Issues
-===========================================================================
-
-- With active connections/sesssions and active I/Os, unload ib_srpt driver
- will randomly fail and got stuck.
-
-- With active connections/sessions with active I/Os, reboot system will
- randomly get stuck.
-
+++ /dev/null
-#!/bin/bash
-#
-# Copyright (c) 2006 Mellanox Technologies. All rights reserved.
-# Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved.
-#
-# This Software is licensed under one of the following licenses:
-#
-# 1) under the terms of the "Common Public License 1.0" a copy of which is
-# available from the Open Source Initiative, see
-# http://www.opensource.org/licenses/cpl.php.
-#
-# 2) under the terms of the "The BSD License" a copy of which is
-# available from the Open Source Initiative, see
-# http://www.opensource.org/licenses/bsd-license.php.
-#
-# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
-# copy of which is available from the Open Source Initiative, see
-# http://www.opensource.org/licenses/gpl-license.php.
-#
-# Licensee has the right to choose one of the above licenses.
-#
-# Redistributions of source code must retain the above copyright
-# notice and one of the license notices.
-#
-# Redistributions in binary form must reproduce both the above copyright
-# notice, one of the license notices in the documentation
-# and/or other materials provided with the distribution.
-#
-# Description: creates Module.symvers file for InfiniBand modules
-
-KVERSION=${KVERSION:-$(uname -r)}
-MOD_SYMVERS=./Module.symvers
-SYMS=/tmp/syms
-
-echo MODULES_DIR=${MODULES_DIR-:./}
-
-if [ -f ${MOD_SYMVERS} -a ! -f ${MOD_SYMVERS}.save ]; then
- mv ${MOD_SYMVERS} ${MOD_SYMVERS}.save
-fi
-rm -f $MOD_SYMVERS
-rm -f $SYMS
-
-for mod in $(find ${MODULES_DIR} -name '*.ko') ; do
- nm -o $mod |grep __crc >> $SYMS
- n_mods=$((n_mods+1))
-done
-
-n_syms=$(wc -l $SYMS |cut -f1 -d" ")
-echo Found $n_syms OFED kernel symbols in $n_mods modules
-n=1
-
-while [ $n -le $n_syms ] ; do
- line=$(head -$n $SYMS|tail -1)
-
- line1=$(echo $line|cut -f1 -d:)
- line2=$(echo $line|cut -f2 -d:)
- file=$(echo $line1| sed -e 's@./@@' -e 's@.ko@@' -e "s@$PWD/@@")
- crc=$(echo $line2|cut -f1 -d" ")
- sym=$(echo $line2|cut -f3 -d" ")
- echo -e "0x$crc\t$sym\t$file" >> $MOD_SYMVERS
- n=$((n+1))
-done
-
-echo ${MOD_SYMVERS} created.
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- CHELSIO T3 RNIC RELEASE NOTES
- September 2010
-
-
-The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the
-Chelsio S series adapters. Make sure you choose the 'cxgb3' and
-'libcxgb3' options when generating your ofed rpms.
-
-============================================
-New for ofed-1.5.2
-============================================
-
-- Bug fixes. Various upstream bug fixes have been included in this
-release.
-
-============================================
-Enabling Various MPIs
-============================================
-
-For OpenMPI, Intel MPI, HP MPI, and Scali MPI: you must set the iw_cxgb3
-module option peer2peer=1 on all systems. This can be done by writing
-to the /sys/module file system during boot. EG:
-
-# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer
-
-Or you can add the following line to /etc/modprobe.conf to set the option
-at module load time:
-
-options iw_cxgb3 peer2peer=1
-
-For Intel MPI, HP MPI, and Scali MPI: Enable the chelsio device by adding
-an entry to /etc/dat.conf for the chelsio interface. For instance,
-if your chelsio interface name is eth2, then the following line adds
-a DAT version 1.2 and 2.0 devices named "chelsio" and "chelsio2" for
-that interface:
-
-chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
-chelsio2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
-
-=============
-Intel MPI:
-=============
-
-The following env vars enable Intel MPI version 3.1.038. Place these
-in your user env after installing and setting up Intel MPI:
-
-export RSH=ssh
-export DAPL_MAX_INLINE=64
-export I_MPI_DEVICE=rdssm:chelsio
-export MPIEXEC_TIMEOUT=180
-export MPI_BIT_MODE=64
-
-Logout & log back in.
-
-Populate mpd.hosts with node names.
-Note: The hosts in this file should be Chelsio interface IP addresses.
-
-Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in
-/etc/dat.conf named "chelsio".
-
-Note: MPIEXEC_TIMEOUT value might be required to increase if heavy traffic
-is going across the systems.
-
-Contact Intel for obtaining their MPI with DAPL support.
-
-To run Intel MPI applications:
-
- mpdboot -n <num nodes> -r ssh --ncpus=<num cpus>
- mpiexec -ppn <process per node> -n <num nodes> <MPI Application Path>
-
-
-=============
-HP MPI:
-=============
-
-The following env vars enable HP MPI version 2.03.01.00. Place these
-in your user env after installing and setting up HP MPI:
-
-export MPI_ROOT=/opt/hpmpi
-export PATH=$MPI_ROOT/bin:/opt/bin:$PATH
-export MANPATH=$MANPATH:$MPI_ROOT/share/man
-
-Log out & log back in.
-
-To run HP MPI applications, use these mpirun options:
-
--prot -e DAPL_MAX_INLINE=64 -UDAPL
-
-EG:
-
-$ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob
-
-Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces.
-
-Also this assumes your first entry in /etc/dat.conf is for the chelsio
-device.
-
-Contact HP for obtaining their MPI with DAPL support.
-
-=============
-Scali MPI:
-=============
-
-The following env vars enable Scali MPI. Place these in your user env
-after installing and setting up Scali MPI for running over IWARP:
-
-export DAPL_MAX_INLINE=64
-export SCAMPI_NETWORKS=chelsio
-export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128"
-
-Log out & log back in.
-
-Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf
-named "chelsio".
-
-Note: SCAMPI supports only dapl 1.2 library not dapl 2.0
-
-Contact Scali for obtaining their MPI with DAPL support.
-
-To run SCALI MPI applications:
-
- mpimon <SCALI Application Path> -- <node1_IP> <procs> <node2_IP> <procs>
-
-Note: <procs> is the number of processes to run on the node Note:
-<node#_IP> should be the IP of Chelsio's interface
-
-=============
-OpenMPI:
-=============
-
-OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater.
-
-Open MPI will work without any specific configuration via the openib btl.
-Users wishing to performance tune the configurable options may wish to
-inspect the receive queue values. Those can be found in the "Chelsio T3"
-section of mca-btl-openib-hca-params.ini.
-
-Note: OpenMPI version 1.3 does not support newer Chelsio card with device
-ID 0x0035 and 0x0036. To use those cards add the device id of the cards
-in the "Chelsio T3" section of mca-btl-openib-hca-params.ini file.
-
-To run OpenMPI applications:
-
- mpirun --host <node1>,<node2> -mca btl openib,sm,self <OpenMPI Application Path>
-
-=============
-MVAPICH2:
-=============
-
-The following env vars enable MVAPICH2 version 1.4-2. Place these
-in your user env after installing and setting up MVAPICH2 MPI:
-
-export MVAPICH2_HOME=/usr/mpi/gcc/mvapich2-1.4/
-export MV2_USE_IWARP_MODE=1
-export MV2_USE_RDMA_CM=1
-
-On each node, add this to the end of /etc/profile.
-
- ulimit -l 999999
-
-On each node, add this to the end of /etc/init.d/sshd and restart sshd.
-
- ulimit -l 999999
- % service sshd restart
-
-Verify the ulimit changes worked. These should show '999999':
-
- % ulimit -l
- % ssh <peer> ulimit -l
-
-Note: You may have to restart sshd a few times to get it to work.
-
-Create mpd.hosts with list of hostname or ipaddrs in the cluster. They
-should be names/addresses that you can ssh to without passwords. (See
-Passwordless SSH Setup).
-
-On each node, create /etc/mv2.conf with a single line containing the
-IP address of the local T3 interface. This is how MVAPICH2 picks which
-interface to use for RDMA traffic.
-
-On each node, edit /etc/hosts file. Comment the entry if there is an
-entry with 127.0.0.1 IP Address and local host name. Add an entry for
-corporate IP address and local host name (name that you have given in
-mpd.hosts file) in /etc/hosts file.
-
-To run MVAPICH2 application:
-
- mpirun_rsh -ssh -np 8 -hostfile mpd.hosts <MVAPICH2 Application Path>
-
-============================================
-Loadable Module options:
-============================================
-
-The following options can be used when loading the iw_cxgb3 module to
-tune the iWARP driver:
-
-cong_flavor - set the congestion control algorithm. Default is 1.
- 0 == Reno
- 1 == Tahoe
- 2 == NewReno
- 3 == HighSpeed
-
-snd_win - set the TCP send window in bytes. Default is 32kB.
-
-rcv_win - set the TCP receive window in bytes. Default is 256kB.
-
-crc_enabled - set whether MPA CRC should be negotiated. Default is 1.
-
-markers_enabled - set whether to request receiving MPA markers. Default is
- 0; do not request to receive markers.
-
- NOTE: The Chelsio RNIC fully supports markers, but
- the current OFA RDMA-CM doesn't provide an API for
- requesting either markers or crc to be negotiated. Thus
- this functionality is provided via module parameters.
-
-mpa_rev - set the MPA revision to be used. Default is 1, which is
- spec compliant. Set to 0 to connect with the Ammasso 1100
- rnic.
-
-ep_timeout_secs - set the number of seconds for timing out MPA start up
- negotiation and normal close. Default is 60.
-
-peer2peer - Enables connection setup changes to allow peer2peer
- applications to work over chelsio rnics. This enables
- the following applications:
- Intel MPI
- HP MPI
- Open MPI
- Scali MPI
- MVAPICH2
- Set peer2peer=1 on all systems to enable these
- applications.
-
-The following options can be used when loading the cxgb3 module to
-tune the NIC driver:
-
-msi - whether to use MSI or MSI-X. Default is 2.
- 0 = only pin
- 1 = only MSI or pin
- 2 = use MSI/X, MSI, or pin, based on system
-
-============================================
-Updating Firmware:
-============================================
-
-This release requires firmware version 7.10.0, and Protocol SRAM
-version 1.1.0. These versions are included in the ofed-1.5.2 release
-and will be automatically loaded when the cxgb3 module is loaded and
-the interface configured. To load later/newer versions of the firmware,
-follow this procedure:
-
-If your distro/kernel supports firmware loading, you can place the chelsio
-firmware and psram images in /lib/firmware/cxgb3, then unload and reload
-the cxgb3 module to get the new images loaded. If this does not work,
-then you can load the firmware images manually:
-
-Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio.
-
-To build cxgbtool:
-
-# cd <path-to-cxgbtool>
-# make && make install
-
-Then load the cxgb3 driver:
-
-# modprobe cxgb3
-
-Now note the ethernet interface name for the T3 device. This can be
-done by typing 'ifconfig -a' and noting the interface name for the
-interface with a HW address that begins with "00:07:43". Then load the
-new firmware and eeprom file:
-
-# cxgbtool ethxx loadfw <firmware_file>
-# update_eeprom.sh ethxx <eeprom_file>
-# reboot
-
-============================================
-Testing connectivity with ping and rping:
-============================================
-
-Configure the ethernet interfaces for your cxgb3 device. After you
-modprobe iw_cxgb3 you will see one or two ethernet interfaces for the
-T3 device. Configure them with an appropriate ip address, netmask, etc.
-You can use the Linux ping command to test basic connectivity via the
-T3 interface.
-
-To test RDMA, use the rping command that is included in the librdmacm-utils
-rpm:
-
-On the server machine:
-
-# rping -s -a 0.0.0.0 -p 9999
-
-On the client machine:
-
-# rping -c -VvC10 -a server_ip_addr -p 9999
-
-You should see ping data like this on the client:
-
-ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
-ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
-ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
-ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
-ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
-ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
-ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
-ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
-ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
-ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
-client DISCONNECT EVENT...
-#
-
-============================================
-Addition Notes and Issues
-============================================
-
-1) To run uDAPL over the chelsio device, you must export this environment
-variable:
-
- export DAPL_MAX_INLINE=64
-
-2) If you have a multi-homed host and the physical ethernet networks
-are bridged, or if you have multiple chelsio rnics in the system, then
-you need to configure arp to only send replies on the interface with
-the target ip address:
-
- sysctl -w net.ipv4.conf.all.arp_ignore=2
-
-3) If you are building OFED against a kernel.org kernel later than
-2.6.20, then make sure your kernel is configured with the cxgb3 and
-iw_cxgb3 modules enabled. This forces the kernel to pull in the genalloc
-allocator, which is required for the OFED iw_cxgb3 module. Make sure
-these config options are included in your .config file:
-
- CONFIG_CHELSIO_T3=m
- CONFIG_INFINIBAND_CXGB=m
-
-4) If you run the RDMA latency test using the ib_rdma_lat program, make
-sure you use the following command lines to limit the amount of inline
-data to 64:
-
- server: ib_rdma_lat -c -I 64
- client: ib_rdma_lat -c -I 64 server_ip_addr
-
-5) If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are
-using a 64KB page size (like PPC64 and IA64 systems) and your server is
-using a 4KB page size (like i386 and X86_64), then you need to mount the
-server using rsize=32768,wsize=32768 to avoid overrunning the Chelsio
-RNIC fast register limits. This is a known firmware limitation in the
-Chelsio RNIC.
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- Diagnostic Tools in OFED 1.5 Release Notes
-
- December 2009
-
-
-Repo: git://git.openfabrics.org/~sashak/management/management.git
-URL: http://www.openfabrics.org/downloads/management
-
-
-General
--------
-Model of operation: All diag utilities use direct MAD access to perform their
-operations. Operations that require QP0 mads only may use direct routed
-mads, and therefore can work even in unconfigured subnets. Almost all
-utilities can operate without accessing the SM, unless GUID to lid translation
-is required. The only exception to this is saquery which requires the SM.
-
-
-Dependencies
-------------
-Most diag utilities depend on libibmad and libibumad.
-All diag utilities depend on the ib_umad kernel module.
-
-
-Multiple port/Multiple CA support
----------------------------------
-When no IB device or port is specified (see the "local umad parameters" below),
-the libibumad library selects the port to use by the following criteria:
-1. the first port that is ACTIVE.
-2. if not found, the first port that is UP (physical link up).
-
-If a port and/or CA name is specified, the libibumad library attempts to
-satisfy the user request, and will fail if it cannot do so.
-
-For example:
- ibaddr # use the 'best port'
- ibaddr -C mthca1 # pick the best port from mthca1 only.
- ibaddr -P 2 # use the second (active/up) port from the
- first available IB device.
- ibaddr -C mthca0 -P 2 # use the specified port only.
-
-
-Common options & flags
-----------------------
-Most diagnostics take the following flags. The exact list of supported
-flags per utility can be found in the usage message and can be displayed
-using util_name -h syntax.
-
-# Debugging flags
- -d raise the IB debugging level. May be used
- several times (-ddd or -d -d -d).
- -e show umad send receive errors (timeouts and others)
- -h display the usage message
- -v increase the application verbosity level.
- May be used several times (-vv or -v -v -v)
- -V display the internal version info.
-
-# Addressing flags
- -D use directed path address arguments. The path
- is a comma separated list of out ports.
- Examples:
- "0" # self port
- "0,1,2,1,4" # out via port 1, then 2, ...
- -G use GUID address arguments. In most cases, it is the Port GUID.
- Examples:
- "0x08f1040023"
- -s <smlid> use 'smlid' as the target lid for SA queries.
-
-# Local umad parameters:
- -C <ca_name> use the specified ca_name.
- -P <ca_port> use the specified ca_port.
- -t <timeout_ms> override the default timeout for the solicited mads.
-
-
-CLI notation
-------------
-All utilities use the POSIX style notation, meaning that all options (flags)
-must precede all arguments (parameters).
-
-
-Utilities descriptions
-----------------------
-See man pages
-
-
-Bugs Fixed
-----------
-
+++ /dev/null
-
- Open Fabrics Enterprise Distribution (OFED)
- ehca in OFED 1.5.2 Release Notes
-
- September 2010
-
-
-Overview
---------
-ehca is the low level driver implementation for all IBM GX-based HCAs.
-
-Supported HCAs
---------------
-- GX Dual-port SDR 4x IB HCA
-- GX Dual-port SDR 12x IB HCA
-- GX Dual-port DDR 4x IB HCA
-- GX Dual-port DDR 12x IB HCA
-
-Available Parameters
---------------------
-In order to set ehca parameters, add the following line(s) to /etc/modprobe.conf:
-
- options ib_ehca <parameter>=<value>
-
-whereby <parameter> is one of the following items:
-- debug_level debug level (0: no debug traces (default), 1: with debug traces)
-- port_act_time time to wait for port activation (default: 30 sec)
-- scaling_code scaling code (0: disable (default), 1: enable)
-- open_aqp1 Open AQP1 on startup (default: no) (bool)
-- hw_level Hardware level (0: autosensing (default), 0x10..0x14: eHCA, 0x20..0x23: eHCA2) (int)
-- nr_ports number of connected ports (-1: autodetect (default), 1: port one only, 2: two ports) (int)
-- use_hp_mr Use high performance MRs (default: no) (bool)
-- poll_all_eqs Poll all event queues periodically (default: yes) (bool)
-- static_rate Set permanent static rate (default: no static rate) (int)
-- lock_hcalls Serialize all hCalls made by the driver (default: autodetect) (bool)
-- number_of_cqs Max number of CQs which can be allocated (default: autodetect) (int)
-- number_of_qps Max number of QPs which can be allocated (default: autodetect) (int)
-
-New Features
-------------
-- None
-
-Fixed Bugs ofed-1.5.2
----------------------
-- Fixed automatic detection if hcall locks should be enabled or not
-
-Fixed Bugs ofed-1.5.1
----------------------
-- Fixed crash when reading sysfs performance counters
-- Do not disable IRQs when processing EQs
-- Allow query of max_dest_rd_atomic and max_qp_rd_atomic values
-
-Fixed Bugs ofed-1.5
----------------------
-- SRQ overflow prevention
-- Performance improvements for QP creation
-- MAD redirection fix
-
-Fixed Bugs ofed-1.4.1
----------------------
-- none
-
-Fixed Bugs ofed-1.4
----------------------
-- Reject send work requests only for RESET, INIT and RTR state
-- Reject receive work requests if QP is in RESET state
-- In case of lost interrupts, trigger EOI to reenable interrupts
-- Filter PATH_MIG events if QP was never armed
-- Release mutex in error path of alloc_small_queue_page()
-- Check idr_find() return value
-- Discard double CQE for one WR
-- Generate flush status CQ entries
-- Don't allow creating UC QP with SRQ
-- Fix reported max number of QPs and CQs in systems with >1 adapter
-- Reject dynamic memory add/remove when ehca adapter is present
-- Remove reference to special QP in case of port activation failure
-- Fix locking for shca_list_lock
-
-Fixed Bugs ofed-1.3.1
----------------------
-- Support all ibv_devinfo values in query_device() and query_port()
-- Prevent posting of SQ WQEs if QP not in RTS
-- Remove mr_largepage parameter, ie always enable large page support
-- Allocate event queue size depending on max number of CQs and QPs
-- Protect QP against destroying until all async events for it are handled
-
-Fixed Bugs ofed-1.3
--------------------
-- Serialize HCA-related hCalls if necessary
-- Fix static rate if path faster than link
-- Return physical link information in query_port()
-- Fix clipping of device limits to INT_MAX
-- Fix issues related to path migration support
-- Support more than 4k QPs for userspace and kernelspace
-- Prevent sending UD packets to QP0
-- Prevent RDMA-related connection failures on some eHCA2 hardware
-
-Available backports
--------------------
-- RedHat EL5 up4: 2.6.18-164.ELsmp
-- RedHat EL5 up5: 2.6.18-194.ELsmp
-- SLES11: 2.6.27.19-5.1-smp
-- SLES11SP1: 2.6.32.12-0.7-default
-- SLES10SP3: 2.6.16.60-0.54.5
-- kernel.org: 2.6.29-32
-
-Known Issues
-------------
-1. The port(s) needs to be connected to an active switch port while
-loading the ehca device driver.
-
-2. Dynamic memory operations are tolerated by ehca, but are prevented by
-the driver while it is loaded.
+++ /dev/null
-IB Bonding
-===============================================================================
-
-1. Introduction
-2. How to work with interface configuration scripts
-2.1 Configuration with initscripts support
-2.1.1 Writing network scripts under Redhat-AS4 (Update 6, 7 or 8)
-2.1.2 Writing network scripts under Redhhat-EL5
-2.2 Configuration with sysconfig support
-2.2.1 Writing network scripts under SLES-10
-2.3 Configuring Ethernet slaves
-
-1. Introduction
--------------------------------------------------------------------------------
-ib-bonding is a High Availability solution for IPoIB interfaces. It is based
-on the Linux Ethernet Bonding Driver and was adopted to work with IPoIB.
-However, the support for for IPoIB interfaces is only for the active-backup
-mode, other modes should not be used.
-
-2. How to work with interface configuration scripts
--------------------------------------------------------------------------------
-To create an interface configuration script for the ibX and bondX interfaces,
-you should use the standard syntax (depending on your OS).
-
-2.1 Configuration with initscripts support
-------------------------------------------
-Note: This feature is available only for Redhat-AS4 (Update 4, Update 5,
-Update 6 or Update 7) and for Redhat-EL5 and above.
-
-2.1.1 Writing network scripts under Redhat-AS4 (Update 4, 5, 6 or 7)
------------------------------------------------------------------
-* In the master (bond) interface script add the line:
-TYPE=Bonding
-MTU=<according to the slave's MTU>
-
-Exmaple: for bond0 (master) the file is named /etc/sysconfig/network-scripts/ifcfg-bond0
-with the following text in the file:
-
-DEVICE=bond0
-IPADDR=192.168.1.1
-NETMASK=255.255.255.0
-NETWORK=192.168.1.0
-BROADCAST=192.168.1.255
-ONBOOT=yes
-BOOTPROTO=none
-USERCTL=no
-TYPE=Bonding
-MTU=65520
-
-Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected
-mode and are configured with the same value. For IPoIB slaves that work in
-datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at
-all (and letting it to be set to the default value), performance of the
-interface might decrease.
-
-* In the slave (ib) interface script put the following lines:
-SLAVE=yes
-MASTER=<bond name>
-TYPE=InfiniBand
-PRIMARY=<yes|no>
-
-Example: the script for ib0 (slave) would be named /etc/sysconfig/network-scripts/ifcfg-ib0
-with the following text in the file:
-
-DEVICE=ib0
-USERCTL=no
-ONBOOT=yes
-MASTER=bond0
-SLAVE=yes
-BOOTPROTO=none
-TYPE=InfiniBand
-PRIMARY=yes
-
-Note: If the slave interface is not primary then the line PRIMARY= is not
-required and can be omitted.
-
-After the configuration is saved, restart the network service by running:
-/etc/init.d/network restart
-
-2.1.2 Writing network scripts under Redhhat-EL5
------------------------------------------------
-Follow the instructions in 3.1.1 (Writing network scripts under Redhat-AS4)
-with the following changes:
-* In the bondX (master) script - the line TYPE=Bonding is not needed.
-* In the bondX (master) script - you may add to the configuration more options
-with the following line
-BONDING_OPTS=" primary=ib0 updelay=0 downdelay=0"
-* in the ibX (slave) script - the line TYPE=InfiniBand necessary when using
- bonding over devices configured with partitions ( p_key)
-Example:
- ifcfg-ibX.8003 and ifcfg-ibY.8003 must include TYPE=InfiniBand line in
- their configuration files, when using as slaves for bondX device
-* in /etc/modprobe.conf add the following lines
-alias bond0 bonding
-options bond0 miimon=100 mode=1 max_bonds=1
-
-If you want more than one bonding interface, name them bond1, bond2... and
-just add the necessary lines in /etc/modprobe.conf and change max_bonds=1 to
-max_bonds=N where N=number_of_bonding_interfaces
-
-Note: restarting OFED doesn't keep the bonding configuration via initscripts.
-You have to restart the network service in order to recreate the bonding
-interface.
-
-2.2 Configuration with sysconfig support
-----------------------------------------
-Note: This feature is available only for SLES-10 and above.
-
-2.2.1 Writing network scripts under SLES-10
------------------------------------------------
-* In the master (bond) interface script add the lins:
-
-BONDING_MASTER=yes
-BONDING_MODULE_OPTS="mode=active-backup miimon=<value>"
-BONDING_SLAVE0=slave0
-BONDING_SLAVE1=slave1
-MTU=<according to the slave's MTU>
-
-Exmaple: for bond0 (master) the file is named /etc/sysconfig/network/ifcfg-bond0
-with the following text in the file:
-
-BOOTPROTO="static"
-BROADCAST="10.0.2.255"
-IPADDR="10.0.2.10"
-NETMASK="255.255.0.0"
-NETWORK="10.0.2.0"
-REMOTE_IPADDR=""
-STARTMODE="onboot"
-BONDING_MASTER="yes"
-BONDING_MODULE_OPTS="mode=active-backup miimon=100 primary=ib0 updelay=0 downdelay=0"
-BONDING_SLAVE0=ib0
-BONDING_SLAVE1=ib1
-MTU=65520
-
-Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected
-mode and are configured with the same value. For IPoIB slaves that work in
-datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at
-all (and letting it to be set to the default value), performance of the
-interface might decrease.
-
-Note: primary, downdelay and updelay is an optional bonding interface
-configuration. You may choose to use them, change them or delete them from the
-configuration script (by editing the line that starts with BONDING_OPTS)
-
-* The slave (ib) interace script should look like this:
-
-BOOTPROTO='none'
-STARTMODE='off'
-PRE_DOWN_SCRIPT=/etc/sysconfig/network/unenslave.sh
-
-After the configuration is saved, restart the network service by running:
-/etc/init.d/network restart
-
-2.3 Configuring Ethernet slaves
--------------------------------
-It is not possible to have a mix of Ethernt slaves and IPoIB slaves under the
-same bonding master. It is possible however that a bonding master of Ethernet
-slaves and a bonding master of IPoIB slaves will co-exist in one machne.
-To configure Ethernet slaves under a bonding master use the following
-instructios (depending on the OS)
-
-* Under Redhat-AS4
-
-Use the same instructions as for IPoIB slaves with the following exceptions
-
-- In the master configuration file add the line
-SLAVEDEV=1
-- In the slave configuration file leave the line
-TYPE=InfiniBand
-- For Ethernet, it is possible to set parameters of the bonding module in /etc/modprobe.conf
-with the following line for example
-options bonding miimon=100 mode=1 primary=eth0
-Note that alias names for the bonding module (such as bond0) may not work.
-
-* Under Redhat-AS5
-
-No special instructions are required.
-
-* Under SLES10
-
-When using both type of bonding under, it is neccessary to update the
-MANDATORY_DEVICES environment variable in /etc/sysconfig/network/config with the names
-of the InfiniBand devices ( ib0, ib1, etc. ). Otherwise, bonding devices will be created
-before InfiniBand devices at boot time.
-
-Note: If there is more than one Ethernet NIC installed then there might be a
-race for the interface name eth0, eth1 etc. This may lead to unexpected
-relation between logical and physical devices which may lead to wrong bonding
-configuration. This issue may be solved by binding a logical device name (e.g.
-eth0) to a physical (hardware) device by specifying the MAC address in the
-ethN configuration file.
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- IB ACM in OFED 1.5 Release Notes
-
- July 2010
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Quick Start Guide
-3. Operation Details
-4. Known Issues
-
-===============================================================================
-1. Overview
-===============================================================================
-The IB ACM package implements and provides a framework for experimental name,
-address, and route resolution services over InfiniBand. It is intended to
-address connection setup scalability issues running MPI applications on
-large clusters. The IB ACM provides information needed to establish a
-connection, but does not implement the CM protocol.
-
-The librdmacm can invoke IB ACM services when built using the --with-ib_acm
-option. The IB ACM services tie in under the rdma_resolve_addr,
-rdma_resolve_route, and rdma_getaddrinfo routines. For maximum benefit,
-the rdma_getaddrinfo routine should be used, however existing applications
-should still see significant connection scaling benefits using the calls
-available in librdmacm 1.0.11 and previous releases.
-
-The IB ACM is focused on being scalable and efficient. The current
-implementation limits network traffic, SA interactions, and centralized
-services. ACM supports multiple resolution protocols in order to handle
-different fabric topologies.
-
-The IB ACM package is comprised of two components: the ib_acm service
-and a test/configuration utility - ib_acme. Both are userspace components
-and are available for Linux and Windows. Additional details are given below.
-
-===============================================================================
-2. Quick Start Guide
-===============================================================================
-
-1. Prerequisites: libibverbs and libibumad must be installed.
- The IB stack should be running with IPoIB configured.
- These steps assume that the user has administrative privileges.
-2. Install the IB ACM package
- This installs ib_acm, and ib_acme.
-3. Run ib_acme -A -O
- This will generate IB ACM address and options configuration files.
- (acm_addr.cfg and acm_opts.cfg)
-4. Run ib_acm and leave running.
- ib_acm will eventually be converted to a service/daemon, but for now
- is a userspace application. Because ib_acm uses the libibumad
- interfaces, it should be run with administrative privileges.
-5. Optionally, run ib_acme -s <source_ip> -d <dest_ip> -v
- This will verify that the ib_acm service is running.
-5. Install librdmacm using the build option --with-ib_acm.
- The librdmacm will automatically use the ib_acm service.
- On failures, the librdmacm will fall back to normal resolution.
-
-===============================================================================
-3. Operation Details
-===============================================================================
-
-ib_acme:
-The ib_acme program serves a dual role. It acts as a utility to test
-ib_acm operation and help verify if the ib_acm service and selected
-protocol is usable for a given cluster configuration. Additionally,
-it automatically generates ib_acm configuration files to assist with
-or eliminate manual setup.
-
-
-acm configuration files:
-The ib_acm service relies on two configuration files.
-
-The acm_addr.cfg file contains name and address mappings for each IB
-<device, port, pkey> endpoint. Although the names in the acm_addr.cfg
-file can be anything, ib_acme maps the host name and IP addresses to
-the IB endpoints.
-
-The acm_opts.cfg file provides a set of configurable options for the
-ib_acm service, such as timeout, number of retries, logging level, etc.
-ib_acme generates the acm_opts.cfg file using static information. A
-future enhancement would adjust options based on the current system
-and cluster size.
-
-
-ib_acm:
-The ib_acm service is responsible for resolving names and addresses to
-InfiniBand path information and caching such data. It is currently
-implemented as an executable application, but is a conceptual service
-or daemon that should execute with administrative privileges.
-
-The ib_acm implements a client interface over TCP sockets, which is
-abstracted by the librdmacm library. One or more back-end protocols are
-used by the ib_acm service to satisfy user requests. Although the
-ib_acm supports standard SA path record queries on the back-end, it
-provides an experimental multicast resolution protocol in hope of
-achieving greater scalability. The latter is not usable on all fabric
-topologies, specifically ones that may not have reversible paths.
-Users should use the ib_acme utility to verify that multicast protocol
-is usable before running other applications.
-
-Conceptually, the ib_acm service implements an ARP like protocol and either
-uses IB multicast records to construct path record data or queries the
-SA directly, depending on the selected route protocol. By default, the
-ib_acm services uses and caches SA path record queries.
-
-Specifically, all IB endpoints join a number of multicast groups.
-Multicast groups differ based on rates, mtu, sl, etc., and are prioritized.
-All participating endpoints must be able to communicate on the lowest
-priority multicast group. The ib_acm assigns one or more names/addresses
-to each IB endpoint using the acm_addr.cfg file. Clients provide source
-and destination names or addresses as input to the service, and receive
-as output path record data.
-
-The service maps a client's source name/address to a local IB endpoint.
-If a client does not provide a source address, then the ib_acm service
-will select one based on the destination and local routing tables. If the
-destination name/address is not cached locally, it sends a multicast
-request out on the lowest priority multicast group on the local endpoint.
-The request carries a list of multicast groups that the sender can use.
-The recipient of the request selects the highest priority multicast group
-that it can use as well and returns that information directly to the sender.
-The request data is cached by all endpoints that receive the multicast
-request message. The source endpoint also caches the response and uses
-the multicast group that was selected to construct or obtain path record
-data, which is returned to the client.
-
-===============================================================================
-4. Known Issues
-===============================================================================
-
-The current implementation of the IB ACM has several restrictions:
-- The ib_acm is limited in its handling of dynamic changes;
- the ib_acm must be stopped and restarted if a cluster is reconfigured.
-- Cached data does not timed out and is only updated if a new resolution
- request is received from a different QPN than a cached request.
-- Support for IPv6 has not been verified.
-- The number of addresses that can be assigned to a single endpoint is
- limited to 4.
-- The number of multicast groups that an endpoint can support is limited to 2.
-
+++ /dev/null
- Open Fabrics InfiniBand Diagnostic Utilities
- --------------------------------------------
-
-*******************************************************************************
-RELEASE: OFED 1.5
-DATE: Dec 2009
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. New features
-3. Major Bugs Fixed
-4. Known Issues
-
-===============================================================================
-1. Overview
-===============================================================================
-
-The ibutils package provides a set of diagnostic tools that check the health
-of an InfiniBand fabric.
-
-Package components:
-ibis: IB interface - A TCL shell that provides interface for sending various
- MADs on the IB fabric. This is the component that actually accesses
- the IB Hardware.
-
-ibdm: IB Data Model - A library that provides IB fabric analysis.
-
-ibmgtsim: An IB fabric simulator. Useful for developing IB tools.
-
-ibdiag: This package provides 3 tools which provide the user interface
- to activate the above functionality:
- - ibdiagnet: Performs various quality and health checks on the IB
- fabric.
- - ibdiagpath: Performs various fabric quality and health checks on
- the given links and nodes in a specific path.
- - ibdiagui: A GUI wrapper for the above tools.
-
-===============================================================================
-2. New Features
-===============================================================================
-
-* New "From the Edge" topology matching algorithm.
- Integrated into ibtopodiff when run with the flag -e
-
-* New library - libsysapi
- The library is a C API for IBDM C++ objects
-
-* Added ibnl definition files for Mellanox and Sun IB QDR products
-
-* Added new feature to ibdiagnet - general device info
-
-* ibdiagnet now can get port 0 as a parameterr (for managed switches).
-
-
-===============================================================================
-3. Major Bugs Fixed
-===============================================================================
-
-* ibutils: various fixes in build process (dependencies, parallel build, etc)
-
-* ibdiagnet: fixed crash with -r flag
-
-* ibdiagnet: fixed regular expression for pkey matching
-
-* ibdiagnet: ibdiagnet.lst file has device IDs with trailing zeroes - fixed
-
-===============================================================================
-4. Known Issues
-===============================================================================
-
-- Ibdiagnet "-wt" option may generate a bad topology file when running on a
- cluster that contains complex switch systems.
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- ipath in OFED 1.5 Release Notes
-
- December 2009
-
-======================================================================
-1. Overview
-======================================================================
-ipath is the low level driver implementation for the
-QLogic HyperTransport HCA only (model QHT7140).
-
-The qib driver is the currently supported driver for all
-PCI-Express based Infiniband HCAs.
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- IPoIB in OFED 1.5.2 Release Notes
-
- December 2010
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Known Issues
-3. DHCP Support of IPoIB
-4. The ib-bonding driver
-5. Child interfaces
-6. Bug Fixes and Enhancements Since OFED 1.3
-7. Bug Fixes and Enhancements Since OFED 1.3.1
-8. Bug Fixes and Enhancements Since OFED 1.4
-9. Bug Fixes and Enhancements Since OFED 1.4.2
-10. Bug Fixes and Enhancements Since OFED 1.5.0
-11. Bug Fixes and Enhancements Since OFED 1.5.2
-12. Performance tuning
-
-===============================================================================
-1. Overview
-===============================================================================
-IPoIB is a network driver implementation that enables transmitting IP and ARP
-protocol packets over an InfiniBand UD channel. The implementation conforms to
-the relevant IETF working group's RFCs (http://www.ietf.org).
-
-
-Usage and configuration:
-========================
-1. To check the current mode used for outgoing connections, enter:
- cat /sys/class/net/ib0/mode
-2. To disable IPoIB CM at compile time, enter:
- cd OFED-1.5
- export OFA_KERNEL_PARAMS="--without-ipoib-cm"
- ./install.pl
-3. To change the run-time configuration for IPoIB, enter:
- edit /etc/infiniband/openib.conf, change the following parameters:
- # Enable IPoIB Connected Mode
- SET_IPOIB_CM=yes
- # Set IPoIB MTU
- IPOIB_MTU=65520
-
-4. You can also change the mode and MTU for a specific interface manually.
-
- To enable connected mode for interface ib0, enter:
- echo connected > /sys/class/net/ib0/mode
-
- To increase MTU, enter:
- ifconfig ib0 mtu 65520
-
-5. Switching between CM and UD mode can be done in run time:
- echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
- echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM
-
-
-===============================================================================
-2. Known Issues
-===============================================================================
-1. If a host has multiple interfaces and (a) each interface belongs to a
- different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
- they are connected to the same IB Switch, then the host violates the IP rule
- requiring different broadcast domains. Consequently, the host may build an
- incorrect ARP table.
-
- The correct setting of a multi-homed IPoIB host is achieved by using a
- different PKEY for each IP subnet. If a host has multiple interfaces on the
- same IP subnet, then to prevent a peer from building an incorrect ARP entry
- (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
- stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
- causes the network stack to send ARP replies only on the interface with the
- IP address specified in the ARP request:
-
- sysctl -w net.ipv4.conf.ib0.arp_ignore=1
- sysctl -w net.ipv4.conf.ib1.arp_ignore=1
-
- Or, globally,
-
- sysctl -w net.ipv4.conf.all.arp_ignore=1
-
- To learn more about the arp_ignore parameter, see
- Documentation/networking/ip-sysctl.txt.
- Note that distributions have the means to make kernel parameters persistent.
-
-2. There are IPoIB alias lines in /etc/modprobe.d/ib_ipoib.conf which prevent
- stopping/unloading the stack (i.e., '/etc/init.d/openibd stop' will fail).
- These alias lines cause the drivers to be loaded again by udev scripts.
-
- Workaround: Change modprobe.conf to set
- OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove
- the alias lines from /etc/modprobe.d/ib_ipoib.conf.
-
-3. On SLES 10:
- The ib1 interface uses the configuration script of ib0.
-
- Workaround: Invoke ifup/ifdown using both the interface name and the
- configuration script name (example: ifup ib1 ib1).
-
-4. After a hotplug event, the IPoIB interface falls back to datagram mode, and
- MTU is reduced to 2K.
- Workaround: Re-enable connected mode and increase MTU manually:
- echo connected > /sys/class/net/ib0/mode
- ifconfig ib0 mtu 65520
-
-5. Since the IPoIB configuration files (ifcfg-ib<n>) are installed under the
- standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/
- and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf
- does not prevent the loading of IPoIB on boot.
-
-6. If IPoIB connected mode is enabled, it uses a large MTU for connected mode
- messages and a small MTU for datagram (in particular, multicast) messages,
- and relies on path MTU discovery to adjust MTU appropriately. Packets sent
- in the window before MTU discovery automatically reduces the MTU for a
- specific destination will be dropped, producing the following message in the
- system log:
- "packet len <actual length> (> <max allowed length>) too long to send, dropping"
-
- To warn about this, a message is produced in the system log each time MTU is
- set to a value higher than 2K.
-
-7. IPoIB IPv6 support is broken for systems with kernels < 2.6.12 and
- kernels >= 2.6.12. The reason for that is that kernel 2.6.12 puts the link
- layer address at an offset of two bytes with respect to older kernels. This
- causes the other host to misinterpret the hardware address resulting in failure
- to resolve path which are based on wrong GIDs. As an example, RH 4.x and RH
- 5.x cannot inter-operate.
-
-8. In connected mode, TCP latency for short messages is larger by approx. 1usec
- (~5%) than in datagram mode. As a workaround, use datagram mode.
-
-9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with
- newer kernels. It is recommended to use kernel 2.6.18 or up for
- best IPoIB performance.
-
-10. Connectivity issues encountered when using IPv6 on ia64 systems.
-
-11. The IPoIB module uses a Linux implementation for Large Receive Offload
- (LRO) in kernel 2.6.24 and later. These kernels require installing the
- "inet_lro" module.
-
-12. ConnectX only: If you have a port configured as ETH and IPoIB is running
- in connected mode, and then you change the port type to IB, the IPoIB mode
- will change to datagram mode.
-
-13. When working with iSCSI, you must disable LRO (even if you are working in
- connected mode). This is because there is a bug in older kernels which causes
- a kernel panic.
-
-14. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test
- gets to packet size 8192 or larger, it always loses the first packet in the
- sequence.
- Workaround: Increase the number of pending skb's before a neighbor is
- resolved (default is 3). This value can be changed with:
- sysctl net.ipv4.neigh.ib0.unres_qlen.
-
-15. IPoIB multicast support is broken in RH4.x kernels. This is because
- ndisc_mc_map() does not handle IPoIB hardware addresses.
-
-16. If bonding uses an IPoIB slave, then un-enslaving all slaves (or downing
- them with ifdown) followed by unloading the module ib_ipoib might crash the
- kernel. To avoid this leave the IPoIB interfaces enslaved when unloading
- ib_ipoib.
-
-17. On SLES 11, sysconfig scripts override the interface mode and set it to
- datagram on each call to ifup, ifdown, etc. To avoid this, add the line
- IPOIB_MODE=connected
- to the interface configuration file (e.g. ifcfg-ib0)
-
-18. When installing OFED on a machine that runs kernel 2.6.30 (or another
- kernel from kernel.org that OFED supports), the installation script blocks
- the installation of ib-bonding since the bonding module that comes with the
- kernel has all the functionality to support IPoIB slaves. This approach
- however doesn't patch the sysconfig (SuSE) or initscripts (RedHat) package,
- so the network configuration script may not work properly.
- For example, if you install OFED on RHEL5.2 that runs kernel 2.6.30 and
- you try to configure and run bonding, you won't be able to restart the
- network and see bond0 up and running with IPoIB slaves.
- A workaround to this problem would be as follows:
- a. Compile ib-bonding source rpm (under SRPMS directory) separately on
- a machine with RHEL5.2 and kernel 2.6.18-92.el5 (default for this OS).
- b. Install the binary RPM while the machine runs kernel 2.6.18-92.el5.
- This will patch the OS configuration scripts and install the bonding
- module.
- c. Switch to kernel 2.6.30. The module that was compiled in (a) will
- not be loaded since it was compiled and installed for a different
- kernel.
- d. Configure bonding and restart the network. The bonding interface
- should be up and running afterwards.
-
-19. On RHEL5.X, '/etc/init.d/openibd start' prints the following messages while
- bringing up IPoIB interfaces:
-
- Setting up InfiniBand network interfaces:
- Bringing up interface ib0: [ OK ]
- RTNETLINK answers: File exists
- Error adding address 192.168.1.11 for ib1.
- Bringing up interface ib1: [ OK ]
- Setting up service network . . . [ done ]
-
- This does not affect IPoIB configuration and interfaces are configured as
- expected.
-
-20. In IPoIB connected mode, packages larger than 2016 bytes are not sent.
- https://bugs.openfabrics.org/show_bug.cgi?id=1839
-
-21. Under SLES11, if an IP configuration exists for an IPoIB interface
- that later becomes a slave of a bonding master, a network restart
- does not erase the IP configuration from the slave and it appears to have
- an IP address even though the new configuration does not set one. This
- may cause problems when trying to use the bonded network interface. To
- avoid this, restart the IB stack (openib restart) once you change the
- configuration.
- This issue is described in
- https://bugs.openfabrics.org/show_bug.cgi?id=1975
-
-22. Currently, IPoIB LRO is not supported on ConnectX-2 devices
-
-===============================================================================
-3. IPoIB Configuration Based on DHCP
-===============================================================================
-
-Setting an IPoIB interface configuration based on DHCP (v3.0.4 which is
-available via www.isc.org) is performed similarly to the configuration of
-Ethernet interfaces. In other words, you need to make sure that IPoIB
-configuration files include the following line:
- For RedHat:
- BOOTPROTO=dhcp
- For SLES:
- BOOTPROTO=dhcp
-Note: If IPoIB configuration files are included, ifcfg-ib<n> files will be
-installed under:
-/etc/sysconfig/network-scripts/ on a RedHat machine
-/etc/sysconfig/network/ on a SuSE machine
-
-Note: Two patches for DHCP are required for supporting IPoIB. The patch files
-for DHCP v3.0.4 are available under the docs/dhcp/ directory.
-
-Standard DHCP fields holding MAC addresses are not large enough to contain an
-IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages
-convey a client identifier field used to identify the DHCP session. This client
-identifier field can be used to associate an IP address with a client identifier
-value, such that the DHCP server will grant the same IP address to any client
-that conveys this client identifier.
-
-Note: Refer to the DHCP documentation for more details how to make this
-association.
-
-The length of the client identifier field is not fixed in the specification.
-
-4.1 DHCP Server
-In order for the DHCP server to provide configuration records for clients, an
-appropriate configuration file needs to be created. By default, the DHCP server
-looks for a configuration file called dhcpd.conf under /etc. You can either
-edit this file or create a new one and provide its full path to the DHCP server
-using the -cf flag. See a file example at docs/dhcpd.conf of this package.
-The DHCP server must run on a machine which has loaded the IPoIB module.
-
-To run the DHCP server from the command line, enter:
-dhcpd <IB network interface name> -d
-Example:
-host1# dhcpd ib0 -d
-
-4.2 DHCP Client (Optional)
-
-Note: A DHCP client can be used if you need to prepare a diskless machine with
-an IB driver.
-
-In order to use a DHCP client identifier, you need to first create a
-configuration file that defines the DHCP client identifier. Then run the DHCP
-client with this file using the following command:
-dhclient cf <client conf file> <IB network interface name>
-Example of a configuration file for the ConnectX (PCI Device ID 26428), called
-dhclient.conf:
-# The value indicates a hexadecimal number
-interface "ib1" {
-send dhcp-client-identifier ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39;
-}
-Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218),
-called dhclient.conf:
-# The value indicates a hexadecimal number
-interface "ib1" {
-send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92;
-}
-
-In order to use the configuration file, run:
-host1# dhclient -cf dhclient.conf ib1
-
-
-===============================================================================
-4. The ib-bonding driver
-===============================================================================
-The ib-bonding driver is a High Availability solution for IPoIB interfaces.
-It is based on the Linux Ethernet Bonding Driver and was adapted to work with
-IPoIB. The ib-bonding driver comes with the ib-bonding package
-(run rpm -qi ib-bonding to get the package information).
-
-Using the ib-bonding driver
----------------------------
-The ib-bonding driver is loaded automatically.
-
-Automatic operation:
-Use standard OS tools (sysconfig in SuSE and initscripts in RedHat)
-to create a configuration that will come up with network restart. For details
-on this, read the documentation for the ib-bonding package.
-
-Notes:
-* Using /etc/infiniband/openib.conf to create a persistent configuration is
- no longer supported
-* On RHEL4_U7, a slave interface cannot be set as primary.
-* ib-bonding will not be compiled and installed with OFED on an OS with kernel
- that is >= 2.6.27 (e.g., SLES11). The bonding driver that comes with those
- kernels already supports enslaving IPoIB interfaces. In addition, an OS
- can come with an older kernel but with a patched bonding driver that also
- does not require modification (e.g., RHEL5.4). OFED will not replace the
- bonding module in such cases either.
- However, there might still be issues with OS configuration tools (like
- sysconfig or initscripts) that may need fixing, but such issues have not
- been observed yet.
-
-
-===============================================================================
-5. Child interfaces
-===============================================================================
-
-5.1 Subinterfaces
------------------
-You can create subinterfaces for a primary IPoIB interface to provide traffic
-isolation. Each such subinterface (also called a child interface) has
-different IP and network addresses from the primary (parent) interface. The
-default Partition Key (PKey), ff:ff, applies to the primary (parent) interface.
-
-5.1.1 Creating a Subinterface
------------------------------
-To create a child interface (subinterface), follow this procedure:
-Note: In the following procedure, ib0 is used as an example of an IB
-subinterface.
-
-Step 1. Decide on the PKey to be used in the subnet. Valid values are 0-255.
-The actual PKey used is a 16-bit number with the most significant bit set. For
-example, a value of 0 will give a PKey with the value 0x8000.
-
-Step 2. Create a child interface by running:
-host1$ echo <PKey> > /sys/class/net/<IB subinterface>/create_child
-Example:
-host1$ echo 0 > /sys/class/net/ib0/create_child
-This will create the interface ib0.8000.
-
-Step 3. Verify the configuration of this interface by running:
-Using the example of Step 2:
-host1$ ifconfig ib0.8000
-ib0.8000 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00-
-00-00-00-00-00-00
-BROADCAST MULTICAST MTU:2044 Metric:1
-RX packets:0 errors:0 dropped:0 overruns:0 frame:0
-TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
-collisions:0 txqueuelen:128
-RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
-
-Step 4. As can be seen, the interface does not have IP or network addresses so
-it needs to be configured.
-
-Step 5. To be able to use this interface, a configuration of the Subnet Manager
-is needed so that the PKey chosen, which defines a broadcast address, can be
-recognized.
-
-5.1.2 Removing a Subinterface
-To remove a child interface (subinterface), run:
-echo <subinterface PKey> /sys/class/net/<ib_interface>/delete_child
-Using the example of Step 2:
-echo 0x8000 > /sys/class/net/ib0/delete_child
-Note that when deleting the interface you must use the PKey value with the most
-significant bit set (e.g., 0x8000 in the example above).
-
-
-===============================================================================
-6. Bug Fixes and Enhancements Since OFED 1.3
-===============================================================================
-- There is no default configuration for IPoIB interfaces: One should manually
- specify the full IP configuration or use the ofed_net.conf file. See
- OFED_Installation_Guide.txt for details on ipoib configuration.
-- Don't drop multicast sends when they can be queued
-- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small
- SKBs (bug 989)
-- IPoIB failed on stress testing (bug 1004)
-- Kernel Oops during "port up/down test" (bug 1040)
-- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel
- panic (bug 985)
-- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20
-- Set max CM MTU when moving to CM mode, instead of setting it in openibd script
-- Fix CQ size calculations for ipoib
-- Bonding: Enable build for SLES10 SP2
-- Bonding: Fix issue in using the bonding module for Ethernet slaves (see
- documentation for details)
-
-===============================================================================
-7. Bug Fixes and Enhancements Since OFED 1.3.1
-===============================================================================
-- IPoIB: Refresh paths instead of flushing them on SM change events to improve
- failover respond
-- IPoIB: Fix loss of connectivity after bonding failover on both sides
-- Bonding: Fix link state detection under RHEL4
-- Bonding: Avoid annoying messages from initscripts when starting bond
-- Bonding: Set default number of grat. ARP after failover to three (was one)
-
-===============================================================================
-8. Bug Fixes and Enhancements Since OFED 1.4
-===============================================================================
-- Performance tuning is enabled by default for IPOIB CM.
-- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails
-- Disable napi while cq is being drained (bugzilla #1587)
-- rdma_cm: Use the rate from the ipoib broadcast when joining an ipoib
- multicast. When joining an IPoIB multicast group, use the same rate as in the
- broadcast group. Otherwise, if rdma_cm creates this group before IPoIB does,
- it might get a different rate. This will cause IPoIB to fail joining the same
- group later on, because IPoIB has a strict rate selection.
-- Fixed unprotected use of priv->broadcast in ipoib_mcast_join_task.
-- Do not join broadcast group if interface is brought down
-
-
-===============================================================================
-9. Bug Fixes and Enhancements Since OFED 1.4.2
-===============================================================================
-
-- Check that the format of multicast link addresses is correct before taking
- them from dev->mc_list to priv->multicast_list. This way we never try to
- send a bogus address to the SA, which prevents badness from erroneous
- 'ip maddr addr add', broken bonding drivers, etc. (bugzilla #1664)
-- IPoIB: Don't turn on carrier for a non-active port.
- If a bonding interface uses this IPoIB interface as a slave it might
- not detect that this slave is almost useless and failover
- functionality will be damaged. The fix checks the state of the IB
- port in the carrier_task before calling netif_carrier_on(). (bugzilla #1726)
-- Clear ipoib_neigh.dgid in ipoib_neigh_alloc()
- IPoIB can miss a change in destination GID under some conditions. The
- problem is caused when ipoib_neigh->dgid contains a stale address.
- The fix is to set ipoib_neigh->dgid to zero in ipoib_neigh_alloc().
-
-===============================================================================
-10. Bug Fixes and Enhancements Since OFED 1.5.0
-===============================================================================
-
-- Fixed lockup of the TX queue on mixed CM/UD traffic
- When there is a high rate of send traffic on both CM and UD QPs, the
- transmitter can be stopped by the CM path but not re-enabled.
-
-===============================================================================
-11. Bug Fixes and Enhancements Since OFED 1.5.2
-===============================================================================
-1. Fix IPoIB rx_frames and rx_usecs to conform to ethtool documentation.
-
-
-===============================================================================
-12. Performance Tuning
-===============================================================================
-When IPoIB is configured to run in connected mode, tcp parameter tuning is
-performed at driver startup to improve the throughput of medium and large
-messages.
-The driver startup scripts set the following TCP parameters as follows:
-
- net.ipv4.tcp_timestamps=0
- net.ipv4.tcp_sack=0
- net.core.netdev_max_backlog=250000
- net.core.rmem_max=16777216
- net.core.wmem_max=16777216
- net.core.rmem_default=16777216
- net.core.wmem_default=16777216
- net.core.optmem_max=16777216
- net.ipv4.tcp_mem="16777216 16777216 16777216"
- net.ipv4.tcp_rmem="4096 87380 16777216"
- net.ipv4.tcp_wmem="4096 65536 16777216"
-
-This tuning is effective only for connected mode. If you run in datagram mode,
-it actually reduces performance.
-
-If you change the IPoIB run mode to "datagram" while the driver is running,
-the tuned parameters do not get reset to their default values. We therefore
-recommend that you change the IPoIB mode only while the driver is down
-(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file
-/etc/infiniband/openib.conf, and then restarting the driver).
-
-
+++ /dev/null
-
- Open Fabrics Enterprise Distribution (OFED)
- iSER initiator in OFED 1.5.x Release Notes
-
- March 2010
-
-
-* Background
-
- iSER allows iSCSI to be layered over RDMA transports (including
- InfiniBand and iWARP (RNIC)).
-
- The OpenFabrics iSER initiator implementation is inter-operable with
- open-iscsi (http://www.open-iscsi.org/). It provides an alternative
- transport to iscsi_tcp in the open-iscsi framework. The iSER transport
- exposes a transport API to scsi_transport_iscsi, and a SCSI LLD API to
- the Linux SCSI mid-layer (scsi_mod). Currently, the OpenFabrics iSER
- initiator can be layered over InfiniBand (no iWARP support yet).
-
-* Supported platforms
-
- - kernel.org: 2.6.30 and higher
- - RHEL 5.4
-
- Except for these platforms, OFED-1.5.x will not install iSER on top of
- the kernel and the original iSER module coming with Linux Distribution
- will stop working because of mismatch in symbols version.
-
-* Fixed Bugs and Enhancements since OFED 1.3
- iSER:
- - Add logical unit reset support
- - Update URLs of iSER docs
- - Add change_queue_depth method
- - Fix list iteration bug
- - Handle iser_device allocation error gracefully
- - Don't change ITT endianess
- - Move high-volume debug output to higher debug level
- - Count FMR alignment violations per session
- Open-iSCSI:
- - Update open-iscsi rpm versions from
- 2.0-754 to 2.0-754.1 and from 2.0-865.15 to 2.0-869.2
- - Change open-iscsi defaults
- - iscsi_discovery: fixed printing debug information
- - iscsi_discovery: check if iscsid is running
- - Set open-iscsi for auto-startup when installing OFED
- - iscsiadm: bail out if daemon isn't running
-
-* Known Issues
- Open-iSCSI:
- - modifying node transport_name while session is active
- will create stale session. It will be deleted only after reboot.
-
-* Installation/upgrade of open-iscsi
- If iSER is selected to be installed with OFED, open-iscsi will be also
- installed (or upgraded if another version of open-iscsi is already
- installed). Installing/upgrading open-iscsi is required for iSER to
- work properly. Before installing OFED, please make sure that no version
- of open-iscsi is installed or add the following key to your ofed.conf
- file: upgrade_open_iscsi=yes. Using this key will remove any old version
- of open-iscsi.
-
- If an older version of open-iscsi was installed, it is recommended to
- delete its records before running open-iscsi. This can easily be done by
- running the following command (while open-iscsi is stopped):
-
- rm -rf /etc/iscsi/nodes/* /etc/iscsi/send_targets/*
-
- Then, open-iscsi may be started, and targets may be discovered by running
- 'iscsi_discovery <target_ip>'.
-
-* iSER links
-
- Wiki pages
-
- Information on building/configuring/running the open iscsi initiator over
- iSER: https://wiki.openfabrics.org/tiki-index.php?page=iSER
-
- IETF pages
-
- iSCSI and iSER specifications come out of the IETF IP storage (IPS) work
- group.
-
- iSCSI specification: http://www.ietf.org/rfc/rfc3720.txt
- iSER specification: http://www.ietf.org/rfc/rfc5046.txt
-
- "About" page
-
- general and detailed information on iSCSI and iSER
- http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA
-
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- STGT/iSER target in OFED 1.5 Release Notes
-
- December 2009
-
-
-* Background
-
- iSER allows iSCSI to be layered over RDMA transports (including InfiniBand
- and iWARP (RNIC)). Linux target framework (tgt) aims to simplify various SCSI
- target driver (iSCSI, Fibre Channel, SRP, etc) creation and maintenance.
-
- tgt supports the following target drivers (among othets)
-
- - iSCSI software (tcp) target driver for Ethernet/IPoIB NICs
- - iSER software target driver for Infiniband and RDMA NICs
-
- For iSCSI and iSER tgt consists of user-space daemon, and user-space
- tools. That is, no special kernel support is needed other than the
- kernel (and user space) RDMA stacks.
-
- The code is under the GNU General Public License version 2.
-
- This package is based on a snapshot (clone) of the tgt git tree taken
- on August 28th, 2008
-
-* Supported platforms
-
- RHEL 5 and its updates
- SLES 10 and its service-packs
-
- The release has been tested against the Linux open iscsi initiator
-
-* STGT/iSER links
-
- STGT home page
- http://stgt.berlios.de
-
- STGT git
- git://git.kernel.org/pub/scm/linux/kernel/git/tomo/tgt.git
-
- the STGT sources have some embedded documentation, specifically
- the README and REDMA.iscsi files would be usefull
-
- Wiki pages
-
- Information on building/configuring/running the stgt/iser target
- https://wiki.openfabrics.org/tiki-index.php?page=iSER-target
-
- general and detailed information on iSCSI and iSER
- http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- ConnectX driver (mlx4) in OFED 1.5.2 Release Notes
-
- December 2010
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Supported firmware versions
-3. VPI (Virtual Process Interconnect)
-4. InfiniBand new features and bug fixes since OFED 1.3.1
-5. InfiniBand (mlx4_ib) new features and bug fixes since OFED 1.4
-6. Eth (mlx4_en) new features and bug fixes since OFED 1.4
-7. New features and bug fixes since OFED 1.4.1
-8. New features and bug fixes since OFED 1.4.2
-9. New features and bug fixes since OFED 1.5
-10. New features and bug fixes since OFED 1.5.1
-11. New features and bug fixes since OFED 1.5.2
-12. Known Issues
-13. mlx4 available parameters
-
-===============================================================================
-1. Overview
-===============================================================================
-mlx4 is the low level driver implementation for the ConnectX adapters designed
-by Mellanox Technologies. The ConnectX can operate as an InfiniBand adapter,
-as an Ethernet NIC, or as a Fibre Channel HBA. The driver in OFED 1.4 supports
-InfiniBand and Ethernet NIC configurations. To accommodate the supported
-configurations, the driver is split into three modules:
-
-- mlx4_core
- Handles low-level functions like device initialization and firmware
- commands processing. Also controls resource allocation so that the
- InfiniBand and Ethernet functions can share the device without
- interfering with each other.
-- mlx4_ib
- Handles InfiniBand-specific functions and plugs into the InfiniBand
- midlayer
-- mlx4_en
- Handles Ethernet specific functions and plugs into the netdev mid-layer.
-
-===============================================================================
-2. Supported firmware versions
-===============================================================================
-- This release was tested with FW 2.8.0000
-- The minimal version to use is 2.3.000.
-- To use both IB and Ethernet (VPI) use FW version 2.6.000 or higher
-
-===============================================================================
-3. VPI (Virtual Protocol Interconnect)
-===============================================================================
-VPI enables ConnectX to be configured as an Ethernet NIC and/or an InfiniBand
-adapter.
-o Overview:
- The VPI driver is a combination of the Mellanox ConnectX HCA Ethernet and
- InfiniBand drivers.
- It supplies the user with the ability to run InfiniBand and Ethernet
- protocols on the same HCA (separately or at the same time).
- For more details on the Ethernet driver see MLNX_EN_README.txt.
-o Firmware:
- The VPI driver works with FW 25408 version 2.6.000 or higher.
- One needs to use INI files that allow different protocols over same HCA.
-o Port type management:
- By default both ConnectX ports are initialized as InfiniBand ports.
- If you wish to change the port type use the connectx_port_config script after
- the driver is loaded.
- Running "/sbin/connectx_port_config -s" will show current port configuration
- for all ConnectX devices.
- Port configuration is saved in file: /etc/infiniband/connectx.conf.
- This saved configuration is restored at driver restart only if done via
- "/etc/init.d/openibd restart".
-
- Possible port types are:
- "eth" - Always Ethernet.
- "ib" - Always InfiniBand.
- "auto" - Link sensing mode - detect port type based on the attached
- network type. If no link is detected, the driver retries link
- sensing every few seconds.
-
- Port link type can be configured for each device in the system at run time
- using the "/sbin/connectx_port_config" script.
-
- This utility will prompt for the PCI device to be modified (if there is only
- one it will be selected automatically).
- At the next stage the user will be prompted for the desired mode for each port.
- The desired port configuration will then be set for the selected device.
- Note: This utility also has a non interactive mode:
- "/sbin/connectx_port_config [[-d|--device <PCI device ID>] -c|--conf <port1,port2>]".
-
-- The following configurations are supported by VPI:
- Port1 = eth Port2 = eth
- Port1 = ib Port2 = ib
- Port1 = auto Port2 = auto
- Port1 = ib Port2 = eth
- Port1 = ib Port2 = auto
- Port1 = auto Port2 = eth
-
- Note: the following options are not supported:
- Port1 = eth Port2 = ib
- Port1 = eth Port2 = auto
- Port1 = auto Port2 = ib
-
-
-===============================================================================
-4. InfiniBand new features and bug fixes since OFED 1.3.1
-===============================================================================
-Features that are enabled with ConnectX firmware 2.5.0 only:
-- Send with invalidate and Local invalidate send queue work requests.
-- Resize CQ support.
-
-Features that are enabled with ConnectX firmware 2.6.0 only:
-- Fast register MR send queue work requests.
-- Local DMA L_Key.
-- Raw Ethertype QP support (one QP per port) -- receive only.
-
-Non-firmware dependent features:
-- Allow 4K messages for UD QPs
-- Allocate/free fast register MR page lists
-- More efficient MTT allocator
-- RESET->ERR QP state transition no longer supported (IB Spec 1.2.1)
-- Pass congestion management class MADs to the HCA
-- Enable firmware diagnostic counters available via sysfs
-- Enable LSO support for IPOIB
-- IB_EVENT_LID_CHANGE is generated more appropriately
-- Fixed race condition between create QP and destroy QP (bugzilla 1389)
-
-
-===============================================================================
-5. InfiniBand new features and bug fixes since OFED 1.4
-===============================================================================
-- Enable setting via module param (set_4k_mtu) 4K MTU for ConnectX ports.
-- Support optimized registration of huge pages backed memory.
- With this optimization, the number of MTT entries used is significantly
- lower than for regular memory, so the HCA will access registered memory with
- fewer cache misses and improved performance.
- For more information on this topic, please refer to Linux documentation file:
- Documentation/vm/hugetlbpage.txt
-- Do not enable blueflame sends if write combining is not available
-- Add write combining support for for PPC64, and thus enable blueflame sends.
-- Unregister IB device before executing CLOSE_PORT.
-- Notify and exit if the kernel module used does not support XRC. This is done
- to avoid libmlx4 compatibility problem.
-- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment.
- This enable to register more memory with the same number of segments.
-
-
-===============================================================================
-6. Eth (mlx4_en) new features and bug fixes since OFED 1.4
-===============================================================================
-6.1 Changes and New Features
-----------------------------
-- Added Tx Multi-queue support which Improves multi-stream and bi-directional
- TCP performance.
-- Added IP Reassembly to improve RX bandwidth for IP fragmented packets.
-- Added linear skb support which improves UDP performance.
-- Removed the following module parameters:
- - rx/tx_ring_size
- - rx_ring_num - number of RX rings
- - pprx/pptx - global pause frames
- The parameters above are controlled through the standard Ethtool interface.
-
-Bug Fixes
----------
-- Memory leak when driver is unloaded without configuring interfaces first.
-- Setting flow control parameters for one ConnectX port through Ethtool
- impacts the other port as well.
-- Adaptive interrupt moderation malfunctions after receiving/transmitting
- around 7 Tera-bytes of data.
-- Firmware commands fail with bad flow messages when bringing an interface up.
-- Unexpected behavior in case of memory allocation failures.
-
-===============================================================================
-7. New features and bug fixes since OFED 1.4.1
-===============================================================================
-- Added support for new device ID: 0x6764: MT26468 ConnectX EN 10GigE PCIe gen2
-
-===============================================================================
-8. New features and bug fixes since OFED 1.4.2
-===============================================================================
-8.1 Changes and New Features
-----------------------------
-- mlx4_en is now supported on PPC and IA64.
-- Added self diagnostics feature: ethtool -t eth<x>.
-- Card's vpd can be accessed for read and write using ethtool interface.
-
-8.2 Bug Fixes
--------------
-- mlx4 can now work with MSI-X on RH4 systems.
-- Enabled the driver to load on systems with 32 cores and higher.
-- The driver is being stuck if the HW/FW stops responding, reset is done
- instead.
-- Fixed recovery flows from memory allocation failures.
-- When the system is low on memory, the mlx4_en driver now allocates smaller RX
- rings.
-- The mlx4_core driver now retries to obtain MSI-X vectors if the initial request is
- rejected by the OS
-
-===============================================================================
-9. New features and bug fixes since OFED 1.5
-===============================================================================
-9.1 Changes and New Features
-----------------------------
-- Added RDMA over Converged Enhanced Ethernet (RoCEE) support
- See RoCEE_README.txt.
-- Masked Compare and Swap (MskCmpSwap)
- The MskCmpSwap atomic operation is an extension to the CmpSwap operation
- defined in the IB spec. MskCmpSwap allows the user to select a portion of the
- 64 bit target data for the "compare" check as well as to restrict the swap to
- a (possibly different) portion.
-- Masked Fetch and Add (MFetchAdd)
- The MFetchAdd Atomic operation extends the functionality of the standard IB
- FetchAdd by allowing the user to split the target into multiple fields of
- selectable length. The atomic add is done independently on each one of this
- fields. A bit set in the field_boundary parameter specifies the field
- boundaries.
-- Improved VLAN tagging performance for the mlx4_en driver.
-- RSS support for Ethernet UDP traffic on ConnectX-2 cards with firmware
- 2.7.700 and higher.
-
-9.2 Bug Fixes
--------------
-- Bonding stops functioning when one of the Ethernet ports is closed.
-- "Scheduling while atomic" errors in /var/log/messages when working with
- bonding and mlx4_en drivers in several operating systems.
-
-===============================================================================
-10. New features and bug fixes since OFED 1.5.1
-===============================================================================
-10.1 Changes and New Features
-----------------------------
-1. Added RAW QP support
-2. Extended the range of log_mtts_per_seg - upper bound moved from 5 to 7.
-3. Added 0xff70 vendor ID support for MADs.
-4. Added support for GID change event.
-5. Better interrupts spreading under heavy RX load (mlx4_en)
-
-10.2 Bug Fixes
--------------
-1. Fixed chunk sg list overflow in mlx4_alloc_icm()
-2. Fixed bug in invalidation of counter index.
-3. Fixed bug in catching netdev events for updating GID table.
-4. Fixed bug in populating GID table for RoCE.
-5. Fixed XRC locking and prevention of null dereference.
-6. Added spinlock to xrc_reg_list changes and scanning in interrupt context.
-7. Fixed offload changes via Ethtool for VLAN interfaces
-
-===============================================================================
-11. New features and bug fixes since OFED 1.5.2
-===============================================================================
-11.1 Changes and new features
------------------------------
-1. RoCE counters are now added to the regular Ethernet counters. The counters
-for RoCE specific traffic are at the same place and are not changed.
-2. Forward any vendor ID SMP MADs to firmware for handling.
-3. Add blue flame support for kernel consumers. This allows lower latencies to
-be achieved. To use blue flame, a consumer needs to create the QP with inline
-support.
-
-11.2 Bug fixes
---------------
-1. Fix race when reading node desctription through MADs.
-2. Fix modify CQ so each of moderation parameters is independent.
-3. Limit the number of fast registration work requests to match HW capabilities.
-4 Changes to node-description via sysfs are now propagated to FW (for FW
-2.8.000 and later). This enables FW to send a 144 trap to OpenSM regarding the
-change, so that OpenSM can read that nodes updated description. This fixes an
-old race condition, where OpenSM read the nodes description before it was
-changed during driver startup.
-5. Fix max fast registration WRs that can be posted to CX.
-6. Fix port speed reporting for RoCE ports.
-7. Limit GID entries for VLAN to match hardware capabilities.
-8. Fix RoCE link state report.
-9. Workaround firmware bug reporting wrong number of blue flame registers.
-10. Bug fix in kernel pos_send when VLANs are used.
-11. Fix in mlx4_en for handling VLAN operations when working under bond
- interfaces.
-12.Fix Ethtool transceiver type report for mlx4_en
-
-
-===============================================================================
-12. Known Issues
-===============================================================================
-- The SQD feature is not supported
-- To load the driver on machines with a 64KB default page size, the UAR bar
- must be enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium
- with SLES 11 or when 64KB page size enabled.
- Perform the following three steps:
- 1. Add the following line in the firmware configuration (INI) file under the
- [HCA] section:
- log2_uar_bar_megabytes = 5
- 2. Burn a modified firmware image with the changed INI file.
- 3. Reboot the system.
-
-
-================================================================================
-13. mlx4 available parameters
-================================================================================
-In order to set mlx4 parameters, add the following line(s) to /etc/modpobe.conf:
- options mlx4_core parameter=<value>
- and/or
- options mlx4_ib parameter=<value>
- and/or
- options mlx4_en parameter=<value>
-
-mlx4_core parameters:
- set_4k_mtu: try to set 4K MTU to all ConnectX ports (int)
- debug_level: enable debug tracing if > 0 (int)
- block_loopback: block multicast loopback packets if > 0 (int)
- msi_x: attempt to use MSI-X if nonzero (int)
- log_num_mac: log2 max number of MACs per ETH port (1-7, int)
- use_prio: enable steering by VLAN priority on ETH ports
- (0/1, default 0) (bool)
- log_num_qp: log maximum number of QPs per HCA (int)
- log_num_srq: log maximum number of SRQs per HCA (int)
- log_rdmarc_per_qp: log number of RDMARC buffers per QP (int)
- log_num_cq: log maximum number of CQs per HCA (int)
- log_num_mcg: log maximum number of multicast groups per HCA
- (int)
- log_num_mpt: log maximum number of memory protection table
- entries per HCA (int)
- log_num_mtt: log maximum number of memory translation table
- segments per HCA (int)
- log_mtts_per_seg: log2 number of MTT entries per segment (1-5)
- (int)
- enable_qos: enable Quality of Service support in the HCA
- (default: off) (bool)
- enable_pre_t11_mode: set FCoXX to pre-T11 mode if non-zero
- (default 0) (int)
- internal_err_reset: reset device on internal errors if non-zero
- (default 1) (int)
-
-mlx4_ib parameters:
- debug_level: enable debug tracing if > 0 (default 0)
-
-mlx4_en parameters:
- udp_rss: enable RSS for incoming UDP traffic or disabled (0)
- tcp_rss: enable RSS for incoming TCP traffic or disabled (0)
- num_lro: number of LRO sessions per ring or disabled (0)
- (default is 32)
- ip_reasm: allow reassembly of fragmented IP packets (default
- is enabled)
- pfctx: priority based Flow Control policy on TX[7:0]
- per priority bit mask (default is 0)
- pfcrx: priority based Flow Control policy on RX[7:0]
- per priority bit mask (default is 0)
- inline_thold: threshold for using inline data (default is 128)
+++ /dev/null
- MPI Selector 1.0 release notes
- December 2009
- ==============================
-
-OFED contains a simple mechanism for system administrators and end
-users to select which MPI implementation they want to use. The MPI
-selector functionality is not specific to any MPI implementation; it
-can be used with any implementation that provides shell startup files
-that correctly set the environment for that MPI. The OFED installer
-will automatically add MPI selector support for each MPI that it
-installs. Additional MPI's not known by the OFED installer can be
-listed in the MPI selector; see the mpi-selector(1) man page for
-details.
-
-Note that MPI selector only affects the default MPI environment for
-*future* shells. Specifically, if you use MPI selector to select MPI
-implementation ABC, this default selection will not take effect until
-you start a new shell (e.g., logout and login again). Other packages
-(such as environment modules) provide functionality that allows
-changing your environment to point to a new MPI implementation in the
-current shell. The MPI selector was not meant to duplicate or replace
-that functionality.
-
-The MPI selector functionality can be invoked in one of two ways:
-
-1. The mpi-selector-menu command.
-
- This command is a simple, menu-based program that allows the
- selection of the system-wide MPI (usually only settable by root)
- and a per-user MPI selection. It also shows what the current
- selections are.
-
- This command is recommended for all users.
-
-2. The mpi-selector command.
-
- This command is a CLI-equivalent of the mpi-selector-menu,
- allowing for the same functionality as mpi-selector-menu but
- without the interactive menus and prompts. It is suitable for
- scripting.
-
-See the mpi-selector(1) man page for more information.
-
+++ /dev/null
-===============================================================================
- OFED 1.5.2 for Linux
- Mellanox Firmware Burning and Diagnostic Utilities
- December 2010
-===============================================================================
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. New Features
-3. Major Bugs Fixed
-4. Known Issues
-
-===============================================================================
-1. Overview
-===============================================================================
-
-This package contains a burning and diagnostic tools for Mellanox
-manufactured cards. It also provides access to the relevant source code. Please
-see the file LICENSE for licensing details.
-
-Package Contents:
- a) mstflint source code
- b) mflash lib
- This lib provides Flash access through Mellanox HCAs.
- c) mtcr lib (implemented in mtcr.h file)
- This lib enables access to adapter hardware registers via PCIe
- d) mstregdump utility
- This utility dumps hardware registers from Mellanox hardware for later
- analysis by Mellanox.
- e) mstvpd
- This utility dumps the on-card VPD (Vital Product Data, which contains
- the card serial number, part number, and other info).
- f) hca_self_test.ofed
- This scripts checks the status of software, firmware and hardware of the
- HCAs or NICs installed on the local host.
-
-===============================================================================
-2. New Features
-===============================================================================
-
-* Added support for flash type SST25VF016B in mstflint
-
-* Added support for flash type M25PX16 in mstflint
-
-* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') in
- a binary image file. This is useful for production to prepare images for pre-
- assembly flash burning. These new commands are supported by Mellanox 4th
- generation devices.
-
-* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') on
- an already burnt device. These command re-burn the existing image with the
- given GUIDs or VSD.
- When the 'sg' command is applied on a device with blank (0xff) GUIDs, it
- updates the GUIDs without re-burning the image.
-
-* mstregdump: Updated address list for ConnectX2 device.
-
-===============================================================================
-3. Bugs Fixed
-===============================================================================
-
-* Show correct device names in mstflint help
-
-===============================================================================
-4. Known Issues
-===============================================================================
-
-* Rarely you may get the following error message when running mstflint:
- Warning: memory access to device 0a:00.0 failed: Input/output error.
- Warning: Fallback on IO: much slower, and unsafe if device in use.
- *** buffer overflow detected ***: mstflint terminated
-
- To solve the issue, run "mst start" (requires MFT - Mellanox Firmware Tools package) and
- then re-run mstflint.
-
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- mthca in OFED 1.5 Release Notes
-
- December 2009
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Fixed Bugs since OFED 1.3.1
-3. Bug fixes and enhancements since OFED 1.4
-4. Known Issues
-
-===============================================================================
-1. Overview
-===============================================================================
-mthca is the low level driver implementation for the following Mellanox
-Technologies HCAs: InfiniHost, InfiniHost III Ex and InfiniHost III Lx.
-
-mthca Available Parameters
---------------------------
-In order to set mthca parameters, add the following line to /etc/modpobe.conf:
-
- options ib_mthca parameter=<value>
-
-mthca parameters:
- catas_reset_disable: disable reset on catastrophic event if nonzero
- (int)
- fw_cmd_doorbell: post FW commands through doorbell page if
- nonzero (and supported by FW) (int)
- debug_level: Enable debug tracing if > 0 (int)
- msi_x: attempt to use MSI-X if nonzero (int)
- tune_pci: increase PCI burst from the default set by BIOS if nonzero (int)
- num_qp: maximum number of QPs per HCA (int)
- rdb_per_qp: number of RDB buffers per QP (int)
- num_cq: maximum number of CQs per HCA (int)
- num_mcg: maximum number of multicast groups per HCA (int)
- num_mpt: maximum number of memory protection table entries per HCA (int)
- num_mtt: maximum number of memory translation table segments per HCA (int)
- num_udav: maximum number of UD address vectors per HCA (int)
- fmr_reserved_mtts: number of memory translation table segments reserved for
- FMR (int)
- log_mtts_per_seg: Log2 number of MTT entries per segment (1-5) (int)
-
-===============================================================================
-2. Fixed Bugs
-===============================================================================
-- Fix access to freed memory in catastrophic processing
- catas_reset() uses pointer to mthca_dev, but mthca_dev is not valid after
- call __mthca_restart_one().
-
-
-===============================================================================
-3. Bug fixes and enhancements since OFED 1.4
-===============================================================================
-- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment.
- This enable to register more memory with the same number of segments.
-- Bring INIT_HCA and other commands timeout into consistency with PRM. This
- solve an issue when had more than 2^18 max qp's configured.
-
-===============================================================================
-4. Known Issues
-===============================================================================
-1. A UAR size other than 8MB prevents mthca driver loading. The default UAR
- size is 8MB. If the size is changed, the following error message will be
- logged to /var/log/messages upon attempting to load the mthca driver:
- ib_mthca 0000:04:00.0: Missing UAR, aborting.
-
-2. If a user level application using multicast receives a control signal
- in the process of detaching from a multicast group, its QP may remain a
- member of the multicast group (in HCA).
- Workaround: Destroy the multicast group after detaching the QP from it.
-
-3. In mem-free devices, RC QPs can be created with a maximum of (max_sge - 1)
- entries only; UD QPs can be created with a maximum of (max_sge - 3) entries.
-
-4. Performance can be degraded due to a wrong BIOS configuration:
- The PCI Express specification requires the BIOS to set the MaxReadReq
- register for each HCA card for maximum performance and stability.
-
- If you experience bandwidth performance degradation, try forcing the card to
- behave not according to the PCI Express specification by setting the
- tune_pci=1 module parameter. This tune_pci=1 assignment was the default
- setting in OFED 1.0; therefore, it may have masked performance degradation
- on some systems.
-
- If tune_pci=1 improves bandwidth, please report the issue to your BIOS
- vendor. Please note that Mellanox Technologies does not recommend using
- tune_pci=1 in production systems: working with tune_pci=1 set is untested
- and is known to trigger instability issues on some platforms.
-
+++ /dev/null
-========================================================================
-
- Open Fabrics Enterprise Distribution (OFED)
- MVAPICH2-1.5.1 in OFED 1.5.2 Release Notes
-
- September 2010
-
-
-Overview
---------
-
-These are the release notes for MVAPICH2-1.5.1. MVAPICH2 is an MPI-2
-implementation over InfiniBand, iWARP and RoCEE (RDMAoE) from the Ohio
-State University (http://mvapich.cse.ohio-state.edu/).
-
-
-User Guide
-----------
-
-For more information on using MVAPICH2-1.5.1, please visit the user
-guide at http://mvapich.cse.ohio-state.edu/support/.
-
-
-Software Dependencies
----------------------
-
-MVAPICH2 depends on the installation of the OFED Distribution stack with
-OpenSM running. The MPI module also requires an established network
-interface (either InfiniBand, IPoIB, iWARP, RoCEE uDAPL, or Ethernet).
-BLCR support is needed if built with fault tolerance support. Similarly,
-HWLOC support is needed if built with Portable Hardware Locality feature
-for CPU mapping.
-
-
-ChangeLog
----------
-
-* Features and Enhancements
- - Significantly reduce memory footprint on some systems by changing
- the stack size setting for multi-rail configurations
- - Optimization to the number of RDMA Fast Path connections
- - Performance improvements in Scatterv and Gatherv collectives for
- CH3 interface (Thanks to Dan Kokran and Max Suarez of NASA for
- identifying the issue)
- - Tuning of Broadcast Collective
- - Support for tuning of eager thresholds based on both adapter and
- platform type
- - Environment variables for message sizes can now be expressed in
- short form K=Kilobytes and M=Megabytes (e.g.
- MV2_IBA_EAGER_THRESHOLD=12K)
- - Ability to selectively use some or all HCAs using colon separated
- lists. e.g. MV2_IBA_HCA=mlx4_0:mlx4_1
- - Improved Bunch/Scatter mapping for process binding with HWLOC and
- SMT support (Thanks to Dr. Bernd Kallies of ZIB for ideas and
- suggestions)
- - Update to Hydra code from MPICH2-1.3b1
- - Auto-detection of various iWARP adapters
- - Specifying MV2_USE_IWARP=1 is no longer needed when using iWARP
- - Changing automatic eager threshold selection and tuning for iWARP
- adapters based on number of nodes in the system instead of the
- number of processes
- - PSM progress loop optimization for QLogic Adapters (Thanks to Dr.
- Avneesh Pant of QLogic for the patch)
-
-* Bug fixes
- - Fix memory leak in registration cache with --enable-g=all
- - Fix memory leak in operations using datatype modules
- - Fix for rdma_cross_connect issue for RDMA CM. The server is
- prevented from initiating a connection.
- - Don't fail during build if RDMA CM is unavailable
- - Various mpirun_rsh bug fixes for CH3, Nemesis and uDAPL interfaces
- - ROMIO panfs build fix
- - Update panfs for not-so-new ADIO file function pointers
- - Shared libraries can be generated with unknown compilers
- - Explicitly link against DL library to prevent build error due to
- DSO link change in Fedora 13 (introduced with gcc-4.4.3-5.fc13)
- - Fix regression that prevents the proper use of our internal HWLOC
- component
- - Remove spurious debug flags when certain options are selected at
- build time
- - Error code added for situation when received eager SMP message is
- larger than receive buffer
- - Fix for Gather and GatherV back-to-back hang problem with LiMIC2
- - Fix for packetized send in Nemesis
- - Fix related to eager threshold in nemesis ib-netmod
- - Fix initialization parameter for Nemesis based on adapter type
- - Fix for uDAPL one sided operations (Thanks to Jakub Fedoruk from
- Intel for reporting this)
- - Fix an issue with out-of-order message handling for iWARP
- - Fixes for memory leak and Shared context Handling in PSM for
- QLogic Adapters (Thanks to Dr. Avneesh Pant of QLogic for the
- patch)
-
-
-Main Verification Flows
------------------------
-
-In order to verify the correctness of MVAPICH2-1.4.1, the following
-tests and parameters were run.
-
-Test Description
-====================================================================
-Intel Intel's MPI functionality test suite
-OSU Benchmarks OSU's performance tests
-IMB Intel's MPI Benchmark test
-mpich2 Test suite distributed with MPICH2
-NAS NAS Parallel Benchmarks (NPB3.2)
-
-
-Mailing List
-------------
-
-There is a public mailing list mvapich-discuss@cse.ohio-state.edu for
-mvapich users and developers to
-- Ask for help and support from each other and get prompt response
-- Contribute patches and enhancements
-
-========================================================================
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- OSU MPI MVAPICH-1.2.0, in OFED 1.5 Release Notes
-
- December 2009
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Software Dependencies
-3. New Features
-4. Bug Fixes
-5. Known Issues
-6. Main Verification Flows
-
-
-===============================================================================
-1. Overview
-===============================================================================
-These are the release notes for OSU MPI MVAPICH-1.2.0.
-OSU MPI is an MPI channel implementation over InfiniBand
-by Ohio State University (OSU).
-
-See http://mvapich.cse.ohio-state.edu
-
-
-===============================================================================
-2. Software Dependencies
-===============================================================================
-OSU MPI depends on the installation of the OFED stack with OpenSM running.
-The MPI module also requires an established network interface (either
-InfiniBand IPoIB or Ethernet).
-
-
-===============================================================================
-3. New Features ( Compared to mvapich 1.1.0 )
-===============================================================================
-MVAPICH-1.2.0 has the following additional features:
-- Advanced network recovery support
-- mpirun launcher improvements
-- Efficient intra-node shared memory communication
- support for diskless clusters
-- RoCEE (RDMAoE) networks support
-
-===============================================================================
-4. Bug Fixes ( Compared to mvapich 1.1.0 )
-===============================================================================
-- Multiple fixes for mpirun_rsh launcher
-
-===============================================================================
-5. Known Issues
-===============================================================================
-- Shared memory broadcast optimization is disabled by default.
-
-- MVAPICH MPI compiled on AMD x86_64 does not work with MVAPICH MPI compiled
- on Intel X86_64 (EM64t).
- Workaround:
- Use "VIADEV_USE_COMPAT_MODE=1" run time option in order to enable compatibility
- mode that works for AMD and Intel platform.
-
-- A process running MPI cannot fork after MPI_Init unless the environment
- variable IBV_FORK_SAFE=1 is set to enable fork support. This support also
- requires a kernel version of 2.6.16 or higher.
-
-- For users of Mellanox Technologies firmware fw-23108 or fw-25208 only:
- MVAPICH might fail in its default configuration if your HCA is burnt with an
- fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version
- 4.7.400 or earlier.
-
- NOTE: There is no issue if you chose to update firmware during Mellanox
- OFED installation as newer firmware versions were burnt.
-
- Workaround:
- Option 1 - Update the firmware. For instructions, see Mellanox Firmware Tools
- (MFT) User's Manual under the docs/ folder.
- Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0
-
-- MVAPICH may fail to run on some SLES 10 machines due to problems in resolving
- the host name.
- Workaround: Edit /etc/hosts and comment-out/remove the line that maps
- IP address 127.0.0.2 to the system's fully qualified hostname.
-
-
-===============================================================================
-6. Main Verification Flows
-===============================================================================
-In order to verify the correctness of MVAPICH, the following tests and
-parameters were run.
-
-Test Description
--------------------------------------------------------------------
-Intel's Test suite - 1400 Intel tests
-BW/LT OSU's test for bandwidth latency
-IMB Intel's MPI Benchmark test
-mpitest b_eff test
-Presta Presta multicast test
-Linpack Linpack benchmark
-NAS2.3 NAS NPB2.3 tests
-SuperLU SuperLU benchmark (NERSC edition)
-NAMD NAMD application
-CAM CAM application
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- NetEffect Ethernet Cluster Server Adapter Release Notes
- September 2010
-
-
-
-The iw_nes module and libnes user library provide RDMA and L2IF
-support for the NetEffect Ethernet Cluster Server Adapters.
-
-==========
-What's New
-==========
-OFED 1.5.2 contains several enhancements and bug fixes to iw_nes driver.
-
-* Add new feature iWarp Multicast Acceleration (IMA).
-* Add module option to disable extra doorbell read after a write.
-* Change CQ event notification to not fire event unless there is a
- new CQE not polled.
-* Fix payload calculation for post receive with more than one SGE.
-* Fix crash when CLOSE was indicated twice due to connection close
- during remote peer's timeout on pending MPA reply.
-* Fix ifdown hang by not calling ib_unregister_device() till removal
- of iw_nes module.
-* Handle RST when state of connection is in FIN_WAIT2.
-* Correct properties for various nes_query_{qp, port, device} calls.
-
-
-============================================
-Required Setting - RDMA Unify TCP port space
-============================================
-RDMA connections use the same TCP port space as the host stack. To avoid
-conflicts, set rdma_cm module option unify_tcp_port_space to 1 by adding
-the following to /etc/modprobe.conf:
-
- options rdma_cm unify_tcp_port_space=1
-
-
-========================================
-Required Setting - Power Management Mode
-========================================
-If possible, disable Active State Power Management in the BIOS, e.g.:
-
- PCIe ASPM L0s - Advanced State Power Management: DISABLED
-
-
-=======================
-Loadable Module Options
-=======================
-The following options can be used when loading the iw_nes module by modifying
-modprobe.conf file.
-
-wide_ppm_offset=0
- Set to 1 will increase CX4 interface clock ppm offset to 300ppm.
- Default setting 0 is 100ppm.
-
-mpa_version=1
- MPA version to be used int MPA Req/Resp (0 or 1).
-
-disable_mpa_crc=0
- Disable checking of MPA CRC.
- Set to 1 to enable MPA CRC.
-
-send_first=0
- Send RDMA Message First on Active Connection.
-
-nes_drv_opt=0x00000100
- Following options are supported:
-
- 0x00000010 - Enable MSI
- 0x00000080 - No Inline Data
- 0x00000100 - Disable Interrupt Moderation
- 0x00000200 - Disable Virtual Work Queue
- 0x00001000 - Disable extra doorbell read after write
-
-nes_debug_level=0
- Specify debug output level.
-
-wqm_quanta=65536
- Set size of data to be transmitted at a time.
-
-limit_maxrdreqsz=0
- Limit PCI read request size to 256 bytes.
-
-
-===============
-Runtime Options
-===============
-The following options can be used to alter the behavior of the iw_nes module:
-NOTE: Assuming NetEffect Ethernet Cluster Server Adapter is assigned eth2.
-
- ifconfig eth2 mtu 9000 - largest mtu supported
-
- ethtool -K eth2 tso on - enables TSO
- ethtool -K eth2 tso off - disables TSO
-
- ethtool -C eth2 rx-usecs-irq 128 - set static interrupt moderation
-
- ethtool -C eth2 adaptive-rx on - enable dynamic interrupt moderation
- ethtool -C eth2 adaptive-rx off - disable dynamic interrupt moderation
- ethtool -C eth2 rx-frames-low 16 - low watermark of rx queue for dynamic
- interrupt moderation
- ethtool -C eth2 rx-frames-high 256 - high watermark of rx queue for
- dynamic interrupt moderation
- ethtool -C eth2 rx-usecs-low 40 - smallest interrupt moderation timer
- for dynamic interrupt moderation
- ethtool -C eth2 rx-usecs-high 1000 - largest interrupt moderation timer
- for dynamic interrupt moderation
-
-===================
-uDAPL Configuration
-===================
-Rest of the document assumes the following uDAPL settings in dat.conf:
-
- OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
- ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
-
-
-==============
-mpd.hosts file
-==============
-mpd.hosts is a text file with a list of nodes, one per line, in the MPI ring.
-Use either fully qualified hostname or IP address.
-
-
-=======================================
-Recommended Settings for HP MPI 2.2.7
-=======================================
-Add the following to mpirun command:
-
- -1sided
-
-Example mpirun command with uDAPL-2.0:
-
- mpirun -np 2 -hostfile /opt/mpd.hosts
- -UDAPL -prot -intra=shm
- -e MPI_HASIC_UDAPL=ofa-v2-iwarp
- -1sided
- /opt/hpmpi/help/hello_world
-
-Example mpirun command with uDAPL-1.2:
-
- mpirun -np 2 -hostfile /opt/mpd.hosts
- -UDAPL -prot -intra=shm
- -e MPI_HASIC_UDAPL=OpenIB-iwarp
- -1sided
- /opt/hpmpi/help/hello_world
-
-
-============================================================
-Recommended Settings for Platform MPI 7.1 (formerly HP-MPI)
-============================================================
-Add the following to mpirun command:
-
- -1sided
-
-Example mpirun command with uDAPL-2.0:
-
- mpirun -np 2 -hostfile /opt/mpd.hosts
- -UDAPL -prot -intra=shm
- -e MPI_HASIC_UDAPL=ofa-v2-iwarp
- -1sided
- /opt/platform_mpi/help/hello_world
-
-Example mpirun command with uDAPL-1.2:
-
- mpirun -np 2 -hostfile /opt/mpd.hosts
- -UDAPL -prot -intra=shm
- -e MPI_HASIC_UDAPL=OpenIB-iwarp
- -1sided
- /opt/platform_mpi/help/hello_world
-
-
-==============================================
-Recommended Settings for Intel MPI 3.2.x/4.0.x
-==============================================
-Add the following to mpiexec command:
-
- -genv I_MPI_FALLBACK_DEVICE 0
- -genv I_MPI_DEVICE rdma:OpenIB-iwarp
- -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
-
-Example mpiexec command line for uDAPL-2.0:
-
- mpiexec -genv I_MPI_FALLBACK_DEVICE 0
- -genv I_MPI_DEVICE rdma:ofa-v2-iwarp
- -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
- -ppn 1 -n 2
- /opt/intel/impi/3.2.2/bin64/IMB-MPI1
-
-Example mpiexec command line for uDAPL-1.2:
- mpiexec -genv I_MPI_FALLBACK_DEVICE 0
- -genv I_MPI_DEVICE rdma:OpenIB-iwarp
- -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
- -ppn 1 -n 2
- /opt/intel/impi/3.2.2/bin64/IMB-MPI1
-
-
-========================================
-Recommended Setting for MVAPICH2 and OFA
-========================================
-Add the following to the mpirun command:
-
- -env MV2_USE_IWARP_MODE 1
-
-Example mpiexec command line:
-
- mpiexec -l -n 2
- -env MV2_USE_IWARP_MODE 1
- /usr/mpi/gcc/mvapich2-1.5/tests/osu_benchmarks-3.1.1/osu_latency
-
-
-==========================================
-Recommended Setting for MVAPICH2 and uDAPL
-==========================================
-Add the following to the mpirun command for 64 or more processes:
-
- -env MV2_ON_DEMAND_THRESHOLD <number of processes>
-
-Example mpirun command with uDAPL-2.0:
-
- mpiexec -l -n 64
- -env MV2_DAPL_PROVIDER ofa-v2-iwarp
- -env MV2_ON_DEMAND_THRESHOLD 64
- /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1
-
-Example mpirun command with uDAPL-1.2:
-
- mpiexec -l -n 64
- -env MV2_DAPL_PROVIDER OpenIB-iwarp
- -env MV2_ON_DEMAND_THRESHOLD 64
- /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1
-
-
-===========================
-Modify Settings in Open MPI
-===========================
-There is more than one way to specify MCA parameters in
-Open MPI. Please visit this link and use the best method
-for your environment:
-
-http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
-
-
-=======================================
-Recommended Settings for Open MPI 1.4.2
-=======================================
-Allow the sender to use RDMA Writes:
-
- -mca btl_openib_flags 2
-
-Example mpirun command line:
-
- mpirun -np 2 -hostfile /opt/mpd.hosts
- -mca btl openib,self,sm
- -mca btl_mpi_leave_pinned 0
- -mca btl_openib_flags 2
- /usr/mpi/gcc/openmpi-1.4.2/tests/IMB-3.2/IMB-MPI1
-
-
-===================================
-iWARP Multicast Acceleration (IMA)
-===================================
-
-iWARP multicast acceleration enables raw L2 multicast traffic kernel
-bypass using user-space verbs API using the new defined QP type
-IBV_QPT_RAW_ETH.
-
-The L2 RAW_ETH acceleration assumes that user application transmits and
-receives a whole L2 frame including MAC/IP/UDP/TCP headers.
-
-ETH RAW QP usage:
-First the application creates IBV_QPT_RAW_ETH QP with associated CQ, PD,
-completion channels as it is performed for RDMA connection.
-
-Next step is enabling L2 MAC address RX filters for directing received
-multicasts to the RAW_ETH QPs using ibv_attach_multicast() verb.
-
-From this point the application is ready to receive and transmit multicast
-traffic.
-
-In multicast acceleration the user application passes to ibv_post_send()
-whole IGMP frame including MAC header, IP header, UDP header and UDP payload.
-It is a user responsibility to make IP fragmentation when required payload
-is larger than MTU. Every fragment is a separate L2 frame to transmit.
-The ibv_poll_cq() provides an information about the status of transmit buffer.
-
-On receive path, ibv_poll_cq() returns information about received L2
-packet, the Rx buffer (previously posted by ibv_post_recv() ) contains
-whole L2 frame including MAC header, IP header and UDP header.
-It is a user application responsibility to check if received packet is
-a valid UDP frame so the fragments must be checked and checksums must be
-computed.
-
-IMA API description (NE020 specific):
-User application must create separate CQs for RX and TX path.
-Only single SGE on tranmit is supported.
-User application must post at least 65 rx buffers to keep RX path working.
-
-IMA device:
-IMA requires creation of the /dev/infiniband/nes_ud_sksq device to get
-access to optimized IMA transmit path. The best method for creation of this
-device is manual addition following line to /etc/udev/rules.d/90-ib.rules
-file after OFED distribution installation and rebooting machine.
-
-KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644"
-
-As a result the 90-ib.rules should look like:
-
-KERNEL=="umad*", NAME="infiniband/%k"
-KERNEL=="issm*", NAME="infiniband/%k"
-KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
-KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
-KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
-KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
-KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644"
-
-
-
-NetEffect is a trademark of Intel Corporation in the U.S. and other countries.
+++ /dev/null
-################################################################################
-# #
-# NFS/RDMA README #
-# #
-################################################################################
-
- Author: NetApp and Open Grid Computing
-
- Adapted for OFED 1.5.1 (from linux-2.6.30/Documentation/filesystems/nfs-rdma.txt)
- by Jon Mason
-
-Table of Contents
-~~~~~~~~~~~~~~~~~
- - Overview
- - OFED 1.5.1 limitations
- - Getting Help
- - Installation
- - Check RDMA and NFS Setup
- - NFS/RDMA Setup
-
-Overview
-~~~~~~~~
-
- This document describes how to install and setup the Linux NFS/RDMA client
- and server software.
-
- The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
- was first included in the following release, Linux 2.6.25.
-
- In our testing, we have obtained excellent performance results (full 10Gbit
- wire bandwidth at minimal client CPU) under many workloads. The code passes
- the full Connectathon test suite and operates over both Infiniband and iWARP
- RDMA adapters.
-
-OFED 1.5.1 limitations:
-~~~~~~~~~~~~~~~~~~~~~
- NFS-RDMA is supported for the following releases:
- - Redhat Enterprise Linux (RHEL) version 5.2
- - Redhat Enterprise Linux (RHEL) version 5.3
- - Redhat Enterprise Linux (RHEL) version 5.4
- - SUSE Linux Enterprise Server (SLES) version 11
-
- And the following kernel.org kernels:
- - 2.6.22
- - 2.6.25
- - 2.6.30
-
- All other Linux Distrubutions and kernel versions are NOT supported on OFED
- 1.5.1
-
-Getting Help
-~~~~~~~~~~~~
-
- If you get stuck, you can ask questions on the
- nfs-rdma-devel@lists.sourceforge.net, or linux-rdma@vger.kernel.org
- mailing lists.
-
-Installation
-~~~~~~~~~~~~
-
- These instructions are a step by step guide to building a machine for
- use with NFS/RDMA.
-
- - Install an RDMA device
-
- Any device supported by the drivers in drivers/infiniband/hw is acceptable.
-
- Testing has been performed using several Mellanox-based IB cards and
- the Chelsio cxgb3 iWARP adapter.
-
- - Install OFED 1.5.1
-
- NFS/RDMA has been tested on RHEL5.2, RHEL 5.3, RHEL5.4, SLES11,
- kernels 2.6.22, 2.6.25, and 2.6.30. On these kernels,
- NFS-RDMA will be installed by default if you simply select "install all",
- and can be specifically included by a "custom" install.
-
- In addition, the install script will install a version of the nfs-utils that
- is required for NFS/RDMA. The binary installed will be named "mount.rnfs".
- This version is not necessary for Linux Distributions with nfs-utils 1.1 or
- later.
-
- Upon successful installation, the nfs kernel modules will be placed in the
- directory /lib/modules/'uname -a'/updates. It is recommended that you reboot
- to ensure that the correct modules are loaded.
-
-Check RDMA and NFS Setup
-~~~~~~~~~~~~~~~~~~~~~~~~
-
- Before configuring the NFS/RDMA software, it is a good idea to test
- your new kernel to ensure that the kernel is working correctly.
- In particular, it is a good idea to verify that the RDMA stack
- is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
- is working properly.
-
- - Check RDMA Setup
-
- If you built the RDMA components as modules, load them at
- this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
- card:
-
- $ modprobe ib_mthca
- $ modprobe ib_ipoib
-
- If you are using InfiniBand, make sure there is a Subnet Manager (SM)
- running on the network. If your IB switch has an embedded SM, you can
- use it. Otherwise, you will need to run an SM, such as OpenSM, on one
- of your end nodes.
-
- If an SM is running on your network, you should see the following:
-
- $ cat /sys/class/infiniband/driverX/ports/1/state
- 4: ACTIVE
-
- where driverX is mthca0, ipath5, ehca3, etc.
-
- To further test the InfiniBand software stack, use IPoIB (this
- assumes you have two IB hosts named host1 and host2):
-
- host1$ ifconfig ib0 a.b.c.x
- host2$ ifconfig ib0 a.b.c.y
- host1$ ping a.b.c.y
- host2$ ping a.b.c.x
-
- For other device types, follow the appropriate procedures.
-
- - Check NFS Setup
-
- For the NFS components enabled above (client and/or server),
- test their functionality over standard Ethernet using TCP/IP or UDP/IP.
-
-NFS/RDMA Setup
-~~~~~~~~~~~~~~
-
- We recommend that you use two machines, one to act as the client and
- one to act as the server.
-
- One time configuration:
-
- - On the server system, configure the /etc/exports file and
- start the NFS/RDMA server.
-
- Exports entries with the following formats have been tested:
-
- /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
- /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
-
- The IP address(es) is(are) the client's IPoIB address for an InfiniBand
- HCA or the client's iWARP address(es) for an RNIC.
-
- NOTE: The "insecure" option must be used because the NFS/RDMA client does
- not use a reserved port.
-
- Each time a machine boots:
-
- - Load and configure the RDMA drivers
-
- For InfiniBand using a Mellanox adapter:
-
- $ modprobe ib_mthca
- $ modprobe ib_ipoib
- $ ifconfig ib0 a.b.c.d
-
- NOTE: use unique addresses for the client and server
-
- - Start the NFS server
-
- Load the RDMA transport module:
-
- $ modprobe svcrdma
-
- Start the server:
-
- $ /etc/init.d/nfsserver start
-
- or
-
- $ service nfs start
-
- Instruct the server to listen on the RDMA transport:
-
- $ echo rdma 20049 > /proc/fs/nfsd/portlist
-
- NOTE for SLES10 servers: The nfs start scripts on most distro's start
- rpc.statd by default. However, the in-kernel lockd that was in SLES10 has
- been removed in the new kernels. Since OFED is back-porting the new code to
- the older distro's, there is no in-kernel lockd in SLES10 and the SLES10
- nfsserver scripts do not know the need to start it. Therefore, the
- nfsserver scripts will be modified when the rnfs-utils rpm is installed to
- start/stop rpc.statd.
-
- - On the client system
-
- Load the RDMA client module:
-
- $ modprobe xprtrdma
-
- Mount the NFS/RDMA server:
-
- $ mount.rnfs <IPoIB-server-name-or-address>:/<export> /mnt -o proto=rdma,port=20049
-
- NOTE: For kernels < 2.6.23, the "-i" flag must be passed into mount.rnfs.
- This option allows the mount command to ignore the kernel version check. If
- not disabled, the check will prevent passing arguments to the kernel and not
- allow the updated version of NFS to accept the "rdma" NFS option.
-
- To verify that the mount is using RDMA, run "cat /proc/mounts" and check
- the "proto" field for the given mount.
-
- Congratulations! You're using NFS/RDMA!
-
-Known Issues
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are using
-a 64KB page size (like PPC64 and IA64 systems) and your server is using a
-4KB page size (like i386 and X86_64), then you need to mount the server
-using rsize=32768,wsize=32768 to avoid overrunning the Chelsio RNIC fast
-register limits. This is a known firmware limitation in the Chelsio RNIC.
-
-Running NFSRDMA over Mellanox's ConnectX HCA requires that the adapter firmware
-be 2.7.0 or greater on all NFS clients and servers. Firmware 2.6.0 has known
-issues that prevent the RDMA connection from being established. Firmware 2.7.0
-has resolved these issues.
-
-IPv6 support requires portmap that supports version 4. Portmap included in RHEL5
-and SLES10 only supports version 2. Without version 4 support, the following
-error will be logged:
- svc: failed to register lockdv1 (errno 97).
-This error will not affect IPv4 support.
+++ /dev/null
-#!/bin/bash
-#
-# Copyright (c) 2009 Mellanox Technologies. All rights reserved.
-#
-# This Software is licensed under one of the following licenses:
-#
-# 1) under the terms of the "Common Public License 1.0" a copy of which is
-# available from the Open Source Initiative, see
-# http://www.opensource.org/licenses/cpl.php.
-#
-# 2) under the terms of the "The BSD License" a copy of which is
-# available from the Open Source Initiative, see
-# http://www.opensource.org/licenses/bsd-license.php.
-#
-# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
-# copy of which is available from the Open Source Initiative, see
-# http://www.opensource.org/licenses/gpl-license.php.
-#
-# Licensee has the right to choose one of the above licenses.
-#
-# Redistributions of source code must retain the above copyright
-# notice and one of the license notices.
-#
-# Redistributions in binary form must reproduce both the above copyright
-# notice, one of the license notices in the documentation
-# and/or other materials provided with the distribution.
-#
-#
-# Add/Remove a patch to/from OFED's ofa_kernel package
-
-
-usage()
-{
-cat << EOF
-
- Usage:
- Add patch to OFED:
- `basename $0` --add
- --ofed|-o <path_to_ofed>
- --patch|-p <path_to_patch>
- --type|-t <kernel|backport <kernel tag>|addons <kernel tag>>
-
- Remove patch from OFED:
- `basename $0` --remove
- --ofed|-o <path_to_ofed>
- --patch|-p <patch name>
- --type|-t <kernel|backport <kernel tag>|addons <kernel tag>>
-
- Example:
- `basename $0` --add --ofed /tmp/OFED-1.X/ --patch /tmp/cma_establish.patch --type kernel
-
- `basename $0` --remove --ofed /tmp/OFED-1.X/ --patch cma_establish.patch --type kernel
-
-EOF
-}
-
-action=""
-
-# Execute command w/ echo and exit if it fail
-ex()
-{
- echo "$@"
- if ! "$@"; then
- printf "\nFailed executing $@\n\n"
- exit 1
- fi
-}
-
-add_patch()
-{
- if [ -f $2/${1##*/} ]; then
- echo Replacing $2/${1##*/}
- ex /bin/rm -f $2/${1##*/}
- fi
- ex cp $1 $2
-}
-
-remove_patch()
-{
- if [ -f $2/${1##*/} ]; then
- echo Removing $2/${1##*/}
- ex /bin/rm -f $2/${1##*/}
- else
- echo Patch $2/${1##*/} was not found
- exit 1
- fi
-}
-
-set_rpm_info()
-{
- package_SRC_RPM=$(/bin/ls -1 ${ofed}/SRPMS/${1}*src.rpm 2> /dev/null)
- if [[ -n "${package_SRC_RPM}" && -s ${package_SRC_RPM} ]]; then
- package_name=$(rpm --queryformat "[%{NAME}]" -qp ${package_SRC_RPM})
- package_ver=$(rpm --queryformat "[%{VERSION}]" -qp ${package_SRC_RPM})
- package_rel=$(rpm --queryformat "[%{RELEASE}]" -qp ${package_SRC_RPM})
- else
- echo $1 src.rpm not found under ${ofed}/SRPMS
- exit 1
- fi
-}
-
-main()
-{
- while [ ! -z "$1" ]
- do
- case $1 in
- --add)
- action="add"
- shift
- ;;
- --remove)
- action="remove"
- shift
- ;;
- --ofed|-o)
- ofed=$2
- shift 2
- ;;
- --patch|-p)
- patch=$2
- shift 2
- ;;
- --type|-t)
- type=$2
- shift 2
- case ${type} in
- backport|addons)
- tag=$1
- shift
- ;;
- esac
- ;;
- --help|-h)
- usage
- exit 0
- ;;
- *)
- usage
- exit 1
- ;;
- esac
- done
-
- if [ -z "$action" ]; then
- usage
- exit 1
- fi
-
- if [ -z "$ofed" ] || [ ! -d "$ofed" ]; then
- echo Set the path to the OFED directory. Use \'--ofed\' parameter
- exit 1
- else
- ofed=$(readlink -f $ofed)
- fi
-
- if [ "$action" == "add" ]; then
- if [ -z "$patch" ] || [ ! -r "$patch" ]; then
- echo Set the path to the patch file. Use \'--patch\' parameter
- exit 1
- else
- patch=$(readlink -f $patch)
- fi
- else
- if [ -z "$patch" ]; then
- echo Set the name of the patch to be removed. Use \'--patch\' parameter
- exit 1
- fi
- fi
-
- if [ -z "$type" ]; then
- echo Set the type of the patch. Use \'--type\' parameter
- exit 1
- fi
-
- if [ "$type" == "backport" ] || [ "$type" == "addons" ]; then
- if [ -z "$tag" ]; then
- echo Set tag for backport patch.
- exit 1
- fi
- fi
-
- # Get ofa RPM version
- case $type in
- kernel|backport|addons)
- set_rpm_info ofa_kernel
- ;;
- *)
- echo "Unknown type $type"
- exit 1
- ;;
- esac
-
- package=${package_name}-${package_ver}
- cd ${ofed}
- if [ ! -e SRPMS/${package}-${package_rel}.src.rpm ]; then
- echo File ${ofed}/SRPMS/${package}-${package_rel}.src.rpm not found
- exit 1
- fi
-
- if ! ( set -x && rpm -i --define "_topdir $(pwd)" SRPMS/${package}-${package_rel}.src.rpm && set +x ); then
- echo "Failed to install ${package}-${package_rel}.src.rpm"
- exit 1
- fi
-
- cd -
-
- cd ${ofed}/SOURCES
- ex tar xzf ${package}.tgz
-
- case $type in
- kernel)
- if [ "$action" == "add" ]; then
- add_patch $patch ${package}/kernel_patches/fixes
- else
- remove_patch $patch ${package}/kernel_patches/fixes
- fi
- ;;
- backport)
- if [ "$action" == "add" ]; then
- if [ ! -d ${package}/kernel_patches/backport/$tag ]; then
- echo Creating ${package}/kernel_patches/backport/$tag directory
- ex mkdir -p ${package}/kernel_patches/backport/$tag
- echo WARNING: Check that ${package} configure supports backport/$tag
- fi
- add_patch $patch ${package}/kernel_patches/backport/$tag
- else
- remove_patch $patch ${package}/kernel_patches/backport/$tag
- fi
- ;;
- addons)
- if [ "$action" == "add" ]; then
- if [ ! -d ${package}/kernel_addons/backport/$tag ]; then
- echo Creating ${package}/kernel_addons/backport/$tag directory
- ex mkdir -p ${package}/kernel_addons/backport/$tag
- echo WARNING: Check that ${package} configure supports backport/$tag
- fi
- add_patch $patch ${package}/kernel_addons/backport/$tag
- else
- remove_patch $patch ${package}/kernel_addons/backport/$tag
- fi
- ;;
- *)
- echo Unknown patch type: $type
- exit 1
- ;;
- esac
-
- ex tar czf ${package}.tgz ${package}
- cd -
-
- cd ${ofed}
- echo Rebuilding ${package_name} source rpm:
- if ! ( set -x && rpmbuild -bs --define "_topdir $(pwd)" SPECS/${package_name}.spec && set +x ); then
- echo Failed to create ${package}-${package_rel}.src.rpm
- exit 1
- fi
- ex rm -rf SOURCES/${package}*
- if [ "$action" == "add" ]; then
- echo Patch added successfully.
- else
- echo Patch removed successfully.
- fi
- echo
- echo Remove existing RPM packages from ${ofed}/RPMS direcory in order
- echo to rebuild RPMs
-}
-
-main $@
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- Open MPI in OFED 1.5.1 Copyrights, License, and Release Notes
-
- March 2010
-
-Open MPI Copyrights
--------------------
-Most files in this release are marked with the copyrights of the
-organizations who have edited them. The copyrights below generally
-reflect members of the Open MPI core team who have contributed code to
-this release. The copyrights for code used under license from other
-parties are included in the corresponding files.
-
-Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
- University Research and Technology
- Corporation. All rights reserved.
-Copyright (c) 2004-2009 The University of Tennessee and The University
- of Tennessee Research Foundation. All rights
- reserved.
-Copyright (c) 2004-2008 High Performance Computing Center Stuttgart,
- University of Stuttgart. All rights reserved.
-Copyright (c) 2004-2007 The Regents of the University of California.
- All rights reserved.
-Copyright (c) 2006-2009 Los Alamos National Security, LLC. All rights
- reserved.
-Copyright (c) 2006-2009 Cisco Systems, Inc. All rights reserved.
-Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
-Copyright (c) 2006-2008 Sandia National Laboratories. All rights
- reserved.
-Copyright (c) 2006-2009 Sun Microsystems, Inc. All rights reserved.
- Use is subject to license terms.
-Copyright (c) 2006-2009 The University of Houston. All rights
- reserved.
-Copyright (c) 2006-2008 Myricom, Inc. All rights reserved.
-Copyright (c) 2007-2008 UT-Battelle, LLC. All rights reserved.
-Copyright (c) 2007-2008 IBM Corporation. All rights reserved.
-Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich
- Supercomputing
- Centre, Federal Republic of Germany
-Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
-Copyright (c) 2007 Evergrid, Inc. All rights reserved.
-Copyright (c) 2008 Institut National de Recherche en
- Informatique. All rights reserved.
-Copyright (c) 2007 Lawrence Livermore National Security, LLC.
- All rights reserved.
-Copyright (c) 2007-2010 Mellanox Technologies. All rights reserved.
-Copyright (c) 2006 QLogic Corporation. All rights reserved.
-
-Additional copyrights may follow
-
-Open MPI License
-----------------
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are
-met:
-
-- Redistributions of source code must retain the above copyright
- notice, this list of conditions and the following disclaimer.
-
-- Redistributions in binary form must reproduce the above copyright
- notice, this list of conditions and the following disclaimer listed
- in this license in the documentation and/or other materials
- provided with the distribution.
-
-- Neither the name of the copyright holders nor the names of its
- contributors may be used to endorse or promote products derived from
- this software without specific prior written permission.
-
-The copyright holders provide no reassurances that the source code
-provided does not infringe any patent, copyright, or any other
-intellectual property rights of third parties. The copyright holders
-disclaim any liability to any recipient for claims brought against
-recipient by any third party for infringement of that parties
-intellectual property rights.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
-OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
-SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
-LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
-THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-===========================================================================
-
-When submitting questions and problems, be sure to include as much
-extra information as possible. This web page details all the
-information that we request in order to provide assistance:
-
- http://www.open-mpi.org/community/help/
-
-The best way to report bugs, send comments, or ask questions is to
-sign up on the user's and/or developer's mailing list (for user-level
-and developer-level questions; when in doubt, send to the user's
-list):
-
- users@open-mpi.org
- devel@open-mpi.org
-
-Because of spam, only subscribers are allowed to post to these lists
-(ensure that you subscribe with and post from exactly the same e-mail
-address -- joe@example.com is considered different than
-joe@mycomputer.example.com!). Visit these pages to subscribe to the
-lists:
-
- http://www.open-mpi.org/mailman/listinfo.cgi/users
- http://www.open-mpi.org/mailman/listinfo.cgi/devel
-
-Thanks for your time.
-
-===========================================================================
-
-Much, much more information is also available in the Open MPI FAQ:
-
- http://www.open-mpi.org/faq/
-
-===========================================================================
-
-OFED-Specific Release Notes
----------------------------
-
-** SLES 10 with Pathscale compiler support:
-
-Using the Pathscale compiler to build Open MPI on SLES10 may result in
-a non-functional Open MPI installation (every Open MPI command fails).
-If this problem occurs, try upgrading your Pathscale installation to
-the latest maintenance release, or use a different compiler to compile
-Open MPI.
-
-** Intel compiler support:
-
-Some versions of the Intel 9.1 C++ compiler suite series produce
-incorrect code when used with the Open MPI C++ bindings. Symptoms of
-this problem include crashing applications (e.g., segmentation
-violations) and Open MPI producing errors about incorrect parameters.
-Be sure to upgrade to the latest maintenance release of the Intel 9.1
-compiler to avoid these problems.
-
-** Installing newer versions of Open MPI after OFED is installed:
-
-Open MPI can be built from source after OFED is fully installed. The
-source code for Open MPI can be extracted from the SRPM shipped with
-OFED or downloaded from the main Open MPI web site:
-http://www.open-mpi.org/.
-
-To compile with Open MPI from source with OFED support, fully install
-the rest of OFED. If you used the default prefix for the OFED
-installation (/usr), Open MPI should build with OpenFabrics support by
-default. If you used a different OFED prefix, you must tell Open MPI
-what it is with the "--with-openib=<OFED_prefix>" switch to configure.
-You can verify that Open MPI installed with OpenFabrics support by
-running (the exact version numbers displayed may be different; the
-important part is that the "openib" BTL is displayed):
-
- shell$ ompi_info | grep openib
- MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.2)
-
-See the rest of the documentation below for other configure command
-line options and installation instructions.
-
-** Changelog summary
-
-Showing versions 1.2.7 - 1.4; see the "NEWS" file in an Open MPI
-distribution for the full list.
-
-1.4.1 (OFED version)
----
-- Update support for various OpenFabrics devices in the openib BTL's
- .ini file.
-- Fixing RDMA CM failure during QP creation (Ticket #2307)
-
-1.4.1
----
-- Update to PLPA v1.3.2, addressing a licensing issue identified by
- the Fedora project. See
- https://svn.open-mpi.org/trac/plpa/changeset/262 for details.
-- Add check for malformed checkpoint metadata files (Ticket #2141).
-- Fix error path in ompi-checkpoint when not able to checkpoint
- (Ticket #2138).
-- Cleanup component release logic when selecting checkpoint/restart
- enabled components (Ticket #2135).
-- Fixed VT node name detection for Cray XT platforms, and fixed some
- broken VT documentation files.
-- Fix a possible race condition in tearing down RDMA CM-based
- connections.
-- Relax error checking on MPI_GRAPH_CREATE. Thanks to David Singleton
- for pointing out the issue.
-- Fix a shared memory "hang" problem that occurred on x86/x86_64
- platforms when used with the GNU >=4.4.x compiler series.
-- Add fix for Libtool 2.2.6b's problems with the PGI 10.x compiler
- suite. Inspired directly from the upstream Libtool patches that fix
- the issue (but we need something working before the next Libtool
- release).
-
-1.4
----
-
-The *only* change in the Open MPI v1.4 release (as compared to v1.3.4)
-was to update the embedded version of Libtool's libltdl to address a
-potential security vulnerability. Specifically: Open MPI v1.3.4 was
-created with GNU Libtool 2.2.6a; Open MPI v1.4 was created with GNU
-Libtool 2.2.6b. There are no other changes between Open MPI v1.3.4
-and v1.4.
-
-
-1.3.4
------
-
-- Fix some issues in OMPI's SRPM with regard to shell_scripts_basename
- and its use with mpi-selector. Thanks to Bill Johnstone for
- pointing out the problem.
-- Added many new MPI job process affinity options to mpirun. See the
- newly-updated mpirun(1) man page for details.
-- Several updates to mpirun's XML output.
-- Update to fix a few Valgrind warnings with regards to the ptmalloc2
- allocator and Open MPI's use of PLPA.
-- Many updates and fixes to the (non-default) "sm" collective
- component (i.e., native shared memory MPI collective operations).
-- Updates and fixes to some MPI_COMM_SPAWN_MULTIPLE corner cases.
-- Fix some internal copying functions in Open MPI's use of PLPA.
-- Correct some SLURM nodelist parsing logic that may have interfered
- with large jobs. Additionally, per advice from the SLURM team,
- change the environment variable that we use for obtaining the job's
- allocation.
-- Revert to an older, safer (but slower) communicator ID allocation
- algorithm.
-- Fixed minimum distance finding for OpenFabrics devices in the openib
- BTL.
-- Relax the parameter checking MPI_CART_CREATE a bit.
-- Fix MPI_COMM_SPAWN[_MULTIPLE] to only error-check the info arguments
- on the root process. Thanks to Federico Golfre Andreasi for
- reporting the problem.
-- Fixed some BLCR configure issues.
-- Fixed a potential deadlock when the openib BTL was used with
- MPI_THREAD_MULTIPLE.
-- Fixed dynamic rules selection for the "tuned" coll component.
-- Added a launch progress meter to mpirun (useful for large jobs; set
- the orte_report_launch_progress MCA parameter to 1 to see it).
-- Reduced the number of file descriptors consumed by each MPI process.
-- Add new device IDs for Chelsio T3 RNICs to the openib BTL config file.
-- Fix some CRS self component issues.
-- Added some MCA parameters to the PSM MTL to tune its run-time
- behavior.
-- Fix some VT issues with MPI_BOTTOM/MPI_IN_PLACE.
-- Man page updates from the Debain Open MPI package maintainers.
-- Add cycle counter support for the Alpha and Sparc platforms.
-- Pass visibility flags to libltdl's configure script, resulting in
- those symbols being hidden. This appears to mainly solve the
- problem of applications attempting to use different versions of
- libltdl from that used to build Open MPI.
-
-
-1.3.3
------
-
-- Fix a number of issues with the openib BTL (OpenFabrics) RDMA CM,
- including a memory corruption bug, a shutdown deadlock, and a route
- timeout. Thanks to David McMillen and Hal Rosenstock for help in
- tracking down the issues.
-- Change the behavior of the EXTRA_STATE parameter that is passed to
- Fortran attribute callback functions: this value is now stored
- internally in MPI -- it no longer references the original value
- passed by MPI_*_CREATE_KEYVAL.
-- Allow the overriding RFC1918 and RFC3330 for the specification of
- "private" networks, thereby influencing Open MPI's TCP
- "reachability" computations.
-- Improve flow control issues in the sm btl, by both tweaking the
- shared memory progression rules and by enabling the "sync" collective
- to barrier every 1,000th collective.
-- Various fixes for the IBM XL C/C++ v10.1 compiler.
-- Allow explicit disabling of ptmalloc2 hooks at runtime (e.g., enable
- support for Debian's builtroot system). Thanks to Manuel Prinz and
- the rest of the Debian crew for helping identify and fix this issue.
-- Various minor fixes for the I/O forwarding subsystem.
-- Big endian iWARP fixes in the Open Fabrics RDMA CM support.
-- Update support for various OpenFabrics devices in the openib BTL's
- .ini file.
-- Fixed undefined symbol issue with Open MPI's parallel debugger
- message queue support so it can be compiled by Sun Studio compilers.
-- Update MPI_SUBVERSION to 1 in the Fortran bindings.
-- Fix MPI_GRAPH_CREATE Fortran 90 binding.
-- Fix MPI_GROUP_COMPARE behavior with regards to MPI_IDENT. Thanks to
- Geoffrey Irving for identifying the problem and supplying the fix.
-- Silence gcc 4.1 compiler warnings about type punning. Thanks to
- Number Cruncher for the fix.
-- Added more Valgrind and other memory-cleanup fixes. Thanks to
- various Open MPI users for help with these issues.
-- Miscellaneous VampirTrace fixes.
-- More fixes for openib credits in heavy-congestion scenarios.
-- Slightly decrease the latency in the openib BTL in some conditions
- (add "send immediate" support to the openib BTL).
-- Ensure to allow MPI_REQUEST_GET_STATUS to accept an
- MPI_STATUS_IGNORE parameter. Thanks to Shaun Jackman for the bug
- report.
-- Added Microsoft Windows support. See README.WINDOWS file for
- details.
-
-
-1.3.2
------
-
-- Fixed a potential infinite loop in the openib BTL that could occur
- in senders in some frequent-communication scenarios. Thanks to Don
- Wood for reporting the problem.
-- Add a new checksum PML variation on ob1 (main MPI point-to-point
- communication engine) to detect memory corruption in node-to-node
- messages
-- Add a new configuration option to add padding to the openib
- header so the data is aligned
-- Add a new configuration option to use an alternative checksum algo
- when using the checksum PML
-- Fixed a problem reported by multiple users on the mailing list that
- the LSF support would fail to find the appropriate libraries at
- run-time.
-- Allow empty shell designations from getpwuid(). Thanks to Sergey
- Koposov for the bug report.
-- Ensure that mpirun exits with non-zero status when applications die
- due to user signal. Thanks to Geoffroy Pignot for suggesting the
- fix.
-- Ensure that MPI_VERSION / MPI_SUBVERSION match what is returned by
- MPI_GET_VERSION. Thanks to Rob Egan for reporting the error.
-- Updated MPI_*KEYVAL_CREATE functions to properly handle Fortran
- extra state.
-- A variety of ob1 (main MPI point-to-point communication engine) bug
- fixes that could have caused hangs or seg faults.
-- Do not install Open MPI's signal handlers in MPI_INIT if there are
- already signal handlers installed. Thanks to Kees Verstoep for
- bringing the issue to our attention.
-- Fix GM support to not seg fault in MPI_INIT.
-- Various VampirTrace fixes.
-- Various PLPA fixes.
-- No longer create BTLs for invalid (TCP) devices.
-- Various man page style and lint cleanups.
-- Fix critical OpenFabrics-related bug noted here:
- http://www.open-mpi.org/community/lists/announce/2009/03/0029.php.
- Open MPI now uses a much more robust memory intercept scheme that is
- quite similar to what is used by MX. The use of "-lopenmpi-malloc"
- is no longer necessary, is deprecated, and is expected to disappear
- in a future release. -lopenmpi-malloc will continue to work for the
- duration of the Open MPI v1.3 and v1.4 series.
-- Fix some OpenFabrics shutdown errors, both regarding iWARP and SRQ.
-- Allow the udapl BTL to work on Solaris platforms that support
- relaxed PCI ordering.
-- Fix problem where the mpirun would sometimes use rsh/ssh to launch on
- the localhost (instead of simply forking).
-- Minor SLURM stdin fixes.
-- Fix to run properly under SGE jobs.
-- Scalability and latency improvements for shared memory jobs: convert
- to using one message queue instead of N queues.
-- Automatically size the shared-memory area (mmap file) to match
- better what is needed; specifically, so that large-np jobs will start.
-- Use fixed-length MPI predefined handles in order to provide ABI
- compatibility between Open MPI releases.
-- Fix building of the posix paffinity component to properly get the
- number of processors in loosely tested environments (e.g.,
- FreeBSD). Thanks to Steve Kargl for reporting the issue.
-- Fix --with-libnuma handling in configure. Thanks to Gus Correa for
- reporting the problem.
-
-
-1.3.1
------
-
-- Added "sync" coll component to allow users to synchronize every N
- collective operations on a given communicator.
-- Increased the default values of the IB and RNR timeout MCA parameters.
-- Fix a compiler error noted by Mostyn Lewis with the PGI 8.0 compiler.
-- Fix an error that prevented stdin from being forwarded if the
- rsh launcher was in use. Thanks to Branden Moore for pointing out
- the problem.
-- Correct a case where the added datatype is considered as contiguous but
- has gaps in the beginning.
-- Fix an error that limited the number of comm_spawns that could
- simultaneously be running in some environments
-- Correct a corner case in OB1's GET protocol for long messages; the
- error could sometimes cause MPI jobs using the openib BTL to hang.
-- Fix a bunch of bugs in the IO forwarding (IOF) subsystem and add some
- new options to output to files and redirect output to xterm. Thanks to
- Jody Weissmann for helping test out many of the new fixes and
- features.
-- Fix SLURM race condition.
-- Fix MPI_File_c2f(MPI_FILE_NULL) to return 0, not -1. Thanks to
- Lisandro Dalcin for the bug report.
-- Fix the DSO build of tm PLM.
-- Various fixes for size disparity between C int's and Fortran
- INTEGER's. Thanks to Christoph van Wullen for the bug report.
-- Ensure that mpirun exits with a non-zero exit status when daemons or
- processes abort or fail to launch.
-- Various fixes to work around Intel (NetEffect) RNIC behavior.
-- Various fixes for mpirun's --preload-files and --preload-binary
- options.
-- Fix the string name in MPI::ERRORS_THROW_EXCEPTIONS.
-- Add ability to forward SIFTSTP and SIGCONT to MPI processes if you
- set the MCA parameter orte_forward_job_control to 1.
-- Allow the sm BTL to allocate larger amounts of shared memory if
- desired (helpful for very large multi-core boxen).
-- Fix a few places where we used PATH_MAX instead of OMPI_PATH_MAX,
- leading to compile problems on some platforms. Thanks to Andrea Iob
- for the bug report.
-- Fix mca_btl_openib_warn_no_device_params_found MCA parameter; it
- was accidentally being ignored.
-- Fix some run-time issues with the sctp BTL.
-- Ensure that RTLD_NEXT exists before trying to use it (e.g., it
- doesn't exist on Cygwin). Thanks to Gustavo Seabra for reporting
- the issue.
-- Various fixes to VampirTrace, including fixing compile errors on
- some platforms.
-- Fixed missing MPI_Comm_accept.3 man page; fixed minor issue in
- orterun.1 man page. Thanks to Dirk Eddelbuettel for identifying the
- problem and submitting a patch.
-- Implement the XML formatted output of stdout/stderr/stddiag.
-- Fixed mpirun's -wdir switch to ensure that working directories for
- multiple app contexts are properly handled. Thanks to Geoffroy
- Pignot for reporting the problem.
-- Improvements to the MPI C++ integer constants:
- - Allow MPI::SEEK_* constants to be used as constants
- - Allow other MPI C++ constants to be used as array sizes
-- Fix minor problem with orte-restart's command line options. See
- ticket #1761 for details. Thanks to Gregor Dschung for reporting
- the problem.
-
-1.3
----
-
-- Extended the OS X 10.5.x (Leopard) workaround for a problem when
- assembly code is compiled with -g[0-9]. Thanks to Barry Smith for
- reporting the problem. See ticket #1701.
-- Disabled MPI_REAL16 and MPI_COMPLEX32 support on platforms where the
- bit representation of REAL*16 is different than that of the C type
- of the same size (usually long double). Thanks to Julien Devriendt
- for reporting the issue. See ticket #1603.
-- Increased the size of MPI_MAX_PORT_NAME to 1024 from 36. See ticket #1533.
-- Added "notify debugger on abort" feature. See tickets #1509 and #1510.
- Thanks to Seppo Sahrakropi for the bug report.
-- Upgraded Open MPI tarballs to use Autoconf 2.63, Automake 1.10.1,
- Libtool 2.2.6a.
-- Added missing MPI::Comm::Call_errhandler() function. Thanks to Dave
- Goodell for bringing this to our attention.
-- Increased MPI_SUBVERSION value in mpi.h to 1 (i.e., MPI 2.1).
-- Changed behavior of MPI_GRAPH_CREATE, MPI_TOPO_CREATE, and several
- other topology functions per MPI-2.1.
-- Fix the type of the C++ constant MPI::IN_PLACE.
-- Various enhancements to the openib BTL:
- - Added btl_openib_if_[in|ex]clude MCA parameters for
- including/excluding comma-delimited lists of HCAs and ports.
- - Added RDMA CM support, includng btl_openib_cpc_[in|ex]clude MCA
- parameters
- - Added NUMA support to only use "near" network adapters
- - Added "Bucket SRQ" (BSRQ) support to better utilize registered
- memory, including btl_openib_receive_queues MCA parameter
- - Added ConnectX XRC support (and integrated with BSRQ)
- - Added btl_openib_ib_max_inline_data MCA parameter
- - Added iWARP support
- - Revamped flow control mechansisms to be more efficient
- - "mpi_leave_pinned=1" is now the default when possible,
- automatically improving performance for large messages when
- application buffers are re-used
-- Elimiated duplicated error messages when multiple MPI processes fail
- with the same error.
-- Added NUMA support to the shared memory BTL.
-- Add Valgrind-based memory checking for MPI-semantic checks.
-- Add support for some optional Fortran datatypes (MPI_LOGICAL1,
- MPI_LOGICAL2, MPI_LOGICAL4 and MPI_LOGICAL8).
-- Remove the use of the STL from the C++ bindings.
-- Added support for Platform/LSF job launchers. Must be Platform LSF
- v7.0.2 or later.
-- Updated ROMIO with the version from MPICH2 1.0.7.
-- Added RDMA capable one-sided component (called rdma), which
- can be used with BTL components that expose a full one-sided
- interface.
-- Added the optional datatype MPI_REAL2. As this is added to the "end of"
- predefined datatypes in the fortran header files, there will not be
- any compatibility issues.
-- Added Portable Linux Processor Affinity (PLPA) for Linux.
-- Addition of a finer symbols export control via the visibiliy feature
- offered by some compilers.
-- Added checkpoint/restart process fault tolerance support. Initially
- support a LAM/MPI-like protocol.
-- Removed "mvapi" BTL; all InfiniBand support now uses the OpenFabrics
- driver stacks ("openib" BTL).
-- Added more stringent MPI API parameter checking to help user-level
- debugging.
-- The ptmalloc2 memory manager component is now by default built as
- a standalone library named libopenmpi-malloc. Users wanting to
- use leave_pinned with ptmalloc2 will now need to link the library
- into their application explicitly. All other users will use the
- libc-provided allocator instead of Open MPI's ptmalloc2. This change
- may be overriden with the configure option enable-ptmalloc2-internal
-- The leave_pinned options will now default to using mallopt on
- Linux in the cases where ptmalloc2 was not linked in. mallopt
- will also only be available if munmap can be intercepted (the
- default whenever Open MPI is not compiled with --without-memory-
- manager.
-- Open MPI will now complain and refuse to use leave_pinned if
- no memory intercept / mallopt option is available.
-- Add option of using Perl-based wrapper compilers instead of the
- C-based wrapper compilers. The Perl-based version does not
- have the features of the C-based version, but does work better
- in cross-compile environments.
-
-
-1.2.9
------
-
-- Fix a segfault when using one-sided communications on some forms of derived
- datatypes. Thanks to Dorian Krause for reporting the bug. See #1715.
-- Fix an alignment problem affecting one-sided communications on
- some architectures (e.g., SPARC64). See #1738.
-- Fix compilation on Solaris when thread support is enabled in Open MPI
- (e.g., when using --with-threads). See #1736.
-- Correctly take into account the MTU that an OpenFabrics device port
- is using. See #1722 and
- https://bugs.openfabrics.org/show_bug.cgi?id=1369.
-- Fix two datatype engine bugs. See #1677.
- Thanks to Peter Kjellstrom for the bugreport.
-- Fix the bml r2 help filename so the help message can be found. See #1623.
-- Fix a compilation problem on RHEL4U3 with the PGI 32 bit compiler
- caused by <infiniband/driver.h>. See ticket #1613.
-- Fix the --enable-cxx-exceptions configure option. See ticket #1607.
-- Properly handle when the MX BTL cannot open an endpoint. See ticket #1621.
-- Fix a double free of events on the tcp_events list. See ticket #1631.
-- Fix a buffer overun in opal_free_list_grow (called by MPI_Init).
- Thanks to Patrick Farrell for the bugreport and Stephan Kramer for
- the bugfix. See ticket #1583.
-- Fix a problem setting OPAL_PREFIX for remote sh-based shells.
- See ticket #1580.
-
-
-1.2.8
------
-
-- Tweaked one memory barrier in the openib component to be more conservative.
- May fix a problem observed on PPC machines. See ticket #1532.
-- Fix OpenFabrics IB partition support. See ticket #1557.
-- Restore v1.1 feature that sourced .profile on remote nodes if the default
- shell will not do so (e.g. /bin/sh and /bin/ksh). See ticket #1560.
-- Fix segfault in MPI_Init_thread() if ompi_mpi_init() fails. See ticket #1562.
-- Adjust SLURM support to first look for $SLURM_JOB_CPUS_PER_NODE instead of
- the deprecated $SLURM_TASKS_PER_NODE environment variable. This change
- may be *required* when using SLURM v1.2 and above. See ticket #1536.
-- Fix the MPIR_Proctable to be in process rank order. See ticket #1529.
-- Fix a regression introduced in 1.2.6 for the IBM eHCA. See ticket #1526.
-
-
-1.2.7
------
-
-- Add some Sun HCA vendor IDs. See ticket #1461.
-- Fixed a memory leak in MPI_Alltoallw when called from Fortran.
- Thanks to Dave Grote for the bugreport. See ticket #1457.
-- Only link in libutil when it is needed/desired. Thanks to
- Brian Barret for diagnosing and fixing the problem. See ticket #1455.
-- Update some QLogic HCA vendor IDs. See ticket #1453.
-- Fix F90 binding for MPI_CART_GET. Thanks to Scott Beardsley for
- bringing it to our attention. See ticket #1429.
-- Remove a spurious warning message generated in/by ROMIO. See ticket #1421.
-- Fix a bug where command-line MCA parameters were not overriding
- MCA parameters set from environment variables. See ticket #1380.
-- Fix a bug in the AMD64 atomics assembly. Thanks to Gabriele Fatigati
- for the bug report and bugfix. See ticket #1351.
-- Fix a gather and scatter bug on intercommunicators when the datatype
- being moved is 0 bytes. See ticket #1331.
-- Some more man page fixes from the Debian maintainers.
- See tickets #1324 and #1329.
-- Have openib BTL (OpenFabrics support) check for the presence of
- /sys/class/infiniband before allowing itself to be used. This check
- prevents spurious "OMPI did not find RDMA hardware!" notices on
- systems that have the software drivers installed, but no
- corresponding hardware. See tickets #1321 and #1305.
-- Added vendor IDs for some ConnectX openib HCAs. See ticket #1311.
-- Fix some RPM specfile inconsistencies. See ticket #1308.
- Thanks to Jim Kusznir for noticing the problem.
-- Removed an unused function prototype that caused warnings on
- some systems (e.g., OS X). See ticket #1274.
-- Fix a deadlock in inter-communicator scatter/gather operations.
- Thanks to Martin Audet for the bug report. See ticket #1268.
-
-===========================================================================
-
-Much, much more information is also available in the Open MPI FAQ:
-
- http://www.open-mpi.org/faq/
-
-===========================================================================
-
-General Release Notes
----------------------
-
-Detailed Open MPI v1.3 Feature List:
-
- o Open MPI RunTime Environment (ORTE) improvements
- - General robustness improvements
- - Scalable job launch (we've seen ~16K processes in less than a
- minute in a highly-optimized configuration)
- - New process mappers
- - Support for Platform/LSF environments (v7.0.2 and later)
- - More flexible processing of host lists
- - new mpirun cmd line options and associated functionality
-
- o Fault-Tolerance Features
- - Asynchronous, transparent checkpoint/restart support
- - Fully coordinated checkpoint/restart coordination component
- - Support for the following checkpoint/restart services:
- - blcr: Berkley Lab's Checkpoint/Restart
- - self: Application level callbacks
- - Support for the following interconnects:
- - tcp
- - mx
- - openib
- - sm
- - self
- - Improved Message Logging
-
- o MPI_THREAD_MULTIPLE support for point-to-point messaging in the
- following BTLs (note that only MPI point-to-point messaging API
- functions support MPI_THREAD_MULTIPLE; other API functions likely
- do not):
- - tcp
- - sm
- - mx
- - elan
- - self
-
- o Point-to-point Messaging Layer (PML) improvements
- - Memory footprint reduction
- - Improved latency
- - Improved algorithm for multiple communication device
- ("multi-rail") support
-
- o Numerous Open Fabrics improvements/enhancements
- - Added iWARP support (including RDMA CM)
- - Memory footprint and performance improvements
- - "Bucket" SRQ support for better registered memory utilization
- - XRC/ConnectX support
- - Message coalescing
- - Improved error report mechanism with Asynchronous events
- - Automatic Path Migration (APM)
- - Improved processor/port binding
- - Infrastructure for additional wireup strategies
- - mpi_leave_pinned is now enabled by default
-
- o uDAPL BTL enhancements
- - Multi-rail support
- - Subnet checking
- - Interface include/exclude capabilities
-
- o Processor affinity
- - Linux processor affinity improvements
- - Core/socket <--> process mappings
-
- o Collectives
- - Performance improvements
- - Support for hierarchical collectives (must be activated
- manually; see below)
-
- o Miscellaneous
- - MPI 2.1 compliant
- - Sparse process groups and communicators
- - Support for Cray Compute Node Linux (CNL)
- - One-sided RDMA component (BTL-level based rather than PML-level
- based)
- - Aggregate MCA parameter sets
- - MPI handle debugging
- - Many small improvements to the MPI C++ bindings
- - Valgrind support
- - VampirTrace support
- - Updated ROMIO to the version from MPICH2 1.0.7
- - Removed the mVAPI IB stacks
- - Display most error messages only once (vs. once for each
- process)
- - Many other small improvements and bug fixes, too numerous to
- list here
-
-Known issues
-------------
-
- o There is a segfault that sometimes occurs on one of our x86_64 test
- clusters when using MPI onesided communications over Myrinet MX.
- Since no one else has reported this problem we are not holding
- up the 1.3 release. See ticket #1757 for the details, and any
- possible workarounds.
-
- o XGrid support is currently broken.
- https://svn.open-mpi.org/trac/ompi/ticket/1777
-
- o MPI_REDUCE_SCATTER does not work with counts of 0.
- https://svn.open-mpi.org/trac/ompi/ticket/1559
-
- o Please also see the Open MPI bug tracker for bugs beyond this release.
- https://svn.open-mpi.org/trac/ompi/report
-
-===========================================================================
-
-The following abbreviated list of release notes applies to this code
-base as of this writing (10 July 2009):
-
-General notes
--------------
-
-- Open MPI includes support for a wide variety of supplemental
- hardware and software package. When configuring Open MPI, you may
- need to supply additional flags to the "configure" script in order
- to tell Open MPI where the header files, libraries, and any other
- required files are located. As such, running "configure" by itself
- may not include support for all the devices (etc.) that you expect,
- especially if their support headers / libraries are installed in
- non-standard locations. Network interconnects are an easy example
- to discuss -- Myrinet and OpenFabrics networks, for example, both
- have supplemental headers and libraries that must be found before
- Open MPI can build support for them. You must specify where these
- files are with the appropriate options to configure. See the
- listing of configure command-line switches, below, for more details.
-
-- The majority of Open MPI's documentation is here in this file, the
- included man pages, and on the web site FAQ
- (http://www.open-mpi.org/). This will eventually be supplemented
- with cohesive installation and user documentation files.
-
-- Note that Open MPI documentation uses the word "component"
- frequently; the word "plugin" is probably more familiar to most
- users. As such, end users can probably completely substitute the
- word "plugin" wherever you see "component" in our documentation.
- For what it's worth, we use the word "component" for historical
- reasons, mainly because it is part of our acronyms and internal API
- functionc calls.
-
-- The run-time systems that are currently supported are:
- - rsh / ssh
- - LoadLeveler
- - PBS Pro, Open PBS, Torque
- - Platform LSF (v7.0.2 and later)
- - SLURM
- - XGrid (known to be broken in 1.3 through 1.3.2)
- - Cray XT-3 and XT-4
- - Sun Grid Engine (SGE) 6.1, 6.2 and open source Grid Engine
- - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)
-
-- Systems that have been tested are:
- - Linux (various flavors/distros), 32 bit, with gcc, and Sun Studio 12
- - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
- Intel, Portland, Pathscale, and Sun Studio 12 compilers (*)
- - OS X (10.4), 32 and 64 bit (i386, PPC, PPC64, x86_64), with gcc
- and Absoft compilers (*)
- - Solaris 10 update 2, 3 and 4, 32 and 64 bit (SPARC, i386, x86_64),
- with Sun Studio 10, 11 and 12
-
- (*) Be sure to read the Compiler Notes, below.
-
-- Other systems have been lightly (but not fully tested):
- - Other 64 bit platforms (e.g., Linux on PPC64)
- - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
- more testing and support is expected later in the Open MPI v1.3.x
- series. See the README.WINDOWS file.
-
-Compiler Notes
---------------
-
-- Mixing compilers from different vendors when building Open MPI
- (e.g., using the C/C++ compiler from one vendor and the F77/F90
- compiler from a different vendor) has been successfully employed by
- some Open MPI users (discussed on the Open MPI user's mailing list),
- but such configurations are not tested and not documented. For
- example, such configurations may require additional compiler /
- linker flags to make Open MPI build properly.
-
-- Open MPI does not support the Sparc v8 CPU target, which is the
- default on Sun Solaris. The v8plus (32 bit) or v9 (64 bit)
- targets must be used to build Open MPI on Solaris. This can be
- done by including a flag in CFLAGS, CXXFLAGS, FFLAGS, and FCFLAGS,
- -xarch=v8plus for the Sun compilers, -mv8plus for GCC.
-
-- At least some versions of the Intel 8.1 compiler seg fault while
- compiling certain Open MPI source code files. As such, it is not
- supported.
-
-- The Intel 9.0 v20051201 compiler on IA64 platforms seems to have a
- problem with optimizing the ptmalloc2 memory manager component (the
- generated code will segv). As such, the ptmalloc2 component will
- automatically disable itself if it detects that it is on this
- platform/compiler combination. The only effect that this should
- have is that the MCA parameter mpi_leave_pinned will be inoperative.
-
-- Early versions of the Portland Group 6.0 compiler have problems
- creating the C++ MPI bindings as a shared library (e.g., v6.0-1).
- Tests with later versions show that this has been fixed (e.g.,
- v6.0-5).
-
-- The Portland Group compilers prior to version 7.0 require the
- "-Msignextend" compiler flag to extend the sign bit when converting
- from a shorter to longer integer. This is is different than other
- compilers (such as GNU). When compiling Open MPI with the Portland
- compiler suite, the following flags should be passed to Open MPI's
- configure script:
-
- shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-Msignextend \
- --with-wrapper-cflags=-Msignextend \
- --with-wrapper-cxxflags=-Msignextend ...
-
- This will both compile Open MPI with the proper compile flags and
- also automatically add "-Msignextend" when the C and C++ MPI wrapper
- compilers are used to compile user MPI applications.
-
-- Using the MPI C++ bindings with the Pathscale compiler is known
- to fail, possibly due to Pathscale compiler issues.
-
-- Using the Absoft compiler to build the MPI Fortran bindings on Suse
- 9.3 is known to fail due to a Libtool compatibility issue.
-
-- Open MPI will build bindings suitable for all common forms of
- Fortran 77 compiler symbol mangling on platforms that support it
- (e.g., Linux). On platforms that do not support weak symbols (e.g.,
- OS X), Open MPI will build Fortran 77 bindings just for the compiler
- that Open MPI was configured with.
-
- Hence, on platforms that support it, if you configure Open MPI with
- a Fortran 77 compiler that uses one symbol mangling scheme, you can
- successfully compile and link MPI Fortran 77 applications with a
- Fortran 77 compiler that uses a different symbol mangling scheme.
-
- NOTE: For platforms that support the multi-Fortran-compiler bindings
- (i.e., weak symbols are supported), due to limitations in the MPI
- standard and in Fortran compilers, it is not possible to hide these
- differences in all cases. Specifically, the following two cases may
- not be portable between different Fortran compilers:
-
- 1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE
- will only compare properly to Fortran applications that were
- created with Fortran compilers that that use the same
- name-mangling scheme as the Fortran compiler that Open MPI was
- configured with.
-
- 2. Fortran compilers may have different values for the logical
- .TRUE. constant. As such, any MPI function that uses the Fortran
- LOGICAL type may only get .TRUE. values back that correspond to
- the the .TRUE. value of the Fortran compiler that Open MPI was
- configured with. Note that some Fortran compilers allow forcing
- .TRUE. to be 1 and .FALSE. to be 0. For example, the Portland
- Group compilers provide the "-Munixlogical" option, and Intel
- compilers (version >= 8.) provide the "-fpscomp logicals" option.
-
- You can use the ompi_info command to see the Fortran compiler that
- Open MPI was configured with.
-
-- The Fortran 90 MPI bindings can now be built in one of three sizes
- using --with-mpi-f90-size=SIZE (see description below). These sizes
- reflect the number of MPI functions included in the "mpi" Fortran 90
- module and therefore which functions will be subject to strict type
- checking. All functions not included in the Fortran 90 module can
- still be invoked from F90 applications, but will fall back to
- Fortran-77 style checking (i.e., little/none).
-
- - trivial: Only includes F90-specific functions from MPI-2. This
- means overloaded versions of MPI_SIZEOF for all the MPI-supported
- F90 intrinsic types.
-
- - small (default): All the functions in "trivial" plus all MPI
- functions that take no choice buffers (meaning buffers that are
- specified by the user and are of type (void*) in the C bindings --
- generally buffers specified for message passing). Hence,
- functions like MPI_COMM_RANK are included, but functions like
- MPI_SEND are not.
-
- - medium: All the functions in "small" plus all MPI functions that
- take one choice buffer (e.g., MPI_SEND, MPI_RECV, ...). All
- one-choice-buffer functions have overloaded variants for each of
- the MPI-supported Fortran intrinsic types up to the number of
- dimensions specified by --with-f90-max-array-dim (default value is
- 4).
-
- Increasing the size of the F90 module (in order from trivial, small,
- and medium) will generally increase the length of time required to
- compile user MPI applications. Specifically, "trivial"- and
- "small"-sized F90 modules generally allow user MPI applications to
- be compiled fairly quickly but lose type safety for all MPI
- functions with choice buffers. "medium"-sized F90 modules generally
- take longer to compile user applications but provide greater type
- safety for MPI functions.
-
- Note that MPI functions with two choice buffers (e.g., MPI_GATHER)
- are not currently included in Open MPI's F90 interface. Calls to
- these functions will automatically fall through to Open MPI's F77
- interface. A "large" size that includes the two choice buffer MPI
- functions is possible in future versions of Open MPI.
-
-
-General Run-Time Support Notes
-------------------------------
-
-- The Open MPI installation must be in your PATH on all nodes (and
- potentially LD_LIBRARY_PATH, if libmpi is a shared library), unless
- using the --prefix or --enable-mpirun-prefix-by-default
- functionality (see below).
-
-- LAM/MPI-like mpirun notation of "C" and "N" is not yet supported.
-
-- The XGrid support is experimental - see the Open MPI FAQ and this
- post on the Open MPI user's mailing list for more information:
-
- http://www.open-mpi.org/community/lists/users/2006/01/0539.php
-
-- Open MPI's run-time behavior can be customized via MCA ("MPI
- Component Architecture") parameters (see below for more information
- on how to get/set MCA parameter values). Some MCA parameters can be
- set in a way that renders Open MPI inoperable (see notes about MCA
- parameters later in this file). In particular, some parameters have
- required options that must be included.
-
- - If specified, the "btl" parameter must include the "self"
- component, or Open MPI will not be able to deliver messages to the
- same rank as the sender. For example: "mpirun --mca btl tcp,self
- ..."
- - If specified, the "btl_tcp_if_exclude" paramater must include the
- loopback device ("lo" on many Linux platforms), or Open MPI will
- not be able to route MPI messages using the TCP BTL. For example:
- "mpirun --mca btl_tcp_if_exclude lo,eth1 ..."
-
-- Running on nodes with different endian and/or different datatype
- sizes within a single parallel job is supported in this release.
- However, Open MPI does not resize data when datatypes differ in size
- (for example, sending a 4 byte MPI_DOUBLE and receiving an 8 byte
- MPI_DOUBLE will fail).
-
-
-MPI Functionality and Features
-------------------------------
-
-- All MPI-2.1 functionality is supported.
-
-- MPI_THREAD_MULTIPLE support is included, but is only lightly tested.
- It likely does not work for thread-intensive applications. Note
- that *only* the MPI point-to-point communication functions for the
- BTL's listed above are considered thread safe. Other support
- functions (e.g., MPI attributes) have not been certified as safe
- when simultaneously used by multiple threads.
-
- Note that Open MPI's thread support is in a fairly early stage; the
- above devices are likely to *work*, but the latency is likely to be
- fairly high. Specifically, efforts so far have concentrated on
- *correctness*, not *performance* (yet).
-
-- MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a
- portable C datatype can be found that matches the Fortran type
- REAL*16, both in size and bit representation.
-
-- Asynchronous message passing progress using threads can be turned on
- with the --enable-progress-threads option to configure.
- Asynchronous message passing progress is only supported with devices
- that support MPI_THREAD_MULTIPLE, but is only very lightly tested
- (and may not provide very much performance benefit).
-
-
-Collectives
------------
-
-- The "hierarch" coll component (i.e., an implementation of MPI
- collective operations) attempts to discover network layers of
- latency in order to segregate individual "local" and "global"
- operations as part of the overall collective operation. In this
- way, network traffic can be reduced -- or possibly even minimized
- (similar to MagPIe). The current "hierarch" component only
- separates MPI processes into on- and off-node groups.
-
- Hierarch has had sufficient correctness testing, but has not
- received much performance tuning. As such, hierarch is not
- activated by default -- it must be enabled manually by setting its
- priority level to 100:
-
- mpirun --mca coll_hierarch_priority 100 ...
-
- We would appreciate feedback from the user community about how well
- hierarch works for your applications.
-
-
-Network Support
----------------
-
-- The OpenFabrics Enterprise Distribution (OFED) software package v1.0
- will not work properly with Open MPI v1.2 (and later) due to how its
- Mellanox InfiniBand plugin driver is created. The problem is fixed
- OFED v1.1 (and later).
-
-- Older mVAPI-based InfiniBand drivers (Mellanox VAPI) are no longer
- supported. Please use an older version of Open MPI (1.2 series or
- earlier) if you need mVAPI support.
-
-- The use of fork() with the openib BTL is only partially supported,
- and only on Linux kernels >= v2.6.15 with libibverbs v1.1 or later
- (first released as part of OFED v1.2), per restrictions imposed by
- the OFED network stack.
-
-- There are two MPI network models available: "ob1" and "cm". "ob1"
- uses BTL ("Byte Transfer Layer") components for each supported
- network. "cm" uses MTL ("Matching Tranport Layer") components for
- each supported network.
-
- - "ob1" supports a variety of networks that can be used in
- combination with each other (per OS constraints; e.g., there are
- reports that the GM and OpenFabrics kernel drivers do not operate
- well together):
- - OpenFabrics: InfiniBand and iWARP
- - Loopback (send-to-self)
- - Myrinet: GM and MX
- - Portals
- - Quadrics Elan
- - Shared memory
- - TCP
- - SCTP
- - uDAPL
-
- - "cm" supports a smaller number of networks (and they cannot be
- used together), but may provide better better overall MPI
- performance:
- - Myrinet MX (not GM)
- - InfiniPath PSM
- - Portals
-
- Open MPI will, by default, choose to use "cm" when the InfiniPath
- PSM MTL can be used. Otherwise, OB1 will be used and the
- corresponding BTLs will be selected. Users can force the use of ob1
- or cm if desired by setting the "pml" MCA parameter at run-time:
-
- shell$ mpirun --mca pml ob1 ...
- or
- shell$ mpirun --mca pml cm ...
-
-- Myrinet MX support is shared between the 2 internal devices, the MTL
- and the BTL. The design of the BTL interface in Open MPI assumes
- that only naive one-sided communication capabilities are provided by
- the low level communication layers. However, modern communication
- layers such as Myrinet MX, InfiniPath PSM, or Portals, natively
- implement highly-optimized two-sided communication semantics. To
- leverage these capabilities, Open MPI provides the "cm" PML and
- corresponding MTL components to transfer messages rather than bytes.
- The MTL interface implements a shorter code path and lets the
- low-level network library decide which protocol to use (depending on
- issues such as message length, internal resources and other
- parameters specific to the underlying interconnect). However, Open
- MPI cannot currently use multiple MTL modules at once. In the case
- of the MX MTL, process loopback and on-node shared memory
- communications are provided by the MX library. Moreover, the
- current MX MTL does not support message pipelining resulting in
- lower performances in case of non-contiguous data-types.
-
- The "ob1" PML and BTL components use Open MPI's internal on-node
- shared memory and process loopback devices for high performance.
- The BTL interface allows multiple devices to be used simultaneously.
- For the MX BTL it is recommended that the first segment (which is as
- a threshold between the eager and the rendezvous protocol) should
- always be at most 4KB, but there is no further restriction on the
- size of subsequent fragments.
-
- The MX MTL is recommended in the common case for best performance on
- 10G hardware when most of the data transfers cover contiguous memory
- layouts. The MX BTL is recommended in all other cases, such as when
- using multiple interconnects at the same time (including TCP), or
- transferring non contiguous data-types.
-
-
-Shared library versioning support
----------------------------------
-
-Open MPI started using GNU-Libtool recommended shared library
-versioning with the v1.3.3 release (where all versions were set to
-0:0:0) for the main MPI libraries: libmpi, libmpi_cxx, libmpi_f77, and
-libmpi_f90.
-
-Open MPI's other internal libraries are not [yet] versioned for deep
-voodoo technical reasons. Please see
-https://svn.open-mpi.org/trac/ompi/ticket/2092 for more details.
-
-===========================================================================
-
-Building Open MPI
------------------
-
-Open MPI uses a traditional configure script paired with "make" to
-build. Typical installs can be of the pattern:
-
----------------------------------------------------------------------------
-shell$ ./configure [...options...]
-shell$ make all install
----------------------------------------------------------------------------
-
-There are many available configure options (see "./configure --help"
-for a full list); a summary of the more commonly used ones follows:
-
---prefix=<directory>
- Install Open MPI into the base directory named <directory>. Hence,
- Open MPI will place its executables in <directory>/bin, its header
- files in <directory>/include, its libraries in <directory>/lib, etc.
-
---with-elan=<directory>
- Specify the directory where the Quadrics Elan library and header
- files are located. This option is generally only necessary if the
- Elan headers and libraries are not in default compiler/linker
- search paths.
-
- Elan is the support library for Quadrics-based networks.
-
---with-elan-libdir=<directory>
- Look in directory for the Quadrics Elan libraries. By default, Open
- MPI will look in <elan directory>/lib and <elan directory>/lib64,
- which covers most cases. This option is only needed for special
- configurations.
-
---with-gm=<directory>
- Specify the directory where the GM libraries and header files are
- located. This option is generally only necessary if the GM headers
- and libraries are not in default compiler/linker search paths.
-
- GM is the support library for older Myrinet-based networks (GM has
- been obsoleted by MX).
-
---with-gm-libdir=<directory>
- Look in directory for the GM libraries. By default, Open MPI will
- look in <gm directory>/lib and <gm directory>/lib64, which covers
- most cases. This option is only needed for special configurations.
-
---with-mx=<directory>
- Specify the directory where the MX libraries and header files are
- located. This option is generally only necessary if the MX headers
- and libraries are not in default compiler/linker search paths.
-
- MX is the support library for Myrinet-based networks.
-
---with-mx-libdir=<directory>
- Look in directory for the MX libraries. By default, Open MPI will
- look in <mx directory>/lib and <mx directory>/lib64, which covers
- most cases. This option is only needed for special configurations.
-
---with-openib=<directory>
- Specify the directory where the OpenFabrics (previously known as
- OpenIB) libraries and header files are located. This option is
- generally only necessary if the OpenFabrics headers and libraries
- are not in default compiler/linker search paths.
-
- "OpenFabrics" refers to iWARP- and InifiniBand-based networks.
-
---with-openib-libdir=<directory>
- Look in directory for the OpenFabrics libraries. By default, Open
- MPI will look in <openib directory>/lib and <openib
- directory>/lib64, which covers most cases. This option is only
- needed for special configurations.
-
---with-portals=<directory>
- Specify the directory where the Portals libraries and header files
- are located. This option is generally only necessary if the Portals
- headers and libraries are not in default compiler/linker search
- paths.
-
- Portals is the support library for Cray interconnects, but is also
- available on other platforms (e.g., there is a Portals library
- implemented over regular TCP).
-
---with-portals-config=<type>
- Configuration to use for Portals support. The following <type>
- values are possible: "utcp", "xt3", "xt3-modex" (default: utcp).
-
---with-portals-libs=<libs>
- Additional libraries to link with for Portals support.
-
---with-psm=<directory>
- Specify the directory where the QLogic InfiniPath PSM library and
- header files are located. This option is generally only necessary
- if the InfiniPath headers and libraries are not in default
- compiler/linker search paths.
-
- PSM is the support library for QLogic InfiniPath network adapters.
-
---with-psm-libdir=<directory>
- Look in directory for the PSM libraries. By default, Open MPI will
- look in <psm directory>/lib and <psm directory>/lib64, which covers
- most cases. This option is only needed for special configurations.
-
---with-sctp=<directory>
- Specify the directory where the SCTP libraries and header files are
- located. This option is generally only necessary if the SCTP headers
- and libraries are not in default compiler/linker search paths.
-
- SCTP is a special network stack over ethernet networks.
-
---with-sctp-libdir=<directory>
- Look in directory for the SCTP libraries. By default, Open MPI will
- look in <sctp directory>/lib and <sctp directory>/lib64, which covers
- most cases. This option is only needed for special configurations.
-
---with-udapl=<directory>
- Specify the directory where the UDAPL libraries and header files are
- located. Note that UDAPL support is disabled by default on Linux;
- the --with-udapl flag must be specified in order to enable it.
- Specifying the directory argument is generally only necessary if the
- UDAPL headers and libraries are not in default compiler/linker
- search paths.
-
- UDAPL is the support library for high performance networks in Sun
- HPC ClusterTools and on Linux OpenFabrics networks (although the
- "openib" options are preferred for Linux OpenFabrics networks, not
- UDAPL).
-
---with-udapl-libdir=<directory>
- Look in directory for the UDAPL libraries. By default, Open MPI
- will look in <udapl directory>/lib and <udapl directory>/lib64,
- which covers most cases. This option is only needed for special
- configurations.
-
---with-lsf=<directory>
- Specify the directory where the LSF libraries and header files are
- located. This option is generally only necessary if the LSF headers
- and libraries are not in default compiler/linker search paths.
-
- LSF is a resource manager system, frequently used as a batch
- scheduler in HPC systems.
-
- NOTE: If you are using LSF version 7.0.5, you will need to add
- "LIBS=-ldl" to the configure command line. For example:
-
- ./configure LIBS=-ldl --with-lsf ...
-
- This workaround should *only* be needed for LSF 7.0.5.
-
---with-lsf-libdir=<directory>
- Look in directory for the LSF libraries. By default, Open MPI will
- look in <lsf directory>/lib and <lsf directory>/lib64, which covers
- most cases. This option is only needed for special configurations.
-
---with-tm=<directory>
- Specify the directory where the TM libraries and header files are
- located. This option is generally only necessary if the TM headers
- and libraries are not in default compiler/linker search paths.
-
- TM is the support library for the Torque and PBS Pro resource
- manager systems, both of which are frequently used as a batch
- scheduler in HPC systems.
-
---with-sge
- Specify to build support for the Sun Grid Engine (SGE) resource
- manager. SGE support is disabled by default; this option must be
- specified to build OMPI's SGE support.
-
- The Sun Grid Engine (SGE) is a resource manager system, frequently
- used as a batch scheduler in HPC systems.
-
---with-mpi-param_check(=value)
- "value" can be one of: always, never, runtime. If --with-mpi-param
- is not specified, "runtime" is the default. If --with-mpi-param
- is specified with no value, "always" is used. Using
- --without-mpi-param-check is equivalent to "never".
-
- - always: the parameters of MPI functions are always checked for
- errors
- - never: the parameters of MPI functions are never checked for
- errors
- - runtime: whether the parameters of MPI functions are checked
- depends on the value of the MCA parameter mpi_param_check
- (default: yes).
-
---with-threads=value
- Since thread support (both support for MPI_THREAD_MULTIPLE and
- asynchronous progress) is only partially tested, it is disabled by
- default. To enable threading, use "--with-threads=posix". This is
- most useful when combined with --enable-mpi-threads and/or
- --enable-progress-threads.
-
---enable-mpi-threads
- Allows the MPI thread level MPI_THREAD_MULTIPLE. See
- --with-threads; this is currently disabled by default.
-
---enable-progress-threads
- Allows asynchronous progress in some transports. See
- --with-threads; this is currently disabled by default. See the
- above note about asynchronous progress.
-
---disable-mpi-cxx
- Disable building the C++ MPI bindings. Note that this does *not*
- disable the C++ checks during configure; some of Open MPI's tools
- are written in C++ and therefore require a C++ compiler to be built.
-
---disable-mpi-cxx-seek
- Disable the MPI::SEEK_* constants. Due to a problem with the MPI-2
- specification, these constants can conflict with system-level SEEK_*
- constants. Open MPI attempts to work around this problem, but the
- workaround may fail in some esoteric situations. The
- --disable-mpi-cxx-seek switch disables Open MPI's workarounds (and
- therefore the MPI::SEEK_* constants will be unavailable).
-
---disable-mpi-f77
- Disable building the Fortran 77 MPI bindings.
-
---disable-mpi-f90
- Disable building the Fortran 90 MPI bindings. Also related to the
- --with-f90-max-array-dim and --with-mpi-f90-size options.
-
---with-mpi-f90-size=<SIZE>
- Three sizes of the MPI F90 module can be built: trivial (only a
- handful of MPI-2 F90-specific functions are included in the F90
- module), small (trivial + all MPI functions that take no choice
- buffers), and medium (small + all MPI functions that take 1 choice
- buffer). This parameter is only used if the F90 bindings are
- enabled.
-
---with-f90-max-array-dim=<DIM>
- The F90 MPI bindings are strictly typed, even including the number of
- dimensions for arrays for MPI choice buffer parameters. Open MPI
- generates these bindings at compile time with a maximum number of
- dimensions as specified by this parameter. The default value is 4.
-
---enable-mpirun-prefix-by-default
- This option forces the "mpirun" command to always behave as if
- "--prefix $prefix" was present on the command line (where $prefix is
- the value given to the --prefix option to configure). This prevents
- most rsh/ssh-based users from needing to modify their shell startup
- files to set the PATH and/or LD_LIBRARY_PATH for Open MPI on remote
- nodes. Note, however, that such users may still desire to set PATH
- -- perhaps even in their shell startup files -- so that executables
- such as mpicc and mpirun can be found without needing to type long
- path names. --enable-orterun-prefix-by-default is a synonym for
- this option.
-
---disable-shared
- By default, libmpi is built as a shared library, and all components
- are built as dynamic shared objects (DSOs). This switch disables
- this default; it is really only useful when used with
- --enable-static. Specifically, this option does *not* imply
- --enable-static; enabling static libraries and disabling shared
- libraries are two independent options.
-
---enable-static
- Build libmpi as a static library, and statically link in all
- components. Note that this option does *not* imply
- --disable-shared; enabling static libraries and disabling shared
- libraries are two independent options.
-
---enable-sparse-groups
- Enable the usage of sparse groups. This would save memory
- significantly especially if you are creating large
- communicators. (Disabled by default)
-
---enable-peruse
- Enable the PERUSE MPI data analysis interface.
-
---enable-dlopen
- Build all of Open MPI's components as standalone Dynamic Shared
- Objects (DSO's) that are loaded at run-time. The opposite of this
- option, --disable-dlopen, causes two things:
-
- 1. All of Open MPI's components will be built as part of Open MPI's
- normal libraries (e.g., libmpi).
- 2. Open MPI will not attempt to open any DSO's at run-time.
-
- Note that this option does *not* imply that OMPI's libraries will be
- built as static objects (e.g., libmpi.a). It only specifies the
- location of OMPI's components: standalone DSOs or folded into the
- Open MPI libraries. You can control whenther Open MPI's libraries
- are build as static or dynamic via --enable|disable-static and
- --enable|disable-shared.
-
---enable-heterogeneous
- Enable support for running on heterogeneous clusters (e.g., machines
- with different endian representations). Heterogeneous support is
- disabled by default because it imposes a minor performance penalty.
-
---enable-ptmalloc2-internal
- ***NOTE: This option no longer exists.
-
- This option was introduced in Open MPI v1.3 and was then removed in
- Open MPI v1.3.2. Open MPI fundamentally changed how it uses
- ptmalloc2 support in v1.3.2 such that the
- --enable-ptmalloc2-internal flag was no longer necessary. It can
- still harmlessly be supplied to Open MPI's configure script, but a
- warning will appear about how it is an unrecognized option.
-
- In v1.3 and v1.3.1, Open MPI built the ptmalloc2 library as a
- standalone library that users could choose to link in or not (by
- adding -lopenmpi-malloc to their link command). Using this option
- restored pre-v1.3 behavior of *always* forcing the user to use the
- ptmalloc2 memory manager (because it is part of libmpi).
-
- Starting with v1.3.2, ptmalloc2 is always built into Open MPI, but
- is only activated in certain scenarios.
-
---with-wrapper-cflags=<cflags>
---with-wrapper-cxxflags=<cxxflags>
---with-wrapper-fflags=<fflags>
---with-wrapper-fcflags=<fcflags>
---with-wrapper-ldflags=<ldflags>
---with-wrapper-libs=<libs>
- Add the specified flags to the default flags that used are in Open
- MPI's "wrapper" compilers (e.g., mpicc -- see below for more
- information about Open MPI's wrapper compilers). By default, Open
- MPI's wrapper compilers use the same compilers used to build Open
- MPI and specify an absolute minimum set of additional flags that are
- necessary to compile/link MPI applications. These configure options
- give system administrators the ability to embed additional flags in
- OMPI's wrapper compilers (which is a local policy decision). The
- meanings of the different flags are:
-
- <cflags>: Flags passed by the mpicc wrapper to the C compiler
- <cxxflags>: Flags passed by the mpic++ wrapper to the C++ compiler
- <fflags>: Flags passed by the mpif77 wrapper to the F77 compiler
- <fcflags>: Flags passed by the mpif90 wrapper to the F90 compiler
- <ldflags>: Flags passed by all the wrappers to the linker
- <libs>: Flags passed by all the wrappers to the linker
-
- There are other ways to configure Open MPI's wrapper compiler
- behavior; see the Open MPI FAQ for more information.
-
-There are many other options available -- see "./configure --help".
-
-Changing the compilers that Open MPI uses to build itself uses the
-standard Autoconf mechanism of setting special environment variables
-either before invoking configure or on the configure command line.
-The following environment variables are recognized by configure:
-
-CC - C compiler to use
-CFLAGS - Compile flags to pass to the C compiler
-CPPFLAGS - Preprocessor flags to pass to the C compiler
-
-CXX - C++ compiler to use
-CXXFLAGS - Compile flags to pass to the C++ compiler
-CXXCPPFLAGS - Preprocessor flags to pass to the C++ compiler
-
-F77 - Fortran 77 compiler to use
-FFLAGS - Compile flags to pass to the Fortran 77 compiler
-
-FC - Fortran 90 compiler to use
-FCFLAGS - Compile flags to pass to the Fortran 90 compiler
-
-LDFLAGS - Linker flags to pass to all compilers
-LIBS - Libraries to pass to all compilers (it is rarely
- necessary for users to need to specify additional LIBS)
-
-For example:
-
-shell$ ./configure CC=mycc CXX=myc++ F77=myf77 F90=myf90 ...
-
-***Note: We generally suggest using the above command line form for
- setting different compilers (vs. setting environment variables and
- then invoking "./configure"). The above form will save all
- variables and values in the config.log file, which makes
- post-mortem analysis easier when problems occur.
-
-Note that you may also want to ensure that the value of
-LD_LIBRARY_PATH is set appropriately (or not at all) for your build
-(or whatever environment variable is relevant for your operating
-system). For example, some users have been tripped up by setting to
-use non-default Fortran compilers via FC / F77, but then failing to
-set LD_LIBRARY_PATH to include the directory containing that
-non-default Fortran compiler's support libraries. This causes Open
-MPI's configure script to fail when it tries to compile / link / run
-simple Fortran programs.
-
-It is required that the compilers specified be compile and link
-compatible, meaning that object files created by one compiler must be
-able to be linked with object files from the other compilers and
-produce correctly functioning executables.
-
-Open MPI supports all the "make" targets that are provided by GNU
-Automake, such as:
-
-all - build the entire Open MPI package
-install - install Open MPI
-uninstall - remove all traces of Open MPI from the $prefix
-clean - clean out the build tree
-
-Once Open MPI has been built and installed, it is safe to run "make
-clean" and/or remove the entire build tree.
-
-VPATH and parallel builds are fully supported.
-
-Generally speaking, the only thing that users need to do to use Open
-MPI is ensure that <prefix>/bin is in their PATH and <prefix>/lib is
-in their LD_LIBRARY_PATH. Users may need to ensure to set the PATH
-and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc)
-so that non-interactive rsh/ssh-based logins will be able to find the
-Open MPI executables.
-
-===========================================================================
-
-Checking Your Open MPI Installation
------------------------------------
-
-The "ompi_info" command can be used to check the status of your Open
-MPI installation (located in <prefix>/bin/ompi_info). Running it with
-no arguments provides a summary of information about your Open MPI
-installation.
-
-Note that the ompi_info command is extremely helpful in determining
-which components are installed as well as listing all the run-time
-settable parameters that are available in each component (as well as
-their default values).
-
-The following options may be helpful:
-
---all Show a *lot* of information about your Open MPI
- installation.
---parsable Display all the information in an easily
- grep/cut/awk/sed-able format.
---param <framework> <component>
- A <framework> of "all" and a <component> of "all" will
- show all parameters to all components. Otherwise, the
- parameters of all the components in a specific framework,
- or just the parameters of a specific component can be
- displayed by using an appropriate <framework> and/or
- <component> name.
-
-Changing the values of these parameters is explained in the "The
-Modular Component Architecture (MCA)" section, below.
-
-===========================================================================
-
-Compiling Open MPI Applications
--------------------------------
-
-Open MPI provides "wrapper" compilers that should be used for
-compiling MPI applications:
-
-C: mpicc
-C++: mpiCC (or mpic++ if your filesystem is case-insensitive)
-Fortran 77: mpif77
-Fortran 90: mpif90
-
-For example:
-
-shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g
-shell$
-
-All the wrapper compilers do is add a variety of compiler and linker
-flags to the command line and then invoke a back-end compiler. To be
-specific: the wrapper compilers do not parse source code at all; they
-are solely command-line manipulators, and have nothing to do with the
-actual compilation or linking of programs. The end result is an MPI
-executable that is properly linked to all the relevant libraries.
-
-Customizing the behavior of the wrapper compilers is possible (e.g.,
-changing the compiler [not recommended] or specifying additional
-compiler/linker flags); see the Open MPI FAQ for more information.
-
-===========================================================================
-
-Running Open MPI Applications
------------------------------
-
-Open MPI supports both mpirun and mpiexec (they are exactly
-equivalent). For example:
-
-shell$ mpirun -np 2 hello_world_mpi
-or
-shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi
-
-are equivalent. Some of mpiexec's switches (such as -host and -arch)
-are not yet functional, although they will not error if you try to use
-them.
-
-The rsh launcher accepts a -hostfile parameter (the option
-"-machinefile" is equivalent); you can specify a -hostfile parameter
-indicating an standard mpirun-style hostfile (one hostname per line):
-
-shell$ mpirun -hostfile my_hostfile -np 2 hello_world_mpi
-
-If you intend to run more than one process on a node, the hostfile can
-use the "slots" attribute. If "slots" is not specified, a count of 1
-is assumed. For example, using the following hostfile:
-
----------------------------------------------------------------------------
-node1.example.com
-node2.example.com
-node3.example.com slots=2
-node4.example.com slots=4
----------------------------------------------------------------------------
-
-shell$ mpirun -hostfile my_hostfile -np 8 hello_world_mpi
-
-will launch MPI_COMM_WORLD rank 0 on node1, rank 1 on node2, ranks 2
-and 3 on node3, and ranks 4 through 7 on node4.
-
-Other starters, such as the resource manager / batch scheduling
-environments, do not require hostfiles (and will ignore the hostfile
-if it is supplied). They will also launch as many processes as slots
-have been allocated by the scheduler if no "-np" argument has been
-provided. For example, running a SLURM job with 8 processors:
-
-shell$ salloc -n 8 mpirun a.out
-
-The above command will reserve 8 processors and run 1 copy of mpirun,
-which will, in turn, launch 8 copies of a.out in a single
-MPI_COMM_WORLD on the processors that were allocated by SLURM.
-
-Note that the values of component parameters can be changed on the
-mpirun / mpiexec command line. This is explained in the section
-below, "The Modular Component Architecture (MCA)".
-
-===========================================================================
-
-The Modular Component Architecture (MCA)
-
-The MCA is the backbone of Open MPI -- most services and functionality
-are implemented through MCA components. Here is a list of all the
-component frameworks in Open MPI:
-
----------------------------------------------------------------------------
-
-MPI component frameworks:
--------------------------
-
-allocator - Memory allocator
-bml - BTL management layer
-btl - MPI point-to-point Byte Transfer Layer, used for MPI
- point-to-point messages on some types of networks
-coll - MPI collective algorithms
-crcp - Checkpoint/restart coordination protocol
-dpm - MPI-2 dynamic process management
-io - MPI-2 I/O
-mpool - Memory pooling
-mtl - Matching transport layer, used for MPI point-to-point
- messages on some types of networks
-osc - MPI-2 one-sided communications
-pml - MPI point-to-point management layer
-pubsub - MPI-2 publish/subscribe management
-rcache - Memory registration cache
-topo - MPI topology routines
-
-Back-end run-time environment component frameworks:
----------------------------------------------------
-
-errmgr - RTE error manager
-ess - RTE environment-specfic services
-filem - Remote file management
-grpcomm - RTE group communications
-iof - I/O forwarding
-notifier - System/network administrator noficiation system
-odls - OpenRTE daemon local launch subsystem
-oob - Out of band messaging
-plm - Process lifecycle management
-ras - Resource allocation system
-rmaps - Resource mapping system
-rml - RTE message layer
-routed - Routing table for the RML
-snapc - Snapshot coordination
-
-Miscellaneous frameworks:
--------------------------
-
-backtrace - Debugging call stack backtrace support
-carto - Cartography (host/network mapping) support
-crs - Checkpoint and restart service
-installdirs - Installation directory relocation services
-maffinity - Memory affinity
-memchecker - Run-time memory checking
-memcpy - Memopy copy support
-memory - Memory management hooks
-paffinity - Processor affinity
-timer - High-resolution timers
-
----------------------------------------------------------------------------
-
-Each framework typically has one or more components that are used at
-run-time. For example, the btl framework is used by the MPI layer to
-send bytes across different types underlying networks. The tcp btl,
-for example, sends messages across TCP-based networks; the openib btl
-sends messages across OpenFabrics-based networks; the MX btl sends
-messages across Myrinet networks.
-
-Each component typically has some tunable parameters that can be
-changed at run-time. Use the ompi_info command to check a component
-to see what its tunable parameters are. For example:
-
-shell$ ompi_info --param btl tcp
-
-shows all the parameters (and default values) for the tcp btl
-component.
-
-These values can be overridden at run-time in several ways. At
-run-time, the following locations are examined (in order) for new
-values of parameters:
-
-1. <prefix>/etc/openmpi-mca-params.conf
-
- This file is intended to set any system-wide default MCA parameter
- values -- it will apply, by default, to all users who use this Open
- MPI installation. The default file that is installed contains many
- comments explaining its format.
-
-2. $HOME/.openmpi/mca-params.conf
-
- If this file exists, it should be in the same format as
- <prefix>/etc/openmpi-mca-params.conf. It is intended to provide
- per-user default parameter values.
-
-3. environment variables of the form OMPI_MCA_<name> set equal to a
- <value>
-
- Where <name> is the name of the parameter. For example, set the
- variable named OMPI_MCA_btl_tcp_frag_size to the value 65536
- (Bourne-style shells):
-
- shell$ OMPI_MCA_btl_tcp_frag_size=65536
- shell$ export OMPI_MCA_btl_tcp_frag_size
-
-4. the mpirun command line: --mca <name> <value>
-
- Where <name> is the name of the parameter. For example:
-
- shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi
-
-These locations are checked in order. For example, a parameter value
-passed on the mpirun command line will override an environment
-variable; an environment variable will override the system-wide
-defaults.
-
-===========================================================================
-
-Common Questions
-----------------
-
-Many common questions about building and using Open MPI are answered
-on the FAQ:
-
- http://www.open-mpi.org/faq/
-
-===========================================================================
-
-Got more questions?
--------------------
-
-Found a bug? Got a question? Want to make a suggestion? Want to
-contribute to Open MPI? Please let us know!
-
-When submitting questions and problems, be sure to include as much
-extra information as possible. This web page details all the
-information that we request in order to provide assistance:
-
- http://www.open-mpi.org/community/help/
-
-User-level questions and comments should generally be sent to the
-user's mailing list (users@open-mpi.org). Because of spam, only
-subscribers are allowed to post to this list (ensure that you
-subscribe with and post from *exactly* the same e-mail address --
-joe@example.com is considered different than
-joe@mycomputer.example.com!). Visit this page to subscribe to the
-user's list:
-
- http://www.open-mpi.org/mailman/listinfo.cgi/users
-
-Developer-level bug reports, questions, and comments should generally
-be sent to the developer's mailing list (devel@open-mpi.org). Please
-do not post the same question to both lists. As with the user's list,
-only subscribers are allowed to post to the developer's list. Visit
-the following web page to subscribe:
-
- http://www.open-mpi.org/mailman/listinfo.cgi/devel
-
-Make today an Open MPI day!
+++ /dev/null
- OpenSM Release Notes 3.3
- =============================
-
-Version: OpenSM 3.3.x
-Repo: git://git.openfabrics.org/~sashak/management.git
-Date: Dec 2009
-
-1 Overview
-----------
-This document describes the contents of the OpenSM 3.3 release.
-OpenSM is an InfiniBand compliant Subnet Manager and Administration,
-and runs on top of OpenIB. The OpenSM version for this release
-is opensm-3.3.5.
-
-This document includes the following sections:
-1 This Overview section (describing new features and software
- dependencies)
-2 Known Issues And Limitations
-3 Unsupported IB compliance statements
-4 Bug Fixes
-5 Main Verification Flows
-6 Qualified Software Stacks and Devices
-
-1.1 Major New Features
-
-* Mesh Analysis for LASH routing algorithm.
- The performance of LASH can be improved by preconditioning the mesh in
- cases where there are multiple links connecting switches and also in
- cases where the switches are not cabled consistently.
- Activated with --do_mesh_analysis command line and config file option.
-
-* Reloadable OpenSM configuration (preliminary implemented)
- This is possible now to reload OpenSM configuration parameters on the
- fly without restarting.
-
-* Routing paths sorted balancing (for UpDown and MinHops)
- This sorts the port order in which routing paths balancing is performed
- by OpenSM. Helps to improve performance dramatically (40-50%) for most
- popular application communication patterns.
- To overwrite this behavior use --guid_routing_order_file command line
- option.
-
-* Weighted Lid Matrices calculation (for UpDown, MinHop and DOR).
- This low level routing fine-tuning feature provides the means to
- define a weighting factor per port for customizing the least weight
- hops for the routing. Custom weights are provided using file specified
- with '--hop_weights_file' command line option.
-
-* I/O nodes connectivity (for FatTree).
- This provides possibility to define the set of I/O nodes for the
- Fat-Tree routing algorithm. I/O nodes are non-CN nodes allowed to use
- up to N (specified using --max_reverse_hops) switches the wrong way
- around to improve connectivity. I/O nodes list is provided using file
- and --io_guid_file command line option.
-
-* MGID to MLID compression - infrastructure for many MGIDs to single MLID
- compression. This becomes helpful when number of multicast groups
- exceeds subnet's MLID routing capability (normally 1024 groups). In such
- cases many multicast groups (MGID) can be routed using same MLID value.
-
-* Many code improvements, optimizations and cleanups.
-
-* Windows support (early stage).
-
-1.2 Minor New Features:
-
-cde0c0d opensm: Convert remaining helper routines for GID printing format
-bc5743c opensm: Add support for MaxCreditHint and LinkRoundTripLatency to
- osm_dump_port_info
-6cd34ab opensm: Add Dell to known vendor list
-003d6bd opensm: Add more info for traps 144 and 256-259 in osm_dump_notice
-5b0c5de opensm/osm_ucat_ftree.c Enhance min hops counters usage
-0715b92 ib_types.h: Add ib_switch_info_get_state_opt_sl2vlmapping routine
-2ddba79 opensm: Remove some __ and __osm_ prefixes
-ea0691f opensm/iba/ib_types.h: Add PortXmit/RcvDataSL PerfMgt attributes
-9c79be5 ib_types.h: Adding BKEY violation trap (259)
-c608ea6 opensm: Add and utilize ib_gid_is_notzero routine
-b639e64 opensm: Handle trap repress on trap 144 generation
-b034205 Add pkey table support to osm_get_all_port_attr
-876605b opensm/ib_types.h: Add attribute ID for PortCountersExtended
-aae3bbc opensm: PortInfo requests for discovered switches
-0147b09 opensm/osm_lid_mgr: use single array for used_lids
-a9225b0 opensm/Makefile.am: remove osm_build_id.h junk file generation
-8e3a57d opensm/osm_console.c: Add list of SMs to status command
-3d664b9 opensm/osm_console.c : Added dump_portguid function to console to
- generate a list of port guids matching one or more regexps
-85b35bc opensm/osm_helper.c: print port number as decimal
-8674cb7 opensm: sort port order for routing by switch loads
-80c0d48 opensm: rescan config file even in standby
-8b7aa5e opensm/osm_subnet.c enable log_max_size opt update
-8558ee5 opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters
-ecde2f7 opensm/osm_subnet.c support subnet configuration rescan and update
-58c45e4 opensm/osm_log.c save log_max_size in subnet opt in MB
-cf88e93 opensm: Add new partition keyword for all hca, switches and routers
-4bfd4e0 opensm: remove libibcommon build dependencies
-3718fc4 opensm/event_plugin: link opensm with -rdynamic flag
-587ce14 opensm/osm_inform.c report IB traps to plugin
-ced5a6e opensm/opensm/osm_console.c: move reporting of plugins to "status"
- command.
-696aca2 opensm: Add configurable retries for transactions
-0d932ff opensm/osm_sa_mcmember_record.c: optimization in zero mgid comparison
-254c2ef opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, set init
- failure on PKeyTable and QoS initialization failure
-83bd10a opensm: Reduce heap consumption by multicast routing tables (MFTs)
-cd33bc5 opensm: Add some additional HP vendor IDs/OUIs
-f78ec3a opensm/osm_mcast_tbl.(h c): Make max_mlid_ho be maximum MLID configured
-2d13530 opensm: Add infrastructure support for PortInfo
- IsMulticastPkeyTrapSuppressionSupported
-3ace760 opensm: Reduce heap consumption by unicast routing tables (LFTs)
-eec568e osmtest: Add SA get PathRecord stress test
-aabc476 opensm: Add infrastructure support for more newly allocated PortInfo
- CapabilityMask bits
-c83c331 opensm: improve multicast re-routing requests processing
-46db92f opensm: Parallelize (Stripe) MFT sets across switches
-00c6a6e opensm: Parallelize (Stripe) LFT sets across switches
-e21c651 opensm/osm_base.h: Add new SA ClassPortInfo:CapabilityMask2 bit
- allocations
-09056b1 opensm/ib_types.h: Add CounterSelect2 field to PortCounters attribute
-6a63003 opensm: Add ability to configure SMSL
-25f071f opensm/lash: Set minimum VL for LASH to use
-622d853 opensm/osm_ucast_ftree.cd: Added support for same level links
-8146ba7 opensm: Add new Sun vendor ID
-1d7dd18 opensm/osm_ucast_ftree.c: Enhanced Fat-Tree algorithm
-e07a2f1 Add LMC support to DOR routing
-1acfe8a opensm: Add SuperMicro to list of recognized vendors
-f02f40e opensm: implement 'connect_roots' option in fat-tree routing
-748d41e opensm SA DB dump/restore: added option to dump SA DB on every sweep
-b03a95e complib/cl_fleximap: add cl_fmap_match() function
-b7a8a87 opensm/include/iba/ib_types.h: adding Congestion Control definitions
-
-1.3 Library API Changes
-
- None
-
-1.4 Software Dependencies
-
-OpenSM depends on the installation of libibumad package (distributed as
-part of OFA IB management together with OpenSM) and IB stack presence,
-in particular libibumad uses user_mad kernel interface ('ib_umad' kernel
-module). The qualified driver versions are provided in Table 2,
-"Qualified IB Stacks".
-
-Also, building of QoS manager policy file parser requires flex, and either
-bison or byacc installed.
-
-1.5 Supported Devices Firmware
-
-The main task of OpenSM is to initialize InfiniBand devices. The
-qualified devices and their corresponding firmware versions
-are listed in Table 3.
-
-2 Known Issues And Limitations
-------------------------------
-
-* No Service / Key associations:
- There is no way to manage Service access by Keys.
-
-* No SM to SM SMDB synchronization:
- Puts the burden of re-registering services, multicast groups, and
- inform-info on the client application (or IB access layer core).
-
-3 Unsupported IB Compliance Statements
---------------------------------------
-The following section lists all the IB compliance statements which
-OpenSM does not support. Please refer to the IB specification for detailed
-information regarding each compliance statement.
-
-* C14-22 (Authentication):
- M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
- SubnSet method. As a work-around, an OpenSM option is provided for
- defining the protect bits.
-
-* C14-67 (Authentication):
- On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
- the SM shall generate a SubnGetResp if the M_Key matches, or
- silently drop the packet if M_Key does not match.
-
-* C15-0.1.23.4 (Authentication):
- InformInfoRecords shall always be provided with the QPN set to 0,
- except for the case of a trusted request, in which case the actual
- subscriber QPN shall be returned.
-
-* o13-17.1.2 (Event-FWD):
- If no permission to forward, the subscription should be removed and
- no further forwarding should occur.
-
-* C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
- GUIDInfo - SM should enable assigning Port GUIDInfo.
-
-* C14-44 (Initialization):
- If the SM discovers that it is missing an M_Key to update CA/RT/SW,
- it should notify the higher level.
-
-* C14-62.1.1.12 (Initialization):
- PortInfo:M_Key - Set the M_Key to a node based random value.
-
-* C14-62.1.1.13 (Initialization):
- PortInfo:M_KeyProtectBits - set according to an optional policy.
-
-* C14-62.1.1.24 (Initialization):
- SwitchInfo:DefaultPort - should be configured for random FDB.
-
-* C14-62.1.1.32 (Initialization):
- RandomForwardingTable should be configured.
-
-* o15-0.1.12 (Multicast):
- If the JoinState is SendOnlyNonMember = 1 (only), then the endport
- should join as sender only.
-
-* o15-0.1.8 (Multicast):
- If a request for creating an MCG with fields that cannot be met,
- return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass).
-
-* C15-0.1.8.6 (SA-Query):
- Respond to SubnAdmGetTraceTable - this is an optional attribute.
-
-* C15-0.1.13 Services:
- Reject ServiceRecord create, modify or delete if the given
- ServiceP_Key does not match the one included in the ServiceGID port
- and the port that sent the request.
-
-* C15-0.1.14 (Services):
- Provide means to associate service name and ServiceKeys.
-
-4 Bug Fixes
------------
-
-4.1 Major Bug Fixes
-
-18990fa opensm: set IS_SM bit during opensm init
-3551389 fix local port smlid in osm_send_trap144()
-a6de48d opensm/osm_link_mgr.c initialize SMSL
-82df467 opensm/osm_req.c: Shouldn't reveal port's MKey on Trap method
-45ebff9 opensm/osm_console_io.h: Modify osm_console_exit so only the
- connection is killed, not the socket
-d10660a opensm/osm_req.c: In osm_send_trap144, set producer type according
- to node type
-8a2d2dd opensm/osm_node_info_rcv.c: create physp for the newly discovered
- port of the known node
-39b241f opensm/lid_mgr: fix duplicated lid assignment
-b44c398 opensm: invalidate routing cache when entering master state
-595f2e3 opensm: update LFTs when entering master
-8406c65 opensm: fix port chooser
-fa90512 opensm/osm_vendor_*_sa: fix incompatibility with QLogic SM
-7ec9f7c opensm: discard multicast SA PR with wildcard DGID
-5cdb53f opensm/osm_sa_node_record.c use comp mask to match by LID or GUID
-55f9772 opensm: Return single PathRecord for SubnAdmGet with DGID/SGID wild
- carded
-5ec0b5f opensm: compress IPV6 SNM groups to use a single MLID
-
-4.2 Other Bug Fixes
-
-4911e0b performance-manager-HOWTO.txt: Indicate master state
-86ccaa4 opensm/osm_pkey_mgr.c: Fix pkey endian in log message
-b79b079 opensm.8.in: Add mention of backing documentation for QoS policy
- file and performance manager
-b4d92af opensm/osm_perfmgr.c: Eliminate duplicated error number
-a10b57a opensm/osm_ucast_ftree.c: lids are always handled in host order
-44273a2 opensm/osm_ucast_ftree.c: fixing bug in indexing
-5cd98f7 Fix further bugs around console closure and clean up code.
-6b34339 opensm/osm_opensm.c: add newline to log message
-68c241c send trap144 when local priority is higher than master priority
-6462999 opensm/osm_inform.c: In __osm_send_report, make sure p_report_madw
- valid before using
-9b8561a opensm/console: Fixed osm_console poll to handle POLLHUP
-91d0700 osm_vendor_ibumad.c: In clear_madw, fix tid endian in message
-5a5136b osm_switch.h : Fixed wrong comment about return value of
- osm_switch_set_hops
-c1ec8c0 osm_ucast_ftree.c: Removed useless initialization on switch indexes
-418d01f opensm/osm_helper.c: use single buffer in osm_dump_dr_smp()
-2c9153c opensm/osm_helper.c: consolidate dr path printing code
-048c447 opensm/osm_helper.c: return then log is inactive
-dd3ef0c opensm: Return error status when cl_disp_register fails
-0143bf7 opensm/osm_perfmgr.c: Improve assert in osm_pc_rcv_process
-6622504 osm_perfmgr.c: In osm_perfmgr_shutdown, add missing cl_disp_unregister
-7b66dee opensm: remove unneeded anymore physp initializations
-f11274a opensm/partition-config.txt: Update for defmember feature
-d240e7d opensm/osm_sm_state_mgr.c: Remove unneeded return statement
-898fb8c opensm: Improve some snprintf uses
-6820e63 opensm/osm_sa_link_record.c: improve get_base_lid()
-64c8d31 opensm: initialize all switch ports
-555fae8 opensm/sweep: add log message before lid assignment
-8e22307 opensm/console: Enhance perfmgr print_counters for better nodenames
-b9721a1 opensm/osm_console.c: Improve perfmgr print_counters error message
-4d8dc72 opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec
-a98dd82 opensm/main.c: remove enable_stack_dump() call
-db6d51e opensm/osm_subnet: fix crash in qos string config parameters reloading
-e5111c8 opensm: proper config file rescan
-e5295b2 opensm: pre-scan command line for config file option
-e2f549e opensm/osm_console.c: Eliminate some extraneous parentheses
-0a265dc opensm/console: dump_portguid - don't duplicate matched guids
-540fefb opensm/console: dump_portguid command fixes
-d96202c opensm/osm_console.c: Add missing command in help_perfmgr
-ae1bd3c opensm/osm_helper.c: Add port counters to __osm_disp_msg_str
-1d38b31 opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin
-156c749 opensm: fix structure definition for trap 257-258
-5c09f4a opensm/osm_state_mgr.c: small bug in scanning lid table
-72a2fa2 opensm/osm_sa.c: fixing SA MAD dump
-539a4d3 opensm/osm_ucast_ftree.c Fixed bad init value for down port index
-6690833 opensm/ftree: simplify root guids setup.
-90e3291 opensm/ftree: cleanup ftree_sw_tbl_element_t use
-c07d245 opensm/qos_config: no invalid option message on default values
-b382ad8 opensm: avoid memory leaks on config parameters reloading
-45f57ce opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation
-3d618aa opensm/osm_subnet.c: break matching when config parameter already found
-44d98e3 opensm/osm_subnet.c: clean_val() remove trailing quotation
-173010a opensm/doc/perf-manager-arch.txt: Fix some commentary typos
-83bf6c5 opensm/osm_subnet.c fix parse functions for big endian machines
-6b9a1e9 opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager
- operation
-4f79a17 opensm/osm_perfmgr.c: In osm_perfmgr_init, eliminate memory leak
- on error
-22da81f opensm/osm_ucast_ftree.c: fix full topology dump
-aa25fcb opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0
- is active
-003bd4b opensm/osm_subnet.c Fix memory leak for QOS string parameters.
-9cbbab2 opensm/opensm.spec: fix event plugin config options
-996e8f6 OpenSM: update osmeventplugin example for the new TRAP event.
-67f4c07 opensm/lash: simplify some memory allocations
-3e6bcdb opensm/lash: fix memory leaks
-3ff97b9 opensm/vendor: save some stack memory
-ccc7621 opensm/osm_ucast_ftree.c: fixing errors in comments
-1a802b3 Corrected incoherency in __osm_ftree_fabric_route_to_non_cns comments
-85a7e54 opensm/osm_sm.c: fix MC group creation in race condition
-aad1af2 opensm/osm_trap_rcv.c: Improvements in log_trap_info()
-f619d67 opensm/osm_trap_rcv.c: Minor reorganization of trap_rcv_process_request
-084335b opensm/link_mgr: verify port's lid
-d525931 opensm/osm_vendor_ibumad: Use OSM_UMAD_MAX_AGENTS rather than
- UMAD_CA_MAX_AGENTS
-f342c62 opensm/osm_sa.c: don't ignore failure in osm_mgrp_add_port()
-587fda4 osmtest/osmt_multicast.c: fix strict aliasing breakage warning
-6931f3e opensm: make subnet's max mlid update implementation independent
-30f1acd osm_ucast_ftree.c missing reset of ca_ports
-ac04779 opensm: fix LFT allocation size
-a7838d0 opensm/osm_ucast_cache: reduce OSM_LOG_INFO debug printouts
-c027335 opensm/osm_ucast_updn.c: Further reduction in cas_per_sw allocation
-e8ee292 opensm/opensm/osm_subnet.c: adjust buffer to ensure a '\n' is printed
-84d9830 opensm/osm_ucast_updn.c: Reduce temporary allocation of cas_per_sw
-347ad64 opensm/ib_types.h: Mask off client rereg bit in set_client_rereg
-c2ab189 opensm/osm_state_mgr.c: in cleanup_switch() check only relevant
- LFT part
-40c93d3 use transportable constant attributes
-c8fa71a osmtest -code cleanup - use strncasecmp()
-770704a opensm/osm_mcast_mgr.c: In mcast_mgr_set_mft_block, fix node GUID
- in log message
-3d20f82 opensm/osm_sa_path_record.c: separate router guid resolution code
-27ea3c8 opensm: fix gcc-4.4.1 warnings
-c88bfd3 opensm/osm_lid_mgr.c: Fix typo in OSM_LOG message
-a9ea08c opensm/osm_mesh.c: Add dump_mesh routine at OSM_LOG_DEBUG level
-bc2a61e C++ style coding does not compile
-6647600 opensm: remove meanless 'const' keywords in APIs
-323a74f opensm/osm_qos_parser_y.y: fix endless loop
-0121a81 opensm: fix endless looping in mcast_mgr
-696c022 opensm: fix some obvious -Wsign-compare warnings
-b91e3c3 opensm/osm_get_port_by_lid(): don't bother with lmc
-ca582df opensm/osm_get_port_by_lid(): speedup a port lookup
-fd846ee opensm/osm_mesh.c: simplify compare_switches() function
-fe20080 osm_sa.c - void * arithmetic causes problems
-220130f osm_helper.c use explicit value for struct init
-0168ece use standard varargs syntax in macro OSM_LOG()
-180b335 update functions to match .h prototypes
-9240ef4 opensm/osm_ucast_lash: fix use after free bug
-6f1a21a opensm: osm_get_port_by_lid() helper
-c9e2818 opensm/osm_sa_path_record.c: validate multicast membership
-225dcf5 opensm/osm_mesh.c: Remove edges in lash matrix
-4dd928b opensm/osm_sa_mcmember_record.c: clean uninitialized variable use
-c48f0bc opensm/osm_perfmgr_db.c: Fix memory leak of db nodes
-82d3585 opensm/osm_notice.c: move logging code to separate function
-9557f60 opensm/osm_inform.c: For traps 64-67, use GID from DataDetails in
- log message
-e2e78d9 opensm/opensm.8.in: Indicate default rule for Default partition
-08c5beb opensm/osm_sa_node_record.c: dump NodeInfo with debug verbosity
-1fe88f0 opensm/multicast: merge mcm_port and mcm_info
-ba75747 opensm/multicast: consolidate port addition/removing code
-5e61ab8 opensm: port object reference in mcm ports list
-5c5dacf opensm: fix uninitialized return value in osm_sm_mcgrp_leave()
-7cfe18d osm_ucast_ftree.c: Removed reverse_hop parameters from
- fabric_route_upgoing_by_going_down
-aa7fb47 opensm/multicast: kill mc group to_be_deleted flag
-a4910fe opensm/osm_mcast_mgr.c: multicast routing by mlid - renaming
-1d14060 opensm/multicast: remove change id tracking
-5a84951 opensm: use mgrp pointer as osm_sm_mcgrp_join/leave() parameter
-d8e3ff5 opensm: use mgrp pointer in port mcm_info
-0631cd3 opensm doc: Indicated limited (rather than partial) partition
- membership
-1010535 opensm/osm_ucast_lash.c: In lash_core, return status -1 for all errors
-942e20f opensm/osm_helper.c: Add SM priority changed into trap 144 description
-2372999 opensm/osm_ucast_mgr: better lft setup
-e268b32 opensm/osm_helper.c: Only change method when > rather than >=
-9309e8c complib/cl_event.c: change nanosec var type long
-d93b126 opensm/complib: account for nsec overflow in timeout values
-ef4c8ac opensm/osm_qos_policy.c: matching PR query to QoS level with pkey
-c93b58b opensm: fixing some data types in osm_req_get/set
-2b89177 opensm/libvendor/osm_vendor_ibumad.c: Handle umad_alloc failure in
- osm_vendor_get
-2cba163 opensm/osm_helper.c: In osm_dump_dr_smp, fix endian of status
-47397e3 opensm/osm_sm_mad_ctrl.c: Fix endian of status in error message
-e83b7ca opensm/osm_mesh.c: Reorder switches for lash
-9256239 opensm/osm_trap_rcv.c: Validate trap is 144 before checking for
- NodeDescription changed
-011d9ca opensm/osm_ucast_lash.c: Handle calloc failure in generate_cdg_for_sp
-59964d7 opensm: fixing handling of opt.max_wire_smps
-f4e3cd0 opensm/osm_ucast_lash.c: Directly call calloc/free rather than
- create/delete_cdg
-5a208bd opensm/osm_ucast_lash.c: Added error numbers to some error log messages
-3b80d10 opensm/osm_helper.c: fix printing trap 258 details
-f682fe0 opensm: do not configure MFTs when mcast support is disabled
-cc42095 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, indicate
- failed attribute
-aebf215 opensm/osm_ucast_lash.c: Remove osm_mesh_node_delete call from
- switch_delete
-1ef4694 opensm/osm_path.h: In osm_dr_path_init, only copy needed part of path
-c594a2d opensm: osm_dr_path_extend can fail due to invalid hop count
-46e5668 opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete
-81841dc opensm/osm_ucast_lash.c: Handle malloc failures better
-2801203 opensm: remove extra "0x" from debug message.
-88821d2 opensm/main.c: Display SMSL when specified
-f814dcd opensm/osm_subnet.c: Format lash_start_vl consistent with other
- uint8 items
-66669c9 opensm/main.c: Display LASH start VL when specified
-31bb0a7 opensm/osm_mcst_mgr.c: check number of switches only once
-75e672c opensm: find MC group by MGID using fleximap
-2b7260d Clarify the syntax of the hop_weights_file
-e6f0070 opensm/osm_mesh.c: Improve VL utilization
-27497a0 opensm/osm_ucast_ftree.c Fix assert comparing number of CAs to CN ports
-3b98131 opensm/osm_qos_policy.c: Use proper size in malloc in
- osm_qos_policy_vlarb_scope_create
-e6f367d opensm/osm_ucast_ftree.c: Made error numbers unique in some log
- messages
-83261a8 osm_ucast_ftree.c Count number of hops instead of calculating it
-7bdf4ff opensm/osm_sa_(path multipath)_record.c: Fix typo in a couple of
- log messages
-0f8ed87 opensm/osm_ucast_mgr.c: Add error numbers to some error log messages
-0b5ccb4 complib/Makefile.am: prevent file duplications
-e0b8ec9 opensm/osm_sminfo_rcv.c: clean type of smi_rcv_process_get_sm()
-4d01005 opensm: sweep component processors return status value
-6ad8d78 opensm/libvendor/osm_vendor_(ibumad mlx)_sa.c: Handle malloc
- failure in __osmv_send_sa_req
-cf97ebf opensm/osm_ucast_lash.(h c): Replace memory allocation by array
-957461c opensm/osm_sa.c add attribute and component mask to error message
-5d339a1 osm_dump.c dump port if lft is set up
-518083d osm_port.c: check if op_vls = 0 before max_op_vls comparison
-b6964cb opensm/osm_port.c: Change log level of Invalid OP_VLS 0 message
- to VERBOSE
-b27568c opensm/PerfMgr: Reduce host name length
-bc495c0 opensm/osm_lid_mgr.c bug in opensm LID assignment
-5a466fd opensm/osm_perfmgr_db.c: Remove unneeded initialization in
- perfmgr_db_print_by_name
-57cf328 opensm/osm_ucast_ftree.c Increase the size of the hop table
-8323cf1 opensm/PerfMgr: Remove some underbars from internal names
-65b1c15 opensm: Changes to spec and make files for updated release notes
-cd226c7 OpenSM: include/vendor/osm_vendor.h - Replaced #elif with no
- condition by #else
-9f8bd4a management: Fixed custom_release in SPEC files
-c0b8207 opensm/PerfMgr: Change redir_tbl_size to num_ports for better clarity
-596bb08 opensm/osm_sa.c: check for SA DB file only if requested
-2f2bd4e opensm SA DB dump/restore: load SA DB only once
-4abcbf2 opensm: Added print_desc to various log messages
-5e3d235 opensm/osm_vendor_ibumad.c: Move error info into single message
-8e5ca10 opensm/libvendor//osm_vendor_ibumad_sa.c: uninitialized fields
-d13c2b6 opensm/osm_sm_mad_ctrl.c Changes to some error messages
-f79d315 opensm/osm_sm_mad_ctrl.c: Add missing call to return mad to mad pool
-150a9b1 opensm/osm_sa_mcmember_record.c: print mcast join/create failures in
- VERBOSE instead of DEBUG level
-9b7882a opensm/osm_vendor_ibumad.c: Change LID format to decimal in log message
-5256c43 opensm/osm_vendor_mlx: fix compilation error
-93db10d opensm/osm_vendor_mlx_txn.c: eliminate bunch of compilation warnings
-156fdc1 opensm/osm_helper.c Log format changes
-7a55434 opensm/osm_ucast_ftree.c Changed log level
-a1694de opensm/osm_state_mgr.c Added more info to some error messages
-fdec20a opensm/osm_trap_rcv.c: Eliminate heavy sweep on receipt of trap 145
-13a32a7 opensm - standardize on a single Windows #define - take #2
-b236a10 opensm/osm_db_files.c: kill useless malloc() castings
-4ba0c26 opensm/osm_db_files.c: add '/' path delimited
-e3b98a5 opensm/osm_sm_mad_ctrl.c: Fix qp0_mads_accounting
-dbbe5b3 opensm/osm_subnet.c: fixing bug in dumping options file
-f22856a opensm/osm_ucast_mgr.c: fix memory leak
-0d5f0b6 opensm: osm_get_mgrp_by_mgid() helper
-e3c044a osm_sa_mcmember_record.c: pass MCM Record data to mlid allocator
-3dda2dc opensm/osm_sa_member_record.c: mlid independent MGID generator
-1f95a3c opensm/osm_sa_mcmember_record.c: move mgid allocation code
-b78add1 complib: replace intn_t types by C99 intptr_t
-a864fd3 osmtest/osmt_mtl_regular_qp.c: cleaning uintn_t use
-9e01318 opensm/osm_console.c: make const functions
-f8c4c3e opensm/osm_mgrp_new(): add subnet db insertion
-80da047 complib/fleximap: make compar callback to return int
-bf7fe2d opensm: cleanup intn_t uses
-0862bba opensm/main.c: opensm cannot be killed while asking for port guid
-2b70193 opensm/complib: bug in cl_list_insert_array_head/tail functions
-4764199 opensm - use C99 transportable data type for pointer storage
-a9c326c opensm/osm_state_mgr.c: do not probe remote side of port 0
-4945706 opensm/osm_mcast_mgr.c: fix return value on alloc_mfts() failures
-8312a24 OpenSM: Fix unused variable compiler warning.
-ab8f0a3 opensm/partition: keep multicast group pointer
-a817430 opensm: Only clear SMP beyond end of PortInfo attribute
-52fb6f2 opensm/osm_switch.h: Remove dead osm_switch_get_physp_ptr routine
-aa6d932 opensm/osm_mcast_tbl.c: In osm_mcast_tbl_clear_mlid, use memset to
- clear port mask entry
-2ad846b opensm/osm_trap_rcv.c: use source_lid and port_num for logging
-b9d7756 opensm/osm_mcast_tbl: Fix size of port mask table array
-11c0a9b opensm/main.c: Use strtoul rather than strtol for parsing transaction
- timeout
-0608af9 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, revert setting
- of init failure on QoS initialization failures
-c6b4d4a opensm/osm_vendor_ibumad.c: Add transaction ID to osm_vendor_send
- log message
-520af84 opensm/osm_sa_path_record.c: don't set dgid pointer for local subnet
-4a878fb opensm/osm_mcast_mgr.c: fix osm_mcast_mgr_compute_max_hops for
- managed switch
-
-* Other less critical or visible bugs were also fixed.
-
-5 Main Verification Flows
--------------------------
-
-OpenSM verification is run using the following activities:
-* osmtest - a stand-alone program
-* ibmgtsim (IB management simulator) based - a set of flows that
- simulate clusters, inject errors and verify OpenSM capability to
- respond and bring up the network correctly.
-* small cluster regression testing - where the SM is used on back to
- back or single switch configurations. The regression includes
- multiple OpenSM dedicated tests.
-* cluster testing - when we run OpenSM to setup a large cluster, perform
- hand-off, reboots and reconnects, verify routing correctness and SA
- responsiveness at the ULP level (IPoIB and SDP).
-
-5.1 osmtest
-
-osmtest is an automated verification tool used for OpenSM
-testing. Its verification flows are described by list below.
-
-* Inventory File: Obtain and verify all port info, node info, link and path
- records parameters.
-
-* Service Record:
- - Register new service
- - Register another service (with a lease period)
- - Register another service (with service p_key set to zero)
- - Get all services by name
- - Delete the first service
- - Delete the third service
- - Added bad flows of get/delete non valid service
- - Add / Get same service with different data
- - Add / Get / Delete by different component mask values (services
- by Name & Key / Name & Data / Name & Id / Id only )
-
-* Multicast Member Record:
- - Query of existing Groups (IPoIB)
- - BAD Join with insufficient comp mask (o15.0.1.3)
- - Create given MGID=0 (o15.0.1.4)
- - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
- - Create BAD MGID=0xFA. (o15.0.1.6)
- - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
- - New MGID with invalid join state (o15.0.1.9)
- - Retry of existing MGID - See JoinState update (o15.0.1.11)
- - BAD RATE when connecting to existing MGID (o15.0.1.13)
- - Partial JoinState delete request - removing FullMember (o15.0.1.14)
- - Full Delete of a group (o15.0.1.14)
- - Verify Delete by trying to Join deleted group (o15.0.1.14)
- - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)
-
-* GUIDInfo Record:
- - All GUIDInfoRecords in subnet are obtained
-
-* MultiPathRecord:
- - Perform some compliant and noncompliant MultiPathRecord requests
- - Validation is via status in responses and IB analyzer
-
-* PKeyTableRecord:
- - Perform some compliant and noncompliant PKeyTableRecord queries
- - Validation is via status in responses and IB analyzer
-
-* LinearForwardingTableRecord:
- - Perform some compliant and noncompliant LinearForwardingTableRecord queries
- - Validation is via status in responses and IB analyzer
-
-* Event Forwarding: Register for trap forwarding using reports
- - Send a trap and wait for report
- - Unregister non-existing
-
-* Trap 64/65 Flow: Register to Trap 64-65, create traps (by
- disconnecting/connecting ports) and wait for report, then unregister.
-
-* Stress Test: send PortInfoRecord queries, both single and RMPP and
- check for the rate of responses as well as their validity.
-
-
-5.2 IB Management Simulator OpenSM Test Flows:
-
-The simulator provides ability to simulate the SM handling of virtual
-topologies that are not limited to actual lab equipment availability.
-OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
-regressions use smaller (16 and 128 nodes clusters).
-
-The following test flows are run on the IB management simulator:
-
-* Stability:
- Up to 12 links from the fabric are randomly selected to drop packets
- at drop rates up to 90%. The SM is required to succeed in bringing the
- fabric up. The resulting routing is verified to be correct as well.
-
-* LID Manager:
- Using LMC = 2 the fabric is initialized with LIDs. Faults such as
- zero LID, Duplicated LID, non-aligned (to LMC) LIDs are
- randomly assigned to various nodes and other errors are randomly
- output to the guid2lid cache file. The SM sweep is run 5 times and
- after each iteration a complete verification is made to ensure that all
- LIDs that could possibly be maintained are kept, as well as that all nodes
- were assigned a legal LID range.
-
-* Multicast Routing:
- Nodes randomly join the 0xc000 group and eventually the
- resulting routing is verified for completeness and adherence to
- Up/Down routing rules.
-
-* osmtest:
- The complete osmtest flow as described in the previous table is run on
- the simulated fabrics.
-
-* Stress Test:
- This flow merges fabric, LID and stability issues with continuous
- PathRecord, ServiceRecord and Multicast Join/Leave activity to
- stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get
- were added to the test such both existing and non existing nodes
- perform them in random order.
-
-5.3 OpenSM Regression
-
-Using a back-to-back or single switch connection, the following set of
-tests is run nightly on the stacks described in table 2. The included
-tests are:
-
-* Stress Testing: Flood the SA with queries from multiple channel
- adapters to check the robustness of the entire stack up to the SA.
-
-* Dynamic Changes: Dynamic Topology changes, through randomly
- dropping SMP packets, used to test OpenSM adaptation to an unstable
- network & verify DB correctness.
-
-* Trap Injection: This flow injects traps to the SM and verifies that it
- handles them gracefully.
-
-* SA Query Test: This test exhaustively checks the SA responses to all
- possible single component mask. To do that the test examines the
- entire set of records the SA can provide, classifies them by their
- field values and then selects every field (using component mask and a
- value) and verifies that the response matches the expected set of records.
- A random selection using multiple component mask bits is also performed.
-
-5.4 Cluster testing:
-
-Cluster testing is usually run before a distribution release. It
-involves real hardware setups of 16 to 32 nodes (or more if a beta site
-is available). Each test is validated by running all-to-all ping through the IB
-interface. The test procedure includes:
-
-* Cluster bringup
-
-* Hand-off between 2 or 3 SM's while performing:
- - Node reboots
- - Switch power cycles (disconnecting the SM's)
-
-* Unresponsive port detection and recovery
-
-* osmtest from multiple nodes
-
-* Trap injection and recovery
-
-
-6 Qualified Software Stacks and Devices
----------------------------------------
-
-OpenSM Compatibility
---------------------
-Note that OpenSM version 3.2.1 and earlier used a value of 1 in host
-byte order for the default SM_Key, so there is a compatibility issue
-with these earlier versions of OpenSM when the 3.2.2 or later version
-is running on a little endian machine. This affects SM handover as well
-as SA queries (saquery tool in infiniband-diags).
-
-
-Table 2 - Qualified IB Stacks
-=============================
-
-Stack | Version
------------------------------------------|--------------------------
-The main stream Linux kernel | 2.6.x
-OFED | 1.4
-OFED | 1.3
-OFED | 1.2
-OFED | 1.1
-OFED | 1.0
-
-Table 3 - Qualified Devices and Corresponding Firmware
-======================================================
-
-Mellanox
-Device | FW versions
-------------------------------------|-------------------------------
-InfiniScale | fw-43132 5.2.000 (and later)
-InfiniScale III | fw-47396 0.5.000 (and later)
-InfiniScale IV | fw-48436 7.1.000 (and later)
-InfiniHost | fw-23108 3.5.000 (and later)
-InfiniHost III Lx | fw-25204 1.2.000 (and later)
-InfiniHost III Ex (InfiniHost Mode) | fw-25208 4.8.200 (and later)
-InfiniHost III Ex (MemFree Mode) | fw-25218 5.3.000 (and later)
-ConnectX IB | fw-25408 2.3.000 (and later)
-
-QLogic/PathScale
-Device | Note
---------|-----------------------------------------------------------
-iPath | QHT6040 (PathScale InfiniPath HT-460)
-iPath | QHT6140 (PathScale InfiniPath HT-465)
-iPath | QLE6140 (PathScale InfiniPath PE-880)
-iPath | QLE7240
-iPath | QLE7280
-
-Note 1: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose
-QP0 and QP1. However, it does support it as a device on the subnet.
-
-Note 2: QoS firmware and Mellanox devices
-
-HCAs: QoS supported by ConnectX. QoS-enabled FW release is 2_5_000 and
-later.
-
-Switches: QoS supported by InfiniScale III
-Any InfiniScale III FW that is supported by OpenSM supports QoS.
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- qib in OFED 1.5.1 Release Notes
-
- March 2010
-
-======================================================================
-1. Overview
-======================================================================
-qib is the low level driver implementation for all QLogic InfiniPath
-PCI-Express HCAs: gen 1 x8 SDR QLE7140, gen 1 x8 DDR QLE7240,
-gen 1 x16 DDR QLE7280, gen 2 x8 QDR QLE7340 and QLE7342.
-
-The qib driver is new for OFED 1.5.
-
-The qib kernel driver obsoletes the ipath kernel driver but is
-compatible with libipathverbs so no new user level components are needed.
+++ /dev/null
-# QLogic VNIC configuration file
-#
-# This file documents and describes the use of the
-# VNIC configuration file qlgc_vnic.cfg. This file
-# should reside in /etc/infiniband/qlgc_vnic.cfg
-#
-#
-# Knowing how to fill the configuration file
-###############################################
-#
-# For filling the configuration file you need to know
-# some information about your EVIC/VEx device. This information
-# can be obtained with the help of the ib_qlgc_vnic_query tool.
-# "ib_qlgc_vnic_query -es" command will give DGID, IOCGUID and IOCSTRING information about
-# the EVIC/VEx IOCs that are available through port 1 and
-# "ib_qlgc_vnic_query -es -d /dev/infiniband/umad1" will give information about
-# the EVIC/VEX IOCs available through port 2.
-#
-# Refer to the README for more information about the ib_qlgc_vnic_query tool.
-#
-#
-# General structure of the configuration file
-###############################################
-#
-# All lines beginning with a # are treated as comments.
-#
-# A simple configuration file consists of CREATE commands
-# for each VNIC interface to be created.
-#
-# A simple CREATE command looks like this:
-#
-# {CREATE; NAME="eioc1";
-# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";
-# }
-#
-#Where
-#
-#NAME - The device name for the interface
-#
-#DGID - The DGID of the IOC to use.
-#
-# If DGID is specified then IOCGUID MUST also be specified.
-#
-# Though specifying DGID is optional, using this option is recommended,
-# as it will provide the quickest way of starting up the VNIC service.
-#
-#
-#IOCGUID - The GUID of the IOC to use.
-#
-#IOCSTRING - The IOC Profile ID String of the IOC to use.
-#
-# Either an IOCGUID or an IOCSTRING MUST always be specified.
-#
-# If DGID is specified then IOCGUID MUST also be specified.
-#
-# If no DGID is specified and both IOCGUID and IOCSTRING are specified
-# then IOCSTRING is given preference and the DGID of the IOC whose
-# IOCSTRING is specified is used to create the VNIC interface.
-#
-# If hotswap capability of EVIC/VEx is to be used, then IOCSTRING
-# must be specified.
-#
-#INSTANCE - Defaults to 0. Range 0-255. If a host will connect to the
-# same IOC more than once, each connection must be assigned a unique
-# number.
-#
-#
-#RX_CSUM - defaults to TRUE. When true, indicates that the receive checksum
-# should be done by the EVIC/VEx
-#
-#HEARTBEAT - defaults to 100. Specifies the time in 1/100'ths of a second
-# between heartbeats
-#
-#PORT - Specification for local HCA port. First port is 1.
-#
-#HCA - Optional HCA specification for use with PORT specification. First HCA is 0.
-#
-#PORTGUID - The PORTGUID of the IB port to use.
-#
-# Use of PORTGUID for configuring the VNIC interface has an
-# advantage on hosts having more than 1 HCAs plugged in. As
-# PORTGUID is persistent for given IB port, VNIC configurations
-# would be consistent and reliable - unaffected by restarts of
-# OFED IB stack on host having more than 1 HCAs plugged in.
-#
-# On the downside, if HCA on the host is changed, VNIC interfaces
-# configured with PORTGUID needs reconfiguration.
-#
-#IB_MULTICAST - Controls enabling or disabling of IB multicast feature on VNIC.
-# Defaults to TRUE implying IB multicast is enabled for
-# the interface. To disable IB multicast, set it to FALSE.
-#
-# Example of DGID and IOCGUID based configuration (this configuration will give
-# the quickest start up of VNIC service):
-#
-# {CREATE; NAME="eioc1";
-# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001;
-# }
-#
-#
-# Example of IOCGUID based configuration:
-#
-# {CREATE; NAME="eioc1"; IOCGUID=0x66A013000010C;
-# RX_CSUM=TRUE;
-# HEARTBEAT=100; }
-#
-# Example of IOCSTRING based configuration:
-#
-# {CREATE; NAME="eioc1"; IOCSTRING="Chassis 0x00066A0050000018, Slot 2, IOC 1";
-# RX_CSUM=TRUE;
-# HEARTBEAT=100; }
-#
-#
-#Failover configuration:
-#########################
-#
-# It is possible to create a VNIC interface with failover configuration
-# by using the PRIMARY and SECONDARY commands. The IOC specified in
-# the PRIMARY command will be used as the primary IOC for this interface
-# and the IOC specified in the SECONDARY command will be used as the
-# fail-over backup in case the connection with the primary IOC fails
-# for some reason.
-#
-# PRIMARY and SECONDARY commands are written in the following way:
-#
-# PRIMARY={DGID=...;IOCGUID=...; IOCSTRING=...;INSTANCE=... } -
-# IOCGUID, and INSTANCE must be values that are unique to the primary interface
-#
-# SECONDARY={DGID=...;IOCGUID=...; INSTANCE=... } -
-# IOCGUID, and INSTANCE must be values that are unique to the secondary interface
-#
-# OR it can also be specified without using DGID, like this:
-#
-# PRIMARY={IOCGUID=...; INSTANCE=... } - IOCGUID may be substituted with
-# IOCSTRING. IOCGUID, IOCSTRING, and INSTANCE must be values that are
-# unique to the primary interface
-#
-# SECONDARY={IOCGUID=...; INSTANCE=... } - bring up a secondary connection for
-# fail-over. IOCGUID may be substituted with IOCSTRING. IOCGUID, IOCSTRING,
-# and INSTANCE values to be used for the secondary connection
-#
-#
-#Examples of failover configuration:
-#
-#{CREATE; NAME="veth1";
-# PRIMARY={ DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";
-# INSTANCE=1; PORT=1; }
-# SECONDARY={DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0230000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 2";
-# INSTANCE=1; PORT=2; }
-#}
-#
-# {CREATE; NAME="eioc2";
-# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; }
-# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; }
-# }
-#
-#Example of configuration with IB_MULTICAST
-#
-# {CREATE; NAME="eioc2";
-# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; IB_MULTICAST=FALSE; }
-# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; IB_MULTICAST=FALSE; }
-# }
-#
-# Example of HCA/PORT and PORTGUID configurations:
-# {
-# CREATE; NAME="veth1";
-# PRIMARY={IOCGUID=00066a02de000070; INSTANCE=1; PORTGUID=0x0002c903000010f5; }
-# SECONDARY={IOCGUID=00066a02de000070; INSTANCE=2; PORTGUID=0x0002c903000010f6; }
-# }
-#
-# {
-# CREATE; NAME="veth2";
-# PRIMARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=3; HCA=1; PORT=2; }
-# SECONDARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=4; HCA=0; PORT=1; }
-# }
-#
-# {
-# CREATE; NAME="veth3";
-# IOCSTRING="EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2";
-# INSTANCE=5 PORTGUID=0x0002c90300000786;
-# }
-# {
-# CREATE; NAME="veth4;
-# IOCGUID=00066a02de000070;
-# INSTANCE=6; HCA=1; PORT=2;
-# }
+++ /dev/null
-Distribution
- Open Fabrics Enterprise Distribution (OFED) 1.5, December 2009
-
-Summary
- qperf - Measure RDMA and IP performance
-
-Overview
- qperf measures bandwidth and latency between two nodes. It can work over
- TCP/IP as well as the RDMA transports.
-
-Quick Start
- * Since qperf measures latency and bandwidth between two nodes, you need
- access to two nodes. Assume they are called node1 and node2.
-
- * On node1, run qperf without any arguments. It will act as a server and
- continue to run until asked to quit.
-
- * To measure TCP bandwidth between the two nodes, on node2, type:
- qperf node1 tcp_bw
-
- * To measure RDMA RC latency, type (on node2):
- qperf node1 rc_lat
-
- * To measure RDMA UD latency using polling, type (on node2):
- qperf node1 -P 1 ud_lat
-
- * To measure SDP bandwidth, on node2, type:
- qperf node1 sdp_bw
-
-Documentation
- * Man page available. Type
- man qperf
-
- * To get a list of examples, type:
- qperf --help examples
-
- * To get a list of tests, type:
- qperf --help tests
-
-Tests
- Miscellaneous
- conf Show configuration
- quit Cause the server to quit
- Socket Based
- rds_bw RDS streaming one way bandwidth
- rds_lat RDS one way latency
- sctp_bw SCTP streaming one way bandwidth
- sctp_lat SCTP one way latency
- sdp_bw SDP streaming one way bandwidth
- sdp_lat SDP one way latency
- tcp_bw TCP streaming one way bandwidth
- tcp_lat TCP one way latency
- udp_bw UDP streaming one way bandwidth
- udp_lat UDP one way latency
- RDMA Send/Receive
- ud_bw UD streaming one way bandwidth
- ud_bi_bw UD streaming two way bandwidth
- ud_lat UD one way latency
- rc_bw RC streaming one way bandwidth
- rc_bi_bw RC streaming two way bandwidth
- rc_lat RC one way latency
- uc_bw UC streaming one way bandwidth
- uc_bi_bw UC streaming two way bandwidth
- uc_lat UC one way latency
- RDMA
- rc_rdma_read_bw RC RDMA read streaming one way bandwidth
- rc_rdma_read_lat RC RDMA read one way latency
- rc_rdma_write_bw RC RDMA write streaming one way bandwidth
- rc_rdma_write_lat RC RDMA write one way latency
- rc_rdma_write_poll_lat RC RDMA write one way polling latency
- uc_rdma_write_bw UC RDMA write streaming one way bandwidth
- uc_rdma_write_lat UC RDMA write one way latency
- uc_rdma_write_poll_lat UC RDMA write one way polling latency
- InfiniBand Atomics
- rc_compare_swap_mr RC compare and swap messaging rate
- rc_fetch_add_mr RC fetch and add messaging rate
- Verification
- ver_rc_compare_swap Verify RC compare and swap
- ver_rc_fetch_add Verify RC fetch and add
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- RDMA CM in OFED 1.5 Release Notes
-
- July 2010
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. New Features
-3. Known Issues
-
-===============================================================================
-1. Overview
-===============================================================================
-The RDMA CM is a communication manager used to setup reliable, connected
-and unreliable datagram data transfers. It provides an RDMA transport
-neutral interface for establishing connections. The API is based on sockets,
-but adapted for queue pair (QP) based semantics: communication must be
-over a specific RDMA device, and data transfers are message based.
-
-
-The RDMA CM only provides the communication management (connection setup /
-teardown) portion of an RDMA API. It works in conjunction with the verbs
-API for data transfers.
-
-===============================================================================
-2. New Features
-===============================================================================
-for OFED 1.5.2:
-
-Several enhancements were added to librdmacm release 1.0.12 that
-are intended to simplify using RDMA devices and address scalability issues.
-These changes were in response to long standing requests to make
-connection establishment 'more like sockets'. For full details,
-users should refer to the appropriate man pages. Major changes include:
-
-* Support synchronous operation for library calls. Users can control
- whether an rdma_cm_id operates asynchronously or synchronously based on
- the rdma_event_channel parameter. Use of synchronous operations
- reduces the amount of application code required to use the librdmacm
- by eliminating the need for event processing code.
-
- An rdma_cm_id will be marked for synchronous operation if the
- rdma_event_channel parameter is NULL for rdma_create_id or
- rdma_migrate_id. Users can toggle between synchronous and
- asynchronous operation through the rdma_migrate_id call.
-
- Calls that operate synchronously include rdma_resolve_addr,
- rdma_resolve_route, rdma_connect, rdma_accept, and rdma_get_request.
- Synchronous event data is returned to the user through the
- rdma_cm_id.
-
-* The addition of a new API: rdma_getaddrinfo. This call is modeled
- after getaddrinfo, but for RDMA devices and connections. It has the
- following notable deviations from getaddrinfo:
-
- A source address is returned as part of the call to allow the
- user to allocate necessary local HW resources for connections.
-
- Optional routing information may be returned to support
- Infiniband fabrics. IB routing information includes necessary
- path record data. rdma_getaddrinfo will obtain this information
- if IB ACM support (see below) is enabled. The use of IB ACM
- is not required for rdma_getaddrinfo.
-
- rdma_getaddrinfo provides future extensions to support
- more complex address and route resolution mechanisms, such as
- multiple path support and failover.
-
-* Support for a new APIs: rdma_get_request, rdma_create_ep, and
- rdma_destroy_ep. rdma_get_request simplifies the passive side
- implementation by adding synchronous support for accepting new
- connections. rdma_create_ep combines the functionality of
- rdma_create_id, rdma_create_qp, rdma_resolve_addr, and rdma_resolve_route
- in a single API that uses the output of rdma_getaddrinfo as its input.
-
-* Support for optional parameters. To simplify support for casual RDMA
- developers and researchers, the librdmacm can allocate protection
- domains, completion queues, and queue pairs on a user's behalf.
- This simplifies the amount of information that a developer
- must learn in order to use RDMA, plus allows the user to take
- advantage of higher-level completion processing abstractions.
-
- In addition to optional parameters, a user can also specify that the
- librdmacm should automatically select usable values for RDMA read
- operations.
-
-* Add support for IB ACM. IB ACM (InfiniBand Assistant for Communication
- Management) defines a socket based protocol to an IB address and route
- resolution service. One implementation of that service is provided
- separately by the ibacm package, but anyone can implement the service
- provided that they adhere to the IB ACM socket protocol. IB ACM is an
- experimental service targeted at increasing the scalability of applications
- running on a large cluster.
-
- Use of IB ACM is not required and is controlled through the build option
- '--with-ib_acm'. If the librdmacm fails to contact the IB ACM service, it
- reverts to using kernel services to resolve address and routing data.
-
-* Add RDMA helper routines. The librdmacm provide a set of simpler verbs
- calls for posting work requests, registering memory, and checking for
- completions. These calls are wrappers around libibverbs routines.
-
-===============================================================================
-3. Known Issues
-===============================================================================
-The RDMA CM relies on the operating system's network configuration tables to
-map IP addresses to RDMA devices. Incorrectly configured network
-configurations can result in the RDMA CM being unable to locate the correct
-RDMA device. Currently, the RDMA CM only supports IPv4 addressing.
-
-All RDMA interfaces must provide a way to map IP addresses to an RDMA device.
-For Infiniband, this is done using IPoIB, and requires correctly configured
-IPoIB device interfaces sharing the same multicast domain. For details on
-configuring IPoIB, refer to ipoib_release_notes.txt. For RDMA devices to
-communicate, they must support the same underlying network and data link
-layers.
-
-If you experience problems using the RDMA CM, you may want to check the
-following:
-
- * Verify that you have IP connectivity over the RDMA devices. For example,
- ping between iWarp or IPoIB devices.
-
- * Ensure that IP network addresses assigned to RDMA devices do not
- overlap with IP network addresses assigned to standard Ethernet devices.
-
- * For multicast issues, either bind directly to a specific RDMA device, or
- configure the IP routing tables to route multicast traffic over an RDMA
- device's IP address.
-
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- RDS in OFED 1.5.1 Release Notes
- March 2010
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Supported Platforms
-3. Installation & Configuration
-4. New Features
-5. Bug fixes and Enhancements since OFED 1.4
-6. Bug fixes and Enhancements since OFED 1.3.1
-7. Bug fixes and Enhancements since OFED 1.3
-8. Bug fixes and Enhancements since OFED 1.2
-9. Known Issues
-
-===============================================================================
-1. Overview
-===============================================================================
-RDS socket API. It provides reliable, in-order datagram delivery between
-sockets over a variety of transports.
-For details see RDS_README.txt and man 7 rds.
-
-===============================================================================
-2. supported platforms
-===============================================================================
-
-Same as overall OFED release.
-
-===============================================================================
-3. Installation & Configuration
-===============================================================================
-To install RDS select rds in OFED's manual installation or put 'rds=y' in the
-ofed.conf for unattended installation.
-
-To load RDS module upon boot edit file '/etc/infiniband/openib.conf' as
-follows:
-
-# Load RDS module
-RDS_LOAD=yes
-
-===============================================================================
-4. New Features
-===============================================================================
-
-GET_MR_FOR_DEST sockopt added. This allows a MR to be associated with
-a remote host. GET_MR sockopt deprecated.
-
-Transports now modularized: rds_rdma.ko (IB and iWARP) and
-rds_tcp.ko. This enables RDS use with TCP, without the IB stack
-loaded.
-
-Improved receive processing to lower amount of time spent with interrupts
-disabled.
-
-===============================================================================
-5. Bug fixes and Enhancements since OFED 1.4
-===============================================================================
-
-* Set retry_count to 2 and make modifiable via modparam
-* Many locking fixes
-* Rebased to mainline kernel 2.6.30 resulted in rds trace framework
- being removed.
-
-===============================================================================
-6. Bug fixes and Enhancements since OFED 1.3.1
-===============================================================================
-- RDMA completion notifications are signalled when the IB stack gives us the
- completion event for the accompanying RDS message. This is a change from the
- 1.3.x behavior, which signalled completion notifications when the RDS message
- was ACKed.
-- Fixed bugs associated with congestion monitoring.
-- FMR pool size increased from 2K to 4K
-- Added support for RDMA_CM_EVENT_ADDR_CHANGE event.
-- RDS should now work on Qlogic HCAs.
-
-===============================================================================
-7. Bug fixes and Enhancements since OFED 1.3
-===============================================================================
-- Fix a bug in RDMA signaling
-- Add 3 more stats counters
-- Fix a kernel crash that can occur when RDS/IB connection drops
-- Fixes for RDMA API
-
-===============================================================================
-8. Bug fixes and Enhancements since OFED 1.2
-===============================================================================
-
-1) Wire protocol for RDS v3 and RDS v2 are not compatible.
-
-2) RDS over TCP is disabled in OFED 1.3. We will re-enable in future release.
-
-3) Congestion monitoring support gives the application more fine-grained
- control.
-
-With explicit monitoring, the application polls for POLLIN as before, and
-additionally uses the RDS_CONG_MONITOR socket option to install a 64bit mask
-value in the socket, where each bit corresponds to a group of ports.
-When a congestion update arrives, RDS checks the set of ports that became
-uncongested against the bit mask installed in the socket. If they overlap, a
-control messages is enqueued on the socket, and the application is woken up.
-When application calls recvmsg (2), it will be given the control message
-containing the bitmap on the socket.
-
-===============================================================================
-9. Known Issues
-===============================================================================
-1. RDMAs over 1 MiB not supported.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ How To Build OFED 1.5.1
+
+ March 2010
+
+
+==============================================================================
+Table of contents
+==============================================================================
+1. Overview
+2. Usage
+3. Requirements
+
+==============================================================================
+1. Overview
+==============================================================================
+The script "build.pl" is used to build the OFED package based on the
+OpenFabrics project. The package is built under /tmp directory.
+
+See OFED_release_notes.txt for more details.
+
+==============================================================================
+2. Usage
+==============================================================================
+
+The build script for the OFED package can be downloaded from:
+ git://git.openfabrics.org/~vlad/build.git
+ branch: master
+
+Name: build.pl
+
+Usage: ./build.pl --version <version> [-r|--release]|[--daily] [-d|--distribution <distribution name>] [-v|--verbose]
+ [-b|--builddir <build directory>]
+ [-p|--packagesdir <packages directory>]
+ [--pre-build <pre-build script>]
+ [--skip-prebuild]
+ [--post-build <post-build script>]
+ [--skip-postbuild]
+
+Example:
+
+ ./build.pl --version 1.5.1-rc1 -p packages-ofed
+
+ This command will create a package (i.e., subtree) called OFED-1.5.1-rc1
+ under /tmp/$USER/
+
+==============================================================================
+3. Requirements
+==============================================================================
+
+1. Git:
+ Can be downloaded from:
+ http://www.kernel.org/pub/software/scm/git
+
+2. Autotools:
+
+ libtool-1.5.20 or higher
+ autoconf-2.59 or higher
+ automake-1.9.6 or higher
+ m4-1.4.4 or higher
+
+ The above tools can be downloaded from the following URLs:
+
+ libtool - "http://ftp.gnu.org/gnu/libtool/libtool-1.5.20.tar.gz"
+ autoconf - "http://ftp.gnu.org/gnu/autoconf/autoconf-2.59.tar.gz"
+ automake - "http://ftp.gnu.org/gnu/automake/automake-1.9.6.tar.gz"
+ m4 - "http://ftp.gnu.org/gnu/m4/m4-1.4.4.tar.gz"
+
+3. wget or ssh slient
--- /dev/null
+===============================================================================
+ MLNX_EN driver for Mellanox Adapter Cards with 10GigE Support
+ README for OFED 1.5.2
+
+ December 2010
+===============================================================================
+
+Contents:
+=========
+1. Overview
+2. Ethernet Driver Usage and Configuration
+
+
+1. Overview
+===========
+MLNX_EN driver is composed from mlx4_core and mlx4_en kernel modules.
+
+The MLNX_EN driver release exposes the following capabilities:
+- Single/Dual port
+- Fibre Channel over Ethernet (FCoE)
+- Up to 16 Rx queues per port
+- 5 TX queues per port
+- Rx steering mode: Receive Core Affinity (RCA)
+- Tx arbitration mode: VLAN user-priority (off by default)
+- MSI-X or INTx
+- Adaptive interrupt moderation
+- HW Tx/Rx checksum calculation
+- Large Send Offload (i.e., TCP Segmentation Offload)
+- Large Receive Offload
+- IP reassembly offload for fragmented IP packets
+- Multi-core NAPI support
+- VLAN Tx/Rx acceleration (HW VLAN stripping/insertion)
+- HW VLAN filtering
+- HW multicast filtering
+- ifconfig up/down + mtu changes (up to 10K)
+- Ethtool support
+- Net device statistics
+
+
+2. Ethernet Driver Usage and Configuration
+==========================================
+
+- To assign an IP address to the interface run:
+ #> ifconfig eth<x> <ip>
+
+ where 'x' is the OS assigned interface number.
+
+- To check driver and device information run:
+ #> ethtool -i eth<x>
+
+ Example:
+ #> ethtool -i eth2
+ driver: mlx4_en (MT_0BD0110004)
+ version: 1.5.2 (March 2010)
+ firmware-version: 2.8.000
+ bus-info: 0000:0e:00.0
+
+- To query stateless offload status run:
+ #> ethtool -k eth<x>
+
+- To set stateless offload status run:
+ #> ethtool -K eth<x> [rx on|off] [tx on|off] [sg on|off] [tso on|off]
+
+- To query interrupt coalescing settings run:
+ #> ethtool -c eth<x>
+
+- By default, the driver uses adaptive interrupt moderation for the receive path,
+ which adjusts the moderation time according to the traffic pattern.
+ Adaptive moderation settings can be set by:
+ #> ethtool -C eth<x> adaptive-rx on|off
+
+- To set interrupt coalescing settings run:
+ #> ethtool -C eth<x> [rx-usecs N] [rx-frames N] [tx-usecs N] [tx-frames N]
+
+ Note: usec settings correspond to the time to wait after the *last* packet
+ sent/received before triggering an interrupt
+
+- To query pause frame settings run:
+ #> ethtool -a eth<x>
+
+- To set pause frame settings run:
+ #> ethtool -A eth<x> [rx on|off] [tx on|off]
+
+- To query ring size values run:
+ #> ethtool -g eth<x>
+
+- To modify rings size run:
+ #> ethtool -G eth<x> [rx <N>] [tx <N>]
+
+- To obtain additional device statistics, run:
+ #> ethtool -S eth<x>
+
+- To perform a self diagnostics test, run:
+ #> ethtool -t eth<x>
+
+
+The driver defaults to the following parameters:
+- Both ports are activated (i.e., a net device is created for each port)
+- The number of Rx rings for each port is the number of on-line CPUs
+- Per-core NAPI is enabled
+- LRO is enabled with 32 concurrent sessions per Rx ring
+
+Some of these values can be changed using module parameters, which are
+detailed by running:
+#> modinfo mlx4_en
+
+To set non-default values to module parameters, the following line should be
+added to /etc/modprobe.conf file:
+ "options mlx4_en <param_name>=<value> <param_name>=<value> ..."
+
+Values of all parameters can be observed in /sys/module/mlx4_en/parameters/.
+
+
--- /dev/null
+
+ MPI in OFED 1.5.2 README
+
+ September 2010
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. MVAPICH
+3. Open MPI
+4. MVAPICH2
+
+
+===============================================================================
+1. Overview
+===============================================================================
+Open Fabrics Enterprise Distribution (OFED)Three MPI stacks are included in
+this release of OFED:
+- MVAPICH 1.2.0
+- Open MPI 1.4.2
+- MVAPICH2 1.5.1
+
+Setup, compilation and run information of MVAPICH, Open MPI and MVAPICH2 is
+provided below in sections 2, 3 and 4 respectively.
+
+1.1 Installation Note
+---------------------
+In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install
+one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt
+to learn about the different options.
+
+The installation script allows each MPI to be compiled using one or
+more compilers. Users need to set, per MPI stack installed, the PATH
+and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks.
+
+1.2 MPI Tests
+-------------
+OFED includes four basic tests that can be run against each MPI stack:
+bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests
+are located under: <prefix>/mpi/<compiler>/<mpi stack>/tests/,
+where <prefix> is /usr by default.
+
+1.4 Selecting Which MPI to Use: mpi-selector
+--------------------------------------------
+Depending on how the OFED installer was run, multiple different MPI
+implementations may be installed on your system. The OFED installer
+will run an MPI selector tool during the installation process,
+presenting a menu-based interface to select which MPI implementation
+is set as the default for all users. This MPI selector tool can be
+re-run at any time by the administrator after the OFED installer
+completes to modify the site-wide default MPI implementation selection
+by invoking the "mpi-selector-menu" command (root access is typically
+required to change the site-wide default).
+
+The mpi-selector-menu command can also be used by non-administrative
+users to override the site-wide default MPI implementation selection
+by setting a per-user default. Specifically: unless a user runs the
+MPI selector tool to set a per-user default, their environment will be
+setup for the site-wide default MPI implementation.
+
+Note that the default MPI selection does *not* affect the shell from
+which the command was invoked (or any other shells that were already
+running when the MPI selector tool was invoked). The default
+selection is only changed for *new* shells started after the selector
+tool was invoked. It is recommended that once the default MPI
+implementation is changed via the selector tool, users should logout
+and login again to ensure that they have a consistent view of the
+default MPI implementation. Other tools can be used to change the MPI
+environment in the current shell, such as the environment modules
+software package (which is not included in the OFED software package;
+see http://modules.sourceforge.net/ for details).
+
+Note that the site-wide default is set in a file that is typically not
+on a networked file system, and is therefore specific to the host on
+which it was run. As such, it is recommended to run the
+mpi-selector-menu command on all hosts in a cluster, picking the same
+default MPI implementation on each. It may be more convenient,
+however, to use the mpi-selector command in script-based scenarios
+(such as running on every host in a cluster); mpi-selector effects all
+the same functionality as mpi-selector-menu, but is intended for
+automated environments. See the mpi-selector(1) manual page for more
+details.
+
+Additionally, per-user defaults are set in a file in the user's $HOME
+directory. If this directory is not on a network-shared file system
+between all hosts that will be used for MPI applications, then it also
+needs to be propagated to all relevant hosts.
+
+Note: The MPI selector tool typically sets the PATH and/or
+LD_LIBRARY_PATH for a given MPI implementation. This step can, of
+course, also be performed manually by a user or on a site-wide basis.
+The MPI selector tool simply bundles up this functionality in a
+convenient set of command line tools and menus.
+
+1.4 Updating MPI Installations
+------------------------------
+Note that all of the MPI implementations included in the OFED software
+package are the versions that were available when OFED v1.5 was
+released. They have been QA tested with this version of OFED and are
+fully supported.
+
+However, note that administrators can go to the web sites of each MPI
+implementation and download / install newer versions after OFED has
+been successfully installed. There is nothing specific about the
+OFED-included MPI software packages that prohibit installing
+newer/other MPI implementations.
+
+It should be also noted that versions of MPI released after OFED v1.5
+are not supported by OFED. But since each MPI has its own release
+schedule and QA process (each of which involves testing with the OFED
+stack), it may sometimes be desirable -- or even advisable, depending
+on how old the MPI implementations are that are included in OFED -- to
+download install a newer version of MPI.
+
+The web sites of each MPI implementation are listed below:
+
+- Open MPI: http://www.open-mpi.org/
+- MVAPICH: http://mvapich.cse.ohio-state.edu/
+- MVAPICH2: http://mvapich.cse.ohio-state.edu/overview/mvapich2/
+
+===============================================================================
+2. MVAPICH MPI
+===============================================================================
+
+This package is a 1.2.0 version of the MVAPICH software package,
+and is the officially supported MPI stack for this release of OFED.
+See http://mvapich.cse.ohio-state.edu for more details.
+
+
+2.1 Setting up for MVAPICH
+--------------------------
+To launch MPI jobs, its installation directory needs to be included
+in PATH and LD_LIBRARY_PATH. To set them, execute one of the following
+commands:
+ source <prefix>/mpi/<compiler>/<mpi stack>/bin/mpivars.sh
+ -- when using sh for launching MPI jobs
+ or
+ source <prefix>/mpi/<compiler>/<mpi stack>/bin/mpivars.csh
+ -- when using csh for launching MPI jobs
+
+
+2.2 Compiling MVAPICH Applications:
+-----------------------------------
+***Important note***:
+A valid Fortran compiler must be present in order to build the MVAPICH MPI
+stack and tests.
+
+The default gcc-g77 Fortran compiler is provided with all RedHat Linux
+releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide
+this compiler as part of the default installation.
+
+The following compilers are supported by OFED's MVAPICH package: Gcc,
+Intel, Pathscale and PGI. The install script prompts the user to choose
+the compiler with which to build the MVAPICH RPM. Note that more
+than one compiler can be selected simultaneously, if desired.
+
+For details see:
+ http://mvapich.cse.ohio-state.edu/support
+
+To review the default configuration of the installation, check the default
+configuration file: <prefix>/mpi/<compiler>/<mpi stack>/etc/mvapich.conf
+
+2.3 Running MVAPICH Applications:
+---------------------------------
+Requirements:
+o At least two nodes. Example: mtlm01, mtlm02
+o Machine file: Includes the list of machines. Example: /root/cluster
+o Bidirectional rsh or ssh without a password
+
+Note: ssh will be used unless -rsh is specified. In order to use
+rsh, add to the mpirun_rsh command the parameter: -rsh
+
+*** Running OSU tests ***
+
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bw
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_latency
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bibw
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bcast
+
+*** Running Intel MPI Benchmark test (Full test) ***
+
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/IMB-3.2/IMB-MPI1
+
+*** Running Presta test ***
+
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/com -o 100
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/glob -o 100
+/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/globalop
+
+
+===============================================================================
+3. Open MPI
+===============================================================================
+
+Open MPI is a next-generation MPI implementation from the Open MPI
+Project (http://www.open-mpi.org/). Version 1.4 of Open MPI is
+included in this release, which is also available directly from the
+main Open MPI web site.
+
+A working Fortran compiler is not required to build Open MPI, but some
+of the included MPI tests are written in Fortran. These tests will
+not compile/run if Open MPI is built without Fortran support.
+
+The following compilers are supported by OFED's Open MPI package: GNU,
+Pathscale, Intel, or Portland. The install script prompts the user
+for the compiler with which to build the Open MPI RPM. Note that more
+than one compiler can be selected simultaneously, if desired.
+
+Users should check the main Open MPI web site for additional
+documentation and support. (Note: The FAQ file considers OpenFabrics
+tuning among other issues.)
+
+3.1 Setting up for Open MPI
+---------------------------
+Selecting to use Open MPI via the mpi-selector-mpi and mpi-selector
+tools will perform all the necessary setup for users to build and run
+Open MPI applications. If you use the MPI selector tools, you can
+skip the rest of this section.
+
+If you do not wish to use the MPI selector tools, the Open MPI team
+strongly advises users to put the Open MPI installation directory in
+their PATH and LD_LIBRARY_PATH. This can be done at the system level
+if all users are going to use Open MPI. Specifically:
+
+- add <prefix>/bin to PATH
+- add <prefix>/lib to LD_LIBRARY_PATH
+
+<prefix> is the directory where the desired Open MPI instance was
+installed ("instance" refers to the compiler used for Open MPI
+compilation at install time.).
+
+If you are using a job scheduler to launch MPI jobs (e.g., SLURM,
+Torque), setting the PATH and LD_LIBRARY_PATH is still required, but
+it does not need to be set in your shell startup files. Procedures
+describing how to add these values to PATH and LD_LIBRARY_PATH are
+described in detail at:
+
+ http://www.open-mpi.org/faq/?category=running
+
+3.2 Open MPI Installation Support / Updates
+-------------------------------------------
+The OFED package will install Open MPI with support for TCP, shared
+memory, and the OpenFabrics network stacks. No other networks are
+supported by the OFED Open MPI installation.
+
+Open MPI supports a wide variety of run-time environments. The OFED
+installer will not include support for all of them, however (e.g.,
+Torque and PBS-based environments are not supported by the
+OFED-installed Open MPI).
+
+The ompi_info command can be used to see what support was installed;
+look for plugins for your specific environment / network / etc. If
+you do not see them, the OFED installer did not include support for
+them.
+
+As described above, administrators or users can go to the Open MPI web
+site and download / install either a newer version of Open MPI (if
+available), or the same version with different configuration options
+(e.g., support for Torque / PBS-based environments).
+
+3.3 Compiling Open MPI Applications
+-----------------------------------
+(copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see
+this web page for more details)
+
+The Open MPI team strongly recommends that you simply use Open MPI's
+"wrapper" compilers to compile your MPI applications. That is, instead
+of using (for example) gcc to compile your program, use mpicc. Open
+MPI provides a wrapper compiler for four languages:
+
+ Language Wrapper compiler name
+ ------------- --------------------------------
+ C mpicc
+ C++ mpiCC, mpicxx, or mpic++
+ (note that mpiCC will not exist
+ on case-insensitive file-systems)
+ Fortran 77 mpif77
+ Fortran 90 mpif90
+ ------------- --------------------------------
+
+Note that if no Fortran 77 or Fortran 90 compilers were found when
+Open MPI was built, Fortran 77 and 90 support will automatically be
+disabled (respectively).
+
+If you expect to compile your program as:
+
+ > gcc my_mpi_application.c -lmpi -o my_mpi_application
+
+Simply use the following instead:
+
+ > mpicc my_mpi_application.c -o my_mpi_application
+
+Specifically: simply adding "-lmpi" to your normal compile/link
+command line *will not work*. See
+http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the
+Open MPI wrapper compilers.
+
+Note that Open MPI's wrapper compilers do not do any actual compiling
+or linking; all they do is manipulate the command line and add in all
+the relevant compiler / linker flags and then invoke the underlying
+compiler / linker (hence, the name "wrapper" compiler). More
+specifically, if you run into a compiler or linker error, check your
+source code and/or back-end compiler -- it is usually not the fault of
+the Open MPI wrapper compiler.
+
+3.4 Running Open MPI Applications:
+----------------------------------
+Open MPI uses either the "mpirun" or "mpiexec" commands to launch
+applications. If your cluster uses a resource manager (such as
+SLURM), providing a hostfile is not necessary:
+
+ > mpirun -np 4 my_mpi_application
+
+If you use rsh/ssh to launch applications, they must be set up to NOT
+prompt for a password (see http://www.open-mpi.org/faq/?category=rsh
+for more details on this topic). Moreover, you need to provide a
+hostfile containing a list of hosts to run on.
+
+Example:
+
+ > cat hostfile
+ host1.example.com
+ host2.example.com
+ host3.example.com
+ host4.example.com
+
+ > mpirun -np 4 -hostfile hostfile my_mpi_application
+ (application runs on all 4 hosts)
+
+In the following examples, replace <N> with the number of hosts to run on,
+and <HOSTFILE> with the filename of a valid hostfile listing the hosts
+to run on (unless you are running under a supported resource manager,
+in which case a hostfile is unnecessary).
+
+Also note that Open MPI is highly run-time tunable. There are many
+options that can be tuned to obtain optimal performance of your MPI
+applications (see the Open MPI web site / FAQ for more information:
+http://www.open-mpi.org/faq/).
+
+ - <N> is an integer indicating how many MPI processes to run (e.g., 2)
+ - <HOSTFILE> is the filename of a hostfile, as described above
+
+Example 1: Running the OSU bandwidth:
+
+ > cd /usr/mpi/gcc/openmpi-1.4.1/tests/osu_benchmarks-3.1.1
+ > mpirun -np <N> -hostfile <HOSTFILE> osu_bw
+
+Example 2: Running the Intel MPI Benchmark benchmarks:
+
+ > cd /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2
+ > mpirun -np <N> -hostfile <HOSTFILE> IMB-MPI1
+
+ --> Note that the version of IMB-EXT that ships in this version of
+ OFED contains a bug that will cause it to immediately error
+ out when run with Open MPI.
+
+Example 3: Running the Presta benchmarks:
+
+ > cd /usr/mpi/gcc/openmpi-1.4.1/tests/presta-1.4.0
+ > mpirun -np <N> -hostfile <HOSTFILE> com -o 100
+
+NOTE: In order to run Open MPI over RoCCE (RDMAoE) network, follow MCA parameter
+ is required:
+ --mca btl_openib_cpc_include rdmacm
+
+
+3.5 More Open MPI Information
+-----------------------------
+Much, much more information is available about using and tuning Open
+MPI (to include OpenFabrics-specific tunable parameters) on the Open
+MPI web site FAQ:
+
+ http://www.open-mpi.org/faq/
+
+Users who cannot find the answers that they are looking for, or are
+experiencing specific problems should consult the "how to get help" web
+page for more information:
+
+ http://www.open-mpi.org/community/help/
+
+
+===============================================================================
+4. MVAPICH2 MPI
+===============================================================================
+
+MVAPICH2 is an MPI-2 implementation which includes all MPI-1 features.
+It is based on MPICH2 and MVICH. MVAPICH2 provides many features including
+fault-tolerance with checkpoint-restart, RDMA_CM support, iWARP support,
+optimized collectives, on-demand connection management, multi-core optimized
+and scalable shared memory support, and memory hook with ptmalloc2 library
+support. The ADI-3-level design of MVAPICH2 supports many features including:
+MPI-2 functionalities (one-sided, collectives and data-type), multi-threading
+and all MPI-1 functionalities. It also supports a wide range of platforms
+(architecture, OS, compilers, InfiniBand adapters and iWARP adapters). More
+information can be found on the MVAPICH2 project site:
+
+http://mvapich.cse.ohio-state.edu/overview/mvapich2/
+
+A valid Fortran compiler must be present in order to build the MVAPICH2
+MPI stack and tests. The following compilers are supported by OFED's
+MVAPICH2 MPI package: gcc, intel, pgi, and pathscale. The install script
+prompts the user to choose the compiler with which to build the MVAPICH2
+MPI RPM. Note that more than one compiler can be selected simultaneously,
+if desired.
+
+The install script prompts for various MVAPICH2 build options as detailed
+below:
+
+
+- Implementation (OFA or uDAPL) [default "OFA"]
+ - OFA (IB and iWARP) Options:
+ - ROMIO Support [default Y]
+ - Shared Library Support [default Y]
+ - Checkpoint-Restart Support [default N]
+ * requires an installation of BLCR and prompts for the
+ BLCR installation directory location
+ - uDAPL Options:
+ - ROMIO Support [default Y]
+ - Shared Library Support [default Y]
+ - Cluster Size [default "Small"]
+ - I/O Bus [default "PCI-Express"]
+ - Link Speed [default "SDR"]
+ - Default DAPL Provider [default ""]
+ * the default provider is determined based on detected OS
+
+For non-interactive builds where no MVAPICH2 build options are stored in
+the OFED configuration file, the default settings are:
+
+Implementation: OFA
+ROMIO Support: Y
+Shared Library Support: Y
+Checkpoint-Restart Support: N
+
+
+4.1 Setting up for MVAPICH2
+---------------------------
+Selecting to use MVAPICH2 via the MPI selector tools will perform
+most of the setup necessary to build and run MPI applications with
+MVAPICH2. If one does not wish to use the MPI Selector tools, using
+the following settings should be enough:
+
+ - add <prefix>/bin to PATH
+
+The <prefix> above is the directory where the desired MVAPICH2
+instance was installed ("instance" refers to the path based on
+the RPM package name, including the compiler chosen during the
+install). It is also possible to source the following files
+in order to setup the proper environment:
+
+source <prefix>/bin/mpivars.sh [for Bourne based shells]
+source <prefix>/bin/mpivars.csh [for C based shells]
+
+In addition to the user environment settings handled by the MPI selector
+tools, some other system settings might need to be modified. MVAPICH2
+requires the memlock resource limit to be modified from the default
+in /etc/security/limits.conf:
+
+* soft memlock unlimited
+
+MVAPICH2 requires bidirectional rsh or ssh without a password to work.
+The default is ssh, and in this case it will be required to add the
+following line to the /etc/init.d/sshd script before sshd is started:
+
+ulimit -l unlimited
+
+It is also possible to specify a specific size in kilobytes instead
+of unlimited if desired.
+
+The MVAPICH2 OFA build requires an /etc/mv2.conf file specifying the
+IP address of an Infiniband HCA (IPoIB) for RDMA-CM functionality
+or the IP address of an iWARP adapter for iWARP functionality if
+either of those are desired. This is not required by default, unless
+either of the following runtime environment variables are set when
+using the OFA MVAPICH2 build:
+
+RDMA-CM
+-------
+MV2_USE_RDMA_CM=1
+
+iWARP
+-----
+MV2_USE_IWARP_MODE=1
+
+Otherwise, the OFA build will work without an /etc/mv2.conf file using
+only the Infiniband HCA directly.
+
+The MVAPICH2 uDAPL build requires an /etc/dat.conf file specifying the
+DAPL provider information. The default DAPL provider is chosen at
+build time, with a default value of "ib0", however it can also be
+specified at runtime by setting the following environment variable:
+
+MV2_DEFAULT_DAPL_PROVIDER=<interface>
+
+More information about MVAPICH2 can be found in the MVAPICH2 User Guide:
+
+http://mvapich.cse.ohio-state.edu/support/
+
+
+4.2 Compiling MVAPICH2 Applications
+-----------------------------------
+The MVAPICH2 compiler command for each language are:
+
+Language Compiler Command
+-------- ----------------
+C mpicc
+C++ mpicxx
+Fortran 77 mpif77
+Fortran 90 mpif90
+
+The system compiler commands should not be used directly. The Fortran 90
+compiler command only exists if a Fortran 90 compiler was used during the
+build process.
+
+
+4.3 Running MVAPICH2 Applications
+---------------------------------
+4.3.1 Running MVAPICH2 Applications with mpirun_rsh
+---------------------------------------------------
+>From release 1.2, MVAPICH2 comes with a faster and more scalable startup based
+on mpirun_rsh. To launch a MPI job with mpirun_rsh, password-less ssh needs to
+be enabled across all nodes.
+
+Note: ssh will be used by default. In order to use rsh, use the -rsh option on
+the mpirun_rsh command line. For more options, see mpirun_rsh -help or the
+MVAPICH2 user guide.
+
+*** Running 4 processes on 4 nodes ***
+
+$ cat > hostfile
+node1
+node2
+node3
+node4
+$ mpirun_rsh -np 4 -hostfile hostfile /path/to/my_mpi_app
+
+*** Running OSU tests ***
+
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bw
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_latency
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bibw
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bcast
+
+*** Running Intel MPI Benchmark test (Full test) ***
+
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2/IMB-MPI1
+
+*** Running Presta test ***
+
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/com -o 100
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/glob -o 100
+/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/globalop
+
+4.3.2 Running MVAPICH2 Applications with mpd and mpiexec
+--------------------------------------------------------
+Launching processes in MVAPICH2 is a two step process. First, mpdboot must
+be used to launch MPD daemons on the desired hosts. Second, the mpiexec
+command is used to launch the processes. MVAPICH2 requires bidirectional
+ssh or rsh without a password. This is specified when the MPD daemons are
+launched with the mpdboot command through the --rsh command line option.
+The default is ssh. Once the processes are finished, stopping the MPD
+daemons with the mpdallexit command should be done. The following example
+shows the basic procedure:
+
+4 Processes on 4 Hosts Example:
+
+$ cat >hostsfile
+node1.example.com
+node2.example.com
+node3.example.com
+node4.example.com
+
+$ mpdboot -n 4 -f ./hostsfile
+
+$ mpiexec -n 4 ./my_mpi_application
+
+$ mpdallexit
+
+It is also possible to use the mpirun command in place of mpiexec. They are
+actually the same command in MVAPICH2, however using mpiexec is preferred.
+
+It is possible to run more processes than hosts. In this case, multiple
+processes will run on some or all of the hosts used. The following examples
+demonstrate how to run the MPI tests. The default installation prefix and
+gcc version of MVAPICH2 are shown. In each case, it is assumed that a hosts
+file has been created in the specific directory with two hosts.
+
+OSU Tests Example:
+
+$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1
+$ mpdboot -n 2 -f ./hosts
+$ mpiexec -n 2 ./osu_bcast
+$ mpiexec -n 2 ./osu_bibw
+$ mpiexec -n 2 ./osu_bw
+$ mpiexec -n 2 ./osu_latency
+$ mpdallexit
+
+Intel MPI Benchmark Example:
+
+$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2
+$ mpdboot -n 2 -f ./hosts
+$ mpiexec -n 2 ./IMB-MPI1
+$ mpdallexit
+
+Presta Benchmarks Example:
+
+$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0
+$ mpdboot -n 2 -f ./hosts
+$ mpiexec -n 2 ./com -o 100
+$ mpiexec -n 2 ./glob -o 100
+$ mpiexec -n 2 ./globalop
+$ mpdallexit
--- /dev/null
+Mellanox Technologies - www.mellanox.com
+****************************************
+
+MSTFLINT Package - Firmware Burning and Diagnostics Tools
+
+1) Overview
+ This package contains a burning tool and diagnostic tools for Mellanox
+ manufactured HCA/NIC cards. It also provides access to the relevant source
+ code. Please see the file LICENSE for licensing details.
+ This package is based on a subset of the Mellanox Firmware Tools (MFT) package.
+ For a full documentation of the MFT package, please refer to the downloads page
+ in Mellanox web site.
+
+ ----------------------------------------------------------------------------
+ NOTE:
+ This burning tool should be used only with Mellanox-manufactured
+ HCA/NIC cards. Using it with cards manufactured by other vendors
+ may be harmful to the cards (due to different configurations).
+ Using the diagnostic tools is normally safe for all HCAs/NICs.
+ ----------------------------------------------------------------------------
+
+2) Package Contents
+ a) mstflint source code
+ b) mflash lib
+ This lib provides low level Flash access through Mellanox HCAs.
+ c) mtcr lib (implemented in mtcr.h file)
+ This lib enables access to HCA hardware registers.
+ d) mstregdump utility
+ This utility dumps hardware registers from Mellanox hardware
+ for later analysis by Mellanox.
+ e) mstvpd
+ This utility dumps the on-card VPD.
+ f) mstmcra
+ This debug utility reads a single word from the device configuration space.
+
+3) Installation
+ a) Build the mstflint utility. This package is built using a standard
+ autotools method.
+
+ Example:
+ > ./configure
+ > make
+ > make install
+
+ - Run "configure --help" for custom configuration options.
+ - Typically, root privileges are required to run "make install"
+
+4) Hardware Access Device Names
+ The tools in this package require a device name in the command
+ line. The device name is the identifier of the target CA.
+ This section describes the device name formats and the HW access flow.
+
+ a) The devices can be accessed by their PCI ID as displayed by lspci
+ (bus:dev.fn).
+ Example:
+ # List all Mellanox devices
+ > /sbin/lspci -d 15b3:
+ 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0)
+
+ # Use mstflint tool to query the firmware on this device
+ > mstflint -d 02:00.0 q
+
+ b) When the IB driver (mlx4 or mthca) is loaded, the devices can be accessed
+ by their IB device name.
+ Example:
+ # List the IB devices
+ > ibv_devinfo | grep hca_id
+ hca_id: mlx4_0
+
+ # Use mstvpd tool to dump the VPD of this device
+ > mstvpd mlx4_0
+
+ c) PCI configuration access
+ In examples a and b above, the device is accessed via PCI Memory Mapping.
+ The device can also be accessed by PCI configuration cycles.
+ PCI configuration access is slower and less safe than memory access --
+ use it only if methods a and b above do not work.
+
+ To force configuration access, use device names in the following format:
+ /proc/bus/pci/<bus>/<dev.fn>
+
+ Example:
+ # List all Mellanox devices
+ > /sbin/lspci -d 15b3:
+ 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0)
+
+ # Use mstregdump to dump HW registers, using PCI config cycles
+ > mstregdump /proc/bus/pci/02/00.0 > crdump.log
+
+ Note: Typically, you will need root privileges for hardware access
+
+ d) Accessing a multi-function device:
+
+ In some configuration, the CA device identifies as a multi-function device on PCI. E.G.:
+ > /sbin/lspci -d 15b3:
+ 07:00.0 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+ 07:00.1 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+ 07:00.2 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+ 07:00.3 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+ 07:00.4 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+ 07:00.5 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+ 07:00.6 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+ 07:00.7 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
+
+ These multiple "logical" devices are actually a single physical device, so firmware update or "physical"
+ diagnostics should be run only on one of the functions.
+
+ When the device driver is loaded, only the primary physical function of the device can be accessed.
+ In Linux that would typically be function 0. This function can be accessed using memory mapping, aas
+ described in sub section a) above. E.G.:
+ > mstflint -d 07:00.0 q
+
+ When the device driver is not loaded, all the functions can be accessed using configuration cycles, as
+ described in sub section c) above. It is recommended to use function 0 for FW update or diagnostics, E.G.:
+ > mstflint -d /proc/bus/pci/07/00.0 q
+
+5) Usage (mstflint):
+ Read mstflint usage. Enter "./mstflint -h" for a short help message, or
+ "./mstflint -hh" for a detailed help message.
+
+ Obtaining firmware files:
+ If you purchased your card from Mellanox Technologies, please use the
+ Mellanox website (www.mellanox.com, under 'Firmware' downloads) to
+ download the firmware for your card.
+ If you purchased your card from a vendor other than Mellanox, get a
+ specific firmware configuration (INI) file from your HCA card vendor and
+ generate the binary image.
+
+ Use mstflint to burn a device according to the burning instructions in
+ "mstflint -hh" and in Mellanox web site firmware page.
+
+6) Usage (mstregdump):
+ An internal register dump is displayed to the standard output.
+ Please store it in a file for analysis by Mellanox.
+
+ Example:
+ > mstregdump mthca0 > dumpfile
+
+7) Usage (mstvpd):
+ A VPD dump is displayed to the standard output.
+ A list of keywords to dump can be supplied after the -- flag
+ to apply an output filter.
+
+ Examples:
+ > mstvpd mlx4_0
+ ID: Hawk Dual Port
+ PN: MNPH29C-XTR
+ EC: X2
+ SN: MT1001X00749
+ V0: PCIe Gen2 x8
+ V1: N/A
+ YA: N/A
+ RW:
+
+ > mstvpd mlx4_0 -- PN ID
+ PN: MNPH29C-XTR
+ ID: Hawk Dual Port
+
+8) Problem Reporting:
+ Please collect the following information when reporting issues:
+
+ uname -a
+ cat /etc/issue
+ cat /proc/bus/pci/devices
+ mstflint -vv
+ lspci
+ mstflint -d 02:00.0 v
+ mstflint -d 02:00.0 q
+ mstvpd 02:00.0
+
+
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ Performance Tests README for OFED 1.5
+
+ December 2010
+
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Notes on Testing Methodology
+3. Test Descriptions
+4. Running Tests
+
+===============================================================================
+1. Overview
+===============================================================================
+This is a collection of tests written over uverbs intended for use as a
+performance micro-benchmark. As an example, the tests can be used for
+HW or SW tuning and/or functional testing.
+
+The collection conatains a set of BW and latency benchmark such as :
+
+ * Read - ib_read_bw and ib_read_lat.
+ * Write - ib_write_bw, ib_write_bw_postlist and ib_write_lat.
+ * Send - ib_send_bw and ib_send_lat.
+ * RDMA - rdma_bw and rdma_lat.
+ * Additional benchmark: ib_clock_test.
+
+Please post results/observations/bugs/remarks to the mailing list specified below:
+ * Maintainer - idos@dev.mellanox.co.il
+ * OFED mailing list - ewg@lists.openfabrics.org
+ or linux-rdma@vger.kernel.org
+ * http://openib.org/mailman/listinfo/openib-general
+
+===============================================================================
+2. Notes on Testing Methodology
+===============================================================================
+The bencmarks specified below are tested of the following architectures:
+- i686
+- x86_64
+- ia64
+
+- The benchmark uses the CPU cycle counter to get time stamps without context
+ switch. Some CPU architectures (e.g., Intel's 80486 or older PPC) do NOT
+ have such capability.
+
+- The benchmark measures round-trip time but reports half of that as one-way
+ latency. Thus, it may not be sufficiently accurate for asymmetrical
+ configurations.
+
+- On BW benchmarks , the BW is calculated on the send side only, as it calculates
+ the BW after collecting completion from the receive side.
+ If using the bidirectional flag , BW is calculated on both sides
+
+- Min/Median/Max result is reported.
+ The median (vs average) is less sensitive to extreme scores.
+ Typically, the "Max" value is the first value measured.
+
+- Larger samples help marginally only. The default (1000) is sufficient.
+ Note that an array of cycles_t (typically unsigned long) is allocated
+ once to collect samples and again to store the difference between them.
+ Large sample sizes (e.g., 1 million) might expose other problems
+ with the program.
+
+- The "-H" option will dump the histogram for additional statistical analysis.
+ See xgraph, ygraph, r-base (http://www.r-project.org/), pspp, or other
+ statistical math programs.
+
+===============================================================================
+4. Test Descriptions
+===============================================================================
+
+rdma_lat.c latency test with RDMA write transactions
+rdma_bw.c streaming BW test with RDMA write transactions
+
+
+The following tests are mainly useful for HW/SW benchmarking.
+They are not intended as actual usage examples.
+
+send_lat.c latency test with send transactions
+send_bw.c BW test with send transactions
+write_lat.c latency test with RDMA write transactions
+write_bw.c BW test with RDMA write transactions
+read_lat.c latency test with RDMA read transactions
+read_bw.c BW test with RDMA read transactions
+
+The executable name of each test starts with the general prefix "ib_",
+e.g., ib_write_lat, except for those of RDMA tests,
+their excutable have the same name except of the .c.
+
+Running Tests
+-------------
+
+Prerequisites:
+ kernel 2.6
+ ib_uverbs (kernel module) matches libibverbs
+ ("match" means binary compatible, but ideally of the same SVN rev)
+
+Server: ./<test name> <options>
+Client: ./<test name> <options> <server IP address>
+
+ o <server address> is IPv4 or IPv6 address. You can use the IPoIB
+ address if IPoIB is configured.
+ o --help lists the available <options>
+
+ *** IMPORTANT NOTE: The SAME OPTIONS must be passed to both server and client.
+
+
+Common Options to all tests:
+ -p, --port=<port> Listen on/connect to port <port> (default: 18515).
+ -m, --mtu=<mtu> Mtu size (default: 1024).
+ -d, --ib-dev=<dev> Use IB device <dev> (default: first device found).
+ -i, --ib-port=<port> Use port <port> of IB device (default: 1).
+ -o, --out=<num_of_out> Number of outstanding reads. only in READ.
+ -q, --qp=<num_of_qps> Number of Qps to perform. only in write_bw.
+ -c, --connection=<c> Connection type : RC,UC,UD according to spec.
+ -g, --mcg=<num_of_qps> Number of Qps in MultiCast group. in SEND only
+ -M, --MGID=<addr> <addr> as the group MGID in format '255:1:X:X:X:X:X:X:X:X:X:X:X:X:X:X'.
+ -s, --size=<size> Size of message to exchange (default: 1).
+ -a, --all Run sizes from 2 till 2^23.
+ -t, --tx-depth=<dep> Size of tx queue (default: 50).
+ -r, --rx-depth=<dep> Make rx queue bigger than tx (default 600).
+ -n, --iters=<iters> Number of exchanges (at least 100, default: 1000).
+ -I, --inline_size=<size> Max size of message to be sent in inline mode.
+ On Bw tests default is 1,latency tests is 400.
+ -C, --report-cycles Report times in cpu cycle units.
+ -u, --qp-timeout=<timeout> QP timeout, timeout value is 4 usec*2 ^(timeout).
+ Default is 14.
+ -S, --sl=<sl> SL (default 0).
+ -H, --report-histogram Print out all results (Default: summary only).
+ Only on Latnecy tests.
+ -x, --gid-index=<index> Test uses GID with GID index taken from command
+ Line (for RDMAoE index should be 0).
+ -b, --bidirectional Measure bidirectional bandwidth (default uni).
+ On BW tests only (Implicit on latency tests).
+ -V, --version Display version number.
+ -e, --events Sleep on CQ events (default poll).
+ -N, --no peak-bw Cancel peak-bw calculation (default with peak-bw)
+ -F, --CPU-freq Do not fail even if cpufreq_ondemand module.
+
+ *** IMPORTANT NOTE: You need to be running a Subnet Manager on the switch or
+ on one of the nodes in your fabric.
+
+
--- /dev/null
+This is a release of the QLogic VNIC driver on OFED 1.4. This driver is
+currently supported on Intel x86 32 and 64 bit machines.
+Supported OS are:
+- RHEL 4 Update 4.
+- RHEL 4 Update 5.
+- RHEL 4 Update 6.
+- SLES 10.
+- SLES 10 Service Pack 1.
+- SLES 10 Service Pack 1 Update 1.
+- SLES 10 Service Pack 2.
+- RHEL 5.
+- RHEL 5 Update 1.
+- RHEL 5 Update 2.
+- vanilla 2.6.27 kernel.
+
+The VNIC driver in conjunction with the QLogic Ethernet Virtual I/O Controller
+(EVIC) provides Ethernet interfaces on a host with IB HCA(s) without the need
+for any physical Ethernet NIC.
+
+This file describes the use of the QLogic VNIC ULP service on an OFED stack
+and covers the following points:
+
+A) Creating QLogic VNIC interfaces
+B) Discovering VEx/EVIC IOCs present on the fabric using ib_qlgc_vnic_query
+C) Starting the QLogic VNIC driver and the VNIC interfaces
+D) Assigning IP addresses etc for the QLogic VNIC interfaces
+E) Information about the QLogic VNIC interfaces
+F) Deleting a specific QLogic VNIC interface
+G) Forced Failover feature for QLogic VNIC.
+H) Infiniband Quality of Service for VNIC.
+I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support
+J) Information about creating VLAN interfaces
+K) Information about enabling IB Multicast for QLogic VNIC interface
+L) Basic Troubleshooting
+
+A) Creating QLogic VNIC interfaces
+
+The VNIC interfaces can be created with the help of
+the configuration file which must be placed at /etc/infiniband/qlgc_vnic.cfg.
+
+Please take a look at /etc/infiniband/qlgc_vnic.cfg.sample file (available also
+as part of the documentation) to see how VNIC configuration files are written.
+You can use this configuration file as the basis for creating a VNIC configuration
+file by copying it to /etc/infiniband/qlgc_vnic.cfg. Of course you will have to
+replace the IOCGUID, IOCSTRING values etc in the sample configuration file
+with those of the EVIC IOCs present on your fabric.
+
+(For backward compatibilty, if this file is missing,
+/etc/infiniband/qlogic_vnic.cfg or /etc/sysconfig/ics_inic.cfg
+will be used for configuration)
+
+Please note that using DGID of the EVIC/VEx IOC is
+recommended as it will ensure the quickest startup of the
+VNIC service. If DGID is specified then you must also
+specify the IOCGUID. More details can be found in
+the qlgc_vnic.cfg.sample file.
+
+In case of a host consisting of more than 1 HCAs plugged in, VNIC
+interfaces can be configured based on HCA no and Port No or PORTGUID.
+
+B) Discovering EVIC/VEx IOCs present on the fabric using ib_qlgc_vnic_query
+
+For writing the configuration file, you will need information
+about the EVIC/VEx IOCs present on the fabric like their IOCGUID,
+IOCSTRING etc. The ib_qlgc_vnic_query tool should be used to get this
+information.
+
+When ib_qlgc_vnic_query is executed without any options, it scans through ALL
+active IB ports on the host and obtains the detailed information about all the
+EVIC/VEx IOCs reachable through each active IB port:
+
+# ib_qlgc_vnic_query
+
+HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
+
+ IO Unit Info:
+ port LID: 0008
+ port GID: fe8000000000000000066a11de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 1]
+ GUID: 00066a01de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
+ service entries: 2
+ service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
+ service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
+
+ IO Unit Info:
+ port LID: 0009
+ port GID: fe8000000000000000066a21de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 2]
+ GUID: 00066a02de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
+ service entries: 2
+ service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
+ service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
+
+HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
+
+ IO Unit Info:
+ port LID: 0008
+ port GID: fe8000000000000000066a11de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 1]
+ GUID: 00066a01de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
+ service entries: 2
+ service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
+ service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
+
+ IO Unit Info:
+ port LID: 0009
+ port GID: fe8000000000000000066a21de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 2]
+ GUID: 00066a02de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
+ service entries: 2
+ service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
+ service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
+
+HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
+
+ Port State is Down. Skipping search of DM nodes on this port.
+
+HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
+
+ IO Unit Info:
+ port LID: 0008
+ port GID: fe8000000000000000066a11de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 1]
+ GUID: 00066a01de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
+ service entries: 2
+ service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
+ service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
+
+ IO Unit Info:
+ port LID: 0009
+ port GID: fe8000000000000000066a21de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 2]
+ GUID: 00066a02de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
+ service entries: 2
+ service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
+ service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
+
+This is meant to help the network administrator to know about HCA/Port information
+on host along with EVIC IOCs reachable through given IB ports on fabric. When
+ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID information
+and with -s option it reports the IOCSTRING information for the EVIC/VEx IOCs
+present on the fabric:
+
+# ib_qlgc_vnic_query -e
+
+HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
+HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
+HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
+
+ Port State is Down. Skipping search of DM nodes on this port.
+
+HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff
+
+# ib_qlgc_vnic_query -s
+
+HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
+
+"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
+
+"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
+
+ Port State is Down. Skipping search of DM nodes on this port.
+
+HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
+
+"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+
+# ib_qlgc_vnic_query -es
+
+HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down
+
+ Port State is Down. Skipping search of DM nodes on this port.
+
+HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+
+ib_qlgc_vnic_query can be used to discover EVIC IOCs on the fabric based on
+umad device, HCA no/Port no and PORTGUID as follows:
+
+For umad devices, it takes the name of the umad device mentioned with '-d'
+option:
+
+# ib_qlgc_vnic_query -es -d /dev/infiniband/umad0
+
+HCA No = 0, HCA = mlx4_0, Port = 1
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+
+If the name of the HCA and its port no is known, then ib_qlgc_vnic_query can
+make use of this information to discover EVIC IOCs on the fabric. HCA name
+and port no is specified with '-C' and '-P' options respectively.
+
+# ib_qlgc_vnic_query -es -C mlx4_1 -P 2
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+
+In case, if HCA name is not specified but port no is specified, HCA 0 is
+selected as default HCA to discover IOCs and if Port no is missing then,
+Port 1 of HCA name mentioned is used to discover the IOCs. If both are
+missing, the behaviour is default and ib_qlgc_vnic_query will scan all the
+IB ports on the host to discover IOCs reachable through each one of them.
+
+PORTGUID information about the IB ports on given host can be obtained using
+the option '-L':
+
+# ib_qlgc_vnic_query -L
+
+0,mlx4_0,1,0x0002c903000010f5
+0,mlx4_0,2,0x0002c903000010f6
+1,mlx4_1,1,0x0002c90300000785
+1,mlx4_1,2,0x0002c90300000786
+
+This actually lists different configurable parameters of IB ports present on
+given host in the order: HCA No, HCA Name, Port No, PORTGUID separated by
+commas. PORTGUID value obtained thus, can be used to discover EVIC IOCs
+reachable through it using '-G' option as follows:
+
+# ib_qlgc_vnic_query -es -G 0x0002c903000010f5
+
+HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active
+
+ ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1"
+ ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"
+
+C) Starting the QLogic VNIC driver and the QLogic VNIC interfaces
+
+To start the QLogic VNIC service as a part of startup of OFED stack, set
+
+QLGC_VNIC_LOAD=yes
+
+in /etc/infiniband/openib.conf file. With this actually, the QLogic VNIC
+service will also be stopped when the OFED stack is stopped. Also, if OFED
+stack has been marked to start on boot, QLogic VNIC service will also start
+on boot.
+
+The rest of the discussion in this subsection C) is valid only if
+
+QLGC_VNIC_LOAD=no
+
+is set into /etc/infiniband/openib.conf.
+
+Once you have created a configuration file, you can start the VNIC driver
+and create the VNIC interfaces specified in the configuration file with:
+
+#/sbin/service qlgc_vnic start
+
+You can stop the VNIC driver and bring down the VNIC interfaces with
+
+#/sbin/service qlgc_vnic stop
+
+To restart the QLogic VNIC driver, you can use
+
+#/sbin/service qlgc_vnic restart
+
+If you have not started the Infiniband network stack (Infinipath or OFED),
+then running "/sbin/service qlgc_vnic start" command will also cause the
+Infiniband network stack to be started since the QLogic VNIC service requires
+the Infiniband stack.
+
+On the other hand if you start the Infiniband network stack separately, then
+the correct order of starting is:
+
+- Start the Infiniband stack
+- Start QLogic VNIC service
+
+For example, if you use OFED, correct order of starting is:
+
+/sbin/service openibd start
+/sbin/service qlgc_vnic start
+
+Correct order of stopping is:
+
+- Stop QLogic VNIC service
+- Stop the Infiniband stack
+
+For example, if you use OFED, correct order of stopping is:
+
+/sbin/service qlgc_vnic stop
+/sbin/service openibd stop
+
+If you try to stop the Infiniband stack when the QLogic VNIC service is
+running,
+you will get an error message that some of the modules of the Infiniband stack
+are in use by the QLogic VNIC service. Also, any QLogic VNIC interfaces that
+you
+created are removed (because stopping the Infiniband network stack causes the
+HCA
+driver to be unloaded which is required for the VNIC interfaces to be
+present).
+In this case, do the following:
+
+ 1. Stop the QLogic VNIC service with "/sbin/service qlgc_vnic stop"
+
+ 2. Stop the Infiniband stack again.
+
+ 3. If you want to restart the QLogic VNIC interfaces, use
+ "/sbin/service qlgc_vnic start".
+
+
+D) Assigning IP addresses etc for the QLogic VNIC interfaces
+
+This can be done with ifconfig or by setting up the ifcfg-XXX (ifcfg-veth0 for
+an interface named veth0 etc) network files for the corresponding VNIC interfaces.
+
+E) Information about the QLogic VNIC interfaces
+
+Information about VNIC interfaces on a given host can be obtained using a
+script "ib_qlgc_vnic_info" :-
+
+# ib_qlgc_vnic_info
+
+VNIC Interface : eioc0
+ VNIC State : VNIC_REGISTERED
+ Current Path : primary path
+ Receive Checksum : true
+ Transmit checksum : true
+
+ Primary Path :
+ VIPORT State : VIPORT_CONNECTED
+ Link State : LINK_IDLING
+ HCA Info. : vnic-mthca0-1
+ Heartbeat : 100
+ IOC String : EVIC in Chassis 0x00066a00db000010, Slot 4, Ioc 1
+ IOC GUID : 66a01de000037
+ DGID : fe8000000000000000066a11de000037
+ P Key : ffff
+
+ Secondary Path :
+ VIPORT State : VIPORT_DISCONNECTED
+ Link State : INVALID STATE
+ HCA Info. : vnic-mthca0-2
+ Heartbeat : 100
+ IOC String :
+ IOC GUID : 66a01de000037
+ DGID : 00000000000000000000000000000000
+ P Key : 0
+
+This information is collected from /sys/class/infiniband_qlgc_vnic/interfaces/
+directory under which there is a separate directory corresponding to each
+VNIC interface.
+
+F) Deleting a specific QLogic VNIC interface
+
+VNIC interfaces can be deleted by writing the name of the interface to
+the /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic file.
+
+For example to delete interface veth0
+
+echo -n veth0 > /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic
+
+G) Forced Failover feature for QLogic VNIC.
+
+VNIC interfaces, when configured with failover configuration, can be
+forced to failover to use other active path. For example, if VNIC interface
+"veth1" is configured with failover configuration, then to switch to other
+path, use command:
+
+echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/force_failover
+
+This will make VNIC interface veth1 to switch to other active path, even though
+the path of VNIC interface, before the forced failover operation, is not in
+disconnected state.
+
+This feature allows the network administrator to control the path of the
+VNIC traffic at run time and reconfiguration as well as restart of VNIC
+service is not required to achieve the same.
+
+Once enabled as mentioned above, forced failover can be cleared with
+the unfailover command:
+
+echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/unfailover
+
+This clears the forced failover on VNIC interface "veth1". Once cleared,
+if module parameter "default_prefer_primary" is set to 1, then VNIC
+interface switches back to primary path. If module parameter
+"default_prefer_primary" is set to 0, then VNIC interface continues to
+use its current active path.
+
+Forced failover, thus, takes priority over default_prefer_primary and the
+default_prefer_primary feature will not be active unless the forced
+failover is cleared through "unfailover".
+
+Besides this forced failover, QLogic VNIC service does retain its
+original failover feature which gets triggered when current active
+path gets disconnected.
+
+H) Infiniband Quality of Service for VNIC:-
+
+To enforce infiniband Quality of Service(QoS) for VNIC protocol, there
+is no configuration required on host side. The service level for the
+VNIC protocol can be configured using service ID or target port guid
+in the "qos-ulps" section of /etc/opensm/qos-policy.conf on the host
+running OpenSM.
+
+Service IDs for the EVIC IO controllers can be obtained from the output
+of ib_qlgc_vnic_query:
+
+HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active
+
+ IO Unit Info:
+ port LID: 0008
+ port GID: fe8000000000000000066a11de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 1]
+ GUID: 00066a01de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1
+ service entries: 2
+------> service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01
+------> service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01
+
+ IO Unit Info:
+ port LID: 0009
+ port GID: fe8000000000000000066a21de000070
+ change ID: 0003
+ max controllers: 0x02
+
+
+ controller[ 2]
+ GUID: 00066a02de000070
+ vendor ID: 00066a
+ device ID: 000030
+ IO class : 2000
+ ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2
+ service entries: 2
+------> service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02
+------> service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02
+
+Numbers 1000066a00000002, 1000066a00000102 are the required service IDs.
+
+Finer control on quality of service for the VNIC protocol can be achieved by
+configuring the service level using target port guid values of the EVIC IO
+controllers. Target port guid values for the EVIC IO controllers can be
+obtained using "saquery" command supplied by OFED package.
+
+I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support:-
+
+This tool is started and stopped as part of the QLogic VNIC service
+(refer to C above) and provides the following features:
+
+1. Dynamic update of disconnected interfaces (which have been configured
+WITHOUT using the DGID option in the configuration file) :
+
+At the start up of VNIC driver, if the HCA port through which a particular VNIC
+interface path (primary or secondary) connects to target is down or the
+EVIC/VEx IOC is not available then all the required parameters (DGID etc) for connecting
+with the EVIC/VEx cannot be determined. Hence the corresponding VNIC interface
+path is not available at the start of the VNIC service. This daemon constantly
+monitors the configured VNIC interfaces to check if any of them are disconnected.
+If any of the interfaces are disconnected, it scans for available EVIC/VEx targets using
+"ib_qlgc_vnic_query" tool. When daemon sees that for a given path of a VNIC interface,
+the configured EVIC/VEx IOC has become available, it dynamically updates the
+VNIC kernel driver with the required information to establish connection for
+that path of the interface. In this way, the interface gets connected with
+the configured EVIC/VEx whenever it becomes available without any manual
+intervention.
+
+2. Hot Swap support :
+
+Hot swap is an operation in which an existing EVIC/VEx is replaced by another
+EVIC/VEx (in the same slot of the switch chassis as the older one). In such a
+case, the current connection for the corresponding VNIC interface will have to
+be re-established. The daemon detects this hot swap case and re-establishes
+the connection automatically. To make use of this feature of the daemon, it is
+recommended that IOCSTRING be used in the configuration file to configure the
+VNIC interfaces.
+
+This is because, after a hot swap though all other parameters like DGID, IOCGUID etc
+of the EVIC/VEx change, the IOCSTRING remains the same. Thus the daemon monitors
+for changes in IOCGUID and DGID of disconnected interfaces based on the IOCSTRING.
+If these values have changed it updates the kernel driver so that the VNIC
+interface can start using the new EVIC/VEx.
+
+If in addition to IOCSTRING, DGID and IOCGUID have been used to configure
+a VNIC interface, then on a hotswap the daemon will update the parameters as required.
+But to have that VNIC interface available immediately on the next restart of the
+QLogic VNIC service, please make sure to update the configuration file with the
+new DGID and IOCGUID values. Otherwise, the creation of such interfaces will be
+delayed till the daemon runs and updates the parameters.
+
+J) Information about creating VLAN interfaces
+
+The EVIC/VEx supports VLAN tagging without having to explicitly create VLAN
+interfaces for the VNIC interface on the host. This is done by enabling
+Egress/Ingress tagging on the EVIC/VEx and setting the "Host ignores VLAN"
+option for the VNIC interface. The "Host ignores VLAN" option is enabled
+by default due to which VLAN tags are ignored on the host by the QLogic
+VNIC driver. Thus explicitly created VLAN interfaces (using vconfig command)
+for a given VNIC interface will not be operational.
+
+If you want to explicitly create a VLAN interface for a given VNIC interface,
+then you will have to disable the "Host ignores VLAN" option for the
+VNIC interface on the EVIC/VEx. The qlgc_vnic service must be restarted
+on the host after disabling (or enabling) the "Host ignores VLAN" option.
+
+Please refer to the EVIC/VEx documentation for more information on Egress/Ingress
+port tagging feature and disabling the "Host ignores VLAN" option.
+
+K) Information about enabling IB Multicast for QLogic VNIC interface
+
+QLogic VNIC driver has been upgraded to support the IB Multicasting feature of
+EVIC/VEx. This feature enables the QLogic VNIC host driver to support the IP
+multicasting more efficiently. With this feature enabled, infiniband multicast
+group acts as a carrier of IP multicast traffic. EVIC will make use of such IB
+multicast groups for forwarding IP multicast traffic to VNIC interfaces which
+are member of given IP multicast group. In the older QLogic VNIC host driver,
+IB multicasting was not being used to carry IP multicast traffic.
+
+By default, IB multicasting is disabled on EVIC/VEx; but it is enabled by
+default at the QLogic VNIC host driver.
+
+To disable IB multicast feature on the host driver, VNIC configuration file
+needs to be modified by setting the parameter IB_MULTICAST=FALSE in the
+interface configuration. Please refer to the qlgc_vnic.cfg.sample for more
+details on configuration of VNIC interfaces for IB multicasting.
+IB multicasting also needs to be enabled over EVIC/VEx. Please refer to the
+EVIC/VEx documentation for more information on enabling IB multicast
+feature over EVIC/VEx.
+
+L) Basic Troubleshooting
+
+1. In case of any problems, make sure that:
+
+ a) The HCA ports you are trying to use have IB cables connected and are in an
+ active state. You can use the "ibv_devinfo" tool to check the state of
+ your HCA ports.
+
+ b) If your HCA ports are not active, check if an SM is running on the fabric
+ where the HCA ports are connected. If you have done a full install of
+ OFED, you can use the "sminfo" command ("sminfo -P 2" for port 2) to
+ check SM information.
+
+ c) Make sure that the EVIC/VEx is powered up and its Ethernet cables are connected
+ properly.
+
+ d) Check /var/log/messages for any error messages.
+
+2. If some of your VNIC interfaces are not available:
+
+ a) Use "ifconfig" tool with -a option to see if all interfaces are created.
+ It is possible that the interfaces are created but do not have an
+ IP address. Make sure that you have setup a correct ifcfg-XXX file for your
+ VNIC interfaces for automatic assignment of IP addresses.
+
+ If the VNIC interface is created and the ifcfg file is also correct
+ but the VNIC interface is not UP, make sure that the target EVIC/VEx
+ IOC has an Ethernet cable properly connected.
+
+ b) Make sure that the VNIC configuration file has been setup properly
+ with correct EVIC/VEx target DGID/IOCGUID/IOCSTRING information and
+ instance numbers.
+
+ c) Make sure that the EVIC/VEx target IOC specified for that interface is
+ available. You can use the "ib_qlgc_vnic_query" tool to verify this. If it is not
+ available when you started the service, but it becomes available later
+ on, then the QLogic VNIC dynamic update daemon will bring up the
+ interface when the target becomes available. You will see messages in
+ /var/log/messages when the corresponding interface is created.
+
+ d) Make sure that you have not exceeded the total number of Virtual interfaces
+ supported by the EVIC/VEx. You can check the total number of Virtual interfaces
+ currently in use on the HTTP interface of the EVIC/VEx.
+
--- /dev/null
+
+ QoS support in OFED
+
+==============================================================================
+Table of contents
+==============================================================================
+
+1. Overview
+2. Architecture
+3. Supported Policy
+4. CMA functionality
+5. IPoIB functionality
+6. SDP functionality
+7. RDS functionality
+8. SRP functionality
+9. iSER functionality
+10. OpenSM functionality
+
+
+==============================================================================
+1. Overview
+==============================================================================
+
+Quality of Service requirements stem from the realization of I/O consolidation
+over IB network: As multiple applications and ULPs share the same fabric,
+means to control their use of the network resources are becoming a must.
+The basic need is to differentiate the service levels provided to different
+traffic flows, such that a policy could be enforced and control each flow
+utilization of the fabric resources.
+
+IBTA specification defined several hardware features and management interfaces
+to support QoS:
+* Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
+* Arbitration between traffic of different VLs is performed by a 2 priority
+ levels weighted round robin arbiter. The arbiter is programmable with
+ a sequence of (VL, weight) pairs and maximal number of high priority credits
+ to be processed before low priority is served
+* Packets carry class of service marking in the range 0 to 15 in their
+ header SL field
+* Each switch can map the incoming packet by its SL to a particular output
+ VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
+* The Subnet Administrator controls each communication flow parameters
+ by providing them as a response to Path Record (PR) or MultiPathRecord (MPR)
+ queries
+
+The IB QoS features provide the means to implement a DiffServ like
+architecture. DiffServ architecture (IETF RFC 2474 & 2475) is widely used
+today in highly dynamic fabrics.
+
+This document provides the detailed functional definition for the various
+software elements that enable a DiffServ like architecture over the
+OpenFabrics software stack.
+
+
+==============================================================================
+2. Architecture
+==============================================================================
+
+QoS functionality is split between the SM/SA, CMA and the various ULPS.
+We take the "chronology approach" to describe how the overall system works.
+
+2.1. The network manager (human) provides a set of rules (policy) that
+define how the network is being configured and how its resources are split
+to different QoS-Levels. The policy also define how to decide which QoS-Level
+each application or ULP or service use.
+
+2.2. The SM analyzes the provided policy to see if it is realizable and
+performs the necessary fabric setup. Part of this policy defines the default
+QoS-Level of each partition. The SA is enhanced to match the requested Source,
+Destination, QoS-Class, Service-ID, PKey against the policy, so clients
+(ULPs, programs) can obtain a policy enforced QoS. The SM may also set up
+partitions with appropriate IPoIB broadcast group. This broadcast group
+carries its QoS attributes: SL, MTU, RATE, and Packet Lifetime.
+
+2.3. IPoIB is being setup. IPoIB uses the SL, MTU, RATE and Packet Lifetime
+available on the multicast group which forms the broadcast group of this
+partition.
+
+2.4. MPI which provides non IB based connection management should be
+configured to run using hard coded SLs. It uses these SLs for every QP
+being opened.
+
+2.5. ULPs that use CM interface (like SRP) have their own pre-assigned
+Service-ID and use it while obtaining PathRecord/MultiPathRecord (PR/MPR)
+for establishing connections. The SA receiving the PR/MPR matches it
+against the policy and returns the appropriate PR/MPR including SL, MTU,
+RATE and Lifetime.
+
+2.6. ULPs and programs (e.g. SDP) use CMA to establish RC connection provide
+the CMA the target IP and port number. ULPs might also provide QoS-Class.
+The CMA then creates Service-ID for the ULP and passes this ID and optional
+QoS-Class in the PR/MPR request. The resulting PR/MPR is used for configuring
+the connection QP.
+
+PathRecord and MultiPathRecord enhancement for QoS:
+
+As mentioned above the PathRecord and MultiPathRecord attributes are enhanced
+to carry the Service-ID which is a 64bit value. A new field QoS-Class is also
+provided.
+A new capability bit describes the SM QoS support in the SA class port info.
+This approach provides an easy migration path for existing access layer and
+ULPs by not introducing new set of PR/MPR attributes.
+
+
+==============================================================================
+3. Supported Policy
+==============================================================================
+
+The QoS policy that is specified in a separate file is divided into
+4 sub sections:
+
+I) Port Group: a set of CAs, Routers or Switches that share the same settings.
+ A port group might be a partition defined by the partition manager policy,
+ list of GUIDs, or list of port names based on NodeDescription.
+
+II) Fabric Setup: Defines how the SL2VL and VLArb tables should be setup.
+ NOTE: Currently this part of the policy is ignored. SL2VL and VLArb
+ tables should be configured in the OpenSM options file
+ (opensm.opts).
+
+III) QoS-Levels Definition: This section defines the possible sets of
+ parameters for QoS that a client might be mapped to. Each set holds
+ SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits.
+ NOTE: Currently, Path Bits are not implemented.
+
+IV) Matching Rules: A list of rules that match an incoming PR/MPR request
+ to a QoS-Level. The rules are processed in order such as the first match
+ is applied. Each rule is built out of a set of match expressions which
+ should all match for the rule to apply. The matching expressions are
+ defined for the following fields:
+ - SRC and DST to lists of port groups
+ - Service-ID to a list of Service-ID values or ranges
+ - QoS-Class to a list of QoS-Class values or ranges
+
+
+==============================================================================
+4. CMA features
+==============================================================================
+
+The CMA interface supports Service-ID through the notion of port space
+as a prefixes to the port_num which is part of the sockaddr provided to
+rdma_resolve_add().
+CMP also allows the ULP (like SDP) to propagate a request for specific
+QoS-Class. CMA uses the provided QoS-Class and Service-ID in the sent PR/MPR.
+
+
+==============================================================================
+5. IPoIB
+==============================================================================
+
+IPoIB queries the SA for its broadcast group information.
+It provides the broadcast group SL, MTU, and RATE in every following
+PathRecord query performed when a new UDAV is needed by IPoIB.
+
+
+==============================================================================
+6. SDP
+==============================================================================
+
+SDP uses CMA for building its connections.
+The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
+holding the remote TCP/IP Port Number to connect to.
+
+
+==============================================================================
+7. RDS
+==============================================================================
+
+RDS uses CMA and thus it is very close to SDP. The Service-ID for RDS is
+0x000000000106PPPP, where PPPP are 4 hex digits holding the TCP/IP Port
+Number that the protocol connects to.
+Default port number for RDS is 0x48CA, which makes a default Service-ID
+0x00000000010648CA.
+
+
+==============================================================================
+8. SRP
+==============================================================================
+
+Current SRP implementation uses its own CM callbacks (not CMA). So SRP fills
+in the Service-ID in the PR/MPR by itself and use that information in setting
+up the QP.
+SRP Service-ID is defined by the SRP target I/O Controller (it also complies
+with IBTA Service-ID rules). The Service-ID is reported by the I/O Controller
+in the ServiceEntries DMA attribute and should be used in the PR/MPR if the
+SA reports its ability to handle QoS PR/MPRs.
+
+
+==============================================================================
+9. iSER
+==============================================================================
+
+Similar to RDS, iSER also uses CMA. The Service-ID for iSER is similar to RDS
+(0x000000000106PPPP), with default port number 0x0CBC, which makes a default
+Service-ID 0x0000000001060CBC.
+
+
+==============================================================================
+10. OpenSM features
+==============================================================================
+
+The QoS related functionality that is provided by OpenSM can be split into two
+main parts:
+
+10.1. Fabric Setup
+During fabric initialization the SM parses the policy and apply its settings
+to the discovered fabric elements.
+
+10.2. PR/MPR query handling:
+OpenSM enforces the provided policy on client request.
+The overall flow for such requests is: first the request is matched against
+the defined match rules such that the target QoS-Level definition is found.
+Given the QoS-Level a path(s) search is performed with the given restrictions
+imposed by that level.
+
+==============================================================================
--- /dev/null
+
+ QoS Management in OpenSM
+
+==============================================================================
+ Table of contents
+==============================================================================
+
+1. Overview
+2. Full QoS Policy File
+3. Simplified QoS Policy Definition
+4. Policy File Syntax Guidelines
+5. Examples of Full Policy File
+6. Simplified QoS Policy - Details and Examples
+7. SL2VL Mapping and VL Arbitration
+
+
+==============================================================================
+ 1. Overview
+==============================================================================
+
+When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
+The default name of OpenSM QoS policy file is
+/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
+or --qos_policy_file option with OpenSM.
+
+During fabric initialization and at every heavy sweep OpenSM parses the QoS
+policy file, applies its settings to the discovered fabric elements, and
+enforces the provided policy on client requests. The overall flow for such
+requests is:
+ - The request is matched against the defined matching rules such that the
+ QoS Level definition is found.
+ - Given the QoS Level, path(s) search is performed with the given
+ restrictions imposed by that level.
+
+There are two ways to define QoS policy:
+ - Full policy, where the policy file syntax provides an administrator
+ various ways to match PathRecord/MultiPathRecord (PR/MPR) request and
+ enforce various QoS constraints on the requested PR/MPR
+ - Simplified QoS policy definition, where an administrator would be able to
+ match PR/MPR requests by various ULPs and applications running on top of
+ these ULPs.
+
+While the full policy syntax is very flexible, in many cases the simplified
+policy definition would be sufficient.
+
+
+==============================================================================
+ 2. Full QoS Policy File
+==============================================================================
+
+QoS policy file has the following sections:
+
+I) Port Groups (denoted by port-groups).
+This section defines zero or more port groups that can be referred later by
+matching rules (see below). Port group lists ports by:
+ - Port GUID
+ - Port name, which is a combination of NodeDescription and IB port number
+ - PKey, which means that all the ports in the subnet that belong to
+ partition with a given PKey belong to this port group
+ - Partition name, which means that all the ports in the subnet that belong
+ to partition with a given name belong to this port group
+ - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and
+ SELF (SM's port).
+
+II) QoS Setup (denoted by qos-setup).
+This section describes how to set up SL2VL and VL Arbitration tables on
+various nodes in the fabric.
+However, this is not supported in OpenSM currently.
+SL2VL and VLArb tables should be configured in the OpenSM options file
+(default location - /usr/local/etc/opensm/opensm.conf).
+
+III) QoS Levels (denoted by qos-levels).
+Each QoS Level defines Service Level (SL) and a few optional fields:
+ - MTU limit
+ - Rate limit
+ - PKey
+ - Packet lifetime
+When path(s) search is performed, it is done with regards to restriction that
+these QoS Level parameters impose.
+One QoS level that is mandatory to define is a DEFAULT QoS level. It is
+applied to a PR/MPR query that does not match any existing match rule.
+Similar to any other QoS Level, it can also be explicitly referred by any
+match rule.
+
+IV) QoS Matching Rules (denoted by qos-match-rules).
+Each PathRecord/MultiPathRecord query that OpenSM receives is matched against
+the set of matching rules. Rules are scanned in order of appearance in the QoS
+policy file such as the first match takes precedence.
+Each rule has a name of QoS level that will be applied to the matching query.
+A default QoS level is applied to a query that did not match any rule.
+Queries can be matched by:
+ - Source port group (whether a source port is a member of a specified group)
+ - Destination port group (same as above, only for destination port)
+ - PKey
+ - QoS class
+ - Service ID
+To match a certain matching rule, PR/MPR query has to match ALL the rule's
+criteria. However, not all the fields of the PR/MPR query have to appear in
+the matching rule.
+For instance, if the rule has a single criterion - Service ID, it will match
+any query that has this Service ID, disregarding rest of the query fields.
+However, if a certain query has only Service ID (which means that this is the
+only bit in the PR/MPR component mask that is on), it will not match any rule
+that has other matching criteria besides Service ID.
+
+
+==============================================================================
+ 3. Simplified QoS Policy Definition
+==============================================================================
+
+Simplified QoS policy definition comprises of a single section denoted by
+qos-ulps. Similar to the full QoS policy, it has a list of match rules and
+their QoS Level, but in this case a match rule has only one criterion - its
+goal is to match a certain ULP (or a certain application on top of this ULP)
+PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
+The simplified policy section may appear in the policy file in combine with
+the full policy, or as a stand-alone policy definition.
+See more details and list of match rule criteria below.
+
+
+==============================================================================
+ 4. Policy File Syntax Guidelines
+==============================================================================
+
+- Empty lines are ignored.
+- Leading and trailing blanks, as well as empty lines, are ignored, so
+ the indentation in the example is just for better readability.
+- Comments are started with the pound sign (#) and terminated by EOL.
+- Any keyword should be the first non-blank in the line, unless it's a
+ comment.
+- Keywords that denote section/subsection start have matching closing
+ keywords.
+- Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR
+ requests that didn't match any of the matching rules.
+- Any section/subsection of the policy file is optional.
+
+
+==============================================================================
+ 5. Examples of Full Policy File
+==============================================================================
+
+As mentioned earlier, any section of the policy file is optional, and
+the only mandatory part of the policy file is a default QoS Level.
+Here's an example of the shortest policy file:
+
+ qos-levels
+ qos-level
+ name: DEFAULT
+ sl: 0
+ end-qos-level
+ end-qos-levels
+
+Port groups section is missing because there are no match rules, which means
+that port groups are not referred anywhere, and there is no need defining
+them. And since this policy file doesn't have any matching rules, PR/MPR query
+won't match any rule, and OpenSM will enforce default QoS level.
+Essentially, the above example is equivalent to not having QoS policy file
+at all.
+
+The following example shows all the possible options and keywords in the
+policy file and their syntax:
+
+ #
+ # See the comments in the following example.
+ # They explain different keywords and their meaning.
+ #
+ port-groups
+
+ port-group # using port GUIDs
+ name: Storage
+ # "use" is just a description that is used for logging
+ # Other than that, it is just a comment
+ use: SRP Targets
+ port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA
+ port-guid: 0x1000000000FFFF
+ end-port-group
+
+ port-group
+ name: Virtual Servers
+ # The syntax of the port name is as follows:
+ # "node_description/Pnum".
+ # node_description is compared to the NodeDescription of the node,
+ # and "Pnum" is a port number on that node.
+ port-name: vs1 HCA-1/P1, vs2 HCA-1/P1
+ end-port-group
+
+ # using partitions defined in the partition policy
+ port-group
+ name: Partitions
+ partition: Part1
+ pkey: 0x1234
+ end-port-group
+
+ # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM)
+ # or ALL (for all the nodes in the subnet)
+ port-group
+ name: CAs and SM
+ node-type: CA, SELF
+ end-port-group
+
+ end-port-groups
+
+ qos-setup
+ # This section of the policy file describes how to set up SL2VL and VL
+ # Arbitration tables on various nodes in the fabric.
+ # However, this is not supported in OpenSM currently - the section is
+ # parsed and ignored. SL2VL and VLArb tables should be configured in the
+ # OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf).
+ end-qos-setup
+
+ qos-levels
+
+ # Having a QoS Level named "DEFAULT" is a must - it is applied to
+ # PR/MPR requests that didn't match any of the matching rules.
+ qos-level
+ name: DEFAULT
+ use: default QoS Level
+ sl: 0
+ end-qos-level
+
+ # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime
+ qos-level
+ name: WholeSet
+ sl: 1
+ mtu-limit: 4
+ rate-limit: 5
+ pkey: 0x1234
+ packet-life: 8
+ end-qos-level
+
+ end-qos-levels
+
+ # Match rules are scanned in order of their apperance in the policy file.
+ # First matched rule takes precedence.
+ qos-match-rules
+
+ # matching by single criteria: QoS class
+ qos-match-rule
+ use: by QoS class
+ qos-class: 7-9,11
+ # Name of qos-level to apply to the matching PR/MPR
+ qos-level-name: WholeSet
+ end-qos-match-rule
+
+ # show matching by destination group and service id
+ qos-match-rule
+ use: Storage targets
+ destination: Storage
+ service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF
+ qos-level-name: WholeSet
+ end-qos-match-rule
+
+ qos-match-rule
+ source: Storage
+ use: match by source group only
+ qos-level-name: DEFAULT
+ end-qos-match-rule
+
+ qos-match-rule
+ use: match by all parameters
+ qos-class: 7-9,11
+ source: Virtual Servers
+ destination: Storage
+ service-id: 0x0000000000010000-0x000000000001FFFF
+ pkey: 0x0F00-0x0FFF
+ qos-level-name: WholeSet
+ end-qos-match-rule
+
+ end-qos-match-rules
+
+
+==============================================================================
+ 6. Simplified QoS Policy - Details and Examples
+==============================================================================
+
+Simplified QoS policy match rules are tailored for matching ULPs (or some
+application on top of a ULP) PR/MPR requests. This section has a list of
+per-ULP (or per-application) match rules and the SL that should be enforced
+on the matched PR/MPR query.
+
+Match rules include:
+ - Default match rule that is applied to PR/MPR query that didn't match any
+ of the other match rules
+ - SDP
+ - SDP application with a specific target TCP/IP port range
+ - SRP with a specific target IB port GUID
+ - RDS
+ - iSER
+ - iSER application with a specific target TCP/IP port range
+ - IPoIB with a default PKey
+ - IPoIB with a specific PKey
+ - any ULP/application with a specific Service ID in the PR/MPR query
+ - any ULP/application with a specific PKey in the PR/MPR query
+ - any ULP/application with a specific target IB port GUID in the PR/MPR query
+
+Since any section of the policy file is optional, as long as basic rules of
+the file are kept (such as no referring to nonexisting port group, having
+default QoS Level, etc), the simplified policy section (qos-ulps) can serve
+as a complete QoS policy file.
+The shortest policy file in this case would be as follows:
+
+ qos-ulps
+ default : 0 #default SL
+ end-qos-ulps
+
+It is equivalent to the previous example of the shortest policy file, and it
+is also equivalent to not having policy file at all.
+
+Below is an example of simplified QoS policy with all the possible keywords:
+
+ qos-ulps
+ default : 0 # default SL
+ sdp, port-num 30000 : 0 # SL for application running on top
+ # of SDP when a destination
+ # TCP/IPport is 30000
+ sdp, port-num 10000-20000 : 0
+ sdp : 1 # default SL for any other
+ # application running on top of SDP
+ rds : 2 # SL for RDS traffic
+ iser, port-num 900 : 0 # SL for iSER with a specific target
+ # port
+ iser : 3 # default SL for iSER
+ ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with
+ # pkey 0x0001
+ ipoib : 4 # default IPoIB partition,
+ # pkey=0x7FFF
+ any, service-id 0x6234 : 6 # match any PR/MPR query with a
+ # specific Service ID
+ any, pkey 0x0ABC : 6 # match any PR/MPR query with a
+ # specific PKey
+ srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on
+ # a specified IB port GUID
+ any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with
+ # a specific target port GUID
+ end-qos-ulps
+
+
+Similar to the full policy definition, matching of PR/MPR queries is done in
+order of appearance in the QoS policy file such as the first match takes
+precedence, except for the "default" rule, which is applied only if the query
+didn't match any other rule.
+
+All other sections of the QoS policy file take precedence over the qos-ulps
+section. That is, if a policy file has both qos-match-rules and qos-ulps
+sections, then any query is matched first against the rules in the
+qos-match-rules section, and only if there was no match, the query is matched
+against the rules in qos-ulps section.
+
+Note that some of these match rules may overlap, so in order to use the
+simplified QoS definition effectively, it is important to understand how each
+of the ULPs is matched:
+
+6.1 IPoIB
+IPoIB query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so
+the following three match rules are equivalent:
+
+ ipoib : <SL>
+ ipoib, pkey 0x7fff : <SL>
+ any, pkey 0x7fff : <SL>
+
+6.2 SDP
+SDP PR query is matched by Service ID. The Service-ID for SDP is
+0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port
+Number to connect to. The following two match rules are equivalent:
+
+ sdp : <SL>
+ any, service-id 0x0000000000010000-0x000000000001ffff : <SL>
+
+6.3 RDS
+Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS
+is 0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP
+Port Number to connect to. Default port number for RDS is 0x48CA, which makes
+a default Service-ID 0x00000000010648CA. The following two match rules are
+equivalent:
+
+ rds : <SL>
+ any, service-id 0x00000000010648CA : <SL>
+
+6.4 iSER
+Similar to RDS, iSER query is matched by Service ID, where the the Service ID
+is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes
+a default Service-ID 0x0000000001060CBC. The following two match rules are
+equivalent:
+
+ iser : <SL>
+ any, service-id 0x0000000001060CBC : <SL>
+
+6.5 SRP
+Service ID for SRP varies from storage vendor to vendor, thus SRP query is
+matched by the target IB port GUID. The following two match rules are
+equivalent:
+
+ srp, target-port-guid 0x1234 : <SL>
+ any, target-port-guid 0x1234 : <SL>
+
+Note that any of the above ULPs might contain target port GUID in the PR
+query, so in order for these queries not to be recognized by the QoS manager
+as SRP, the SRP match rule (or any match rule that refers to the target port
+guid only) should be placed at the end of the qos-ulps match rules.
+
+6.6 MPI
+SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL
+on the MPI traffic, and that's why it is the only ULP that did not appear in
+the qos-ulps section.
+
+
+==============================================================================
+ 7. SL2VL Mapping and VL Arbitration
+==============================================================================
+
+OpenSM cached options file has a set of QoS related configuration parameters,
+that are used to configure SL2VL mapping and VL arbitration on IB ports.
+These parameters are:
+ - Max VLs: the maximum number of VLs that will be on the subnet.
+ - High limit: the limit of High Priority component of VL Arbitration
+ table (IBA 7.6.9).
+ - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template.
+ - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template.
+ - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs
+ corresponding to SLs 0-15 (Note that VL15 used here means drop this SL).
+
+There are separate QoS configuration parameters sets for various target types:
+CAs, routers, switch external ports, and switch's enhanced port 0. The names
+of such parameters are prefixed by "qos_<type>_" string. Here is a full list
+of the currently supported sets:
+
+ qos_ca_ - QoS configuration parameters set for CAs.
+ qos_rtr_ - parameters set for routers.
+ qos_sw0_ - parameters set for switches' port 0.
+ qos_swe_ - parameters set for switches' external ports.
+
+Here's the example of typical default values for CAs and switches' external
+ports (hard-coded in OpenSM initialization):
+
+ qos_ca_max_vls 15
+ qos_ca_high_limit 0
+ qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+ qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+ qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+ qos_swe_max_vls 15
+ qos_swe_high_limit 0
+ qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+ qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+ qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+VL arbitration tables (both high and low) are lists of VL/Weight pairs.
+Each list entry contains a VL number (values from 0-14), and a weighting value
+(values 0-255), indicating the number of 64 byte units (credits) which may be
+transmitted from that VL when its turn in the arbitration occurs. A weight
+of 0 indicates that this entry should be skipped. If a list entry is
+programmed for VL15 or for a VL that is not supported or is not currently
+configured by the port, the port may either skip that entry or send from any
+supported VL for that entry.
+
+Note, that the same VLs may be listed multiple times in the High or Low
+priority arbitration tables, and, further, it can be listed in both tables.
+
+The limit of high-priority VLArb table (qos_<type>_high_limit) indicates the
+number of high-priority packets that can be transmitted without an opportunity
+to send a low-priority packet. Specifically, the number of bytes that can be
+sent is high_limit times 4K bytes.
+
+A high_limit value of 255 indicates that the byte limit is unbounded.
+Note: if the 255 value is used, the low priority VLs may be starved.
+A value of 0 indicates that only a single packet from the high-priority table
+may be sent before an opportunity is given to the low-priority table.
+
+Keep in mind that ports usually transmit packets of size equal to MTU.
+For instance, for 4KB MTU a single packet will require 64 credits, so in order
+to achieve effective VL arbitration for packets of 4KB MTU, the weighting
+values for each VL should be multiples of 64.
+
+Below is an example of SL2VL and VL Arbitration configuration on subnet:
+
+ qos_ca_max_vls 15
+ qos_ca_high_limit 6
+ qos_ca_vlarb_high 0:4
+ qos_ca_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
+ qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+ qos_swe_max_vls 15
+ qos_swe_high_limit 6
+ qos_swe_vlarb_high 0:4
+ qos_swe_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64
+ qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is
+defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single
+transmission burst. Such configuration would suilt VL that needs low latency
+and uses small MTU when transmitting packets. Rest of VLs are defined as low
+priority VLs with different weights, while VL4 is effectively turned off.
--- /dev/null
+RDS(7) RDS(7)
+
+
+
+NAME
+ RDS - Reliable Datagram Sockets
+
+SYNOPSIS
+ #include <sys/socket.h>
+ #include <netinet/in.h>
+
+DESCRIPTION
+ This is an implementation of the RDS socket API. It provides reliable,
+ in-order datagram delivery between sockets over a variety of trans‐
+ ports.
+
+ Currently, RDS can be transported over Infiniband, and loopback.
+ iWARP bcopy is supported, but not RDMA operations.
+
+ RDS uses standard AF_INET addresses as described in ip(7) to identify
+ end points.
+
+ Socket Creation
+ RDS is still in development and as such does not have a reserved proto‐
+ col family constant. Applications must read the string representation
+ of the protocol family value from the pf_rds sysctl parameter file
+ described below.
+
+ rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
+
+
+ Socket Options
+ RDS sockets support a number of socket options through the setsock‐
+ opt(2) and getsockopt(2) calls. The following generic options (with
+ socket level SOL_SOCKET) are of specific importance:
+
+ SO_RCVBUF
+ Specifies the size of the receive buffer. See section on "Con‐
+ gestion Control" below.
+
+ SO_SNDBUF
+ Specifies the size of the send buffer. See "Message Transmis‐
+ sion" below.
+
+ SO_SNDTIMEO
+ Specifies the send timeout when trying to enqueue a message on a
+ socket with a full queue in blocking mode.
+
+ In addition to these, RDS supports a number of protocol specific
+ options (with socket level SOL_RDS). Just as with the RDS protocol
+ family, an official value has not been assigned yet, so the kernel will
+ assign a value dynamically. The assigned value can be retrieved from
+ the sol_rds sysctl parameter file.
+
+ RDS specific socket options will be described in a separate section
+ below.
+
+ Binding
+ A new RDS socket has no local address when it is first returned from
+ socket(2). It must be bound to a local address by calling bind(2)
+ before any messages can be sent or received. This will also attach the
+ socket to a specific transport, based on the type of interface the
+ local address is attached to. From that point on, the socket can only
+ reach destinations which are available through this transport.
+
+ For instance, when binding to the address of an Infiniband interface
+ such as ib0, the socket will use the Infiniband transport. If RDS is
+ not able to associate a transport with the given address, it will
+ return EADDRNOTAVAIL.
+
+ An RDS socket can only be bound to one address and only one socket can
+ be bound to a given address/port pair. If no port is specified in the
+ binding address then an unbound port is selected at random.
+
+ RDS does not allow the application to bind a previously bound socket to
+ another address. Binding to the wildcard address INADDR_ANY is not per‐
+ mitted either.
+
+ Connecting
+ The default mode of operation for RDS is to use unconnected socket, and
+ specify a destination address as an argument to sendmsg. However, RDS
+ allows sockets to be connected to a remote end point using connect(2).
+ If a socket is connected, calling sendmsg without specifying a destina‐
+ tion address will use the previously given remote address.
+
+ Congestion Control
+ RDS does not have explicit congestion control like common streaming
+ protocols such as TCP. However, sockets have two queue limits associ‐
+ ated with them; the send queue size and the receive queue size. Mes‐
+ sages are accounted based on the number of bytes of payload.
+
+ The send queue size limits how much data local processes can queue on a
+ local socket (see the following section). If that limit is exceeded,
+ the kernel will not accept further messages until the queue is drained
+ and messages have been delivered to and acknowledged by the remote
+ host.
+
+ The receive queue size limits how much data RDS will put on the receive
+ queue of a socket before marking the socket as congested. When a
+ socket becomes congested, RDS will send a congestion map update to the
+ other participating hosts, who are then expected to stop sending more
+ messages to this port.
+
+ There is a timing window during which a remote host can still continue
+ to send messages to a congested port; RDS solves this by accepting
+ these messages even if the socket's receive queue is already over the
+ limit.
+
+ As the application pulls incoming messages off the receive queue using
+ recvmsg(2), the number of bytes on the receive queue will eventually
+ drop below the receive queue size, at which point the port is then
+ marked uncongested, and another congestion update is sent to all par‐
+ ticipating hosts. This tells them to allow applications to send addi‐
+ tional messages to this port.
+
+ The default values for the send and receive buffer size are controlled
+ by the A given RDS socket has limited transmit buffer space. It
+ defaults to the system wide socket send buffer size set in the
+ wmem_default and rmem_default sysctls, respectively. They can be tuned
+ by the application through the SO_SNDBUF and SO_RCVBUF socket options.
+
+
+ Blocking Behavior
+ The sendmsg(2) and recvmsg(2) calls can block in a variety of situa‐
+ tions. Whether a call blocks or returns with an error depends on the
+ non-blocking setting of the file descriptor and the MSG_DONTWAIT mes‐
+ sage flag. If the file descriptor is set to blocking mode (which is the
+ default), and the MSG_DONTWAIT flag is not given, the call will block.
+
+ In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used
+ to specify a timeout (in seconds) after which the call will abort wait‐
+ ing, and return an error. The default timeout is 0, which tells RDS to
+ block indefinitely.
+
+ Message Transmission
+ Messages may be sent using sendmsg(2) once the RDS socket is bound.
+ Message length cannot exceed 4 gigabytes as the wire protocol uses an
+ unsigned 32 bit integer to express the message length.
+
+ RDS does not support out of band data. Applications are allowed to send
+ to unicast addresses only; broadcast or multicast are not supported.
+
+ A successful sendmsg(2) call puts the message in the socket's transmit
+ queue where it will remain until either the destination acknowledges
+ that the message is no longer in the network or the application removes
+ the message from the send queue.
+
+ Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO
+ socket option described below.
+
+ While a message is in the transmit queue its payload bytes are
+ accounted for. If an attempt is made to send a message while there is
+ not sufficient room on the transmit queue, the call will either block
+ or return EAGAIN.
+
+ Trying to send to a destination that is marked congested (see above),
+ the call will either block or return ENOBUFS.
+
+ A message sent with no payload bytes will not consume any space in the
+ destination's send buffer but will result in a message receipt on the
+ destination. The receiver will not get any payload data but will be
+ able to see the sender's address.
+
+ Messages sent to a port to which no socket is bound will be silently
+ discarded by the destination host. No error messages are reported to
+ the sender.
+
+ Message Receipt
+ Messages may be received with recvmsg(2) on an RDS socket once it is
+ bound to a source address. RDS will return messages in-order, i.e. mes‐
+ sages from the same sender will arrive in the same order in which they
+ were be sent.
+
+ The address of the sender will be returned in the sockaddr_in structure
+ pointed to by the msg_name field, if set.
+
+ If the MSG_PEEK flag is given, the first message on the receive is
+ returned without removing it from the queue.
+
+ The memory consumed by messages waiting for delivery does not limit the
+ number of messages that can be queued for receive. RDS does attempt to
+ perform congestion control as described in the section above.
+
+ If the length of the message exceeds the size of the buffer provided to
+ recvmsg(2), then the remainder of the bytes in the message are dis‐
+ carded and the MSG_TRUNC flag is set in the msg_flags field. In this
+ truncating case recvmsg(2) will still return the number of bytes
+ copied, not the length of entire messge. If MSG_TRUNC is set in the
+ flags argument to recvmsg(2), then it will return the number of bytes
+ in the entire message. Thus one can examine the size of the next mes‐
+ sage in the receive queue without incurring a copying overhead by pro‐
+ viding a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the
+ flags argument.
+
+ The sending address of a zero-length message will still be provided in
+ the msg_name field.
+
+ Control Messages
+ RDS uses control messages (a.k.a. ancillary data) through the msg_con‐
+ trol and msg_controllen fields in sendmsg(2) and recvmsg(2). Control
+ messages generated by RDS have a cmsg_level value of sol_rds. Most
+ control messages are related to the zerocopy interface added in RDS
+ version 3, and are described in rds-rdma(7).
+
+ The only exception is the RDS_CMSG_CONG_UPDATE message, which is
+ described in the following section.
+
+ Polling
+ RDS supports the poll(2) interface in a limited fashion. POLLIN is
+ returned when there is a message (either a proper RDS message, or a
+ control message) waiting in the socket's receive queue. POLLOUT is
+ always returned while there is room on the socket's send queue.
+
+ Sending to congested ports requires special handling. When an applica‐
+ tion tries to send to a congested destination, the system call will
+ return ENOBUFS. However, it cannot poll for POLLOUT, as there is prob‐
+ ably still room on the transmit queue, so the call to poll(2) would
+ return immediately, even though the destination is still congested.
+
+ There are two ways of dealing with this situation. The first is to sim‐
+ ply poll for POLLIN. By default, a process sleeping in poll(2) is
+ always woken up when the congestion map is updated, and thus the appli‐
+ cation can retry any previously congested sends.
+
+ The second option is explicit congestion monitoring, which gives the
+ application more fine-grained control.
+
+ With explicit monitoring, the application polls for POLLIN as before,
+ and additionally uses the RDS_CONG_MONITOR socket option to install a
+ 64bit mask value in the socket, where each bit corresponds to a group
+ of ports. When a congestion update arrives, RDS checks the set of ports
+ that became uncongested against the bit mask installed in the socket.
+ If they overlap, a control messages is enqueued on the socket, and the
+ application is woken up. When it calls recvmsg(2), it will be given the
+ control message containing the bitmap. on the socket.
+
+ The congestion monitor bitmask can be set and queried using setsock‐
+ opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.
+
+ Congestion updates are delivered to the application via
+ RDS_CMSG_CONG_UPDATE control messages. These control messages are
+ always delivered by themselves (or possibly additional control mes‐
+ sages), but never along with a RDS data message. The cmsg_data field of
+ the control message is an 8 byte datum containing the 64bit mask value.
+
+ Applications can use the following macros to test for and set bits in
+ the bitmask:
+
+ #define RDS_CONG_MONITOR_SIZE 64
+ #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
+ #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
+
+
+ Canceling Messages
+ An application can cancel (flush) messages from the send queue using
+ the RDS_CANCEL_SENT_TO socket option with setsockopt(2). This call
+ takes an optional sockaddr_in address structure as argument. If given,
+ only messages to the destination specified by this address are dis‐
+ carded. If no address is given, all pending messages are discarded.
+
+ Note that this affects messages that have not yet been transmitted as
+ well as messages that have been transmitted, but for which no acknowl‐
+ edgment from the remote host has been received yet.
+
+ Reliability
+ If sendmsg(2) succeeds, RDS guarantees that the message will be vis‐
+ ible to recvmsg(2) on a socket bound to the destination address as
+ long as that destination socket remains open.
+
+ If there is no socket bound on the destination, the message is
+ silently dropped. If the sending RDS can't be sure that there is no
+ socket bound then it will try to send the message indefinitely until it
+ can be sure or the sent message is canceled.
+
+ If a socket is closed then all pending sent messages on the socket are
+ canceled and may or may not be seen by the receiver.
+
+ The RDS_CANCEL_SENT_TO socket option can be used to cancel all pending
+ messages to a given destination.
+
+ If a receiving socket is closed with pending messages then the sender
+ considers those messages as having left the network and will not
+ retransmit them.
+
+ A message will only be seen by recvmsg(2) once, unless MSG_PEEK was
+ specified. Once the message has been delivered it is removed from the
+ sending socket's transmit queue.
+
+ All messages sent from the same socket to the same destination will be
+ delivered in the order they're sent. Messages sent from different sock‐
+ ets, or to different destinations, may be delivered in any order.
+
+SYSCTL VALUES
+ These parameteres may only be accessed through their files in
+ /proc/sys/net/rds. Access through sysctl(2) is not supported.
+
+ pf_rds This file contains the string representation of the protocol
+ family constant passed to socket(2) to create a new RDS socket.
+
+ sol_rds
+ This file contains the string representation of the socket level
+ parameter that is passed to getsockopt(2) and setsockopt(2) to
+ manipulate RDS socket options.
+
+ max_unacked_bytes and max_unacked_packets
+ These parameters are used to tune the generation of acknowledge‐
+ ments. By default, the system receiving RDS messages does not
+ send back explicit acknowledgements unless it transmits a mes‐
+ sage of its own (in which case the ACK is piggybacked onto the
+ outgoing message), or when the sending system requests an ACK.
+
+ However, the sender needs to see an ACK from time to time so
+ that it can purge old messages from the send queue. The unacked
+ bytes and packet counters are used to keep track of how much
+ data has been sent without requesting an ACK. The default is to
+ request an acknowledgement every 16 packets, or every 16 MB,
+ whichever comes first.
+
+ reconnect_delay_min_ms and reconnect_delay_max_ms
+ RDS uses host-to-host connections to transport RDS messages
+ (both for the TCP and the Infiniband transport). If this connec‐
+ tion breaks, RDS will try to re-establish the connection.
+ Because this reconnect may be triggered by both hosts at the
+ same time and fail, RDS uses a random backoff before attempting
+ a reconnect. These two parameters specify the minimum and maxi‐
+ mum delay in milliseconds. The default values are 1 and 1000,
+ respectively.
+
+SEE ALSO
+ rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2),
+ setsockopt(2).
+
+
+
+ RDS(7)
--- /dev/null
+===============================================================================
+ OFED-1.5.1 RoCEE Support README
+ February 2010
+===============================================================================
+
+Contents:
+=========
+1. Overview
+2. Software Dependencies
+3. User Guidelines
+4. Ported Applications
+5. Gid tables
+6. Using VLANs
+7. Statistic counters
+8. Firmware Requirements
+9. Supported hardware
+10. Added fearues
+11. Known Issues
+
+
+1. Overview
+===========
+RDMA over Converged Enhanced Ethernet (RoCEE) allows InfiniBand (IB) transport
+over Ethernet networks. It encapsulates IB transport and GRH headers in
+Ethernet packets bearing a dedicated ether type.
+While the use of GRH is optional within IB subnets, it is mandatory when using
+RoCEE. Verbs applications written over IB verbs should work seamlessly, but
+they require provisioning of GRH information when creating address vectors. The
+library and driver are modified to provide for mapping from GID to MAC
+addresses required by the hardware.
+
+2. Software Dependencies
+========================
+In order to use RoCEE over Mellanox ConnectX(R) hardware, the mlx4_en driver
+must be loaded. Please refer to MLNX_EN_README.txt for further details.
+
+
+3. User Guidelines
+==================
+Since RoCEE encapsulates InfiniBand traffic in Ethernet frames, the
+corresponding net device must be up and running. In case of Mellanox
+hardware, mlx4_en must be loaded and the corresponding interface configured.
+- Make sure mlx4_en.ko is loaded
+- Make sure an IP address has been configured to this interface
+- Run "ibv_devinfo". There is a new field named "link_layer" which can be
+ either "Ethernet" or "IB". If the value is IB, then you need to use
+ connectx_port_config to change the ConnectX ports designation to eth (see
+ mlx4_release_notes.txt for details)
+- Configure the IP address of the interface so that the link will become
+ active
+- All IB verbs applications which run over IB verbs should work on RoCEE
+ links as long as they use GRH headers (that is, as long as they specify use
+ of GRH in their address vector)
+- rdma_cm applications working over RoCEE will have the TOS field set to a
+ default value of 3. The default value is given as a module paramter to
+ rdma_cm:
+ def_prec2sl:Default value for SL priority with RoCE. Valid values 0 - 7 (int).
+
+
+4. Ported Applications
+======================
+- ibv_*_pingpong examples have been ported too. The user must specify the GID
+ of the remote peer using the new '-g' option. The GID has the same format as
+ that in /sys/class/infiniband/mlx4_0/ports/1/gids/0
+
+- Note: Care should be taken when using ibv_ud_pingpong. The default message
+ size is 2K, which is likely to exceed the MTU of the RoCEE link. Use
+ ibv_devinfo to inspect the link MTU and specify an appropriate message size
+
+- All rdma_cm applications should work seamlessly without any change
+
+- libsdp works without any change
+
+- Performance tests have been ported
+
+
+5. Gid tables
+=============
+With RoCEE, there may be several entries in a port's GID table. The first entry
+always contains the IPv6 link local address of the corresponding ethernet
+interface. The link local address is formed in the following way:
+
+gid[0..7] = fe80000000000000
+gid[8] = mac[0] ^ 2
+gid[9] = mac[1]
+gid[10] = mac[2]
+gid[11] = ff
+gid[12] = fe
+gid[13] = mac[3]
+gid[14] = mac[4]
+gid[15] = mac[5]
+
+If VLAN is supported by the kernel, and there are VLAN interfaces on the main
+ethernet interface (the interface that the IB port is tied to), each such VLAN
+will appear as a new GID in the port's GID table. The format of the GID entry
+will be identical to the one decribed above with the following change:
+
+gid[11] = VLAN ID high byte (4 MS bits).
+gid[12] = VLAN ID low byte
+
+Please note that VLAN ID is 12 bits.
+
+Priority pause frames
+---------------------
+Tagged ethernet frames carry a 3 bit priority field. The value of this field is
+derived from the IB SL field by taking the 3 LS bits of the SL field.
+
+
+6. Using VLANs
+==============
+In order for RoCEE traffic to used VLAN tagged frames, the user has to specify
+GID table entries that are derived from VLAN devices, when creating address
+vectors. Consider the example bellow:
+
+6.1 Make sure VLAN support is enabled by the kernel. Usually this requires
+loading the 8021q module.
+- modprobe 8021q
+
+6.2 Add a VLAN device
+- vconfig add eth2 7
+
+6.3 Assign IP address to the VLAN interface
+- ifconfig eth2.7 7.10.11.12
+suppose this created a new entry in the GID table in index 1.
+
+6.4 verbs test:
+server: ibv_rc_pingpong -g 1
+client: ibv_rc_pingpongs -g 1 server
+
+6.5 For rdma_cm applications, the user only needs to specify an IP address of a
+VLAN device for the traffic to go with that VLAN tagged frames.
+
+7. Statistic counters
+=====================
+RoCEE traffic is counted and can be read from the sysfs counters in the same
+manner as it is done for regular Infiniband devices. Only the following
+counters are supported:
+- port_xmit_packets
+- port_rcv_packets
+- port_rcv_data
+- port_xmit_data
+
+For example, to read the number of transmitted packets on port 2 of device
+mlx4_1, one needs to read the file:
+/sys/class/infiniband/mlx4_1/ports/2/counters/port_xmit_packets
+
+Note: RoCEE traffic will not show in the associated Etherent device's counters
+since it is offloaded by the hardware and does not go through Ethernet network
+driver.
+
+
+8. Firmware Requirements
+========================
+RoCEE has limited support with firmware 2.7.700 and will be fully supported
+with firmware 2.8.000.
+
+
+9. Supported hardware
+=====================
+Currently, ConnectX B0 hardware is supported. A0 hardware may have issues.
+
+
+10. Added fearues
+=================
+ibdev2netdev is a utility that displays the association between an HCA's port
+and the network interface bound to it. Example run:
+
+sw417:/usr/src/packages/SOURCES/ofa_kernel-1.5.2 # ibdev2netdev
+mlx4_0 port 1 ==> ib0 (Down)
+mlx4_0 port 2 ==> ib1 (Down)
+mlx4_1 port 1 ==> eth2 (Up)
+mlx4_1 port 2 ==> eth3 (Up)
+
+
+
+11. Known Issues
+===============
+- PowerPC and ia64 architectures are not supported. x32 architectures were
+ not tested.
+
+- SRP is not supported.
+
+- UD QPs that send traffic with VLAN tags (e.g. 802.1q tagged frames) do not
+ work. This will be fixed in a subsequent release.
--- /dev/null
+SCSI RDMA Protocol (SRP) Target driver for Linux
+=================================================
+
+SRP Target driver is designed to work directly on top of OpenFabrics
+OFED-1.x software stack (http://www.openfabrics.org) or Infiniband
+drivers in Linux kernel tree (kernel.org). It also interfaces with
+Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net)
+
+By interfacing with SCST driver we are able to work and support a lot IO
+modes on real or virtual devices in the backend
+
+1. scst_disk -- interfacing with scsi sub-system to claim and export real
+ scsi devices ie. disks, hardware raid volumes, tape library as SRP's luns
+
+2. scst_vdisk -- fileio and blockio modes. This allows you to turn software
+ raid volumes, LVM volumes, IDE disks, block devices and normal files into
+ SRP's luns
+
+3. NULLIO mode will allow you to measure the performance without sending IOs
+ to *real* devices
+
+
+Prerequisites
+-------------
+0. Supported distributions: RHEL 5.2/5.3/5.4, SLES 10 sp2/sp3, SLES 11
+
+NOTES: On distribution default kernels, you can run scst_vdisk blockio mode
+ to have good performance.
+
+ It is required to patch and recompile the kernel to run scst_disk
+ ie. scsi pass-thru mode
+ OR
+ You have to compile scst with -DSTRICT_SERIALIZING enabled and this
+ does not yield good performance.
+
+1. Download and install SCST driver (supported version 1.0.1.1)
+
+1a. Download scst-1.0.1.1.tar.gz from this URL
+ http://scst.sourceforge.net/downloads.html
+
+1b. untar and install scst-1.0.1.1
+
+ $ tar zxvf scst-1.0.1.1.tar.gz
+ $ cd scst-1.0.1.1
+
+ THIS STEP IS SPECIFIC FOR SLES 10 sp2/sp3 distributions:
+
+ $ patch -p1 -i <path to OFED>/docs/scst/scst_sles10_sp2.patch
+
+ For all distributions:
+
+ $ make && make install
+
+NOTES: FOR SLES 11 distribution, skip next step (step 1c) and go directly to
+ step (2)
+
+1c. patch scst.h header file with scst.patch
+
+ $ cd /usr/local/include/scst
+ $ patch -p1 -i <path to OFED>/docs/scst/scst.patch
+
+
+2. Download/install OFED-1.5.1 package - SRP target is part of OFED package
+
+NOTES: if your system already have OFED stack installed, you need to remove
+ the previous built of kernel-ib RPMs and reinstall
+
+ $ cd ~/OFED-1.5.1
+ $ rm RPMS/*/*/kernel-ib*
+ $ ./install.pl -c ofed.conf
+
+ Make sure that srpt=y in the ofed.conf
+
+2a. download OFED packages from this URL
+ http://www.openfabrics.org/downloads/OFED/OFED-1.5.1/
+
+2b. install OFED - remember to choose srpt=y
+
+ $ cd ~/OFED-1.5.1
+ $ ./install.pl
+
+
+How-to run
+-----------
+
+A. On srp target machine
+
+A1. Please refer to SCST's README for loading scst driver and its dev_handlers
+ drivers (scst_disk, scst_vdisk block or file IO mode, nullio, ...)
+ SCST's README locates in ~/scst-1.0.1.1/ directory
+
+NOTES: In any mode you always need to have lun 0 in any group's device list
+ Then you can have any lun number following lun 0 (it does not required
+ have lun number in order except that the first lun is always 0)
+
+ Setting SRPT_LOAD=yes in /etc/infiniband/openib.conf is not good enough
+ It only load ib_srpt module and does not load scst and its dev_handlers
+
+ SCST's scst_disk module (pass-thru mode) does not run on default
+ distribution kernels (kernels come with RHEL 5.2/5.3/5.4 & SLES 11)
+ because it requires to patch and recompile the kernel. It can only
+ run with vanilla kernels.
+
+Example 1: working with VDISK BLOCKIO mode
+ (using md0 device, sda, and cciss/c1d0)
+a. modprobe scst
+b. modprobe scst_vdisk
+c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
+g. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
+h. echo "add vdisk2 2" >/proc/scsi_tgt/groups/Default/devices
+
+Example 2: working with real back-end scsi disks in scsi pass-thru mode
+a. modprobe scst
+b. modprobe scst_disk
+c. cat /proc/scsi_tgt/scsi_tgt
+ibstor00:~ # cat /proc/scsi_tgt/scsi_tgt
+Device (host:ch:id:lun or name) Device handler
+0:0:0:0 dev_disk
+4:0:0:0 dev_disk
+5:0:0:0 dev_disk
+6:0:0:0 dev_disk
+7:0:0:0 dev_disk
+
+Now you want to exclude the first scsi disk and expose the last 4 scsi disks
+as IB/SRP luns for I/O
+
+echo "add 4:0:0:0 0" >/proc/scsi_tgt/groups/Default/devices
+echo "add 5:0:0:0 1" >/proc/scsi_tgt/groups/Default/devices
+echo "add 6:0:0:0 2" >/proc/scsi_tgt/groups/Default/devices
+echo "add 7:0:0:0 3" >/proc/scsi_tgt/groups/Default/devices
+
+Example 3: working with scst_vdisk FILEIO mode
+ (using md0 device and file 10G-file)
+a. modprobe scst
+b. modprobe scst_vdisk
+c. echo "open vdisk0 /dev/md0" > /proc/scsi_tgt/vdisk/vdisk
+d. echo "open vdisk1 /10G-file" > /proc/scsi_tgt/vdisk/vdisk
+e. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices
+f. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices
+
+A2. modprobe ib_srpt
+
+
+B. On initiator machines you can manualy do the following steps:
+
+B1. modprobe ib_srp
+B2. ipsrpdm -c -d /dev/infiniband/umadX
+ (to discover new SRP target)
+ umad0: port 1 of the first HCA
+ umad1: port 2 of the first HCA
+ umad2: port 1 of the second HCA
+B3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target
+B4. fdisk -l (will show new discovered scsi disks)
+
+Example:
+Assume that you use port 1 of first HCA in the system ie. mthca0
+
+[root@lab104 ~]# ibsrpdm -c -d /dev/infiniband/umad0
+id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
+dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4
+[root@lab104 ~]# echo id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4,
+dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 >
+/sys/class/infiniband_srp/srp-mthca0-1/add_target
+
+OR
+
++ You can edit /etc/infiniband/openib.conf to load srp driver and srp HA daemon
+automatically ie. set SRP_LOAD=yes, SRP_DAEMON_ENABLE=yes, and SRPHA_ENABLE=yes
++ To set up and use high availability feature you need dm-multipath driver
+and multipath tool
++ Please refer to OFED-1.5.1 SRP's user manual for more in-details instructions
+on how-to enable/use HA feature (OFED-1.5.1/docs/srp_release_notes.txt)
+
+
+Here is an example of srp target setup file
+--------------------------------------------
+
+*********************** srpt.sh *****************************************
+#!/bin/sh
+modprobe scst scst_threads=1
+modprobe scst_vdisk scst_vdisk_ID=100
+
+echo "open vdisk0 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+echo "open vdisk1 /dev/sdb BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+echo "open vdisk2 /dev/sdc BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+echo "open vdisk3 /dev/sdd BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
+echo "add vdisk0 0" > /proc/scsi_tgt/groups/Default/devices
+echo "add vdisk1 1" > /proc/scsi_tgt/groups/Default/devices
+echo "add vdisk2 2" > /proc/scsi_tgt/groups/Default/devices
+echo "add vdisk3 3" > /proc/scsi_tgt/groups/Default/devices
+
+modprobe ib_srpt
+
+echo "add "mgmt"" > /proc/scsi_tgt/trace_level
+echo "add "mgmt_dbg"" > /proc/scsi_tgt/trace_level
+echo "add "out_of_mem"" > /proc/scsi_tgt/trace_level
+
+*********************** End srpt.sh **************************************
+
+
+How-to unload/shutdown
+-----------------------
+
+1. Unload ib_srpt
+ $ modprobe -r ib_srpt
+2. Unload scst and its dev_handlers
+ $ modprobe -r scst_vdisk scst
+3. Unload ofed
+ $ /etc/rc.d/openibd stop
+
+===========================================================================
+Known Issues
+===========================================================================
+
+- With active connections/sesssions and active I/Os, unload ib_srpt driver
+ will randomly fail and got stuck.
+
+- With active connections/sessions with active I/Os, reboot system will
+ randomly get stuck.
+
--- /dev/null
+IB Bonding
+===============================================================================
+
+1. Introduction
+2. How to work with interface configuration scripts
+2.1 Configuration with initscripts support
+2.1.1 Writing network scripts under Redhat-AS4 (Update 6, 7 or 8)
+2.1.2 Writing network scripts under Redhhat-EL5
+2.2 Configuration with sysconfig support
+2.2.1 Writing network scripts under SLES-10
+2.3 Configuring Ethernet slaves
+
+1. Introduction
+-------------------------------------------------------------------------------
+ib-bonding is a High Availability solution for IPoIB interfaces. It is based
+on the Linux Ethernet Bonding Driver and was adopted to work with IPoIB.
+However, the support for for IPoIB interfaces is only for the active-backup
+mode, other modes should not be used.
+
+2. How to work with interface configuration scripts
+-------------------------------------------------------------------------------
+To create an interface configuration script for the ibX and bondX interfaces,
+you should use the standard syntax (depending on your OS).
+
+2.1 Configuration with initscripts support
+------------------------------------------
+Note: This feature is available only for Redhat-AS4 (Update 4, Update 5,
+Update 6 or Update 7) and for Redhat-EL5 and above.
+
+2.1.1 Writing network scripts under Redhat-AS4 (Update 4, 5, 6 or 7)
+-----------------------------------------------------------------
+* In the master (bond) interface script add the line:
+TYPE=Bonding
+MTU=<according to the slave's MTU>
+
+Exmaple: for bond0 (master) the file is named /etc/sysconfig/network-scripts/ifcfg-bond0
+with the following text in the file:
+
+DEVICE=bond0
+IPADDR=192.168.1.1
+NETMASK=255.255.255.0
+NETWORK=192.168.1.0
+BROADCAST=192.168.1.255
+ONBOOT=yes
+BOOTPROTO=none
+USERCTL=no
+TYPE=Bonding
+MTU=65520
+
+Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected
+mode and are configured with the same value. For IPoIB slaves that work in
+datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at
+all (and letting it to be set to the default value), performance of the
+interface might decrease.
+
+* In the slave (ib) interface script put the following lines:
+SLAVE=yes
+MASTER=<bond name>
+TYPE=InfiniBand
+PRIMARY=<yes|no>
+
+Example: the script for ib0 (slave) would be named /etc/sysconfig/network-scripts/ifcfg-ib0
+with the following text in the file:
+
+DEVICE=ib0
+USERCTL=no
+ONBOOT=yes
+MASTER=bond0
+SLAVE=yes
+BOOTPROTO=none
+TYPE=InfiniBand
+PRIMARY=yes
+
+Note: If the slave interface is not primary then the line PRIMARY= is not
+required and can be omitted.
+
+After the configuration is saved, restart the network service by running:
+/etc/init.d/network restart
+
+2.1.2 Writing network scripts under Redhhat-EL5
+-----------------------------------------------
+Follow the instructions in 3.1.1 (Writing network scripts under Redhat-AS4)
+with the following changes:
+* In the bondX (master) script - the line TYPE=Bonding is not needed.
+* In the bondX (master) script - you may add to the configuration more options
+with the following line
+BONDING_OPTS=" primary=ib0 updelay=0 downdelay=0"
+* in the ibX (slave) script - the line TYPE=InfiniBand necessary when using
+ bonding over devices configured with partitions ( p_key)
+Example:
+ ifcfg-ibX.8003 and ifcfg-ibY.8003 must include TYPE=InfiniBand line in
+ their configuration files, when using as slaves for bondX device
+* in /etc/modprobe.conf add the following lines
+alias bond0 bonding
+options bond0 miimon=100 mode=1 max_bonds=1
+
+If you want more than one bonding interface, name them bond1, bond2... and
+just add the necessary lines in /etc/modprobe.conf and change max_bonds=1 to
+max_bonds=N where N=number_of_bonding_interfaces
+
+Note: restarting OFED doesn't keep the bonding configuration via initscripts.
+You have to restart the network service in order to recreate the bonding
+interface.
+
+2.2 Configuration with sysconfig support
+----------------------------------------
+Note: This feature is available only for SLES-10 and above.
+
+2.2.1 Writing network scripts under SLES-10
+-----------------------------------------------
+* In the master (bond) interface script add the lins:
+
+BONDING_MASTER=yes
+BONDING_MODULE_OPTS="mode=active-backup miimon=<value>"
+BONDING_SLAVE0=slave0
+BONDING_SLAVE1=slave1
+MTU=<according to the slave's MTU>
+
+Exmaple: for bond0 (master) the file is named /etc/sysconfig/network/ifcfg-bond0
+with the following text in the file:
+
+BOOTPROTO="static"
+BROADCAST="10.0.2.255"
+IPADDR="10.0.2.10"
+NETMASK="255.255.0.0"
+NETWORK="10.0.2.0"
+REMOTE_IPADDR=""
+STARTMODE="onboot"
+BONDING_MASTER="yes"
+BONDING_MODULE_OPTS="mode=active-backup miimon=100 primary=ib0 updelay=0 downdelay=0"
+BONDING_SLAVE0=ib0
+BONDING_SLAVE1=ib1
+MTU=65520
+
+Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected
+mode and are configured with the same value. For IPoIB slaves that work in
+datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at
+all (and letting it to be set to the default value), performance of the
+interface might decrease.
+
+Note: primary, downdelay and updelay is an optional bonding interface
+configuration. You may choose to use them, change them or delete them from the
+configuration script (by editing the line that starts with BONDING_OPTS)
+
+* The slave (ib) interace script should look like this:
+
+BOOTPROTO='none'
+STARTMODE='off'
+PRE_DOWN_SCRIPT=/etc/sysconfig/network/unenslave.sh
+
+After the configuration is saved, restart the network service by running:
+/etc/init.d/network restart
+
+2.3 Configuring Ethernet slaves
+-------------------------------
+It is not possible to have a mix of Ethernt slaves and IPoIB slaves under the
+same bonding master. It is possible however that a bonding master of Ethernet
+slaves and a bonding master of IPoIB slaves will co-exist in one machne.
+To configure Ethernet slaves under a bonding master use the following
+instructios (depending on the OS)
+
+* Under Redhat-AS4
+
+Use the same instructions as for IPoIB slaves with the following exceptions
+
+- In the master configuration file add the line
+SLAVEDEV=1
+- In the slave configuration file leave the line
+TYPE=InfiniBand
+- For Ethernet, it is possible to set parameters of the bonding module in /etc/modprobe.conf
+with the following line for example
+options bonding miimon=100 mode=1 primary=eth0
+Note that alias names for the bonding module (such as bond0) may not work.
+
+* Under Redhat-AS5
+
+No special instructions are required.
+
+* Under SLES10
+
+When using both type of bonding under, it is neccessary to update the
+MANDATORY_DEVICES environment variable in /etc/sysconfig/network/config with the names
+of the InfiniBand devices ( ib0, ib1, etc. ). Otherwise, bonding devices will be created
+before InfiniBand devices at boot time.
+
+Note: If there is more than one Ethernet NIC installed then there might be a
+race for the interface name eth0, eth1 etc. This may lead to unexpected
+relation between logical and physical devices which may lead to wrong bonding
+configuration. This issue may be solved by binding a logical device name (e.g.
+eth0) to a physical (hardware) device by specifying the MAC address in the
+ethN configuration file.
--- /dev/null
+# QLogic VNIC configuration file
+#
+# This file documents and describes the use of the
+# VNIC configuration file qlgc_vnic.cfg. This file
+# should reside in /etc/infiniband/qlgc_vnic.cfg
+#
+#
+# Knowing how to fill the configuration file
+###############################################
+#
+# For filling the configuration file you need to know
+# some information about your EVIC/VEx device. This information
+# can be obtained with the help of the ib_qlgc_vnic_query tool.
+# "ib_qlgc_vnic_query -es" command will give DGID, IOCGUID and IOCSTRING information about
+# the EVIC/VEx IOCs that are available through port 1 and
+# "ib_qlgc_vnic_query -es -d /dev/infiniband/umad1" will give information about
+# the EVIC/VEX IOCs available through port 2.
+#
+# Refer to the README for more information about the ib_qlgc_vnic_query tool.
+#
+#
+# General structure of the configuration file
+###############################################
+#
+# All lines beginning with a # are treated as comments.
+#
+# A simple configuration file consists of CREATE commands
+# for each VNIC interface to be created.
+#
+# A simple CREATE command looks like this:
+#
+# {CREATE; NAME="eioc1";
+# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";
+# }
+#
+#Where
+#
+#NAME - The device name for the interface
+#
+#DGID - The DGID of the IOC to use.
+#
+# If DGID is specified then IOCGUID MUST also be specified.
+#
+# Though specifying DGID is optional, using this option is recommended,
+# as it will provide the quickest way of starting up the VNIC service.
+#
+#
+#IOCGUID - The GUID of the IOC to use.
+#
+#IOCSTRING - The IOC Profile ID String of the IOC to use.
+#
+# Either an IOCGUID or an IOCSTRING MUST always be specified.
+#
+# If DGID is specified then IOCGUID MUST also be specified.
+#
+# If no DGID is specified and both IOCGUID and IOCSTRING are specified
+# then IOCSTRING is given preference and the DGID of the IOC whose
+# IOCSTRING is specified is used to create the VNIC interface.
+#
+# If hotswap capability of EVIC/VEx is to be used, then IOCSTRING
+# must be specified.
+#
+#INSTANCE - Defaults to 0. Range 0-255. If a host will connect to the
+# same IOC more than once, each connection must be assigned a unique
+# number.
+#
+#
+#RX_CSUM - defaults to TRUE. When true, indicates that the receive checksum
+# should be done by the EVIC/VEx
+#
+#HEARTBEAT - defaults to 100. Specifies the time in 1/100'ths of a second
+# between heartbeats
+#
+#PORT - Specification for local HCA port. First port is 1.
+#
+#HCA - Optional HCA specification for use with PORT specification. First HCA is 0.
+#
+#PORTGUID - The PORTGUID of the IB port to use.
+#
+# Use of PORTGUID for configuring the VNIC interface has an
+# advantage on hosts having more than 1 HCAs plugged in. As
+# PORTGUID is persistent for given IB port, VNIC configurations
+# would be consistent and reliable - unaffected by restarts of
+# OFED IB stack on host having more than 1 HCAs plugged in.
+#
+# On the downside, if HCA on the host is changed, VNIC interfaces
+# configured with PORTGUID needs reconfiguration.
+#
+#IB_MULTICAST - Controls enabling or disabling of IB multicast feature on VNIC.
+# Defaults to TRUE implying IB multicast is enabled for
+# the interface. To disable IB multicast, set it to FALSE.
+#
+# Example of DGID and IOCGUID based configuration (this configuration will give
+# the quickest start up of VNIC service):
+#
+# {CREATE; NAME="eioc1";
+# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001;
+# }
+#
+#
+# Example of IOCGUID based configuration:
+#
+# {CREATE; NAME="eioc1"; IOCGUID=0x66A013000010C;
+# RX_CSUM=TRUE;
+# HEARTBEAT=100; }
+#
+# Example of IOCSTRING based configuration:
+#
+# {CREATE; NAME="eioc1"; IOCSTRING="Chassis 0x00066A0050000018, Slot 2, IOC 1";
+# RX_CSUM=TRUE;
+# HEARTBEAT=100; }
+#
+#
+#Failover configuration:
+#########################
+#
+# It is possible to create a VNIC interface with failover configuration
+# by using the PRIMARY and SECONDARY commands. The IOC specified in
+# the PRIMARY command will be used as the primary IOC for this interface
+# and the IOC specified in the SECONDARY command will be used as the
+# fail-over backup in case the connection with the primary IOC fails
+# for some reason.
+#
+# PRIMARY and SECONDARY commands are written in the following way:
+#
+# PRIMARY={DGID=...;IOCGUID=...; IOCSTRING=...;INSTANCE=... } -
+# IOCGUID, and INSTANCE must be values that are unique to the primary interface
+#
+# SECONDARY={DGID=...;IOCGUID=...; INSTANCE=... } -
+# IOCGUID, and INSTANCE must be values that are unique to the secondary interface
+#
+# OR it can also be specified without using DGID, like this:
+#
+# PRIMARY={IOCGUID=...; INSTANCE=... } - IOCGUID may be substituted with
+# IOCSTRING. IOCGUID, IOCSTRING, and INSTANCE must be values that are
+# unique to the primary interface
+#
+# SECONDARY={IOCGUID=...; INSTANCE=... } - bring up a secondary connection for
+# fail-over. IOCGUID may be substituted with IOCSTRING. IOCGUID, IOCSTRING,
+# and INSTANCE values to be used for the secondary connection
+#
+#
+#Examples of failover configuration:
+#
+#{CREATE; NAME="veth1";
+# PRIMARY={ DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1";
+# INSTANCE=1; PORT=1; }
+# SECONDARY={DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0230000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 2";
+# INSTANCE=1; PORT=2; }
+#}
+#
+# {CREATE; NAME="eioc2";
+# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; }
+# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; }
+# }
+#
+#Example of configuration with IB_MULTICAST
+#
+# {CREATE; NAME="eioc2";
+# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; IB_MULTICAST=FALSE; }
+# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; IB_MULTICAST=FALSE; }
+# }
+#
+# Example of HCA/PORT and PORTGUID configurations:
+# {
+# CREATE; NAME="veth1";
+# PRIMARY={IOCGUID=00066a02de000070; INSTANCE=1; PORTGUID=0x0002c903000010f5; }
+# SECONDARY={IOCGUID=00066a02de000070; INSTANCE=2; PORTGUID=0x0002c903000010f6; }
+# }
+#
+# {
+# CREATE; NAME="veth2";
+# PRIMARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=3; HCA=1; PORT=2; }
+# SECONDARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=4; HCA=0; PORT=1; }
+# }
+#
+# {
+# CREATE; NAME="veth3";
+# IOCSTRING="EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2";
+# INSTANCE=5 PORTGUID=0x0002c90300000786;
+# }
+# {
+# CREATE; NAME="veth4;
+# IOCGUID=00066a02de000070;
+# INSTANCE=6; HCA=1; PORT=2;
+# }
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ CHELSIO T3 RNIC RELEASE NOTES
+ September 2010
+
+
+The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the
+Chelsio S series adapters. Make sure you choose the 'cxgb3' and
+'libcxgb3' options when generating your ofed rpms.
+
+============================================
+New for ofed-1.5.2
+============================================
+
+- Bug fixes. Various upstream bug fixes have been included in this
+release.
+
+============================================
+Enabling Various MPIs
+============================================
+
+For OpenMPI, Intel MPI, HP MPI, and Scali MPI: you must set the iw_cxgb3
+module option peer2peer=1 on all systems. This can be done by writing
+to the /sys/module file system during boot. EG:
+
+# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer
+
+Or you can add the following line to /etc/modprobe.conf to set the option
+at module load time:
+
+options iw_cxgb3 peer2peer=1
+
+For Intel MPI, HP MPI, and Scali MPI: Enable the chelsio device by adding
+an entry to /etc/dat.conf for the chelsio interface. For instance,
+if your chelsio interface name is eth2, then the following line adds
+a DAT version 1.2 and 2.0 devices named "chelsio" and "chelsio2" for
+that interface:
+
+chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
+chelsio2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
+
+=============
+Intel MPI:
+=============
+
+The following env vars enable Intel MPI version 3.1.038. Place these
+in your user env after installing and setting up Intel MPI:
+
+export RSH=ssh
+export DAPL_MAX_INLINE=64
+export I_MPI_DEVICE=rdssm:chelsio
+export MPIEXEC_TIMEOUT=180
+export MPI_BIT_MODE=64
+
+Logout & log back in.
+
+Populate mpd.hosts with node names.
+Note: The hosts in this file should be Chelsio interface IP addresses.
+
+Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in
+/etc/dat.conf named "chelsio".
+
+Note: MPIEXEC_TIMEOUT value might be required to increase if heavy traffic
+is going across the systems.
+
+Contact Intel for obtaining their MPI with DAPL support.
+
+To run Intel MPI applications:
+
+ mpdboot -n <num nodes> -r ssh --ncpus=<num cpus>
+ mpiexec -ppn <process per node> -n <num nodes> <MPI Application Path>
+
+
+=============
+HP MPI:
+=============
+
+The following env vars enable HP MPI version 2.03.01.00. Place these
+in your user env after installing and setting up HP MPI:
+
+export MPI_ROOT=/opt/hpmpi
+export PATH=$MPI_ROOT/bin:/opt/bin:$PATH
+export MANPATH=$MANPATH:$MPI_ROOT/share/man
+
+Log out & log back in.
+
+To run HP MPI applications, use these mpirun options:
+
+-prot -e DAPL_MAX_INLINE=64 -UDAPL
+
+EG:
+
+$ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob
+
+Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces.
+
+Also this assumes your first entry in /etc/dat.conf is for the chelsio
+device.
+
+Contact HP for obtaining their MPI with DAPL support.
+
+=============
+Scali MPI:
+=============
+
+The following env vars enable Scali MPI. Place these in your user env
+after installing and setting up Scali MPI for running over IWARP:
+
+export DAPL_MAX_INLINE=64
+export SCAMPI_NETWORKS=chelsio
+export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128"
+
+Log out & log back in.
+
+Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf
+named "chelsio".
+
+Note: SCAMPI supports only dapl 1.2 library not dapl 2.0
+
+Contact Scali for obtaining their MPI with DAPL support.
+
+To run SCALI MPI applications:
+
+ mpimon <SCALI Application Path> -- <node1_IP> <procs> <node2_IP> <procs>
+
+Note: <procs> is the number of processes to run on the node Note:
+<node#_IP> should be the IP of Chelsio's interface
+
+=============
+OpenMPI:
+=============
+
+OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater.
+
+Open MPI will work without any specific configuration via the openib btl.
+Users wishing to performance tune the configurable options may wish to
+inspect the receive queue values. Those can be found in the "Chelsio T3"
+section of mca-btl-openib-hca-params.ini.
+
+Note: OpenMPI version 1.3 does not support newer Chelsio card with device
+ID 0x0035 and 0x0036. To use those cards add the device id of the cards
+in the "Chelsio T3" section of mca-btl-openib-hca-params.ini file.
+
+To run OpenMPI applications:
+
+ mpirun --host <node1>,<node2> -mca btl openib,sm,self <OpenMPI Application Path>
+
+=============
+MVAPICH2:
+=============
+
+The following env vars enable MVAPICH2 version 1.4-2. Place these
+in your user env after installing and setting up MVAPICH2 MPI:
+
+export MVAPICH2_HOME=/usr/mpi/gcc/mvapich2-1.4/
+export MV2_USE_IWARP_MODE=1
+export MV2_USE_RDMA_CM=1
+
+On each node, add this to the end of /etc/profile.
+
+ ulimit -l 999999
+
+On each node, add this to the end of /etc/init.d/sshd and restart sshd.
+
+ ulimit -l 999999
+ % service sshd restart
+
+Verify the ulimit changes worked. These should show '999999':
+
+ % ulimit -l
+ % ssh <peer> ulimit -l
+
+Note: You may have to restart sshd a few times to get it to work.
+
+Create mpd.hosts with list of hostname or ipaddrs in the cluster. They
+should be names/addresses that you can ssh to without passwords. (See
+Passwordless SSH Setup).
+
+On each node, create /etc/mv2.conf with a single line containing the
+IP address of the local T3 interface. This is how MVAPICH2 picks which
+interface to use for RDMA traffic.
+
+On each node, edit /etc/hosts file. Comment the entry if there is an
+entry with 127.0.0.1 IP Address and local host name. Add an entry for
+corporate IP address and local host name (name that you have given in
+mpd.hosts file) in /etc/hosts file.
+
+To run MVAPICH2 application:
+
+ mpirun_rsh -ssh -np 8 -hostfile mpd.hosts <MVAPICH2 Application Path>
+
+============================================
+Loadable Module options:
+============================================
+
+The following options can be used when loading the iw_cxgb3 module to
+tune the iWARP driver:
+
+cong_flavor - set the congestion control algorithm. Default is 1.
+ 0 == Reno
+ 1 == Tahoe
+ 2 == NewReno
+ 3 == HighSpeed
+
+snd_win - set the TCP send window in bytes. Default is 32kB.
+
+rcv_win - set the TCP receive window in bytes. Default is 256kB.
+
+crc_enabled - set whether MPA CRC should be negotiated. Default is 1.
+
+markers_enabled - set whether to request receiving MPA markers. Default is
+ 0; do not request to receive markers.
+
+ NOTE: The Chelsio RNIC fully supports markers, but
+ the current OFA RDMA-CM doesn't provide an API for
+ requesting either markers or crc to be negotiated. Thus
+ this functionality is provided via module parameters.
+
+mpa_rev - set the MPA revision to be used. Default is 1, which is
+ spec compliant. Set to 0 to connect with the Ammasso 1100
+ rnic.
+
+ep_timeout_secs - set the number of seconds for timing out MPA start up
+ negotiation and normal close. Default is 60.
+
+peer2peer - Enables connection setup changes to allow peer2peer
+ applications to work over chelsio rnics. This enables
+ the following applications:
+ Intel MPI
+ HP MPI
+ Open MPI
+ Scali MPI
+ MVAPICH2
+ Set peer2peer=1 on all systems to enable these
+ applications.
+
+The following options can be used when loading the cxgb3 module to
+tune the NIC driver:
+
+msi - whether to use MSI or MSI-X. Default is 2.
+ 0 = only pin
+ 1 = only MSI or pin
+ 2 = use MSI/X, MSI, or pin, based on system
+
+============================================
+Updating Firmware:
+============================================
+
+This release requires firmware version 7.10.0, and Protocol SRAM
+version 1.1.0. These versions are included in the ofed-1.5.2 release
+and will be automatically loaded when the cxgb3 module is loaded and
+the interface configured. To load later/newer versions of the firmware,
+follow this procedure:
+
+If your distro/kernel supports firmware loading, you can place the chelsio
+firmware and psram images in /lib/firmware/cxgb3, then unload and reload
+the cxgb3 module to get the new images loaded. If this does not work,
+then you can load the firmware images manually:
+
+Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio.
+
+To build cxgbtool:
+
+# cd <path-to-cxgbtool>
+# make && make install
+
+Then load the cxgb3 driver:
+
+# modprobe cxgb3
+
+Now note the ethernet interface name for the T3 device. This can be
+done by typing 'ifconfig -a' and noting the interface name for the
+interface with a HW address that begins with "00:07:43". Then load the
+new firmware and eeprom file:
+
+# cxgbtool ethxx loadfw <firmware_file>
+# update_eeprom.sh ethxx <eeprom_file>
+# reboot
+
+============================================
+Testing connectivity with ping and rping:
+============================================
+
+Configure the ethernet interfaces for your cxgb3 device. After you
+modprobe iw_cxgb3 you will see one or two ethernet interfaces for the
+T3 device. Configure them with an appropriate ip address, netmask, etc.
+You can use the Linux ping command to test basic connectivity via the
+T3 interface.
+
+To test RDMA, use the rping command that is included in the librdmacm-utils
+rpm:
+
+On the server machine:
+
+# rping -s -a 0.0.0.0 -p 9999
+
+On the client machine:
+
+# rping -c -VvC10 -a server_ip_addr -p 9999
+
+You should see ping data like this on the client:
+
+ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
+ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
+ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
+ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
+ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
+ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
+ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
+ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
+ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
+ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
+client DISCONNECT EVENT...
+#
+
+============================================
+Addition Notes and Issues
+============================================
+
+1) To run uDAPL over the chelsio device, you must export this environment
+variable:
+
+ export DAPL_MAX_INLINE=64
+
+2) If you have a multi-homed host and the physical ethernet networks
+are bridged, or if you have multiple chelsio rnics in the system, then
+you need to configure arp to only send replies on the interface with
+the target ip address:
+
+ sysctl -w net.ipv4.conf.all.arp_ignore=2
+
+3) If you are building OFED against a kernel.org kernel later than
+2.6.20, then make sure your kernel is configured with the cxgb3 and
+iw_cxgb3 modules enabled. This forces the kernel to pull in the genalloc
+allocator, which is required for the OFED iw_cxgb3 module. Make sure
+these config options are included in your .config file:
+
+ CONFIG_CHELSIO_T3=m
+ CONFIG_INFINIBAND_CXGB=m
+
+4) If you run the RDMA latency test using the ib_rdma_lat program, make
+sure you use the following command lines to limit the amount of inline
+data to 64:
+
+ server: ib_rdma_lat -c -I 64
+ client: ib_rdma_lat -c -I 64 server_ip_addr
+
+5) If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are
+using a 64KB page size (like PPC64 and IA64 systems) and your server is
+using a 4KB page size (like i386 and X86_64), then you need to mount the
+server using rsize=32768,wsize=32768 to avoid overrunning the Chelsio
+RNIC fast register limits. This is a known firmware limitation in the
+Chelsio RNIC.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ Diagnostic Tools in OFED 1.5 Release Notes
+
+ December 2009
+
+
+Repo: git://git.openfabrics.org/~sashak/management/management.git
+URL: http://www.openfabrics.org/downloads/management
+
+
+General
+-------
+Model of operation: All diag utilities use direct MAD access to perform their
+operations. Operations that require QP0 mads only may use direct routed
+mads, and therefore can work even in unconfigured subnets. Almost all
+utilities can operate without accessing the SM, unless GUID to lid translation
+is required. The only exception to this is saquery which requires the SM.
+
+
+Dependencies
+------------
+Most diag utilities depend on libibmad and libibumad.
+All diag utilities depend on the ib_umad kernel module.
+
+
+Multiple port/Multiple CA support
+---------------------------------
+When no IB device or port is specified (see the "local umad parameters" below),
+the libibumad library selects the port to use by the following criteria:
+1. the first port that is ACTIVE.
+2. if not found, the first port that is UP (physical link up).
+
+If a port and/or CA name is specified, the libibumad library attempts to
+satisfy the user request, and will fail if it cannot do so.
+
+For example:
+ ibaddr # use the 'best port'
+ ibaddr -C mthca1 # pick the best port from mthca1 only.
+ ibaddr -P 2 # use the second (active/up) port from the
+ first available IB device.
+ ibaddr -C mthca0 -P 2 # use the specified port only.
+
+
+Common options & flags
+----------------------
+Most diagnostics take the following flags. The exact list of supported
+flags per utility can be found in the usage message and can be displayed
+using util_name -h syntax.
+
+# Debugging flags
+ -d raise the IB debugging level. May be used
+ several times (-ddd or -d -d -d).
+ -e show umad send receive errors (timeouts and others)
+ -h display the usage message
+ -v increase the application verbosity level.
+ May be used several times (-vv or -v -v -v)
+ -V display the internal version info.
+
+# Addressing flags
+ -D use directed path address arguments. The path
+ is a comma separated list of out ports.
+ Examples:
+ "0" # self port
+ "0,1,2,1,4" # out via port 1, then 2, ...
+ -G use GUID address arguments. In most cases, it is the Port GUID.
+ Examples:
+ "0x08f1040023"
+ -s <smlid> use 'smlid' as the target lid for SA queries.
+
+# Local umad parameters:
+ -C <ca_name> use the specified ca_name.
+ -P <ca_port> use the specified ca_port.
+ -t <timeout_ms> override the default timeout for the solicited mads.
+
+
+CLI notation
+------------
+All utilities use the POSIX style notation, meaning that all options (flags)
+must precede all arguments (parameters).
+
+
+Utilities descriptions
+----------------------
+See man pages
+
+
+Bugs Fixed
+----------
+
--- /dev/null
+
+ Open Fabrics Enterprise Distribution (OFED)
+ ehca in OFED 1.5.2 Release Notes
+
+ September 2010
+
+
+Overview
+--------
+ehca is the low level driver implementation for all IBM GX-based HCAs.
+
+Supported HCAs
+--------------
+- GX Dual-port SDR 4x IB HCA
+- GX Dual-port SDR 12x IB HCA
+- GX Dual-port DDR 4x IB HCA
+- GX Dual-port DDR 12x IB HCA
+
+Available Parameters
+--------------------
+In order to set ehca parameters, add the following line(s) to /etc/modprobe.conf:
+
+ options ib_ehca <parameter>=<value>
+
+whereby <parameter> is one of the following items:
+- debug_level debug level (0: no debug traces (default), 1: with debug traces)
+- port_act_time time to wait for port activation (default: 30 sec)
+- scaling_code scaling code (0: disable (default), 1: enable)
+- open_aqp1 Open AQP1 on startup (default: no) (bool)
+- hw_level Hardware level (0: autosensing (default), 0x10..0x14: eHCA, 0x20..0x23: eHCA2) (int)
+- nr_ports number of connected ports (-1: autodetect (default), 1: port one only, 2: two ports) (int)
+- use_hp_mr Use high performance MRs (default: no) (bool)
+- poll_all_eqs Poll all event queues periodically (default: yes) (bool)
+- static_rate Set permanent static rate (default: no static rate) (int)
+- lock_hcalls Serialize all hCalls made by the driver (default: autodetect) (bool)
+- number_of_cqs Max number of CQs which can be allocated (default: autodetect) (int)
+- number_of_qps Max number of QPs which can be allocated (default: autodetect) (int)
+
+New Features
+------------
+- None
+
+Fixed Bugs ofed-1.5.2
+---------------------
+- Fixed automatic detection if hcall locks should be enabled or not
+
+Fixed Bugs ofed-1.5.1
+---------------------
+- Fixed crash when reading sysfs performance counters
+- Do not disable IRQs when processing EQs
+- Allow query of max_dest_rd_atomic and max_qp_rd_atomic values
+
+Fixed Bugs ofed-1.5
+---------------------
+- SRQ overflow prevention
+- Performance improvements for QP creation
+- MAD redirection fix
+
+Fixed Bugs ofed-1.4.1
+---------------------
+- none
+
+Fixed Bugs ofed-1.4
+---------------------
+- Reject send work requests only for RESET, INIT and RTR state
+- Reject receive work requests if QP is in RESET state
+- In case of lost interrupts, trigger EOI to reenable interrupts
+- Filter PATH_MIG events if QP was never armed
+- Release mutex in error path of alloc_small_queue_page()
+- Check idr_find() return value
+- Discard double CQE for one WR
+- Generate flush status CQ entries
+- Don't allow creating UC QP with SRQ
+- Fix reported max number of QPs and CQs in systems with >1 adapter
+- Reject dynamic memory add/remove when ehca adapter is present
+- Remove reference to special QP in case of port activation failure
+- Fix locking for shca_list_lock
+
+Fixed Bugs ofed-1.3.1
+---------------------
+- Support all ibv_devinfo values in query_device() and query_port()
+- Prevent posting of SQ WQEs if QP not in RTS
+- Remove mr_largepage parameter, ie always enable large page support
+- Allocate event queue size depending on max number of CQs and QPs
+- Protect QP against destroying until all async events for it are handled
+
+Fixed Bugs ofed-1.3
+-------------------
+- Serialize HCA-related hCalls if necessary
+- Fix static rate if path faster than link
+- Return physical link information in query_port()
+- Fix clipping of device limits to INT_MAX
+- Fix issues related to path migration support
+- Support more than 4k QPs for userspace and kernelspace
+- Prevent sending UD packets to QP0
+- Prevent RDMA-related connection failures on some eHCA2 hardware
+
+Available backports
+-------------------
+- RedHat EL5 up4: 2.6.18-164.ELsmp
+- RedHat EL5 up5: 2.6.18-194.ELsmp
+- SLES11: 2.6.27.19-5.1-smp
+- SLES11SP1: 2.6.32.12-0.7-default
+- SLES10SP3: 2.6.16.60-0.54.5
+- kernel.org: 2.6.29-32
+
+Known Issues
+------------
+1. The port(s) needs to be connected to an active switch port while
+loading the ehca device driver.
+
+2. Dynamic memory operations are tolerated by ehca, but are prevented by
+the driver while it is loaded.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ IB ACM in OFED 1.5 Release Notes
+
+ July 2010
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Quick Start Guide
+3. Operation Details
+4. Known Issues
+
+===============================================================================
+1. Overview
+===============================================================================
+The IB ACM package implements and provides a framework for experimental name,
+address, and route resolution services over InfiniBand. It is intended to
+address connection setup scalability issues running MPI applications on
+large clusters. The IB ACM provides information needed to establish a
+connection, but does not implement the CM protocol.
+
+The librdmacm can invoke IB ACM services when built using the --with-ib_acm
+option. The IB ACM services tie in under the rdma_resolve_addr,
+rdma_resolve_route, and rdma_getaddrinfo routines. For maximum benefit,
+the rdma_getaddrinfo routine should be used, however existing applications
+should still see significant connection scaling benefits using the calls
+available in librdmacm 1.0.11 and previous releases.
+
+The IB ACM is focused on being scalable and efficient. The current
+implementation limits network traffic, SA interactions, and centralized
+services. ACM supports multiple resolution protocols in order to handle
+different fabric topologies.
+
+The IB ACM package is comprised of two components: the ib_acm service
+and a test/configuration utility - ib_acme. Both are userspace components
+and are available for Linux and Windows. Additional details are given below.
+
+===============================================================================
+2. Quick Start Guide
+===============================================================================
+
+1. Prerequisites: libibverbs and libibumad must be installed.
+ The IB stack should be running with IPoIB configured.
+ These steps assume that the user has administrative privileges.
+2. Install the IB ACM package
+ This installs ib_acm, and ib_acme.
+3. Run ib_acme -A -O
+ This will generate IB ACM address and options configuration files.
+ (acm_addr.cfg and acm_opts.cfg)
+4. Run ib_acm and leave running.
+ ib_acm will eventually be converted to a service/daemon, but for now
+ is a userspace application. Because ib_acm uses the libibumad
+ interfaces, it should be run with administrative privileges.
+5. Optionally, run ib_acme -s <source_ip> -d <dest_ip> -v
+ This will verify that the ib_acm service is running.
+5. Install librdmacm using the build option --with-ib_acm.
+ The librdmacm will automatically use the ib_acm service.
+ On failures, the librdmacm will fall back to normal resolution.
+
+===============================================================================
+3. Operation Details
+===============================================================================
+
+ib_acme:
+The ib_acme program serves a dual role. It acts as a utility to test
+ib_acm operation and help verify if the ib_acm service and selected
+protocol is usable for a given cluster configuration. Additionally,
+it automatically generates ib_acm configuration files to assist with
+or eliminate manual setup.
+
+
+acm configuration files:
+The ib_acm service relies on two configuration files.
+
+The acm_addr.cfg file contains name and address mappings for each IB
+<device, port, pkey> endpoint. Although the names in the acm_addr.cfg
+file can be anything, ib_acme maps the host name and IP addresses to
+the IB endpoints.
+
+The acm_opts.cfg file provides a set of configurable options for the
+ib_acm service, such as timeout, number of retries, logging level, etc.
+ib_acme generates the acm_opts.cfg file using static information. A
+future enhancement would adjust options based on the current system
+and cluster size.
+
+
+ib_acm:
+The ib_acm service is responsible for resolving names and addresses to
+InfiniBand path information and caching such data. It is currently
+implemented as an executable application, but is a conceptual service
+or daemon that should execute with administrative privileges.
+
+The ib_acm implements a client interface over TCP sockets, which is
+abstracted by the librdmacm library. One or more back-end protocols are
+used by the ib_acm service to satisfy user requests. Although the
+ib_acm supports standard SA path record queries on the back-end, it
+provides an experimental multicast resolution protocol in hope of
+achieving greater scalability. The latter is not usable on all fabric
+topologies, specifically ones that may not have reversible paths.
+Users should use the ib_acme utility to verify that multicast protocol
+is usable before running other applications.
+
+Conceptually, the ib_acm service implements an ARP like protocol and either
+uses IB multicast records to construct path record data or queries the
+SA directly, depending on the selected route protocol. By default, the
+ib_acm services uses and caches SA path record queries.
+
+Specifically, all IB endpoints join a number of multicast groups.
+Multicast groups differ based on rates, mtu, sl, etc., and are prioritized.
+All participating endpoints must be able to communicate on the lowest
+priority multicast group. The ib_acm assigns one or more names/addresses
+to each IB endpoint using the acm_addr.cfg file. Clients provide source
+and destination names or addresses as input to the service, and receive
+as output path record data.
+
+The service maps a client's source name/address to a local IB endpoint.
+If a client does not provide a source address, then the ib_acm service
+will select one based on the destination and local routing tables. If the
+destination name/address is not cached locally, it sends a multicast
+request out on the lowest priority multicast group on the local endpoint.
+The request carries a list of multicast groups that the sender can use.
+The recipient of the request selects the highest priority multicast group
+that it can use as well and returns that information directly to the sender.
+The request data is cached by all endpoints that receive the multicast
+request message. The source endpoint also caches the response and uses
+the multicast group that was selected to construct or obtain path record
+data, which is returned to the client.
+
+===============================================================================
+4. Known Issues
+===============================================================================
+
+The current implementation of the IB ACM has several restrictions:
+- The ib_acm is limited in its handling of dynamic changes;
+ the ib_acm must be stopped and restarted if a cluster is reconfigured.
+- Cached data does not timed out and is only updated if a new resolution
+ request is received from a different QPN than a cached request.
+- Support for IPv6 has not been verified.
+- The number of addresses that can be assigned to a single endpoint is
+ limited to 4.
+- The number of multicast groups that an endpoint can support is limited to 2.
+
--- /dev/null
+ Open Fabrics InfiniBand Diagnostic Utilities
+ --------------------------------------------
+
+*******************************************************************************
+RELEASE: OFED 1.5
+DATE: Dec 2009
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. New features
+3. Major Bugs Fixed
+4. Known Issues
+
+===============================================================================
+1. Overview
+===============================================================================
+
+The ibutils package provides a set of diagnostic tools that check the health
+of an InfiniBand fabric.
+
+Package components:
+ibis: IB interface - A TCL shell that provides interface for sending various
+ MADs on the IB fabric. This is the component that actually accesses
+ the IB Hardware.
+
+ibdm: IB Data Model - A library that provides IB fabric analysis.
+
+ibmgtsim: An IB fabric simulator. Useful for developing IB tools.
+
+ibdiag: This package provides 3 tools which provide the user interface
+ to activate the above functionality:
+ - ibdiagnet: Performs various quality and health checks on the IB
+ fabric.
+ - ibdiagpath: Performs various fabric quality and health checks on
+ the given links and nodes in a specific path.
+ - ibdiagui: A GUI wrapper for the above tools.
+
+===============================================================================
+2. New Features
+===============================================================================
+
+* New "From the Edge" topology matching algorithm.
+ Integrated into ibtopodiff when run with the flag -e
+
+* New library - libsysapi
+ The library is a C API for IBDM C++ objects
+
+* Added ibnl definition files for Mellanox and Sun IB QDR products
+
+* Added new feature to ibdiagnet - general device info
+
+* ibdiagnet now can get port 0 as a parameterr (for managed switches).
+
+
+===============================================================================
+3. Major Bugs Fixed
+===============================================================================
+
+* ibutils: various fixes in build process (dependencies, parallel build, etc)
+
+* ibdiagnet: fixed crash with -r flag
+
+* ibdiagnet: fixed regular expression for pkey matching
+
+* ibdiagnet: ibdiagnet.lst file has device IDs with trailing zeroes - fixed
+
+===============================================================================
+4. Known Issues
+===============================================================================
+
+- Ibdiagnet "-wt" option may generate a bad topology file when running on a
+ cluster that contains complex switch systems.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ ipath in OFED 1.5 Release Notes
+
+ December 2009
+
+======================================================================
+1. Overview
+======================================================================
+ipath is the low level driver implementation for the
+QLogic HyperTransport HCA only (model QHT7140).
+
+The qib driver is the currently supported driver for all
+PCI-Express based Infiniband HCAs.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ IPoIB in OFED 1.5.2 Release Notes
+
+ December 2010
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Known Issues
+3. DHCP Support of IPoIB
+4. The ib-bonding driver
+5. Child interfaces
+6. Bug Fixes and Enhancements Since OFED 1.3
+7. Bug Fixes and Enhancements Since OFED 1.3.1
+8. Bug Fixes and Enhancements Since OFED 1.4
+9. Bug Fixes and Enhancements Since OFED 1.4.2
+10. Bug Fixes and Enhancements Since OFED 1.5.0
+11. Bug Fixes and Enhancements Since OFED 1.5.2
+12. Performance tuning
+
+===============================================================================
+1. Overview
+===============================================================================
+IPoIB is a network driver implementation that enables transmitting IP and ARP
+protocol packets over an InfiniBand UD channel. The implementation conforms to
+the relevant IETF working group's RFCs (http://www.ietf.org).
+
+
+Usage and configuration:
+========================
+1. To check the current mode used for outgoing connections, enter:
+ cat /sys/class/net/ib0/mode
+2. To disable IPoIB CM at compile time, enter:
+ cd OFED-1.5
+ export OFA_KERNEL_PARAMS="--without-ipoib-cm"
+ ./install.pl
+3. To change the run-time configuration for IPoIB, enter:
+ edit /etc/infiniband/openib.conf, change the following parameters:
+ # Enable IPoIB Connected Mode
+ SET_IPOIB_CM=yes
+ # Set IPoIB MTU
+ IPOIB_MTU=65520
+
+4. You can also change the mode and MTU for a specific interface manually.
+
+ To enable connected mode for interface ib0, enter:
+ echo connected > /sys/class/net/ib0/mode
+
+ To increase MTU, enter:
+ ifconfig ib0 mtu 65520
+
+5. Switching between CM and UD mode can be done in run time:
+ echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
+ echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM
+
+
+===============================================================================
+2. Known Issues
+===============================================================================
+1. If a host has multiple interfaces and (a) each interface belongs to a
+ different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
+ they are connected to the same IB Switch, then the host violates the IP rule
+ requiring different broadcast domains. Consequently, the host may build an
+ incorrect ARP table.
+
+ The correct setting of a multi-homed IPoIB host is achieved by using a
+ different PKEY for each IP subnet. If a host has multiple interfaces on the
+ same IP subnet, then to prevent a peer from building an incorrect ARP entry
+ (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
+ stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
+ causes the network stack to send ARP replies only on the interface with the
+ IP address specified in the ARP request:
+
+ sysctl -w net.ipv4.conf.ib0.arp_ignore=1
+ sysctl -w net.ipv4.conf.ib1.arp_ignore=1
+
+ Or, globally,
+
+ sysctl -w net.ipv4.conf.all.arp_ignore=1
+
+ To learn more about the arp_ignore parameter, see
+ Documentation/networking/ip-sysctl.txt.
+ Note that distributions have the means to make kernel parameters persistent.
+
+2. There are IPoIB alias lines in /etc/modprobe.d/ib_ipoib.conf which prevent
+ stopping/unloading the stack (i.e., '/etc/init.d/openibd stop' will fail).
+ These alias lines cause the drivers to be loaded again by udev scripts.
+
+ Workaround: Change modprobe.conf to set
+ OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove
+ the alias lines from /etc/modprobe.d/ib_ipoib.conf.
+
+3. On SLES 10:
+ The ib1 interface uses the configuration script of ib0.
+
+ Workaround: Invoke ifup/ifdown using both the interface name and the
+ configuration script name (example: ifup ib1 ib1).
+
+4. After a hotplug event, the IPoIB interface falls back to datagram mode, and
+ MTU is reduced to 2K.
+ Workaround: Re-enable connected mode and increase MTU manually:
+ echo connected > /sys/class/net/ib0/mode
+ ifconfig ib0 mtu 65520
+
+5. Since the IPoIB configuration files (ifcfg-ib<n>) are installed under the
+ standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/
+ and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf
+ does not prevent the loading of IPoIB on boot.
+
+6. If IPoIB connected mode is enabled, it uses a large MTU for connected mode
+ messages and a small MTU for datagram (in particular, multicast) messages,
+ and relies on path MTU discovery to adjust MTU appropriately. Packets sent
+ in the window before MTU discovery automatically reduces the MTU for a
+ specific destination will be dropped, producing the following message in the
+ system log:
+ "packet len <actual length> (> <max allowed length>) too long to send, dropping"
+
+ To warn about this, a message is produced in the system log each time MTU is
+ set to a value higher than 2K.
+
+7. IPoIB IPv6 support is broken for systems with kernels < 2.6.12 and
+ kernels >= 2.6.12. The reason for that is that kernel 2.6.12 puts the link
+ layer address at an offset of two bytes with respect to older kernels. This
+ causes the other host to misinterpret the hardware address resulting in failure
+ to resolve path which are based on wrong GIDs. As an example, RH 4.x and RH
+ 5.x cannot inter-operate.
+
+8. In connected mode, TCP latency for short messages is larger by approx. 1usec
+ (~5%) than in datagram mode. As a workaround, use datagram mode.
+
+9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with
+ newer kernels. It is recommended to use kernel 2.6.18 or up for
+ best IPoIB performance.
+
+10. Connectivity issues encountered when using IPv6 on ia64 systems.
+
+11. The IPoIB module uses a Linux implementation for Large Receive Offload
+ (LRO) in kernel 2.6.24 and later. These kernels require installing the
+ "inet_lro" module.
+
+12. ConnectX only: If you have a port configured as ETH and IPoIB is running
+ in connected mode, and then you change the port type to IB, the IPoIB mode
+ will change to datagram mode.
+
+13. When working with iSCSI, you must disable LRO (even if you are working in
+ connected mode). This is because there is a bug in older kernels which causes
+ a kernel panic.
+
+14. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test
+ gets to packet size 8192 or larger, it always loses the first packet in the
+ sequence.
+ Workaround: Increase the number of pending skb's before a neighbor is
+ resolved (default is 3). This value can be changed with:
+ sysctl net.ipv4.neigh.ib0.unres_qlen.
+
+15. IPoIB multicast support is broken in RH4.x kernels. This is because
+ ndisc_mc_map() does not handle IPoIB hardware addresses.
+
+16. If bonding uses an IPoIB slave, then un-enslaving all slaves (or downing
+ them with ifdown) followed by unloading the module ib_ipoib might crash the
+ kernel. To avoid this leave the IPoIB interfaces enslaved when unloading
+ ib_ipoib.
+
+17. On SLES 11, sysconfig scripts override the interface mode and set it to
+ datagram on each call to ifup, ifdown, etc. To avoid this, add the line
+ IPOIB_MODE=connected
+ to the interface configuration file (e.g. ifcfg-ib0)
+
+18. When installing OFED on a machine that runs kernel 2.6.30 (or another
+ kernel from kernel.org that OFED supports), the installation script blocks
+ the installation of ib-bonding since the bonding module that comes with the
+ kernel has all the functionality to support IPoIB slaves. This approach
+ however doesn't patch the sysconfig (SuSE) or initscripts (RedHat) package,
+ so the network configuration script may not work properly.
+ For example, if you install OFED on RHEL5.2 that runs kernel 2.6.30 and
+ you try to configure and run bonding, you won't be able to restart the
+ network and see bond0 up and running with IPoIB slaves.
+ A workaround to this problem would be as follows:
+ a. Compile ib-bonding source rpm (under SRPMS directory) separately on
+ a machine with RHEL5.2 and kernel 2.6.18-92.el5 (default for this OS).
+ b. Install the binary RPM while the machine runs kernel 2.6.18-92.el5.
+ This will patch the OS configuration scripts and install the bonding
+ module.
+ c. Switch to kernel 2.6.30. The module that was compiled in (a) will
+ not be loaded since it was compiled and installed for a different
+ kernel.
+ d. Configure bonding and restart the network. The bonding interface
+ should be up and running afterwards.
+
+19. On RHEL5.X, '/etc/init.d/openibd start' prints the following messages while
+ bringing up IPoIB interfaces:
+
+ Setting up InfiniBand network interfaces:
+ Bringing up interface ib0: [ OK ]
+ RTNETLINK answers: File exists
+ Error adding address 192.168.1.11 for ib1.
+ Bringing up interface ib1: [ OK ]
+ Setting up service network . . . [ done ]
+
+ This does not affect IPoIB configuration and interfaces are configured as
+ expected.
+
+20. In IPoIB connected mode, packages larger than 2016 bytes are not sent.
+ https://bugs.openfabrics.org/show_bug.cgi?id=1839
+
+21. Under SLES11, if an IP configuration exists for an IPoIB interface
+ that later becomes a slave of a bonding master, a network restart
+ does not erase the IP configuration from the slave and it appears to have
+ an IP address even though the new configuration does not set one. This
+ may cause problems when trying to use the bonded network interface. To
+ avoid this, restart the IB stack (openib restart) once you change the
+ configuration.
+ This issue is described in
+ https://bugs.openfabrics.org/show_bug.cgi?id=1975
+
+22. Currently, IPoIB LRO is not supported on ConnectX-2 devices
+
+===============================================================================
+3. IPoIB Configuration Based on DHCP
+===============================================================================
+
+Setting an IPoIB interface configuration based on DHCP (v3.0.4 which is
+available via www.isc.org) is performed similarly to the configuration of
+Ethernet interfaces. In other words, you need to make sure that IPoIB
+configuration files include the following line:
+ For RedHat:
+ BOOTPROTO=dhcp
+ For SLES:
+ BOOTPROTO=dhcp
+Note: If IPoIB configuration files are included, ifcfg-ib<n> files will be
+installed under:
+/etc/sysconfig/network-scripts/ on a RedHat machine
+/etc/sysconfig/network/ on a SuSE machine
+
+Note: Two patches for DHCP are required for supporting IPoIB. The patch files
+for DHCP v3.0.4 are available under the docs/dhcp/ directory.
+
+Standard DHCP fields holding MAC addresses are not large enough to contain an
+IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages
+convey a client identifier field used to identify the DHCP session. This client
+identifier field can be used to associate an IP address with a client identifier
+value, such that the DHCP server will grant the same IP address to any client
+that conveys this client identifier.
+
+Note: Refer to the DHCP documentation for more details how to make this
+association.
+
+The length of the client identifier field is not fixed in the specification.
+
+4.1 DHCP Server
+In order for the DHCP server to provide configuration records for clients, an
+appropriate configuration file needs to be created. By default, the DHCP server
+looks for a configuration file called dhcpd.conf under /etc. You can either
+edit this file or create a new one and provide its full path to the DHCP server
+using the -cf flag. See a file example at docs/dhcpd.conf of this package.
+The DHCP server must run on a machine which has loaded the IPoIB module.
+
+To run the DHCP server from the command line, enter:
+dhcpd <IB network interface name> -d
+Example:
+host1# dhcpd ib0 -d
+
+4.2 DHCP Client (Optional)
+
+Note: A DHCP client can be used if you need to prepare a diskless machine with
+an IB driver.
+
+In order to use a DHCP client identifier, you need to first create a
+configuration file that defines the DHCP client identifier. Then run the DHCP
+client with this file using the following command:
+dhclient cf <client conf file> <IB network interface name>
+Example of a configuration file for the ConnectX (PCI Device ID 26428), called
+dhclient.conf:
+# The value indicates a hexadecimal number
+interface "ib1" {
+send dhcp-client-identifier ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39;
+}
+Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218),
+called dhclient.conf:
+# The value indicates a hexadecimal number
+interface "ib1" {
+send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92;
+}
+
+In order to use the configuration file, run:
+host1# dhclient -cf dhclient.conf ib1
+
+
+===============================================================================
+4. The ib-bonding driver
+===============================================================================
+The ib-bonding driver is a High Availability solution for IPoIB interfaces.
+It is based on the Linux Ethernet Bonding Driver and was adapted to work with
+IPoIB. The ib-bonding driver comes with the ib-bonding package
+(run rpm -qi ib-bonding to get the package information).
+
+Using the ib-bonding driver
+---------------------------
+The ib-bonding driver is loaded automatically.
+
+Automatic operation:
+Use standard OS tools (sysconfig in SuSE and initscripts in RedHat)
+to create a configuration that will come up with network restart. For details
+on this, read the documentation for the ib-bonding package.
+
+Notes:
+* Using /etc/infiniband/openib.conf to create a persistent configuration is
+ no longer supported
+* On RHEL4_U7, a slave interface cannot be set as primary.
+* ib-bonding will not be compiled and installed with OFED on an OS with kernel
+ that is >= 2.6.27 (e.g., SLES11). The bonding driver that comes with those
+ kernels already supports enslaving IPoIB interfaces. In addition, an OS
+ can come with an older kernel but with a patched bonding driver that also
+ does not require modification (e.g., RHEL5.4). OFED will not replace the
+ bonding module in such cases either.
+ However, there might still be issues with OS configuration tools (like
+ sysconfig or initscripts) that may need fixing, but such issues have not
+ been observed yet.
+
+
+===============================================================================
+5. Child interfaces
+===============================================================================
+
+5.1 Subinterfaces
+-----------------
+You can create subinterfaces for a primary IPoIB interface to provide traffic
+isolation. Each such subinterface (also called a child interface) has
+different IP and network addresses from the primary (parent) interface. The
+default Partition Key (PKey), ff:ff, applies to the primary (parent) interface.
+
+5.1.1 Creating a Subinterface
+-----------------------------
+To create a child interface (subinterface), follow this procedure:
+Note: In the following procedure, ib0 is used as an example of an IB
+subinterface.
+
+Step 1. Decide on the PKey to be used in the subnet. Valid values are 0-255.
+The actual PKey used is a 16-bit number with the most significant bit set. For
+example, a value of 0 will give a PKey with the value 0x8000.
+
+Step 2. Create a child interface by running:
+host1$ echo <PKey> > /sys/class/net/<IB subinterface>/create_child
+Example:
+host1$ echo 0 > /sys/class/net/ib0/create_child
+This will create the interface ib0.8000.
+
+Step 3. Verify the configuration of this interface by running:
+Using the example of Step 2:
+host1$ ifconfig ib0.8000
+ib0.8000 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00-
+00-00-00-00-00-00
+BROADCAST MULTICAST MTU:2044 Metric:1
+RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
+collisions:0 txqueuelen:128
+RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
+
+Step 4. As can be seen, the interface does not have IP or network addresses so
+it needs to be configured.
+
+Step 5. To be able to use this interface, a configuration of the Subnet Manager
+is needed so that the PKey chosen, which defines a broadcast address, can be
+recognized.
+
+5.1.2 Removing a Subinterface
+To remove a child interface (subinterface), run:
+echo <subinterface PKey> /sys/class/net/<ib_interface>/delete_child
+Using the example of Step 2:
+echo 0x8000 > /sys/class/net/ib0/delete_child
+Note that when deleting the interface you must use the PKey value with the most
+significant bit set (e.g., 0x8000 in the example above).
+
+
+===============================================================================
+6. Bug Fixes and Enhancements Since OFED 1.3
+===============================================================================
+- There is no default configuration for IPoIB interfaces: One should manually
+ specify the full IP configuration or use the ofed_net.conf file. See
+ OFED_Installation_Guide.txt for details on ipoib configuration.
+- Don't drop multicast sends when they can be queued
+- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small
+ SKBs (bug 989)
+- IPoIB failed on stress testing (bug 1004)
+- Kernel Oops during "port up/down test" (bug 1040)
+- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel
+ panic (bug 985)
+- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20
+- Set max CM MTU when moving to CM mode, instead of setting it in openibd script
+- Fix CQ size calculations for ipoib
+- Bonding: Enable build for SLES10 SP2
+- Bonding: Fix issue in using the bonding module for Ethernet slaves (see
+ documentation for details)
+
+===============================================================================
+7. Bug Fixes and Enhancements Since OFED 1.3.1
+===============================================================================
+- IPoIB: Refresh paths instead of flushing them on SM change events to improve
+ failover respond
+- IPoIB: Fix loss of connectivity after bonding failover on both sides
+- Bonding: Fix link state detection under RHEL4
+- Bonding: Avoid annoying messages from initscripts when starting bond
+- Bonding: Set default number of grat. ARP after failover to three (was one)
+
+===============================================================================
+8. Bug Fixes and Enhancements Since OFED 1.4
+===============================================================================
+- Performance tuning is enabled by default for IPOIB CM.
+- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails
+- Disable napi while cq is being drained (bugzilla #1587)
+- rdma_cm: Use the rate from the ipoib broadcast when joining an ipoib
+ multicast. When joining an IPoIB multicast group, use the same rate as in the
+ broadcast group. Otherwise, if rdma_cm creates this group before IPoIB does,
+ it might get a different rate. This will cause IPoIB to fail joining the same
+ group later on, because IPoIB has a strict rate selection.
+- Fixed unprotected use of priv->broadcast in ipoib_mcast_join_task.
+- Do not join broadcast group if interface is brought down
+
+
+===============================================================================
+9. Bug Fixes and Enhancements Since OFED 1.4.2
+===============================================================================
+
+- Check that the format of multicast link addresses is correct before taking
+ them from dev->mc_list to priv->multicast_list. This way we never try to
+ send a bogus address to the SA, which prevents badness from erroneous
+ 'ip maddr addr add', broken bonding drivers, etc. (bugzilla #1664)
+- IPoIB: Don't turn on carrier for a non-active port.
+ If a bonding interface uses this IPoIB interface as a slave it might
+ not detect that this slave is almost useless and failover
+ functionality will be damaged. The fix checks the state of the IB
+ port in the carrier_task before calling netif_carrier_on(). (bugzilla #1726)
+- Clear ipoib_neigh.dgid in ipoib_neigh_alloc()
+ IPoIB can miss a change in destination GID under some conditions. The
+ problem is caused when ipoib_neigh->dgid contains a stale address.
+ The fix is to set ipoib_neigh->dgid to zero in ipoib_neigh_alloc().
+
+===============================================================================
+10. Bug Fixes and Enhancements Since OFED 1.5.0
+===============================================================================
+
+- Fixed lockup of the TX queue on mixed CM/UD traffic
+ When there is a high rate of send traffic on both CM and UD QPs, the
+ transmitter can be stopped by the CM path but not re-enabled.
+
+===============================================================================
+11. Bug Fixes and Enhancements Since OFED 1.5.2
+===============================================================================
+1. Fix IPoIB rx_frames and rx_usecs to conform to ethtool documentation.
+
+
+===============================================================================
+12. Performance Tuning
+===============================================================================
+When IPoIB is configured to run in connected mode, tcp parameter tuning is
+performed at driver startup to improve the throughput of medium and large
+messages.
+The driver startup scripts set the following TCP parameters as follows:
+
+ net.ipv4.tcp_timestamps=0
+ net.ipv4.tcp_sack=0
+ net.core.netdev_max_backlog=250000
+ net.core.rmem_max=16777216
+ net.core.wmem_max=16777216
+ net.core.rmem_default=16777216
+ net.core.wmem_default=16777216
+ net.core.optmem_max=16777216
+ net.ipv4.tcp_mem="16777216 16777216 16777216"
+ net.ipv4.tcp_rmem="4096 87380 16777216"
+ net.ipv4.tcp_wmem="4096 65536 16777216"
+
+This tuning is effective only for connected mode. If you run in datagram mode,
+it actually reduces performance.
+
+If you change the IPoIB run mode to "datagram" while the driver is running,
+the tuned parameters do not get reset to their default values. We therefore
+recommend that you change the IPoIB mode only while the driver is down
+(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file
+/etc/infiniband/openib.conf, and then restarting the driver).
+
+
--- /dev/null
+
+ Open Fabrics Enterprise Distribution (OFED)
+ iSER initiator in OFED 1.5.x Release Notes
+
+ March 2010
+
+
+* Background
+
+ iSER allows iSCSI to be layered over RDMA transports (including
+ InfiniBand and iWARP (RNIC)).
+
+ The OpenFabrics iSER initiator implementation is inter-operable with
+ open-iscsi (http://www.open-iscsi.org/). It provides an alternative
+ transport to iscsi_tcp in the open-iscsi framework. The iSER transport
+ exposes a transport API to scsi_transport_iscsi, and a SCSI LLD API to
+ the Linux SCSI mid-layer (scsi_mod). Currently, the OpenFabrics iSER
+ initiator can be layered over InfiniBand (no iWARP support yet).
+
+* Supported platforms
+
+ - kernel.org: 2.6.30 and higher
+ - RHEL 5.4
+
+ Except for these platforms, OFED-1.5.x will not install iSER on top of
+ the kernel and the original iSER module coming with Linux Distribution
+ will stop working because of mismatch in symbols version.
+
+* Fixed Bugs and Enhancements since OFED 1.3
+ iSER:
+ - Add logical unit reset support
+ - Update URLs of iSER docs
+ - Add change_queue_depth method
+ - Fix list iteration bug
+ - Handle iser_device allocation error gracefully
+ - Don't change ITT endianess
+ - Move high-volume debug output to higher debug level
+ - Count FMR alignment violations per session
+ Open-iSCSI:
+ - Update open-iscsi rpm versions from
+ 2.0-754 to 2.0-754.1 and from 2.0-865.15 to 2.0-869.2
+ - Change open-iscsi defaults
+ - iscsi_discovery: fixed printing debug information
+ - iscsi_discovery: check if iscsid is running
+ - Set open-iscsi for auto-startup when installing OFED
+ - iscsiadm: bail out if daemon isn't running
+
+* Known Issues
+ Open-iSCSI:
+ - modifying node transport_name while session is active
+ will create stale session. It will be deleted only after reboot.
+
+* Installation/upgrade of open-iscsi
+ If iSER is selected to be installed with OFED, open-iscsi will be also
+ installed (or upgraded if another version of open-iscsi is already
+ installed). Installing/upgrading open-iscsi is required for iSER to
+ work properly. Before installing OFED, please make sure that no version
+ of open-iscsi is installed or add the following key to your ofed.conf
+ file: upgrade_open_iscsi=yes. Using this key will remove any old version
+ of open-iscsi.
+
+ If an older version of open-iscsi was installed, it is recommended to
+ delete its records before running open-iscsi. This can easily be done by
+ running the following command (while open-iscsi is stopped):
+
+ rm -rf /etc/iscsi/nodes/* /etc/iscsi/send_targets/*
+
+ Then, open-iscsi may be started, and targets may be discovered by running
+ 'iscsi_discovery <target_ip>'.
+
+* iSER links
+
+ Wiki pages
+
+ Information on building/configuring/running the open iscsi initiator over
+ iSER: https://wiki.openfabrics.org/tiki-index.php?page=iSER
+
+ IETF pages
+
+ iSCSI and iSER specifications come out of the IETF IP storage (IPS) work
+ group.
+
+ iSCSI specification: http://www.ietf.org/rfc/rfc3720.txt
+ iSER specification: http://www.ietf.org/rfc/rfc5046.txt
+
+ "About" page
+
+ general and detailed information on iSCSI and iSER
+ http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA
+
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ STGT/iSER target in OFED 1.5 Release Notes
+
+ December 2009
+
+
+* Background
+
+ iSER allows iSCSI to be layered over RDMA transports (including InfiniBand
+ and iWARP (RNIC)). Linux target framework (tgt) aims to simplify various SCSI
+ target driver (iSCSI, Fibre Channel, SRP, etc) creation and maintenance.
+
+ tgt supports the following target drivers (among othets)
+
+ - iSCSI software (tcp) target driver for Ethernet/IPoIB NICs
+ - iSER software target driver for Infiniband and RDMA NICs
+
+ For iSCSI and iSER tgt consists of user-space daemon, and user-space
+ tools. That is, no special kernel support is needed other than the
+ kernel (and user space) RDMA stacks.
+
+ The code is under the GNU General Public License version 2.
+
+ This package is based on a snapshot (clone) of the tgt git tree taken
+ on August 28th, 2008
+
+* Supported platforms
+
+ RHEL 5 and its updates
+ SLES 10 and its service-packs
+
+ The release has been tested against the Linux open iscsi initiator
+
+* STGT/iSER links
+
+ STGT home page
+ http://stgt.berlios.de
+
+ STGT git
+ git://git.kernel.org/pub/scm/linux/kernel/git/tomo/tgt.git
+
+ the STGT sources have some embedded documentation, specifically
+ the README and REDMA.iscsi files would be usefull
+
+ Wiki pages
+
+ Information on building/configuring/running the stgt/iser target
+ https://wiki.openfabrics.org/tiki-index.php?page=iSER-target
+
+ general and detailed information on iSCSI and iSER
+ http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ ConnectX driver (mlx4) in OFED 1.5.2 Release Notes
+
+ December 2010
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Supported firmware versions
+3. VPI (Virtual Process Interconnect)
+4. InfiniBand new features and bug fixes since OFED 1.3.1
+5. InfiniBand (mlx4_ib) new features and bug fixes since OFED 1.4
+6. Eth (mlx4_en) new features and bug fixes since OFED 1.4
+7. New features and bug fixes since OFED 1.4.1
+8. New features and bug fixes since OFED 1.4.2
+9. New features and bug fixes since OFED 1.5
+10. New features and bug fixes since OFED 1.5.1
+11. New features and bug fixes since OFED 1.5.2
+12. Known Issues
+13. mlx4 available parameters
+
+===============================================================================
+1. Overview
+===============================================================================
+mlx4 is the low level driver implementation for the ConnectX adapters designed
+by Mellanox Technologies. The ConnectX can operate as an InfiniBand adapter,
+as an Ethernet NIC, or as a Fibre Channel HBA. The driver in OFED 1.4 supports
+InfiniBand and Ethernet NIC configurations. To accommodate the supported
+configurations, the driver is split into three modules:
+
+- mlx4_core
+ Handles low-level functions like device initialization and firmware
+ commands processing. Also controls resource allocation so that the
+ InfiniBand and Ethernet functions can share the device without
+ interfering with each other.
+- mlx4_ib
+ Handles InfiniBand-specific functions and plugs into the InfiniBand
+ midlayer
+- mlx4_en
+ Handles Ethernet specific functions and plugs into the netdev mid-layer.
+
+===============================================================================
+2. Supported firmware versions
+===============================================================================
+- This release was tested with FW 2.8.0000
+- The minimal version to use is 2.3.000.
+- To use both IB and Ethernet (VPI) use FW version 2.6.000 or higher
+
+===============================================================================
+3. VPI (Virtual Protocol Interconnect)
+===============================================================================
+VPI enables ConnectX to be configured as an Ethernet NIC and/or an InfiniBand
+adapter.
+o Overview:
+ The VPI driver is a combination of the Mellanox ConnectX HCA Ethernet and
+ InfiniBand drivers.
+ It supplies the user with the ability to run InfiniBand and Ethernet
+ protocols on the same HCA (separately or at the same time).
+ For more details on the Ethernet driver see MLNX_EN_README.txt.
+o Firmware:
+ The VPI driver works with FW 25408 version 2.6.000 or higher.
+ One needs to use INI files that allow different protocols over same HCA.
+o Port type management:
+ By default both ConnectX ports are initialized as InfiniBand ports.
+ If you wish to change the port type use the connectx_port_config script after
+ the driver is loaded.
+ Running "/sbin/connectx_port_config -s" will show current port configuration
+ for all ConnectX devices.
+ Port configuration is saved in file: /etc/infiniband/connectx.conf.
+ This saved configuration is restored at driver restart only if done via
+ "/etc/init.d/openibd restart".
+
+ Possible port types are:
+ "eth" - Always Ethernet.
+ "ib" - Always InfiniBand.
+ "auto" - Link sensing mode - detect port type based on the attached
+ network type. If no link is detected, the driver retries link
+ sensing every few seconds.
+
+ Port link type can be configured for each device in the system at run time
+ using the "/sbin/connectx_port_config" script.
+
+ This utility will prompt for the PCI device to be modified (if there is only
+ one it will be selected automatically).
+ At the next stage the user will be prompted for the desired mode for each port.
+ The desired port configuration will then be set for the selected device.
+ Note: This utility also has a non interactive mode:
+ "/sbin/connectx_port_config [[-d|--device <PCI device ID>] -c|--conf <port1,port2>]".
+
+- The following configurations are supported by VPI:
+ Port1 = eth Port2 = eth
+ Port1 = ib Port2 = ib
+ Port1 = auto Port2 = auto
+ Port1 = ib Port2 = eth
+ Port1 = ib Port2 = auto
+ Port1 = auto Port2 = eth
+
+ Note: the following options are not supported:
+ Port1 = eth Port2 = ib
+ Port1 = eth Port2 = auto
+ Port1 = auto Port2 = ib
+
+
+===============================================================================
+4. InfiniBand new features and bug fixes since OFED 1.3.1
+===============================================================================
+Features that are enabled with ConnectX firmware 2.5.0 only:
+- Send with invalidate and Local invalidate send queue work requests.
+- Resize CQ support.
+
+Features that are enabled with ConnectX firmware 2.6.0 only:
+- Fast register MR send queue work requests.
+- Local DMA L_Key.
+- Raw Ethertype QP support (one QP per port) -- receive only.
+
+Non-firmware dependent features:
+- Allow 4K messages for UD QPs
+- Allocate/free fast register MR page lists
+- More efficient MTT allocator
+- RESET->ERR QP state transition no longer supported (IB Spec 1.2.1)
+- Pass congestion management class MADs to the HCA
+- Enable firmware diagnostic counters available via sysfs
+- Enable LSO support for IPOIB
+- IB_EVENT_LID_CHANGE is generated more appropriately
+- Fixed race condition between create QP and destroy QP (bugzilla 1389)
+
+
+===============================================================================
+5. InfiniBand new features and bug fixes since OFED 1.4
+===============================================================================
+- Enable setting via module param (set_4k_mtu) 4K MTU for ConnectX ports.
+- Support optimized registration of huge pages backed memory.
+ With this optimization, the number of MTT entries used is significantly
+ lower than for regular memory, so the HCA will access registered memory with
+ fewer cache misses and improved performance.
+ For more information on this topic, please refer to Linux documentation file:
+ Documentation/vm/hugetlbpage.txt
+- Do not enable blueflame sends if write combining is not available
+- Add write combining support for for PPC64, and thus enable blueflame sends.
+- Unregister IB device before executing CLOSE_PORT.
+- Notify and exit if the kernel module used does not support XRC. This is done
+ to avoid libmlx4 compatibility problem.
+- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment.
+ This enable to register more memory with the same number of segments.
+
+
+===============================================================================
+6. Eth (mlx4_en) new features and bug fixes since OFED 1.4
+===============================================================================
+6.1 Changes and New Features
+----------------------------
+- Added Tx Multi-queue support which Improves multi-stream and bi-directional
+ TCP performance.
+- Added IP Reassembly to improve RX bandwidth for IP fragmented packets.
+- Added linear skb support which improves UDP performance.
+- Removed the following module parameters:
+ - rx/tx_ring_size
+ - rx_ring_num - number of RX rings
+ - pprx/pptx - global pause frames
+ The parameters above are controlled through the standard Ethtool interface.
+
+Bug Fixes
+---------
+- Memory leak when driver is unloaded without configuring interfaces first.
+- Setting flow control parameters for one ConnectX port through Ethtool
+ impacts the other port as well.
+- Adaptive interrupt moderation malfunctions after receiving/transmitting
+ around 7 Tera-bytes of data.
+- Firmware commands fail with bad flow messages when bringing an interface up.
+- Unexpected behavior in case of memory allocation failures.
+
+===============================================================================
+7. New features and bug fixes since OFED 1.4.1
+===============================================================================
+- Added support for new device ID: 0x6764: MT26468 ConnectX EN 10GigE PCIe gen2
+
+===============================================================================
+8. New features and bug fixes since OFED 1.4.2
+===============================================================================
+8.1 Changes and New Features
+----------------------------
+- mlx4_en is now supported on PPC and IA64.
+- Added self diagnostics feature: ethtool -t eth<x>.
+- Card's vpd can be accessed for read and write using ethtool interface.
+
+8.2 Bug Fixes
+-------------
+- mlx4 can now work with MSI-X on RH4 systems.
+- Enabled the driver to load on systems with 32 cores and higher.
+- The driver is being stuck if the HW/FW stops responding, reset is done
+ instead.
+- Fixed recovery flows from memory allocation failures.
+- When the system is low on memory, the mlx4_en driver now allocates smaller RX
+ rings.
+- The mlx4_core driver now retries to obtain MSI-X vectors if the initial request is
+ rejected by the OS
+
+===============================================================================
+9. New features and bug fixes since OFED 1.5
+===============================================================================
+9.1 Changes and New Features
+----------------------------
+- Added RDMA over Converged Enhanced Ethernet (RoCEE) support
+ See RoCEE_README.txt.
+- Masked Compare and Swap (MskCmpSwap)
+ The MskCmpSwap atomic operation is an extension to the CmpSwap operation
+ defined in the IB spec. MskCmpSwap allows the user to select a portion of the
+ 64 bit target data for the "compare" check as well as to restrict the swap to
+ a (possibly different) portion.
+- Masked Fetch and Add (MFetchAdd)
+ The MFetchAdd Atomic operation extends the functionality of the standard IB
+ FetchAdd by allowing the user to split the target into multiple fields of
+ selectable length. The atomic add is done independently on each one of this
+ fields. A bit set in the field_boundary parameter specifies the field
+ boundaries.
+- Improved VLAN tagging performance for the mlx4_en driver.
+- RSS support for Ethernet UDP traffic on ConnectX-2 cards with firmware
+ 2.7.700 and higher.
+
+9.2 Bug Fixes
+-------------
+- Bonding stops functioning when one of the Ethernet ports is closed.
+- "Scheduling while atomic" errors in /var/log/messages when working with
+ bonding and mlx4_en drivers in several operating systems.
+
+===============================================================================
+10. New features and bug fixes since OFED 1.5.1
+===============================================================================
+10.1 Changes and New Features
+----------------------------
+1. Added RAW QP support
+2. Extended the range of log_mtts_per_seg - upper bound moved from 5 to 7.
+3. Added 0xff70 vendor ID support for MADs.
+4. Added support for GID change event.
+5. Better interrupts spreading under heavy RX load (mlx4_en)
+
+10.2 Bug Fixes
+-------------
+1. Fixed chunk sg list overflow in mlx4_alloc_icm()
+2. Fixed bug in invalidation of counter index.
+3. Fixed bug in catching netdev events for updating GID table.
+4. Fixed bug in populating GID table for RoCE.
+5. Fixed XRC locking and prevention of null dereference.
+6. Added spinlock to xrc_reg_list changes and scanning in interrupt context.
+7. Fixed offload changes via Ethtool for VLAN interfaces
+
+===============================================================================
+11. New features and bug fixes since OFED 1.5.2
+===============================================================================
+11.1 Changes and new features
+-----------------------------
+1. RoCE counters are now added to the regular Ethernet counters. The counters
+for RoCE specific traffic are at the same place and are not changed.
+2. Forward any vendor ID SMP MADs to firmware for handling.
+3. Add blue flame support for kernel consumers. This allows lower latencies to
+be achieved. To use blue flame, a consumer needs to create the QP with inline
+support.
+
+11.2 Bug fixes
+--------------
+1. Fix race when reading node desctription through MADs.
+2. Fix modify CQ so each of moderation parameters is independent.
+3. Limit the number of fast registration work requests to match HW capabilities.
+4 Changes to node-description via sysfs are now propagated to FW (for FW
+2.8.000 and later). This enables FW to send a 144 trap to OpenSM regarding the
+change, so that OpenSM can read that nodes updated description. This fixes an
+old race condition, where OpenSM read the nodes description before it was
+changed during driver startup.
+5. Fix max fast registration WRs that can be posted to CX.
+6. Fix port speed reporting for RoCE ports.
+7. Limit GID entries for VLAN to match hardware capabilities.
+8. Fix RoCE link state report.
+9. Workaround firmware bug reporting wrong number of blue flame registers.
+10. Bug fix in kernel pos_send when VLANs are used.
+11. Fix in mlx4_en for handling VLAN operations when working under bond
+ interfaces.
+12.Fix Ethtool transceiver type report for mlx4_en
+
+
+===============================================================================
+12. Known Issues
+===============================================================================
+- The SQD feature is not supported
+- To load the driver on machines with a 64KB default page size, the UAR bar
+ must be enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium
+ with SLES 11 or when 64KB page size enabled.
+ Perform the following three steps:
+ 1. Add the following line in the firmware configuration (INI) file under the
+ [HCA] section:
+ log2_uar_bar_megabytes = 5
+ 2. Burn a modified firmware image with the changed INI file.
+ 3. Reboot the system.
+
+
+================================================================================
+13. mlx4 available parameters
+================================================================================
+In order to set mlx4 parameters, add the following line(s) to /etc/modpobe.conf:
+ options mlx4_core parameter=<value>
+ and/or
+ options mlx4_ib parameter=<value>
+ and/or
+ options mlx4_en parameter=<value>
+
+mlx4_core parameters:
+ set_4k_mtu: try to set 4K MTU to all ConnectX ports (int)
+ debug_level: enable debug tracing if > 0 (int)
+ block_loopback: block multicast loopback packets if > 0 (int)
+ msi_x: attempt to use MSI-X if nonzero (int)
+ log_num_mac: log2 max number of MACs per ETH port (1-7, int)
+ use_prio: enable steering by VLAN priority on ETH ports
+ (0/1, default 0) (bool)
+ log_num_qp: log maximum number of QPs per HCA (int)
+ log_num_srq: log maximum number of SRQs per HCA (int)
+ log_rdmarc_per_qp: log number of RDMARC buffers per QP (int)
+ log_num_cq: log maximum number of CQs per HCA (int)
+ log_num_mcg: log maximum number of multicast groups per HCA
+ (int)
+ log_num_mpt: log maximum number of memory protection table
+ entries per HCA (int)
+ log_num_mtt: log maximum number of memory translation table
+ segments per HCA (int)
+ log_mtts_per_seg: log2 number of MTT entries per segment (1-5)
+ (int)
+ enable_qos: enable Quality of Service support in the HCA
+ (default: off) (bool)
+ enable_pre_t11_mode: set FCoXX to pre-T11 mode if non-zero
+ (default 0) (int)
+ internal_err_reset: reset device on internal errors if non-zero
+ (default 1) (int)
+
+mlx4_ib parameters:
+ debug_level: enable debug tracing if > 0 (default 0)
+
+mlx4_en parameters:
+ udp_rss: enable RSS for incoming UDP traffic or disabled (0)
+ tcp_rss: enable RSS for incoming TCP traffic or disabled (0)
+ num_lro: number of LRO sessions per ring or disabled (0)
+ (default is 32)
+ ip_reasm: allow reassembly of fragmented IP packets (default
+ is enabled)
+ pfctx: priority based Flow Control policy on TX[7:0]
+ per priority bit mask (default is 0)
+ pfcrx: priority based Flow Control policy on RX[7:0]
+ per priority bit mask (default is 0)
+ inline_thold: threshold for using inline data (default is 128)
--- /dev/null
+ MPI Selector 1.0 release notes
+ December 2009
+ ==============================
+
+OFED contains a simple mechanism for system administrators and end
+users to select which MPI implementation they want to use. The MPI
+selector functionality is not specific to any MPI implementation; it
+can be used with any implementation that provides shell startup files
+that correctly set the environment for that MPI. The OFED installer
+will automatically add MPI selector support for each MPI that it
+installs. Additional MPI's not known by the OFED installer can be
+listed in the MPI selector; see the mpi-selector(1) man page for
+details.
+
+Note that MPI selector only affects the default MPI environment for
+*future* shells. Specifically, if you use MPI selector to select MPI
+implementation ABC, this default selection will not take effect until
+you start a new shell (e.g., logout and login again). Other packages
+(such as environment modules) provide functionality that allows
+changing your environment to point to a new MPI implementation in the
+current shell. The MPI selector was not meant to duplicate or replace
+that functionality.
+
+The MPI selector functionality can be invoked in one of two ways:
+
+1. The mpi-selector-menu command.
+
+ This command is a simple, menu-based program that allows the
+ selection of the system-wide MPI (usually only settable by root)
+ and a per-user MPI selection. It also shows what the current
+ selections are.
+
+ This command is recommended for all users.
+
+2. The mpi-selector command.
+
+ This command is a CLI-equivalent of the mpi-selector-menu,
+ allowing for the same functionality as mpi-selector-menu but
+ without the interactive menus and prompts. It is suitable for
+ scripting.
+
+See the mpi-selector(1) man page for more information.
+
--- /dev/null
+===============================================================================
+ OFED 1.5.2 for Linux
+ Mellanox Firmware Burning and Diagnostic Utilities
+ December 2010
+===============================================================================
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. New Features
+3. Major Bugs Fixed
+4. Known Issues
+
+===============================================================================
+1. Overview
+===============================================================================
+
+This package contains a burning and diagnostic tools for Mellanox
+manufactured cards. It also provides access to the relevant source code. Please
+see the file LICENSE for licensing details.
+
+Package Contents:
+ a) mstflint source code
+ b) mflash lib
+ This lib provides Flash access through Mellanox HCAs.
+ c) mtcr lib (implemented in mtcr.h file)
+ This lib enables access to adapter hardware registers via PCIe
+ d) mstregdump utility
+ This utility dumps hardware registers from Mellanox hardware for later
+ analysis by Mellanox.
+ e) mstvpd
+ This utility dumps the on-card VPD (Vital Product Data, which contains
+ the card serial number, part number, and other info).
+ f) hca_self_test.ofed
+ This scripts checks the status of software, firmware and hardware of the
+ HCAs or NICs installed on the local host.
+
+===============================================================================
+2. New Features
+===============================================================================
+
+* Added support for flash type SST25VF016B in mstflint
+
+* Added support for flash type M25PX16 in mstflint
+
+* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') in
+ a binary image file. This is useful for production to prepare images for pre-
+ assembly flash burning. These new commands are supported by Mellanox 4th
+ generation devices.
+
+* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') on
+ an already burnt device. These command re-burn the existing image with the
+ given GUIDs or VSD.
+ When the 'sg' command is applied on a device with blank (0xff) GUIDs, it
+ updates the GUIDs without re-burning the image.
+
+* mstregdump: Updated address list for ConnectX2 device.
+
+===============================================================================
+3. Bugs Fixed
+===============================================================================
+
+* Show correct device names in mstflint help
+
+===============================================================================
+4. Known Issues
+===============================================================================
+
+* Rarely you may get the following error message when running mstflint:
+ Warning: memory access to device 0a:00.0 failed: Input/output error.
+ Warning: Fallback on IO: much slower, and unsafe if device in use.
+ *** buffer overflow detected ***: mstflint terminated
+
+ To solve the issue, run "mst start" (requires MFT - Mellanox Firmware Tools package) and
+ then re-run mstflint.
+
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ mthca in OFED 1.5 Release Notes
+
+ December 2009
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Fixed Bugs since OFED 1.3.1
+3. Bug fixes and enhancements since OFED 1.4
+4. Known Issues
+
+===============================================================================
+1. Overview
+===============================================================================
+mthca is the low level driver implementation for the following Mellanox
+Technologies HCAs: InfiniHost, InfiniHost III Ex and InfiniHost III Lx.
+
+mthca Available Parameters
+--------------------------
+In order to set mthca parameters, add the following line to /etc/modpobe.conf:
+
+ options ib_mthca parameter=<value>
+
+mthca parameters:
+ catas_reset_disable: disable reset on catastrophic event if nonzero
+ (int)
+ fw_cmd_doorbell: post FW commands through doorbell page if
+ nonzero (and supported by FW) (int)
+ debug_level: Enable debug tracing if > 0 (int)
+ msi_x: attempt to use MSI-X if nonzero (int)
+ tune_pci: increase PCI burst from the default set by BIOS if nonzero (int)
+ num_qp: maximum number of QPs per HCA (int)
+ rdb_per_qp: number of RDB buffers per QP (int)
+ num_cq: maximum number of CQs per HCA (int)
+ num_mcg: maximum number of multicast groups per HCA (int)
+ num_mpt: maximum number of memory protection table entries per HCA (int)
+ num_mtt: maximum number of memory translation table segments per HCA (int)
+ num_udav: maximum number of UD address vectors per HCA (int)
+ fmr_reserved_mtts: number of memory translation table segments reserved for
+ FMR (int)
+ log_mtts_per_seg: Log2 number of MTT entries per segment (1-5) (int)
+
+===============================================================================
+2. Fixed Bugs
+===============================================================================
+- Fix access to freed memory in catastrophic processing
+ catas_reset() uses pointer to mthca_dev, but mthca_dev is not valid after
+ call __mthca_restart_one().
+
+
+===============================================================================
+3. Bug fixes and enhancements since OFED 1.4
+===============================================================================
+- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment.
+ This enable to register more memory with the same number of segments.
+- Bring INIT_HCA and other commands timeout into consistency with PRM. This
+ solve an issue when had more than 2^18 max qp's configured.
+
+===============================================================================
+4. Known Issues
+===============================================================================
+1. A UAR size other than 8MB prevents mthca driver loading. The default UAR
+ size is 8MB. If the size is changed, the following error message will be
+ logged to /var/log/messages upon attempting to load the mthca driver:
+ ib_mthca 0000:04:00.0: Missing UAR, aborting.
+
+2. If a user level application using multicast receives a control signal
+ in the process of detaching from a multicast group, its QP may remain a
+ member of the multicast group (in HCA).
+ Workaround: Destroy the multicast group after detaching the QP from it.
+
+3. In mem-free devices, RC QPs can be created with a maximum of (max_sge - 1)
+ entries only; UD QPs can be created with a maximum of (max_sge - 3) entries.
+
+4. Performance can be degraded due to a wrong BIOS configuration:
+ The PCI Express specification requires the BIOS to set the MaxReadReq
+ register for each HCA card for maximum performance and stability.
+
+ If you experience bandwidth performance degradation, try forcing the card to
+ behave not according to the PCI Express specification by setting the
+ tune_pci=1 module parameter. This tune_pci=1 assignment was the default
+ setting in OFED 1.0; therefore, it may have masked performance degradation
+ on some systems.
+
+ If tune_pci=1 improves bandwidth, please report the issue to your BIOS
+ vendor. Please note that Mellanox Technologies does not recommend using
+ tune_pci=1 in production systems: working with tune_pci=1 set is untested
+ and is known to trigger instability issues on some platforms.
+
--- /dev/null
+========================================================================
+
+ Open Fabrics Enterprise Distribution (OFED)
+ MVAPICH2-1.5.1 in OFED 1.5.2 Release Notes
+
+ September 2010
+
+
+Overview
+--------
+
+These are the release notes for MVAPICH2-1.5.1. MVAPICH2 is an MPI-2
+implementation over InfiniBand, iWARP and RoCEE (RDMAoE) from the Ohio
+State University (http://mvapich.cse.ohio-state.edu/).
+
+
+User Guide
+----------
+
+For more information on using MVAPICH2-1.5.1, please visit the user
+guide at http://mvapich.cse.ohio-state.edu/support/.
+
+
+Software Dependencies
+---------------------
+
+MVAPICH2 depends on the installation of the OFED Distribution stack with
+OpenSM running. The MPI module also requires an established network
+interface (either InfiniBand, IPoIB, iWARP, RoCEE uDAPL, or Ethernet).
+BLCR support is needed if built with fault tolerance support. Similarly,
+HWLOC support is needed if built with Portable Hardware Locality feature
+for CPU mapping.
+
+
+ChangeLog
+---------
+
+* Features and Enhancements
+ - Significantly reduce memory footprint on some systems by changing
+ the stack size setting for multi-rail configurations
+ - Optimization to the number of RDMA Fast Path connections
+ - Performance improvements in Scatterv and Gatherv collectives for
+ CH3 interface (Thanks to Dan Kokran and Max Suarez of NASA for
+ identifying the issue)
+ - Tuning of Broadcast Collective
+ - Support for tuning of eager thresholds based on both adapter and
+ platform type
+ - Environment variables for message sizes can now be expressed in
+ short form K=Kilobytes and M=Megabytes (e.g.
+ MV2_IBA_EAGER_THRESHOLD=12K)
+ - Ability to selectively use some or all HCAs using colon separated
+ lists. e.g. MV2_IBA_HCA=mlx4_0:mlx4_1
+ - Improved Bunch/Scatter mapping for process binding with HWLOC and
+ SMT support (Thanks to Dr. Bernd Kallies of ZIB for ideas and
+ suggestions)
+ - Update to Hydra code from MPICH2-1.3b1
+ - Auto-detection of various iWARP adapters
+ - Specifying MV2_USE_IWARP=1 is no longer needed when using iWARP
+ - Changing automatic eager threshold selection and tuning for iWARP
+ adapters based on number of nodes in the system instead of the
+ number of processes
+ - PSM progress loop optimization for QLogic Adapters (Thanks to Dr.
+ Avneesh Pant of QLogic for the patch)
+
+* Bug fixes
+ - Fix memory leak in registration cache with --enable-g=all
+ - Fix memory leak in operations using datatype modules
+ - Fix for rdma_cross_connect issue for RDMA CM. The server is
+ prevented from initiating a connection.
+ - Don't fail during build if RDMA CM is unavailable
+ - Various mpirun_rsh bug fixes for CH3, Nemesis and uDAPL interfaces
+ - ROMIO panfs build fix
+ - Update panfs for not-so-new ADIO file function pointers
+ - Shared libraries can be generated with unknown compilers
+ - Explicitly link against DL library to prevent build error due to
+ DSO link change in Fedora 13 (introduced with gcc-4.4.3-5.fc13)
+ - Fix regression that prevents the proper use of our internal HWLOC
+ component
+ - Remove spurious debug flags when certain options are selected at
+ build time
+ - Error code added for situation when received eager SMP message is
+ larger than receive buffer
+ - Fix for Gather and GatherV back-to-back hang problem with LiMIC2
+ - Fix for packetized send in Nemesis
+ - Fix related to eager threshold in nemesis ib-netmod
+ - Fix initialization parameter for Nemesis based on adapter type
+ - Fix for uDAPL one sided operations (Thanks to Jakub Fedoruk from
+ Intel for reporting this)
+ - Fix an issue with out-of-order message handling for iWARP
+ - Fixes for memory leak and Shared context Handling in PSM for
+ QLogic Adapters (Thanks to Dr. Avneesh Pant of QLogic for the
+ patch)
+
+
+Main Verification Flows
+-----------------------
+
+In order to verify the correctness of MVAPICH2-1.4.1, the following
+tests and parameters were run.
+
+Test Description
+====================================================================
+Intel Intel's MPI functionality test suite
+OSU Benchmarks OSU's performance tests
+IMB Intel's MPI Benchmark test
+mpich2 Test suite distributed with MPICH2
+NAS NAS Parallel Benchmarks (NPB3.2)
+
+
+Mailing List
+------------
+
+There is a public mailing list mvapich-discuss@cse.ohio-state.edu for
+mvapich users and developers to
+- Ask for help and support from each other and get prompt response
+- Contribute patches and enhancements
+
+========================================================================
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ OSU MPI MVAPICH-1.2.0, in OFED 1.5 Release Notes
+
+ December 2009
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Software Dependencies
+3. New Features
+4. Bug Fixes
+5. Known Issues
+6. Main Verification Flows
+
+
+===============================================================================
+1. Overview
+===============================================================================
+These are the release notes for OSU MPI MVAPICH-1.2.0.
+OSU MPI is an MPI channel implementation over InfiniBand
+by Ohio State University (OSU).
+
+See http://mvapich.cse.ohio-state.edu
+
+
+===============================================================================
+2. Software Dependencies
+===============================================================================
+OSU MPI depends on the installation of the OFED stack with OpenSM running.
+The MPI module also requires an established network interface (either
+InfiniBand IPoIB or Ethernet).
+
+
+===============================================================================
+3. New Features ( Compared to mvapich 1.1.0 )
+===============================================================================
+MVAPICH-1.2.0 has the following additional features:
+- Advanced network recovery support
+- mpirun launcher improvements
+- Efficient intra-node shared memory communication
+ support for diskless clusters
+- RoCEE (RDMAoE) networks support
+
+===============================================================================
+4. Bug Fixes ( Compared to mvapich 1.1.0 )
+===============================================================================
+- Multiple fixes for mpirun_rsh launcher
+
+===============================================================================
+5. Known Issues
+===============================================================================
+- Shared memory broadcast optimization is disabled by default.
+
+- MVAPICH MPI compiled on AMD x86_64 does not work with MVAPICH MPI compiled
+ on Intel X86_64 (EM64t).
+ Workaround:
+ Use "VIADEV_USE_COMPAT_MODE=1" run time option in order to enable compatibility
+ mode that works for AMD and Intel platform.
+
+- A process running MPI cannot fork after MPI_Init unless the environment
+ variable IBV_FORK_SAFE=1 is set to enable fork support. This support also
+ requires a kernel version of 2.6.16 or higher.
+
+- For users of Mellanox Technologies firmware fw-23108 or fw-25208 only:
+ MVAPICH might fail in its default configuration if your HCA is burnt with an
+ fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version
+ 4.7.400 or earlier.
+
+ NOTE: There is no issue if you chose to update firmware during Mellanox
+ OFED installation as newer firmware versions were burnt.
+
+ Workaround:
+ Option 1 - Update the firmware. For instructions, see Mellanox Firmware Tools
+ (MFT) User's Manual under the docs/ folder.
+ Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0
+
+- MVAPICH may fail to run on some SLES 10 machines due to problems in resolving
+ the host name.
+ Workaround: Edit /etc/hosts and comment-out/remove the line that maps
+ IP address 127.0.0.2 to the system's fully qualified hostname.
+
+
+===============================================================================
+6. Main Verification Flows
+===============================================================================
+In order to verify the correctness of MVAPICH, the following tests and
+parameters were run.
+
+Test Description
+-------------------------------------------------------------------
+Intel's Test suite - 1400 Intel tests
+BW/LT OSU's test for bandwidth latency
+IMB Intel's MPI Benchmark test
+mpitest b_eff test
+Presta Presta multicast test
+Linpack Linpack benchmark
+NAS2.3 NAS NPB2.3 tests
+SuperLU SuperLU benchmark (NERSC edition)
+NAMD NAMD application
+CAM CAM application
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ NetEffect Ethernet Cluster Server Adapter Release Notes
+ September 2010
+
+
+
+The iw_nes module and libnes user library provide RDMA and L2IF
+support for the NetEffect Ethernet Cluster Server Adapters.
+
+==========
+What's New
+==========
+OFED 1.5.2 contains several enhancements and bug fixes to iw_nes driver.
+
+* Add new feature iWarp Multicast Acceleration (IMA).
+* Add module option to disable extra doorbell read after a write.
+* Change CQ event notification to not fire event unless there is a
+ new CQE not polled.
+* Fix payload calculation for post receive with more than one SGE.
+* Fix crash when CLOSE was indicated twice due to connection close
+ during remote peer's timeout on pending MPA reply.
+* Fix ifdown hang by not calling ib_unregister_device() till removal
+ of iw_nes module.
+* Handle RST when state of connection is in FIN_WAIT2.
+* Correct properties for various nes_query_{qp, port, device} calls.
+
+
+============================================
+Required Setting - RDMA Unify TCP port space
+============================================
+RDMA connections use the same TCP port space as the host stack. To avoid
+conflicts, set rdma_cm module option unify_tcp_port_space to 1 by adding
+the following to /etc/modprobe.conf:
+
+ options rdma_cm unify_tcp_port_space=1
+
+
+========================================
+Required Setting - Power Management Mode
+========================================
+If possible, disable Active State Power Management in the BIOS, e.g.:
+
+ PCIe ASPM L0s - Advanced State Power Management: DISABLED
+
+
+=======================
+Loadable Module Options
+=======================
+The following options can be used when loading the iw_nes module by modifying
+modprobe.conf file.
+
+wide_ppm_offset=0
+ Set to 1 will increase CX4 interface clock ppm offset to 300ppm.
+ Default setting 0 is 100ppm.
+
+mpa_version=1
+ MPA version to be used int MPA Req/Resp (0 or 1).
+
+disable_mpa_crc=0
+ Disable checking of MPA CRC.
+ Set to 1 to enable MPA CRC.
+
+send_first=0
+ Send RDMA Message First on Active Connection.
+
+nes_drv_opt=0x00000100
+ Following options are supported:
+
+ 0x00000010 - Enable MSI
+ 0x00000080 - No Inline Data
+ 0x00000100 - Disable Interrupt Moderation
+ 0x00000200 - Disable Virtual Work Queue
+ 0x00001000 - Disable extra doorbell read after write
+
+nes_debug_level=0
+ Specify debug output level.
+
+wqm_quanta=65536
+ Set size of data to be transmitted at a time.
+
+limit_maxrdreqsz=0
+ Limit PCI read request size to 256 bytes.
+
+
+===============
+Runtime Options
+===============
+The following options can be used to alter the behavior of the iw_nes module:
+NOTE: Assuming NetEffect Ethernet Cluster Server Adapter is assigned eth2.
+
+ ifconfig eth2 mtu 9000 - largest mtu supported
+
+ ethtool -K eth2 tso on - enables TSO
+ ethtool -K eth2 tso off - disables TSO
+
+ ethtool -C eth2 rx-usecs-irq 128 - set static interrupt moderation
+
+ ethtool -C eth2 adaptive-rx on - enable dynamic interrupt moderation
+ ethtool -C eth2 adaptive-rx off - disable dynamic interrupt moderation
+ ethtool -C eth2 rx-frames-low 16 - low watermark of rx queue for dynamic
+ interrupt moderation
+ ethtool -C eth2 rx-frames-high 256 - high watermark of rx queue for
+ dynamic interrupt moderation
+ ethtool -C eth2 rx-usecs-low 40 - smallest interrupt moderation timer
+ for dynamic interrupt moderation
+ ethtool -C eth2 rx-usecs-high 1000 - largest interrupt moderation timer
+ for dynamic interrupt moderation
+
+===================
+uDAPL Configuration
+===================
+Rest of the document assumes the following uDAPL settings in dat.conf:
+
+ OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
+ ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
+
+
+==============
+mpd.hosts file
+==============
+mpd.hosts is a text file with a list of nodes, one per line, in the MPI ring.
+Use either fully qualified hostname or IP address.
+
+
+=======================================
+Recommended Settings for HP MPI 2.2.7
+=======================================
+Add the following to mpirun command:
+
+ -1sided
+
+Example mpirun command with uDAPL-2.0:
+
+ mpirun -np 2 -hostfile /opt/mpd.hosts
+ -UDAPL -prot -intra=shm
+ -e MPI_HASIC_UDAPL=ofa-v2-iwarp
+ -1sided
+ /opt/hpmpi/help/hello_world
+
+Example mpirun command with uDAPL-1.2:
+
+ mpirun -np 2 -hostfile /opt/mpd.hosts
+ -UDAPL -prot -intra=shm
+ -e MPI_HASIC_UDAPL=OpenIB-iwarp
+ -1sided
+ /opt/hpmpi/help/hello_world
+
+
+============================================================
+Recommended Settings for Platform MPI 7.1 (formerly HP-MPI)
+============================================================
+Add the following to mpirun command:
+
+ -1sided
+
+Example mpirun command with uDAPL-2.0:
+
+ mpirun -np 2 -hostfile /opt/mpd.hosts
+ -UDAPL -prot -intra=shm
+ -e MPI_HASIC_UDAPL=ofa-v2-iwarp
+ -1sided
+ /opt/platform_mpi/help/hello_world
+
+Example mpirun command with uDAPL-1.2:
+
+ mpirun -np 2 -hostfile /opt/mpd.hosts
+ -UDAPL -prot -intra=shm
+ -e MPI_HASIC_UDAPL=OpenIB-iwarp
+ -1sided
+ /opt/platform_mpi/help/hello_world
+
+
+==============================================
+Recommended Settings for Intel MPI 3.2.x/4.0.x
+==============================================
+Add the following to mpiexec command:
+
+ -genv I_MPI_FALLBACK_DEVICE 0
+ -genv I_MPI_DEVICE rdma:OpenIB-iwarp
+ -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
+
+Example mpiexec command line for uDAPL-2.0:
+
+ mpiexec -genv I_MPI_FALLBACK_DEVICE 0
+ -genv I_MPI_DEVICE rdma:ofa-v2-iwarp
+ -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
+ -ppn 1 -n 2
+ /opt/intel/impi/3.2.2/bin64/IMB-MPI1
+
+Example mpiexec command line for uDAPL-1.2:
+ mpiexec -genv I_MPI_FALLBACK_DEVICE 0
+ -genv I_MPI_DEVICE rdma:OpenIB-iwarp
+ -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
+ -ppn 1 -n 2
+ /opt/intel/impi/3.2.2/bin64/IMB-MPI1
+
+
+========================================
+Recommended Setting for MVAPICH2 and OFA
+========================================
+Add the following to the mpirun command:
+
+ -env MV2_USE_IWARP_MODE 1
+
+Example mpiexec command line:
+
+ mpiexec -l -n 2
+ -env MV2_USE_IWARP_MODE 1
+ /usr/mpi/gcc/mvapich2-1.5/tests/osu_benchmarks-3.1.1/osu_latency
+
+
+==========================================
+Recommended Setting for MVAPICH2 and uDAPL
+==========================================
+Add the following to the mpirun command for 64 or more processes:
+
+ -env MV2_ON_DEMAND_THRESHOLD <number of processes>
+
+Example mpirun command with uDAPL-2.0:
+
+ mpiexec -l -n 64
+ -env MV2_DAPL_PROVIDER ofa-v2-iwarp
+ -env MV2_ON_DEMAND_THRESHOLD 64
+ /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1
+
+Example mpirun command with uDAPL-1.2:
+
+ mpiexec -l -n 64
+ -env MV2_DAPL_PROVIDER OpenIB-iwarp
+ -env MV2_ON_DEMAND_THRESHOLD 64
+ /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1
+
+
+===========================
+Modify Settings in Open MPI
+===========================
+There is more than one way to specify MCA parameters in
+Open MPI. Please visit this link and use the best method
+for your environment:
+
+http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
+
+
+=======================================
+Recommended Settings for Open MPI 1.4.2
+=======================================
+Allow the sender to use RDMA Writes:
+
+ -mca btl_openib_flags 2
+
+Example mpirun command line:
+
+ mpirun -np 2 -hostfile /opt/mpd.hosts
+ -mca btl openib,self,sm
+ -mca btl_mpi_leave_pinned 0
+ -mca btl_openib_flags 2
+ /usr/mpi/gcc/openmpi-1.4.2/tests/IMB-3.2/IMB-MPI1
+
+
+===================================
+iWARP Multicast Acceleration (IMA)
+===================================
+
+iWARP multicast acceleration enables raw L2 multicast traffic kernel
+bypass using user-space verbs API using the new defined QP type
+IBV_QPT_RAW_ETH.
+
+The L2 RAW_ETH acceleration assumes that user application transmits and
+receives a whole L2 frame including MAC/IP/UDP/TCP headers.
+
+ETH RAW QP usage:
+First the application creates IBV_QPT_RAW_ETH QP with associated CQ, PD,
+completion channels as it is performed for RDMA connection.
+
+Next step is enabling L2 MAC address RX filters for directing received
+multicasts to the RAW_ETH QPs using ibv_attach_multicast() verb.
+
+From this point the application is ready to receive and transmit multicast
+traffic.
+
+In multicast acceleration the user application passes to ibv_post_send()
+whole IGMP frame including MAC header, IP header, UDP header and UDP payload.
+It is a user responsibility to make IP fragmentation when required payload
+is larger than MTU. Every fragment is a separate L2 frame to transmit.
+The ibv_poll_cq() provides an information about the status of transmit buffer.
+
+On receive path, ibv_poll_cq() returns information about received L2
+packet, the Rx buffer (previously posted by ibv_post_recv() ) contains
+whole L2 frame including MAC header, IP header and UDP header.
+It is a user application responsibility to check if received packet is
+a valid UDP frame so the fragments must be checked and checksums must be
+computed.
+
+IMA API description (NE020 specific):
+User application must create separate CQs for RX and TX path.
+Only single SGE on tranmit is supported.
+User application must post at least 65 rx buffers to keep RX path working.
+
+IMA device:
+IMA requires creation of the /dev/infiniband/nes_ud_sksq device to get
+access to optimized IMA transmit path. The best method for creation of this
+device is manual addition following line to /etc/udev/rules.d/90-ib.rules
+file after OFED distribution installation and rebooting machine.
+
+KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644"
+
+As a result the 90-ib.rules should look like:
+
+KERNEL=="umad*", NAME="infiniband/%k"
+KERNEL=="issm*", NAME="infiniband/%k"
+KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
+KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
+KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
+KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
+KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644"
+
+
+
+NetEffect is a trademark of Intel Corporation in the U.S. and other countries.
--- /dev/null
+################################################################################
+# #
+# NFS/RDMA README #
+# #
+################################################################################
+
+ Author: NetApp and Open Grid Computing
+
+ Adapted for OFED 1.5.1 (from linux-2.6.30/Documentation/filesystems/nfs-rdma.txt)
+ by Jon Mason
+
+Table of Contents
+~~~~~~~~~~~~~~~~~
+ - Overview
+ - OFED 1.5.1 limitations
+ - Getting Help
+ - Installation
+ - Check RDMA and NFS Setup
+ - NFS/RDMA Setup
+
+Overview
+~~~~~~~~
+
+ This document describes how to install and setup the Linux NFS/RDMA client
+ and server software.
+
+ The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
+ was first included in the following release, Linux 2.6.25.
+
+ In our testing, we have obtained excellent performance results (full 10Gbit
+ wire bandwidth at minimal client CPU) under many workloads. The code passes
+ the full Connectathon test suite and operates over both Infiniband and iWARP
+ RDMA adapters.
+
+OFED 1.5.1 limitations:
+~~~~~~~~~~~~~~~~~~~~~
+ NFS-RDMA is supported for the following releases:
+ - Redhat Enterprise Linux (RHEL) version 5.2
+ - Redhat Enterprise Linux (RHEL) version 5.3
+ - Redhat Enterprise Linux (RHEL) version 5.4
+ - SUSE Linux Enterprise Server (SLES) version 11
+
+ And the following kernel.org kernels:
+ - 2.6.22
+ - 2.6.25
+ - 2.6.30
+
+ All other Linux Distrubutions and kernel versions are NOT supported on OFED
+ 1.5.1
+
+Getting Help
+~~~~~~~~~~~~
+
+ If you get stuck, you can ask questions on the
+ nfs-rdma-devel@lists.sourceforge.net, or linux-rdma@vger.kernel.org
+ mailing lists.
+
+Installation
+~~~~~~~~~~~~
+
+ These instructions are a step by step guide to building a machine for
+ use with NFS/RDMA.
+
+ - Install an RDMA device
+
+ Any device supported by the drivers in drivers/infiniband/hw is acceptable.
+
+ Testing has been performed using several Mellanox-based IB cards and
+ the Chelsio cxgb3 iWARP adapter.
+
+ - Install OFED 1.5.1
+
+ NFS/RDMA has been tested on RHEL5.2, RHEL 5.3, RHEL5.4, SLES11,
+ kernels 2.6.22, 2.6.25, and 2.6.30. On these kernels,
+ NFS-RDMA will be installed by default if you simply select "install all",
+ and can be specifically included by a "custom" install.
+
+ In addition, the install script will install a version of the nfs-utils that
+ is required for NFS/RDMA. The binary installed will be named "mount.rnfs".
+ This version is not necessary for Linux Distributions with nfs-utils 1.1 or
+ later.
+
+ Upon successful installation, the nfs kernel modules will be placed in the
+ directory /lib/modules/'uname -a'/updates. It is recommended that you reboot
+ to ensure that the correct modules are loaded.
+
+Check RDMA and NFS Setup
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+ Before configuring the NFS/RDMA software, it is a good idea to test
+ your new kernel to ensure that the kernel is working correctly.
+ In particular, it is a good idea to verify that the RDMA stack
+ is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
+ is working properly.
+
+ - Check RDMA Setup
+
+ If you built the RDMA components as modules, load them at
+ this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
+ card:
+
+ $ modprobe ib_mthca
+ $ modprobe ib_ipoib
+
+ If you are using InfiniBand, make sure there is a Subnet Manager (SM)
+ running on the network. If your IB switch has an embedded SM, you can
+ use it. Otherwise, you will need to run an SM, such as OpenSM, on one
+ of your end nodes.
+
+ If an SM is running on your network, you should see the following:
+
+ $ cat /sys/class/infiniband/driverX/ports/1/state
+ 4: ACTIVE
+
+ where driverX is mthca0, ipath5, ehca3, etc.
+
+ To further test the InfiniBand software stack, use IPoIB (this
+ assumes you have two IB hosts named host1 and host2):
+
+ host1$ ifconfig ib0 a.b.c.x
+ host2$ ifconfig ib0 a.b.c.y
+ host1$ ping a.b.c.y
+ host2$ ping a.b.c.x
+
+ For other device types, follow the appropriate procedures.
+
+ - Check NFS Setup
+
+ For the NFS components enabled above (client and/or server),
+ test their functionality over standard Ethernet using TCP/IP or UDP/IP.
+
+NFS/RDMA Setup
+~~~~~~~~~~~~~~
+
+ We recommend that you use two machines, one to act as the client and
+ one to act as the server.
+
+ One time configuration:
+
+ - On the server system, configure the /etc/exports file and
+ start the NFS/RDMA server.
+
+ Exports entries with the following formats have been tested:
+
+ /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
+ /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
+
+ The IP address(es) is(are) the client's IPoIB address for an InfiniBand
+ HCA or the client's iWARP address(es) for an RNIC.
+
+ NOTE: The "insecure" option must be used because the NFS/RDMA client does
+ not use a reserved port.
+
+ Each time a machine boots:
+
+ - Load and configure the RDMA drivers
+
+ For InfiniBand using a Mellanox adapter:
+
+ $ modprobe ib_mthca
+ $ modprobe ib_ipoib
+ $ ifconfig ib0 a.b.c.d
+
+ NOTE: use unique addresses for the client and server
+
+ - Start the NFS server
+
+ Load the RDMA transport module:
+
+ $ modprobe svcrdma
+
+ Start the server:
+
+ $ /etc/init.d/nfsserver start
+
+ or
+
+ $ service nfs start
+
+ Instruct the server to listen on the RDMA transport:
+
+ $ echo rdma 20049 > /proc/fs/nfsd/portlist
+
+ NOTE for SLES10 servers: The nfs start scripts on most distro's start
+ rpc.statd by default. However, the in-kernel lockd that was in SLES10 has
+ been removed in the new kernels. Since OFED is back-porting the new code to
+ the older distro's, there is no in-kernel lockd in SLES10 and the SLES10
+ nfsserver scripts do not know the need to start it. Therefore, the
+ nfsserver scripts will be modified when the rnfs-utils rpm is installed to
+ start/stop rpc.statd.
+
+ - On the client system
+
+ Load the RDMA client module:
+
+ $ modprobe xprtrdma
+
+ Mount the NFS/RDMA server:
+
+ $ mount.rnfs <IPoIB-server-name-or-address>:/<export> /mnt -o proto=rdma,port=20049
+
+ NOTE: For kernels < 2.6.23, the "-i" flag must be passed into mount.rnfs.
+ This option allows the mount command to ignore the kernel version check. If
+ not disabled, the check will prevent passing arguments to the kernel and not
+ allow the updated version of NFS to accept the "rdma" NFS option.
+
+ To verify that the mount is using RDMA, run "cat /proc/mounts" and check
+ the "proto" field for the given mount.
+
+ Congratulations! You're using NFS/RDMA!
+
+Known Issues
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are using
+a 64KB page size (like PPC64 and IA64 systems) and your server is using a
+4KB page size (like i386 and X86_64), then you need to mount the server
+using rsize=32768,wsize=32768 to avoid overrunning the Chelsio RNIC fast
+register limits. This is a known firmware limitation in the Chelsio RNIC.
+
+Running NFSRDMA over Mellanox's ConnectX HCA requires that the adapter firmware
+be 2.7.0 or greater on all NFS clients and servers. Firmware 2.6.0 has known
+issues that prevent the RDMA connection from being established. Firmware 2.7.0
+has resolved these issues.
+
+IPv6 support requires portmap that supports version 4. Portmap included in RHEL5
+and SLES10 only supports version 2. Without version 4 support, the following
+error will be logged:
+ svc: failed to register lockdv1 (errno 97).
+This error will not affect IPv4 support.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ Open MPI in OFED 1.5.1 Copyrights, License, and Release Notes
+
+ March 2010
+
+Open MPI Copyrights
+-------------------
+Most files in this release are marked with the copyrights of the
+organizations who have edited them. The copyrights below generally
+reflect members of the Open MPI core team who have contributed code to
+this release. The copyrights for code used under license from other
+parties are included in the corresponding files.
+
+Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
+ University Research and Technology
+ Corporation. All rights reserved.
+Copyright (c) 2004-2009 The University of Tennessee and The University
+ of Tennessee Research Foundation. All rights
+ reserved.
+Copyright (c) 2004-2008 High Performance Computing Center Stuttgart,
+ University of Stuttgart. All rights reserved.
+Copyright (c) 2004-2007 The Regents of the University of California.
+ All rights reserved.
+Copyright (c) 2006-2009 Los Alamos National Security, LLC. All rights
+ reserved.
+Copyright (c) 2006-2009 Cisco Systems, Inc. All rights reserved.
+Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved.
+Copyright (c) 2006-2008 Sandia National Laboratories. All rights
+ reserved.
+Copyright (c) 2006-2009 Sun Microsystems, Inc. All rights reserved.
+ Use is subject to license terms.
+Copyright (c) 2006-2009 The University of Houston. All rights
+ reserved.
+Copyright (c) 2006-2008 Myricom, Inc. All rights reserved.
+Copyright (c) 2007-2008 UT-Battelle, LLC. All rights reserved.
+Copyright (c) 2007-2008 IBM Corporation. All rights reserved.
+Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich
+ Supercomputing
+ Centre, Federal Republic of Germany
+Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
+Copyright (c) 2007 Evergrid, Inc. All rights reserved.
+Copyright (c) 2008 Institut National de Recherche en
+ Informatique. All rights reserved.
+Copyright (c) 2007 Lawrence Livermore National Security, LLC.
+ All rights reserved.
+Copyright (c) 2007-2010 Mellanox Technologies. All rights reserved.
+Copyright (c) 2006 QLogic Corporation. All rights reserved.
+
+Additional copyrights may follow
+
+Open MPI License
+----------------
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+- Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+
+- Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer listed
+ in this license in the documentation and/or other materials
+ provided with the distribution.
+
+- Neither the name of the copyright holders nor the names of its
+ contributors may be used to endorse or promote products derived from
+ this software without specific prior written permission.
+
+The copyright holders provide no reassurances that the source code
+provided does not infringe any patent, copyright, or any other
+intellectual property rights of third parties. The copyright holders
+disclaim any liability to any recipient for claims brought against
+recipient by any third party for infringement of that parties
+intellectual property rights.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+===========================================================================
+
+When submitting questions and problems, be sure to include as much
+extra information as possible. This web page details all the
+information that we request in order to provide assistance:
+
+ http://www.open-mpi.org/community/help/
+
+The best way to report bugs, send comments, or ask questions is to
+sign up on the user's and/or developer's mailing list (for user-level
+and developer-level questions; when in doubt, send to the user's
+list):
+
+ users@open-mpi.org
+ devel@open-mpi.org
+
+Because of spam, only subscribers are allowed to post to these lists
+(ensure that you subscribe with and post from exactly the same e-mail
+address -- joe@example.com is considered different than
+joe@mycomputer.example.com!). Visit these pages to subscribe to the
+lists:
+
+ http://www.open-mpi.org/mailman/listinfo.cgi/users
+ http://www.open-mpi.org/mailman/listinfo.cgi/devel
+
+Thanks for your time.
+
+===========================================================================
+
+Much, much more information is also available in the Open MPI FAQ:
+
+ http://www.open-mpi.org/faq/
+
+===========================================================================
+
+OFED-Specific Release Notes
+---------------------------
+
+** SLES 10 with Pathscale compiler support:
+
+Using the Pathscale compiler to build Open MPI on SLES10 may result in
+a non-functional Open MPI installation (every Open MPI command fails).
+If this problem occurs, try upgrading your Pathscale installation to
+the latest maintenance release, or use a different compiler to compile
+Open MPI.
+
+** Intel compiler support:
+
+Some versions of the Intel 9.1 C++ compiler suite series produce
+incorrect code when used with the Open MPI C++ bindings. Symptoms of
+this problem include crashing applications (e.g., segmentation
+violations) and Open MPI producing errors about incorrect parameters.
+Be sure to upgrade to the latest maintenance release of the Intel 9.1
+compiler to avoid these problems.
+
+** Installing newer versions of Open MPI after OFED is installed:
+
+Open MPI can be built from source after OFED is fully installed. The
+source code for Open MPI can be extracted from the SRPM shipped with
+OFED or downloaded from the main Open MPI web site:
+http://www.open-mpi.org/.
+
+To compile with Open MPI from source with OFED support, fully install
+the rest of OFED. If you used the default prefix for the OFED
+installation (/usr), Open MPI should build with OpenFabrics support by
+default. If you used a different OFED prefix, you must tell Open MPI
+what it is with the "--with-openib=<OFED_prefix>" switch to configure.
+You can verify that Open MPI installed with OpenFabrics support by
+running (the exact version numbers displayed may be different; the
+important part is that the "openib" BTL is displayed):
+
+ shell$ ompi_info | grep openib
+ MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.2)
+
+See the rest of the documentation below for other configure command
+line options and installation instructions.
+
+** Changelog summary
+
+Showing versions 1.2.7 - 1.4; see the "NEWS" file in an Open MPI
+distribution for the full list.
+
+1.4.1 (OFED version)
+---
+- Update support for various OpenFabrics devices in the openib BTL's
+ .ini file.
+- Fixing RDMA CM failure during QP creation (Ticket #2307)
+
+1.4.1
+---
+- Update to PLPA v1.3.2, addressing a licensing issue identified by
+ the Fedora project. See
+ https://svn.open-mpi.org/trac/plpa/changeset/262 for details.
+- Add check for malformed checkpoint metadata files (Ticket #2141).
+- Fix error path in ompi-checkpoint when not able to checkpoint
+ (Ticket #2138).
+- Cleanup component release logic when selecting checkpoint/restart
+ enabled components (Ticket #2135).
+- Fixed VT node name detection for Cray XT platforms, and fixed some
+ broken VT documentation files.
+- Fix a possible race condition in tearing down RDMA CM-based
+ connections.
+- Relax error checking on MPI_GRAPH_CREATE. Thanks to David Singleton
+ for pointing out the issue.
+- Fix a shared memory "hang" problem that occurred on x86/x86_64
+ platforms when used with the GNU >=4.4.x compiler series.
+- Add fix for Libtool 2.2.6b's problems with the PGI 10.x compiler
+ suite. Inspired directly from the upstream Libtool patches that fix
+ the issue (but we need something working before the next Libtool
+ release).
+
+1.4
+---
+
+The *only* change in the Open MPI v1.4 release (as compared to v1.3.4)
+was to update the embedded version of Libtool's libltdl to address a
+potential security vulnerability. Specifically: Open MPI v1.3.4 was
+created with GNU Libtool 2.2.6a; Open MPI v1.4 was created with GNU
+Libtool 2.2.6b. There are no other changes between Open MPI v1.3.4
+and v1.4.
+
+
+1.3.4
+-----
+
+- Fix some issues in OMPI's SRPM with regard to shell_scripts_basename
+ and its use with mpi-selector. Thanks to Bill Johnstone for
+ pointing out the problem.
+- Added many new MPI job process affinity options to mpirun. See the
+ newly-updated mpirun(1) man page for details.
+- Several updates to mpirun's XML output.
+- Update to fix a few Valgrind warnings with regards to the ptmalloc2
+ allocator and Open MPI's use of PLPA.
+- Many updates and fixes to the (non-default) "sm" collective
+ component (i.e., native shared memory MPI collective operations).
+- Updates and fixes to some MPI_COMM_SPAWN_MULTIPLE corner cases.
+- Fix some internal copying functions in Open MPI's use of PLPA.
+- Correct some SLURM nodelist parsing logic that may have interfered
+ with large jobs. Additionally, per advice from the SLURM team,
+ change the environment variable that we use for obtaining the job's
+ allocation.
+- Revert to an older, safer (but slower) communicator ID allocation
+ algorithm.
+- Fixed minimum distance finding for OpenFabrics devices in the openib
+ BTL.
+- Relax the parameter checking MPI_CART_CREATE a bit.
+- Fix MPI_COMM_SPAWN[_MULTIPLE] to only error-check the info arguments
+ on the root process. Thanks to Federico Golfre Andreasi for
+ reporting the problem.
+- Fixed some BLCR configure issues.
+- Fixed a potential deadlock when the openib BTL was used with
+ MPI_THREAD_MULTIPLE.
+- Fixed dynamic rules selection for the "tuned" coll component.
+- Added a launch progress meter to mpirun (useful for large jobs; set
+ the orte_report_launch_progress MCA parameter to 1 to see it).
+- Reduced the number of file descriptors consumed by each MPI process.
+- Add new device IDs for Chelsio T3 RNICs to the openib BTL config file.
+- Fix some CRS self component issues.
+- Added some MCA parameters to the PSM MTL to tune its run-time
+ behavior.
+- Fix some VT issues with MPI_BOTTOM/MPI_IN_PLACE.
+- Man page updates from the Debain Open MPI package maintainers.
+- Add cycle counter support for the Alpha and Sparc platforms.
+- Pass visibility flags to libltdl's configure script, resulting in
+ those symbols being hidden. This appears to mainly solve the
+ problem of applications attempting to use different versions of
+ libltdl from that used to build Open MPI.
+
+
+1.3.3
+-----
+
+- Fix a number of issues with the openib BTL (OpenFabrics) RDMA CM,
+ including a memory corruption bug, a shutdown deadlock, and a route
+ timeout. Thanks to David McMillen and Hal Rosenstock for help in
+ tracking down the issues.
+- Change the behavior of the EXTRA_STATE parameter that is passed to
+ Fortran attribute callback functions: this value is now stored
+ internally in MPI -- it no longer references the original value
+ passed by MPI_*_CREATE_KEYVAL.
+- Allow the overriding RFC1918 and RFC3330 for the specification of
+ "private" networks, thereby influencing Open MPI's TCP
+ "reachability" computations.
+- Improve flow control issues in the sm btl, by both tweaking the
+ shared memory progression rules and by enabling the "sync" collective
+ to barrier every 1,000th collective.
+- Various fixes for the IBM XL C/C++ v10.1 compiler.
+- Allow explicit disabling of ptmalloc2 hooks at runtime (e.g., enable
+ support for Debian's builtroot system). Thanks to Manuel Prinz and
+ the rest of the Debian crew for helping identify and fix this issue.
+- Various minor fixes for the I/O forwarding subsystem.
+- Big endian iWARP fixes in the Open Fabrics RDMA CM support.
+- Update support for various OpenFabrics devices in the openib BTL's
+ .ini file.
+- Fixed undefined symbol issue with Open MPI's parallel debugger
+ message queue support so it can be compiled by Sun Studio compilers.
+- Update MPI_SUBVERSION to 1 in the Fortran bindings.
+- Fix MPI_GRAPH_CREATE Fortran 90 binding.
+- Fix MPI_GROUP_COMPARE behavior with regards to MPI_IDENT. Thanks to
+ Geoffrey Irving for identifying the problem and supplying the fix.
+- Silence gcc 4.1 compiler warnings about type punning. Thanks to
+ Number Cruncher for the fix.
+- Added more Valgrind and other memory-cleanup fixes. Thanks to
+ various Open MPI users for help with these issues.
+- Miscellaneous VampirTrace fixes.
+- More fixes for openib credits in heavy-congestion scenarios.
+- Slightly decrease the latency in the openib BTL in some conditions
+ (add "send immediate" support to the openib BTL).
+- Ensure to allow MPI_REQUEST_GET_STATUS to accept an
+ MPI_STATUS_IGNORE parameter. Thanks to Shaun Jackman for the bug
+ report.
+- Added Microsoft Windows support. See README.WINDOWS file for
+ details.
+
+
+1.3.2
+-----
+
+- Fixed a potential infinite loop in the openib BTL that could occur
+ in senders in some frequent-communication scenarios. Thanks to Don
+ Wood for reporting the problem.
+- Add a new checksum PML variation on ob1 (main MPI point-to-point
+ communication engine) to detect memory corruption in node-to-node
+ messages
+- Add a new configuration option to add padding to the openib
+ header so the data is aligned
+- Add a new configuration option to use an alternative checksum algo
+ when using the checksum PML
+- Fixed a problem reported by multiple users on the mailing list that
+ the LSF support would fail to find the appropriate libraries at
+ run-time.
+- Allow empty shell designations from getpwuid(). Thanks to Sergey
+ Koposov for the bug report.
+- Ensure that mpirun exits with non-zero status when applications die
+ due to user signal. Thanks to Geoffroy Pignot for suggesting the
+ fix.
+- Ensure that MPI_VERSION / MPI_SUBVERSION match what is returned by
+ MPI_GET_VERSION. Thanks to Rob Egan for reporting the error.
+- Updated MPI_*KEYVAL_CREATE functions to properly handle Fortran
+ extra state.
+- A variety of ob1 (main MPI point-to-point communication engine) bug
+ fixes that could have caused hangs or seg faults.
+- Do not install Open MPI's signal handlers in MPI_INIT if there are
+ already signal handlers installed. Thanks to Kees Verstoep for
+ bringing the issue to our attention.
+- Fix GM support to not seg fault in MPI_INIT.
+- Various VampirTrace fixes.
+- Various PLPA fixes.
+- No longer create BTLs for invalid (TCP) devices.
+- Various man page style and lint cleanups.
+- Fix critical OpenFabrics-related bug noted here:
+ http://www.open-mpi.org/community/lists/announce/2009/03/0029.php.
+ Open MPI now uses a much more robust memory intercept scheme that is
+ quite similar to what is used by MX. The use of "-lopenmpi-malloc"
+ is no longer necessary, is deprecated, and is expected to disappear
+ in a future release. -lopenmpi-malloc will continue to work for the
+ duration of the Open MPI v1.3 and v1.4 series.
+- Fix some OpenFabrics shutdown errors, both regarding iWARP and SRQ.
+- Allow the udapl BTL to work on Solaris platforms that support
+ relaxed PCI ordering.
+- Fix problem where the mpirun would sometimes use rsh/ssh to launch on
+ the localhost (instead of simply forking).
+- Minor SLURM stdin fixes.
+- Fix to run properly under SGE jobs.
+- Scalability and latency improvements for shared memory jobs: convert
+ to using one message queue instead of N queues.
+- Automatically size the shared-memory area (mmap file) to match
+ better what is needed; specifically, so that large-np jobs will start.
+- Use fixed-length MPI predefined handles in order to provide ABI
+ compatibility between Open MPI releases.
+- Fix building of the posix paffinity component to properly get the
+ number of processors in loosely tested environments (e.g.,
+ FreeBSD). Thanks to Steve Kargl for reporting the issue.
+- Fix --with-libnuma handling in configure. Thanks to Gus Correa for
+ reporting the problem.
+
+
+1.3.1
+-----
+
+- Added "sync" coll component to allow users to synchronize every N
+ collective operations on a given communicator.
+- Increased the default values of the IB and RNR timeout MCA parameters.
+- Fix a compiler error noted by Mostyn Lewis with the PGI 8.0 compiler.
+- Fix an error that prevented stdin from being forwarded if the
+ rsh launcher was in use. Thanks to Branden Moore for pointing out
+ the problem.
+- Correct a case where the added datatype is considered as contiguous but
+ has gaps in the beginning.
+- Fix an error that limited the number of comm_spawns that could
+ simultaneously be running in some environments
+- Correct a corner case in OB1's GET protocol for long messages; the
+ error could sometimes cause MPI jobs using the openib BTL to hang.
+- Fix a bunch of bugs in the IO forwarding (IOF) subsystem and add some
+ new options to output to files and redirect output to xterm. Thanks to
+ Jody Weissmann for helping test out many of the new fixes and
+ features.
+- Fix SLURM race condition.
+- Fix MPI_File_c2f(MPI_FILE_NULL) to return 0, not -1. Thanks to
+ Lisandro Dalcin for the bug report.
+- Fix the DSO build of tm PLM.
+- Various fixes for size disparity between C int's and Fortran
+ INTEGER's. Thanks to Christoph van Wullen for the bug report.
+- Ensure that mpirun exits with a non-zero exit status when daemons or
+ processes abort or fail to launch.
+- Various fixes to work around Intel (NetEffect) RNIC behavior.
+- Various fixes for mpirun's --preload-files and --preload-binary
+ options.
+- Fix the string name in MPI::ERRORS_THROW_EXCEPTIONS.
+- Add ability to forward SIFTSTP and SIGCONT to MPI processes if you
+ set the MCA parameter orte_forward_job_control to 1.
+- Allow the sm BTL to allocate larger amounts of shared memory if
+ desired (helpful for very large multi-core boxen).
+- Fix a few places where we used PATH_MAX instead of OMPI_PATH_MAX,
+ leading to compile problems on some platforms. Thanks to Andrea Iob
+ for the bug report.
+- Fix mca_btl_openib_warn_no_device_params_found MCA parameter; it
+ was accidentally being ignored.
+- Fix some run-time issues with the sctp BTL.
+- Ensure that RTLD_NEXT exists before trying to use it (e.g., it
+ doesn't exist on Cygwin). Thanks to Gustavo Seabra for reporting
+ the issue.
+- Various fixes to VampirTrace, including fixing compile errors on
+ some platforms.
+- Fixed missing MPI_Comm_accept.3 man page; fixed minor issue in
+ orterun.1 man page. Thanks to Dirk Eddelbuettel for identifying the
+ problem and submitting a patch.
+- Implement the XML formatted output of stdout/stderr/stddiag.
+- Fixed mpirun's -wdir switch to ensure that working directories for
+ multiple app contexts are properly handled. Thanks to Geoffroy
+ Pignot for reporting the problem.
+- Improvements to the MPI C++ integer constants:
+ - Allow MPI::SEEK_* constants to be used as constants
+ - Allow other MPI C++ constants to be used as array sizes
+- Fix minor problem with orte-restart's command line options. See
+ ticket #1761 for details. Thanks to Gregor Dschung for reporting
+ the problem.
+
+1.3
+---
+
+- Extended the OS X 10.5.x (Leopard) workaround for a problem when
+ assembly code is compiled with -g[0-9]. Thanks to Barry Smith for
+ reporting the problem. See ticket #1701.
+- Disabled MPI_REAL16 and MPI_COMPLEX32 support on platforms where the
+ bit representation of REAL*16 is different than that of the C type
+ of the same size (usually long double). Thanks to Julien Devriendt
+ for reporting the issue. See ticket #1603.
+- Increased the size of MPI_MAX_PORT_NAME to 1024 from 36. See ticket #1533.
+- Added "notify debugger on abort" feature. See tickets #1509 and #1510.
+ Thanks to Seppo Sahrakropi for the bug report.
+- Upgraded Open MPI tarballs to use Autoconf 2.63, Automake 1.10.1,
+ Libtool 2.2.6a.
+- Added missing MPI::Comm::Call_errhandler() function. Thanks to Dave
+ Goodell for bringing this to our attention.
+- Increased MPI_SUBVERSION value in mpi.h to 1 (i.e., MPI 2.1).
+- Changed behavior of MPI_GRAPH_CREATE, MPI_TOPO_CREATE, and several
+ other topology functions per MPI-2.1.
+- Fix the type of the C++ constant MPI::IN_PLACE.
+- Various enhancements to the openib BTL:
+ - Added btl_openib_if_[in|ex]clude MCA parameters for
+ including/excluding comma-delimited lists of HCAs and ports.
+ - Added RDMA CM support, includng btl_openib_cpc_[in|ex]clude MCA
+ parameters
+ - Added NUMA support to only use "near" network adapters
+ - Added "Bucket SRQ" (BSRQ) support to better utilize registered
+ memory, including btl_openib_receive_queues MCA parameter
+ - Added ConnectX XRC support (and integrated with BSRQ)
+ - Added btl_openib_ib_max_inline_data MCA parameter
+ - Added iWARP support
+ - Revamped flow control mechansisms to be more efficient
+ - "mpi_leave_pinned=1" is now the default when possible,
+ automatically improving performance for large messages when
+ application buffers are re-used
+- Elimiated duplicated error messages when multiple MPI processes fail
+ with the same error.
+- Added NUMA support to the shared memory BTL.
+- Add Valgrind-based memory checking for MPI-semantic checks.
+- Add support for some optional Fortran datatypes (MPI_LOGICAL1,
+ MPI_LOGICAL2, MPI_LOGICAL4 and MPI_LOGICAL8).
+- Remove the use of the STL from the C++ bindings.
+- Added support for Platform/LSF job launchers. Must be Platform LSF
+ v7.0.2 or later.
+- Updated ROMIO with the version from MPICH2 1.0.7.
+- Added RDMA capable one-sided component (called rdma), which
+ can be used with BTL components that expose a full one-sided
+ interface.
+- Added the optional datatype MPI_REAL2. As this is added to the "end of"
+ predefined datatypes in the fortran header files, there will not be
+ any compatibility issues.
+- Added Portable Linux Processor Affinity (PLPA) for Linux.
+- Addition of a finer symbols export control via the visibiliy feature
+ offered by some compilers.
+- Added checkpoint/restart process fault tolerance support. Initially
+ support a LAM/MPI-like protocol.
+- Removed "mvapi" BTL; all InfiniBand support now uses the OpenFabrics
+ driver stacks ("openib" BTL).
+- Added more stringent MPI API parameter checking to help user-level
+ debugging.
+- The ptmalloc2 memory manager component is now by default built as
+ a standalone library named libopenmpi-malloc. Users wanting to
+ use leave_pinned with ptmalloc2 will now need to link the library
+ into their application explicitly. All other users will use the
+ libc-provided allocator instead of Open MPI's ptmalloc2. This change
+ may be overriden with the configure option enable-ptmalloc2-internal
+- The leave_pinned options will now default to using mallopt on
+ Linux in the cases where ptmalloc2 was not linked in. mallopt
+ will also only be available if munmap can be intercepted (the
+ default whenever Open MPI is not compiled with --without-memory-
+ manager.
+- Open MPI will now complain and refuse to use leave_pinned if
+ no memory intercept / mallopt option is available.
+- Add option of using Perl-based wrapper compilers instead of the
+ C-based wrapper compilers. The Perl-based version does not
+ have the features of the C-based version, but does work better
+ in cross-compile environments.
+
+
+1.2.9
+-----
+
+- Fix a segfault when using one-sided communications on some forms of derived
+ datatypes. Thanks to Dorian Krause for reporting the bug. See #1715.
+- Fix an alignment problem affecting one-sided communications on
+ some architectures (e.g., SPARC64). See #1738.
+- Fix compilation on Solaris when thread support is enabled in Open MPI
+ (e.g., when using --with-threads). See #1736.
+- Correctly take into account the MTU that an OpenFabrics device port
+ is using. See #1722 and
+ https://bugs.openfabrics.org/show_bug.cgi?id=1369.
+- Fix two datatype engine bugs. See #1677.
+ Thanks to Peter Kjellstrom for the bugreport.
+- Fix the bml r2 help filename so the help message can be found. See #1623.
+- Fix a compilation problem on RHEL4U3 with the PGI 32 bit compiler
+ caused by <infiniband/driver.h>. See ticket #1613.
+- Fix the --enable-cxx-exceptions configure option. See ticket #1607.
+- Properly handle when the MX BTL cannot open an endpoint. See ticket #1621.
+- Fix a double free of events on the tcp_events list. See ticket #1631.
+- Fix a buffer overun in opal_free_list_grow (called by MPI_Init).
+ Thanks to Patrick Farrell for the bugreport and Stephan Kramer for
+ the bugfix. See ticket #1583.
+- Fix a problem setting OPAL_PREFIX for remote sh-based shells.
+ See ticket #1580.
+
+
+1.2.8
+-----
+
+- Tweaked one memory barrier in the openib component to be more conservative.
+ May fix a problem observed on PPC machines. See ticket #1532.
+- Fix OpenFabrics IB partition support. See ticket #1557.
+- Restore v1.1 feature that sourced .profile on remote nodes if the default
+ shell will not do so (e.g. /bin/sh and /bin/ksh). See ticket #1560.
+- Fix segfault in MPI_Init_thread() if ompi_mpi_init() fails. See ticket #1562.
+- Adjust SLURM support to first look for $SLURM_JOB_CPUS_PER_NODE instead of
+ the deprecated $SLURM_TASKS_PER_NODE environment variable. This change
+ may be *required* when using SLURM v1.2 and above. See ticket #1536.
+- Fix the MPIR_Proctable to be in process rank order. See ticket #1529.
+- Fix a regression introduced in 1.2.6 for the IBM eHCA. See ticket #1526.
+
+
+1.2.7
+-----
+
+- Add some Sun HCA vendor IDs. See ticket #1461.
+- Fixed a memory leak in MPI_Alltoallw when called from Fortran.
+ Thanks to Dave Grote for the bugreport. See ticket #1457.
+- Only link in libutil when it is needed/desired. Thanks to
+ Brian Barret for diagnosing and fixing the problem. See ticket #1455.
+- Update some QLogic HCA vendor IDs. See ticket #1453.
+- Fix F90 binding for MPI_CART_GET. Thanks to Scott Beardsley for
+ bringing it to our attention. See ticket #1429.
+- Remove a spurious warning message generated in/by ROMIO. See ticket #1421.
+- Fix a bug where command-line MCA parameters were not overriding
+ MCA parameters set from environment variables. See ticket #1380.
+- Fix a bug in the AMD64 atomics assembly. Thanks to Gabriele Fatigati
+ for the bug report and bugfix. See ticket #1351.
+- Fix a gather and scatter bug on intercommunicators when the datatype
+ being moved is 0 bytes. See ticket #1331.
+- Some more man page fixes from the Debian maintainers.
+ See tickets #1324 and #1329.
+- Have openib BTL (OpenFabrics support) check for the presence of
+ /sys/class/infiniband before allowing itself to be used. This check
+ prevents spurious "OMPI did not find RDMA hardware!" notices on
+ systems that have the software drivers installed, but no
+ corresponding hardware. See tickets #1321 and #1305.
+- Added vendor IDs for some ConnectX openib HCAs. See ticket #1311.
+- Fix some RPM specfile inconsistencies. See ticket #1308.
+ Thanks to Jim Kusznir for noticing the problem.
+- Removed an unused function prototype that caused warnings on
+ some systems (e.g., OS X). See ticket #1274.
+- Fix a deadlock in inter-communicator scatter/gather operations.
+ Thanks to Martin Audet for the bug report. See ticket #1268.
+
+===========================================================================
+
+Much, much more information is also available in the Open MPI FAQ:
+
+ http://www.open-mpi.org/faq/
+
+===========================================================================
+
+General Release Notes
+---------------------
+
+Detailed Open MPI v1.3 Feature List:
+
+ o Open MPI RunTime Environment (ORTE) improvements
+ - General robustness improvements
+ - Scalable job launch (we've seen ~16K processes in less than a
+ minute in a highly-optimized configuration)
+ - New process mappers
+ - Support for Platform/LSF environments (v7.0.2 and later)
+ - More flexible processing of host lists
+ - new mpirun cmd line options and associated functionality
+
+ o Fault-Tolerance Features
+ - Asynchronous, transparent checkpoint/restart support
+ - Fully coordinated checkpoint/restart coordination component
+ - Support for the following checkpoint/restart services:
+ - blcr: Berkley Lab's Checkpoint/Restart
+ - self: Application level callbacks
+ - Support for the following interconnects:
+ - tcp
+ - mx
+ - openib
+ - sm
+ - self
+ - Improved Message Logging
+
+ o MPI_THREAD_MULTIPLE support for point-to-point messaging in the
+ following BTLs (note that only MPI point-to-point messaging API
+ functions support MPI_THREAD_MULTIPLE; other API functions likely
+ do not):
+ - tcp
+ - sm
+ - mx
+ - elan
+ - self
+
+ o Point-to-point Messaging Layer (PML) improvements
+ - Memory footprint reduction
+ - Improved latency
+ - Improved algorithm for multiple communication device
+ ("multi-rail") support
+
+ o Numerous Open Fabrics improvements/enhancements
+ - Added iWARP support (including RDMA CM)
+ - Memory footprint and performance improvements
+ - "Bucket" SRQ support for better registered memory utilization
+ - XRC/ConnectX support
+ - Message coalescing
+ - Improved error report mechanism with Asynchronous events
+ - Automatic Path Migration (APM)
+ - Improved processor/port binding
+ - Infrastructure for additional wireup strategies
+ - mpi_leave_pinned is now enabled by default
+
+ o uDAPL BTL enhancements
+ - Multi-rail support
+ - Subnet checking
+ - Interface include/exclude capabilities
+
+ o Processor affinity
+ - Linux processor affinity improvements
+ - Core/socket <--> process mappings
+
+ o Collectives
+ - Performance improvements
+ - Support for hierarchical collectives (must be activated
+ manually; see below)
+
+ o Miscellaneous
+ - MPI 2.1 compliant
+ - Sparse process groups and communicators
+ - Support for Cray Compute Node Linux (CNL)
+ - One-sided RDMA component (BTL-level based rather than PML-level
+ based)
+ - Aggregate MCA parameter sets
+ - MPI handle debugging
+ - Many small improvements to the MPI C++ bindings
+ - Valgrind support
+ - VampirTrace support
+ - Updated ROMIO to the version from MPICH2 1.0.7
+ - Removed the mVAPI IB stacks
+ - Display most error messages only once (vs. once for each
+ process)
+ - Many other small improvements and bug fixes, too numerous to
+ list here
+
+Known issues
+------------
+
+ o There is a segfault that sometimes occurs on one of our x86_64 test
+ clusters when using MPI onesided communications over Myrinet MX.
+ Since no one else has reported this problem we are not holding
+ up the 1.3 release. See ticket #1757 for the details, and any
+ possible workarounds.
+
+ o XGrid support is currently broken.
+ https://svn.open-mpi.org/trac/ompi/ticket/1777
+
+ o MPI_REDUCE_SCATTER does not work with counts of 0.
+ https://svn.open-mpi.org/trac/ompi/ticket/1559
+
+ o Please also see the Open MPI bug tracker for bugs beyond this release.
+ https://svn.open-mpi.org/trac/ompi/report
+
+===========================================================================
+
+The following abbreviated list of release notes applies to this code
+base as of this writing (10 July 2009):
+
+General notes
+-------------
+
+- Open MPI includes support for a wide variety of supplemental
+ hardware and software package. When configuring Open MPI, you may
+ need to supply additional flags to the "configure" script in order
+ to tell Open MPI where the header files, libraries, and any other
+ required files are located. As such, running "configure" by itself
+ may not include support for all the devices (etc.) that you expect,
+ especially if their support headers / libraries are installed in
+ non-standard locations. Network interconnects are an easy example
+ to discuss -- Myrinet and OpenFabrics networks, for example, both
+ have supplemental headers and libraries that must be found before
+ Open MPI can build support for them. You must specify where these
+ files are with the appropriate options to configure. See the
+ listing of configure command-line switches, below, for more details.
+
+- The majority of Open MPI's documentation is here in this file, the
+ included man pages, and on the web site FAQ
+ (http://www.open-mpi.org/). This will eventually be supplemented
+ with cohesive installation and user documentation files.
+
+- Note that Open MPI documentation uses the word "component"
+ frequently; the word "plugin" is probably more familiar to most
+ users. As such, end users can probably completely substitute the
+ word "plugin" wherever you see "component" in our documentation.
+ For what it's worth, we use the word "component" for historical
+ reasons, mainly because it is part of our acronyms and internal API
+ functionc calls.
+
+- The run-time systems that are currently supported are:
+ - rsh / ssh
+ - LoadLeveler
+ - PBS Pro, Open PBS, Torque
+ - Platform LSF (v7.0.2 and later)
+ - SLURM
+ - XGrid (known to be broken in 1.3 through 1.3.2)
+ - Cray XT-3 and XT-4
+ - Sun Grid Engine (SGE) 6.1, 6.2 and open source Grid Engine
+ - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)
+
+- Systems that have been tested are:
+ - Linux (various flavors/distros), 32 bit, with gcc, and Sun Studio 12
+ - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
+ Intel, Portland, Pathscale, and Sun Studio 12 compilers (*)
+ - OS X (10.4), 32 and 64 bit (i386, PPC, PPC64, x86_64), with gcc
+ and Absoft compilers (*)
+ - Solaris 10 update 2, 3 and 4, 32 and 64 bit (SPARC, i386, x86_64),
+ with Sun Studio 10, 11 and 12
+
+ (*) Be sure to read the Compiler Notes, below.
+
+- Other systems have been lightly (but not fully tested):
+ - Other 64 bit platforms (e.g., Linux on PPC64)
+ - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
+ more testing and support is expected later in the Open MPI v1.3.x
+ series. See the README.WINDOWS file.
+
+Compiler Notes
+--------------
+
+- Mixing compilers from different vendors when building Open MPI
+ (e.g., using the C/C++ compiler from one vendor and the F77/F90
+ compiler from a different vendor) has been successfully employed by
+ some Open MPI users (discussed on the Open MPI user's mailing list),
+ but such configurations are not tested and not documented. For
+ example, such configurations may require additional compiler /
+ linker flags to make Open MPI build properly.
+
+- Open MPI does not support the Sparc v8 CPU target, which is the
+ default on Sun Solaris. The v8plus (32 bit) or v9 (64 bit)
+ targets must be used to build Open MPI on Solaris. This can be
+ done by including a flag in CFLAGS, CXXFLAGS, FFLAGS, and FCFLAGS,
+ -xarch=v8plus for the Sun compilers, -mv8plus for GCC.
+
+- At least some versions of the Intel 8.1 compiler seg fault while
+ compiling certain Open MPI source code files. As such, it is not
+ supported.
+
+- The Intel 9.0 v20051201 compiler on IA64 platforms seems to have a
+ problem with optimizing the ptmalloc2 memory manager component (the
+ generated code will segv). As such, the ptmalloc2 component will
+ automatically disable itself if it detects that it is on this
+ platform/compiler combination. The only effect that this should
+ have is that the MCA parameter mpi_leave_pinned will be inoperative.
+
+- Early versions of the Portland Group 6.0 compiler have problems
+ creating the C++ MPI bindings as a shared library (e.g., v6.0-1).
+ Tests with later versions show that this has been fixed (e.g.,
+ v6.0-5).
+
+- The Portland Group compilers prior to version 7.0 require the
+ "-Msignextend" compiler flag to extend the sign bit when converting
+ from a shorter to longer integer. This is is different than other
+ compilers (such as GNU). When compiling Open MPI with the Portland
+ compiler suite, the following flags should be passed to Open MPI's
+ configure script:
+
+ shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-Msignextend \
+ --with-wrapper-cflags=-Msignextend \
+ --with-wrapper-cxxflags=-Msignextend ...
+
+ This will both compile Open MPI with the proper compile flags and
+ also automatically add "-Msignextend" when the C and C++ MPI wrapper
+ compilers are used to compile user MPI applications.
+
+- Using the MPI C++ bindings with the Pathscale compiler is known
+ to fail, possibly due to Pathscale compiler issues.
+
+- Using the Absoft compiler to build the MPI Fortran bindings on Suse
+ 9.3 is known to fail due to a Libtool compatibility issue.
+
+- Open MPI will build bindings suitable for all common forms of
+ Fortran 77 compiler symbol mangling on platforms that support it
+ (e.g., Linux). On platforms that do not support weak symbols (e.g.,
+ OS X), Open MPI will build Fortran 77 bindings just for the compiler
+ that Open MPI was configured with.
+
+ Hence, on platforms that support it, if you configure Open MPI with
+ a Fortran 77 compiler that uses one symbol mangling scheme, you can
+ successfully compile and link MPI Fortran 77 applications with a
+ Fortran 77 compiler that uses a different symbol mangling scheme.
+
+ NOTE: For platforms that support the multi-Fortran-compiler bindings
+ (i.e., weak symbols are supported), due to limitations in the MPI
+ standard and in Fortran compilers, it is not possible to hide these
+ differences in all cases. Specifically, the following two cases may
+ not be portable between different Fortran compilers:
+
+ 1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE
+ will only compare properly to Fortran applications that were
+ created with Fortran compilers that that use the same
+ name-mangling scheme as the Fortran compiler that Open MPI was
+ configured with.
+
+ 2. Fortran compilers may have different values for the logical
+ .TRUE. constant. As such, any MPI function that uses the Fortran
+ LOGICAL type may only get .TRUE. values back that correspond to
+ the the .TRUE. value of the Fortran compiler that Open MPI was
+ configured with. Note that some Fortran compilers allow forcing
+ .TRUE. to be 1 and .FALSE. to be 0. For example, the Portland
+ Group compilers provide the "-Munixlogical" option, and Intel
+ compilers (version >= 8.) provide the "-fpscomp logicals" option.
+
+ You can use the ompi_info command to see the Fortran compiler that
+ Open MPI was configured with.
+
+- The Fortran 90 MPI bindings can now be built in one of three sizes
+ using --with-mpi-f90-size=SIZE (see description below). These sizes
+ reflect the number of MPI functions included in the "mpi" Fortran 90
+ module and therefore which functions will be subject to strict type
+ checking. All functions not included in the Fortran 90 module can
+ still be invoked from F90 applications, but will fall back to
+ Fortran-77 style checking (i.e., little/none).
+
+ - trivial: Only includes F90-specific functions from MPI-2. This
+ means overloaded versions of MPI_SIZEOF for all the MPI-supported
+ F90 intrinsic types.
+
+ - small (default): All the functions in "trivial" plus all MPI
+ functions that take no choice buffers (meaning buffers that are
+ specified by the user and are of type (void*) in the C bindings --
+ generally buffers specified for message passing). Hence,
+ functions like MPI_COMM_RANK are included, but functions like
+ MPI_SEND are not.
+
+ - medium: All the functions in "small" plus all MPI functions that
+ take one choice buffer (e.g., MPI_SEND, MPI_RECV, ...). All
+ one-choice-buffer functions have overloaded variants for each of
+ the MPI-supported Fortran intrinsic types up to the number of
+ dimensions specified by --with-f90-max-array-dim (default value is
+ 4).
+
+ Increasing the size of the F90 module (in order from trivial, small,
+ and medium) will generally increase the length of time required to
+ compile user MPI applications. Specifically, "trivial"- and
+ "small"-sized F90 modules generally allow user MPI applications to
+ be compiled fairly quickly but lose type safety for all MPI
+ functions with choice buffers. "medium"-sized F90 modules generally
+ take longer to compile user applications but provide greater type
+ safety for MPI functions.
+
+ Note that MPI functions with two choice buffers (e.g., MPI_GATHER)
+ are not currently included in Open MPI's F90 interface. Calls to
+ these functions will automatically fall through to Open MPI's F77
+ interface. A "large" size that includes the two choice buffer MPI
+ functions is possible in future versions of Open MPI.
+
+
+General Run-Time Support Notes
+------------------------------
+
+- The Open MPI installation must be in your PATH on all nodes (and
+ potentially LD_LIBRARY_PATH, if libmpi is a shared library), unless
+ using the --prefix or --enable-mpirun-prefix-by-default
+ functionality (see below).
+
+- LAM/MPI-like mpirun notation of "C" and "N" is not yet supported.
+
+- The XGrid support is experimental - see the Open MPI FAQ and this
+ post on the Open MPI user's mailing list for more information:
+
+ http://www.open-mpi.org/community/lists/users/2006/01/0539.php
+
+- Open MPI's run-time behavior can be customized via MCA ("MPI
+ Component Architecture") parameters (see below for more information
+ on how to get/set MCA parameter values). Some MCA parameters can be
+ set in a way that renders Open MPI inoperable (see notes about MCA
+ parameters later in this file). In particular, some parameters have
+ required options that must be included.
+
+ - If specified, the "btl" parameter must include the "self"
+ component, or Open MPI will not be able to deliver messages to the
+ same rank as the sender. For example: "mpirun --mca btl tcp,self
+ ..."
+ - If specified, the "btl_tcp_if_exclude" paramater must include the
+ loopback device ("lo" on many Linux platforms), or Open MPI will
+ not be able to route MPI messages using the TCP BTL. For example:
+ "mpirun --mca btl_tcp_if_exclude lo,eth1 ..."
+
+- Running on nodes with different endian and/or different datatype
+ sizes within a single parallel job is supported in this release.
+ However, Open MPI does not resize data when datatypes differ in size
+ (for example, sending a 4 byte MPI_DOUBLE and receiving an 8 byte
+ MPI_DOUBLE will fail).
+
+
+MPI Functionality and Features
+------------------------------
+
+- All MPI-2.1 functionality is supported.
+
+- MPI_THREAD_MULTIPLE support is included, but is only lightly tested.
+ It likely does not work for thread-intensive applications. Note
+ that *only* the MPI point-to-point communication functions for the
+ BTL's listed above are considered thread safe. Other support
+ functions (e.g., MPI attributes) have not been certified as safe
+ when simultaneously used by multiple threads.
+
+ Note that Open MPI's thread support is in a fairly early stage; the
+ above devices are likely to *work*, but the latency is likely to be
+ fairly high. Specifically, efforts so far have concentrated on
+ *correctness*, not *performance* (yet).
+
+- MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a
+ portable C datatype can be found that matches the Fortran type
+ REAL*16, both in size and bit representation.
+
+- Asynchronous message passing progress using threads can be turned on
+ with the --enable-progress-threads option to configure.
+ Asynchronous message passing progress is only supported with devices
+ that support MPI_THREAD_MULTIPLE, but is only very lightly tested
+ (and may not provide very much performance benefit).
+
+
+Collectives
+-----------
+
+- The "hierarch" coll component (i.e., an implementation of MPI
+ collective operations) attempts to discover network layers of
+ latency in order to segregate individual "local" and "global"
+ operations as part of the overall collective operation. In this
+ way, network traffic can be reduced -- or possibly even minimized
+ (similar to MagPIe). The current "hierarch" component only
+ separates MPI processes into on- and off-node groups.
+
+ Hierarch has had sufficient correctness testing, but has not
+ received much performance tuning. As such, hierarch is not
+ activated by default -- it must be enabled manually by setting its
+ priority level to 100:
+
+ mpirun --mca coll_hierarch_priority 100 ...
+
+ We would appreciate feedback from the user community about how well
+ hierarch works for your applications.
+
+
+Network Support
+---------------
+
+- The OpenFabrics Enterprise Distribution (OFED) software package v1.0
+ will not work properly with Open MPI v1.2 (and later) due to how its
+ Mellanox InfiniBand plugin driver is created. The problem is fixed
+ OFED v1.1 (and later).
+
+- Older mVAPI-based InfiniBand drivers (Mellanox VAPI) are no longer
+ supported. Please use an older version of Open MPI (1.2 series or
+ earlier) if you need mVAPI support.
+
+- The use of fork() with the openib BTL is only partially supported,
+ and only on Linux kernels >= v2.6.15 with libibverbs v1.1 or later
+ (first released as part of OFED v1.2), per restrictions imposed by
+ the OFED network stack.
+
+- There are two MPI network models available: "ob1" and "cm". "ob1"
+ uses BTL ("Byte Transfer Layer") components for each supported
+ network. "cm" uses MTL ("Matching Tranport Layer") components for
+ each supported network.
+
+ - "ob1" supports a variety of networks that can be used in
+ combination with each other (per OS constraints; e.g., there are
+ reports that the GM and OpenFabrics kernel drivers do not operate
+ well together):
+ - OpenFabrics: InfiniBand and iWARP
+ - Loopback (send-to-self)
+ - Myrinet: GM and MX
+ - Portals
+ - Quadrics Elan
+ - Shared memory
+ - TCP
+ - SCTP
+ - uDAPL
+
+ - "cm" supports a smaller number of networks (and they cannot be
+ used together), but may provide better better overall MPI
+ performance:
+ - Myrinet MX (not GM)
+ - InfiniPath PSM
+ - Portals
+
+ Open MPI will, by default, choose to use "cm" when the InfiniPath
+ PSM MTL can be used. Otherwise, OB1 will be used and the
+ corresponding BTLs will be selected. Users can force the use of ob1
+ or cm if desired by setting the "pml" MCA parameter at run-time:
+
+ shell$ mpirun --mca pml ob1 ...
+ or
+ shell$ mpirun --mca pml cm ...
+
+- Myrinet MX support is shared between the 2 internal devices, the MTL
+ and the BTL. The design of the BTL interface in Open MPI assumes
+ that only naive one-sided communication capabilities are provided by
+ the low level communication layers. However, modern communication
+ layers such as Myrinet MX, InfiniPath PSM, or Portals, natively
+ implement highly-optimized two-sided communication semantics. To
+ leverage these capabilities, Open MPI provides the "cm" PML and
+ corresponding MTL components to transfer messages rather than bytes.
+ The MTL interface implements a shorter code path and lets the
+ low-level network library decide which protocol to use (depending on
+ issues such as message length, internal resources and other
+ parameters specific to the underlying interconnect). However, Open
+ MPI cannot currently use multiple MTL modules at once. In the case
+ of the MX MTL, process loopback and on-node shared memory
+ communications are provided by the MX library. Moreover, the
+ current MX MTL does not support message pipelining resulting in
+ lower performances in case of non-contiguous data-types.
+
+ The "ob1" PML and BTL components use Open MPI's internal on-node
+ shared memory and process loopback devices for high performance.
+ The BTL interface allows multiple devices to be used simultaneously.
+ For the MX BTL it is recommended that the first segment (which is as
+ a threshold between the eager and the rendezvous protocol) should
+ always be at most 4KB, but there is no further restriction on the
+ size of subsequent fragments.
+
+ The MX MTL is recommended in the common case for best performance on
+ 10G hardware when most of the data transfers cover contiguous memory
+ layouts. The MX BTL is recommended in all other cases, such as when
+ using multiple interconnects at the same time (including TCP), or
+ transferring non contiguous data-types.
+
+
+Shared library versioning support
+---------------------------------
+
+Open MPI started using GNU-Libtool recommended shared library
+versioning with the v1.3.3 release (where all versions were set to
+0:0:0) for the main MPI libraries: libmpi, libmpi_cxx, libmpi_f77, and
+libmpi_f90.
+
+Open MPI's other internal libraries are not [yet] versioned for deep
+voodoo technical reasons. Please see
+https://svn.open-mpi.org/trac/ompi/ticket/2092 for more details.
+
+===========================================================================
+
+Building Open MPI
+-----------------
+
+Open MPI uses a traditional configure script paired with "make" to
+build. Typical installs can be of the pattern:
+
+---------------------------------------------------------------------------
+shell$ ./configure [...options...]
+shell$ make all install
+---------------------------------------------------------------------------
+
+There are many available configure options (see "./configure --help"
+for a full list); a summary of the more commonly used ones follows:
+
+--prefix=<directory>
+ Install Open MPI into the base directory named <directory>. Hence,
+ Open MPI will place its executables in <directory>/bin, its header
+ files in <directory>/include, its libraries in <directory>/lib, etc.
+
+--with-elan=<directory>
+ Specify the directory where the Quadrics Elan library and header
+ files are located. This option is generally only necessary if the
+ Elan headers and libraries are not in default compiler/linker
+ search paths.
+
+ Elan is the support library for Quadrics-based networks.
+
+--with-elan-libdir=<directory>
+ Look in directory for the Quadrics Elan libraries. By default, Open
+ MPI will look in <elan directory>/lib and <elan directory>/lib64,
+ which covers most cases. This option is only needed for special
+ configurations.
+
+--with-gm=<directory>
+ Specify the directory where the GM libraries and header files are
+ located. This option is generally only necessary if the GM headers
+ and libraries are not in default compiler/linker search paths.
+
+ GM is the support library for older Myrinet-based networks (GM has
+ been obsoleted by MX).
+
+--with-gm-libdir=<directory>
+ Look in directory for the GM libraries. By default, Open MPI will
+ look in <gm directory>/lib and <gm directory>/lib64, which covers
+ most cases. This option is only needed for special configurations.
+
+--with-mx=<directory>
+ Specify the directory where the MX libraries and header files are
+ located. This option is generally only necessary if the MX headers
+ and libraries are not in default compiler/linker search paths.
+
+ MX is the support library for Myrinet-based networks.
+
+--with-mx-libdir=<directory>
+ Look in directory for the MX libraries. By default, Open MPI will
+ look in <mx directory>/lib and <mx directory>/lib64, which covers
+ most cases. This option is only needed for special configurations.
+
+--with-openib=<directory>
+ Specify the directory where the OpenFabrics (previously known as
+ OpenIB) libraries and header files are located. This option is
+ generally only necessary if the OpenFabrics headers and libraries
+ are not in default compiler/linker search paths.
+
+ "OpenFabrics" refers to iWARP- and InifiniBand-based networks.
+
+--with-openib-libdir=<directory>
+ Look in directory for the OpenFabrics libraries. By default, Open
+ MPI will look in <openib directory>/lib and <openib
+ directory>/lib64, which covers most cases. This option is only
+ needed for special configurations.
+
+--with-portals=<directory>
+ Specify the directory where the Portals libraries and header files
+ are located. This option is generally only necessary if the Portals
+ headers and libraries are not in default compiler/linker search
+ paths.
+
+ Portals is the support library for Cray interconnects, but is also
+ available on other platforms (e.g., there is a Portals library
+ implemented over regular TCP).
+
+--with-portals-config=<type>
+ Configuration to use for Portals support. The following <type>
+ values are possible: "utcp", "xt3", "xt3-modex" (default: utcp).
+
+--with-portals-libs=<libs>
+ Additional libraries to link with for Portals support.
+
+--with-psm=<directory>
+ Specify the directory where the QLogic InfiniPath PSM library and
+ header files are located. This option is generally only necessary
+ if the InfiniPath headers and libraries are not in default
+ compiler/linker search paths.
+
+ PSM is the support library for QLogic InfiniPath network adapters.
+
+--with-psm-libdir=<directory>
+ Look in directory for the PSM libraries. By default, Open MPI will
+ look in <psm directory>/lib and <psm directory>/lib64, which covers
+ most cases. This option is only needed for special configurations.
+
+--with-sctp=<directory>
+ Specify the directory where the SCTP libraries and header files are
+ located. This option is generally only necessary if the SCTP headers
+ and libraries are not in default compiler/linker search paths.
+
+ SCTP is a special network stack over ethernet networks.
+
+--with-sctp-libdir=<directory>
+ Look in directory for the SCTP libraries. By default, Open MPI will
+ look in <sctp directory>/lib and <sctp directory>/lib64, which covers
+ most cases. This option is only needed for special configurations.
+
+--with-udapl=<directory>
+ Specify the directory where the UDAPL libraries and header files are
+ located. Note that UDAPL support is disabled by default on Linux;
+ the --with-udapl flag must be specified in order to enable it.
+ Specifying the directory argument is generally only necessary if the
+ UDAPL headers and libraries are not in default compiler/linker
+ search paths.
+
+ UDAPL is the support library for high performance networks in Sun
+ HPC ClusterTools and on Linux OpenFabrics networks (although the
+ "openib" options are preferred for Linux OpenFabrics networks, not
+ UDAPL).
+
+--with-udapl-libdir=<directory>
+ Look in directory for the UDAPL libraries. By default, Open MPI
+ will look in <udapl directory>/lib and <udapl directory>/lib64,
+ which covers most cases. This option is only needed for special
+ configurations.
+
+--with-lsf=<directory>
+ Specify the directory where the LSF libraries and header files are
+ located. This option is generally only necessary if the LSF headers
+ and libraries are not in default compiler/linker search paths.
+
+ LSF is a resource manager system, frequently used as a batch
+ scheduler in HPC systems.
+
+ NOTE: If you are using LSF version 7.0.5, you will need to add
+ "LIBS=-ldl" to the configure command line. For example:
+
+ ./configure LIBS=-ldl --with-lsf ...
+
+ This workaround should *only* be needed for LSF 7.0.5.
+
+--with-lsf-libdir=<directory>
+ Look in directory for the LSF libraries. By default, Open MPI will
+ look in <lsf directory>/lib and <lsf directory>/lib64, which covers
+ most cases. This option is only needed for special configurations.
+
+--with-tm=<directory>
+ Specify the directory where the TM libraries and header files are
+ located. This option is generally only necessary if the TM headers
+ and libraries are not in default compiler/linker search paths.
+
+ TM is the support library for the Torque and PBS Pro resource
+ manager systems, both of which are frequently used as a batch
+ scheduler in HPC systems.
+
+--with-sge
+ Specify to build support for the Sun Grid Engine (SGE) resource
+ manager. SGE support is disabled by default; this option must be
+ specified to build OMPI's SGE support.
+
+ The Sun Grid Engine (SGE) is a resource manager system, frequently
+ used as a batch scheduler in HPC systems.
+
+--with-mpi-param_check(=value)
+ "value" can be one of: always, never, runtime. If --with-mpi-param
+ is not specified, "runtime" is the default. If --with-mpi-param
+ is specified with no value, "always" is used. Using
+ --without-mpi-param-check is equivalent to "never".
+
+ - always: the parameters of MPI functions are always checked for
+ errors
+ - never: the parameters of MPI functions are never checked for
+ errors
+ - runtime: whether the parameters of MPI functions are checked
+ depends on the value of the MCA parameter mpi_param_check
+ (default: yes).
+
+--with-threads=value
+ Since thread support (both support for MPI_THREAD_MULTIPLE and
+ asynchronous progress) is only partially tested, it is disabled by
+ default. To enable threading, use "--with-threads=posix". This is
+ most useful when combined with --enable-mpi-threads and/or
+ --enable-progress-threads.
+
+--enable-mpi-threads
+ Allows the MPI thread level MPI_THREAD_MULTIPLE. See
+ --with-threads; this is currently disabled by default.
+
+--enable-progress-threads
+ Allows asynchronous progress in some transports. See
+ --with-threads; this is currently disabled by default. See the
+ above note about asynchronous progress.
+
+--disable-mpi-cxx
+ Disable building the C++ MPI bindings. Note that this does *not*
+ disable the C++ checks during configure; some of Open MPI's tools
+ are written in C++ and therefore require a C++ compiler to be built.
+
+--disable-mpi-cxx-seek
+ Disable the MPI::SEEK_* constants. Due to a problem with the MPI-2
+ specification, these constants can conflict with system-level SEEK_*
+ constants. Open MPI attempts to work around this problem, but the
+ workaround may fail in some esoteric situations. The
+ --disable-mpi-cxx-seek switch disables Open MPI's workarounds (and
+ therefore the MPI::SEEK_* constants will be unavailable).
+
+--disable-mpi-f77
+ Disable building the Fortran 77 MPI bindings.
+
+--disable-mpi-f90
+ Disable building the Fortran 90 MPI bindings. Also related to the
+ --with-f90-max-array-dim and --with-mpi-f90-size options.
+
+--with-mpi-f90-size=<SIZE>
+ Three sizes of the MPI F90 module can be built: trivial (only a
+ handful of MPI-2 F90-specific functions are included in the F90
+ module), small (trivial + all MPI functions that take no choice
+ buffers), and medium (small + all MPI functions that take 1 choice
+ buffer). This parameter is only used if the F90 bindings are
+ enabled.
+
+--with-f90-max-array-dim=<DIM>
+ The F90 MPI bindings are strictly typed, even including the number of
+ dimensions for arrays for MPI choice buffer parameters. Open MPI
+ generates these bindings at compile time with a maximum number of
+ dimensions as specified by this parameter. The default value is 4.
+
+--enable-mpirun-prefix-by-default
+ This option forces the "mpirun" command to always behave as if
+ "--prefix $prefix" was present on the command line (where $prefix is
+ the value given to the --prefix option to configure). This prevents
+ most rsh/ssh-based users from needing to modify their shell startup
+ files to set the PATH and/or LD_LIBRARY_PATH for Open MPI on remote
+ nodes. Note, however, that such users may still desire to set PATH
+ -- perhaps even in their shell startup files -- so that executables
+ such as mpicc and mpirun can be found without needing to type long
+ path names. --enable-orterun-prefix-by-default is a synonym for
+ this option.
+
+--disable-shared
+ By default, libmpi is built as a shared library, and all components
+ are built as dynamic shared objects (DSOs). This switch disables
+ this default; it is really only useful when used with
+ --enable-static. Specifically, this option does *not* imply
+ --enable-static; enabling static libraries and disabling shared
+ libraries are two independent options.
+
+--enable-static
+ Build libmpi as a static library, and statically link in all
+ components. Note that this option does *not* imply
+ --disable-shared; enabling static libraries and disabling shared
+ libraries are two independent options.
+
+--enable-sparse-groups
+ Enable the usage of sparse groups. This would save memory
+ significantly especially if you are creating large
+ communicators. (Disabled by default)
+
+--enable-peruse
+ Enable the PERUSE MPI data analysis interface.
+
+--enable-dlopen
+ Build all of Open MPI's components as standalone Dynamic Shared
+ Objects (DSO's) that are loaded at run-time. The opposite of this
+ option, --disable-dlopen, causes two things:
+
+ 1. All of Open MPI's components will be built as part of Open MPI's
+ normal libraries (e.g., libmpi).
+ 2. Open MPI will not attempt to open any DSO's at run-time.
+
+ Note that this option does *not* imply that OMPI's libraries will be
+ built as static objects (e.g., libmpi.a). It only specifies the
+ location of OMPI's components: standalone DSOs or folded into the
+ Open MPI libraries. You can control whenther Open MPI's libraries
+ are build as static or dynamic via --enable|disable-static and
+ --enable|disable-shared.
+
+--enable-heterogeneous
+ Enable support for running on heterogeneous clusters (e.g., machines
+ with different endian representations). Heterogeneous support is
+ disabled by default because it imposes a minor performance penalty.
+
+--enable-ptmalloc2-internal
+ ***NOTE: This option no longer exists.
+
+ This option was introduced in Open MPI v1.3 and was then removed in
+ Open MPI v1.3.2. Open MPI fundamentally changed how it uses
+ ptmalloc2 support in v1.3.2 such that the
+ --enable-ptmalloc2-internal flag was no longer necessary. It can
+ still harmlessly be supplied to Open MPI's configure script, but a
+ warning will appear about how it is an unrecognized option.
+
+ In v1.3 and v1.3.1, Open MPI built the ptmalloc2 library as a
+ standalone library that users could choose to link in or not (by
+ adding -lopenmpi-malloc to their link command). Using this option
+ restored pre-v1.3 behavior of *always* forcing the user to use the
+ ptmalloc2 memory manager (because it is part of libmpi).
+
+ Starting with v1.3.2, ptmalloc2 is always built into Open MPI, but
+ is only activated in certain scenarios.
+
+--with-wrapper-cflags=<cflags>
+--with-wrapper-cxxflags=<cxxflags>
+--with-wrapper-fflags=<fflags>
+--with-wrapper-fcflags=<fcflags>
+--with-wrapper-ldflags=<ldflags>
+--with-wrapper-libs=<libs>
+ Add the specified flags to the default flags that used are in Open
+ MPI's "wrapper" compilers (e.g., mpicc -- see below for more
+ information about Open MPI's wrapper compilers). By default, Open
+ MPI's wrapper compilers use the same compilers used to build Open
+ MPI and specify an absolute minimum set of additional flags that are
+ necessary to compile/link MPI applications. These configure options
+ give system administrators the ability to embed additional flags in
+ OMPI's wrapper compilers (which is a local policy decision). The
+ meanings of the different flags are:
+
+ <cflags>: Flags passed by the mpicc wrapper to the C compiler
+ <cxxflags>: Flags passed by the mpic++ wrapper to the C++ compiler
+ <fflags>: Flags passed by the mpif77 wrapper to the F77 compiler
+ <fcflags>: Flags passed by the mpif90 wrapper to the F90 compiler
+ <ldflags>: Flags passed by all the wrappers to the linker
+ <libs>: Flags passed by all the wrappers to the linker
+
+ There are other ways to configure Open MPI's wrapper compiler
+ behavior; see the Open MPI FAQ for more information.
+
+There are many other options available -- see "./configure --help".
+
+Changing the compilers that Open MPI uses to build itself uses the
+standard Autoconf mechanism of setting special environment variables
+either before invoking configure or on the configure command line.
+The following environment variables are recognized by configure:
+
+CC - C compiler to use
+CFLAGS - Compile flags to pass to the C compiler
+CPPFLAGS - Preprocessor flags to pass to the C compiler
+
+CXX - C++ compiler to use
+CXXFLAGS - Compile flags to pass to the C++ compiler
+CXXCPPFLAGS - Preprocessor flags to pass to the C++ compiler
+
+F77 - Fortran 77 compiler to use
+FFLAGS - Compile flags to pass to the Fortran 77 compiler
+
+FC - Fortran 90 compiler to use
+FCFLAGS - Compile flags to pass to the Fortran 90 compiler
+
+LDFLAGS - Linker flags to pass to all compilers
+LIBS - Libraries to pass to all compilers (it is rarely
+ necessary for users to need to specify additional LIBS)
+
+For example:
+
+shell$ ./configure CC=mycc CXX=myc++ F77=myf77 F90=myf90 ...
+
+***Note: We generally suggest using the above command line form for
+ setting different compilers (vs. setting environment variables and
+ then invoking "./configure"). The above form will save all
+ variables and values in the config.log file, which makes
+ post-mortem analysis easier when problems occur.
+
+Note that you may also want to ensure that the value of
+LD_LIBRARY_PATH is set appropriately (or not at all) for your build
+(or whatever environment variable is relevant for your operating
+system). For example, some users have been tripped up by setting to
+use non-default Fortran compilers via FC / F77, but then failing to
+set LD_LIBRARY_PATH to include the directory containing that
+non-default Fortran compiler's support libraries. This causes Open
+MPI's configure script to fail when it tries to compile / link / run
+simple Fortran programs.
+
+It is required that the compilers specified be compile and link
+compatible, meaning that object files created by one compiler must be
+able to be linked with object files from the other compilers and
+produce correctly functioning executables.
+
+Open MPI supports all the "make" targets that are provided by GNU
+Automake, such as:
+
+all - build the entire Open MPI package
+install - install Open MPI
+uninstall - remove all traces of Open MPI from the $prefix
+clean - clean out the build tree
+
+Once Open MPI has been built and installed, it is safe to run "make
+clean" and/or remove the entire build tree.
+
+VPATH and parallel builds are fully supported.
+
+Generally speaking, the only thing that users need to do to use Open
+MPI is ensure that <prefix>/bin is in their PATH and <prefix>/lib is
+in their LD_LIBRARY_PATH. Users may need to ensure to set the PATH
+and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc)
+so that non-interactive rsh/ssh-based logins will be able to find the
+Open MPI executables.
+
+===========================================================================
+
+Checking Your Open MPI Installation
+-----------------------------------
+
+The "ompi_info" command can be used to check the status of your Open
+MPI installation (located in <prefix>/bin/ompi_info). Running it with
+no arguments provides a summary of information about your Open MPI
+installation.
+
+Note that the ompi_info command is extremely helpful in determining
+which components are installed as well as listing all the run-time
+settable parameters that are available in each component (as well as
+their default values).
+
+The following options may be helpful:
+
+--all Show a *lot* of information about your Open MPI
+ installation.
+--parsable Display all the information in an easily
+ grep/cut/awk/sed-able format.
+--param <framework> <component>
+ A <framework> of "all" and a <component> of "all" will
+ show all parameters to all components. Otherwise, the
+ parameters of all the components in a specific framework,
+ or just the parameters of a specific component can be
+ displayed by using an appropriate <framework> and/or
+ <component> name.
+
+Changing the values of these parameters is explained in the "The
+Modular Component Architecture (MCA)" section, below.
+
+===========================================================================
+
+Compiling Open MPI Applications
+-------------------------------
+
+Open MPI provides "wrapper" compilers that should be used for
+compiling MPI applications:
+
+C: mpicc
+C++: mpiCC (or mpic++ if your filesystem is case-insensitive)
+Fortran 77: mpif77
+Fortran 90: mpif90
+
+For example:
+
+shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g
+shell$
+
+All the wrapper compilers do is add a variety of compiler and linker
+flags to the command line and then invoke a back-end compiler. To be
+specific: the wrapper compilers do not parse source code at all; they
+are solely command-line manipulators, and have nothing to do with the
+actual compilation or linking of programs. The end result is an MPI
+executable that is properly linked to all the relevant libraries.
+
+Customizing the behavior of the wrapper compilers is possible (e.g.,
+changing the compiler [not recommended] or specifying additional
+compiler/linker flags); see the Open MPI FAQ for more information.
+
+===========================================================================
+
+Running Open MPI Applications
+-----------------------------
+
+Open MPI supports both mpirun and mpiexec (they are exactly
+equivalent). For example:
+
+shell$ mpirun -np 2 hello_world_mpi
+or
+shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi
+
+are equivalent. Some of mpiexec's switches (such as -host and -arch)
+are not yet functional, although they will not error if you try to use
+them.
+
+The rsh launcher accepts a -hostfile parameter (the option
+"-machinefile" is equivalent); you can specify a -hostfile parameter
+indicating an standard mpirun-style hostfile (one hostname per line):
+
+shell$ mpirun -hostfile my_hostfile -np 2 hello_world_mpi
+
+If you intend to run more than one process on a node, the hostfile can
+use the "slots" attribute. If "slots" is not specified, a count of 1
+is assumed. For example, using the following hostfile:
+
+---------------------------------------------------------------------------
+node1.example.com
+node2.example.com
+node3.example.com slots=2
+node4.example.com slots=4
+---------------------------------------------------------------------------
+
+shell$ mpirun -hostfile my_hostfile -np 8 hello_world_mpi
+
+will launch MPI_COMM_WORLD rank 0 on node1, rank 1 on node2, ranks 2
+and 3 on node3, and ranks 4 through 7 on node4.
+
+Other starters, such as the resource manager / batch scheduling
+environments, do not require hostfiles (and will ignore the hostfile
+if it is supplied). They will also launch as many processes as slots
+have been allocated by the scheduler if no "-np" argument has been
+provided. For example, running a SLURM job with 8 processors:
+
+shell$ salloc -n 8 mpirun a.out
+
+The above command will reserve 8 processors and run 1 copy of mpirun,
+which will, in turn, launch 8 copies of a.out in a single
+MPI_COMM_WORLD on the processors that were allocated by SLURM.
+
+Note that the values of component parameters can be changed on the
+mpirun / mpiexec command line. This is explained in the section
+below, "The Modular Component Architecture (MCA)".
+
+===========================================================================
+
+The Modular Component Architecture (MCA)
+
+The MCA is the backbone of Open MPI -- most services and functionality
+are implemented through MCA components. Here is a list of all the
+component frameworks in Open MPI:
+
+---------------------------------------------------------------------------
+
+MPI component frameworks:
+-------------------------
+
+allocator - Memory allocator
+bml - BTL management layer
+btl - MPI point-to-point Byte Transfer Layer, used for MPI
+ point-to-point messages on some types of networks
+coll - MPI collective algorithms
+crcp - Checkpoint/restart coordination protocol
+dpm - MPI-2 dynamic process management
+io - MPI-2 I/O
+mpool - Memory pooling
+mtl - Matching transport layer, used for MPI point-to-point
+ messages on some types of networks
+osc - MPI-2 one-sided communications
+pml - MPI point-to-point management layer
+pubsub - MPI-2 publish/subscribe management
+rcache - Memory registration cache
+topo - MPI topology routines
+
+Back-end run-time environment component frameworks:
+---------------------------------------------------
+
+errmgr - RTE error manager
+ess - RTE environment-specfic services
+filem - Remote file management
+grpcomm - RTE group communications
+iof - I/O forwarding
+notifier - System/network administrator noficiation system
+odls - OpenRTE daemon local launch subsystem
+oob - Out of band messaging
+plm - Process lifecycle management
+ras - Resource allocation system
+rmaps - Resource mapping system
+rml - RTE message layer
+routed - Routing table for the RML
+snapc - Snapshot coordination
+
+Miscellaneous frameworks:
+-------------------------
+
+backtrace - Debugging call stack backtrace support
+carto - Cartography (host/network mapping) support
+crs - Checkpoint and restart service
+installdirs - Installation directory relocation services
+maffinity - Memory affinity
+memchecker - Run-time memory checking
+memcpy - Memopy copy support
+memory - Memory management hooks
+paffinity - Processor affinity
+timer - High-resolution timers
+
+---------------------------------------------------------------------------
+
+Each framework typically has one or more components that are used at
+run-time. For example, the btl framework is used by the MPI layer to
+send bytes across different types underlying networks. The tcp btl,
+for example, sends messages across TCP-based networks; the openib btl
+sends messages across OpenFabrics-based networks; the MX btl sends
+messages across Myrinet networks.
+
+Each component typically has some tunable parameters that can be
+changed at run-time. Use the ompi_info command to check a component
+to see what its tunable parameters are. For example:
+
+shell$ ompi_info --param btl tcp
+
+shows all the parameters (and default values) for the tcp btl
+component.
+
+These values can be overridden at run-time in several ways. At
+run-time, the following locations are examined (in order) for new
+values of parameters:
+
+1. <prefix>/etc/openmpi-mca-params.conf
+
+ This file is intended to set any system-wide default MCA parameter
+ values -- it will apply, by default, to all users who use this Open
+ MPI installation. The default file that is installed contains many
+ comments explaining its format.
+
+2. $HOME/.openmpi/mca-params.conf
+
+ If this file exists, it should be in the same format as
+ <prefix>/etc/openmpi-mca-params.conf. It is intended to provide
+ per-user default parameter values.
+
+3. environment variables of the form OMPI_MCA_<name> set equal to a
+ <value>
+
+ Where <name> is the name of the parameter. For example, set the
+ variable named OMPI_MCA_btl_tcp_frag_size to the value 65536
+ (Bourne-style shells):
+
+ shell$ OMPI_MCA_btl_tcp_frag_size=65536
+ shell$ export OMPI_MCA_btl_tcp_frag_size
+
+4. the mpirun command line: --mca <name> <value>
+
+ Where <name> is the name of the parameter. For example:
+
+ shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi
+
+These locations are checked in order. For example, a parameter value
+passed on the mpirun command line will override an environment
+variable; an environment variable will override the system-wide
+defaults.
+
+===========================================================================
+
+Common Questions
+----------------
+
+Many common questions about building and using Open MPI are answered
+on the FAQ:
+
+ http://www.open-mpi.org/faq/
+
+===========================================================================
+
+Got more questions?
+-------------------
+
+Found a bug? Got a question? Want to make a suggestion? Want to
+contribute to Open MPI? Please let us know!
+
+When submitting questions and problems, be sure to include as much
+extra information as possible. This web page details all the
+information that we request in order to provide assistance:
+
+ http://www.open-mpi.org/community/help/
+
+User-level questions and comments should generally be sent to the
+user's mailing list (users@open-mpi.org). Because of spam, only
+subscribers are allowed to post to this list (ensure that you
+subscribe with and post from *exactly* the same e-mail address --
+joe@example.com is considered different than
+joe@mycomputer.example.com!). Visit this page to subscribe to the
+user's list:
+
+ http://www.open-mpi.org/mailman/listinfo.cgi/users
+
+Developer-level bug reports, questions, and comments should generally
+be sent to the developer's mailing list (devel@open-mpi.org). Please
+do not post the same question to both lists. As with the user's list,
+only subscribers are allowed to post to the developer's list. Visit
+the following web page to subscribe:
+
+ http://www.open-mpi.org/mailman/listinfo.cgi/devel
+
+Make today an Open MPI day!
--- /dev/null
+ OpenSM Release Notes 3.3
+ =============================
+
+Version: OpenSM 3.3.x
+Repo: git://git.openfabrics.org/~sashak/management.git
+Date: Dec 2009
+
+1 Overview
+----------
+This document describes the contents of the OpenSM 3.3 release.
+OpenSM is an InfiniBand compliant Subnet Manager and Administration,
+and runs on top of OpenIB. The OpenSM version for this release
+is opensm-3.3.5.
+
+This document includes the following sections:
+1 This Overview section (describing new features and software
+ dependencies)
+2 Known Issues And Limitations
+3 Unsupported IB compliance statements
+4 Bug Fixes
+5 Main Verification Flows
+6 Qualified Software Stacks and Devices
+
+1.1 Major New Features
+
+* Mesh Analysis for LASH routing algorithm.
+ The performance of LASH can be improved by preconditioning the mesh in
+ cases where there are multiple links connecting switches and also in
+ cases where the switches are not cabled consistently.
+ Activated with --do_mesh_analysis command line and config file option.
+
+* Reloadable OpenSM configuration (preliminary implemented)
+ This is possible now to reload OpenSM configuration parameters on the
+ fly without restarting.
+
+* Routing paths sorted balancing (for UpDown and MinHops)
+ This sorts the port order in which routing paths balancing is performed
+ by OpenSM. Helps to improve performance dramatically (40-50%) for most
+ popular application communication patterns.
+ To overwrite this behavior use --guid_routing_order_file command line
+ option.
+
+* Weighted Lid Matrices calculation (for UpDown, MinHop and DOR).
+ This low level routing fine-tuning feature provides the means to
+ define a weighting factor per port for customizing the least weight
+ hops for the routing. Custom weights are provided using file specified
+ with '--hop_weights_file' command line option.
+
+* I/O nodes connectivity (for FatTree).
+ This provides possibility to define the set of I/O nodes for the
+ Fat-Tree routing algorithm. I/O nodes are non-CN nodes allowed to use
+ up to N (specified using --max_reverse_hops) switches the wrong way
+ around to improve connectivity. I/O nodes list is provided using file
+ and --io_guid_file command line option.
+
+* MGID to MLID compression - infrastructure for many MGIDs to single MLID
+ compression. This becomes helpful when number of multicast groups
+ exceeds subnet's MLID routing capability (normally 1024 groups). In such
+ cases many multicast groups (MGID) can be routed using same MLID value.
+
+* Many code improvements, optimizations and cleanups.
+
+* Windows support (early stage).
+
+1.2 Minor New Features:
+
+cde0c0d opensm: Convert remaining helper routines for GID printing format
+bc5743c opensm: Add support for MaxCreditHint and LinkRoundTripLatency to
+ osm_dump_port_info
+6cd34ab opensm: Add Dell to known vendor list
+003d6bd opensm: Add more info for traps 144 and 256-259 in osm_dump_notice
+5b0c5de opensm/osm_ucat_ftree.c Enhance min hops counters usage
+0715b92 ib_types.h: Add ib_switch_info_get_state_opt_sl2vlmapping routine
+2ddba79 opensm: Remove some __ and __osm_ prefixes
+ea0691f opensm/iba/ib_types.h: Add PortXmit/RcvDataSL PerfMgt attributes
+9c79be5 ib_types.h: Adding BKEY violation trap (259)
+c608ea6 opensm: Add and utilize ib_gid_is_notzero routine
+b639e64 opensm: Handle trap repress on trap 144 generation
+b034205 Add pkey table support to osm_get_all_port_attr
+876605b opensm/ib_types.h: Add attribute ID for PortCountersExtended
+aae3bbc opensm: PortInfo requests for discovered switches
+0147b09 opensm/osm_lid_mgr: use single array for used_lids
+a9225b0 opensm/Makefile.am: remove osm_build_id.h junk file generation
+8e3a57d opensm/osm_console.c: Add list of SMs to status command
+3d664b9 opensm/osm_console.c : Added dump_portguid function to console to
+ generate a list of port guids matching one or more regexps
+85b35bc opensm/osm_helper.c: print port number as decimal
+8674cb7 opensm: sort port order for routing by switch loads
+80c0d48 opensm: rescan config file even in standby
+8b7aa5e opensm/osm_subnet.c enable log_max_size opt update
+8558ee5 opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters
+ecde2f7 opensm/osm_subnet.c support subnet configuration rescan and update
+58c45e4 opensm/osm_log.c save log_max_size in subnet opt in MB
+cf88e93 opensm: Add new partition keyword for all hca, switches and routers
+4bfd4e0 opensm: remove libibcommon build dependencies
+3718fc4 opensm/event_plugin: link opensm with -rdynamic flag
+587ce14 opensm/osm_inform.c report IB traps to plugin
+ced5a6e opensm/opensm/osm_console.c: move reporting of plugins to "status"
+ command.
+696aca2 opensm: Add configurable retries for transactions
+0d932ff opensm/osm_sa_mcmember_record.c: optimization in zero mgid comparison
+254c2ef opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, set init
+ failure on PKeyTable and QoS initialization failure
+83bd10a opensm: Reduce heap consumption by multicast routing tables (MFTs)
+cd33bc5 opensm: Add some additional HP vendor IDs/OUIs
+f78ec3a opensm/osm_mcast_tbl.(h c): Make max_mlid_ho be maximum MLID configured
+2d13530 opensm: Add infrastructure support for PortInfo
+ IsMulticastPkeyTrapSuppressionSupported
+3ace760 opensm: Reduce heap consumption by unicast routing tables (LFTs)
+eec568e osmtest: Add SA get PathRecord stress test
+aabc476 opensm: Add infrastructure support for more newly allocated PortInfo
+ CapabilityMask bits
+c83c331 opensm: improve multicast re-routing requests processing
+46db92f opensm: Parallelize (Stripe) MFT sets across switches
+00c6a6e opensm: Parallelize (Stripe) LFT sets across switches
+e21c651 opensm/osm_base.h: Add new SA ClassPortInfo:CapabilityMask2 bit
+ allocations
+09056b1 opensm/ib_types.h: Add CounterSelect2 field to PortCounters attribute
+6a63003 opensm: Add ability to configure SMSL
+25f071f opensm/lash: Set minimum VL for LASH to use
+622d853 opensm/osm_ucast_ftree.cd: Added support for same level links
+8146ba7 opensm: Add new Sun vendor ID
+1d7dd18 opensm/osm_ucast_ftree.c: Enhanced Fat-Tree algorithm
+e07a2f1 Add LMC support to DOR routing
+1acfe8a opensm: Add SuperMicro to list of recognized vendors
+f02f40e opensm: implement 'connect_roots' option in fat-tree routing
+748d41e opensm SA DB dump/restore: added option to dump SA DB on every sweep
+b03a95e complib/cl_fleximap: add cl_fmap_match() function
+b7a8a87 opensm/include/iba/ib_types.h: adding Congestion Control definitions
+
+1.3 Library API Changes
+
+ None
+
+1.4 Software Dependencies
+
+OpenSM depends on the installation of libibumad package (distributed as
+part of OFA IB management together with OpenSM) and IB stack presence,
+in particular libibumad uses user_mad kernel interface ('ib_umad' kernel
+module). The qualified driver versions are provided in Table 2,
+"Qualified IB Stacks".
+
+Also, building of QoS manager policy file parser requires flex, and either
+bison or byacc installed.
+
+1.5 Supported Devices Firmware
+
+The main task of OpenSM is to initialize InfiniBand devices. The
+qualified devices and their corresponding firmware versions
+are listed in Table 3.
+
+2 Known Issues And Limitations
+------------------------------
+
+* No Service / Key associations:
+ There is no way to manage Service access by Keys.
+
+* No SM to SM SMDB synchronization:
+ Puts the burden of re-registering services, multicast groups, and
+ inform-info on the client application (or IB access layer core).
+
+3 Unsupported IB Compliance Statements
+--------------------------------------
+The following section lists all the IB compliance statements which
+OpenSM does not support. Please refer to the IB specification for detailed
+information regarding each compliance statement.
+
+* C14-22 (Authentication):
+ M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
+ SubnSet method. As a work-around, an OpenSM option is provided for
+ defining the protect bits.
+
+* C14-67 (Authentication):
+ On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
+ the SM shall generate a SubnGetResp if the M_Key matches, or
+ silently drop the packet if M_Key does not match.
+
+* C15-0.1.23.4 (Authentication):
+ InformInfoRecords shall always be provided with the QPN set to 0,
+ except for the case of a trusted request, in which case the actual
+ subscriber QPN shall be returned.
+
+* o13-17.1.2 (Event-FWD):
+ If no permission to forward, the subscription should be removed and
+ no further forwarding should occur.
+
+* C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
+ GUIDInfo - SM should enable assigning Port GUIDInfo.
+
+* C14-44 (Initialization):
+ If the SM discovers that it is missing an M_Key to update CA/RT/SW,
+ it should notify the higher level.
+
+* C14-62.1.1.12 (Initialization):
+ PortInfo:M_Key - Set the M_Key to a node based random value.
+
+* C14-62.1.1.13 (Initialization):
+ PortInfo:M_KeyProtectBits - set according to an optional policy.
+
+* C14-62.1.1.24 (Initialization):
+ SwitchInfo:DefaultPort - should be configured for random FDB.
+
+* C14-62.1.1.32 (Initialization):
+ RandomForwardingTable should be configured.
+
+* o15-0.1.12 (Multicast):
+ If the JoinState is SendOnlyNonMember = 1 (only), then the endport
+ should join as sender only.
+
+* o15-0.1.8 (Multicast):
+ If a request for creating an MCG with fields that cannot be met,
+ return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass).
+
+* C15-0.1.8.6 (SA-Query):
+ Respond to SubnAdmGetTraceTable - this is an optional attribute.
+
+* C15-0.1.13 Services:
+ Reject ServiceRecord create, modify or delete if the given
+ ServiceP_Key does not match the one included in the ServiceGID port
+ and the port that sent the request.
+
+* C15-0.1.14 (Services):
+ Provide means to associate service name and ServiceKeys.
+
+4 Bug Fixes
+-----------
+
+4.1 Major Bug Fixes
+
+18990fa opensm: set IS_SM bit during opensm init
+3551389 fix local port smlid in osm_send_trap144()
+a6de48d opensm/osm_link_mgr.c initialize SMSL
+82df467 opensm/osm_req.c: Shouldn't reveal port's MKey on Trap method
+45ebff9 opensm/osm_console_io.h: Modify osm_console_exit so only the
+ connection is killed, not the socket
+d10660a opensm/osm_req.c: In osm_send_trap144, set producer type according
+ to node type
+8a2d2dd opensm/osm_node_info_rcv.c: create physp for the newly discovered
+ port of the known node
+39b241f opensm/lid_mgr: fix duplicated lid assignment
+b44c398 opensm: invalidate routing cache when entering master state
+595f2e3 opensm: update LFTs when entering master
+8406c65 opensm: fix port chooser
+fa90512 opensm/osm_vendor_*_sa: fix incompatibility with QLogic SM
+7ec9f7c opensm: discard multicast SA PR with wildcard DGID
+5cdb53f opensm/osm_sa_node_record.c use comp mask to match by LID or GUID
+55f9772 opensm: Return single PathRecord for SubnAdmGet with DGID/SGID wild
+ carded
+5ec0b5f opensm: compress IPV6 SNM groups to use a single MLID
+
+4.2 Other Bug Fixes
+
+4911e0b performance-manager-HOWTO.txt: Indicate master state
+86ccaa4 opensm/osm_pkey_mgr.c: Fix pkey endian in log message
+b79b079 opensm.8.in: Add mention of backing documentation for QoS policy
+ file and performance manager
+b4d92af opensm/osm_perfmgr.c: Eliminate duplicated error number
+a10b57a opensm/osm_ucast_ftree.c: lids are always handled in host order
+44273a2 opensm/osm_ucast_ftree.c: fixing bug in indexing
+5cd98f7 Fix further bugs around console closure and clean up code.
+6b34339 opensm/osm_opensm.c: add newline to log message
+68c241c send trap144 when local priority is higher than master priority
+6462999 opensm/osm_inform.c: In __osm_send_report, make sure p_report_madw
+ valid before using
+9b8561a opensm/console: Fixed osm_console poll to handle POLLHUP
+91d0700 osm_vendor_ibumad.c: In clear_madw, fix tid endian in message
+5a5136b osm_switch.h : Fixed wrong comment about return value of
+ osm_switch_set_hops
+c1ec8c0 osm_ucast_ftree.c: Removed useless initialization on switch indexes
+418d01f opensm/osm_helper.c: use single buffer in osm_dump_dr_smp()
+2c9153c opensm/osm_helper.c: consolidate dr path printing code
+048c447 opensm/osm_helper.c: return then log is inactive
+dd3ef0c opensm: Return error status when cl_disp_register fails
+0143bf7 opensm/osm_perfmgr.c: Improve assert in osm_pc_rcv_process
+6622504 osm_perfmgr.c: In osm_perfmgr_shutdown, add missing cl_disp_unregister
+7b66dee opensm: remove unneeded anymore physp initializations
+f11274a opensm/partition-config.txt: Update for defmember feature
+d240e7d opensm/osm_sm_state_mgr.c: Remove unneeded return statement
+898fb8c opensm: Improve some snprintf uses
+6820e63 opensm/osm_sa_link_record.c: improve get_base_lid()
+64c8d31 opensm: initialize all switch ports
+555fae8 opensm/sweep: add log message before lid assignment
+8e22307 opensm/console: Enhance perfmgr print_counters for better nodenames
+b9721a1 opensm/osm_console.c: Improve perfmgr print_counters error message
+4d8dc72 opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec
+a98dd82 opensm/main.c: remove enable_stack_dump() call
+db6d51e opensm/osm_subnet: fix crash in qos string config parameters reloading
+e5111c8 opensm: proper config file rescan
+e5295b2 opensm: pre-scan command line for config file option
+e2f549e opensm/osm_console.c: Eliminate some extraneous parentheses
+0a265dc opensm/console: dump_portguid - don't duplicate matched guids
+540fefb opensm/console: dump_portguid command fixes
+d96202c opensm/osm_console.c: Add missing command in help_perfmgr
+ae1bd3c opensm/osm_helper.c: Add port counters to __osm_disp_msg_str
+1d38b31 opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin
+156c749 opensm: fix structure definition for trap 257-258
+5c09f4a opensm/osm_state_mgr.c: small bug in scanning lid table
+72a2fa2 opensm/osm_sa.c: fixing SA MAD dump
+539a4d3 opensm/osm_ucast_ftree.c Fixed bad init value for down port index
+6690833 opensm/ftree: simplify root guids setup.
+90e3291 opensm/ftree: cleanup ftree_sw_tbl_element_t use
+c07d245 opensm/qos_config: no invalid option message on default values
+b382ad8 opensm: avoid memory leaks on config parameters reloading
+45f57ce opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation
+3d618aa opensm/osm_subnet.c: break matching when config parameter already found
+44d98e3 opensm/osm_subnet.c: clean_val() remove trailing quotation
+173010a opensm/doc/perf-manager-arch.txt: Fix some commentary typos
+83bf6c5 opensm/osm_subnet.c fix parse functions for big endian machines
+6b9a1e9 opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager
+ operation
+4f79a17 opensm/osm_perfmgr.c: In osm_perfmgr_init, eliminate memory leak
+ on error
+22da81f opensm/osm_ucast_ftree.c: fix full topology dump
+aa25fcb opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0
+ is active
+003bd4b opensm/osm_subnet.c Fix memory leak for QOS string parameters.
+9cbbab2 opensm/opensm.spec: fix event plugin config options
+996e8f6 OpenSM: update osmeventplugin example for the new TRAP event.
+67f4c07 opensm/lash: simplify some memory allocations
+3e6bcdb opensm/lash: fix memory leaks
+3ff97b9 opensm/vendor: save some stack memory
+ccc7621 opensm/osm_ucast_ftree.c: fixing errors in comments
+1a802b3 Corrected incoherency in __osm_ftree_fabric_route_to_non_cns comments
+85a7e54 opensm/osm_sm.c: fix MC group creation in race condition
+aad1af2 opensm/osm_trap_rcv.c: Improvements in log_trap_info()
+f619d67 opensm/osm_trap_rcv.c: Minor reorganization of trap_rcv_process_request
+084335b opensm/link_mgr: verify port's lid
+d525931 opensm/osm_vendor_ibumad: Use OSM_UMAD_MAX_AGENTS rather than
+ UMAD_CA_MAX_AGENTS
+f342c62 opensm/osm_sa.c: don't ignore failure in osm_mgrp_add_port()
+587fda4 osmtest/osmt_multicast.c: fix strict aliasing breakage warning
+6931f3e opensm: make subnet's max mlid update implementation independent
+30f1acd osm_ucast_ftree.c missing reset of ca_ports
+ac04779 opensm: fix LFT allocation size
+a7838d0 opensm/osm_ucast_cache: reduce OSM_LOG_INFO debug printouts
+c027335 opensm/osm_ucast_updn.c: Further reduction in cas_per_sw allocation
+e8ee292 opensm/opensm/osm_subnet.c: adjust buffer to ensure a '\n' is printed
+84d9830 opensm/osm_ucast_updn.c: Reduce temporary allocation of cas_per_sw
+347ad64 opensm/ib_types.h: Mask off client rereg bit in set_client_rereg
+c2ab189 opensm/osm_state_mgr.c: in cleanup_switch() check only relevant
+ LFT part
+40c93d3 use transportable constant attributes
+c8fa71a osmtest -code cleanup - use strncasecmp()
+770704a opensm/osm_mcast_mgr.c: In mcast_mgr_set_mft_block, fix node GUID
+ in log message
+3d20f82 opensm/osm_sa_path_record.c: separate router guid resolution code
+27ea3c8 opensm: fix gcc-4.4.1 warnings
+c88bfd3 opensm/osm_lid_mgr.c: Fix typo in OSM_LOG message
+a9ea08c opensm/osm_mesh.c: Add dump_mesh routine at OSM_LOG_DEBUG level
+bc2a61e C++ style coding does not compile
+6647600 opensm: remove meanless 'const' keywords in APIs
+323a74f opensm/osm_qos_parser_y.y: fix endless loop
+0121a81 opensm: fix endless looping in mcast_mgr
+696c022 opensm: fix some obvious -Wsign-compare warnings
+b91e3c3 opensm/osm_get_port_by_lid(): don't bother with lmc
+ca582df opensm/osm_get_port_by_lid(): speedup a port lookup
+fd846ee opensm/osm_mesh.c: simplify compare_switches() function
+fe20080 osm_sa.c - void * arithmetic causes problems
+220130f osm_helper.c use explicit value for struct init
+0168ece use standard varargs syntax in macro OSM_LOG()
+180b335 update functions to match .h prototypes
+9240ef4 opensm/osm_ucast_lash: fix use after free bug
+6f1a21a opensm: osm_get_port_by_lid() helper
+c9e2818 opensm/osm_sa_path_record.c: validate multicast membership
+225dcf5 opensm/osm_mesh.c: Remove edges in lash matrix
+4dd928b opensm/osm_sa_mcmember_record.c: clean uninitialized variable use
+c48f0bc opensm/osm_perfmgr_db.c: Fix memory leak of db nodes
+82d3585 opensm/osm_notice.c: move logging code to separate function
+9557f60 opensm/osm_inform.c: For traps 64-67, use GID from DataDetails in
+ log message
+e2e78d9 opensm/opensm.8.in: Indicate default rule for Default partition
+08c5beb opensm/osm_sa_node_record.c: dump NodeInfo with debug verbosity
+1fe88f0 opensm/multicast: merge mcm_port and mcm_info
+ba75747 opensm/multicast: consolidate port addition/removing code
+5e61ab8 opensm: port object reference in mcm ports list
+5c5dacf opensm: fix uninitialized return value in osm_sm_mcgrp_leave()
+7cfe18d osm_ucast_ftree.c: Removed reverse_hop parameters from
+ fabric_route_upgoing_by_going_down
+aa7fb47 opensm/multicast: kill mc group to_be_deleted flag
+a4910fe opensm/osm_mcast_mgr.c: multicast routing by mlid - renaming
+1d14060 opensm/multicast: remove change id tracking
+5a84951 opensm: use mgrp pointer as osm_sm_mcgrp_join/leave() parameter
+d8e3ff5 opensm: use mgrp pointer in port mcm_info
+0631cd3 opensm doc: Indicated limited (rather than partial) partition
+ membership
+1010535 opensm/osm_ucast_lash.c: In lash_core, return status -1 for all errors
+942e20f opensm/osm_helper.c: Add SM priority changed into trap 144 description
+2372999 opensm/osm_ucast_mgr: better lft setup
+e268b32 opensm/osm_helper.c: Only change method when > rather than >=
+9309e8c complib/cl_event.c: change nanosec var type long
+d93b126 opensm/complib: account for nsec overflow in timeout values
+ef4c8ac opensm/osm_qos_policy.c: matching PR query to QoS level with pkey
+c93b58b opensm: fixing some data types in osm_req_get/set
+2b89177 opensm/libvendor/osm_vendor_ibumad.c: Handle umad_alloc failure in
+ osm_vendor_get
+2cba163 opensm/osm_helper.c: In osm_dump_dr_smp, fix endian of status
+47397e3 opensm/osm_sm_mad_ctrl.c: Fix endian of status in error message
+e83b7ca opensm/osm_mesh.c: Reorder switches for lash
+9256239 opensm/osm_trap_rcv.c: Validate trap is 144 before checking for
+ NodeDescription changed
+011d9ca opensm/osm_ucast_lash.c: Handle calloc failure in generate_cdg_for_sp
+59964d7 opensm: fixing handling of opt.max_wire_smps
+f4e3cd0 opensm/osm_ucast_lash.c: Directly call calloc/free rather than
+ create/delete_cdg
+5a208bd opensm/osm_ucast_lash.c: Added error numbers to some error log messages
+3b80d10 opensm/osm_helper.c: fix printing trap 258 details
+f682fe0 opensm: do not configure MFTs when mcast support is disabled
+cc42095 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, indicate
+ failed attribute
+aebf215 opensm/osm_ucast_lash.c: Remove osm_mesh_node_delete call from
+ switch_delete
+1ef4694 opensm/osm_path.h: In osm_dr_path_init, only copy needed part of path
+c594a2d opensm: osm_dr_path_extend can fail due to invalid hop count
+46e5668 opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete
+81841dc opensm/osm_ucast_lash.c: Handle malloc failures better
+2801203 opensm: remove extra "0x" from debug message.
+88821d2 opensm/main.c: Display SMSL when specified
+f814dcd opensm/osm_subnet.c: Format lash_start_vl consistent with other
+ uint8 items
+66669c9 opensm/main.c: Display LASH start VL when specified
+31bb0a7 opensm/osm_mcst_mgr.c: check number of switches only once
+75e672c opensm: find MC group by MGID using fleximap
+2b7260d Clarify the syntax of the hop_weights_file
+e6f0070 opensm/osm_mesh.c: Improve VL utilization
+27497a0 opensm/osm_ucast_ftree.c Fix assert comparing number of CAs to CN ports
+3b98131 opensm/osm_qos_policy.c: Use proper size in malloc in
+ osm_qos_policy_vlarb_scope_create
+e6f367d opensm/osm_ucast_ftree.c: Made error numbers unique in some log
+ messages
+83261a8 osm_ucast_ftree.c Count number of hops instead of calculating it
+7bdf4ff opensm/osm_sa_(path multipath)_record.c: Fix typo in a couple of
+ log messages
+0f8ed87 opensm/osm_ucast_mgr.c: Add error numbers to some error log messages
+0b5ccb4 complib/Makefile.am: prevent file duplications
+e0b8ec9 opensm/osm_sminfo_rcv.c: clean type of smi_rcv_process_get_sm()
+4d01005 opensm: sweep component processors return status value
+6ad8d78 opensm/libvendor/osm_vendor_(ibumad mlx)_sa.c: Handle malloc
+ failure in __osmv_send_sa_req
+cf97ebf opensm/osm_ucast_lash.(h c): Replace memory allocation by array
+957461c opensm/osm_sa.c add attribute and component mask to error message
+5d339a1 osm_dump.c dump port if lft is set up
+518083d osm_port.c: check if op_vls = 0 before max_op_vls comparison
+b6964cb opensm/osm_port.c: Change log level of Invalid OP_VLS 0 message
+ to VERBOSE
+b27568c opensm/PerfMgr: Reduce host name length
+bc495c0 opensm/osm_lid_mgr.c bug in opensm LID assignment
+5a466fd opensm/osm_perfmgr_db.c: Remove unneeded initialization in
+ perfmgr_db_print_by_name
+57cf328 opensm/osm_ucast_ftree.c Increase the size of the hop table
+8323cf1 opensm/PerfMgr: Remove some underbars from internal names
+65b1c15 opensm: Changes to spec and make files for updated release notes
+cd226c7 OpenSM: include/vendor/osm_vendor.h - Replaced #elif with no
+ condition by #else
+9f8bd4a management: Fixed custom_release in SPEC files
+c0b8207 opensm/PerfMgr: Change redir_tbl_size to num_ports for better clarity
+596bb08 opensm/osm_sa.c: check for SA DB file only if requested
+2f2bd4e opensm SA DB dump/restore: load SA DB only once
+4abcbf2 opensm: Added print_desc to various log messages
+5e3d235 opensm/osm_vendor_ibumad.c: Move error info into single message
+8e5ca10 opensm/libvendor//osm_vendor_ibumad_sa.c: uninitialized fields
+d13c2b6 opensm/osm_sm_mad_ctrl.c Changes to some error messages
+f79d315 opensm/osm_sm_mad_ctrl.c: Add missing call to return mad to mad pool
+150a9b1 opensm/osm_sa_mcmember_record.c: print mcast join/create failures in
+ VERBOSE instead of DEBUG level
+9b7882a opensm/osm_vendor_ibumad.c: Change LID format to decimal in log message
+5256c43 opensm/osm_vendor_mlx: fix compilation error
+93db10d opensm/osm_vendor_mlx_txn.c: eliminate bunch of compilation warnings
+156fdc1 opensm/osm_helper.c Log format changes
+7a55434 opensm/osm_ucast_ftree.c Changed log level
+a1694de opensm/osm_state_mgr.c Added more info to some error messages
+fdec20a opensm/osm_trap_rcv.c: Eliminate heavy sweep on receipt of trap 145
+13a32a7 opensm - standardize on a single Windows #define - take #2
+b236a10 opensm/osm_db_files.c: kill useless malloc() castings
+4ba0c26 opensm/osm_db_files.c: add '/' path delimited
+e3b98a5 opensm/osm_sm_mad_ctrl.c: Fix qp0_mads_accounting
+dbbe5b3 opensm/osm_subnet.c: fixing bug in dumping options file
+f22856a opensm/osm_ucast_mgr.c: fix memory leak
+0d5f0b6 opensm: osm_get_mgrp_by_mgid() helper
+e3c044a osm_sa_mcmember_record.c: pass MCM Record data to mlid allocator
+3dda2dc opensm/osm_sa_member_record.c: mlid independent MGID generator
+1f95a3c opensm/osm_sa_mcmember_record.c: move mgid allocation code
+b78add1 complib: replace intn_t types by C99 intptr_t
+a864fd3 osmtest/osmt_mtl_regular_qp.c: cleaning uintn_t use
+9e01318 opensm/osm_console.c: make const functions
+f8c4c3e opensm/osm_mgrp_new(): add subnet db insertion
+80da047 complib/fleximap: make compar callback to return int
+bf7fe2d opensm: cleanup intn_t uses
+0862bba opensm/main.c: opensm cannot be killed while asking for port guid
+2b70193 opensm/complib: bug in cl_list_insert_array_head/tail functions
+4764199 opensm - use C99 transportable data type for pointer storage
+a9c326c opensm/osm_state_mgr.c: do not probe remote side of port 0
+4945706 opensm/osm_mcast_mgr.c: fix return value on alloc_mfts() failures
+8312a24 OpenSM: Fix unused variable compiler warning.
+ab8f0a3 opensm/partition: keep multicast group pointer
+a817430 opensm: Only clear SMP beyond end of PortInfo attribute
+52fb6f2 opensm/osm_switch.h: Remove dead osm_switch_get_physp_ptr routine
+aa6d932 opensm/osm_mcast_tbl.c: In osm_mcast_tbl_clear_mlid, use memset to
+ clear port mask entry
+2ad846b opensm/osm_trap_rcv.c: use source_lid and port_num for logging
+b9d7756 opensm/osm_mcast_tbl: Fix size of port mask table array
+11c0a9b opensm/main.c: Use strtoul rather than strtol for parsing transaction
+ timeout
+0608af9 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, revert setting
+ of init failure on QoS initialization failures
+c6b4d4a opensm/osm_vendor_ibumad.c: Add transaction ID to osm_vendor_send
+ log message
+520af84 opensm/osm_sa_path_record.c: don't set dgid pointer for local subnet
+4a878fb opensm/osm_mcast_mgr.c: fix osm_mcast_mgr_compute_max_hops for
+ managed switch
+
+* Other less critical or visible bugs were also fixed.
+
+5 Main Verification Flows
+-------------------------
+
+OpenSM verification is run using the following activities:
+* osmtest - a stand-alone program
+* ibmgtsim (IB management simulator) based - a set of flows that
+ simulate clusters, inject errors and verify OpenSM capability to
+ respond and bring up the network correctly.
+* small cluster regression testing - where the SM is used on back to
+ back or single switch configurations. The regression includes
+ multiple OpenSM dedicated tests.
+* cluster testing - when we run OpenSM to setup a large cluster, perform
+ hand-off, reboots and reconnects, verify routing correctness and SA
+ responsiveness at the ULP level (IPoIB and SDP).
+
+5.1 osmtest
+
+osmtest is an automated verification tool used for OpenSM
+testing. Its verification flows are described by list below.
+
+* Inventory File: Obtain and verify all port info, node info, link and path
+ records parameters.
+
+* Service Record:
+ - Register new service
+ - Register another service (with a lease period)
+ - Register another service (with service p_key set to zero)
+ - Get all services by name
+ - Delete the first service
+ - Delete the third service
+ - Added bad flows of get/delete non valid service
+ - Add / Get same service with different data
+ - Add / Get / Delete by different component mask values (services
+ by Name & Key / Name & Data / Name & Id / Id only )
+
+* Multicast Member Record:
+ - Query of existing Groups (IPoIB)
+ - BAD Join with insufficient comp mask (o15.0.1.3)
+ - Create given MGID=0 (o15.0.1.4)
+ - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
+ - Create BAD MGID=0xFA. (o15.0.1.6)
+ - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
+ - New MGID with invalid join state (o15.0.1.9)
+ - Retry of existing MGID - See JoinState update (o15.0.1.11)
+ - BAD RATE when connecting to existing MGID (o15.0.1.13)
+ - Partial JoinState delete request - removing FullMember (o15.0.1.14)
+ - Full Delete of a group (o15.0.1.14)
+ - Verify Delete by trying to Join deleted group (o15.0.1.14)
+ - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)
+
+* GUIDInfo Record:
+ - All GUIDInfoRecords in subnet are obtained
+
+* MultiPathRecord:
+ - Perform some compliant and noncompliant MultiPathRecord requests
+ - Validation is via status in responses and IB analyzer
+
+* PKeyTableRecord:
+ - Perform some compliant and noncompliant PKeyTableRecord queries
+ - Validation is via status in responses and IB analyzer
+
+* LinearForwardingTableRecord:
+ - Perform some compliant and noncompliant LinearForwardingTableRecord queries
+ - Validation is via status in responses and IB analyzer
+
+* Event Forwarding: Register for trap forwarding using reports
+ - Send a trap and wait for report
+ - Unregister non-existing
+
+* Trap 64/65 Flow: Register to Trap 64-65, create traps (by
+ disconnecting/connecting ports) and wait for report, then unregister.
+
+* Stress Test: send PortInfoRecord queries, both single and RMPP and
+ check for the rate of responses as well as their validity.
+
+
+5.2 IB Management Simulator OpenSM Test Flows:
+
+The simulator provides ability to simulate the SM handling of virtual
+topologies that are not limited to actual lab equipment availability.
+OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
+regressions use smaller (16 and 128 nodes clusters).
+
+The following test flows are run on the IB management simulator:
+
+* Stability:
+ Up to 12 links from the fabric are randomly selected to drop packets
+ at drop rates up to 90%. The SM is required to succeed in bringing the
+ fabric up. The resulting routing is verified to be correct as well.
+
+* LID Manager:
+ Using LMC = 2 the fabric is initialized with LIDs. Faults such as
+ zero LID, Duplicated LID, non-aligned (to LMC) LIDs are
+ randomly assigned to various nodes and other errors are randomly
+ output to the guid2lid cache file. The SM sweep is run 5 times and
+ after each iteration a complete verification is made to ensure that all
+ LIDs that could possibly be maintained are kept, as well as that all nodes
+ were assigned a legal LID range.
+
+* Multicast Routing:
+ Nodes randomly join the 0xc000 group and eventually the
+ resulting routing is verified for completeness and adherence to
+ Up/Down routing rules.
+
+* osmtest:
+ The complete osmtest flow as described in the previous table is run on
+ the simulated fabrics.
+
+* Stress Test:
+ This flow merges fabric, LID and stability issues with continuous
+ PathRecord, ServiceRecord and Multicast Join/Leave activity to
+ stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get
+ were added to the test such both existing and non existing nodes
+ perform them in random order.
+
+5.3 OpenSM Regression
+
+Using a back-to-back or single switch connection, the following set of
+tests is run nightly on the stacks described in table 2. The included
+tests are:
+
+* Stress Testing: Flood the SA with queries from multiple channel
+ adapters to check the robustness of the entire stack up to the SA.
+
+* Dynamic Changes: Dynamic Topology changes, through randomly
+ dropping SMP packets, used to test OpenSM adaptation to an unstable
+ network & verify DB correctness.
+
+* Trap Injection: This flow injects traps to the SM and verifies that it
+ handles them gracefully.
+
+* SA Query Test: This test exhaustively checks the SA responses to all
+ possible single component mask. To do that the test examines the
+ entire set of records the SA can provide, classifies them by their
+ field values and then selects every field (using component mask and a
+ value) and verifies that the response matches the expected set of records.
+ A random selection using multiple component mask bits is also performed.
+
+5.4 Cluster testing:
+
+Cluster testing is usually run before a distribution release. It
+involves real hardware setups of 16 to 32 nodes (or more if a beta site
+is available). Each test is validated by running all-to-all ping through the IB
+interface. The test procedure includes:
+
+* Cluster bringup
+
+* Hand-off between 2 or 3 SM's while performing:
+ - Node reboots
+ - Switch power cycles (disconnecting the SM's)
+
+* Unresponsive port detection and recovery
+
+* osmtest from multiple nodes
+
+* Trap injection and recovery
+
+
+6 Qualified Software Stacks and Devices
+---------------------------------------
+
+OpenSM Compatibility
+--------------------
+Note that OpenSM version 3.2.1 and earlier used a value of 1 in host
+byte order for the default SM_Key, so there is a compatibility issue
+with these earlier versions of OpenSM when the 3.2.2 or later version
+is running on a little endian machine. This affects SM handover as well
+as SA queries (saquery tool in infiniband-diags).
+
+
+Table 2 - Qualified IB Stacks
+=============================
+
+Stack | Version
+-----------------------------------------|--------------------------
+The main stream Linux kernel | 2.6.x
+OFED | 1.4
+OFED | 1.3
+OFED | 1.2
+OFED | 1.1
+OFED | 1.0
+
+Table 3 - Qualified Devices and Corresponding Firmware
+======================================================
+
+Mellanox
+Device | FW versions
+------------------------------------|-------------------------------
+InfiniScale | fw-43132 5.2.000 (and later)
+InfiniScale III | fw-47396 0.5.000 (and later)
+InfiniScale IV | fw-48436 7.1.000 (and later)
+InfiniHost | fw-23108 3.5.000 (and later)
+InfiniHost III Lx | fw-25204 1.2.000 (and later)
+InfiniHost III Ex (InfiniHost Mode) | fw-25208 4.8.200 (and later)
+InfiniHost III Ex (MemFree Mode) | fw-25218 5.3.000 (and later)
+ConnectX IB | fw-25408 2.3.000 (and later)
+
+QLogic/PathScale
+Device | Note
+--------|-----------------------------------------------------------
+iPath | QHT6040 (PathScale InfiniPath HT-460)
+iPath | QHT6140 (PathScale InfiniPath HT-465)
+iPath | QLE6140 (PathScale InfiniPath PE-880)
+iPath | QLE7240
+iPath | QLE7280
+
+Note 1: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose
+QP0 and QP1. However, it does support it as a device on the subnet.
+
+Note 2: QoS firmware and Mellanox devices
+
+HCAs: QoS supported by ConnectX. QoS-enabled FW release is 2_5_000 and
+later.
+
+Switches: QoS supported by InfiniScale III
+Any InfiniScale III FW that is supported by OpenSM supports QoS.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ qib in OFED 1.5.1 Release Notes
+
+ March 2010
+
+======================================================================
+1. Overview
+======================================================================
+qib is the low level driver implementation for all QLogic InfiniPath
+PCI-Express HCAs: gen 1 x8 SDR QLE7140, gen 1 x8 DDR QLE7240,
+gen 1 x16 DDR QLE7280, gen 2 x8 QDR QLE7340 and QLE7342.
+
+The qib driver is new for OFED 1.5.
+
+The qib kernel driver obsoletes the ipath kernel driver but is
+compatible with libipathverbs so no new user level components are needed.
--- /dev/null
+Distribution
+ Open Fabrics Enterprise Distribution (OFED) 1.5, December 2009
+
+Summary
+ qperf - Measure RDMA and IP performance
+
+Overview
+ qperf measures bandwidth and latency between two nodes. It can work over
+ TCP/IP as well as the RDMA transports.
+
+Quick Start
+ * Since qperf measures latency and bandwidth between two nodes, you need
+ access to two nodes. Assume they are called node1 and node2.
+
+ * On node1, run qperf without any arguments. It will act as a server and
+ continue to run until asked to quit.
+
+ * To measure TCP bandwidth between the two nodes, on node2, type:
+ qperf node1 tcp_bw
+
+ * To measure RDMA RC latency, type (on node2):
+ qperf node1 rc_lat
+
+ * To measure RDMA UD latency using polling, type (on node2):
+ qperf node1 -P 1 ud_lat
+
+ * To measure SDP bandwidth, on node2, type:
+ qperf node1 sdp_bw
+
+Documentation
+ * Man page available. Type
+ man qperf
+
+ * To get a list of examples, type:
+ qperf --help examples
+
+ * To get a list of tests, type:
+ qperf --help tests
+
+Tests
+ Miscellaneous
+ conf Show configuration
+ quit Cause the server to quit
+ Socket Based
+ rds_bw RDS streaming one way bandwidth
+ rds_lat RDS one way latency
+ sctp_bw SCTP streaming one way bandwidth
+ sctp_lat SCTP one way latency
+ sdp_bw SDP streaming one way bandwidth
+ sdp_lat SDP one way latency
+ tcp_bw TCP streaming one way bandwidth
+ tcp_lat TCP one way latency
+ udp_bw UDP streaming one way bandwidth
+ udp_lat UDP one way latency
+ RDMA Send/Receive
+ ud_bw UD streaming one way bandwidth
+ ud_bi_bw UD streaming two way bandwidth
+ ud_lat UD one way latency
+ rc_bw RC streaming one way bandwidth
+ rc_bi_bw RC streaming two way bandwidth
+ rc_lat RC one way latency
+ uc_bw UC streaming one way bandwidth
+ uc_bi_bw UC streaming two way bandwidth
+ uc_lat UC one way latency
+ RDMA
+ rc_rdma_read_bw RC RDMA read streaming one way bandwidth
+ rc_rdma_read_lat RC RDMA read one way latency
+ rc_rdma_write_bw RC RDMA write streaming one way bandwidth
+ rc_rdma_write_lat RC RDMA write one way latency
+ rc_rdma_write_poll_lat RC RDMA write one way polling latency
+ uc_rdma_write_bw UC RDMA write streaming one way bandwidth
+ uc_rdma_write_lat UC RDMA write one way latency
+ uc_rdma_write_poll_lat UC RDMA write one way polling latency
+ InfiniBand Atomics
+ rc_compare_swap_mr RC compare and swap messaging rate
+ rc_fetch_add_mr RC fetch and add messaging rate
+ Verification
+ ver_rc_compare_swap Verify RC compare and swap
+ ver_rc_fetch_add Verify RC fetch and add
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ RDMA CM in OFED 1.5 Release Notes
+
+ July 2010
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. New Features
+3. Known Issues
+
+===============================================================================
+1. Overview
+===============================================================================
+The RDMA CM is a communication manager used to setup reliable, connected
+and unreliable datagram data transfers. It provides an RDMA transport
+neutral interface for establishing connections. The API is based on sockets,
+but adapted for queue pair (QP) based semantics: communication must be
+over a specific RDMA device, and data transfers are message based.
+
+
+The RDMA CM only provides the communication management (connection setup /
+teardown) portion of an RDMA API. It works in conjunction with the verbs
+API for data transfers.
+
+===============================================================================
+2. New Features
+===============================================================================
+for OFED 1.5.2:
+
+Several enhancements were added to librdmacm release 1.0.12 that
+are intended to simplify using RDMA devices and address scalability issues.
+These changes were in response to long standing requests to make
+connection establishment 'more like sockets'. For full details,
+users should refer to the appropriate man pages. Major changes include:
+
+* Support synchronous operation for library calls. Users can control
+ whether an rdma_cm_id operates asynchronously or synchronously based on
+ the rdma_event_channel parameter. Use of synchronous operations
+ reduces the amount of application code required to use the librdmacm
+ by eliminating the need for event processing code.
+
+ An rdma_cm_id will be marked for synchronous operation if the
+ rdma_event_channel parameter is NULL for rdma_create_id or
+ rdma_migrate_id. Users can toggle between synchronous and
+ asynchronous operation through the rdma_migrate_id call.
+
+ Calls that operate synchronously include rdma_resolve_addr,
+ rdma_resolve_route, rdma_connect, rdma_accept, and rdma_get_request.
+ Synchronous event data is returned to the user through the
+ rdma_cm_id.
+
+* The addition of a new API: rdma_getaddrinfo. This call is modeled
+ after getaddrinfo, but for RDMA devices and connections. It has the
+ following notable deviations from getaddrinfo:
+
+ A source address is returned as part of the call to allow the
+ user to allocate necessary local HW resources for connections.
+
+ Optional routing information may be returned to support
+ Infiniband fabrics. IB routing information includes necessary
+ path record data. rdma_getaddrinfo will obtain this information
+ if IB ACM support (see below) is enabled. The use of IB ACM
+ is not required for rdma_getaddrinfo.
+
+ rdma_getaddrinfo provides future extensions to support
+ more complex address and route resolution mechanisms, such as
+ multiple path support and failover.
+
+* Support for a new APIs: rdma_get_request, rdma_create_ep, and
+ rdma_destroy_ep. rdma_get_request simplifies the passive side
+ implementation by adding synchronous support for accepting new
+ connections. rdma_create_ep combines the functionality of
+ rdma_create_id, rdma_create_qp, rdma_resolve_addr, and rdma_resolve_route
+ in a single API that uses the output of rdma_getaddrinfo as its input.
+
+* Support for optional parameters. To simplify support for casual RDMA
+ developers and researchers, the librdmacm can allocate protection
+ domains, completion queues, and queue pairs on a user's behalf.
+ This simplifies the amount of information that a developer
+ must learn in order to use RDMA, plus allows the user to take
+ advantage of higher-level completion processing abstractions.
+
+ In addition to optional parameters, a user can also specify that the
+ librdmacm should automatically select usable values for RDMA read
+ operations.
+
+* Add support for IB ACM. IB ACM (InfiniBand Assistant for Communication
+ Management) defines a socket based protocol to an IB address and route
+ resolution service. One implementation of that service is provided
+ separately by the ibacm package, but anyone can implement the service
+ provided that they adhere to the IB ACM socket protocol. IB ACM is an
+ experimental service targeted at increasing the scalability of applications
+ running on a large cluster.
+
+ Use of IB ACM is not required and is controlled through the build option
+ '--with-ib_acm'. If the librdmacm fails to contact the IB ACM service, it
+ reverts to using kernel services to resolve address and routing data.
+
+* Add RDMA helper routines. The librdmacm provide a set of simpler verbs
+ calls for posting work requests, registering memory, and checking for
+ completions. These calls are wrappers around libibverbs routines.
+
+===============================================================================
+3. Known Issues
+===============================================================================
+The RDMA CM relies on the operating system's network configuration tables to
+map IP addresses to RDMA devices. Incorrectly configured network
+configurations can result in the RDMA CM being unable to locate the correct
+RDMA device. Currently, the RDMA CM only supports IPv4 addressing.
+
+All RDMA interfaces must provide a way to map IP addresses to an RDMA device.
+For Infiniband, this is done using IPoIB, and requires correctly configured
+IPoIB device interfaces sharing the same multicast domain. For details on
+configuring IPoIB, refer to ipoib_release_notes.txt. For RDMA devices to
+communicate, they must support the same underlying network and data link
+layers.
+
+If you experience problems using the RDMA CM, you may want to check the
+following:
+
+ * Verify that you have IP connectivity over the RDMA devices. For example,
+ ping between iWarp or IPoIB devices.
+
+ * Ensure that IP network addresses assigned to RDMA devices do not
+ overlap with IP network addresses assigned to standard Ethernet devices.
+
+ * For multicast issues, either bind directly to a specific RDMA device, or
+ configure the IP routing tables to route multicast traffic over an RDMA
+ device's IP address.
+
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ RDS in OFED 1.5.1 Release Notes
+ March 2010
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Supported Platforms
+3. Installation & Configuration
+4. New Features
+5. Bug fixes and Enhancements since OFED 1.4
+6. Bug fixes and Enhancements since OFED 1.3.1
+7. Bug fixes and Enhancements since OFED 1.3
+8. Bug fixes and Enhancements since OFED 1.2
+9. Known Issues
+
+===============================================================================
+1. Overview
+===============================================================================
+RDS socket API. It provides reliable, in-order datagram delivery between
+sockets over a variety of transports.
+For details see RDS_README.txt and man 7 rds.
+
+===============================================================================
+2. supported platforms
+===============================================================================
+
+Same as overall OFED release.
+
+===============================================================================
+3. Installation & Configuration
+===============================================================================
+To install RDS select rds in OFED's manual installation or put 'rds=y' in the
+ofed.conf for unattended installation.
+
+To load RDS module upon boot edit file '/etc/infiniband/openib.conf' as
+follows:
+
+# Load RDS module
+RDS_LOAD=yes
+
+===============================================================================
+4. New Features
+===============================================================================
+
+GET_MR_FOR_DEST sockopt added. This allows a MR to be associated with
+a remote host. GET_MR sockopt deprecated.
+
+Transports now modularized: rds_rdma.ko (IB and iWARP) and
+rds_tcp.ko. This enables RDS use with TCP, without the IB stack
+loaded.
+
+Improved receive processing to lower amount of time spent with interrupts
+disabled.
+
+===============================================================================
+5. Bug fixes and Enhancements since OFED 1.4
+===============================================================================
+
+* Set retry_count to 2 and make modifiable via modparam
+* Many locking fixes
+* Rebased to mainline kernel 2.6.30 resulted in rds trace framework
+ being removed.
+
+===============================================================================
+6. Bug fixes and Enhancements since OFED 1.3.1
+===============================================================================
+- RDMA completion notifications are signalled when the IB stack gives us the
+ completion event for the accompanying RDS message. This is a change from the
+ 1.3.x behavior, which signalled completion notifications when the RDS message
+ was ACKed.
+- Fixed bugs associated with congestion monitoring.
+- FMR pool size increased from 2K to 4K
+- Added support for RDMA_CM_EVENT_ADDR_CHANGE event.
+- RDS should now work on Qlogic HCAs.
+
+===============================================================================
+7. Bug fixes and Enhancements since OFED 1.3
+===============================================================================
+- Fix a bug in RDMA signaling
+- Add 3 more stats counters
+- Fix a kernel crash that can occur when RDS/IB connection drops
+- Fixes for RDMA API
+
+===============================================================================
+8. Bug fixes and Enhancements since OFED 1.2
+===============================================================================
+
+1) Wire protocol for RDS v3 and RDS v2 are not compatible.
+
+2) RDS over TCP is disabled in OFED 1.3. We will re-enable in future release.
+
+3) Congestion monitoring support gives the application more fine-grained
+ control.
+
+With explicit monitoring, the application polls for POLLIN as before, and
+additionally uses the RDS_CONG_MONITOR socket option to install a 64bit mask
+value in the socket, where each bit corresponds to a group of ports.
+When a congestion update arrives, RDS checks the set of ports that became
+uncongested against the bit mask installed in the socket. If they overlap, a
+control messages is enqueued on the socket, and the application is woken up.
+When application calls recvmsg (2), it will be given the control message
+containing the bitmap on the socket.
+
+===============================================================================
+9. Known Issues
+===============================================================================
+1. RDMAs over 1 MiB not supported.
--- /dev/null
+ Open Fabrics Enterprise Distribution (OFED)
+ SDP in MLNX_OFED 1.5.2 Release Notes
+
+ December 2010
+
+
+
+===============================================================================
+Table of Contents
+===============================================================================
+1. Overview
+2. Bug Fixes and Enhancements since OFED 1.5.2
+3. ZCopy
+4. Known Issues
+5. Verification Applications/Flows/Tests
+6. Module Parameters
+
+===============================================================================
+1. Overview
+===============================================================================
+Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol
+that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced
+protocol offload capabilities, SDP can provide lower latency, higher bandwidth,
+and lower CPU utilization than IPoIB or
+Ethernet running some sockets-based applications.
+
+SDP in OFED is at GA level for MLNX OFED 1.5.2
+
+===============================================================================
+2. Main Features and Changes
+===============================================================================
+- Added support for Inline and blueflame
+- Improved stability issues
+- Bug fixes
+
+===============================================================================
+2. Bug Fixes and Enhancements since OFED 1.5.2
+===============================================================================
+* Cleanups
+ - Added support for 2.6.34 / 2.6.36.
+
+* Bug Fixes
+ - Fixed compilation problems on 32 bit hosts
+ - Do not compile in debug mode when not asked.
+ - Improved recovery from errors.
+
+* Enhancements
+ - more statistics in /proc/sdpstats
+ - added debugfs for sdp:
+ - sdpprf was moved from /proc to debugfs/sdp
+ - debugfs/<socket_id> - Socket history
+
+
+===============================================================================
+3. ZCopy
+===============================================================================
+- ZCopy is enabled by default for blocks larger than 64K. ZCopy can be disabled
+ by setting the module paramter sdp_zcopy_thresh to zero or to any other value
+ by setting it to another non zero value.
+
+- ZCOPY mode gives good performance for large blocks with very small cpu
+ utilization. When in use, all messages longer than 'sdp_zcopy_thresh' bytes
+ in length will cause the user space buffer to be pinned and the data sent
+ directly from the original buffer. This results in less CPU usage and on many
+ systems in enhanced bandwidth.
+ ZCOPY is most efficient with multi stream jobs and it performs better as the
+ message size increases.
+ The default 64K value for 'sdp_zcopy_thresh' is sometimes too low for some
+ systems. You must experiment with your hardware to select the best value.
+
+- ZCOPY vs BCOPY:
+ ZCOPY performance is more efficient in weak cpu and multi streams, whereas
+ BCOPY is more efficient in single stream.
+
+===============================================================================
+4. Known Issues
+===============================================================================
+- SDP is at beta level on Infinihost HCA family
+
+- Occasionally, socket bind fails when using EINVAL. Although TCP socket is binded
+ successfully, SDP is occupied, thus causing the socket bind failure.
+ See Bugzilla 2159 and Bugzilla 2160
+
+- When SO_REUSEADDR is set, only a single socket can be bind to the IP_ANY and a
+ specific port. TCP limitation, unless one of the sockets is listening.
+
+- BUG 1331 - Although TCP allows connecting to IP_ANY - 0.0.0.0
+ (as a destination address!), SDP does not allow connecting to the IP_ANY
+ and rejects the connection.
+
+- BUG 1444 - The setsockopt(SO_RCVBUF) is not functional in sdp socket.
+ To limit top system wide sdp memory usage for recv,
+ use the module parameter top_mem_usage.
+
+- Each SDP socket currently consumes up to 2 MBytes of memory. If this value
+ is high for your installation, it is possible to trade off performance
+ for lower memory utilization per socket by reducing the value of the
+ "rcvbuf_scale" module parameter (default: 16).
+
+ Note: The minimum legal value for the "rcvbuf_scale" module is 1.
+ At this parameter value, each socket will consume approximately 128 KBytes.
+
+- Small message size performance is low when messages are sent by client
+ at a rate lower than the rate at which they are consumed by server,
+ and when TCP_CORK is not set. This is observed, for example, with iperf
+ benchmark.
+ Workaround: Set the TCP_CORK socket option
+ to ensure data is sent in at least 32K byte chunks.
+
+- Performance is low on 32-bit kernels, as SDP utilizes high memory
+ to ease memory pressure.
+ Workaround: Move to a 64-bit kernel if the application remains a 32-bit one.
+
+- By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards
+ using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth.
+ Workaround: Reset the MTU size to 1K in this situation, using either of
+ the two methods below:
+
+ 1. Activate the "tavor quirk" workaround in opensm:
+ a. Create an opensm options cache file (/var/cache/osm/opensm.opts):
+ > opensm --cache-options -o
+ b. Add the following line to /var/cache/osm/opensm.opts:
+ enable_quirks TRUE
+ c. Rerun opensm using your usual command line options to activate
+ the opensm quirk option.
+
+ 2. Activate the "tavor quirk" workaround in cma:
+ set the tavor_quirk module parameter of the rdma_cm module to value 1
+ (default: 0).
+
+- When waiting for RX, the driver first polls, arms interrupt and then goes to
+ sleep. Polling duration could be set by recv_poll module parameter. The
+ higher this value is, the higher the CPU utilization is, and the number of
+ interrupts is lower.
+ This should be fine tuned according to the specific environment and
+ application latency.
+
+- When using SDP over RoCE, and the peer has a card that does not support RoCE
+ a delay in the connection establishment may occur.
+
+- BUG2185 - Occasionally, accessing /proc/net/sdpstats, causes kernel
+ panic.
+
+- For set-user-ID/set-group-ID ELF binaries, only libraries in the standard
+ search directories that are also set-user-ID. Since always installing
+ libsdp with this bit on is a security vulnerability, the default behavior is
+ to reset this bit. A user that want to run such binaries should modify the
+ libsdp.spec file.
+
+===============================================================================
+5. Verification Applications/Flows/Tests
+===============================================================================
+- ssh/sshd
+- wget/netscape/firefox/apache
+- netpipe
+- netperf
+- LTP socket tests
+- iperf-2.0.2
+- ttcp
+- openmpi
+- openmpi + Intel MPI benchmarks
+- Threaded and forking echo client server examples
+- Various Java client server applications (SUN:jre, BEA:jrockit/WebLogic, GNU:gij/gcj)
+- Many UNIX utilities to verify that pre-load did not harm the applications
+
+===============================================================================
+6. Module Parameters
+===============================================================================
+
+General
+-------
+sdp_link_layer_ib_only:
+ Supports only link layer of type InfiniBand.
+ It is useful when not using SDP over RoCE.
+
+sdp_debug_level:
+ Enables connection establishment and teardown debug tracing.
+
+sdp_data_debug_level:
+ Enables datapath debug tracing. If set to 1, it shows only packets >1.
+ To enable debugging of data path, compile driver with CONFIG_SDP_DEBUG_DATA.
+
+
+recv_poll:
+ Enables poll receiving before arming the interrupt. Set a higher value
+ to decrease the number of RX interrupts. Consequently, the CPU
+ utilization will be higher.
+
+sdp_keepalive_time:
+ Default idle time in seconds before keepalive probe sent.
+
+Resources
+---------
+rcvbuf_initial_size:
+ Receives buffer initial size in bytes.
+
+rcvbuf_scale:
+ Not in use
+
+top_mem_usage:
+ Top system wide sdp memory usage for recv (in MB).
+
+max_large_sockets:
+ Not in use
+
+sdp_fmr_pool_size:
+ Number of FMRs to allocate for pool
+
+sdp_fmr_dirty_wm:
+ Watermark to flush fmr pool
+
+Thresholds
+----------
+sdp_inline_thresh:
+ Inline copy threshold. effective to new sockets only; 0=Off.
+
+sdp_zcopy_thresh:
+ Zero copy using RDMA threshold; 0=Off.
+ If smaller than page size, set to page size.
+
+Interrupt hardware moderation:
+------------------------------
+sdp_rx_coal_target:
+ Target number of bytes to coalesce with interrupt moderation.
+
+sdp_rx_coal_time:
+ rx coal time (jiffies).
+
+sdp_rx_rate_low:
+ rx_rate low (packets/sec).
+
+sdp_rx_coal_time_low:
+ low moderation usec.
+
+sdp_rx_rate_high:
+ rx_rate high (packets/sec).
+
+sdp_rx_coal_time_high:
+ high moderation usec.
+
+sdp_rx_rate_thresh:
+ rx rate thresh ().
+
+sdp_sample_interval:
+ sample interval (jiffies).
+
+hw_int_mod_count:
+ Forced hw int moderation val. -1 for auto (packets). 0 to disable.
+
+hw_int_mod_usec:
+ Forced hw int moderation val. -1 for auto (usec). 0 to disable.
--- /dev/null
+
+ Open Fabrics Enterprise Distribution (OFED)
+ SRP in OFED 1.5.2 Release Notes
+
+ December 2010
+
+
+==============================================================================
+Table of contents
+==============================================================================
+
+ 1. Overview
+ 2. Changes and Bug Fixes since OFED 1.5
+ 3. Software Dependencies
+ 4. Major Features
+ 5. Loading SRP Initiator
+ 6. Manually Establishing an SRP Connection
+ 7. SRP Tools - ibsrpdm and srp_daemon
+ 8. Automatic Discovery and Connecting to Targets
+ 9. Multiple Connections from Initiator IB Port to the Target
+ 10. High Availability
+ 11. Shutting Down SRP
+ 12. Known Issues
+ 13. Vendor Specific Notes
+
+
+==============================================================================
+1. Overview
+==============================================================================
+
+The SRP standard describes the message format and protocol definitions required
+for transferring commands and data between a SCSI initiator port and a SCSI
+target port using RDMA communication service.
+
+
+==============================================================================
+2. Changes and Bug Fixes since OFED 1.5
+==============================================================================
+* Check for scsi_id in scmnd to prevent scan/rescan keep adding new scsi devices
+ ie. echo "- - -" > /sys/class/scsi_host/hostXX/scan
+* Bug fixing
+
+==============================================================================
+4. Software Dependencies
+==============================================================================
+
+The SRP Initiator depends on the installation of the OFED Distribution stack
+with OpenSM running.
+
+==============================================================================
+5. Major Features
+==============================================================================
+
+This SRP Initiator is based on source taken from openib.org gen2 implementing
+the SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. See:
+www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf
+
+The SRP Initiator supports:
+- Basic SCSI Primary Commands -3 (SPC-3)
+ (www.t10.org/ftp/t10/drafts/spc3/spc3r21b.pdf)
+- Basic SCSI Block Commands -2 (SBC-2)
+ (www.t10.org/ftp/t10/drafts/sbc2/sbc2r16.pdf)
+- Basic functionality, task management and limited error handling
+
+==============================================================================
+6. Loading SRP Initiator
+==============================================================================
+
+To load the SRP module, either execute the "modprobe ib_srp" command after the
+OFED driver is up, or change the value of SRP_LOAD in
+/etc/infiniband/openib.conf to "yes" (causing the srp module to be loaded
+at driver boot).
+
+NOTE: When loading the ib_srp module, it is possible to set the module
+ parameter srp_sg_tablesize. This is the maximum number of
+ gather/scatter entries per I/O (default: 12).
+
+ a. modprobe ib_srp srp_sg_tablesize=32
+ or
+ b. edit /etc/modprobe.conf and add the following line:
+ options ib_srp srp_sg_tablesize=32
+
+Module paramters:
+For the list of ib_srp module parameters
+ $ modinfo ib_srp
+
+ + srp_sg_tablesze: Max number of scatter/gather entries per I/O
+ + srp_dev_loss_tmo: Number of seconds that srp driver will not return
+ DID_NO_CONNECT status when it loss connection to target.
+ During this period, it will try to re-establish
+ the connection to target, and return DID_RESET,
+ DID_ABORT statuses for outstanding scsi command to
+ prevent DM Multipath driver to failover to next paths.
+ Default value is 60 seconds.
+
+==============================================================================
+7. Manually Establishing an SRP Connection
+==============================================================================
+
+The following steps describe how to manually load an SRP connection between
+the Initiator and an SRP Target. Section 8 explains how to do this
+automatically.
+
+- Make sure that the ib_srp module is loaded, the SRP Initiator is reachable
+ by the SRP Target, and that an SM is running.
+
+- To establish a connection with an SRP Target and create SRP (SCSI) device(s)
+ for that target under /dev, use the following command:
+
+ echo -n id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\
+ pkey=ffff,service_id=[service[0] value] > \
+ /sys/class/infiniband_srp/srp-mthca[hca number]-[port number]/add_target
+
+ a. Execution of the above "echo" command may take some time
+ b. The SM must be running while the command executes
+ c. It is possible to include additional parameters in the echo command:
+ > max_cmd_per_lun - Default: 63
+ > max_sect (short for max_sectors) - sets the request size of a command
+ > io_class - Default: 0x100 as in rev 16A of the specification
+ Note: In rev 10 the default was 0xff00
+ > initiator_ext - Please refer to Section 9 (Multiple Connections...)
+ d. See SRP Tools below for instructions on how the parameters in the
+ echo command above may be obtained.
+
+NOTES:
+
+- Using the same *echo -n <same paramters>* more than one, the srp target
+ will terminate the previous connection and re-establish the new
+ connection. To have more than two connections to srp target, please use
+ different inititiator_ext values in echo command.
+
+- To list the new SCSI devices that have been added by the echo command, you
+ may use either of the following two methods:
+ a. Execute "fdisk -l". This command lists all devices; the new devices are
+ included in this listing.
+ b. Execute *dmesg* or look at /var/log/messages to find messages with the
+ names of the new devices.
+
+
+==============================================================================
+8. SRP Tools - ibsrpdm and srp_daemon
+==============================================================================
+
+To assist in performing the steps in Section 6, the OFED 1.3.1 distribution
+provides two utilities which:
+- Detect targets on the fabric reachable by the Initiator (for step 1)
+- Output target attributes in a format suitable for use in the above
+ "echo" command (step 2)
+
+These utilities are: ibsrpdm and srp_daemon.
+
+The utilities can be found under /usr/local/ofed/sbin/ (or <prefix>/sbin/),
+and are part of the srptools RPM that may be installed using the
+OFED custom installation. Detailed information regarding the various
+options for these utilities are provided by their man pages.
+
+Below, several usage scenarios for these utilities are presented.
+
+ibsrpdm usage
+-------------
+1. Detecting reachable targets
+
+ a. To detect all targets reachable by the SRP initiator via the default
+ umad device (/dev/infiniband/umad0), execute the following command:
+ $ ibsrpdm
+
+ This command will output information on each SRP target detected, in
+ human-readable form.
+
+ Sample output:
+ IO Unit Info:
+ port LID: 0103
+ port GID: fe800000000000000002c90200402bd5
+ change ID: 0002
+ max controllers: 0x10
+
+ controller[ 1]
+ GUID: 0002c90200402bd4
+ vendor ID: 0002c9
+ device ID: 005a44
+ IO class : 0100
+ ID: LSI Storage Systems SRP Driver 200400a0b81146a1
+ service entries: 1
+ service[ 0]: 200400a0b81146a1 / SRP.T10:200400A0B81146A1
+
+ b. To detect all the SRP Targets reachable by the SRP Initiator via
+ another umad device, use the following command:
+
+ $ ibsrpdm -d <umad device>
+
+2. Assistance in creating an SRP connection
+
+ a. To generate output suitable for utilization in the "echo" command of
+ section 5, add the "-c" option to ibsrpdm:
+
+ $ ibsrpdm -c
+
+ Sample output:
+ id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
+ dgid=fe800000000000000002c90200402bd5,pkey=ffff,
+ service_id=200400a0b81146a1
+
+ b. To establish a connection with an SRP Target (Section 6) using the output
+ from the "libsrpdm -c" example above, execute the following command:
+
+ $ echo -n id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
+ dgid=fe800000000000000002c90200402bd5,pkey=ffff,
+ service_id=200400a0b81146a1
+ > /sys/class/infiniband_srp/srp-mlnx_0-1/add_target
+
+ The SRP connection should now be up; the newly created SCSI devices should
+ appear in the listing obtained from the "fdisk -l" command.
+
+
+srp_daemon
+----------
+The srp_daemon utility is based on ibsrpdm and extends its functionality.
+In addition to the ibsrpdm functionality described above, srp_daemon can also
+- Establish an SRP connection by itself (without the need to issue the "echo"
+ command described in Section 6)
+- Continue running in background, detecting new targets and establishing SRP
+ connections with them (daemon mode)
+- Discover reachable SRP targets given an infiniband HCA name and port, rather
+ than just by /dev/umad<N> where <N> is a digit
+- Enable High Availability operation (together with Device-Mapper Multipath)
+- Have a configuration file that determines the targets to connect to
+
+a. srp_daemon commands equivalent to ibsrpdm:
+
+ "srp_daemon -a -o" is equivalent to "ibsrpdm"
+ "srp_daemon -c -a -o" is equivalent to "ibsrpdm -c"
+
+Note: These srp_daemon commands can behave differently than the equivalent
+ ibsrpdm command when /etc/srp_daemon.conf is not empty.
+
+b. srp_daemon extensions to ibsrpdm
+
+ - To discover SRP Targets reachable from HCA device <infiniband HCA name>,
+ port <port num>, (and generate output suitable for 'echo') you may execute
+
+ srp_daemon -c -a -o -i <infiniband HCA name> -p <port number>
+
+ - To both discover the SRP Targets and establish connections with them, just
+ add the -e option to the above command.
+
+ - Executing srp_daemon over a port without the -a option will only display
+ the reachable targets via the port and to which the initiator is not
+ connected. If executing with the -e option it is better to omit -a.
+
+ - It is recommended to use the -n option. This option adds the initiator_ext
+ to the connecting string. (See Section 9 for more details).
+
+ - srp_daemon has a configuration file that can be set, where the default is
+ /etc/srp_daemon.conf. Use the -f to supply a different configuration file
+ that configures the targets srp_daemon is allowed to connect to. The
+ configuration file can also be used to set values for additional
+ parameters (e.g., max_cmd_per_lun, max_sect).
+
+ - A continuous background (daemon) operation, providing an automatic ongoing
+ detection and connection capability. See Section 8.
+
+==============================================================================
+9. Automatic Discovery and Connecting to Targets
+==============================================================================
+
+- Make sure that the ib_srp module is loaded, the SRP Initiator can reach an
+ SRP Target, and that an SM is running.
+
+- To connect to all the existing Targets in the fabric, execute
+ srp_daemon -e -o. This utility will scan the fabric once, connect to
+ every Target it detects, and then exit.
+
+NOTE: srp_daemon will follow the configuration it finds in
+ /etc/srp_daemon.conf. Thus, it will ignore a target that is disallowed in
+ the configuration file.
+
+- To connect to all the existing Targets in the fabric and to connect
+ to new targets that will join the fabric, execute srp_daemon -e. This utility
+ continues to execute until it is either killed by the user or encounters
+ connection errors (such as no SM in the fabric).
+
+- To execute SRP daemon as a daemon you may execute run_srp_daemon
+ (found under /usr/local/ofed/sbin/ or <prefix>/sbin/), providing it with
+ the same options used for running srp_daemon.
+
+ Note: Make sure only one instance of run_srp_daemon runs per port.
+
+- To execute SRP daemon as a daemon on all the ports, execute srp_daemon.sh
+ (found under /usr/local/ofed/sbin/ or <prefix>/sbin/).
+ srp_daemon.sh sends its log to /var/log/srp_daemon.log.
+
+- It is possible to configure this script to execute automatically when the
+ InfiniBand driver starts by changing the value of SRP_DAEMON_ENABLE in
+ /etc/infiniband/openib.conf to "yes" and SRP_LOAD to yes as well.
+
+ Another option to to configure this script to execute automatically when the
+ InfiniBand driver starts is by changing the value of SRPHA_ENABLE in
+ /etc/infiniband/openib.conf to "yes". However, this option also enables
+ SRP High Availability that has some more features. (Please read the High
+ Availability section).
+
+==============================================================================
+10. Multiple Connections from Initiator IB Port to the Target
+==============================================================================
+
+Some system configurations may need multiple SRP connections from
+the SRP Initiator to the same SRP Target: to the same Target IB port,
+or to different IB ports on the same Target HCA.
+
+In case of a single Target IB port, i.e., SRP connections use the same path,
+the configuration is enabled using a different initiator_ext value for each
+SRP connection. The initiator_ext value is a 16-hexadecimal-digit value
+specified in the connection command.
+
+Also in case of two physical connections (i.e., network paths) from a single
+initiator IB port to two different IB ports on the same Target HCA, there is
+need for a different initiator_ext value on each path. The conventions is to
+use the Target port GUID as the initiator_ext value for the relevant path.
+
+If you use srp_daemon with -n flag, it automatically assigns initiator_ext
+values according to this convention. For example:
+
+ id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec,
+ dgid=fe800000000000000002c90200402bed,
+ pkey=ffff,service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200
+
+ Notes:
+ a. It is recommended to use the -n flag for all srp_daemon invocations.
+ b. ibsrpdm does not have a corresponding option.
+ c. srp_daemon.sh always uses the -n option (whether invoked manually by
+ the user, or automatically at startup by setting SRPHA_ENABLE or
+ SRP_DAEMON_ENABLE to yes).
+
+==============================================================================
+11. High Availability (HA)
+==============================================================================
+
+High Availability Overview
+--------------------------
+
+High Availability works using the Device-Mapper (DM) multipath and the
+SRP daemon.
+
+Each initiator is connected to the same target from several ports/HCAs.
+The DM multipath is responsible for joining together different paths to the
+same target and for fail-over between paths when one of them goes offline.
+Multipath will be execute on newly joined SCSI devices.
+
+Each initiator should execute several instances of the SRP daemon, one for each
+port. At startup, each SRP daemon detects the SRP targets in the fabric and
+sends requests to the ib_srp module to connect to each of them. These
+SRP daemons also detect targets that subsequently join the fabric, and send the
+ib_srp module requests to connect to them as well.
+
+High Availability Operation
+---------------------------
+
+When a path (from port1) to a target fails, the ib_srp module starts an error
+recovery process. If this process gets to the reset_host stage and there is no
+path to the target from this port, ib_srp will remove this scsi_host. After
+the scsi_host is removed, multipath switches to another path to this target
+(from another port/HCA).
+
+When the failed path recovers, it will be detected by the SRP daemon. The SRP
+daemon will then request ib_srp to connect to this target. Once the connection
+is up, there will be a new scsi_host for this target. Multipath will be
+executed on the devices of this host, returning to the original state (prior to
+the failed path).
+
+High Availability Prerequisites
+-------------------------------
+
+Installation for RHEL4 and RHEL5: (Execute once)
+ - Verify that the standard device-mapper-multipath rpm is installed. If not,
+ install it from the RHEL distribution.
+
+Installation for SLES10: (Execute once)
+ - Verify that multipath is installed. If not, install it from the
+ installation (You can use yast).
+
+ - Update udev: (Execute once - for manual activation of High Availability only)
+
+ - Add a file to /etc/udev/rules.d/ (you can call it 91-srp.rules)
+ This file should have one line:
+ ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m"
+
+ Note: When SRPHA_ENABLE is set to "yes" (see Automatic Activation of High
+ Availability below), this file is created upon each boot of the driver and
+ is deleted when the driver is unloaded.
+
+
+Manual Activation of High Availability
+--------------------------------------
+
+Initialization: (Execute after each boot of the driver)
+ 1) Execute modprobe dm-multipath
+ 2) Execute modprobe ib-srp
+ 3) Make sure you have created file /etc/udev/rules.d/91-srp.rules
+ as described above
+ 4) Execute for each port and each HCA:
+ srp_daemon -c -e -R 300 -i <InfiniBand HCA name> -p <port number>
+ (You can use another value for -R. See under the Known Issues section
+ the workaround for the rare race condition.)
+
+ This step can be performed by executing srp_daemon.sh, which sends
+ its log to /var/log/srp_daemon.log.
+
+ Now it is possible to access the SRP LUNs on /dev/mapper/.
+
+ NOTE: It is possible for regular (non-SRP) LUNs to also be present;
+ the SRP LUNs may be identified by their names. You can configure the
+ /etc/multipath.conf file to change multipath behavior.
+
+
+Automatic Activation of High Availability
+-----------------------------------------
+- Set the value of SRPHA_ENABLE in /etc/infiniband/openib.conf to "yes".
+ Also make sure SRP_LOAD=yes and SRP_DAEMON_ENABLE=yes.
+
+- From the next loading of the driver it will be possible to access the SRP
+ LUNs on /dev/mapper/
+ NOTE: It is possible that regular (not SRP) LUNs may also be present;
+ the SRP LUNs may be identified by their name.
+
+- It is possible to see the output of the SRP daemon in /var/log/srp_daemon.log
+
+
+==============================================================================
+12. Shutting Down SRP
+==============================================================================
+
+SRP can be shutdown by using "modprobe -r ib_srp", or by stopping the OFED
+("/etc/init.d/openibd stop"), or as a by-product of a complete system shutdown.
+
+Prior to shutting down SRP, it is REQUIRED to remove all references to it.
+The actions you need to take depend on the way SRP was loaded. There are
+three cases.
+
+a. Without High Availability
+------------------------------------
+When working without High Availability, you should unmount all SRP
+partitions that were mounted prior to shutting down SRP.
+For example, /dev/sdd1 is srp partition and mount to /mnt/test
+$ umount /mnt/test
+$ modprobe -r ib_srp
+
+NOTES: the umount may get stuck ~90 seconds per connection to target if the
+ target is down. This is due to the srp_dev_loss_tmo=60 seconds which
+ srp driver waits for the target coming back before returning error
+ status.
+ If you have shutdown/remove srp target and the host have 4 connections
+ to the SRP target, you should wait ~4-5 minutes for the unmount to exit.
+ Do not ctrl+c to kill umount process.
+
+b. After Manual Activation of High Availability
+-----------------------------------------------
+If you manually activated SRP High Availability, perform the following steps:
+- Unmount all SRP partitions that were mounted
+- Kill all SRP daemon instances.
+- Make sure there are no multipath instances running. If there are multiple
+ instances, wait for them to end or kill them.
+- Execute multipath -F
+
+Example:
+$ umount /mnt/test1 /mnt/test2 (wait for it to exit, do not ctrl+c)
+$ ps -ax and kill all srp_daemon processes.
+$ multipath -ll (wait for it to exit, do not ctrl+c)
+$ multipath -F
+$ modprobe -r ib_srp
+
+c. After Automatic Activation of High Availability
+--------------------------------------------------
+If SRP High Availability was automatically activated, SRP shutdown must be
+part of the driver shutdown ("/etc/init.d/openibd stop") which performs
+steps 2-5 of case (b) above. However, you still have to unmount all SRP
+partitions that were mounted before driver shutdown.
+
+
+HAL Issue
+---------
+The HAL (Hardware Abstraction Layer) system includes a daemon that examines
+all devices in the system. In this process, it frequently holds a reference
+to the ib_srp module. If you attempt to shutdown SRP while this daemon is
+holding a reference to ib_srp, the shutdown will fail. Therefore, you
+should make sure this will not occur. One solution may be to stop "haldaemon"
+(/etc/init.d/haldaemon stop) prior to SRP shutdown.
+
+
+==============================================================================
+13. Known Issues
+==============================================================================
+
+- There is a very rare race condition which can cause the SRP daemon to miss a
+ target that joins the fabric. The race can occur if a target joins and leaves
+ the fabric several times in a short time (e.g., if the cable is not connected
+ well). In such a case, the SM may ignore this quick change of state and may
+ not send an InformInfo to the srp_daemon.
+
+ Workaround: Execute the srp_daemon command with the -R <sec> option. This
+ option causes the SRP daemon to perform a full rescan of the fabric every
+ <sec> seconds.
+
+- The srp_daemon does not support different pkeys other than the default
+ pkey=ffff
+
+- It is recommended to use an SM that supports the enhanced capability mask
+ matching feature (errata MGTWG8372). With SMs which support this feature, the
+ SRP daemon generates significantly less communication traffic.
+
+- When booting OFED with SRP High Availability enabled, executing multipath for
+ all LUNs on all connections may take some time (several minutes). However, it
+ is possible to start working while this process is in progress.
+
+- Stopping the driver while SRP High Availability is enabled kills all
+ multipath processes. Consider appropriate actions in case multipath is used
+ for other purposes.
+
+- AS High Availability is based on Device Mapper multipath, it embodies
+ multipath limitations and also its configuration and tuning options.
+ See http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=Home
+ for information on multipath.
+ To modify and tune multipath configuration, edit the file /etc/multipath.conf
+ according to instructions and tips listed in
+ /usr/share/doc/packages/multipath-tools/multipath.conf.*
+
+- In case your topology has two physical connections (i.e., network paths) from
+ a single initiator IB port to two different IB ports on the same Target HCA,
+ and you wish to have an SRP connection on the one path coexist with an SRP
+ connection on the second path, you must set a different initiator_ext value
+ on each path. See Section 9, "Multiple Connections from Initiator IB Port
+ to the Target" for details.
+
+- The srp_daemon tool reads by default the configuration file
+ /etc/srp_daemon.conf. In case this configuration file disallows connecting
+ to a certain target, srp_daemon will ignore the target. If you find out
+ that srp_daemon ignores a target, please check the /etc/srp_daemon.conf file.
+
+- Rebooting the system with unclean mounted filesystem and dead connection
+ to SRP target, the system may get stuck.
+
+- After establish the connection with srp target and rebooting the system,
+ initiator will fail to connect to target @ first manual *echo -n* command
+ (target reject with stale connection). You need to do *echo -n* one more
+ time.
+ You do not see this problem with srp_daemon mode since srp_daemon will
+ retry to connect.
+
+- The combination of "weak" single lun srp target, I/O with big block size,
+ default max_command_per_lun=63 while using /dev/urandom to create file with
+ ext3 fs on srp lun, may cause ext3 remount with "read-only" flag
+ ie.
+ Example:
+ sdb1 is first partition of srp lun sdb, ext3 fs is created
+ $ mount /dev/sdb1 /mnt/sdb1; cd /mnt/sdb1
+ $ dd if=/dev/urandom of=10G-file bs=1G count=10
+ --> ext3 fs may remount with read-only flag
+
+ Workarounds:
+ ------------
+ a. Log into the target with small max_command_per_lun (3,4,8)
+ $ echo id_ext=0002c9030008fc0c,ioc_guid=0002c9030008fc0c,
+ dgid=fe800000000000000002c90300084417,max_cmd_per_lun=4,pkey=ffff,
+ service_id=0002c9030008fc0c > /sys/class/infiniband_srp/srp-mlx4_0-1/add_target
+
+ ----------------OR-------------------
+
+ b. Run dd with /dev/zero instead of /dev/urandom
+ $ dd if=/dev/zero of=10G-file bs=1G count=10
+
+ ----------------OR-------------------
+
+ c. Run dd with smaller block size
+ $ dd if=/dev/urandom of=10G-file bs=128K count=40000
+
+ ----------------OR-------------------
+
+ d. Combine the a,b,c steps (This is the recommended workaround)
+
+==============================================================================
+14. Vendor Specific Notes
+==============================================================================
+
+Hosts connected to Qlogic SRP Targets must perform one of the following
+steps after upgrading to OFED 1.3.1 to continue accessing their storage
+successfully:
+
+1. When issuing the "echo" command to add a new SRP Target, the host
+ must append the string ",initiator_ext=0000000000000001" to the original
+ echo string.
+ Example:
+ 'ibsrpdm -c' output is as follows:
+
+ id_ext=0000000000000001,ioc_guid=00066a0138000165,dgid=fe8000000000000
+ 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00
+
+ id_ext=0000000000000001,ioc_guid=00066a0238000165,dgid=fe8000000000000
+ 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00
+
+ To connect to the first target, the echo command must be:
+
+ echo -n \
+ id_ext=0000000000000001,ioc_guid=00066a0138000165,\
+ dgid=fe8000000000000000066a0260000165,pkey=ffff,\
+ service_id=0000494353535250,io_class=ff00,\
+ initiator_ext=0000000000000001 > \
+ /sys/class/inifiniband_srp/srp-mthca0-1/add_target
+
+
+2. Change the SRP map on the Qlogic SRP Target to set the expected initiator
+ extension to 0. For details on how to change the SRP map on a Qlogic SRP
+ Target, please refer to product documentation.
+
+
--- /dev/null
+ Release Notes for
+ OFED 1.5.1 DAPL Release
+ March 2010
+
+ This release of the uDAPL reference implementation package for both
+ DAT 1.2 and 2.0 specification is timed to coincide with OFED release
+ of the Open Fabrics (www.openfabrics.org) software stack.
+
+ uDAPL v1 (1.2.16-1) and v2 (2.0.27-1)
+
+ ----------------
+
+ * New Features (v2 only) - UCM provider with IB UD based CM per process.
+ More scalable then rdma_cm (cma) or socket cm (scm).
+ ----------------
+
+ * Provider descriptions and PROS/CONS (cma, scm, ucm)
+
+ 1. CMA - uses OFA rdma_cm to setup QP's. IPoIB, ARP, and SA queries required.
+
+ Provider name: ofa-v2-cma
+ PROs: OFA rdma_cm has the most testing across many applications.
+ Supports both iWARP and IB.
+
+ CONs: Serialization of conn processing with kernel based CM service
+ Requires IPoIB ARP for name resolution, storms
+ Requires SA for path record queries for IB fabrics.
+ Conn Request private data limited to 52 bytes.
+
+ Settings for larger clusters (512+ cores):
+
+ setenv DAPL_CM_ROUTE_TIMEOUT_MS 20000
+ setenv DAPL_CM_ARP_TIMEOUT_MS 10000
+
+ 2. SCM - uses sockets to exchange QP information. IPoIB, ARP, and SA queries NOT required.
+
+ Provider name (connectx): ofa-v2-mlx4_0-1
+ PROs: Each rank has own instance of socket cm. More private data with requests.
+ Doesn't require path-record lookup.
+
+ CONs: Socket resources grow with scale-out, serialization of
+ connections with kernel based tcp sockets,
+ Competes for MPI socket resources/port space and other TCP applications.
+ Sockets remain in TIMEWAIT state for minutes after closure.
+ Requires ARP for name resolution.
+ Doesn't support iWARP devices.
+
+ Settings for larger clusters (512+ cores):
+
+ setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */
+ setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */
+
+ 3. UCM - use's IB UD QP to exchange QP info. Sockets, ARP, IPoIB, and SA queries NOT required.
+
+ Provider name (connectx): ofa-v2-mlx4_0-1u
+ PROs: Each rank has own instance of CM in user process
+ Resources fixed per rank regardless of scale-out size
+ No serialization of user or kernel resources establishing connections,
+ Simple 3-way msg handsake, CM messages fit in inline data for lowest message latency,
+ Supports alternate paths
+ No address resolution required.
+ No path resolution required.
+
+ CONs: New provider with limited testing, a little tougher to debug.
+ Doesn't support iWARP
+
+ Settings for larger clusters (512+ cores):
+
+ setenv DAPL_UCM_REP_TIME 800 /* REQUEST timer, waiting for REPLY in millisecs */
+ setenv DAPL_UCM_RTU_TIME 400 /* REPLY timer, waiting for RTU in millisecs */
+ setenv DAPL_UCM_RETRY 15 /* REQUEST and REPLY retries */
+ setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */
+ setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */
+
+ ----------------
+
+ * CM Performance: CPS profile for cma, scm, and ucm v2 uDAPL providers:
+
+ Intel SR1600 Urbanna Servers with Xeon(R) CPU X5570 @ 2.93GHz
+ Urbanna Platform - 2 node, 8 cores per node, Mellanox MLX4 IB QDR, no switch.
+
+ dtestcm (server/client):
+
+ cma: Connections: 183.21 usec, CPS 5458.31 Total 0.18 secs, poll_cnt=3403, Num=1000
+ scm: Connections: 178.80 usec, CPS 5592.93 Total 0.18 secs, poll_cnt=2344, Num=1000
+ ucm: Connections: 122.43 usec, CPS 8167.93 Total 0.12 secs, poll_cnt=2609, Num=1000
+
+ dapl_cm_bw: MPI uDAPL/CM profiling application (all-to-all connections, all ranks)
+
+ CMA
+ 2 Connect times (10): Total 0.0020 per 0.0002 CPS=4997.98
+ 4 Connect times (40): Total 0.0077 per 0.0002 CPS=5224.59
+ 8 Connect times (240): Total 0.0276 per 0.0001 CPS=8710.76
+ 16 Connect times (1120): Total 0.1194 per 0.0001 CPS=9379.37
+ 32 Connect times (4800): Total 6.1949 per 0.0013 CPS=774.83
+
+ SCM
+ 2 Connect times (10): Total 0.0024 per 0.0002 CPS=4103.61
+ 4 Connect times (40): Total 0.0060 per 0.0002 CPS=6622.41
+ 8 Connect times (240): Total 0.0206 per 0.0001 CPS=11634.15
+ 16 Connect times (1120): Total 9.0118 per 0.0080 CPS=124.28
+ 32 Connect times (4800): Total 21.0198 per 0.0044 CPS=228.36
+
+ UCM
+ 2 Connect times (10): Total 0.0014 per 0.0001 CPS=7353.27
+ 4 Connect times (40): Total 0.0045 per 0.0001 CPS=8816.19
+ 8 Connect times (240): Total 0.0191 per 0.0001 CPS=12582.44
+ 16 Connect times (1120): Total 0.0799 per 0.0001 CPS=14017.68
+ 32 Connect times (4800): Total 0.3337 per 0.0001 CPS=14385.21
+
+ ----------------
+
+ * Bug Fixes
+
+ V2.0 Package
+
+ Release 2.0.27
+ windows: add scm makefile
+ windows does not require rdma_cma_abi.h, move the include from common code
+ windows patch to fix IB_INVALID_HANDLE name collision
+ scm: dat_ep_connect fails on 32bit servers
+ undefined symbol: dapls_print_cm_list
+ cleanup CM object lock before freeing CM object memory
+ destroy verbs completion channels created via ia_open or ep_create.
+ package: update Copyright file and include the 3 license files in distribution
+ common: when copying private_data out of rdma_cm events, use the
+ cma: fix referencing freed address
+ dapl: move close device after async thread is done
+
+ Release 2.0.26
+ openib_common: add check for both gid and global routing in RTR
+ openib_common: remote memory read privilege set multi times
+ ucm, scm: DAPL_GLOBAL_ROUTING enabled causes segv
+
+ Release 2.0.25
+ winof scm: initialize opt for NODELAY setsockopt
+ winof cma: windows definition for EADDRNOTAVAIL missing
+ scm: client side setsockopt NODELAY fails if data arrives before setting
+ cma: setup_listener Cannot assign requested address
+ common: seg fault in dapl_evd_wait with multi-thread application using CNO's.
+ ucm: inbound DREQ/DREP handshake should transition QP.
+ winof: Remove duplicate include of comp_channel.cpp from cm.c as it is
+ included in opensm_ucb/device.c.
+
+ Release 2.0.24
+ winof: Utilize WinOF version of inet_ntop() for Windows OSes which do not
+ support inet_ntop().
+ ucm: windows build issue with new CQ completion channel
+ winof: add ucm provider to windows build
+ winof: add missing build files for ibal, scm
+ scm: connection peer resets under heavy load, incorrect event on error
+ ucm: increase default reply and rtu timeout values.
+ ucm: change some debug message levels and add check for valid UD REPLY during retries.
+ ucm: increase timers during subsequent retries
+ ucm, scm: address handles need destroyed when freeing Endpoints with UD QP's.
+ openib_common: ignore pd free errors, clear pd_handle and return.
+ ucm: using UD type QP's, ucm reports wrong reject event when user rejects AH resolution request.
+ ucm, scm, cma: Fix CNO support on DTO type EVD's
+ ucm: fix lock init bug in ucm_cm_find
+ ucm: fix build problem with latest windows ucm changes
+ ucm: The HCA should not be closed until all resources have been released.
+ ucm: Fix build warning when compiling on 32-bit systems.
+ ucm: Trying to deregister the same memory region twice leads to an
+ dat: reduce debug message level when parsing for location of dat.conf
+ ucm: update ucm provider for windows environment
+ ucm: add timer/retry CM logic to the ucm provider
+
+ Release 2.0.23
+ cma: cannot reuse the cm_id and qp for new connection, must reallocate a new one.
+ scm, cma: update DAPL cm protocol revision with latest address/port changes
+ ucm: modify IB address format to align better with sockaddr_in6
+ Add definition for getpid similar to that used by the other dtest apps.
+ WinOF provides a common implementation of gettimeofday that should
+ The completion manager was updated to provide an abstraction that
+ dtestcm: remove IB verb definitions
+ dtest, dtestx: remove IB verb definitions
+ scm: tighten up socket options to insure similiar behavior on Windows and Linux.
+ cma: improve serialization of destroy and event processing
+ scm: improve serialization of destroy and state changes
+ common: no cleanup/release code for timer thread
+ scm, cma: dapli_thread doesn't always get teminated on library close.
+ ucm: tighten up locking with CM processing, state changes
+ ucm: For UD type QP's, return CR p_data with CONN_EST event on passive side.
+ ucm: cleanup extra cr/lf
+ ucm: fix issues with UD QP's.
+ winof: Convert windows version of dapl and dat libaries to use private heaps.
+ dtest, dtestx: modifications for UD QP testing with ucm provider.
+ scm, ucm: UD QP support was broken when porting to common openib code base.
+ cma: cleanup warning with unused local variable, ret, in disconnect
+ cma: remove debug message after rdma_disconnect failure
+ scm: socket errno check needs O/S dependent wrapper
+ dapltest: update script files for WinOF
+ cma: conditional check for new rdma_cm definition.
+
+ Release 2.0.22
+ dapltest: add mdep processor yield and use with dapltest
+ ucm: Add new provider using a DAPL based IB-UD cm mechanism for MPI implementations.
+
+ Release 2.0.21
+ scm: Fix disconnect. QP's need to move to ERROR state in
+ modify dtest.c to cleanup CNO wait code and consolidate into
+ CNO events, once triggered will not be returned during the cno wait.
+ CNO support broken in both CMA and SCM providers.
+ common osd: include winsock2.h for IPv6 definitions.
+ common osd: include w2tcpip.h for sockaddr_in6 definitions.
+ DAPL introduced the concept of directly waiting on the CQ for
+ dapltest: Implement a malloc() threshold for the completion reaping.
+ scm: handle connected state when freeing CM objects
+ scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings.
+ scm: set TCP_NODELAY sockopt on the server side for sends.
+ remove obsolete files in dapl/udapl source tree
+ dtestcm: add UD type QP option to test
+ scm: destroy QP called before disconnect
+ cma: add support for rdma_cm TIME_WAIT event.
+ scm: remove old udapl_scm code replaced by openib_scm.
+ winof: fix issues after consolidating cma, scm code base.
+ cma: lock held when exiting as a result of a rdma_create_event_channel failure.
+ windows: all dlist functions have been moved to the header file.
+ dtestcm windows: add build infrastructure for new dtestcm test suite
+ openib_common: reorganize code base to share common mem, cq, qp, dto functions
+ scm: fixes and optimizations for connection scaling
+ scm: double the default fd_set_size
+ scm: EP reference in CR should be cleared during ep_destroy
+ dtestx: fix conn establishment event checking
+ dtestcm: new test to measure dapl connection rates.
+
+ Release 2.0.20
+ common,scm: add debug capabilities to print in-process CM lists
+ scm: disconnect EP before cleaning up orphaned CR's during dat_ep_free
+ dapltest: windows scripts updated
+ scm: private data is not handled properly via CR rejects.
+ scm: cleanup orphaned UD CR's when destroying the EP
+ scm: provider specific query for default UD MTU is wrong.
+ scm: update CM code to shutdown before closing socket
+ dapltest: windows script dt-cli.bat updated
+ dapl/windows cma provider: add support for network devices based on index
+ openib: remove 1st gen provider, replaced with openib_cma and openib_scm
+ dapltest: update windows script files
+ dapltest: windows batch files in sripts directory
+ windows_osd/linux_osd: new dapl_os_gettid macro to return thread id
+ windows: missing build files for common and udapl sub-directories
+ windows: add build files for openib_scm, remove /Wp64 build option.
+ scm: multi-hca CM processing broken. Need cr thread wakeup mechanism per HCA.
+ dtest: add connection timers on client side
+ linux_osd: use pthread_self instead of getpid for debug messages
+ windows ibal-scm: dapl/dirs file needs updated to remove ibal-scm
+
+ v1.2 Package:
+
+ Release 1.2.16
+ package: update Copyright file and include the 3 license files in distribution
+ cma: max sge incorrectly decremented during ibv_device_query
+
+ Release 1.2.15
+ dtest, dapltest: conflict with dapl-2 utils package, change to dapl1, dapltest1
+ scm: fix compiler warning, unused variable
+
+ ----------------
+
+ * Build Notes:
+
+ # NON_DEBUG build/install example for x86_64, OFED targets
+ ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
+ make install
+
+ # DEBUG build/install example for x86_64, using OFED targets
+ ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
+ make install
+
+ # COUNTERS build/install example for x86_64, using OFED targets
+ ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS"
+ make install
+
+ ----------------
+
+ * BKM for running new DAPL library on your cluster without any impact on existing OFED installation:
+
+ Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1
+
+ Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.25.tar.gz
+
+ untar in /home/ardavis
+ cd /home/ardavis/dapl-2.0.25
+ ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries)
+
+ create /home/ardavis/dat.conf with following 3 lines. (entries with path to new libraries):
+
+ ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" ""
+ ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
+ ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
+
+ Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following:
+
+ setenv DAT_OVERRIDE=/home/ardavis/dat.conf
+
+ If running Intel MPI and uDAPL socket cm, set the following:
+
+ setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1
+
+ or if running Intel MPI and uDAPL IB UD cm, set the following:
+
+ setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1u
+
+ or if running Intel MPI and uDAPL rdma_cm, set the following:
+
+ setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0
+
+-------------------------
+
+ OFED 1.4.1 RELEASE NOTES
+
+ NEW SINCE OFED 1.4 - new versions of uDAPL v1 (1.2.14-1) and v2 (2.0.19-1)
+
+ * New Features - optional counters, must be configured/built with -DDAPL_COUNTERS
+
+ * Bug Fixes
+
+ v2 - scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit
+ v2 - scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge
+ v2 - dtest: add flush EVD call after data transfer errors
+ v2 - scm: increase default MTU size from 1024 to 2048
+ v2 - dapltest: reset server listen ports to avoid collisions during long runs
+ v2 - dapltest: avoid duplicating ports, increment based on ep/thread count
+ v2 - dapltest: fix assumptions that multiple EP's will connect in order
+ v2 - common: sync missing with when removing items off of EVD pending queue
+ v2 - scm: reduce open time with thread start up
+ v2 - scm: getsockopt optlen needs initialized to size of optval
+ v2 - scm: cr_thread cleanup
+ v2 - OFED and WinOF code sync
+ v2 - scm: remove unnecessary query gid/lid from connection phase code.
+ v2 - scm: add optional 64-bit counters, build with -DDAPL_COUNTERS.
+ v1,v2 - spec files missing Requires(post) statements for sed/coreutils
+ v1,v2 - dtest/dapltest: use $(top_builddir) for .la files during test builds
+ v1,v2 - scm: remove unecessary thread when using direct objects
+ v1,v2 - Fix SuSE 11 build issues, asm/atomic.h no longer exists
+
+ * Build Notes:
+
+ # NON_DEBUG build/install example for x86_64, OFED targets
+ ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
+ make install
+
+ # DEBUG build/install example for x86_64, using OFED targets
+ ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
+ make install
+
+ # COUNTERS build/install example for x86_64, using OFED targets
+ ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS"
+ make install
+
+ * BKM for running new DAPL library on your cluster without any impact on existing OFED installation:
+
+ Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1
+
+ Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.19.tar.gz
+
+ untar in /home/ardavis
+ cd /home/ardavis/dapl-2.0.19
+ ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries)
+
+ create /home/ardavis/dat.conf with following 2 lines. (entries with path to new libraries):
+
+ ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" ""
+ ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
+
+ Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following:
+
+ setenv DAT_OVERRIDE=/home/ardavis/dat.conf
+
+ If running Intel MPI and uDAPL socket cm, set the following:
+
+ setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1
+
+ if running Intel MPI and uDAPL rdma_cm, set the following:
+
+ setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0
+
+-------------------------
+
+ OFED 1.4 RELEASE NOTES
+
+ NEW SINCE OFED 1.3.1 - new versions of uDAPL v1 (1.2.12-1) and v2 (2.0.15-1)
+
+ * New Features
+
+ 1. The new socket CM provider, introduced in 1.2.8 and 2.0.11 packages,
+ assumes homogeneous cluster and will setup the QP's based on local HCA port
+ attributes and exchanges QP information via socket's using the hostname of
+ each node. IPoIB and rdma_cm are NOT required for this provider. QP attributes
+ can be adjusted via the following environment parameters:
+
+ DAPL_ACK_TIMER (default=16 5 bits, 4.096us*2^ack_timer. 16 == 268ms)
+ DAPL_ACK_RETRY (default=7 3 bits, 7 * 268ms = 1.8 seconds)
+ DAPL_RNR_TIMER (default=12 5 bits, 12 == 64ms, 28 == 163ms, 31 == 491ms)
+ DAPL_RNR_RETRY (default=7 3 bits, 7 == infinite)
+ DAPL_IB_MTU (default=1024 limited to active MTU max)
+
+ The new socket cm entries in /etc/dat.conf provide a link to the actual HCA
+ device and port. Example v1 and v2 entries for a Mellanox connectx device, port 1:
+
+ OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
+ ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
+
+ This new socket cm provider, was successfully tested on the TATA CRL cluster
+ (#8 on Top500) with Intel MPI, achieving a HPLinpack score of 132.8TFlops on
+ 1798 nodes, 14384 cores at ~76.9% of peak. DAPL_ACK_TIMER was increased to 21
+ for this scale.
+
+ 2. New v2 definitions for IB unreliable datagram extension (only supported in
+ scm provider, libdaploscm.so.2)
+
+ Extended EP dat_service_type, with DAT_IB_SERVICE_TYPE_UD
+ Add IB extension call dat_ib_post_send_ud().
+ Add address handle definition for UD calls.
+ Add IB event definitions to provide remote AH via connect and connect requests
+ See dtestx (-d) source for example usage model
+
+ * Bug Fixes
+
+ v1,v2 - dapltest: trans test moves to cleanup stage before rdma_read processing is complete
+ v1,v2 - Fix static registration (dat.conf) to include sysconfdir override
+ v1,v2 - dat.conf: add default iwarp entry for eth2
+ v1,v2 - dapl: adjust max_rdma_read_iov to 1 for iWARP devices
+ v1,v2 - dtest: reduce default IOV's for ep_create to support iWARP
+ v1,v2 - dtest: fix 32-bit build issues
+ v1,v2 - build: $(DESTDIR) prepend needed on install hooks for dat.conf
+ v2 - scm: UD shares EP;s which requires serialization
+ v2 - dapl: fixes for IB UD extensions in common code and socket cm provider.
+ v2 - dapl: add provider specific attribute query option for IB UD MTU size
+ v2 - dapl build: add correct CFLAGS, set non-debug build by default for v2
+ v2 - dtestx: fix stack corruption problem with hostname strcpy
+ v2 - dapl extension: dapli_post_ext should always allocate cookie for requests.
+ v2 - dapltest: manpage - rdma write example incorrect
+ v1,v2 - dat, dapl, dtest, dapltest, providers: fix compiler warnings in dat common code
+ v1,v2 - dapl cma: debug message during query needs definition for inet_ntoa
+ v1,v2 - dapl scm: fix corner case that delivers duplicate disconnect events
+ v1,v2 - dat: include stddef.h for NULL definition in dat_platform_specific.h
+ v1,v2 - dapl: add debug messages during async and overflow events
+ v1,v2 - dapltest: add check for duplicate disconnect events in transaction test
+ v1,v2 - dapl scm: use correct device attribute for max_rdma_read_out, max_qp_init_rd_atom
+ v1,v2 - dapl scm: change IB RC qp inline and timer defaults.
+ v1,v2 - dapl scm: add mtu adjustments via environment, default = 1024.
+ v1,v2 - dapl scm: change connect and accept to non-blocking to avoid blocking user thread.
+ v1,v2 - dapl scm: update max_rdma_read_iov, max_rdma_write_iov EP attributes during query
+ v1,v2 - dat: allow TYPE_ERR messages to be turned off with DAT_DBG_TYPE
+ v1,v2 - dapl: remove needless terminating 0 in dto_op_str functions.
+ v1,v2 - dat: remove reference to doc/dat.conf in makefile.am
+ v1,v2 - dapl scm: fix ibv_destroy_cq busy error condition during dat_evd_free.
+ v1,v2 - dapl scm: add stdout logging for uname and gethostbyname errors during open.
+ v1,v2 - dapl scm: support global routing and set mtu based on active_mtu
+ v1,v2 - dapl: add opcode to string function to report opcode during failures.
+ v1,v2 - dapl: remove unused iov buffer allocation on the endpoint
+ v1,v2 - dapl: endpoint pending request count is wrong
+
+-------------------------
+
+ OFED 1.3.1 RELEASE NOTES
+
+ NEW SINCE OFED 1.3 - new versions of uDAPL v1 (1.2.7-1) and v2 (2.0.9-1)
+
+ * New Features - None
+
+ * Bug Fixes
+ v2 - add private data exchange with reject
+ v1,v2 - better error reporting in non-debug builds
+ v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers
+ v1,v2 - support for zero byte operations, iov==NULL
+ v1,v2 - multi-transport support for inline data and private data differences
+ v1,v2 - fix memory leaks and other reported bugs since OFED 1.3
+ v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1
+ v1,v2 - long delay during dat_ia_open when DNS not configured
+ v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max
+
+-------------------------
+
+ OFED 1.3 RELEASE NOTES
+
+ NEW SINCE OFED 1.2
+
+ * New Features
+
+ 1. Add v2.0 library support for new 2.0 API Specification
+ 2. Separate v1.2 library release to co-exist with v2.0 libraries.
+ 3. New dat.conf with both 1.2 and 2.0 support
+ 4. New v2.0 dtestx utilities to test IB extensions
+
+ * Bug Fixes
+
+ v1.2 and v2.0
+ - uDAT: static/dynamic registry parsing fixes
+ - uDAPL: provider fixes for dat_psp_create_any
+ - dtest/dapltest: change default provider names to sync with dat.conf
+ - openib_cma: issues with destroy_cm_id and init/resp exchange
+ - dapltest: use gettimeofday instead of get_cycles for better portability
+ - dapltest: endian issue with mem_handle, mem_address
+ - dapltest fix to include inet_ntoa definitions
+ - fix build problems on 32-bit and 64-bit PowerPC
+ - cleanup packaging
+
+ v2.0
+ - set default config options to match spec file, --enable-debug --enable-ext-type=ib
+ - use unique devel target names, libdat2.so, /usr/include/dat2
+ - dtestx fix memory leak, freeaddrinfo after getaddrinfo
+ - Fix for IB extended DTO cookie deallocation on inbound rdma_Write_immed
+ - WinOF: Update OFED code base to include WinOF changes, work from same code base
+ - WinOF: add DAT_API definition, __stdcall for windows, nothing for linux
+ - dtest: add dat_evd_query to check correct size
+ - openib_cma: add macro to convert SID to PORT
+ - dtest: endian support for exchanging RMR info
+ - openib_cma: lower default settings, inline and RDMA init/resp
+ - openib_cma: missing ia_query for max_iov_segments_per_rdma_write
+
+ v1.2
+ - openib_cma: turn down dbg noise level on rejects
+ - dtest: typo in memset
+
+
+ BUILD: v1 and v2 uDAPL source install/build instructions (redhat example):
+
+ # cd to distribution SRPMS directory
+ cd /tmp/OFED-1.3/SRPMS
+ rpm -i dapl-1.2*.rpm
+ rpm -i dapl-2.0*.rpm
+ cd /usr/src/redhat/SOURCES
+ tar zxf dapl-1.2*.tgz
+ tar zxf dapl-2.0*.tgz
+
+ # NON_DEBUG build example for x86_64, using OFED targets
+
+ ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64
+ LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
+
+ # build and install
+
+ make
+ make install
+
+ # DEBUG build example for x86_64, using OFED targets
+
+ ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64
+ LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
+
+ # build and install
+
+ make
+ make install
+
+ # DEBUG messages: set environment variable DAPL_DBG_TYPE, default
+ mapping is 0x0003
+
+ DAPL_DBG_TYPE_ERR = 0x0001,
+ DAPL_DBG_TYPE_WARN = 0x0002,
+ DAPL_DBG_TYPE_EVD = 0x0004,
+ DAPL_DBG_TYPE_CM = 0x0008,
+ DAPL_DBG_TYPE_EP = 0x0010,
+ DAPL_DBG_TYPE_UTIL = 0x0020,
+ DAPL_DBG_TYPE_CALLBACK = 0x0040,
+ DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080,
+ DAPL_DBG_TYPE_API = 0x0100,
+ DAPL_DBG_TYPE_RTN = 0x0200,
+ DAPL_DBG_TYPE_EXCEPTION = 0x0400,
+ DAPL_DBG_TYPE_SRQ = 0x0800,
+ DAPL_DBG_TYPE_CNTR = 0x1000
+
+-------------------------
+
+ OFED 1.2 RELEASE NOTES
+
+ NEW SINCE Gamma 3.2 and OFED 1.1
+
+ * New Features
+
+ 1. Added dtest and dapltest to the openfabrics build and utils rpm.
+ Includes manpages.
+ 2. Added following enviroment variables to configure connection management
+ timers (default settings) for larger clusters:
+
+ DAPL_CM_ARP_TIMEOUT_MS 4000
+ DAPL_CM_ARP_RETRY_COUNT 15
+ DAPL_CM_ROUTE_TIMEOUT_MS 4000
+ DAPL_CM_ROUTE_RETRY_COUNT 15
+
+ * Bug Fixes
+
+ + Added support for new ib verbs client register event. No extra
+ processing required at the uDAPL level.
+ + Fix some issues supporting create qp without recv cq handle or
+ recv qp resources. IB verbs assume a recv_cq handle and uDAPL
+ dapl_ep_create assumes there is always recv_sge resources specified.
+ + Fix some timeout and long disconnect delay issues discovered during
+ scale-out testing. Added support to retry rdma_cm address and route
+ resolution with configuration options. Provide a disconnect call
+ when receiving the disconnect request to guarantee a disconnect reply
+ and event on the remote side. The rdma_disconnect was not being called
+ from dat_ep_disconnect() as a result of the state changing
+ to DISCONNECTED in the event callback.
+ + Changes to support exchanging and validation of the device
+ responder_resources and the initiator_depth during conn establishment
+ + Fix some build issues with dapltest on 32 bit arch, and on ia64 SUSE arch
+ + Add support for multiple IB devices to dat.conf to support IPoIB HA failover
+ + Fix atomic operation build problem with ia64 and RHEL5.
+ + Add support to return local and remote port information with dat_ep_query
+ + Cleanup RPM specfile for the dapl package, move to 1.2-1 release.
+
+ NEW SINCE Gamma 3.1 and OFED 1.0
+
+ * BUG FIXES
+
+ + Update obsolete CLK_TCK to CLOCKS_PER_SEC
+ + Fill out some unitialized fields in the ia_attr structure returned by
+ dat_ia_query().
+ + Update dtest to support multiple segments on rdma write and change
+ makefile to use OpenIB-cma by default.
+ + Add support for dat_evd_set_unwaitable on a DTO evd in openib_cma
+ provider
+ + Added errno reporting (message and return codes) during open to help
+ diagnose create thread issues.
+ + Fix some suspicious inline assembly EIEIO_ON_SMP and ISYNC_ON_SMP
+ + Fix IA64 build problems
+ + Lower the reject debug message level so we don't see warnings when
+ consumers reject.
+ + Added support for active side TIMED_OUT event from a provider.
+ + Fix bug in dapls_ib_get_dat_event() call after adding new unreachable
+ event.
+ + Update for new rdma_create_id() function signature.
+ + Set max rdma read per EP attributes
+ + Report the proper error and timeout events.
+ + Socket CM fix to guard against using a loopback address as the local
+ device address.
+ + Use the uCM set_option feature to adjust connect request timeout
+ retry values.
+ + Fix to disallow any event after a disconnect event.
+
+ * OFED 1.1 uDAPL source build instructions:
+
+ cd /usr/local/ofed/src/openib-1.1/src/userspace/dapl
+
+ # NON_DEBUG build configuration
+
+ ./configure --disable-libcheck --prefix /usr/local/ofed
+ --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64
+ CPPFLAGS="-I../libibverbs/include -I../librdmacm/include"
+
+ # build and install
+
+ make
+ make install
+
+ # DEBUG build configuration
+
+ ./configure --disable-libcheck --enable-debug --prefix /usr/local/ofed
+ --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64
+ CPPFLAGS="-I../libibverbs/include -I../librdmacm/include"
+
+ # build and install
+
+ make
+ make install
+
+ # DEBUG messages: set environment variable DAPL_DBG_TYPE, default
+ mapping is 0x0003
+
+ DAPL_DBG_TYPE_ERR = 0x0001,
+ DAPL_DBG_TYPE_WARN = 0x0002,
+ DAPL_DBG_TYPE_EVD = 0x0004,
+ DAPL_DBG_TYPE_CM = 0x0008,
+ DAPL_DBG_TYPE_EP = 0x0010,
+ DAPL_DBG_TYPE_UTIL = 0x0020,
+ DAPL_DBG_TYPE_CALLBACK = 0x0040,
+ DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080,
+ DAPL_DBG_TYPE_API = 0x0100,
+ DAPL_DBG_TYPE_RTN = 0x0200,
+ DAPL_DBG_TYPE_EXCEPTION = 0x0400,
+ DAPL_DBG_TYPE_SRQ = 0x0800,
+ DAPL_DBG_TYPE_CNTR = 0x1000
+
+
+ Note: The udapl provider library libdaplscm.so is untested and
+ unsupported, thus customers should not use it.
+ It will be removed in the next OFED release.
+
+ DAPL GAMMA 3.1 RELEASE NOTES
+
+ This release of the DAPL reference implementation
+ is timed to coincide with the first release of the
+ Open Fabrics (www.openfabrics.org) software stack.
+ This release adds support for this new stack, which
+ is now the native Linux RDMA stack.
+
+ This release also adds a new licensing option. In
+ addition to the Common Public License and BSD License,
+ the code can now be licensed under the terms of the GNU
+ General Public License (GPL) version 2.
+
+ NEW SINCE Gamma 3.0
+
+ - GPL v2 added as a licensing option
+ - OpenFabrics (aka OpenIB) gen2 verbs support
+ - dapltest support for Solaris 10
+
+ * BUG FIXES
+
+ + Fixed a disconnect event processing race
+ + Fix to destroy all QPs on IA close
+ + Removed compiler warnings
+ + Removed unused variables
+ + And many more...
+
+ DAPL GAMMA 3.0 RELEASE NOTES
+
+ This is the first release based on version 1.2 of the spec. There
+ are some components, such a shared receive queues (SRQs), which
+ are not implemented yet.
+
+ Once again there were numerous bug fixes submitted by the
+ DAPL community.
+
+ NEW SINCE Beta 2.06
+
+ - DAT 1.2 headers
+ - DAT_IA_HANDLEs implemented as small integers
+ - Changed default device name to be "ia0a"
+ - Initial support for Linux 2.6.X kernels
+ - Updates to the OpenIB gen 1 provider
+
+ * BUG FIXES
+
+ + Updated Makefile for differentiation between OS releases.
+ + Updated atomic routines to use appropriate API
+ + Removed unnecessary assert from atomic_dec.
+ + Fixed bugs when freeing a PSP.
+ + Fixed error codes returned by the DAT static registry.
+ + Kernel updates for dat_strerror.
+ + Cleaned up the transport layer/adapter interface to use DAPL
+ types rather than transport types.
+ + Fixed ring buffer reallocation.
+ + Removed old test/udapl/dapltest directory.
+ + Fixed DAT_IA_HANDLE translation (from pointer to int and
+ vice versa) on 64-bit platforms.
+
+ DAP BETA 2.06 RELEASE NOTES
+
+ We are not planning any further releases of the Beta series,
+ which are based on the 1.1 version of the spec. There may be
+ further releases for bug fixes, but we anticipate the DAPL
+ community to move to the new 1.2 version of the spec and the
+ changes mandated in the reference implementation.
+
+ The biggest item in this release is the first inclusion of the
+ OpenIB Gen 1 provider, an item generating a lot of interest in
+ the IB community. This implementation has graciously been
+ provided by the Mellanox team. The kdapl implementation is in
+ progress, and we imagine work will soon begin on Gen 2.
+
+ There are also a handful of bug fixes available, as well as a long
+ awaited update to the endpoint design document.
+
+ NEW SINCE Beta 2.05
+
+ - OpenIB gen 1 provider support has been added
+ - Added dapls_evd_post_generic_event(), routine to post generic
+ event types as requested by some providers. Also cleaned up
+ error reporting.
+ - Updated the endpoint design document in the doc/ directory.
+
+ * BUG FIXES
+
+ + Cleaned up memory leak on close by freeing the HCA structure;
+ + Removed bogus #defs for rdtsc calls on IA64.
+ + Changed daptest thread types to use internal types for
+ portability & correctness
+ + Various 64 bit enhancements & updates
+ + Fixes to conformance test that were defining CONN_QUAL twice
+ and using it in different ways
+ + Cleaned up private data handling in ep_connect & provider
+ support: we now avoid extra copy in connect code; reduced
+ stack requirements by using private_data structure in the EP;
+ removed provider variable.
+ + Fixed problem in the dat conformance test where cno_wait would
+ attempt to dereference a timer value and SEGV.
+ + Removed old vestiges of depricated POLLING_COMPLETIONS
+ conditionals.
+
+ DAPL BETA 2.05 RELEASE NOTES
+
+ This was to be a very minor release, the primary change was
+ going to be the new wording of the DAT license as contained in
+ the header for all source files. But the interest and
+ development occurring in DAPL provided some extra bug fixes, and
+ some new functionality that has been requested for a while.
+
+ First, you may notice that every single source file was
+ changed. If you read the release notes from DAPL BETA 2.04, you
+ were warned this would happen. There was a legal issue with the
+ wording in the header, the end result was that every source file
+ was required to change the word 'either of' to 'both'. We've
+ been putting this change off as long as possible, but we wanted
+ to do it in a clean drop before we start working on DAT 1.2
+ changes in the reference implementation, just to keep things
+ reasonably sane.
+
+ kdapltest has enabled three of the subtests supported by
+ dapltest. The Performance test in particular has been very
+ useful to dapltest in getting minima and maxima. The Limit test
+ pushes the limits by allocating the maximum number of specific
+ resources. And the FFT tests are also available.
+
+ Most vendors have supported shared memory regions for a while,
+ several of which have asked the reference implementation team to
+ provide a common implementation. Shared memory registration has
+ been tested on ibapi, and compiled into vapi. Both InfiniBand
+ providers have the restriction that a memory region must be
+ created before it can be shared; not all RDMA APIs are this way,
+ several allow you to declare a memory region shared when it is
+ registered. Hence, details of the implementation are hidden in
+ the provider layer, rather than forcing other APIs to do
+ something strange.
+
+ This release also contains some changes that will allow dapl to
+ work on Opteron processors, as well as some preliminary support
+ for Power PC architecture. These features are not well tested
+ and may be incomplete at this time.
+
+ Finally, we have been asked several times over the course of the
+ project for a canonical interface between the common and
+ provider layers. This release includes a dummy provider to meet
+ that need. Anyone should be able to download the release and do
+ a:
+ make VERBS=DUMMY
+
+ And have a cleanly compiled dapl library. This will be useful
+ both to those porting new transport providers, as well as those
+ going to new machines.
+
+ The DUMMY provider has been compiled on both Linux and Windows
+ machines.
+
+
+ NEW SINCE Beta 2.4
+ - kdapltest enhancements:
+ * Limit subtests now work
+ * Performance subtests now work.
+ * FFT tests now work.
+
+ - The VAPI headers have been refreshed by Mellanox
+
+ - Initial Opteron and PPC support.
+
+ - Atomic data types now have consistent treatment, allowing us to
+ use native data types other than integers. The Linux kdapl
+ uses atomic_t, allowing dapl to use the kernel macros and
+ eliminate the assembly code in dapl_osd.h
+
+ - The license language was updated per the direction of the
+ DAT Collaborative. This two word change affected the header
+ of every file in the tree.
+
+ - SHARED memory regions are now supported.
+
+ - Initial support for the TOPSPIN provider.
+
+ - Added a dummy provider, essentially the NULL provider. It's
+ purpose is to aid in porting and to clarify exactly what is
+ expected in a provider implementation.
+
+ - Removed memory allocation from the DTO path for VAPI
+
+ - cq_resize will now allow the CQ to be resized smaller. Not all
+ providers support this, but it's a provider problem, not a
+ limitation of the common code.
+
+ * BUG FIXES
+
+ + Removed spurious lock in dapl_evd_connection_callb.c that
+ would have caused a deadlock.
+ + The Async EVD was getting torn down too early, potentially
+ causing lost errors. Has been moved later in the teardown
+ process.
+ + kDAPL replaced mem_map_reserve() with newer SetPageReserved()
+ for better Linux integration.
+ + kdapltest no longer allocate large print buffers on the stack,
+ is more careful to ensure buffers don't overflow.
+ + Put dapl_os_dbg_print() under DAPL_DBG conditional, it is
+ supposed to go away in a production build.
+ + dapltest protocol version has been bumped to reflect the
+ change in the Service ID.
+ + Corrected several instances of routines that did not adhere
+ to the DAT 1.1 error code scheme.
+ + Cleaned up vapi ib_reject_connection to pass DAT types rather
+ than provider specific types. Also cleaned up naming interface
+ declarations and their use in vapi_cm.c; fixed incorrect
+ #ifdef for naming.
+ + Initialize missing uDAPL provider attr, pz_support.
+ + Changes for better layering: first, moved
+ dapl_lmr_convert_privileges to the provider layer as memory
+ permissions are clearly transport specific and are not always
+ defined in an integer bitfield; removed common routines for
+ lmr and rmr. Second, move init and release setup/teardown
+ routines into adapter_util.h, which defined the provider
+ interface.
+ + Cleaned up the HCA name cruft that allowed different types
+ of names such as strings or ints to be dealt with in common
+ code; but all names are presented by the dat_registry as
+ strings, so pushed conversions down to the provider
+ level. Greatly simplifies names.
+ + Changed deprecated true/false to DAT_TRUE/DAT_FALSE.
+ + Removed old IB_HCA_NAME type in favor of char *.
+ + Fixed race condition in kdapltest's use of dat_evd_dequeue.
+ + Changed cast for SERVER_PORT_NUMBER to DAT_CONN_QUAL as it
+ should be.
+ + Small code reorg to put the CNO into the EVD when it is
+ allocated, which simplifies things.
+ + Removed gratuitous ib_hca_port_t and ib_send_op_type_t types,
+ replaced with standard int.
+ + Pass a pointer to cqe debug routine, not a structure. Some
+ clean up of data types.
+ + kdapl threads now invoke reparent_to_init() on exit to allow
+ threads to get cleaned up.
+
+
+
+ DAPL BETA 2.04 RELEASE NOTES
+
+ The big changes for this release involve a more strict adherence
+ to the original dapl architecture. Originally, only InfiniBand
+ providers were available, so allowing various data types and
+ event codes to show through into common code wasn't a big deal.
+
+ But today, there are an increasing number of providers available
+ on a number of transports. Requiring an IP iWarp provider to
+ match up to InfiniBand events is silly, for example.
+
+ Restructuring the code allows more flexibility in providing an
+ implementation.
+
+ There are also a large number of bug fixes available in this
+ release, particularly in kdapl related code.
+
+ Be warned that the next release will change every file in the
+ tree as we move to the newly approved DAT license. This is a
+ small change, but all files are affected.
+
+ Future releases will also support to the soon to be ratified DAT
+ 1.2 specification.
+
+ This release has benefited from many bug reports and fixes from
+ a number of individuals and companies. On behalf of the DAPL
+ community, thank you!
+
+
+ NEW SINCE Beta 2.3
+
+ - Made several changes to be more rigorous on the layering
+ design of dapl. The intent is to make it easier for non
+ InfiniBand transports to use dapl. These changes include:
+
+ * Revamped the ib_hca_open/close code to use an hca_ptr
+ rather than an ib_handle, giving the transport layer more
+ flexibility in assigning transport handles and resources.
+
+ * Removed the CQD calls, they are specific to the IBM API;
+ folded this functionality into the provider open/close calls.
+
+ * Moved VAPI, IBAPI transport specific items into a transport
+ structure placed inside of the HCA structure. Also updated
+ routines using these fields to use the new location. Cleaned
+ up provider knobs that have been exposed for too long.
+
+ * Changed a number of provider routines to use DAPL structure
+ pointers rather than exposing provider handles & values. Moved
+ provider specific items out of common code, including provider
+ data types (e.g. ib_uint32_t).
+
+ * Pushed provider completion codes and type back into the
+ provider layer. We no longer use EVD or CM completion types at
+ the common layer, instead we obtain the appropriate DAT type
+ from the provider and process only DAT types.
+
+ * Change private_data handling such that we can now accommodate
+ variable length private data.
+
+ - Remove DAT 1.0 cruft from the DAT header files.
+
+ - Better spec compliance in headers and various routines.
+
+ - Major updates to the VAPI implementation from
+ Mellanox. Includes initial kdapl implementation
+
+ - Move kdapl platform specific support for hash routines into
+ OSD file.
+
+ - Cleanups to make the code more readable, including comments
+ and certain variable and structure names.
+
+ - Fixed CM_BUSTED code so that it works again: very useful for
+ new dapl ports where infrastructure is lacking. Also made
+ some fixes for IBHOSTS_NAMING conditional code.
+
+ - Added DAPL_MERGE_CM_DTO as a compile time switch to support
+ EVD stream merging of CM and DTO events. Default is off.
+
+ - 'Quit' test ported to kdapltest
+
+ - uDAPL now builds on Linux 2.6 platform (SuSE 9.1).
+
+ - kDAPL now builds for a larger range of Linux kernels, but
+ still lacks 2.6 support.
+
+ - Added shared memory ID to LMR structure. Shared memory is
+ still not fully supported in the reference implementation, but
+ the common code will appear soon.
+
+ * Bug fixes
+ - Various Makefiles fixed to use the correct dat registry
+ library in its new location (as of Beta 2.03)
+ - Simple reorg of dat headers files to be consistent with
+ the spec.
+ - fixed bug in vapi_dto.h recv macro where we could have an
+ uninitialized pointer.
+ - Simple fix in dat_dr.c to initialize a variable early in the
+ routine before errors occur.
+ - Removed private data pointers from a CONNECTED event, as
+ there should be no private data here.
+ - dat_strerror no longer returns an uninitialized pointer if
+ the error code is not recognized.
+ - dat_dup_connect() will reject 0 timeout values, per the
+ spec.
+ - Removed unused internal_hca_names parameter from
+ ib_enum_hcas() interface.
+ - Use a temporary DAT_EVENT for kdapl up-calls rather than
+ making assumptions about the current event queue.
+ - Relocated some platform dependent code to an OSD file.
+ - Eliminated several #ifdefs in .c files.
+ - Inserted a missing unlock() on an error path.
+ - Added bounds checking on size of private data to make sure
+ we don't overrun the buffer
+ - Fixed a kdapltest problem that caused a machine to panic if
+ the user hit ^C
+ - kdapltest now uses spin locks more appropriate for their
+ context, e.g. spin_lock_bh or spin_lock_irq. Under a
+ conditional.
+ - Fixed kdapltest loops that drain EVDs so they don't go into
+ endless loops.
+ - Fixed bug in dapl_llist_add_entry link list code.
+ - Better error reporting from provider code.
+ - Handle case of user trying to reap DTO completions on an
+ EP that has been freed.
+ - No longer hold lock when ep_free() calls into provider layer
+ - Fixed cr_accept() to not have an extra copy of
+ private_data.
+ - Verify private_data pointers before using them, avoid
+ panic.
+ - Fixed memory leak in kdapltest where print buffers were not
+ getting reclaimed.
+
+
+
+ DAPL BETA 2.03 RELEASE NOTES
+
+ There are some prominent features in this release:
+ 1) dapltest/kdapltest. The dapltest test program has been
+ rearchitected such that a kernel version is now available
+ to test with kdapl. The most obvious change is a new
+ directory structure that more closely matches other core
+ dapl software. But there are a large number of changes
+ throughout the source files to accommodate both the
+ differences in udapl/kdapl interfaces, but also more mundane
+ things such as printing.
+
+ The new dapltest is in the tree at ./test/dapltest, while the
+ old remains at ./test/udapl/dapltest. For this release, we
+ have maintained both versions. In a future release, perhaps
+ the next release, the old dapltest directory will be
+ removed. Ongoing development will only occur in the new tree.
+
+ 2) DAT 1.1 compliance. The DAT Collaborative has been busy
+ finalizing the 1.1 revision of the spec. The header files
+ have been reviewed and posted on the DAT Collaborative web
+ site, they are now in full compliance.
+
+ The reference implementation has been at a 1.1 level for a
+ while. The current implementation has some features that will
+ be part of the 1.2 DAT specification, but only in places
+ where full compatibility can be maintained.
+
+ 3) The DAT Registry has undergone some positive changes for
+ robustness and support of more platforms. It now has the
+ ability to support several identical provider names
+ simultaneously, which enables the same dat.conf file to
+ support multiple platforms. The registry will open each
+ library and return when successful. For example, a dat.conf
+ file may contain multiple provider names for ex0a, each
+ pointing to a different library that may represent different
+ platforms or vendors. This simplifies distribution into
+ different environments by enabling the use of common
+ dat.conf files.
+
+ In addition, there are a large number of bug fixes throughout
+ the code. Bug reports and fixes have come from a number of
+ companies.
+
+ Also note that the Release notes are cleaned up, no longer
+ containing the complete text of previous releases.
+
+ * EVDs no longer support DTO and CONNECTION event types on the
+ same EVD. NOTE: The problem is maintaining the event ordering
+ between two channels such that no DTO completes before a
+ connection is received; and no DTO completes after a
+ disconnect is received. For 90% of the cases this can be made
+ to work, but the remaining 10% will cause serious performance
+ degradation to get right.
+
+ NEW SINCE Beta 2.2
+
+ * DAT 1.1 spec compliance. This includes some new types, error
+ codes, and moving structures around in the header files,
+ among other things. Note the Class bits of dat_error.h have
+ returned to a #define (from an enum) to cover the broadest
+ range of platforms.
+
+ * Several additions for robustness, including handle and
+ pointer checking, better argument checking, state
+ verification, etc. Better recovery from error conditions,
+ and some assert()s have been replaced with 'if' statements to
+ handle the error.
+
+ * EVDs now maintain the actual queue length, rather than the
+ requested amount. Both the DAT spec and IB (and other
+ transports) allow the underlying implementation to provide
+ more CQ entries than requested.
+
+ Requests for the same number of entries contained by an EVD
+ return immediate success.
+
+ * kDAPL enhancements:
+ - module parameters & OS support calls updated to work with
+ more recent Linux kernels.
+ - kDAPL build options changes to match the Linux kernel, vastly
+ reducing the size and making it more robust.
+ - kDAPL unload now works properly
+ - kDAPL takes a reference on the provider driver when it
+ obtains a verbs vector, to prevent an accidental unload
+ - Cleaned out all of the uDAPL cruft from the linux/osd files.
+
+ * New dapltest (see above).
+
+ * Added a new I/O trace facility, enabling a developer to debug
+ all I/O that are in progress or recently completed. Default
+ is OFF in the build.
+
+ * 0 timeout connections now refused, per the spec.
+
+ * Moved the remaining uDAPL specific files from the common/
+ directory to udapl/. Also removed udapl files from the kdapl
+ build.
+
+ * Bug fixes
+ - Better error reporting from provider layer
+ - Fixed race condition on reference counts for posting DTO
+ ops.
+ - Use DAT_COMPLETION_SUPPRESS_FLAG to suppress successful
+ completion of dapl_rmr_bind (instead of
+ DAT_COMPLEITON_UNSIGNALLED, which is for non-notification
+ completion).
+ - Verify psp_flags value per the spec
+ - Bug in psp_create_any() checking psp_flags fixed
+ - Fixed type of flags in ib_disconnect from
+ DAT_COMPLETION_FLAGS to DAT_CLOSE_FLAGS
+ - Removed hard coded check for ASYNC_EVD. Placed all EVD
+ prevention in evd_stream_merging_supported array, and
+ prevent ASYNC_EVD from being created by an app.
+ - ep_free() fixed to comply with the spec
+ - Replaced various printfs with dbg_log statements
+ - Fixed kDAPL interaction with the Linux kernel
+ - Corrected phy_register protottype
+ - Corrected kDAPL wait/wakeup synchronization
+ - Fixed kDAPL evd_kcreate() such that it no longer depends
+ on uDAPL only code.
+ - dapl_provider.h had wrong guard #def: changed DAT_PROVIDER_H
+ to DAPL_PROVIDER_H
+ - removed extra (and bogus) call to dapls_ib_completion_notify()
+ in evd_kcreate.c
+ - Inserted missing error code assignment in
+ dapls_rbuf_realloc()
+ - When a CONNECTED event arrives, make sure we are ready for
+ it, else something bad may have happened to the EP and we
+ just return; this replaces an explicit check for a single
+ error condition, replacing it with the general check for the
+ state capable of dealing with the request.
+ - Better context pointer verification. Removed locks around
+ call to ib_disconnect on an error path, which would result
+ in a deadlock. Added code for BROKEN events.
+ - Brought the vapi code more up to date: added conditional
+ compile switches, removed obsolete __ActivePort, deal
+ with 0 length DTO
+ - Several dapltest fixes to bring the code up to the 1.1
+ specification.
+ - Fixed mismatched dalp_os_dbg_print() #else dapl_Dbg_Print();
+ the latter was replaced with the former.
+ - ep_state_subtype() now includes UNCONNECTED.
+ - Added some missing ibapi error codes.
+
+
+
+ NEW SINCE Beta 2.1
+
+ * Changes for Erratta and 1.1 Spec
+ - Removed DAT_NAME_NOT_FOUND, per DAT erratta
+ - EVD's with DTO and CONNECTION flags set no longer valid.
+ - Removed DAT_IS_SUCCESS macro
+ - Moved provider attribute structures from vendor files to udat.h
+ and kdat.h
+ - kdapl UPCALL_OBJECT now passed by reference
+
+ * Completed dat_strerr return strings
+
+ * Now support interrupted system calls
+
+ * dapltest now used dat_strerror for error reporting.
+
+ * Large number of files were formatted to meet project standard,
+ very cosmetic changes but improves readability and
+ maintainability. Also cleaned up a number of comments during
+ this effort.
+
+ * dat_registry and RPM file changes (contributed by Steffen Persvold):
+ - Renamed the RPM name of the registry to be dat-registry
+ (renamed the .spec file too, some cvs add/remove needed)
+ - Added the ability to create RPMs as normal user (using
+ temporal paths), works on SuSE, Fedora, and RedHat.
+ - 'make rpm' now works even if you didn't build first.
+ - Changed to using the GNU __attribute__((constructor)) and
+ __attribute__((destructor)) on the dat_init functions, dat_init
+ and dat_fini. The old -init and -fini options to LD makes
+ applications crash on some platforms (Fedora for example).
+ - Added support for 64 bit platforms.
+ - Added code to allow multiple provider names in the registry,
+ primarily to support ia32 and ia64 libraries simultaneously.
+ Provider names are now kept in a list, the first successful
+ library open will be the provider.
+
+ * Added initial infrastructure for DAPL_DCNTR, a feature that
+ will aid in debug and tuning of a dapl implementation. Partial
+ implementation only at this point.
+
+ * Bug fixes
+ - Prevent debug messages from crashing dapl in EVD completions by
+ verifying the error code to ensure data is valid.
+ - Verify CNO before using it to clean up in evd_free()
+ - CNO timeouts now return correct error codes, per the spec.
+ - cr_accept now complies with the spec concerning connection
+ requests that go away before the accept is invoked.
+ - Verify valid EVD before posting connection evens on active side
+ of a connection. EP locking also corrected.
+ - Clean up of dapltest Makefile, no longer need to declare
+ DAT_THREADSAFE
+ - Fixed check of EP states to see if we need to disconnect an
+ IA is closed.
+ - ep_free() code reworked such that we can properly close a
+ connection pending EP.
+ - Changed disconnect processing to comply with the spec: user will
+ see a BROKEN event, not DISCONNECTED.
+ - If we get a DTO error, issue a disconnect to let the CM and
+ the user know the EP state changed to disconnect; checked IBA
+ spec to make sure we disconnect on correct error codes.
+ - ep_disconnect now properly deals with abrupt disconnects on the
+ active side of a connection.
+ - PSP now created in the correct state for psp_create_any(), making
+ it usable.
+ - dapl_evd_resize() now returns correct status, instead of always
+ DAT_NOT_IMPLEMENTED.
+ - dapl_evd_modify_cno() does better error checking before invoking
+ the provider layer, avoiding bugs.
+ - Simple change to allow dapl_evd_modify_cno() to set the CNO to
+ NULL, per the spec.
+ - Added required locking around call to dapl_sp_remove_cr.
+
+ - Fixed problems related to dapl_ep_free: the new
+ disconnect(abrupt) allows us to do a more immediate teardown of
+ connections, removing the need for the MAGIC_EP_EXIT magic
+ number/state, which has been removed. Mmuch cleanup of paths,
+ and made more robust.
+ - Made changes to meet the spec, uDAPL 1.1 6.3.2.3: CNO is
+ triggered if there are waiters when the last EVD is removed
+ or when the IA is freed.
+ - Added code to deal with the provider synchronously telling us
+ a connection is unreachable, and generate the appropriate
+ event.
+ - Changed timer routine type from unsigned long to uintptr_t
+ to better fit with machine architectures.
+ - ep.param data now initialized in ep_create, not ep_alloc.
+ - Or Gerlitz provided updates to Mellanox files for evd_resize,
+ fw attributes, many others. Also implemented changes for correct
+ sizes on REP side of a connection request.
+
+
+
+ NEW SINCE Beta 2.0
+
+ * dat_echo now DAT 1.1 compliant. Various small enhancements.
+
+ * Revamped atomic_inc/dec to be void, the return value was never
+ used. This allows kdapl to use Linux kernel equivalents, and
+ is a small performance advantage.
+
+ * kDAPL: dapl_evd_modify_upcall implemented and tested.
+
+ * kDAPL: physical memory registration implemented and tested.
+
+ * uDAPL now builds cleanly for non-debug versions.
+
+ * Default RDMA credits increased to 8.
+
+ * Default ACK_TIMEOUT now a reasonable value (2 sec vs old 2
+ months).
+
+ * Cleaned up dat_error.h, now 1.1 compliant in comments.
+
+ * evd_resize initial implementation. Untested.
+
+ * Bug fixes
+ - __KDAPL__ is defined in kdat_config.h, so apps don't need
+ to define it.
+ - Changed include file ordering in kdat.h to put kdat_config.h
+ first.
+ - resolved connection/tear-down race on the client side.
+ - kDAPL timeouts now scaled properly; fixed 3 orders of
+ magnitude difference.
+ - kDAPL EVD callbacks now get invoked for all completions; old
+ code would drop them in heavy utilization.
+ - Fixed error path in kDAPL evd creation, so we no longer
+ leak CNOs.
+ - create_psp_any returns correct error code if it can't create
+ a connection qualifier.
+ - lock fix in ibapi disconnect code.
+ - kDAPL INFINITE waits now work properly (non connection
+ waits)
+ - kDAPL driver unload now works properly
+ - dapl_lmr_[k]create now returns 1.1 error codes
+ - ibapi routines now return DAT 1.1 error codes
+
+
+
+ NEW SINCE Beta 1.10
+
+ * kDAPL is now part of the DAPL distribution. See the release
+ notes above.
+
+ The kDAPL 1.1 spec is now contained in the doc/ subdirectory.
+
+ * Several files have been moved around as part of the kDAPL
+ checkin. Some files that were previously in udapl/ are now
+ in common/, some in common are now in udapl/. The goal was
+ to make sure files are properly located and make sense for
+ the build.
+
+ * Source code formatting changes for consistency.
+
+ * Bug fixes
+ - dapl_evd_create() was comparing the wrong bit combinations,
+ allowing bogus EVDs to be created.
+ - Removed code that swallowed zero length I/O requests, which
+ are allowed by the spec and are useful to applications.
+ - Locking in dapli_get_sp_ep was asymmetric; fixed it so the
+ routine will take and release the lock. Cosmetic change.
+ - dapl_get_consuemr_context() will now verify the pointer
+ argument 'context' is not NULL.
+
+
+ OBTAIN THE CODE
+
+ To obtain the tree for your local machine you can check it
+ out of the source repository using CVS tools. CVS is common
+ on Unix systems and available as freeware on Windows machines.
+ The command to anonymously obtain the source code from
+ Source Forge (with no password) is:
+
+ cvs -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl login
+ cvs -z3 -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl co .
+
+ When prompted for a password, simply press the Enter key.
+
+ Source Forge also contains explicit directions on how to become
+ a developer, as well as how to use different CVS commands. You may
+ also browse the source code using the URL:
+
+ http://svn.sourceforge.net/viewvc/dapl/trunk/
+
+ SYSTEM REQUIREMENTS
+
+ This project has been implemented on Red Hat Linux 7.3, SuSE
+ SLES 8, 9, and 10, Windows 2000, RHEL 3.0, 4.0 and 5.0 and a few
+ other Linux distrubutions. The structure of the code is designed
+ to allow other operating systems to easily be adapted.
+
+ The DAPL team has used Mellanox Tavor based InfiniBand HCAs for
+ development, and continues with this platform. Our HCAs use the
+ IB verbs API submitted by IBM. Mellanox has contributed an
+ adapter layer using their VAPI verbs API. Either platform is
+ available to any group considering DAPL work. The structure of
+ the uDAPL source allows other provider API sets to be easily
+ integrated.
+
+ The development team uses any one of three topologies: two HCAs
+ in a single machine; a single HCA in each of two machines; and
+ most commonly, a switch. Machines connected to a switch may have
+ more than one HCA.
+
+ The DAPL Plugfest revealed that switches and HCAs available from
+ most vendors will interoperate with little trouble, given the
+ most recent releases of software. The dapl reference team makes
+ no recommendation on HCA or switch vendors.
+
+ Explicit machine configurations are available upon request.
+
+ IN THE TREE
+
+ The DAPL tree contains source code for the uDAPL and kDAPL
+ implementations, and also includes tests and documentation.
+
+ Included documentation has the base level API of the
+ providers: OpenFabrics, IBM Access, and Mellanox Verbs API. Also
+ included are a growing number of DAPL design documents which
+ lead the reader through specific DAPL subsystems. More
+ design documents are in progress and will appear in the tree in
+ the near future.
+
+ A small number of test applications and a unit test framework
+ are also included. dapltest is the primary testing application
+ used by the DAPL team, it is capable of simulating a variety of
+ loads and exercises a large number of interfaces. Full
+ documentation is included for each of the tests.
+
+ Recently, the dapl conformance test has been added to the source
+ repository. The test provides coverage of the most common
+ interfaces, doing both positive and negative testing. Vendors
+ providing DAPL implementation are strongly encouraged to run
+ this set of tests.
+
+ MAKEFILE NOTES
+
+ There are a number #ifdef's in the code that were necessary
+ during early development. They are disappearing as we
+ have time to take advantage of features and work available from
+ newer releases of provider software. These #ifdefs are not
+ documented as the intent is to remove them as soon as possible.
+
+ CONTRIBUTIONS
+
+ As is common to Source Forge projects, there are a small number
+ of developers directly associated with the source tree and having
+ privileges to change the tree. Requested updates, changes, bug
+ fixes, enhancements, or contributions should be sent to
+ James Lentini at jlentinit@netapp.com for review. We welcome your
+ contributions and expect the quality of the project will
+ improve thanks to your help.
+
+ The core DAPL team is:
+
+ James Lentini
+ Arlin Davis
+ Steve Sears
+
+ ... with contributions from a number of excellent engineers in
+ various companies contributing to the open source effort.
+
+
+ ONGOING WORK
+
+ Not all of the DAPL spec is implemented at this time.
+ Functionality such as shared memory will probably not be
+ implemented by the reference implementation (there is a write up
+ on this in the doc/ area), and there are yet various cases where
+ work remains to be done. And of course, not all of the
+ implemented functionality has been tested yet. The DAPL team
+ continues to develop and test the tree with the intent of
+ completing the specification and delivering a robust and useful
+ implementation.
+
+
+The DAPL Team
+
--- /dev/null
+#!/bin/bash
+#
+# Copyright (c) 2006 Mellanox Technologies. All rights reserved.
+# Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved.
+#
+# This Software is licensed under one of the following licenses:
+#
+# 1) under the terms of the "Common Public License 1.0" a copy of which is
+# available from the Open Source Initiative, see
+# http://www.opensource.org/licenses/cpl.php.
+#
+# 2) under the terms of the "The BSD License" a copy of which is
+# available from the Open Source Initiative, see
+# http://www.opensource.org/licenses/bsd-license.php.
+#
+# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+# copy of which is available from the Open Source Initiative, see
+# http://www.opensource.org/licenses/gpl-license.php.
+#
+# Licensee has the right to choose one of the above licenses.
+#
+# Redistributions of source code must retain the above copyright
+# notice and one of the license notices.
+#
+# Redistributions in binary form must reproduce both the above copyright
+# notice, one of the license notices in the documentation
+# and/or other materials provided with the distribution.
+#
+# Description: creates Module.symvers file for InfiniBand modules
+
+KVERSION=${KVERSION:-$(uname -r)}
+MOD_SYMVERS=./Module.symvers
+SYMS=/tmp/syms
+
+echo MODULES_DIR=${MODULES_DIR-:./}
+
+if [ -f ${MOD_SYMVERS} -a ! -f ${MOD_SYMVERS}.save ]; then
+ mv ${MOD_SYMVERS} ${MOD_SYMVERS}.save
+fi
+rm -f $MOD_SYMVERS
+rm -f $SYMS
+
+for mod in $(find ${MODULES_DIR} -name '*.ko') ; do
+ nm -o $mod |grep __crc >> $SYMS
+ n_mods=$((n_mods+1))
+done
+
+n_syms=$(wc -l $SYMS |cut -f1 -d" ")
+echo Found $n_syms OFED kernel symbols in $n_mods modules
+n=1
+
+while [ $n -le $n_syms ] ; do
+ line=$(head -$n $SYMS|tail -1)
+
+ line1=$(echo $line|cut -f1 -d:)
+ line2=$(echo $line|cut -f2 -d:)
+ file=$(echo $line1| sed -e 's@./@@' -e 's@.ko@@' -e "s@$PWD/@@")
+ crc=$(echo $line2|cut -f1 -d" ")
+ sym=$(echo $line2|cut -f3 -d" ")
+ echo -e "0x$crc\t$sym\t$file" >> $MOD_SYMVERS
+ n=$((n+1))
+done
+
+echo ${MOD_SYMVERS} created.
--- /dev/null
+#!/bin/bash
+#
+# Copyright (c) 2009 Mellanox Technologies. All rights reserved.
+#
+# This Software is licensed under one of the following licenses:
+#
+# 1) under the terms of the "Common Public License 1.0" a copy of which is
+# available from the Open Source Initiative, see
+# http://www.opensource.org/licenses/cpl.php.
+#
+# 2) under the terms of the "The BSD License" a copy of which is
+# available from the Open Source Initiative, see
+# http://www.opensource.org/licenses/bsd-license.php.
+#
+# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+# copy of which is available from the Open Source Initiative, see
+# http://www.opensource.org/licenses/gpl-license.php.
+#
+# Licensee has the right to choose one of the above licenses.
+#
+# Redistributions of source code must retain the above copyright
+# notice and one of the license notices.
+#
+# Redistributions in binary form must reproduce both the above copyright
+# notice, one of the license notices in the documentation
+# and/or other materials provided with the distribution.
+#
+#
+# Add/Remove a patch to/from OFED's ofa_kernel package
+
+
+usage()
+{
+cat << EOF
+
+ Usage:
+ Add patch to OFED:
+ `basename $0` --add
+ --ofed|-o <path_to_ofed>
+ --patch|-p <path_to_patch>
+ --type|-t <kernel|backport <kernel tag>|addons <kernel tag>>
+
+ Remove patch from OFED:
+ `basename $0` --remove
+ --ofed|-o <path_to_ofed>
+ --patch|-p <patch name>
+ --type|-t <kernel|backport <kernel tag>|addons <kernel tag>>
+
+ Example:
+ `basename $0` --add --ofed /tmp/OFED-1.X/ --patch /tmp/cma_establish.patch --type kernel
+
+ `basename $0` --remove --ofed /tmp/OFED-1.X/ --patch cma_establish.patch --type kernel
+
+EOF
+}
+
+action=""
+
+# Execute command w/ echo and exit if it fail
+ex()
+{
+ echo "$@"
+ if ! "$@"; then
+ printf "\nFailed executing $@\n\n"
+ exit 1
+ fi
+}
+
+add_patch()
+{
+ if [ -f $2/${1##*/} ]; then
+ echo Replacing $2/${1##*/}
+ ex /bin/rm -f $2/${1##*/}
+ fi
+ ex cp $1 $2
+}
+
+remove_patch()
+{
+ if [ -f $2/${1##*/} ]; then
+ echo Removing $2/${1##*/}
+ ex /bin/rm -f $2/${1##*/}
+ else
+ echo Patch $2/${1##*/} was not found
+ exit 1
+ fi
+}
+
+set_rpm_info()
+{
+ package_SRC_RPM=$(/bin/ls -1 ${ofed}/SRPMS/${1}*src.rpm 2> /dev/null)
+ if [[ -n "${package_SRC_RPM}" && -s ${package_SRC_RPM} ]]; then
+ package_name=$(rpm --queryformat "[%{NAME}]" -qp ${package_SRC_RPM})
+ package_ver=$(rpm --queryformat "[%{VERSION}]" -qp ${package_SRC_RPM})
+ package_rel=$(rpm --queryformat "[%{RELEASE}]" -qp ${package_SRC_RPM})
+ else
+ echo $1 src.rpm not found under ${ofed}/SRPMS
+ exit 1
+ fi
+}
+
+main()
+{
+ while [ ! -z "$1" ]
+ do
+ case $1 in
+ --add)
+ action="add"
+ shift
+ ;;
+ --remove)
+ action="remove"
+ shift
+ ;;
+ --ofed|-o)
+ ofed=$2
+ shift 2
+ ;;
+ --patch|-p)
+ patch=$2
+ shift 2
+ ;;
+ --type|-t)
+ type=$2
+ shift 2
+ case ${type} in
+ backport|addons)
+ tag=$1
+ shift
+ ;;
+ esac
+ ;;
+ --help|-h)
+ usage
+ exit 0
+ ;;
+ *)
+ usage
+ exit 1
+ ;;
+ esac
+ done
+
+ if [ -z "$action" ]; then
+ usage
+ exit 1
+ fi
+
+ if [ -z "$ofed" ] || [ ! -d "$ofed" ]; then
+ echo Set the path to the OFED directory. Use \'--ofed\' parameter
+ exit 1
+ else
+ ofed=$(readlink -f $ofed)
+ fi
+
+ if [ "$action" == "add" ]; then
+ if [ -z "$patch" ] || [ ! -r "$patch" ]; then
+ echo Set the path to the patch file. Use \'--patch\' parameter
+ exit 1
+ else
+ patch=$(readlink -f $patch)
+ fi
+ else
+ if [ -z "$patch" ]; then
+ echo Set the name of the patch to be removed. Use \'--patch\' parameter
+ exit 1
+ fi
+ fi
+
+ if [ -z "$type" ]; then
+ echo Set the type of the patch. Use \'--type\' parameter
+ exit 1
+ fi
+
+ if [ "$type" == "backport" ] || [ "$type" == "addons" ]; then
+ if [ -z "$tag" ]; then
+ echo Set tag for backport patch.
+ exit 1
+ fi
+ fi
+
+ # Get ofa RPM version
+ case $type in
+ kernel|backport|addons)
+ set_rpm_info ofa_kernel
+ ;;
+ *)
+ echo "Unknown type $type"
+ exit 1
+ ;;
+ esac
+
+ package=${package_name}-${package_ver}
+ cd ${ofed}
+ if [ ! -e SRPMS/${package}-${package_rel}.src.rpm ]; then
+ echo File ${ofed}/SRPMS/${package}-${package_rel}.src.rpm not found
+ exit 1
+ fi
+
+ if ! ( set -x && rpm -i --define "_topdir $(pwd)" SRPMS/${package}-${package_rel}.src.rpm && set +x ); then
+ echo "Failed to install ${package}-${package_rel}.src.rpm"
+ exit 1
+ fi
+
+ cd -
+
+ cd ${ofed}/SOURCES
+ ex tar xzf ${package}.tgz
+
+ case $type in
+ kernel)
+ if [ "$action" == "add" ]; then
+ add_patch $patch ${package}/kernel_patches/fixes
+ else
+ remove_patch $patch ${package}/kernel_patches/fixes
+ fi
+ ;;
+ backport)
+ if [ "$action" == "add" ]; then
+ if [ ! -d ${package}/kernel_patches/backport/$tag ]; then
+ echo Creating ${package}/kernel_patches/backport/$tag directory
+ ex mkdir -p ${package}/kernel_patches/backport/$tag
+ echo WARNING: Check that ${package} configure supports backport/$tag
+ fi
+ add_patch $patch ${package}/kernel_patches/backport/$tag
+ else
+ remove_patch $patch ${package}/kernel_patches/backport/$tag
+ fi
+ ;;
+ addons)
+ if [ "$action" == "add" ]; then
+ if [ ! -d ${package}/kernel_addons/backport/$tag ]; then
+ echo Creating ${package}/kernel_addons/backport/$tag directory
+ ex mkdir -p ${package}/kernel_addons/backport/$tag
+ echo WARNING: Check that ${package} configure supports backport/$tag
+ fi
+ add_patch $patch ${package}/kernel_addons/backport/$tag
+ else
+ remove_patch $patch ${package}/kernel_addons/backport/$tag
+ fi
+ ;;
+ *)
+ echo Unknown patch type: $type
+ exit 1
+ ;;
+ esac
+
+ ex tar czf ${package}.tgz ${package}
+ cd -
+
+ cd ${ofed}
+ echo Rebuilding ${package_name} source rpm:
+ if ! ( set -x && rpmbuild -bs --define "_topdir $(pwd)" SPECS/${package_name}.spec && set +x ); then
+ echo Failed to create ${package}-${package_rel}.src.rpm
+ exit 1
+ fi
+ ex rm -rf SOURCES/${package}*
+ if [ "$action" == "add" ]; then
+ echo Patch added successfully.
+ else
+ echo Patch removed successfully.
+ fi
+ echo
+ echo Remove existing RPM packages from ${ofed}/RPMS direcory in order
+ echo to rebuild RPMs
+}
+
+main $@
+++ /dev/null
- Open Fabrics Enterprise Distribution (OFED)
- SDP in MLNX_OFED 1.5.2 Release Notes
-
- December 2010
-
-
-
-===============================================================================
-Table of Contents
-===============================================================================
-1. Overview
-2. Bug Fixes and Enhancements since OFED 1.5.2
-3. ZCopy
-4. Known Issues
-5. Verification Applications/Flows/Tests
-6. Module Parameters
-
-===============================================================================
-1. Overview
-===============================================================================
-Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol
-that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced
-protocol offload capabilities, SDP can provide lower latency, higher bandwidth,
-and lower CPU utilization than IPoIB or
-Ethernet running some sockets-based applications.
-
-SDP in OFED is at GA level for MLNX OFED 1.5.2
-
-===============================================================================
-2. Main Features and Changes
-===============================================================================
-- Added support for Inline and blueflame
-- Improved stability issues
-- Bug fixes
-
-===============================================================================
-2. Bug Fixes and Enhancements since OFED 1.5.2
-===============================================================================
-* Cleanups
- - Added support for 2.6.34 / 2.6.36.
-
-* Bug Fixes
- - Fixed compilation problems on 32 bit hosts
- - Do not compile in debug mode when not asked.
- - Improved recovery from errors.
-
-* Enhancements
- - more statistics in /proc/sdpstats
- - added debugfs for sdp:
- - sdpprf was moved from /proc to debugfs/sdp
- - debugfs/<socket_id> - Socket history
-
-
-===============================================================================
-3. ZCopy
-===============================================================================
-- ZCopy is enabled by default for blocks larger than 64K. ZCopy can be disabled
- by setting the module paramter sdp_zcopy_thresh to zero or to any other value
- by setting it to another non zero value.
-
-- ZCOPY mode gives good performance for large blocks with very small cpu
- utilization. When in use, all messages longer than 'sdp_zcopy_thresh' bytes
- in length will cause the user space buffer to be pinned and the data sent
- directly from the original buffer. This results in less CPU usage and on many
- systems in enhanced bandwidth.
- ZCOPY is most efficient with multi stream jobs and it performs better as the
- message size increases.
- The default 64K value for 'sdp_zcopy_thresh' is sometimes too low for some
- systems. You must experiment with your hardware to select the best value.
-
-- ZCOPY vs BCOPY:
- ZCOPY performance is more efficient in weak cpu and multi streams, whereas
- BCOPY is more efficient in single stream.
-
-===============================================================================
-4. Known Issues
-===============================================================================
-- SDP is at beta level on Infinihost HCA family
-
-- Occasionally, socket bind fails when using EINVAL. Although TCP socket is binded
- successfully, SDP is occupied, thus causing the socket bind failure.
- See Bugzilla 2159 and Bugzilla 2160
-
-- When SO_REUSEADDR is set, only a single socket can be bind to the IP_ANY and a
- specific port. TCP limitation, unless one of the sockets is listening.
-
-- BUG 1331 - Although TCP allows connecting to IP_ANY - 0.0.0.0
- (as a destination address!), SDP does not allow connecting to the IP_ANY
- and rejects the connection.
-
-- BUG 1444 - The setsockopt(SO_RCVBUF) is not functional in sdp socket.
- To limit top system wide sdp memory usage for recv,
- use the module parameter top_mem_usage.
-
-- Each SDP socket currently consumes up to 2 MBytes of memory. If this value
- is high for your installation, it is possible to trade off performance
- for lower memory utilization per socket by reducing the value of the
- "rcvbuf_scale" module parameter (default: 16).
-
- Note: The minimum legal value for the "rcvbuf_scale" module is 1.
- At this parameter value, each socket will consume approximately 128 KBytes.
-
-- Small message size performance is low when messages are sent by client
- at a rate lower than the rate at which they are consumed by server,
- and when TCP_CORK is not set. This is observed, for example, with iperf
- benchmark.
- Workaround: Set the TCP_CORK socket option
- to ensure data is sent in at least 32K byte chunks.
-
-- Performance is low on 32-bit kernels, as SDP utilizes high memory
- to ease memory pressure.
- Workaround: Move to a 64-bit kernel if the application remains a 32-bit one.
-
-- By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards
- using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth.
- Workaround: Reset the MTU size to 1K in this situation, using either of
- the two methods below:
-
- 1. Activate the "tavor quirk" workaround in opensm:
- a. Create an opensm options cache file (/var/cache/osm/opensm.opts):
- > opensm --cache-options -o
- b. Add the following line to /var/cache/osm/opensm.opts:
- enable_quirks TRUE
- c. Rerun opensm using your usual command line options to activate
- the opensm quirk option.
-
- 2. Activate the "tavor quirk" workaround in cma:
- set the tavor_quirk module parameter of the rdma_cm module to value 1
- (default: 0).
-
-- When waiting for RX, the driver first polls, arms interrupt and then goes to
- sleep. Polling duration could be set by recv_poll module parameter. The
- higher this value is, the higher the CPU utilization is, and the number of
- interrupts is lower.
- This should be fine tuned according to the specific environment and
- application latency.
-
-- When using SDP over RoCE, and the peer has a card that does not support RoCE
- a delay in the connection establishment may occur.
-
-- BUG2185 - Occasionally, accessing /proc/net/sdpstats, causes kernel
- panic.
-
-- For set-user-ID/set-group-ID ELF binaries, only libraries in the standard
- search directories that are also set-user-ID. Since always installing
- libsdp with this bit on is a security vulnerability, the default behavior is
- to reset this bit. A user that want to run such binaries should modify the
- libsdp.spec file.
-
-===============================================================================
-5. Verification Applications/Flows/Tests
-===============================================================================
-- ssh/sshd
-- wget/netscape/firefox/apache
-- netpipe
-- netperf
-- LTP socket tests
-- iperf-2.0.2
-- ttcp
-- openmpi
-- openmpi + Intel MPI benchmarks
-- Threaded and forking echo client server examples
-- Various Java client server applications (SUN:jre, BEA:jrockit/WebLogic, GNU:gij/gcj)
-- Many UNIX utilities to verify that pre-load did not harm the applications
-
-===============================================================================
-6. Module Parameters
-===============================================================================
-
-General
--------
-sdp_link_layer_ib_only:
- Supports only link layer of type InfiniBand.
- It is useful when not using SDP over RoCE.
-
-sdp_debug_level:
- Enables connection establishment and teardown debug tracing.
-
-sdp_data_debug_level:
- Enables datapath debug tracing. If set to 1, it shows only packets >1.
- To enable debugging of data path, compile driver with CONFIG_SDP_DEBUG_DATA.
-
-
-recv_poll:
- Enables poll receiving before arming the interrupt. Set a higher value
- to decrease the number of RX interrupts. Consequently, the CPU
- utilization will be higher.
-
-sdp_keepalive_time:
- Default idle time in seconds before keepalive probe sent.
-
-Resources
----------
-rcvbuf_initial_size:
- Receives buffer initial size in bytes.
-
-rcvbuf_scale:
- Not in use
-
-top_mem_usage:
- Top system wide sdp memory usage for recv (in MB).
-
-max_large_sockets:
- Not in use
-
-sdp_fmr_pool_size:
- Number of FMRs to allocate for pool
-
-sdp_fmr_dirty_wm:
- Watermark to flush fmr pool
-
-Thresholds
-----------
-sdp_inline_thresh:
- Inline copy threshold. effective to new sockets only; 0=Off.
-
-sdp_zcopy_thresh:
- Zero copy using RDMA threshold; 0=Off.
- If smaller than page size, set to page size.
-
-Interrupt hardware moderation:
-------------------------------
-sdp_rx_coal_target:
- Target number of bytes to coalesce with interrupt moderation.
-
-sdp_rx_coal_time:
- rx coal time (jiffies).
-
-sdp_rx_rate_low:
- rx_rate low (packets/sec).
-
-sdp_rx_coal_time_low:
- low moderation usec.
-
-sdp_rx_rate_high:
- rx_rate high (packets/sec).
-
-sdp_rx_coal_time_high:
- high moderation usec.
-
-sdp_rx_rate_thresh:
- rx rate thresh ().
-
-sdp_sample_interval:
- sample interval (jiffies).
-
-hw_int_mod_count:
- Forced hw int moderation val. -1 for auto (packets). 0 to disable.
-
-hw_int_mod_usec:
- Forced hw int moderation val. -1 for auto (usec). 0 to disable.
+++ /dev/null
-
- Open Fabrics Enterprise Distribution (OFED)
- SRP in OFED 1.5.2 Release Notes
-
- December 2010
-
-
-==============================================================================
-Table of contents
-==============================================================================
-
- 1. Overview
- 2. Changes and Bug Fixes since OFED 1.5
- 3. Software Dependencies
- 4. Major Features
- 5. Loading SRP Initiator
- 6. Manually Establishing an SRP Connection
- 7. SRP Tools - ibsrpdm and srp_daemon
- 8. Automatic Discovery and Connecting to Targets
- 9. Multiple Connections from Initiator IB Port to the Target
- 10. High Availability
- 11. Shutting Down SRP
- 12. Known Issues
- 13. Vendor Specific Notes
-
-
-==============================================================================
-1. Overview
-==============================================================================
-
-The SRP standard describes the message format and protocol definitions required
-for transferring commands and data between a SCSI initiator port and a SCSI
-target port using RDMA communication service.
-
-
-==============================================================================
-2. Changes and Bug Fixes since OFED 1.5
-==============================================================================
-* Check for scsi_id in scmnd to prevent scan/rescan keep adding new scsi devices
- ie. echo "- - -" > /sys/class/scsi_host/hostXX/scan
-* Bug fixing
-
-==============================================================================
-4. Software Dependencies
-==============================================================================
-
-The SRP Initiator depends on the installation of the OFED Distribution stack
-with OpenSM running.
-
-==============================================================================
-5. Major Features
-==============================================================================
-
-This SRP Initiator is based on source taken from openib.org gen2 implementing
-the SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. See:
-www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf
-
-The SRP Initiator supports:
-- Basic SCSI Primary Commands -3 (SPC-3)
- (www.t10.org/ftp/t10/drafts/spc3/spc3r21b.pdf)
-- Basic SCSI Block Commands -2 (SBC-2)
- (www.t10.org/ftp/t10/drafts/sbc2/sbc2r16.pdf)
-- Basic functionality, task management and limited error handling
-
-==============================================================================
-6. Loading SRP Initiator
-==============================================================================
-
-To load the SRP module, either execute the "modprobe ib_srp" command after the
-OFED driver is up, or change the value of SRP_LOAD in
-/etc/infiniband/openib.conf to "yes" (causing the srp module to be loaded
-at driver boot).
-
-NOTE: When loading the ib_srp module, it is possible to set the module
- parameter srp_sg_tablesize. This is the maximum number of
- gather/scatter entries per I/O (default: 12).
-
- a. modprobe ib_srp srp_sg_tablesize=32
- or
- b. edit /etc/modprobe.conf and add the following line:
- options ib_srp srp_sg_tablesize=32
-
-Module paramters:
-For the list of ib_srp module parameters
- $ modinfo ib_srp
-
- + srp_sg_tablesze: Max number of scatter/gather entries per I/O
- + srp_dev_loss_tmo: Number of seconds that srp driver will not return
- DID_NO_CONNECT status when it loss connection to target.
- During this period, it will try to re-establish
- the connection to target, and return DID_RESET,
- DID_ABORT statuses for outstanding scsi command to
- prevent DM Multipath driver to failover to next paths.
- Default value is 60 seconds.
-
-==============================================================================
-7. Manually Establishing an SRP Connection
-==============================================================================
-
-The following steps describe how to manually load an SRP connection between
-the Initiator and an SRP Target. Section 8 explains how to do this
-automatically.
-
-- Make sure that the ib_srp module is loaded, the SRP Initiator is reachable
- by the SRP Target, and that an SM is running.
-
-- To establish a connection with an SRP Target and create SRP (SCSI) device(s)
- for that target under /dev, use the following command:
-
- echo -n id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\
- pkey=ffff,service_id=[service[0] value] > \
- /sys/class/infiniband_srp/srp-mthca[hca number]-[port number]/add_target
-
- a. Execution of the above "echo" command may take some time
- b. The SM must be running while the command executes
- c. It is possible to include additional parameters in the echo command:
- > max_cmd_per_lun - Default: 63
- > max_sect (short for max_sectors) - sets the request size of a command
- > io_class - Default: 0x100 as in rev 16A of the specification
- Note: In rev 10 the default was 0xff00
- > initiator_ext - Please refer to Section 9 (Multiple Connections...)
- d. See SRP Tools below for instructions on how the parameters in the
- echo command above may be obtained.
-
-NOTES:
-
-- Using the same *echo -n <same paramters>* more than one, the srp target
- will terminate the previous connection and re-establish the new
- connection. To have more than two connections to srp target, please use
- different inititiator_ext values in echo command.
-
-- To list the new SCSI devices that have been added by the echo command, you
- may use either of the following two methods:
- a. Execute "fdisk -l". This command lists all devices; the new devices are
- included in this listing.
- b. Execute *dmesg* or look at /var/log/messages to find messages with the
- names of the new devices.
-
-
-==============================================================================
-8. SRP Tools - ibsrpdm and srp_daemon
-==============================================================================
-
-To assist in performing the steps in Section 6, the OFED 1.3.1 distribution
-provides two utilities which:
-- Detect targets on the fabric reachable by the Initiator (for step 1)
-- Output target attributes in a format suitable for use in the above
- "echo" command (step 2)
-
-These utilities are: ibsrpdm and srp_daemon.
-
-The utilities can be found under /usr/local/ofed/sbin/ (or <prefix>/sbin/),
-and are part of the srptools RPM that may be installed using the
-OFED custom installation. Detailed information regarding the various
-options for these utilities are provided by their man pages.
-
-Below, several usage scenarios for these utilities are presented.
-
-ibsrpdm usage
--------------
-1. Detecting reachable targets
-
- a. To detect all targets reachable by the SRP initiator via the default
- umad device (/dev/infiniband/umad0), execute the following command:
- $ ibsrpdm
-
- This command will output information on each SRP target detected, in
- human-readable form.
-
- Sample output:
- IO Unit Info:
- port LID: 0103
- port GID: fe800000000000000002c90200402bd5
- change ID: 0002
- max controllers: 0x10
-
- controller[ 1]
- GUID: 0002c90200402bd4
- vendor ID: 0002c9
- device ID: 005a44
- IO class : 0100
- ID: LSI Storage Systems SRP Driver 200400a0b81146a1
- service entries: 1
- service[ 0]: 200400a0b81146a1 / SRP.T10:200400A0B81146A1
-
- b. To detect all the SRP Targets reachable by the SRP Initiator via
- another umad device, use the following command:
-
- $ ibsrpdm -d <umad device>
-
-2. Assistance in creating an SRP connection
-
- a. To generate output suitable for utilization in the "echo" command of
- section 5, add the "-c" option to ibsrpdm:
-
- $ ibsrpdm -c
-
- Sample output:
- id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
- dgid=fe800000000000000002c90200402bd5,pkey=ffff,
- service_id=200400a0b81146a1
-
- b. To establish a connection with an SRP Target (Section 6) using the output
- from the "libsrpdm -c" example above, execute the following command:
-
- $ echo -n id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
- dgid=fe800000000000000002c90200402bd5,pkey=ffff,
- service_id=200400a0b81146a1
- > /sys/class/infiniband_srp/srp-mlnx_0-1/add_target
-
- The SRP connection should now be up; the newly created SCSI devices should
- appear in the listing obtained from the "fdisk -l" command.
-
-
-srp_daemon
-----------
-The srp_daemon utility is based on ibsrpdm and extends its functionality.
-In addition to the ibsrpdm functionality described above, srp_daemon can also
-- Establish an SRP connection by itself (without the need to issue the "echo"
- command described in Section 6)
-- Continue running in background, detecting new targets and establishing SRP
- connections with them (daemon mode)
-- Discover reachable SRP targets given an infiniband HCA name and port, rather
- than just by /dev/umad<N> where <N> is a digit
-- Enable High Availability operation (together with Device-Mapper Multipath)
-- Have a configuration file that determines the targets to connect to
-
-a. srp_daemon commands equivalent to ibsrpdm:
-
- "srp_daemon -a -o" is equivalent to "ibsrpdm"
- "srp_daemon -c -a -o" is equivalent to "ibsrpdm -c"
-
-Note: These srp_daemon commands can behave differently than the equivalent
- ibsrpdm command when /etc/srp_daemon.conf is not empty.
-
-b. srp_daemon extensions to ibsrpdm
-
- - To discover SRP Targets reachable from HCA device <infiniband HCA name>,
- port <port num>, (and generate output suitable for 'echo') you may execute
-
- srp_daemon -c -a -o -i <infiniband HCA name> -p <port number>
-
- - To both discover the SRP Targets and establish connections with them, just
- add the -e option to the above command.
-
- - Executing srp_daemon over a port without the -a option will only display
- the reachable targets via the port and to which the initiator is not
- connected. If executing with the -e option it is better to omit -a.
-
- - It is recommended to use the -n option. This option adds the initiator_ext
- to the connecting string. (See Section 9 for more details).
-
- - srp_daemon has a configuration file that can be set, where the default is
- /etc/srp_daemon.conf. Use the -f to supply a different configuration file
- that configures the targets srp_daemon is allowed to connect to. The
- configuration file can also be used to set values for additional
- parameters (e.g., max_cmd_per_lun, max_sect).
-
- - A continuous background (daemon) operation, providing an automatic ongoing
- detection and connection capability. See Section 8.
-
-==============================================================================
-9. Automatic Discovery and Connecting to Targets
-==============================================================================
-
-- Make sure that the ib_srp module is loaded, the SRP Initiator can reach an
- SRP Target, and that an SM is running.
-
-- To connect to all the existing Targets in the fabric, execute
- srp_daemon -e -o. This utility will scan the fabric once, connect to
- every Target it detects, and then exit.
-
-NOTE: srp_daemon will follow the configuration it finds in
- /etc/srp_daemon.conf. Thus, it will ignore a target that is disallowed in
- the configuration file.
-
-- To connect to all the existing Targets in the fabric and to connect
- to new targets that will join the fabric, execute srp_daemon -e. This utility
- continues to execute until it is either killed by the user or encounters
- connection errors (such as no SM in the fabric).
-
-- To execute SRP daemon as a daemon you may execute run_srp_daemon
- (found under /usr/local/ofed/sbin/ or <prefix>/sbin/), providing it with
- the same options used for running srp_daemon.
-
- Note: Make sure only one instance of run_srp_daemon runs per port.
-
-- To execute SRP daemon as a daemon on all the ports, execute srp_daemon.sh
- (found under /usr/local/ofed/sbin/ or <prefix>/sbin/).
- srp_daemon.sh sends its log to /var/log/srp_daemon.log.
-
-- It is possible to configure this script to execute automatically when the
- InfiniBand driver starts by changing the value of SRP_DAEMON_ENABLE in
- /etc/infiniband/openib.conf to "yes" and SRP_LOAD to yes as well.
-
- Another option to to configure this script to execute automatically when the
- InfiniBand driver starts is by changing the value of SRPHA_ENABLE in
- /etc/infiniband/openib.conf to "yes". However, this option also enables
- SRP High Availability that has some more features. (Please read the High
- Availability section).
-
-==============================================================================
-10. Multiple Connections from Initiator IB Port to the Target
-==============================================================================
-
-Some system configurations may need multiple SRP connections from
-the SRP Initiator to the same SRP Target: to the same Target IB port,
-or to different IB ports on the same Target HCA.
-
-In case of a single Target IB port, i.e., SRP connections use the same path,
-the configuration is enabled using a different initiator_ext value for each
-SRP connection. The initiator_ext value is a 16-hexadecimal-digit value
-specified in the connection command.
-
-Also in case of two physical connections (i.e., network paths) from a single
-initiator IB port to two different IB ports on the same Target HCA, there is
-need for a different initiator_ext value on each path. The conventions is to
-use the Target port GUID as the initiator_ext value for the relevant path.
-
-If you use srp_daemon with -n flag, it automatically assigns initiator_ext
-values according to this convention. For example:
-
- id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec,
- dgid=fe800000000000000002c90200402bed,
- pkey=ffff,service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200
-
- Notes:
- a. It is recommended to use the -n flag for all srp_daemon invocations.
- b. ibsrpdm does not have a corresponding option.
- c. srp_daemon.sh always uses the -n option (whether invoked manually by
- the user, or automatically at startup by setting SRPHA_ENABLE or
- SRP_DAEMON_ENABLE to yes).
-
-==============================================================================
-11. High Availability (HA)
-==============================================================================
-
-High Availability Overview
---------------------------
-
-High Availability works using the Device-Mapper (DM) multipath and the
-SRP daemon.
-
-Each initiator is connected to the same target from several ports/HCAs.
-The DM multipath is responsible for joining together different paths to the
-same target and for fail-over between paths when one of them goes offline.
-Multipath will be execute on newly joined SCSI devices.
-
-Each initiator should execute several instances of the SRP daemon, one for each
-port. At startup, each SRP daemon detects the SRP targets in the fabric and
-sends requests to the ib_srp module to connect to each of them. These
-SRP daemons also detect targets that subsequently join the fabric, and send the
-ib_srp module requests to connect to them as well.
-
-High Availability Operation
----------------------------
-
-When a path (from port1) to a target fails, the ib_srp module starts an error
-recovery process. If this process gets to the reset_host stage and there is no
-path to the target from this port, ib_srp will remove this scsi_host. After
-the scsi_host is removed, multipath switches to another path to this target
-(from another port/HCA).
-
-When the failed path recovers, it will be detected by the SRP daemon. The SRP
-daemon will then request ib_srp to connect to this target. Once the connection
-is up, there will be a new scsi_host for this target. Multipath will be
-executed on the devices of this host, returning to the original state (prior to
-the failed path).
-
-High Availability Prerequisites
--------------------------------
-
-Installation for RHEL4 and RHEL5: (Execute once)
- - Verify that the standard device-mapper-multipath rpm is installed. If not,
- install it from the RHEL distribution.
-
-Installation for SLES10: (Execute once)
- - Verify that multipath is installed. If not, install it from the
- installation (You can use yast).
-
- - Update udev: (Execute once - for manual activation of High Availability only)
-
- - Add a file to /etc/udev/rules.d/ (you can call it 91-srp.rules)
- This file should have one line:
- ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m"
-
- Note: When SRPHA_ENABLE is set to "yes" (see Automatic Activation of High
- Availability below), this file is created upon each boot of the driver and
- is deleted when the driver is unloaded.
-
-
-Manual Activation of High Availability
---------------------------------------
-
-Initialization: (Execute after each boot of the driver)
- 1) Execute modprobe dm-multipath
- 2) Execute modprobe ib-srp
- 3) Make sure you have created file /etc/udev/rules.d/91-srp.rules
- as described above
- 4) Execute for each port and each HCA:
- srp_daemon -c -e -R 300 -i <InfiniBand HCA name> -p <port number>
- (You can use another value for -R. See under the Known Issues section
- the workaround for the rare race condition.)
-
- This step can be performed by executing srp_daemon.sh, which sends
- its log to /var/log/srp_daemon.log.
-
- Now it is possible to access the SRP LUNs on /dev/mapper/.
-
- NOTE: It is possible for regular (non-SRP) LUNs to also be present;
- the SRP LUNs may be identified by their names. You can configure the
- /etc/multipath.conf file to change multipath behavior.
-
-
-Automatic Activation of High Availability
------------------------------------------
-- Set the value of SRPHA_ENABLE in /etc/infiniband/openib.conf to "yes".
- Also make sure SRP_LOAD=yes and SRP_DAEMON_ENABLE=yes.
-
-- From the next loading of the driver it will be possible to access the SRP
- LUNs on /dev/mapper/
- NOTE: It is possible that regular (not SRP) LUNs may also be present;
- the SRP LUNs may be identified by their name.
-
-- It is possible to see the output of the SRP daemon in /var/log/srp_daemon.log
-
-
-==============================================================================
-12. Shutting Down SRP
-==============================================================================
-
-SRP can be shutdown by using "modprobe -r ib_srp", or by stopping the OFED
-("/etc/init.d/openibd stop"), or as a by-product of a complete system shutdown.
-
-Prior to shutting down SRP, it is REQUIRED to remove all references to it.
-The actions you need to take depend on the way SRP was loaded. There are
-three cases.
-
-a. Without High Availability
-------------------------------------
-When working without High Availability, you should unmount all SRP
-partitions that were mounted prior to shutting down SRP.
-For example, /dev/sdd1 is srp partition and mount to /mnt/test
-$ umount /mnt/test
-$ modprobe -r ib_srp
-
-NOTES: the umount may get stuck ~90 seconds per connection to target if the
- target is down. This is due to the srp_dev_loss_tmo=60 seconds which
- srp driver waits for the target coming back before returning error
- status.
- If you have shutdown/remove srp target and the host have 4 connections
- to the SRP target, you should wait ~4-5 minutes for the unmount to exit.
- Do not ctrl+c to kill umount process.
-
-b. After Manual Activation of High Availability
------------------------------------------------
-If you manually activated SRP High Availability, perform the following steps:
-- Unmount all SRP partitions that were mounted
-- Kill all SRP daemon instances.
-- Make sure there are no multipath instances running. If there are multiple
- instances, wait for them to end or kill them.
-- Execute multipath -F
-
-Example:
-$ umount /mnt/test1 /mnt/test2 (wait for it to exit, do not ctrl+c)
-$ ps -ax and kill all srp_daemon processes.
-$ multipath -ll (wait for it to exit, do not ctrl+c)
-$ multipath -F
-$ modprobe -r ib_srp
-
-c. After Automatic Activation of High Availability
---------------------------------------------------
-If SRP High Availability was automatically activated, SRP shutdown must be
-part of the driver shutdown ("/etc/init.d/openibd stop") which performs
-steps 2-5 of case (b) above. However, you still have to unmount all SRP
-partitions that were mounted before driver shutdown.
-
-
-HAL Issue
----------
-The HAL (Hardware Abstraction Layer) system includes a daemon that examines
-all devices in the system. In this process, it frequently holds a reference
-to the ib_srp module. If you attempt to shutdown SRP while this daemon is
-holding a reference to ib_srp, the shutdown will fail. Therefore, you
-should make sure this will not occur. One solution may be to stop "haldaemon"
-(/etc/init.d/haldaemon stop) prior to SRP shutdown.
-
-
-==============================================================================
-13. Known Issues
-==============================================================================
-
-- There is a very rare race condition which can cause the SRP daemon to miss a
- target that joins the fabric. The race can occur if a target joins and leaves
- the fabric several times in a short time (e.g., if the cable is not connected
- well). In such a case, the SM may ignore this quick change of state and may
- not send an InformInfo to the srp_daemon.
-
- Workaround: Execute the srp_daemon command with the -R <sec> option. This
- option causes the SRP daemon to perform a full rescan of the fabric every
- <sec> seconds.
-
-- The srp_daemon does not support different pkeys other than the default
- pkey=ffff
-
-- It is recommended to use an SM that supports the enhanced capability mask
- matching feature (errata MGTWG8372). With SMs which support this feature, the
- SRP daemon generates significantly less communication traffic.
-
-- When booting OFED with SRP High Availability enabled, executing multipath for
- all LUNs on all connections may take some time (several minutes). However, it
- is possible to start working while this process is in progress.
-
-- Stopping the driver while SRP High Availability is enabled kills all
- multipath processes. Consider appropriate actions in case multipath is used
- for other purposes.
-
-- AS High Availability is based on Device Mapper multipath, it embodies
- multipath limitations and also its configuration and tuning options.
- See http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=Home
- for information on multipath.
- To modify and tune multipath configuration, edit the file /etc/multipath.conf
- according to instructions and tips listed in
- /usr/share/doc/packages/multipath-tools/multipath.conf.*
-
-- In case your topology has two physical connections (i.e., network paths) from
- a single initiator IB port to two different IB ports on the same Target HCA,
- and you wish to have an SRP connection on the one path coexist with an SRP
- connection on the second path, you must set a different initiator_ext value
- on each path. See Section 9, "Multiple Connections from Initiator IB Port
- to the Target" for details.
-
-- The srp_daemon tool reads by default the configuration file
- /etc/srp_daemon.conf. In case this configuration file disallows connecting
- to a certain target, srp_daemon will ignore the target. If you find out
- that srp_daemon ignores a target, please check the /etc/srp_daemon.conf file.
-
-- Rebooting the system with unclean mounted filesystem and dead connection
- to SRP target, the system may get stuck.
-
-- After establish the connection with srp target and rebooting the system,
- initiator will fail to connect to target @ first manual *echo -n* command
- (target reject with stale connection). You need to do *echo -n* one more
- time.
- You do not see this problem with srp_daemon mode since srp_daemon will
- retry to connect.
-
-- The combination of "weak" single lun srp target, I/O with big block size,
- default max_command_per_lun=63 while using /dev/urandom to create file with
- ext3 fs on srp lun, may cause ext3 remount with "read-only" flag
- ie.
- Example:
- sdb1 is first partition of srp lun sdb, ext3 fs is created
- $ mount /dev/sdb1 /mnt/sdb1; cd /mnt/sdb1
- $ dd if=/dev/urandom of=10G-file bs=1G count=10
- --> ext3 fs may remount with read-only flag
-
- Workarounds:
- ------------
- a. Log into the target with small max_command_per_lun (3,4,8)
- $ echo id_ext=0002c9030008fc0c,ioc_guid=0002c9030008fc0c,
- dgid=fe800000000000000002c90300084417,max_cmd_per_lun=4,pkey=ffff,
- service_id=0002c9030008fc0c > /sys/class/infiniband_srp/srp-mlx4_0-1/add_target
-
- ----------------OR-------------------
-
- b. Run dd with /dev/zero instead of /dev/urandom
- $ dd if=/dev/zero of=10G-file bs=1G count=10
-
- ----------------OR-------------------
-
- c. Run dd with smaller block size
- $ dd if=/dev/urandom of=10G-file bs=128K count=40000
-
- ----------------OR-------------------
-
- d. Combine the a,b,c steps (This is the recommended workaround)
-
-==============================================================================
-14. Vendor Specific Notes
-==============================================================================
-
-Hosts connected to Qlogic SRP Targets must perform one of the following
-steps after upgrading to OFED 1.3.1 to continue accessing their storage
-successfully:
-
-1. When issuing the "echo" command to add a new SRP Target, the host
- must append the string ",initiator_ext=0000000000000001" to the original
- echo string.
- Example:
- 'ibsrpdm -c' output is as follows:
-
- id_ext=0000000000000001,ioc_guid=00066a0138000165,dgid=fe8000000000000
- 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00
-
- id_ext=0000000000000001,ioc_guid=00066a0238000165,dgid=fe8000000000000
- 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00
-
- To connect to the first target, the echo command must be:
-
- echo -n \
- id_ext=0000000000000001,ioc_guid=00066a0138000165,\
- dgid=fe8000000000000000066a0260000165,pkey=ffff,\
- service_id=0000494353535250,io_class=ff00,\
- initiator_ext=0000000000000001 > \
- /sys/class/inifiniband_srp/srp-mthca0-1/add_target
-
-
-2. Change the SRP map on the Qlogic SRP Target to set the expected initiator
- extension to 0. For details on how to change the SRP map on a Qlogic SRP
- Target, please refer to product documentation.
-
-
+++ /dev/null
- Release Notes for
- OFED 1.5.1 DAPL Release
- March 2010
-
- This release of the uDAPL reference implementation package for both
- DAT 1.2 and 2.0 specification is timed to coincide with OFED release
- of the Open Fabrics (www.openfabrics.org) software stack.
-
- uDAPL v1 (1.2.16-1) and v2 (2.0.27-1)
-
- ----------------
-
- * New Features (v2 only) - UCM provider with IB UD based CM per process.
- More scalable then rdma_cm (cma) or socket cm (scm).
- ----------------
-
- * Provider descriptions and PROS/CONS (cma, scm, ucm)
-
- 1. CMA - uses OFA rdma_cm to setup QP's. IPoIB, ARP, and SA queries required.
-
- Provider name: ofa-v2-cma
- PROs: OFA rdma_cm has the most testing across many applications.
- Supports both iWARP and IB.
-
- CONs: Serialization of conn processing with kernel based CM service
- Requires IPoIB ARP for name resolution, storms
- Requires SA for path record queries for IB fabrics.
- Conn Request private data limited to 52 bytes.
-
- Settings for larger clusters (512+ cores):
-
- setenv DAPL_CM_ROUTE_TIMEOUT_MS 20000
- setenv DAPL_CM_ARP_TIMEOUT_MS 10000
-
- 2. SCM - uses sockets to exchange QP information. IPoIB, ARP, and SA queries NOT required.
-
- Provider name (connectx): ofa-v2-mlx4_0-1
- PROs: Each rank has own instance of socket cm. More private data with requests.
- Doesn't require path-record lookup.
-
- CONs: Socket resources grow with scale-out, serialization of
- connections with kernel based tcp sockets,
- Competes for MPI socket resources/port space and other TCP applications.
- Sockets remain in TIMEWAIT state for minutes after closure.
- Requires ARP for name resolution.
- Doesn't support iWARP devices.
-
- Settings for larger clusters (512+ cores):
-
- setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */
- setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */
-
- 3. UCM - use's IB UD QP to exchange QP info. Sockets, ARP, IPoIB, and SA queries NOT required.
-
- Provider name (connectx): ofa-v2-mlx4_0-1u
- PROs: Each rank has own instance of CM in user process
- Resources fixed per rank regardless of scale-out size
- No serialization of user or kernel resources establishing connections,
- Simple 3-way msg handsake, CM messages fit in inline data for lowest message latency,
- Supports alternate paths
- No address resolution required.
- No path resolution required.
-
- CONs: New provider with limited testing, a little tougher to debug.
- Doesn't support iWARP
-
- Settings for larger clusters (512+ cores):
-
- setenv DAPL_UCM_REP_TIME 800 /* REQUEST timer, waiting for REPLY in millisecs */
- setenv DAPL_UCM_RTU_TIME 400 /* REPLY timer, waiting for RTU in millisecs */
- setenv DAPL_UCM_RETRY 15 /* REQUEST and REPLY retries */
- setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */
- setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */
-
- ----------------
-
- * CM Performance: CPS profile for cma, scm, and ucm v2 uDAPL providers:
-
- Intel SR1600 Urbanna Servers with Xeon(R) CPU X5570 @ 2.93GHz
- Urbanna Platform - 2 node, 8 cores per node, Mellanox MLX4 IB QDR, no switch.
-
- dtestcm (server/client):
-
- cma: Connections: 183.21 usec, CPS 5458.31 Total 0.18 secs, poll_cnt=3403, Num=1000
- scm: Connections: 178.80 usec, CPS 5592.93 Total 0.18 secs, poll_cnt=2344, Num=1000
- ucm: Connections: 122.43 usec, CPS 8167.93 Total 0.12 secs, poll_cnt=2609, Num=1000
-
- dapl_cm_bw: MPI uDAPL/CM profiling application (all-to-all connections, all ranks)
-
- CMA
- 2 Connect times (10): Total 0.0020 per 0.0002 CPS=4997.98
- 4 Connect times (40): Total 0.0077 per 0.0002 CPS=5224.59
- 8 Connect times (240): Total 0.0276 per 0.0001 CPS=8710.76
- 16 Connect times (1120): Total 0.1194 per 0.0001 CPS=9379.37
- 32 Connect times (4800): Total 6.1949 per 0.0013 CPS=774.83
-
- SCM
- 2 Connect times (10): Total 0.0024 per 0.0002 CPS=4103.61
- 4 Connect times (40): Total 0.0060 per 0.0002 CPS=6622.41
- 8 Connect times (240): Total 0.0206 per 0.0001 CPS=11634.15
- 16 Connect times (1120): Total 9.0118 per 0.0080 CPS=124.28
- 32 Connect times (4800): Total 21.0198 per 0.0044 CPS=228.36
-
- UCM
- 2 Connect times (10): Total 0.0014 per 0.0001 CPS=7353.27
- 4 Connect times (40): Total 0.0045 per 0.0001 CPS=8816.19
- 8 Connect times (240): Total 0.0191 per 0.0001 CPS=12582.44
- 16 Connect times (1120): Total 0.0799 per 0.0001 CPS=14017.68
- 32 Connect times (4800): Total 0.3337 per 0.0001 CPS=14385.21
-
- ----------------
-
- * Bug Fixes
-
- V2.0 Package
-
- Release 2.0.27
- windows: add scm makefile
- windows does not require rdma_cma_abi.h, move the include from common code
- windows patch to fix IB_INVALID_HANDLE name collision
- scm: dat_ep_connect fails on 32bit servers
- undefined symbol: dapls_print_cm_list
- cleanup CM object lock before freeing CM object memory
- destroy verbs completion channels created via ia_open or ep_create.
- package: update Copyright file and include the 3 license files in distribution
- common: when copying private_data out of rdma_cm events, use the
- cma: fix referencing freed address
- dapl: move close device after async thread is done
-
- Release 2.0.26
- openib_common: add check for both gid and global routing in RTR
- openib_common: remote memory read privilege set multi times
- ucm, scm: DAPL_GLOBAL_ROUTING enabled causes segv
-
- Release 2.0.25
- winof scm: initialize opt for NODELAY setsockopt
- winof cma: windows definition for EADDRNOTAVAIL missing
- scm: client side setsockopt NODELAY fails if data arrives before setting
- cma: setup_listener Cannot assign requested address
- common: seg fault in dapl_evd_wait with multi-thread application using CNO's.
- ucm: inbound DREQ/DREP handshake should transition QP.
- winof: Remove duplicate include of comp_channel.cpp from cm.c as it is
- included in opensm_ucb/device.c.
-
- Release 2.0.24
- winof: Utilize WinOF version of inet_ntop() for Windows OSes which do not
- support inet_ntop().
- ucm: windows build issue with new CQ completion channel
- winof: add ucm provider to windows build
- winof: add missing build files for ibal, scm
- scm: connection peer resets under heavy load, incorrect event on error
- ucm: increase default reply and rtu timeout values.
- ucm: change some debug message levels and add check for valid UD REPLY during retries.
- ucm: increase timers during subsequent retries
- ucm, scm: address handles need destroyed when freeing Endpoints with UD QP's.
- openib_common: ignore pd free errors, clear pd_handle and return.
- ucm: using UD type QP's, ucm reports wrong reject event when user rejects AH resolution request.
- ucm, scm, cma: Fix CNO support on DTO type EVD's
- ucm: fix lock init bug in ucm_cm_find
- ucm: fix build problem with latest windows ucm changes
- ucm: The HCA should not be closed until all resources have been released.
- ucm: Fix build warning when compiling on 32-bit systems.
- ucm: Trying to deregister the same memory region twice leads to an
- dat: reduce debug message level when parsing for location of dat.conf
- ucm: update ucm provider for windows environment
- ucm: add timer/retry CM logic to the ucm provider
-
- Release 2.0.23
- cma: cannot reuse the cm_id and qp for new connection, must reallocate a new one.
- scm, cma: update DAPL cm protocol revision with latest address/port changes
- ucm: modify IB address format to align better with sockaddr_in6
- Add definition for getpid similar to that used by the other dtest apps.
- WinOF provides a common implementation of gettimeofday that should
- The completion manager was updated to provide an abstraction that
- dtestcm: remove IB verb definitions
- dtest, dtestx: remove IB verb definitions
- scm: tighten up socket options to insure similiar behavior on Windows and Linux.
- cma: improve serialization of destroy and event processing
- scm: improve serialization of destroy and state changes
- common: no cleanup/release code for timer thread
- scm, cma: dapli_thread doesn't always get teminated on library close.
- ucm: tighten up locking with CM processing, state changes
- ucm: For UD type QP's, return CR p_data with CONN_EST event on passive side.
- ucm: cleanup extra cr/lf
- ucm: fix issues with UD QP's.
- winof: Convert windows version of dapl and dat libaries to use private heaps.
- dtest, dtestx: modifications for UD QP testing with ucm provider.
- scm, ucm: UD QP support was broken when porting to common openib code base.
- cma: cleanup warning with unused local variable, ret, in disconnect
- cma: remove debug message after rdma_disconnect failure
- scm: socket errno check needs O/S dependent wrapper
- dapltest: update script files for WinOF
- cma: conditional check for new rdma_cm definition.
-
- Release 2.0.22
- dapltest: add mdep processor yield and use with dapltest
- ucm: Add new provider using a DAPL based IB-UD cm mechanism for MPI implementations.
-
- Release 2.0.21
- scm: Fix disconnect. QP's need to move to ERROR state in
- modify dtest.c to cleanup CNO wait code and consolidate into
- CNO events, once triggered will not be returned during the cno wait.
- CNO support broken in both CMA and SCM providers.
- common osd: include winsock2.h for IPv6 definitions.
- common osd: include w2tcpip.h for sockaddr_in6 definitions.
- DAPL introduced the concept of directly waiting on the CQ for
- dapltest: Implement a malloc() threshold for the completion reaping.
- scm: handle connected state when freeing CM objects
- scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings.
- scm: set TCP_NODELAY sockopt on the server side for sends.
- remove obsolete files in dapl/udapl source tree
- dtestcm: add UD type QP option to test
- scm: destroy QP called before disconnect
- cma: add support for rdma_cm TIME_WAIT event.
- scm: remove old udapl_scm code replaced by openib_scm.
- winof: fix issues after consolidating cma, scm code base.
- cma: lock held when exiting as a result of a rdma_create_event_channel failure.
- windows: all dlist functions have been moved to the header file.
- dtestcm windows: add build infrastructure for new dtestcm test suite
- openib_common: reorganize code base to share common mem, cq, qp, dto functions
- scm: fixes and optimizations for connection scaling
- scm: double the default fd_set_size
- scm: EP reference in CR should be cleared during ep_destroy
- dtestx: fix conn establishment event checking
- dtestcm: new test to measure dapl connection rates.
-
- Release 2.0.20
- common,scm: add debug capabilities to print in-process CM lists
- scm: disconnect EP before cleaning up orphaned CR's during dat_ep_free
- dapltest: windows scripts updated
- scm: private data is not handled properly via CR rejects.
- scm: cleanup orphaned UD CR's when destroying the EP
- scm: provider specific query for default UD MTU is wrong.
- scm: update CM code to shutdown before closing socket
- dapltest: windows script dt-cli.bat updated
- dapl/windows cma provider: add support for network devices based on index
- openib: remove 1st gen provider, replaced with openib_cma and openib_scm
- dapltest: update windows script files
- dapltest: windows batch files in sripts directory
- windows_osd/linux_osd: new dapl_os_gettid macro to return thread id
- windows: missing build files for common and udapl sub-directories
- windows: add build files for openib_scm, remove /Wp64 build option.
- scm: multi-hca CM processing broken. Need cr thread wakeup mechanism per HCA.
- dtest: add connection timers on client side
- linux_osd: use pthread_self instead of getpid for debug messages
- windows ibal-scm: dapl/dirs file needs updated to remove ibal-scm
-
- v1.2 Package:
-
- Release 1.2.16
- package: update Copyright file and include the 3 license files in distribution
- cma: max sge incorrectly decremented during ibv_device_query
-
- Release 1.2.15
- dtest, dapltest: conflict with dapl-2 utils package, change to dapl1, dapltest1
- scm: fix compiler warning, unused variable
-
- ----------------
-
- * Build Notes:
-
- # NON_DEBUG build/install example for x86_64, OFED targets
- ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
- make install
-
- # DEBUG build/install example for x86_64, using OFED targets
- ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
- make install
-
- # COUNTERS build/install example for x86_64, using OFED targets
- ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS"
- make install
-
- ----------------
-
- * BKM for running new DAPL library on your cluster without any impact on existing OFED installation:
-
- Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1
-
- Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.25.tar.gz
-
- untar in /home/ardavis
- cd /home/ardavis/dapl-2.0.25
- ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries)
-
- create /home/ardavis/dat.conf with following 3 lines. (entries with path to new libraries):
-
- ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" ""
- ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
- ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
-
- Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following:
-
- setenv DAT_OVERRIDE=/home/ardavis/dat.conf
-
- If running Intel MPI and uDAPL socket cm, set the following:
-
- setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1
-
- or if running Intel MPI and uDAPL IB UD cm, set the following:
-
- setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1u
-
- or if running Intel MPI and uDAPL rdma_cm, set the following:
-
- setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0
-
--------------------------
-
- OFED 1.4.1 RELEASE NOTES
-
- NEW SINCE OFED 1.4 - new versions of uDAPL v1 (1.2.14-1) and v2 (2.0.19-1)
-
- * New Features - optional counters, must be configured/built with -DDAPL_COUNTERS
-
- * Bug Fixes
-
- v2 - scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit
- v2 - scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge
- v2 - dtest: add flush EVD call after data transfer errors
- v2 - scm: increase default MTU size from 1024 to 2048
- v2 - dapltest: reset server listen ports to avoid collisions during long runs
- v2 - dapltest: avoid duplicating ports, increment based on ep/thread count
- v2 - dapltest: fix assumptions that multiple EP's will connect in order
- v2 - common: sync missing with when removing items off of EVD pending queue
- v2 - scm: reduce open time with thread start up
- v2 - scm: getsockopt optlen needs initialized to size of optval
- v2 - scm: cr_thread cleanup
- v2 - OFED and WinOF code sync
- v2 - scm: remove unnecessary query gid/lid from connection phase code.
- v2 - scm: add optional 64-bit counters, build with -DDAPL_COUNTERS.
- v1,v2 - spec files missing Requires(post) statements for sed/coreutils
- v1,v2 - dtest/dapltest: use $(top_builddir) for .la files during test builds
- v1,v2 - scm: remove unecessary thread when using direct objects
- v1,v2 - Fix SuSE 11 build issues, asm/atomic.h no longer exists
-
- * Build Notes:
-
- # NON_DEBUG build/install example for x86_64, OFED targets
- ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
- make install
-
- # DEBUG build/install example for x86_64, using OFED targets
- ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
- make install
-
- # COUNTERS build/install example for x86_64, using OFED targets
- ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS"
- make install
-
- * BKM for running new DAPL library on your cluster without any impact on existing OFED installation:
-
- Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1
-
- Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.19.tar.gz
-
- untar in /home/ardavis
- cd /home/ardavis/dapl-2.0.19
- ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries)
-
- create /home/ardavis/dat.conf with following 2 lines. (entries with path to new libraries):
-
- ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" ""
- ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
-
- Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following:
-
- setenv DAT_OVERRIDE=/home/ardavis/dat.conf
-
- If running Intel MPI and uDAPL socket cm, set the following:
-
- setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1
-
- if running Intel MPI and uDAPL rdma_cm, set the following:
-
- setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0
-
--------------------------
-
- OFED 1.4 RELEASE NOTES
-
- NEW SINCE OFED 1.3.1 - new versions of uDAPL v1 (1.2.12-1) and v2 (2.0.15-1)
-
- * New Features
-
- 1. The new socket CM provider, introduced in 1.2.8 and 2.0.11 packages,
- assumes homogeneous cluster and will setup the QP's based on local HCA port
- attributes and exchanges QP information via socket's using the hostname of
- each node. IPoIB and rdma_cm are NOT required for this provider. QP attributes
- can be adjusted via the following environment parameters:
-
- DAPL_ACK_TIMER (default=16 5 bits, 4.096us*2^ack_timer. 16 == 268ms)
- DAPL_ACK_RETRY (default=7 3 bits, 7 * 268ms = 1.8 seconds)
- DAPL_RNR_TIMER (default=12 5 bits, 12 == 64ms, 28 == 163ms, 31 == 491ms)
- DAPL_RNR_RETRY (default=7 3 bits, 7 == infinite)
- DAPL_IB_MTU (default=1024 limited to active MTU max)
-
- The new socket cm entries in /etc/dat.conf provide a link to the actual HCA
- device and port. Example v1 and v2 entries for a Mellanox connectx device, port 1:
-
- OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
- ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
-
- This new socket cm provider, was successfully tested on the TATA CRL cluster
- (#8 on Top500) with Intel MPI, achieving a HPLinpack score of 132.8TFlops on
- 1798 nodes, 14384 cores at ~76.9% of peak. DAPL_ACK_TIMER was increased to 21
- for this scale.
-
- 2. New v2 definitions for IB unreliable datagram extension (only supported in
- scm provider, libdaploscm.so.2)
-
- Extended EP dat_service_type, with DAT_IB_SERVICE_TYPE_UD
- Add IB extension call dat_ib_post_send_ud().
- Add address handle definition for UD calls.
- Add IB event definitions to provide remote AH via connect and connect requests
- See dtestx (-d) source for example usage model
-
- * Bug Fixes
-
- v1,v2 - dapltest: trans test moves to cleanup stage before rdma_read processing is complete
- v1,v2 - Fix static registration (dat.conf) to include sysconfdir override
- v1,v2 - dat.conf: add default iwarp entry for eth2
- v1,v2 - dapl: adjust max_rdma_read_iov to 1 for iWARP devices
- v1,v2 - dtest: reduce default IOV's for ep_create to support iWARP
- v1,v2 - dtest: fix 32-bit build issues
- v1,v2 - build: $(DESTDIR) prepend needed on install hooks for dat.conf
- v2 - scm: UD shares EP;s which requires serialization
- v2 - dapl: fixes for IB UD extensions in common code and socket cm provider.
- v2 - dapl: add provider specific attribute query option for IB UD MTU size
- v2 - dapl build: add correct CFLAGS, set non-debug build by default for v2
- v2 - dtestx: fix stack corruption problem with hostname strcpy
- v2 - dapl extension: dapli_post_ext should always allocate cookie for requests.
- v2 - dapltest: manpage - rdma write example incorrect
- v1,v2 - dat, dapl, dtest, dapltest, providers: fix compiler warnings in dat common code
- v1,v2 - dapl cma: debug message during query needs definition for inet_ntoa
- v1,v2 - dapl scm: fix corner case that delivers duplicate disconnect events
- v1,v2 - dat: include stddef.h for NULL definition in dat_platform_specific.h
- v1,v2 - dapl: add debug messages during async and overflow events
- v1,v2 - dapltest: add check for duplicate disconnect events in transaction test
- v1,v2 - dapl scm: use correct device attribute for max_rdma_read_out, max_qp_init_rd_atom
- v1,v2 - dapl scm: change IB RC qp inline and timer defaults.
- v1,v2 - dapl scm: add mtu adjustments via environment, default = 1024.
- v1,v2 - dapl scm: change connect and accept to non-blocking to avoid blocking user thread.
- v1,v2 - dapl scm: update max_rdma_read_iov, max_rdma_write_iov EP attributes during query
- v1,v2 - dat: allow TYPE_ERR messages to be turned off with DAT_DBG_TYPE
- v1,v2 - dapl: remove needless terminating 0 in dto_op_str functions.
- v1,v2 - dat: remove reference to doc/dat.conf in makefile.am
- v1,v2 - dapl scm: fix ibv_destroy_cq busy error condition during dat_evd_free.
- v1,v2 - dapl scm: add stdout logging for uname and gethostbyname errors during open.
- v1,v2 - dapl scm: support global routing and set mtu based on active_mtu
- v1,v2 - dapl: add opcode to string function to report opcode during failures.
- v1,v2 - dapl: remove unused iov buffer allocation on the endpoint
- v1,v2 - dapl: endpoint pending request count is wrong
-
--------------------------
-
- OFED 1.3.1 RELEASE NOTES
-
- NEW SINCE OFED 1.3 - new versions of uDAPL v1 (1.2.7-1) and v2 (2.0.9-1)
-
- * New Features - None
-
- * Bug Fixes
- v2 - add private data exchange with reject
- v1,v2 - better error reporting in non-debug builds
- v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers
- v1,v2 - support for zero byte operations, iov==NULL
- v1,v2 - multi-transport support for inline data and private data differences
- v1,v2 - fix memory leaks and other reported bugs since OFED 1.3
- v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1
- v1,v2 - long delay during dat_ia_open when DNS not configured
- v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max
-
--------------------------
-
- OFED 1.3 RELEASE NOTES
-
- NEW SINCE OFED 1.2
-
- * New Features
-
- 1. Add v2.0 library support for new 2.0 API Specification
- 2. Separate v1.2 library release to co-exist with v2.0 libraries.
- 3. New dat.conf with both 1.2 and 2.0 support
- 4. New v2.0 dtestx utilities to test IB extensions
-
- * Bug Fixes
-
- v1.2 and v2.0
- - uDAT: static/dynamic registry parsing fixes
- - uDAPL: provider fixes for dat_psp_create_any
- - dtest/dapltest: change default provider names to sync with dat.conf
- - openib_cma: issues with destroy_cm_id and init/resp exchange
- - dapltest: use gettimeofday instead of get_cycles for better portability
- - dapltest: endian issue with mem_handle, mem_address
- - dapltest fix to include inet_ntoa definitions
- - fix build problems on 32-bit and 64-bit PowerPC
- - cleanup packaging
-
- v2.0
- - set default config options to match spec file, --enable-debug --enable-ext-type=ib
- - use unique devel target names, libdat2.so, /usr/include/dat2
- - dtestx fix memory leak, freeaddrinfo after getaddrinfo
- - Fix for IB extended DTO cookie deallocation on inbound rdma_Write_immed
- - WinOF: Update OFED code base to include WinOF changes, work from same code base
- - WinOF: add DAT_API definition, __stdcall for windows, nothing for linux
- - dtest: add dat_evd_query to check correct size
- - openib_cma: add macro to convert SID to PORT
- - dtest: endian support for exchanging RMR info
- - openib_cma: lower default settings, inline and RDMA init/resp
- - openib_cma: missing ia_query for max_iov_segments_per_rdma_write
-
- v1.2
- - openib_cma: turn down dbg noise level on rejects
- - dtest: typo in memset
-
-
- BUILD: v1 and v2 uDAPL source install/build instructions (redhat example):
-
- # cd to distribution SRPMS directory
- cd /tmp/OFED-1.3/SRPMS
- rpm -i dapl-1.2*.rpm
- rpm -i dapl-2.0*.rpm
- cd /usr/src/redhat/SOURCES
- tar zxf dapl-1.2*.tgz
- tar zxf dapl-2.0*.tgz
-
- # NON_DEBUG build example for x86_64, using OFED targets
-
- ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64
- LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
-
- # build and install
-
- make
- make install
-
- # DEBUG build example for x86_64, using OFED targets
-
- ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64
- LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include"
-
- # build and install
-
- make
- make install
-
- # DEBUG messages: set environment variable DAPL_DBG_TYPE, default
- mapping is 0x0003
-
- DAPL_DBG_TYPE_ERR = 0x0001,
- DAPL_DBG_TYPE_WARN = 0x0002,
- DAPL_DBG_TYPE_EVD = 0x0004,
- DAPL_DBG_TYPE_CM = 0x0008,
- DAPL_DBG_TYPE_EP = 0x0010,
- DAPL_DBG_TYPE_UTIL = 0x0020,
- DAPL_DBG_TYPE_CALLBACK = 0x0040,
- DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080,
- DAPL_DBG_TYPE_API = 0x0100,
- DAPL_DBG_TYPE_RTN = 0x0200,
- DAPL_DBG_TYPE_EXCEPTION = 0x0400,
- DAPL_DBG_TYPE_SRQ = 0x0800,
- DAPL_DBG_TYPE_CNTR = 0x1000
-
--------------------------
-
- OFED 1.2 RELEASE NOTES
-
- NEW SINCE Gamma 3.2 and OFED 1.1
-
- * New Features
-
- 1. Added dtest and dapltest to the openfabrics build and utils rpm.
- Includes manpages.
- 2. Added following enviroment variables to configure connection management
- timers (default settings) for larger clusters:
-
- DAPL_CM_ARP_TIMEOUT_MS 4000
- DAPL_CM_ARP_RETRY_COUNT 15
- DAPL_CM_ROUTE_TIMEOUT_MS 4000
- DAPL_CM_ROUTE_RETRY_COUNT 15
-
- * Bug Fixes
-
- + Added support for new ib verbs client register event. No extra
- processing required at the uDAPL level.
- + Fix some issues supporting create qp without recv cq handle or
- recv qp resources. IB verbs assume a recv_cq handle and uDAPL
- dapl_ep_create assumes there is always recv_sge resources specified.
- + Fix some timeout and long disconnect delay issues discovered during
- scale-out testing. Added support to retry rdma_cm address and route
- resolution with configuration options. Provide a disconnect call
- when receiving the disconnect request to guarantee a disconnect reply
- and event on the remote side. The rdma_disconnect was not being called
- from dat_ep_disconnect() as a result of the state changing
- to DISCONNECTED in the event callback.
- + Changes to support exchanging and validation of the device
- responder_resources and the initiator_depth during conn establishment
- + Fix some build issues with dapltest on 32 bit arch, and on ia64 SUSE arch
- + Add support for multiple IB devices to dat.conf to support IPoIB HA failover
- + Fix atomic operation build problem with ia64 and RHEL5.
- + Add support to return local and remote port information with dat_ep_query
- + Cleanup RPM specfile for the dapl package, move to 1.2-1 release.
-
- NEW SINCE Gamma 3.1 and OFED 1.0
-
- * BUG FIXES
-
- + Update obsolete CLK_TCK to CLOCKS_PER_SEC
- + Fill out some unitialized fields in the ia_attr structure returned by
- dat_ia_query().
- + Update dtest to support multiple segments on rdma write and change
- makefile to use OpenIB-cma by default.
- + Add support for dat_evd_set_unwaitable on a DTO evd in openib_cma
- provider
- + Added errno reporting (message and return codes) during open to help
- diagnose create thread issues.
- + Fix some suspicious inline assembly EIEIO_ON_SMP and ISYNC_ON_SMP
- + Fix IA64 build problems
- + Lower the reject debug message level so we don't see warnings when
- consumers reject.
- + Added support for active side TIMED_OUT event from a provider.
- + Fix bug in dapls_ib_get_dat_event() call after adding new unreachable
- event.
- + Update for new rdma_create_id() function signature.
- + Set max rdma read per EP attributes
- + Report the proper error and timeout events.
- + Socket CM fix to guard against using a loopback address as the local
- device address.
- + Use the uCM set_option feature to adjust connect request timeout
- retry values.
- + Fix to disallow any event after a disconnect event.
-
- * OFED 1.1 uDAPL source build instructions:
-
- cd /usr/local/ofed/src/openib-1.1/src/userspace/dapl
-
- # NON_DEBUG build configuration
-
- ./configure --disable-libcheck --prefix /usr/local/ofed
- --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64
- CPPFLAGS="-I../libibverbs/include -I../librdmacm/include"
-
- # build and install
-
- make
- make install
-
- # DEBUG build configuration
-
- ./configure --disable-libcheck --enable-debug --prefix /usr/local/ofed
- --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64
- CPPFLAGS="-I../libibverbs/include -I../librdmacm/include"
-
- # build and install
-
- make
- make install
-
- # DEBUG messages: set environment variable DAPL_DBG_TYPE, default
- mapping is 0x0003
-
- DAPL_DBG_TYPE_ERR = 0x0001,
- DAPL_DBG_TYPE_WARN = 0x0002,
- DAPL_DBG_TYPE_EVD = 0x0004,
- DAPL_DBG_TYPE_CM = 0x0008,
- DAPL_DBG_TYPE_EP = 0x0010,
- DAPL_DBG_TYPE_UTIL = 0x0020,
- DAPL_DBG_TYPE_CALLBACK = 0x0040,
- DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080,
- DAPL_DBG_TYPE_API = 0x0100,
- DAPL_DBG_TYPE_RTN = 0x0200,
- DAPL_DBG_TYPE_EXCEPTION = 0x0400,
- DAPL_DBG_TYPE_SRQ = 0x0800,
- DAPL_DBG_TYPE_CNTR = 0x1000
-
-
- Note: The udapl provider library libdaplscm.so is untested and
- unsupported, thus customers should not use it.
- It will be removed in the next OFED release.
-
- DAPL GAMMA 3.1 RELEASE NOTES
-
- This release of the DAPL reference implementation
- is timed to coincide with the first release of the
- Open Fabrics (www.openfabrics.org) software stack.
- This release adds support for this new stack, which
- is now the native Linux RDMA stack.
-
- This release also adds a new licensing option. In
- addition to the Common Public License and BSD License,
- the code can now be licensed under the terms of the GNU
- General Public License (GPL) version 2.
-
- NEW SINCE Gamma 3.0
-
- - GPL v2 added as a licensing option
- - OpenFabrics (aka OpenIB) gen2 verbs support
- - dapltest support for Solaris 10
-
- * BUG FIXES
-
- + Fixed a disconnect event processing race
- + Fix to destroy all QPs on IA close
- + Removed compiler warnings
- + Removed unused variables
- + And many more...
-
- DAPL GAMMA 3.0 RELEASE NOTES
-
- This is the first release based on version 1.2 of the spec. There
- are some components, such a shared receive queues (SRQs), which
- are not implemented yet.
-
- Once again there were numerous bug fixes submitted by the
- DAPL community.
-
- NEW SINCE Beta 2.06
-
- - DAT 1.2 headers
- - DAT_IA_HANDLEs implemented as small integers
- - Changed default device name to be "ia0a"
- - Initial support for Linux 2.6.X kernels
- - Updates to the OpenIB gen 1 provider
-
- * BUG FIXES
-
- + Updated Makefile for differentiation between OS releases.
- + Updated atomic routines to use appropriate API
- + Removed unnecessary assert from atomic_dec.
- + Fixed bugs when freeing a PSP.
- + Fixed error codes returned by the DAT static registry.
- + Kernel updates for dat_strerror.
- + Cleaned up the transport layer/adapter interface to use DAPL
- types rather than transport types.
- + Fixed ring buffer reallocation.
- + Removed old test/udapl/dapltest directory.
- + Fixed DAT_IA_HANDLE translation (from pointer to int and
- vice versa) on 64-bit platforms.
-
- DAP BETA 2.06 RELEASE NOTES
-
- We are not planning any further releases of the Beta series,
- which are based on the 1.1 version of the spec. There may be
- further releases for bug fixes, but we anticipate the DAPL
- community to move to the new 1.2 version of the spec and the
- changes mandated in the reference implementation.
-
- The biggest item in this release is the first inclusion of the
- OpenIB Gen 1 provider, an item generating a lot of interest in
- the IB community. This implementation has graciously been
- provided by the Mellanox team. The kdapl implementation is in
- progress, and we imagine work will soon begin on Gen 2.
-
- There are also a handful of bug fixes available, as well as a long
- awaited update to the endpoint design document.
-
- NEW SINCE Beta 2.05
-
- - OpenIB gen 1 provider support has been added
- - Added dapls_evd_post_generic_event(), routine to post generic
- event types as requested by some providers. Also cleaned up
- error reporting.
- - Updated the endpoint design document in the doc/ directory.
-
- * BUG FIXES
-
- + Cleaned up memory leak on close by freeing the HCA structure;
- + Removed bogus #defs for rdtsc calls on IA64.
- + Changed daptest thread types to use internal types for
- portability & correctness
- + Various 64 bit enhancements & updates
- + Fixes to conformance test that were defining CONN_QUAL twice
- and using it in different ways
- + Cleaned up private data handling in ep_connect & provider
- support: we now avoid extra copy in connect code; reduced
- stack requirements by using private_data structure in the EP;
- removed provider variable.
- + Fixed problem in the dat conformance test where cno_wait would
- attempt to dereference a timer value and SEGV.
- + Removed old vestiges of depricated POLLING_COMPLETIONS
- conditionals.
-
- DAPL BETA 2.05 RELEASE NOTES
-
- This was to be a very minor release, the primary change was
- going to be the new wording of the DAT license as contained in
- the header for all source files. But the interest and
- development occurring in DAPL provided some extra bug fixes, and
- some new functionality that has been requested for a while.
-
- First, you may notice that every single source file was
- changed. If you read the release notes from DAPL BETA 2.04, you
- were warned this would happen. There was a legal issue with the
- wording in the header, the end result was that every source file
- was required to change the word 'either of' to 'both'. We've
- been putting this change off as long as possible, but we wanted
- to do it in a clean drop before we start working on DAT 1.2
- changes in the reference implementation, just to keep things
- reasonably sane.
-
- kdapltest has enabled three of the subtests supported by
- dapltest. The Performance test in particular has been very
- useful to dapltest in getting minima and maxima. The Limit test
- pushes the limits by allocating the maximum number of specific
- resources. And the FFT tests are also available.
-
- Most vendors have supported shared memory regions for a while,
- several of which have asked the reference implementation team to
- provide a common implementation. Shared memory registration has
- been tested on ibapi, and compiled into vapi. Both InfiniBand
- providers have the restriction that a memory region must be
- created before it can be shared; not all RDMA APIs are this way,
- several allow you to declare a memory region shared when it is
- registered. Hence, details of the implementation are hidden in
- the provider layer, rather than forcing other APIs to do
- something strange.
-
- This release also contains some changes that will allow dapl to
- work on Opteron processors, as well as some preliminary support
- for Power PC architecture. These features are not well tested
- and may be incomplete at this time.
-
- Finally, we have been asked several times over the course of the
- project for a canonical interface between the common and
- provider layers. This release includes a dummy provider to meet
- that need. Anyone should be able to download the release and do
- a:
- make VERBS=DUMMY
-
- And have a cleanly compiled dapl library. This will be useful
- both to those porting new transport providers, as well as those
- going to new machines.
-
- The DUMMY provider has been compiled on both Linux and Windows
- machines.
-
-
- NEW SINCE Beta 2.4
- - kdapltest enhancements:
- * Limit subtests now work
- * Performance subtests now work.
- * FFT tests now work.
-
- - The VAPI headers have been refreshed by Mellanox
-
- - Initial Opteron and PPC support.
-
- - Atomic data types now have consistent treatment, allowing us to
- use native data types other than integers. The Linux kdapl
- uses atomic_t, allowing dapl to use the kernel macros and
- eliminate the assembly code in dapl_osd.h
-
- - The license language was updated per the direction of the
- DAT Collaborative. This two word change affected the header
- of every file in the tree.
-
- - SHARED memory regions are now supported.
-
- - Initial support for the TOPSPIN provider.
-
- - Added a dummy provider, essentially the NULL provider. It's
- purpose is to aid in porting and to clarify exactly what is
- expected in a provider implementation.
-
- - Removed memory allocation from the DTO path for VAPI
-
- - cq_resize will now allow the CQ to be resized smaller. Not all
- providers support this, but it's a provider problem, not a
- limitation of the common code.
-
- * BUG FIXES
-
- + Removed spurious lock in dapl_evd_connection_callb.c that
- would have caused a deadlock.
- + The Async EVD was getting torn down too early, potentially
- causing lost errors. Has been moved later in the teardown
- process.
- + kDAPL replaced mem_map_reserve() with newer SetPageReserved()
- for better Linux integration.
- + kdapltest no longer allocate large print buffers on the stack,
- is more careful to ensure buffers don't overflow.
- + Put dapl_os_dbg_print() under DAPL_DBG conditional, it is
- supposed to go away in a production build.
- + dapltest protocol version has been bumped to reflect the
- change in the Service ID.
- + Corrected several instances of routines that did not adhere
- to the DAT 1.1 error code scheme.
- + Cleaned up vapi ib_reject_connection to pass DAT types rather
- than provider specific types. Also cleaned up naming interface
- declarations and their use in vapi_cm.c; fixed incorrect
- #ifdef for naming.
- + Initialize missing uDAPL provider attr, pz_support.
- + Changes for better layering: first, moved
- dapl_lmr_convert_privileges to the provider layer as memory
- permissions are clearly transport specific and are not always
- defined in an integer bitfield; removed common routines for
- lmr and rmr. Second, move init and release setup/teardown
- routines into adapter_util.h, which defined the provider
- interface.
- + Cleaned up the HCA name cruft that allowed different types
- of names such as strings or ints to be dealt with in common
- code; but all names are presented by the dat_registry as
- strings, so pushed conversions down to the provider
- level. Greatly simplifies names.
- + Changed deprecated true/false to DAT_TRUE/DAT_FALSE.
- + Removed old IB_HCA_NAME type in favor of char *.
- + Fixed race condition in kdapltest's use of dat_evd_dequeue.
- + Changed cast for SERVER_PORT_NUMBER to DAT_CONN_QUAL as it
- should be.
- + Small code reorg to put the CNO into the EVD when it is
- allocated, which simplifies things.
- + Removed gratuitous ib_hca_port_t and ib_send_op_type_t types,
- replaced with standard int.
- + Pass a pointer to cqe debug routine, not a structure. Some
- clean up of data types.
- + kdapl threads now invoke reparent_to_init() on exit to allow
- threads to get cleaned up.
-
-
-
- DAPL BETA 2.04 RELEASE NOTES
-
- The big changes for this release involve a more strict adherence
- to the original dapl architecture. Originally, only InfiniBand
- providers were available, so allowing various data types and
- event codes to show through into common code wasn't a big deal.
-
- But today, there are an increasing number of providers available
- on a number of transports. Requiring an IP iWarp provider to
- match up to InfiniBand events is silly, for example.
-
- Restructuring the code allows more flexibility in providing an
- implementation.
-
- There are also a large number of bug fixes available in this
- release, particularly in kdapl related code.
-
- Be warned that the next release will change every file in the
- tree as we move to the newly approved DAT license. This is a
- small change, but all files are affected.
-
- Future releases will also support to the soon to be ratified DAT
- 1.2 specification.
-
- This release has benefited from many bug reports and fixes from
- a number of individuals and companies. On behalf of the DAPL
- community, thank you!
-
-
- NEW SINCE Beta 2.3
-
- - Made several changes to be more rigorous on the layering
- design of dapl. The intent is to make it easier for non
- InfiniBand transports to use dapl. These changes include:
-
- * Revamped the ib_hca_open/close code to use an hca_ptr
- rather than an ib_handle, giving the transport layer more
- flexibility in assigning transport handles and resources.
-
- * Removed the CQD calls, they are specific to the IBM API;
- folded this functionality into the provider open/close calls.
-
- * Moved VAPI, IBAPI transport specific items into a transport
- structure placed inside of the HCA structure. Also updated
- routines using these fields to use the new location. Cleaned
- up provider knobs that have been exposed for too long.
-
- * Changed a number of provider routines to use DAPL structure
- pointers rather than exposing provider handles & values. Moved
- provider specific items out of common code, including provider
- data types (e.g. ib_uint32_t).
-
- * Pushed provider completion codes and type back into the
- provider layer. We no longer use EVD or CM completion types at
- the common layer, instead we obtain the appropriate DAT type
- from the provider and process only DAT types.
-
- * Change private_data handling such that we can now accommodate
- variable length private data.
-
- - Remove DAT 1.0 cruft from the DAT header files.
-
- - Better spec compliance in headers and various routines.
-
- - Major updates to the VAPI implementation from
- Mellanox. Includes initial kdapl implementation
-
- - Move kdapl platform specific support for hash routines into
- OSD file.
-
- - Cleanups to make the code more readable, including comments
- and certain variable and structure names.
-
- - Fixed CM_BUSTED code so that it works again: very useful for
- new dapl ports where infrastructure is lacking. Also made
- some fixes for IBHOSTS_NAMING conditional code.
-
- - Added DAPL_MERGE_CM_DTO as a compile time switch to support
- EVD stream merging of CM and DTO events. Default is off.
-
- - 'Quit' test ported to kdapltest
-
- - uDAPL now builds on Linux 2.6 platform (SuSE 9.1).
-
- - kDAPL now builds for a larger range of Linux kernels, but
- still lacks 2.6 support.
-
- - Added shared memory ID to LMR structure. Shared memory is
- still not fully supported in the reference implementation, but
- the common code will appear soon.
-
- * Bug fixes
- - Various Makefiles fixed to use the correct dat registry
- library in its new location (as of Beta 2.03)
- - Simple reorg of dat headers files to be consistent with
- the spec.
- - fixed bug in vapi_dto.h recv macro where we could have an
- uninitialized pointer.
- - Simple fix in dat_dr.c to initialize a variable early in the
- routine before errors occur.
- - Removed private data pointers from a CONNECTED event, as
- there should be no private data here.
- - dat_strerror no longer returns an uninitialized pointer if
- the error code is not recognized.
- - dat_dup_connect() will reject 0 timeout values, per the
- spec.
- - Removed unused internal_hca_names parameter from
- ib_enum_hcas() interface.
- - Use a temporary DAT_EVENT for kdapl up-calls rather than
- making assumptions about the current event queue.
- - Relocated some platform dependent code to an OSD file.
- - Eliminated several #ifdefs in .c files.
- - Inserted a missing unlock() on an error path.
- - Added bounds checking on size of private data to make sure
- we don't overrun the buffer
- - Fixed a kdapltest problem that caused a machine to panic if
- the user hit ^C
- - kdapltest now uses spin locks more appropriate for their
- context, e.g. spin_lock_bh or spin_lock_irq. Under a
- conditional.
- - Fixed kdapltest loops that drain EVDs so they don't go into
- endless loops.
- - Fixed bug in dapl_llist_add_entry link list code.
- - Better error reporting from provider code.
- - Handle case of user trying to reap DTO completions on an
- EP that has been freed.
- - No longer hold lock when ep_free() calls into provider layer
- - Fixed cr_accept() to not have an extra copy of
- private_data.
- - Verify private_data pointers before using them, avoid
- panic.
- - Fixed memory leak in kdapltest where print buffers were not
- getting reclaimed.
-
-
-
- DAPL BETA 2.03 RELEASE NOTES
-
- There are some prominent features in this release:
- 1) dapltest/kdapltest. The dapltest test program has been
- rearchitected such that a kernel version is now available
- to test with kdapl. The most obvious change is a new
- directory structure that more closely matches other core
- dapl software. But there are a large number of changes
- throughout the source files to accommodate both the
- differences in udapl/kdapl interfaces, but also more mundane
- things such as printing.
-
- The new dapltest is in the tree at ./test/dapltest, while the
- old remains at ./test/udapl/dapltest. For this release, we
- have maintained both versions. In a future release, perhaps
- the next release, the old dapltest directory will be
- removed. Ongoing development will only occur in the new tree.
-
- 2) DAT 1.1 compliance. The DAT Collaborative has been busy
- finalizing the 1.1 revision of the spec. The header files
- have been reviewed and posted on the DAT Collaborative web
- site, they are now in full compliance.
-
- The reference implementation has been at a 1.1 level for a
- while. The current implementation has some features that will
- be part of the 1.2 DAT specification, but only in places
- where full compatibility can be maintained.
-
- 3) The DAT Registry has undergone some positive changes for
- robustness and support of more platforms. It now has the
- ability to support several identical provider names
- simultaneously, which enables the same dat.conf file to
- support multiple platforms. The registry will open each
- library and return when successful. For example, a dat.conf
- file may contain multiple provider names for ex0a, each
- pointing to a different library that may represent different
- platforms or vendors. This simplifies distribution into
- different environments by enabling the use of common
- dat.conf files.
-
- In addition, there are a large number of bug fixes throughout
- the code. Bug reports and fixes have come from a number of
- companies.
-
- Also note that the Release notes are cleaned up, no longer
- containing the complete text of previous releases.
-
- * EVDs no longer support DTO and CONNECTION event types on the
- same EVD. NOTE: The problem is maintaining the event ordering
- between two channels such that no DTO completes before a
- connection is received; and no DTO completes after a
- disconnect is received. For 90% of the cases this can be made
- to work, but the remaining 10% will cause serious performance
- degradation to get right.
-
- NEW SINCE Beta 2.2
-
- * DAT 1.1 spec compliance. This includes some new types, error
- codes, and moving structures around in the header files,
- among other things. Note the Class bits of dat_error.h have
- returned to a #define (from an enum) to cover the broadest
- range of platforms.
-
- * Several additions for robustness, including handle and
- pointer checking, better argument checking, state
- verification, etc. Better recovery from error conditions,
- and some assert()s have been replaced with 'if' statements to
- handle the error.
-
- * EVDs now maintain the actual queue length, rather than the
- requested amount. Both the DAT spec and IB (and other
- transports) allow the underlying implementation to provide
- more CQ entries than requested.
-
- Requests for the same number of entries contained by an EVD
- return immediate success.
-
- * kDAPL enhancements:
- - module parameters & OS support calls updated to work with
- more recent Linux kernels.
- - kDAPL build options changes to match the Linux kernel, vastly
- reducing the size and making it more robust.
- - kDAPL unload now works properly
- - kDAPL takes a reference on the provider driver when it
- obtains a verbs vector, to prevent an accidental unload
- - Cleaned out all of the uDAPL cruft from the linux/osd files.
-
- * New dapltest (see above).
-
- * Added a new I/O trace facility, enabling a developer to debug
- all I/O that are in progress or recently completed. Default
- is OFF in the build.
-
- * 0 timeout connections now refused, per the spec.
-
- * Moved the remaining uDAPL specific files from the common/
- directory to udapl/. Also removed udapl files from the kdapl
- build.
-
- * Bug fixes
- - Better error reporting from provider layer
- - Fixed race condition on reference counts for posting DTO
- ops.
- - Use DAT_COMPLETION_SUPPRESS_FLAG to suppress successful
- completion of dapl_rmr_bind (instead of
- DAT_COMPLEITON_UNSIGNALLED, which is for non-notification
- completion).
- - Verify psp_flags value per the spec
- - Bug in psp_create_any() checking psp_flags fixed
- - Fixed type of flags in ib_disconnect from
- DAT_COMPLETION_FLAGS to DAT_CLOSE_FLAGS
- - Removed hard coded check for ASYNC_EVD. Placed all EVD
- prevention in evd_stream_merging_supported array, and
- prevent ASYNC_EVD from being created by an app.
- - ep_free() fixed to comply with the spec
- - Replaced various printfs with dbg_log statements
- - Fixed kDAPL interaction with the Linux kernel
- - Corrected phy_register protottype
- - Corrected kDAPL wait/wakeup synchronization
- - Fixed kDAPL evd_kcreate() such that it no longer depends
- on uDAPL only code.
- - dapl_provider.h had wrong guard #def: changed DAT_PROVIDER_H
- to DAPL_PROVIDER_H
- - removed extra (and bogus) call to dapls_ib_completion_notify()
- in evd_kcreate.c
- - Inserted missing error code assignment in
- dapls_rbuf_realloc()
- - When a CONNECTED event arrives, make sure we are ready for
- it, else something bad may have happened to the EP and we
- just return; this replaces an explicit check for a single
- error condition, replacing it with the general check for the
- state capable of dealing with the request.
- - Better context pointer verification. Removed locks around
- call to ib_disconnect on an error path, which would result
- in a deadlock. Added code for BROKEN events.
- - Brought the vapi code more up to date: added conditional
- compile switches, removed obsolete __ActivePort, deal
- with 0 length DTO
- - Several dapltest fixes to bring the code up to the 1.1
- specification.
- - Fixed mismatched dalp_os_dbg_print() #else dapl_Dbg_Print();
- the latter was replaced with the former.
- - ep_state_subtype() now includes UNCONNECTED.
- - Added some missing ibapi error codes.
-
-
-
- NEW SINCE Beta 2.1
-
- * Changes for Erratta and 1.1 Spec
- - Removed DAT_NAME_NOT_FOUND, per DAT erratta
- - EVD's with DTO and CONNECTION flags set no longer valid.
- - Removed DAT_IS_SUCCESS macro
- - Moved provider attribute structures from vendor files to udat.h
- and kdat.h
- - kdapl UPCALL_OBJECT now passed by reference
-
- * Completed dat_strerr return strings
-
- * Now support interrupted system calls
-
- * dapltest now used dat_strerror for error reporting.
-
- * Large number of files were formatted to meet project standard,
- very cosmetic changes but improves readability and
- maintainability. Also cleaned up a number of comments during
- this effort.
-
- * dat_registry and RPM file changes (contributed by Steffen Persvold):
- - Renamed the RPM name of the registry to be dat-registry
- (renamed the .spec file too, some cvs add/remove needed)
- - Added the ability to create RPMs as normal user (using
- temporal paths), works on SuSE, Fedora, and RedHat.
- - 'make rpm' now works even if you didn't build first.
- - Changed to using the GNU __attribute__((constructor)) and
- __attribute__((destructor)) on the dat_init functions, dat_init
- and dat_fini. The old -init and -fini options to LD makes
- applications crash on some platforms (Fedora for example).
- - Added support for 64 bit platforms.
- - Added code to allow multiple provider names in the registry,
- primarily to support ia32 and ia64 libraries simultaneously.
- Provider names are now kept in a list, the first successful
- library open will be the provider.
-
- * Added initial infrastructure for DAPL_DCNTR, a feature that
- will aid in debug and tuning of a dapl implementation. Partial
- implementation only at this point.
-
- * Bug fixes
- - Prevent debug messages from crashing dapl in EVD completions by
- verifying the error code to ensure data is valid.
- - Verify CNO before using it to clean up in evd_free()
- - CNO timeouts now return correct error codes, per the spec.
- - cr_accept now complies with the spec concerning connection
- requests that go away before the accept is invoked.
- - Verify valid EVD before posting connection evens on active side
- of a connection. EP locking also corrected.
- - Clean up of dapltest Makefile, no longer need to declare
- DAT_THREADSAFE
- - Fixed check of EP states to see if we need to disconnect an
- IA is closed.
- - ep_free() code reworked such that we can properly close a
- connection pending EP.
- - Changed disconnect processing to comply with the spec: user will
- see a BROKEN event, not DISCONNECTED.
- - If we get a DTO error, issue a disconnect to let the CM and
- the user know the EP state changed to disconnect; checked IBA
- spec to make sure we disconnect on correct error codes.
- - ep_disconnect now properly deals with abrupt disconnects on the
- active side of a connection.
- - PSP now created in the correct state for psp_create_any(), making
- it usable.
- - dapl_evd_resize() now returns correct status, instead of always
- DAT_NOT_IMPLEMENTED.
- - dapl_evd_modify_cno() does better error checking before invoking
- the provider layer, avoiding bugs.
- - Simple change to allow dapl_evd_modify_cno() to set the CNO to
- NULL, per the spec.
- - Added required locking around call to dapl_sp_remove_cr.
-
- - Fixed problems related to dapl_ep_free: the new
- disconnect(abrupt) allows us to do a more immediate teardown of
- connections, removing the need for the MAGIC_EP_EXIT magic
- number/state, which has been removed. Mmuch cleanup of paths,
- and made more robust.
- - Made changes to meet the spec, uDAPL 1.1 6.3.2.3: CNO is
- triggered if there are waiters when the last EVD is removed
- or when the IA is freed.
- - Added code to deal with the provider synchronously telling us
- a connection is unreachable, and generate the appropriate
- event.
- - Changed timer routine type from unsigned long to uintptr_t
- to better fit with machine architectures.
- - ep.param data now initialized in ep_create, not ep_alloc.
- - Or Gerlitz provided updates to Mellanox files for evd_resize,
- fw attributes, many others. Also implemented changes for correct
- sizes on REP side of a connection request.
-
-
-
- NEW SINCE Beta 2.0
-
- * dat_echo now DAT 1.1 compliant. Various small enhancements.
-
- * Revamped atomic_inc/dec to be void, the return value was never
- used. This allows kdapl to use Linux kernel equivalents, and
- is a small performance advantage.
-
- * kDAPL: dapl_evd_modify_upcall implemented and tested.
-
- * kDAPL: physical memory registration implemented and tested.
-
- * uDAPL now builds cleanly for non-debug versions.
-
- * Default RDMA credits increased to 8.
-
- * Default ACK_TIMEOUT now a reasonable value (2 sec vs old 2
- months).
-
- * Cleaned up dat_error.h, now 1.1 compliant in comments.
-
- * evd_resize initial implementation. Untested.
-
- * Bug fixes
- - __KDAPL__ is defined in kdat_config.h, so apps don't need
- to define it.
- - Changed include file ordering in kdat.h to put kdat_config.h
- first.
- - resolved connection/tear-down race on the client side.
- - kDAPL timeouts now scaled properly; fixed 3 orders of
- magnitude difference.
- - kDAPL EVD callbacks now get invoked for all completions; old
- code would drop them in heavy utilization.
- - Fixed error path in kDAPL evd creation, so we no longer
- leak CNOs.
- - create_psp_any returns correct error code if it can't create
- a connection qualifier.
- - lock fix in ibapi disconnect code.
- - kDAPL INFINITE waits now work properly (non connection
- waits)
- - kDAPL driver unload now works properly
- - dapl_lmr_[k]create now returns 1.1 error codes
- - ibapi routines now return DAT 1.1 error codes
-
-
-
- NEW SINCE Beta 1.10
-
- * kDAPL is now part of the DAPL distribution. See the release
- notes above.
-
- The kDAPL 1.1 spec is now contained in the doc/ subdirectory.
-
- * Several files have been moved around as part of the kDAPL
- checkin. Some files that were previously in udapl/ are now
- in common/, some in common are now in udapl/. The goal was
- to make sure files are properly located and make sense for
- the build.
-
- * Source code formatting changes for consistency.
-
- * Bug fixes
- - dapl_evd_create() was comparing the wrong bit combinations,
- allowing bogus EVDs to be created.
- - Removed code that swallowed zero length I/O requests, which
- are allowed by the spec and are useful to applications.
- - Locking in dapli_get_sp_ep was asymmetric; fixed it so the
- routine will take and release the lock. Cosmetic change.
- - dapl_get_consuemr_context() will now verify the pointer
- argument 'context' is not NULL.
-
-
- OBTAIN THE CODE
-
- To obtain the tree for your local machine you can check it
- out of the source repository using CVS tools. CVS is common
- on Unix systems and available as freeware on Windows machines.
- The command to anonymously obtain the source code from
- Source Forge (with no password) is:
-
- cvs -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl login
- cvs -z3 -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl co .
-
- When prompted for a password, simply press the Enter key.
-
- Source Forge also contains explicit directions on how to become
- a developer, as well as how to use different CVS commands. You may
- also browse the source code using the URL:
-
- http://svn.sourceforge.net/viewvc/dapl/trunk/
-
- SYSTEM REQUIREMENTS
-
- This project has been implemented on Red Hat Linux 7.3, SuSE
- SLES 8, 9, and 10, Windows 2000, RHEL 3.0, 4.0 and 5.0 and a few
- other Linux distrubutions. The structure of the code is designed
- to allow other operating systems to easily be adapted.
-
- The DAPL team has used Mellanox Tavor based InfiniBand HCAs for
- development, and continues with this platform. Our HCAs use the
- IB verbs API submitted by IBM. Mellanox has contributed an
- adapter layer using their VAPI verbs API. Either platform is
- available to any group considering DAPL work. The structure of
- the uDAPL source allows other provider API sets to be easily
- integrated.
-
- The development team uses any one of three topologies: two HCAs
- in a single machine; a single HCA in each of two machines; and
- most commonly, a switch. Machines connected to a switch may have
- more than one HCA.
-
- The DAPL Plugfest revealed that switches and HCAs available from
- most vendors will interoperate with little trouble, given the
- most recent releases of software. The dapl reference team makes
- no recommendation on HCA or switch vendors.
-
- Explicit machine configurations are available upon request.
-
- IN THE TREE
-
- The DAPL tree contains source code for the uDAPL and kDAPL
- implementations, and also includes tests and documentation.
-
- Included documentation has the base level API of the
- providers: OpenFabrics, IBM Access, and Mellanox Verbs API. Also
- included are a growing number of DAPL design documents which
- lead the reader through specific DAPL subsystems. More
- design documents are in progress and will appear in the tree in
- the near future.
-
- A small number of test applications and a unit test framework
- are also included. dapltest is the primary testing application
- used by the DAPL team, it is capable of simulating a variety of
- loads and exercises a large number of interfaces. Full
- documentation is included for each of the tests.
-
- Recently, the dapl conformance test has been added to the source
- repository. The test provides coverage of the most common
- interfaces, doing both positive and negative testing. Vendors
- providing DAPL implementation are strongly encouraged to run
- this set of tests.
-
- MAKEFILE NOTES
-
- There are a number #ifdef's in the code that were necessary
- during early development. They are disappearing as we
- have time to take advantage of features and work available from
- newer releases of provider software. These #ifdefs are not
- documented as the intent is to remove them as soon as possible.
-
- CONTRIBUTIONS
-
- As is common to Source Forge projects, there are a small number
- of developers directly associated with the source tree and having
- privileges to change the tree. Requested updates, changes, bug
- fixes, enhancements, or contributions should be sent to
- James Lentini at jlentinit@netapp.com for review. We welcome your
- contributions and expect the quality of the project will
- improve thanks to your help.
-
- The core DAPL team is:
-
- James Lentini
- Arlin Davis
- Steve Sears
-
- ... with contributions from a number of excellent engineers in
- various companies contributing to the open source effort.
-
-
- ONGOING WORK
-
- Not all of the DAPL spec is implemented at this time.
- Functionality such as shared memory will probably not be
- implemented by the reference implementation (there is a write up
- on this in the doc/ area), and there are yet various cases where
- work remains to be done. And of course, not all of the
- implemented functionality has been tested yet. The DAPL team
- continues to develop and test the tree with the intent of
- completing the specification and delivering a robust and useful
- implementation.
-
-
-The DAPL Team
-