From: Vladimir Sokolovsky Date: Tue, 18 Jan 2011 13:42:59 +0000 (+0200) Subject: Change directory structure X-Git-Url: https://openfabrics.org/gitweb/?a=commitdiff_plain;h=a99dbd15c12e9c89b770e8cf7876a15e7cd21c12;p=~ardavis%2Fofed_docs%2F.git Change directory structure Signed-off-by: Vladimir Sokolovsky --- diff --git a/HOWTO.build_ofed b/HOWTO.build_ofed deleted file mode 100644 index 195185e..0000000 --- a/HOWTO.build_ofed +++ /dev/null @@ -1,69 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - How To Build OFED 1.5.1 - - March 2010 - - -============================================================================== -Table of contents -============================================================================== -1. Overview -2. Usage -3. Requirements - -============================================================================== -1. Overview -============================================================================== -The script "build.pl" is used to build the OFED package based on the -OpenFabrics project. The package is built under /tmp directory. - -See OFED_release_notes.txt for more details. - -============================================================================== -2. Usage -============================================================================== - -The build script for the OFED package can be downloaded from: - git://git.openfabrics.org/~vlad/build.git - branch: master - -Name: build.pl - -Usage: ./build.pl --version [-r|--release]|[--daily] [-d|--distribution ] [-v|--verbose] - [-b|--builddir ] - [-p|--packagesdir ] - [--pre-build ] - [--skip-prebuild] - [--post-build ] - [--skip-postbuild] - -Example: - - ./build.pl --version 1.5.1-rc1 -p packages-ofed - - This command will create a package (i.e., subtree) called OFED-1.5.1-rc1 - under /tmp/$USER/ - -============================================================================== -3. Requirements -============================================================================== - -1. Git: - Can be downloaded from: - http://www.kernel.org/pub/software/scm/git - -2. Autotools: - - libtool-1.5.20 or higher - autoconf-2.59 or higher - automake-1.9.6 or higher - m4-1.4.4 or higher - - The above tools can be downloaded from the following URLs: - - libtool - "http://ftp.gnu.org/gnu/libtool/libtool-1.5.20.tar.gz" - autoconf - "http://ftp.gnu.org/gnu/autoconf/autoconf-2.59.tar.gz" - automake - "http://ftp.gnu.org/gnu/automake/automake-1.9.6.tar.gz" - m4 - "http://ftp.gnu.org/gnu/m4/m4-1.4.4.tar.gz" - -3. wget or ssh slient diff --git a/MLNX_EN_README.txt b/MLNX_EN_README.txt deleted file mode 100644 index 88464ab..0000000 --- a/MLNX_EN_README.txt +++ /dev/null @@ -1,113 +0,0 @@ -=============================================================================== - MLNX_EN driver for Mellanox Adapter Cards with 10GigE Support - README for OFED 1.5.2 - - December 2010 -=============================================================================== - -Contents: -========= -1. Overview -2. Ethernet Driver Usage and Configuration - - -1. Overview -=========== -MLNX_EN driver is composed from mlx4_core and mlx4_en kernel modules. - -The MLNX_EN driver release exposes the following capabilities: -- Single/Dual port -- Fibre Channel over Ethernet (FCoE) -- Up to 16 Rx queues per port -- 5 TX queues per port -- Rx steering mode: Receive Core Affinity (RCA) -- Tx arbitration mode: VLAN user-priority (off by default) -- MSI-X or INTx -- Adaptive interrupt moderation -- HW Tx/Rx checksum calculation -- Large Send Offload (i.e., TCP Segmentation Offload) -- Large Receive Offload -- IP reassembly offload for fragmented IP packets -- Multi-core NAPI support -- VLAN Tx/Rx acceleration (HW VLAN stripping/insertion) -- HW VLAN filtering -- HW multicast filtering -- ifconfig up/down + mtu changes (up to 10K) -- Ethtool support -- Net device statistics - - -2. Ethernet Driver Usage and Configuration -========================================== - -- To assign an IP address to the interface run: - #> ifconfig eth - - where 'x' is the OS assigned interface number. - -- To check driver and device information run: - #> ethtool -i eth - - Example: - #> ethtool -i eth2 - driver: mlx4_en (MT_0BD0110004) - version: 1.5.2 (March 2010) - firmware-version: 2.8.000 - bus-info: 0000:0e:00.0 - -- To query stateless offload status run: - #> ethtool -k eth - -- To set stateless offload status run: - #> ethtool -K eth [rx on|off] [tx on|off] [sg on|off] [tso on|off] - -- To query interrupt coalescing settings run: - #> ethtool -c eth - -- By default, the driver uses adaptive interrupt moderation for the receive path, - which adjusts the moderation time according to the traffic pattern. - Adaptive moderation settings can be set by: - #> ethtool -C eth adaptive-rx on|off - -- To set interrupt coalescing settings run: - #> ethtool -C eth [rx-usecs N] [rx-frames N] [tx-usecs N] [tx-frames N] - - Note: usec settings correspond to the time to wait after the *last* packet - sent/received before triggering an interrupt - -- To query pause frame settings run: - #> ethtool -a eth - -- To set pause frame settings run: - #> ethtool -A eth [rx on|off] [tx on|off] - -- To query ring size values run: - #> ethtool -g eth - -- To modify rings size run: - #> ethtool -G eth [rx ] [tx ] - -- To obtain additional device statistics, run: - #> ethtool -S eth - -- To perform a self diagnostics test, run: - #> ethtool -t eth - - -The driver defaults to the following parameters: -- Both ports are activated (i.e., a net device is created for each port) -- The number of Rx rings for each port is the number of on-line CPUs -- Per-core NAPI is enabled -- LRO is enabled with 32 concurrent sessions per Rx ring - -Some of these values can be changed using module parameters, which are -detailed by running: -#> modinfo mlx4_en - -To set non-default values to module parameters, the following line should be -added to /etc/modprobe.conf file: - "options mlx4_en = = ..." - -Values of all parameters can be observed in /sys/module/mlx4_en/parameters/. - - diff --git a/MPI_README.txt b/MPI_README.txt deleted file mode 100644 index 4713166..0000000 --- a/MPI_README.txt +++ /dev/null @@ -1,612 +0,0 @@ - - MPI in OFED 1.5.2 README - - September 2010 - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. MVAPICH -3. Open MPI -4. MVAPICH2 - - -=============================================================================== -1. Overview -=============================================================================== -Open Fabrics Enterprise Distribution (OFED)Three MPI stacks are included in -this release of OFED: -- MVAPICH 1.2.0 -- Open MPI 1.4.2 -- MVAPICH2 1.5.1 - -Setup, compilation and run information of MVAPICH, Open MPI and MVAPICH2 is -provided below in sections 2, 3 and 4 respectively. - -1.1 Installation Note ---------------------- -In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install -one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt -to learn about the different options. - -The installation script allows each MPI to be compiled using one or -more compilers. Users need to set, per MPI stack installed, the PATH -and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks. - -1.2 MPI Tests -------------- -OFED includes four basic tests that can be run against each MPI stack: -bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests -are located under: /mpi///tests/, -where is /usr by default. - -1.4 Selecting Which MPI to Use: mpi-selector --------------------------------------------- -Depending on how the OFED installer was run, multiple different MPI -implementations may be installed on your system. The OFED installer -will run an MPI selector tool during the installation process, -presenting a menu-based interface to select which MPI implementation -is set as the default for all users. This MPI selector tool can be -re-run at any time by the administrator after the OFED installer -completes to modify the site-wide default MPI implementation selection -by invoking the "mpi-selector-menu" command (root access is typically -required to change the site-wide default). - -The mpi-selector-menu command can also be used by non-administrative -users to override the site-wide default MPI implementation selection -by setting a per-user default. Specifically: unless a user runs the -MPI selector tool to set a per-user default, their environment will be -setup for the site-wide default MPI implementation. - -Note that the default MPI selection does *not* affect the shell from -which the command was invoked (or any other shells that were already -running when the MPI selector tool was invoked). The default -selection is only changed for *new* shells started after the selector -tool was invoked. It is recommended that once the default MPI -implementation is changed via the selector tool, users should logout -and login again to ensure that they have a consistent view of the -default MPI implementation. Other tools can be used to change the MPI -environment in the current shell, such as the environment modules -software package (which is not included in the OFED software package; -see http://modules.sourceforge.net/ for details). - -Note that the site-wide default is set in a file that is typically not -on a networked file system, and is therefore specific to the host on -which it was run. As such, it is recommended to run the -mpi-selector-menu command on all hosts in a cluster, picking the same -default MPI implementation on each. It may be more convenient, -however, to use the mpi-selector command in script-based scenarios -(such as running on every host in a cluster); mpi-selector effects all -the same functionality as mpi-selector-menu, but is intended for -automated environments. See the mpi-selector(1) manual page for more -details. - -Additionally, per-user defaults are set in a file in the user's $HOME -directory. If this directory is not on a network-shared file system -between all hosts that will be used for MPI applications, then it also -needs to be propagated to all relevant hosts. - -Note: The MPI selector tool typically sets the PATH and/or -LD_LIBRARY_PATH for a given MPI implementation. This step can, of -course, also be performed manually by a user or on a site-wide basis. -The MPI selector tool simply bundles up this functionality in a -convenient set of command line tools and menus. - -1.4 Updating MPI Installations ------------------------------- -Note that all of the MPI implementations included in the OFED software -package are the versions that were available when OFED v1.5 was -released. They have been QA tested with this version of OFED and are -fully supported. - -However, note that administrators can go to the web sites of each MPI -implementation and download / install newer versions after OFED has -been successfully installed. There is nothing specific about the -OFED-included MPI software packages that prohibit installing -newer/other MPI implementations. - -It should be also noted that versions of MPI released after OFED v1.5 -are not supported by OFED. But since each MPI has its own release -schedule and QA process (each of which involves testing with the OFED -stack), it may sometimes be desirable -- or even advisable, depending -on how old the MPI implementations are that are included in OFED -- to -download install a newer version of MPI. - -The web sites of each MPI implementation are listed below: - -- Open MPI: http://www.open-mpi.org/ -- MVAPICH: http://mvapich.cse.ohio-state.edu/ -- MVAPICH2: http://mvapich.cse.ohio-state.edu/overview/mvapich2/ - -=============================================================================== -2. MVAPICH MPI -=============================================================================== - -This package is a 1.2.0 version of the MVAPICH software package, -and is the officially supported MPI stack for this release of OFED. -See http://mvapich.cse.ohio-state.edu for more details. - - -2.1 Setting up for MVAPICH --------------------------- -To launch MPI jobs, its installation directory needs to be included -in PATH and LD_LIBRARY_PATH. To set them, execute one of the following -commands: - source /mpi///bin/mpivars.sh - -- when using sh for launching MPI jobs - or - source /mpi///bin/mpivars.csh - -- when using csh for launching MPI jobs - - -2.2 Compiling MVAPICH Applications: ------------------------------------ -***Important note***: -A valid Fortran compiler must be present in order to build the MVAPICH MPI -stack and tests. - -The default gcc-g77 Fortran compiler is provided with all RedHat Linux -releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide -this compiler as part of the default installation. - -The following compilers are supported by OFED's MVAPICH package: Gcc, -Intel, Pathscale and PGI. The install script prompts the user to choose -the compiler with which to build the MVAPICH RPM. Note that more -than one compiler can be selected simultaneously, if desired. - -For details see: - http://mvapich.cse.ohio-state.edu/support - -To review the default configuration of the installation, check the default -configuration file: /mpi///etc/mvapich.conf - -2.3 Running MVAPICH Applications: ---------------------------------- -Requirements: -o At least two nodes. Example: mtlm01, mtlm02 -o Machine file: Includes the list of machines. Example: /root/cluster -o Bidirectional rsh or ssh without a password - -Note: ssh will be used unless -rsh is specified. In order to use -rsh, add to the mpirun_rsh command the parameter: -rsh - -*** Running OSU tests *** - -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bw -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_latency -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bibw -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bcast - -*** Running Intel MPI Benchmark test (Full test) *** - -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/IMB-3.2/IMB-MPI1 - -*** Running Presta test *** - -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/com -o 100 -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/glob -o 100 -/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/globalop - - -=============================================================================== -3. Open MPI -=============================================================================== - -Open MPI is a next-generation MPI implementation from the Open MPI -Project (http://www.open-mpi.org/). Version 1.4 of Open MPI is -included in this release, which is also available directly from the -main Open MPI web site. - -A working Fortran compiler is not required to build Open MPI, but some -of the included MPI tests are written in Fortran. These tests will -not compile/run if Open MPI is built without Fortran support. - -The following compilers are supported by OFED's Open MPI package: GNU, -Pathscale, Intel, or Portland. The install script prompts the user -for the compiler with which to build the Open MPI RPM. Note that more -than one compiler can be selected simultaneously, if desired. - -Users should check the main Open MPI web site for additional -documentation and support. (Note: The FAQ file considers OpenFabrics -tuning among other issues.) - -3.1 Setting up for Open MPI ---------------------------- -Selecting to use Open MPI via the mpi-selector-mpi and mpi-selector -tools will perform all the necessary setup for users to build and run -Open MPI applications. If you use the MPI selector tools, you can -skip the rest of this section. - -If you do not wish to use the MPI selector tools, the Open MPI team -strongly advises users to put the Open MPI installation directory in -their PATH and LD_LIBRARY_PATH. This can be done at the system level -if all users are going to use Open MPI. Specifically: - -- add /bin to PATH -- add /lib to LD_LIBRARY_PATH - - is the directory where the desired Open MPI instance was -installed ("instance" refers to the compiler used for Open MPI -compilation at install time.). - -If you are using a job scheduler to launch MPI jobs (e.g., SLURM, -Torque), setting the PATH and LD_LIBRARY_PATH is still required, but -it does not need to be set in your shell startup files. Procedures -describing how to add these values to PATH and LD_LIBRARY_PATH are -described in detail at: - - http://www.open-mpi.org/faq/?category=running - -3.2 Open MPI Installation Support / Updates -------------------------------------------- -The OFED package will install Open MPI with support for TCP, shared -memory, and the OpenFabrics network stacks. No other networks are -supported by the OFED Open MPI installation. - -Open MPI supports a wide variety of run-time environments. The OFED -installer will not include support for all of them, however (e.g., -Torque and PBS-based environments are not supported by the -OFED-installed Open MPI). - -The ompi_info command can be used to see what support was installed; -look for plugins for your specific environment / network / etc. If -you do not see them, the OFED installer did not include support for -them. - -As described above, administrators or users can go to the Open MPI web -site and download / install either a newer version of Open MPI (if -available), or the same version with different configuration options -(e.g., support for Torque / PBS-based environments). - -3.3 Compiling Open MPI Applications ------------------------------------ -(copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see -this web page for more details) - -The Open MPI team strongly recommends that you simply use Open MPI's -"wrapper" compilers to compile your MPI applications. That is, instead -of using (for example) gcc to compile your program, use mpicc. Open -MPI provides a wrapper compiler for four languages: - - Language Wrapper compiler name - ------------- -------------------------------- - C mpicc - C++ mpiCC, mpicxx, or mpic++ - (note that mpiCC will not exist - on case-insensitive file-systems) - Fortran 77 mpif77 - Fortran 90 mpif90 - ------------- -------------------------------- - -Note that if no Fortran 77 or Fortran 90 compilers were found when -Open MPI was built, Fortran 77 and 90 support will automatically be -disabled (respectively). - -If you expect to compile your program as: - - > gcc my_mpi_application.c -lmpi -o my_mpi_application - -Simply use the following instead: - - > mpicc my_mpi_application.c -o my_mpi_application - -Specifically: simply adding "-lmpi" to your normal compile/link -command line *will not work*. See -http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the -Open MPI wrapper compilers. - -Note that Open MPI's wrapper compilers do not do any actual compiling -or linking; all they do is manipulate the command line and add in all -the relevant compiler / linker flags and then invoke the underlying -compiler / linker (hence, the name "wrapper" compiler). More -specifically, if you run into a compiler or linker error, check your -source code and/or back-end compiler -- it is usually not the fault of -the Open MPI wrapper compiler. - -3.4 Running Open MPI Applications: ----------------------------------- -Open MPI uses either the "mpirun" or "mpiexec" commands to launch -applications. If your cluster uses a resource manager (such as -SLURM), providing a hostfile is not necessary: - - > mpirun -np 4 my_mpi_application - -If you use rsh/ssh to launch applications, they must be set up to NOT -prompt for a password (see http://www.open-mpi.org/faq/?category=rsh -for more details on this topic). Moreover, you need to provide a -hostfile containing a list of hosts to run on. - -Example: - - > cat hostfile - host1.example.com - host2.example.com - host3.example.com - host4.example.com - - > mpirun -np 4 -hostfile hostfile my_mpi_application - (application runs on all 4 hosts) - -In the following examples, replace with the number of hosts to run on, -and with the filename of a valid hostfile listing the hosts -to run on (unless you are running under a supported resource manager, -in which case a hostfile is unnecessary). - -Also note that Open MPI is highly run-time tunable. There are many -options that can be tuned to obtain optimal performance of your MPI -applications (see the Open MPI web site / FAQ for more information: -http://www.open-mpi.org/faq/). - - - is an integer indicating how many MPI processes to run (e.g., 2) - - is the filename of a hostfile, as described above - -Example 1: Running the OSU bandwidth: - - > cd /usr/mpi/gcc/openmpi-1.4.1/tests/osu_benchmarks-3.1.1 - > mpirun -np -hostfile osu_bw - -Example 2: Running the Intel MPI Benchmark benchmarks: - - > cd /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2 - > mpirun -np -hostfile IMB-MPI1 - - --> Note that the version of IMB-EXT that ships in this version of - OFED contains a bug that will cause it to immediately error - out when run with Open MPI. - -Example 3: Running the Presta benchmarks: - - > cd /usr/mpi/gcc/openmpi-1.4.1/tests/presta-1.4.0 - > mpirun -np -hostfile com -o 100 - -NOTE: In order to run Open MPI over RoCCE (RDMAoE) network, follow MCA parameter - is required: - --mca btl_openib_cpc_include rdmacm - - -3.5 More Open MPI Information ------------------------------ -Much, much more information is available about using and tuning Open -MPI (to include OpenFabrics-specific tunable parameters) on the Open -MPI web site FAQ: - - http://www.open-mpi.org/faq/ - -Users who cannot find the answers that they are looking for, or are -experiencing specific problems should consult the "how to get help" web -page for more information: - - http://www.open-mpi.org/community/help/ - - -=============================================================================== -4. MVAPICH2 MPI -=============================================================================== - -MVAPICH2 is an MPI-2 implementation which includes all MPI-1 features. -It is based on MPICH2 and MVICH. MVAPICH2 provides many features including -fault-tolerance with checkpoint-restart, RDMA_CM support, iWARP support, -optimized collectives, on-demand connection management, multi-core optimized -and scalable shared memory support, and memory hook with ptmalloc2 library -support. The ADI-3-level design of MVAPICH2 supports many features including: -MPI-2 functionalities (one-sided, collectives and data-type), multi-threading -and all MPI-1 functionalities. It also supports a wide range of platforms -(architecture, OS, compilers, InfiniBand adapters and iWARP adapters). More -information can be found on the MVAPICH2 project site: - -http://mvapich.cse.ohio-state.edu/overview/mvapich2/ - -A valid Fortran compiler must be present in order to build the MVAPICH2 -MPI stack and tests. The following compilers are supported by OFED's -MVAPICH2 MPI package: gcc, intel, pgi, and pathscale. The install script -prompts the user to choose the compiler with which to build the MVAPICH2 -MPI RPM. Note that more than one compiler can be selected simultaneously, -if desired. - -The install script prompts for various MVAPICH2 build options as detailed -below: - - -- Implementation (OFA or uDAPL) [default "OFA"] - - OFA (IB and iWARP) Options: - - ROMIO Support [default Y] - - Shared Library Support [default Y] - - Checkpoint-Restart Support [default N] - * requires an installation of BLCR and prompts for the - BLCR installation directory location - - uDAPL Options: - - ROMIO Support [default Y] - - Shared Library Support [default Y] - - Cluster Size [default "Small"] - - I/O Bus [default "PCI-Express"] - - Link Speed [default "SDR"] - - Default DAPL Provider [default ""] - * the default provider is determined based on detected OS - -For non-interactive builds where no MVAPICH2 build options are stored in -the OFED configuration file, the default settings are: - -Implementation: OFA -ROMIO Support: Y -Shared Library Support: Y -Checkpoint-Restart Support: N - - -4.1 Setting up for MVAPICH2 ---------------------------- -Selecting to use MVAPICH2 via the MPI selector tools will perform -most of the setup necessary to build and run MPI applications with -MVAPICH2. If one does not wish to use the MPI Selector tools, using -the following settings should be enough: - - - add /bin to PATH - -The above is the directory where the desired MVAPICH2 -instance was installed ("instance" refers to the path based on -the RPM package name, including the compiler chosen during the -install). It is also possible to source the following files -in order to setup the proper environment: - -source /bin/mpivars.sh [for Bourne based shells] -source /bin/mpivars.csh [for C based shells] - -In addition to the user environment settings handled by the MPI selector -tools, some other system settings might need to be modified. MVAPICH2 -requires the memlock resource limit to be modified from the default -in /etc/security/limits.conf: - -* soft memlock unlimited - -MVAPICH2 requires bidirectional rsh or ssh without a password to work. -The default is ssh, and in this case it will be required to add the -following line to the /etc/init.d/sshd script before sshd is started: - -ulimit -l unlimited - -It is also possible to specify a specific size in kilobytes instead -of unlimited if desired. - -The MVAPICH2 OFA build requires an /etc/mv2.conf file specifying the -IP address of an Infiniband HCA (IPoIB) for RDMA-CM functionality -or the IP address of an iWARP adapter for iWARP functionality if -either of those are desired. This is not required by default, unless -either of the following runtime environment variables are set when -using the OFA MVAPICH2 build: - -RDMA-CM -------- -MV2_USE_RDMA_CM=1 - -iWARP ------ -MV2_USE_IWARP_MODE=1 - -Otherwise, the OFA build will work without an /etc/mv2.conf file using -only the Infiniband HCA directly. - -The MVAPICH2 uDAPL build requires an /etc/dat.conf file specifying the -DAPL provider information. The default DAPL provider is chosen at -build time, with a default value of "ib0", however it can also be -specified at runtime by setting the following environment variable: - -MV2_DEFAULT_DAPL_PROVIDER= - -More information about MVAPICH2 can be found in the MVAPICH2 User Guide: - -http://mvapich.cse.ohio-state.edu/support/ - - -4.2 Compiling MVAPICH2 Applications ------------------------------------ -The MVAPICH2 compiler command for each language are: - -Language Compiler Command --------- ---------------- -C mpicc -C++ mpicxx -Fortran 77 mpif77 -Fortran 90 mpif90 - -The system compiler commands should not be used directly. The Fortran 90 -compiler command only exists if a Fortran 90 compiler was used during the -build process. - - -4.3 Running MVAPICH2 Applications ---------------------------------- -4.3.1 Running MVAPICH2 Applications with mpirun_rsh ---------------------------------------------------- ->From release 1.2, MVAPICH2 comes with a faster and more scalable startup based -on mpirun_rsh. To launch a MPI job with mpirun_rsh, password-less ssh needs to -be enabled across all nodes. - -Note: ssh will be used by default. In order to use rsh, use the -rsh option on -the mpirun_rsh command line. For more options, see mpirun_rsh -help or the -MVAPICH2 user guide. - -*** Running 4 processes on 4 nodes *** - -$ cat > hostfile -node1 -node2 -node3 -node4 -$ mpirun_rsh -np 4 -hostfile hostfile /path/to/my_mpi_app - -*** Running OSU tests *** - -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bw -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_latency -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bibw -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bcast - -*** Running Intel MPI Benchmark test (Full test) *** - -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2/IMB-MPI1 - -*** Running Presta test *** - -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/com -o 100 -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/glob -o 100 -/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/globalop - -4.3.2 Running MVAPICH2 Applications with mpd and mpiexec --------------------------------------------------------- -Launching processes in MVAPICH2 is a two step process. First, mpdboot must -be used to launch MPD daemons on the desired hosts. Second, the mpiexec -command is used to launch the processes. MVAPICH2 requires bidirectional -ssh or rsh without a password. This is specified when the MPD daemons are -launched with the mpdboot command through the --rsh command line option. -The default is ssh. Once the processes are finished, stopping the MPD -daemons with the mpdallexit command should be done. The following example -shows the basic procedure: - -4 Processes on 4 Hosts Example: - -$ cat >hostsfile -node1.example.com -node2.example.com -node3.example.com -node4.example.com - -$ mpdboot -n 4 -f ./hostsfile - -$ mpiexec -n 4 ./my_mpi_application - -$ mpdallexit - -It is also possible to use the mpirun command in place of mpiexec. They are -actually the same command in MVAPICH2, however using mpiexec is preferred. - -It is possible to run more processes than hosts. In this case, multiple -processes will run on some or all of the hosts used. The following examples -demonstrate how to run the MPI tests. The default installation prefix and -gcc version of MVAPICH2 are shown. In each case, it is assumed that a hosts -file has been created in the specific directory with two hosts. - -OSU Tests Example: - -$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1 -$ mpdboot -n 2 -f ./hosts -$ mpiexec -n 2 ./osu_bcast -$ mpiexec -n 2 ./osu_bibw -$ mpiexec -n 2 ./osu_bw -$ mpiexec -n 2 ./osu_latency -$ mpdallexit - -Intel MPI Benchmark Example: - -$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2 -$ mpdboot -n 2 -f ./hosts -$ mpiexec -n 2 ./IMB-MPI1 -$ mpdallexit - -Presta Benchmarks Example: - -$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0 -$ mpdboot -n 2 -f ./hosts -$ mpiexec -n 2 ./com -o 100 -$ mpiexec -n 2 ./glob -o 100 -$ mpiexec -n 2 ./globalop -$ mpdallexit diff --git a/MSTFLINT_README.txt b/MSTFLINT_README.txt deleted file mode 100644 index 86d3e02..0000000 --- a/MSTFLINT_README.txt +++ /dev/null @@ -1,171 +0,0 @@ -Mellanox Technologies - www.mellanox.com -**************************************** - -MSTFLINT Package - Firmware Burning and Diagnostics Tools - -1) Overview - This package contains a burning tool and diagnostic tools for Mellanox - manufactured HCA/NIC cards. It also provides access to the relevant source - code. Please see the file LICENSE for licensing details. - This package is based on a subset of the Mellanox Firmware Tools (MFT) package. - For a full documentation of the MFT package, please refer to the downloads page - in Mellanox web site. - - ---------------------------------------------------------------------------- - NOTE: - This burning tool should be used only with Mellanox-manufactured - HCA/NIC cards. Using it with cards manufactured by other vendors - may be harmful to the cards (due to different configurations). - Using the diagnostic tools is normally safe for all HCAs/NICs. - ---------------------------------------------------------------------------- - -2) Package Contents - a) mstflint source code - b) mflash lib - This lib provides low level Flash access through Mellanox HCAs. - c) mtcr lib (implemented in mtcr.h file) - This lib enables access to HCA hardware registers. - d) mstregdump utility - This utility dumps hardware registers from Mellanox hardware - for later analysis by Mellanox. - e) mstvpd - This utility dumps the on-card VPD. - f) mstmcra - This debug utility reads a single word from the device configuration space. - -3) Installation - a) Build the mstflint utility. This package is built using a standard - autotools method. - - Example: - > ./configure - > make - > make install - - - Run "configure --help" for custom configuration options. - - Typically, root privileges are required to run "make install" - -4) Hardware Access Device Names - The tools in this package require a device name in the command - line. The device name is the identifier of the target CA. - This section describes the device name formats and the HW access flow. - - a) The devices can be accessed by their PCI ID as displayed by lspci - (bus:dev.fn). - Example: - # List all Mellanox devices - > /sbin/lspci -d 15b3: - 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0) - - # Use mstflint tool to query the firmware on this device - > mstflint -d 02:00.0 q - - b) When the IB driver (mlx4 or mthca) is loaded, the devices can be accessed - by their IB device name. - Example: - # List the IB devices - > ibv_devinfo | grep hca_id - hca_id: mlx4_0 - - # Use mstvpd tool to dump the VPD of this device - > mstvpd mlx4_0 - - c) PCI configuration access - In examples a and b above, the device is accessed via PCI Memory Mapping. - The device can also be accessed by PCI configuration cycles. - PCI configuration access is slower and less safe than memory access -- - use it only if methods a and b above do not work. - - To force configuration access, use device names in the following format: - /proc/bus/pci// - - Example: - # List all Mellanox devices - > /sbin/lspci -d 15b3: - 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0) - - # Use mstregdump to dump HW registers, using PCI config cycles - > mstregdump /proc/bus/pci/02/00.0 > crdump.log - - Note: Typically, you will need root privileges for hardware access - - d) Accessing a multi-function device: - - In some configuration, the CA device identifies as a multi-function device on PCI. E.G.: - > /sbin/lspci -d 15b3: - 07:00.0 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - 07:00.1 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - 07:00.2 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - 07:00.3 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - 07:00.4 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - 07:00.5 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - 07:00.6 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - 07:00.7 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) - - These multiple "logical" devices are actually a single physical device, so firmware update or "physical" - diagnostics should be run only on one of the functions. - - When the device driver is loaded, only the primary physical function of the device can be accessed. - In Linux that would typically be function 0. This function can be accessed using memory mapping, aas - described in sub section a) above. E.G.: - > mstflint -d 07:00.0 q - - When the device driver is not loaded, all the functions can be accessed using configuration cycles, as - described in sub section c) above. It is recommended to use function 0 for FW update or diagnostics, E.G.: - > mstflint -d /proc/bus/pci/07/00.0 q - -5) Usage (mstflint): - Read mstflint usage. Enter "./mstflint -h" for a short help message, or - "./mstflint -hh" for a detailed help message. - - Obtaining firmware files: - If you purchased your card from Mellanox Technologies, please use the - Mellanox website (www.mellanox.com, under 'Firmware' downloads) to - download the firmware for your card. - If you purchased your card from a vendor other than Mellanox, get a - specific firmware configuration (INI) file from your HCA card vendor and - generate the binary image. - - Use mstflint to burn a device according to the burning instructions in - "mstflint -hh" and in Mellanox web site firmware page. - -6) Usage (mstregdump): - An internal register dump is displayed to the standard output. - Please store it in a file for analysis by Mellanox. - - Example: - > mstregdump mthca0 > dumpfile - -7) Usage (mstvpd): - A VPD dump is displayed to the standard output. - A list of keywords to dump can be supplied after the -- flag - to apply an output filter. - - Examples: - > mstvpd mlx4_0 - ID: Hawk Dual Port - PN: MNPH29C-XTR - EC: X2 - SN: MT1001X00749 - V0: PCIe Gen2 x8 - V1: N/A - YA: N/A - RW: - - > mstvpd mlx4_0 -- PN ID - PN: MNPH29C-XTR - ID: Hawk Dual Port - -8) Problem Reporting: - Please collect the following information when reporting issues: - - uname -a - cat /etc/issue - cat /proc/bus/pci/devices - mstflint -vv - lspci - mstflint -d 02:00.0 v - mstflint -d 02:00.0 q - mstvpd 02:00.0 - - diff --git a/OFED_Installation_Guide.txt b/OFED_Installation_Guide.txt deleted file mode 100644 index dfc3000..0000000 --- a/OFED_Installation_Guide.txt +++ /dev/null @@ -1,597 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - Version 1.5.2 - Installation Guide - - May 2010 - -============================================================================== -Table of contents -============================================================================== - - 1. Overview - 2. Contents of the OFED Distribution - 3. Hardware and Software Requirements - 4. How to Download and Extract the OFED Distribution - 5. Installing OFED Software - 6. Building OFED RPMs - 7. IPoIB Configuration - 8. Using SDP - 9. Uninstalling OFED - 10. Upgrading OFED - 11. Configuration - 12. Related Documentation - - -============================================================================== -1. Overview -============================================================================== - -This is the OpenFabrics Enterprise Distribution (OFED) version 1.5.2 -software package supporting InfiniBand and iWARP fabrics. It is composed -of several software modules intended for use on a computer cluster -constructed as an InfiniBand subnet or an iWARP network. - -This document describes how to install the various modules and test them in -a Linux environment. - -General Notes: - 1) The install script removes all previously installed OFED packages - and re-installs from scratch. (Note: Configuration files will not - be removed). You will be prompted to acknowledge the deletion of - the old packages. - - 2) When installing OFED on an entire [homogeneous] cluster, a common - strategy is to install the software on one of the cluster nodes - (perhaps on a shared file system such as NFS). The resulting RPMs, - created under OFED-X.X.X/RPMS directory, can then be installed on all - nodes in the cluster using any cluster-aware tools (such as pdsh). - -============================================================================== -2. OFED Package Contents -============================================================================== - -The OFED Distribution package generates RPMs for installing the following: - - o OpenFabrics core and ULPs: - - HCA drivers (mthca, mlx4, qib, ehca) - - iWARP driver (cxgb3, nes) - - core - - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER - Initiator and target, RDS, qlgc_vnic, uDAPL and NFS-RDMA - o OpenFabrics utilities - - OpenSM: InfiniBand Subnet Manager - - Diagnostic tools - - Performance tests - o MPI - - OSU MVAPICH stack supporting the InfiniBand and iWARP interface - - Open MPI stack supporting the InfiniBand and iWARP interface - - OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface - - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta) - o Extra packages - - open-iscsi: open-iscsi initiator with iSER support - - ib-bonding: Bonding driver for IPoIB interface - o Sources of all software modules (under conditions mentioned in the - modules' LICENSE files) - o Documentation - -============================================================================== -3. Hardware and Software Requirements -============================================================================== - -1) Server platform with InfiniBand HCA or iWARP RNIC (see OFED Distribution - Release Notes for details) - -2) Linux operating system (see OFED Distribution Release Notes for details) - -3) Administrator privileges on your machine(s) - -4) Disk Space: - For Build & Installation: 300MB - - For Installation only: 200MB - -5) For the OFED Distribution to compile on your machine, some software - packages of your operating system (OS) distribution are required. These - are listed here. - -OS Distribution Required Packages ---------------- ---------------------------------- -General: -o Common to all gcc, glib, glib-devel, glibc, glibc-devel, - glibc-devel-32bit (to build 32-bit libraries on x86_86 - and ppc64), zlib-devel, libstdc++-devel -o RedHat, Fedora kernel-devel, rpm-build, redhat-rpm-config -o SLES kernel-source, rpm - -Note: To build 32-bit libraries on x86_64 and ppc64 platforms, the 32-bit - glibc-devel should be installed. - -Specific Component Requirements: -o Mvapich a Fortran Compiler (such as gcc-g77) -o Mvapich2 libsysfs-devel -o Open MPI libsysfs-devel -o ibutils tcl-8.4, tcl-devel-8.4, tk, libstdc++-devel -o mstflint libstdc++-devel (32-bit on ppc64), gcc-c++ -o rnfs-utils krb5-devel, krb5-libs, libevent-devel, - nfs-utils-lib-devel, openldap-devel, - e2fsprogs-devel (on RedHat) - krb5-devel, libevent-devel, nfsidmap-devel, - libopenssl-devel, libblkid-devel (on SLES11) - krb5-devel, libevent, nfsidmap, krb5, openldap2-devel, - cyrus-sasl-devel, e2fsprogs-devel (on SLES10) - -Note: The installer will warn you if you attempt to compile any of the - above packages and do not have the prerequisites installed. - On SLES, some of required RPMs can be found on SLES SDK DVD. - E.g.: libgfortran43, kernel-source, ... - -*** Important Note for open-iscsi users: - Installing iSER as part of OFED installation will also install open-iscsi. - Before installing OFED, please uninstall any open-iscsi version that may - be installed on your machine. Installing OFED with iSER support while - another open-iscsi version is already installed will cause the installation - process to fail. - -============================================================================== -4. How to Download and Extract the OFED Distribution -============================================================================== - -1) Download the OFED-X.X.X.tgz file to your target Linux host. - - If this package is to be installed on a cluster, it is recommended to - download it to an NFS shared directory. - -2) Extract the package using: - - tar xzvf OFED-X.X.X.tgz - -============================================================================== -5. Installing OFED Software -============================================================================== - -1) Go to the directory into which the package was extracted: - - cd /..../OFED-X.X.X - -2) Installing the OFED package must be done as root. For a - menu-driven first build and installation, run the installer - script: - - ./install.pl - - Interactive menus will direct you through the install process. - - Note: After the installer completes, information about the OFED - installation such as the prefix, the kernel version, and - installation parameters can be found by running - /etc/infiniband/info. - - Information on the driver version and source git trees can be found - using the ofed_info utility - - - During the interactive installation of OFED, two files are - generated: ofed.conf and ofed_net.conf. - ofed.conf holds the installed software modules and configuration settings - chosen by the user. ofed_net.conf holds the IPoIB settings chosen by the - user. - - If the package is installed on a cluster-shared directory, these - files can then be used to perform an automatic, unattended - installation of OFED on other machines in the cluster. The - unattended installation will use the same choices as were selected - in the interactive installation. - - For an automatic installation on any host, run the following: - - ./OFED-X.X.X/install.pl -c /ofed.conf -n /ofed_net.conf - -3) Install script usage: - - Usage: ./install.pl [-c |--all|--hpc|--basic] - [-n|--net ] - - -c|--config . Example of the config file can - be found under docs (ofed.conf-example) - -n|--net Example of the config file can be - found under docs (ofed_net.conf-example) - -l|--prefix Set installation prefix. - -p|--print-available Print available packages for current platform. - And create corresponding ofed.conf file. - -k|--kernel . Default on this system: $(uname -r) - -s|--kernel-sources . Default on this - system: /lib/modules/$(uname -r)/build - --build32 Build 32-bit libraries. Relevant for x86_64 and - ppc64 platforms - --without-depcheck Skip Distro's libraries check - -v|-vv|-vvv. Set verbosity level - -q. Set quiet - no messages will be printed - --force Force uninstall RPM coming with Distribution - --builddir Change build directory. Default: /var/tmp/ - - --all|--hpc|--basic Install all,hpc or basic packages - correspondingly - -Notes: ------- -a. It is possible to rename and/or edit the ofed.conf and ofed_net.conf files. - Thus it is possible to change user choices (observing the original format). - See examples of ofed.conf and ofed_net.conf under OFED-X.X.X/docs. - Run './install.pl -p' to get ofed.conf with all available packages included. - -b. Important note for open-iscsi users: - Installing iSER as part of the OFED installation will also install - open-iscsi. Before installing OFED, please uninstall any open-iscsi version - that may be installed on your machine. Installing OFED with iSER support - while another open-iscsi version is already installed will cause the - installation process to fail. - - -Install Process Results: ------------------------- - -o The OFED package is installed under directory. Default prefix is /usr -o The kernel modules are installed under: - - Infiniband subsystem: - /lib/modules/`uname -r`/updates/kernel/drivers/infiniband/ - - open-iscsi: - /lib/modules/`uname -r`/updates/kernel/drivers/scsi/ - - Chelsio driver: - /lib/modules/`uname -r`/updates/kernel/drivers/net/cxgb3/ - - ConnectX driver: - /lib/modules/`uname -r`/updates/kernel/drivers/net/mlx4/ - - RDS: - /lib/modules/`uname -r`/updates/kernel/net/rds/ - - NFSoRDMA: - /lib/modules/`uname -r`/updates/kernel/fs/exportfs/ - /lib/modules/`uname -r`/updates/kernel/fs/lockd/ - /lib/modules/`uname -r`/updates/kernel/fs/nfs/ - /lib/modules/`uname -r`/updates/kernel/fs/nfs_common/ - /lib/modules/`uname -r`/updates/kernel/fs/nfsd/ - /lib/modules/`uname -r`/updates/kernel/net/sunrpc/ - - Bonding module: - /lib/modules/`uname -r`/updates/kernel/drivers/net/bonding/bonding.ko -o The package kernel include files are placed under /src/ofa_kernel/. - These includes should be used when building kernel modules which use - the Openfabrics stack. (Note that these includes, if needed, are - "backported" to your kernel). -o The raw package (un-backported) source files are placed under - /src/ofa_kernel-x.x.x -o The script "openibd" is installed under /etc/init.d/. This script can - be used to load and unload the software stack. -o The directory /etc/infiniband is created with the files "info" and - "openib.conf". The "info" script can be used to retrieve OFED - installation information. The "openib.conf" file contains the list of - modules that are loaded when the "openibd" script is used. -o The file "90-ib.rules" is installed under /etc/udev/rules.d/ -o If libibverbs-utils is installed, then ofed.sh and ofed.csh are - installed under /etc/profile.d/. These automatically update the PATH - environment variable with /bin. In addition, ofed.conf is - installed under /etc/ld.so.conf.d/ to update the dynamic linker's - run-time search path to find the InfiniBand shared libraries. -o The file /etc/modprobe.d/ib_ipoib.conf is updated to include the following: - - "alias ib ib_ipoib" for each ib interface. -o The file /etc/modprobe.d/ib_sdp.conf is updated to include the following: - - "alias net-pf-27 ib_sdp" for sdp. -o If opensm is installed, the daemon opensmd is installed under /etc/init.d/ -o All verbs tests and examples are installed under /bin and management - utilities under /sbin -o ofed_info script provides information on the OFED version and git repository. -o If iSER is included, open-iscsi user-space files will be also installed: - - Configuration files will be installed at /etc/iscsi - - Startup script will be installed at: - - RedHat: /etc/init.d/iscsi - - SuSE: /etc/init.d/open-iscsi - - Other tools (iscsiadm, iscsid, iscsi_discovery, iscsi-iname, iscsistart) - will be installed under /sbin. - - Documentation will be installed under: - - RedHat: /usr/share/doc/iscsi-initiator-utils- - - SuSE: /usr/share/doc/packages/open-iscsi -o man pages will be installed under /usr/share/man/. - -============================================================================== -6. Building OFED RPMs -============================================================================== - -1) Go to the directory into which the package was extracted: - - cd /..../OFED-X.X.X - -2) Run install.pl as explained above - This script also builds OFED binary RPMs under OFED-X.X.X/RPMS; the sources - are placed in OFED-X.X.X/SRPMS/. - - Once the install process has completed, the user may run ./install.pl on - other machines that have the same operating system and kernel to - install the new RPMs. - -Note: Depending on your hardware, the build procedure may take 30-45 - minutes. Installation, however, is a relatively short process - (~5 minutes). A common strategy for OFED installation on large - homogeneous clusters is to extract the tarball on a network - file system (such as NFS), build OFED RPMs on NFS, and then run the - installer on each node with the RPMs that were previously built. - -============================================================================== -7. IP-over-IB (IPoIB) Configuration -============================================================================== - -Configuring IPoIB is an optional step during the installation. During -an interactive installation, the user may choose to insert the ifcfg-ib -files. If this option is chosen, the ifcfg-ib files will be -installed under: - -- RedHat: /etc/sysconfig/network-scripts/ -- SuSE: /etc/sysconfig/network/ - -Setting IPoIB Configuration: ----------------------------- -There is no default configuration for IPoIB interfaces. - -One should manually specify the full IP configuration during the -interactive installation: IP address, network address, netmask, and -broadcast address, or use the ofed_net.conf file. - -For bonding setting please see "ipoib_release_notes.txt" - -For unattended installations, a configuration file can be provided -with this information. The configuration file must specify the -following information: -- Fixed values for each IPoIB interface -- Base IPoIB configuration on Ethernet configuration (may be useful for - cluster configuration) - -Here are some examples of ofed_net.conf: - -# Static settings; all values provided by this file -IPADDR_ib0=172.16.0.4 -NETMASK_ib0=255.255.0.0 -NETWORK_ib0=172.16.0.0 -BROADCAST_ib0=172.16.255.255 -ONBOOT_ib0=1 - -# Based on eth0; each '*' will be replaced by the script with corresponding -# octet from eth0. -LAN_INTERFACE_ib0=eth0 -IPADDR_ib0=172.16.'*'.'*' -NETMASK_ib0=255.255.0.0 -NETWORK_ib0=172.16.0.0 -BROADCAST_ib0=172.16.255.255 -ONBOOT_ib0=1 - -# Based on the first eth interface that is found (for n=0,1,...); -# each '*' will be replaced by the script with corresponding octet from eth. -LAN_INTERFACE_ib0= -IPADDR_ib0=172.16.'*'.'*' -NETMASK_ib0=255.255.0.0 -NETWORK_ib0=172.16.0.0 -BROADCAST_ib0=172.16.255.255 -ONBOOT_ib0=1 - - -============================================================================== -8. Using SDP -============================================================================== - -Overview: ---------- - -Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol -that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced -protocol offload capabilities, SDP can provide lower latency, higher -bandwidth, and lower CPU utilization than IPoIB running some sockets-based -applications. - -SDP can be used by applications and improve their performance transparently -(that is, without any recompilation). Since SDP has the same socket semantics -as TCP, an existing application is able to run using SDP; the difference is -that the application's TCP socket gets replaced with an SDP socket. - -It is also possible to configure the driver to automatically translate TCP to -SDP based on the source IP, the destination, or the application name (See -below). - -The SDP protocol is composed of a kernel module that implements the SDP as a -new address-family/protocol-family, and a library that is used for replacing -the TCP address family with SDP according to a policy. - -libsdp.so Library: ------------------- - -libsdp.so is a dynamically linked library, which is used for transparent -integration of applications with SDP. The library is preloaded, and therefore -takes precedence over glibc for certain socket calls. Thus, it can -transparently replace the TCP socket family with SDP socket calls. - -The library also implements a user-level socket switch. Using a configuration -file, the system administrator can set up the policy that selects the type of -socket to be used. libsdp.so also has the option to allow server sockets to -listen on both SDP and TCP interfaces. The various configurations with SDP/TCP -sockets are explained inside the /etc/libsdp.conf file. - -Configuring SDP: ----------------- - -To load SDP upon boot, edit the file /etc/infiniband/openib.conf and set -"SDP_LOAD=yes". - -Note: For the changes to take effect, run: /etc/init.d/openibd restart - -SDP shares the same IP addresses and interface names as IPoIB. See IPoIB -Configuration (chapter 7) - -How to Know SDP Is Working: ---------------------------- - -Since SDP is a transparent TCP replacement, it can sometimes be difficult to -know that it is working correctly. -To figure out whether traffic is passing through SDP or TCP, check -/proc/net/sdpstats and monitor which counters are running. - -sdpnetstat: ------------ - -The sdpnetstat program can be used to verify both that SDP is loaded and is -being used: - -host1$ sdpnetstat -S - -This command shows all active SDP sockets using the same format as the -traditional netstat program. Without the '-S' option, it shows all the -information that netstat does plus SDP data. - -Assuming that the SDP kernel module is loaded and is being used, then the -output of the command will show something like the following: - -host1$ sdpnetstat -S - -Proto Recv-Q Send-Q Local Address Foreign Address -sdp 0 0 193.168.10.144:34216 193.168.10.125:12865 -sdp 0 884720 193.168.10.144:42724 193.168.10.:filenet-rmi - -The example output above shows two active SDP sockets and contains details -about the connections. If the SDP kernel module is not loaded, or it is -loaded but is not being used, then the output of the command will be something -like the following: - -host1$ sdpnetstat -S - -Proto Recv-Q Send-Q Local Address Foreign Address -netstat: no support for `AF INET (tcp)' on this system. - -To verify whether the module is loaded or not, you can use the lsmod command - -Monitoring and Troubleshooting Tools: -------------------------------------- - -SDP has debug support for both the user space libsdp.so library and the ib_sdp -kernel module. -Both can be useful to understand why a TCP socket was not redirected over SDP -and to help find problems in the SDP implementation. - -User-space SDP debug is controlled by options in the libsdp.conf file. You can also have a local -version and point to it explicitly using the following command: - -host1$ export LIBSDP_CONFIG_FILE=/libsdp.conf - -To obtain extensive debug information, you can modify libsdp.conf to have the -log directive produce maximum debug output (provide the min-level flag with -the value 1). More details in the default libsdp.conf installed by OFED. -A non-root user can configure libsdp.so to record function calls and return values in the file -/tmp/libsdp.log. - -Kernel Space SDP Debug - The SDP kernel module can log detailed trace -information if you enable it using the 'debug_level' variable in the sysfs -filesystem. The following command performs this: - -host1$ echo 1 > /sys/module/ib_sdp/debug_level - -Note: Depending on the operating system distribution on your machine, you may need -an extra level, 'parameters', in the directory structure, so you may need to direct -the echo command to /sys/module/ib_sdp/parameters/debug_level. - -Turning off kernel debug is done by setting the sysfs variable to zero using -the following command: - -host1$ echo 0 > /sys/module/ib_sdp/debug_level - -To display debug information, use the dmesg command: - -host1$ dmesg - -Environment Variables: ----------------------- - -For the transparent integration with SDP, the following two environment -variables are required: -1. LD_PRELOAD - this environment variable is used to preload libsdp.so and it - should point to the libsdp.so library. The variable should be set by the - system administrator to libsdp.so. -2. LIBSDP_CONFIG_FILE - this environment variable is used to configure the - policy for replacing TCP sockets with SDP sockets. By default it points to: - /etc/libsdp.conf - -Using RDMA: ------------ - -For smaller buffers, the overhead of preparing a user buffer to be RDMA'ed is -too big; therefore, it is more efficient to use BCopy. (Large buffers can also -be sent using RDMA, but they lower CPU utilization.) This mode is called -"ZCopy combined mode". The sendmsg syscall is blocked until the buffer is -transfered to the socket's peer, and the data is copied directly from the user -buffer at the source side to the user buffer at the sink side. - -To set the threshold, use the module parameter sdp_zcopy_thresh. This parameter -can be accessed through sysfs (/sys/module/ib_sdp/parameters/sdp_zcopy_thresh). -Setting it to 0, disables ZCopy. - - -============================================================================== -9. Uninstalling OFED -============================================================================== - -There are two ways to uninstall OFED: -1) Via the installation menu. -2) Using the script ofed_uninstall.sh. The script is part of ofed-scripts - package. -3) ofed_uninstall.sh script supports an option to executes 'openibd stop' - before removing the RPMs using the flag: --unload-modules - -============================================================================== -10. Upgrading OFED -============================================================================== - -If an old OFED version is installed, it may be upgraded by installing a -new OFED version as described in section 5. Note that if the old OFED -version was loaded before upgrading, you need to restart OFED or reboot -your machine in order to start the new OFED stack. - -============================================================================== -11. Configuration -============================================================================== - -Most of the OFED components can be configured or reconfigured after -the installation by modifying the relevant configuration files. The -list of the modules that will be loaded automatically upon boot can be -found in the /etc/infiniband/openib.conf file. Other configuration -files include: -- SDP configuration file: /etc/libsdp.conf -- OpenSM configuration file: /etc/ofa/opensm.conf (for RedHat) - /etc/sysconfig/opensm (for SuSE) - should be - created manually if required. -- DAPL configuration file: /etc/dat.conf - -See packages Release Notes for more details. - -Note: After the installer completes, information about the OFED - installation such as the prefix, kernel version, and - installation parameters can be found by running - /etc/infiniband/info. - - -============================================================================== -12. Related Documentation -============================================================================== - -OFED documentation is located in the ofed-docs RPM. After -installation the documents are located under the directory: -/usr/share/doc/ofed-docs-x.x.x for RedHat -/usr/share/doc/packages/ofed-docs-x.x.x for SuSE - -Documents list: - - o README.txt - o OFED_Installation_Guide.txt - o MPI_README.txt - o Examples of configuration files - o OFED_tips.txt - o HOWTO.build_ofed - o All release notes and README files - -For more information, please visit the OpenFabrics web site: - http://www.openfabrics.org - -open-iscsi documentation is located at: -- RedHat: /usr/share/doc/iscsi-initiator-utils- -- SuSE: /usr/share/doc/packages/open-iscsi - -For more information, please visit the open-iscsi web site: - http://www.open-iscsi.org diff --git a/OFED_release_notes.txt b/OFED_release_notes.txt index 4132c5f..03bbab5 100644 --- a/OFED_release_notes.txt +++ b/OFED_release_notes.txt @@ -270,6 +270,7 @@ for each package in the docs directory. openmpi-1.4.3-1.src.rpm 2. Added RHEL6 support +3. Added RHEL5.6 support =============================================================================== 6. Known Issues diff --git a/PERF_TEST_README.txt b/PERF_TEST_README.txt deleted file mode 100644 index af8b119..0000000 --- a/PERF_TEST_README.txt +++ /dev/null @@ -1,146 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - Performance Tests README for OFED 1.5 - - December 2010 - - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Notes on Testing Methodology -3. Test Descriptions -4. Running Tests - -=============================================================================== -1. Overview -=============================================================================== -This is a collection of tests written over uverbs intended for use as a -performance micro-benchmark. As an example, the tests can be used for -HW or SW tuning and/or functional testing. - -The collection conatains a set of BW and latency benchmark such as : - - * Read - ib_read_bw and ib_read_lat. - * Write - ib_write_bw, ib_write_bw_postlist and ib_write_lat. - * Send - ib_send_bw and ib_send_lat. - * RDMA - rdma_bw and rdma_lat. - * Additional benchmark: ib_clock_test. - -Please post results/observations/bugs/remarks to the mailing list specified below: - * Maintainer - idos@dev.mellanox.co.il - * OFED mailing list - ewg@lists.openfabrics.org - or linux-rdma@vger.kernel.org - * http://openib.org/mailman/listinfo/openib-general - -=============================================================================== -2. Notes on Testing Methodology -=============================================================================== -The bencmarks specified below are tested of the following architectures: -- i686 -- x86_64 -- ia64 - -- The benchmark uses the CPU cycle counter to get time stamps without context - switch. Some CPU architectures (e.g., Intel's 80486 or older PPC) do NOT - have such capability. - -- The benchmark measures round-trip time but reports half of that as one-way - latency. Thus, it may not be sufficiently accurate for asymmetrical - configurations. - -- On BW benchmarks , the BW is calculated on the send side only, as it calculates - the BW after collecting completion from the receive side. - If using the bidirectional flag , BW is calculated on both sides - -- Min/Median/Max result is reported. - The median (vs average) is less sensitive to extreme scores. - Typically, the "Max" value is the first value measured. - -- Larger samples help marginally only. The default (1000) is sufficient. - Note that an array of cycles_t (typically unsigned long) is allocated - once to collect samples and again to store the difference between them. - Large sample sizes (e.g., 1 million) might expose other problems - with the program. - -- The "-H" option will dump the histogram for additional statistical analysis. - See xgraph, ygraph, r-base (http://www.r-project.org/), pspp, or other - statistical math programs. - -=============================================================================== -4. Test Descriptions -=============================================================================== - -rdma_lat.c latency test with RDMA write transactions -rdma_bw.c streaming BW test with RDMA write transactions - - -The following tests are mainly useful for HW/SW benchmarking. -They are not intended as actual usage examples. - -send_lat.c latency test with send transactions -send_bw.c BW test with send transactions -write_lat.c latency test with RDMA write transactions -write_bw.c BW test with RDMA write transactions -read_lat.c latency test with RDMA read transactions -read_bw.c BW test with RDMA read transactions - -The executable name of each test starts with the general prefix "ib_", -e.g., ib_write_lat, except for those of RDMA tests, -their excutable have the same name except of the .c. - -Running Tests -------------- - -Prerequisites: - kernel 2.6 - ib_uverbs (kernel module) matches libibverbs - ("match" means binary compatible, but ideally of the same SVN rev) - -Server: ./ -Client: ./ - - o is IPv4 or IPv6 address. You can use the IPoIB - address if IPoIB is configured. - o --help lists the available - - *** IMPORTANT NOTE: The SAME OPTIONS must be passed to both server and client. - - -Common Options to all tests: - -p, --port= Listen on/connect to port (default: 18515). - -m, --mtu= Mtu size (default: 1024). - -d, --ib-dev= Use IB device (default: first device found). - -i, --ib-port= Use port of IB device (default: 1). - -o, --out= Number of outstanding reads. only in READ. - -q, --qp= Number of Qps to perform. only in write_bw. - -c, --connection= Connection type : RC,UC,UD according to spec. - -g, --mcg= Number of Qps in MultiCast group. in SEND only - -M, --MGID= as the group MGID in format '255:1:X:X:X:X:X:X:X:X:X:X:X:X:X:X'. - -s, --size= Size of message to exchange (default: 1). - -a, --all Run sizes from 2 till 2^23. - -t, --tx-depth= Size of tx queue (default: 50). - -r, --rx-depth= Make rx queue bigger than tx (default 600). - -n, --iters= Number of exchanges (at least 100, default: 1000). - -I, --inline_size= Max size of message to be sent in inline mode. - On Bw tests default is 1,latency tests is 400. - -C, --report-cycles Report times in cpu cycle units. - -u, --qp-timeout= QP timeout, timeout value is 4 usec*2 ^(timeout). - Default is 14. - -S, --sl= SL (default 0). - -H, --report-histogram Print out all results (Default: summary only). - Only on Latnecy tests. - -x, --gid-index= Test uses GID with GID index taken from command - Line (for RDMAoE index should be 0). - -b, --bidirectional Measure bidirectional bandwidth (default uni). - On BW tests only (Implicit on latency tests). - -V, --version Display version number. - -e, --events Sleep on CQ events (default poll). - -N, --no peak-bw Cancel peak-bw calculation (default with peak-bw) - -F, --CPU-freq Do not fail even if cpufreq_ondemand module. - - *** IMPORTANT NOTE: You need to be running a Subnet Manager on the switch or - on one of the nodes in your fabric. - - diff --git a/QLOGIC_VNIC_README.txt b/QLOGIC_VNIC_README.txt deleted file mode 100644 index 9e5a75e..0000000 --- a/QLOGIC_VNIC_README.txt +++ /dev/null @@ -1,642 +0,0 @@ -This is a release of the QLogic VNIC driver on OFED 1.4. This driver is -currently supported on Intel x86 32 and 64 bit machines. -Supported OS are: -- RHEL 4 Update 4. -- RHEL 4 Update 5. -- RHEL 4 Update 6. -- SLES 10. -- SLES 10 Service Pack 1. -- SLES 10 Service Pack 1 Update 1. -- SLES 10 Service Pack 2. -- RHEL 5. -- RHEL 5 Update 1. -- RHEL 5 Update 2. -- vanilla 2.6.27 kernel. - -The VNIC driver in conjunction with the QLogic Ethernet Virtual I/O Controller -(EVIC) provides Ethernet interfaces on a host with IB HCA(s) without the need -for any physical Ethernet NIC. - -This file describes the use of the QLogic VNIC ULP service on an OFED stack -and covers the following points: - -A) Creating QLogic VNIC interfaces -B) Discovering VEx/EVIC IOCs present on the fabric using ib_qlgc_vnic_query -C) Starting the QLogic VNIC driver and the VNIC interfaces -D) Assigning IP addresses etc for the QLogic VNIC interfaces -E) Information about the QLogic VNIC interfaces -F) Deleting a specific QLogic VNIC interface -G) Forced Failover feature for QLogic VNIC. -H) Infiniband Quality of Service for VNIC. -I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support -J) Information about creating VLAN interfaces -K) Information about enabling IB Multicast for QLogic VNIC interface -L) Basic Troubleshooting - -A) Creating QLogic VNIC interfaces - -The VNIC interfaces can be created with the help of -the configuration file which must be placed at /etc/infiniband/qlgc_vnic.cfg. - -Please take a look at /etc/infiniband/qlgc_vnic.cfg.sample file (available also -as part of the documentation) to see how VNIC configuration files are written. -You can use this configuration file as the basis for creating a VNIC configuration -file by copying it to /etc/infiniband/qlgc_vnic.cfg. Of course you will have to -replace the IOCGUID, IOCSTRING values etc in the sample configuration file -with those of the EVIC IOCs present on your fabric. - -(For backward compatibilty, if this file is missing, -/etc/infiniband/qlogic_vnic.cfg or /etc/sysconfig/ics_inic.cfg -will be used for configuration) - -Please note that using DGID of the EVIC/VEx IOC is -recommended as it will ensure the quickest startup of the -VNIC service. If DGID is specified then you must also -specify the IOCGUID. More details can be found in -the qlgc_vnic.cfg.sample file. - -In case of a host consisting of more than 1 HCAs plugged in, VNIC -interfaces can be configured based on HCA no and Port No or PORTGUID. - -B) Discovering EVIC/VEx IOCs present on the fabric using ib_qlgc_vnic_query - -For writing the configuration file, you will need information -about the EVIC/VEx IOCs present on the fabric like their IOCGUID, -IOCSTRING etc. The ib_qlgc_vnic_query tool should be used to get this -information. - -When ib_qlgc_vnic_query is executed without any options, it scans through ALL -active IB ports on the host and obtains the detailed information about all the -EVIC/VEx IOCs reachable through each active IB port: - -# ib_qlgc_vnic_query - -HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active - - IO Unit Info: - port LID: 0008 - port GID: fe8000000000000000066a11de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 1] - GUID: 00066a01de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 - service entries: 2 - service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 - service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 - - IO Unit Info: - port LID: 0009 - port GID: fe8000000000000000066a21de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 2] - GUID: 00066a02de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 - service entries: 2 - service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 - service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 - -HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active - - IO Unit Info: - port LID: 0008 - port GID: fe8000000000000000066a11de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 1] - GUID: 00066a01de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 - service entries: 2 - service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 - service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 - - IO Unit Info: - port LID: 0009 - port GID: fe8000000000000000066a21de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 2] - GUID: 00066a02de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 - service entries: 2 - service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 - service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 - -HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down - - Port State is Down. Skipping search of DM nodes on this port. - -HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active - - IO Unit Info: - port LID: 0008 - port GID: fe8000000000000000066a11de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 1] - GUID: 00066a01de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 - service entries: 2 - service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 - service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 - - IO Unit Info: - port LID: 0009 - port GID: fe8000000000000000066a21de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 2] - GUID: 00066a02de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 - service entries: 2 - service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 - service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 - -This is meant to help the network administrator to know about HCA/Port information -on host along with EVIC IOCs reachable through given IB ports on fabric. When -ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID information -and with -s option it reports the IOCSTRING information for the EVIC/VEx IOCs -present on the fabric: - -# ib_qlgc_vnic_query -e - -HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff -HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff -HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down - - Port State is Down. Skipping search of DM nodes on this port. - -HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff - -# ib_qlgc_vnic_query -s - -HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active - -"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" -"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" -HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active - -"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" -"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" -HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down - - Port State is Down. Skipping search of DM nodes on this port. - -HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active - -"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" -"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" - -# ib_qlgc_vnic_query -es - -HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" -HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" -HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down - - Port State is Down. Skipping search of DM nodes on this port. - -HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" - -ib_qlgc_vnic_query can be used to discover EVIC IOCs on the fabric based on -umad device, HCA no/Port no and PORTGUID as follows: - -For umad devices, it takes the name of the umad device mentioned with '-d' -option: - -# ib_qlgc_vnic_query -es -d /dev/infiniband/umad0 - -HCA No = 0, HCA = mlx4_0, Port = 1 - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" - -If the name of the HCA and its port no is known, then ib_qlgc_vnic_query can -make use of this information to discover EVIC IOCs on the fabric. HCA name -and port no is specified with '-C' and '-P' options respectively. - -# ib_qlgc_vnic_query -es -C mlx4_1 -P 2 - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" - -In case, if HCA name is not specified but port no is specified, HCA 0 is -selected as default HCA to discover IOCs and if Port no is missing then, -Port 1 of HCA name mentioned is used to discover the IOCs. If both are -missing, the behaviour is default and ib_qlgc_vnic_query will scan all the -IB ports on the host to discover IOCs reachable through each one of them. - -PORTGUID information about the IB ports on given host can be obtained using -the option '-L': - -# ib_qlgc_vnic_query -L - -0,mlx4_0,1,0x0002c903000010f5 -0,mlx4_0,2,0x0002c903000010f6 -1,mlx4_1,1,0x0002c90300000785 -1,mlx4_1,2,0x0002c90300000786 - -This actually lists different configurable parameters of IB ports present on -given host in the order: HCA No, HCA Name, Port No, PORTGUID separated by -commas. PORTGUID value obtained thus, can be used to discover EVIC IOCs -reachable through it using '-G' option as follows: - -# ib_qlgc_vnic_query -es -G 0x0002c903000010f5 - -HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active - - ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" - ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" - -C) Starting the QLogic VNIC driver and the QLogic VNIC interfaces - -To start the QLogic VNIC service as a part of startup of OFED stack, set - -QLGC_VNIC_LOAD=yes - -in /etc/infiniband/openib.conf file. With this actually, the QLogic VNIC -service will also be stopped when the OFED stack is stopped. Also, if OFED -stack has been marked to start on boot, QLogic VNIC service will also start -on boot. - -The rest of the discussion in this subsection C) is valid only if - -QLGC_VNIC_LOAD=no - -is set into /etc/infiniband/openib.conf. - -Once you have created a configuration file, you can start the VNIC driver -and create the VNIC interfaces specified in the configuration file with: - -#/sbin/service qlgc_vnic start - -You can stop the VNIC driver and bring down the VNIC interfaces with - -#/sbin/service qlgc_vnic stop - -To restart the QLogic VNIC driver, you can use - -#/sbin/service qlgc_vnic restart - -If you have not started the Infiniband network stack (Infinipath or OFED), -then running "/sbin/service qlgc_vnic start" command will also cause the -Infiniband network stack to be started since the QLogic VNIC service requires -the Infiniband stack. - -On the other hand if you start the Infiniband network stack separately, then -the correct order of starting is: - -- Start the Infiniband stack -- Start QLogic VNIC service - -For example, if you use OFED, correct order of starting is: - -/sbin/service openibd start -/sbin/service qlgc_vnic start - -Correct order of stopping is: - -- Stop QLogic VNIC service -- Stop the Infiniband stack - -For example, if you use OFED, correct order of stopping is: - -/sbin/service qlgc_vnic stop -/sbin/service openibd stop - -If you try to stop the Infiniband stack when the QLogic VNIC service is -running, -you will get an error message that some of the modules of the Infiniband stack -are in use by the QLogic VNIC service. Also, any QLogic VNIC interfaces that -you -created are removed (because stopping the Infiniband network stack causes the -HCA -driver to be unloaded which is required for the VNIC interfaces to be -present). -In this case, do the following: - - 1. Stop the QLogic VNIC service with "/sbin/service qlgc_vnic stop" - - 2. Stop the Infiniband stack again. - - 3. If you want to restart the QLogic VNIC interfaces, use - "/sbin/service qlgc_vnic start". - - -D) Assigning IP addresses etc for the QLogic VNIC interfaces - -This can be done with ifconfig or by setting up the ifcfg-XXX (ifcfg-veth0 for -an interface named veth0 etc) network files for the corresponding VNIC interfaces. - -E) Information about the QLogic VNIC interfaces - -Information about VNIC interfaces on a given host can be obtained using a -script "ib_qlgc_vnic_info" :- - -# ib_qlgc_vnic_info - -VNIC Interface : eioc0 - VNIC State : VNIC_REGISTERED - Current Path : primary path - Receive Checksum : true - Transmit checksum : true - - Primary Path : - VIPORT State : VIPORT_CONNECTED - Link State : LINK_IDLING - HCA Info. : vnic-mthca0-1 - Heartbeat : 100 - IOC String : EVIC in Chassis 0x00066a00db000010, Slot 4, Ioc 1 - IOC GUID : 66a01de000037 - DGID : fe8000000000000000066a11de000037 - P Key : ffff - - Secondary Path : - VIPORT State : VIPORT_DISCONNECTED - Link State : INVALID STATE - HCA Info. : vnic-mthca0-2 - Heartbeat : 100 - IOC String : - IOC GUID : 66a01de000037 - DGID : 00000000000000000000000000000000 - P Key : 0 - -This information is collected from /sys/class/infiniband_qlgc_vnic/interfaces/ -directory under which there is a separate directory corresponding to each -VNIC interface. - -F) Deleting a specific QLogic VNIC interface - -VNIC interfaces can be deleted by writing the name of the interface to -the /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic file. - -For example to delete interface veth0 - -echo -n veth0 > /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic - -G) Forced Failover feature for QLogic VNIC. - -VNIC interfaces, when configured with failover configuration, can be -forced to failover to use other active path. For example, if VNIC interface -"veth1" is configured with failover configuration, then to switch to other -path, use command: - -echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/force_failover - -This will make VNIC interface veth1 to switch to other active path, even though -the path of VNIC interface, before the forced failover operation, is not in -disconnected state. - -This feature allows the network administrator to control the path of the -VNIC traffic at run time and reconfiguration as well as restart of VNIC -service is not required to achieve the same. - -Once enabled as mentioned above, forced failover can be cleared with -the unfailover command: - -echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/unfailover - -This clears the forced failover on VNIC interface "veth1". Once cleared, -if module parameter "default_prefer_primary" is set to 1, then VNIC -interface switches back to primary path. If module parameter -"default_prefer_primary" is set to 0, then VNIC interface continues to -use its current active path. - -Forced failover, thus, takes priority over default_prefer_primary and the -default_prefer_primary feature will not be active unless the forced -failover is cleared through "unfailover". - -Besides this forced failover, QLogic VNIC service does retain its -original failover feature which gets triggered when current active -path gets disconnected. - -H) Infiniband Quality of Service for VNIC:- - -To enforce infiniband Quality of Service(QoS) for VNIC protocol, there -is no configuration required on host side. The service level for the -VNIC protocol can be configured using service ID or target port guid -in the "qos-ulps" section of /etc/opensm/qos-policy.conf on the host -running OpenSM. - -Service IDs for the EVIC IO controllers can be obtained from the output -of ib_qlgc_vnic_query: - -HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active - - IO Unit Info: - port LID: 0008 - port GID: fe8000000000000000066a11de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 1] - GUID: 00066a01de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 - service entries: 2 -------> service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 -------> service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 - - IO Unit Info: - port LID: 0009 - port GID: fe8000000000000000066a21de000070 - change ID: 0003 - max controllers: 0x02 - - - controller[ 2] - GUID: 00066a02de000070 - vendor ID: 00066a - device ID: 000030 - IO class : 2000 - ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 - service entries: 2 -------> service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 -------> service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 - -Numbers 1000066a00000002, 1000066a00000102 are the required service IDs. - -Finer control on quality of service for the VNIC protocol can be achieved by -configuring the service level using target port guid values of the EVIC IO -controllers. Target port guid values for the EVIC IO controllers can be -obtained using "saquery" command supplied by OFED package. - -I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support:- - -This tool is started and stopped as part of the QLogic VNIC service -(refer to C above) and provides the following features: - -1. Dynamic update of disconnected interfaces (which have been configured -WITHOUT using the DGID option in the configuration file) : - -At the start up of VNIC driver, if the HCA port through which a particular VNIC -interface path (primary or secondary) connects to target is down or the -EVIC/VEx IOC is not available then all the required parameters (DGID etc) for connecting -with the EVIC/VEx cannot be determined. Hence the corresponding VNIC interface -path is not available at the start of the VNIC service. This daemon constantly -monitors the configured VNIC interfaces to check if any of them are disconnected. -If any of the interfaces are disconnected, it scans for available EVIC/VEx targets using -"ib_qlgc_vnic_query" tool. When daemon sees that for a given path of a VNIC interface, -the configured EVIC/VEx IOC has become available, it dynamically updates the -VNIC kernel driver with the required information to establish connection for -that path of the interface. In this way, the interface gets connected with -the configured EVIC/VEx whenever it becomes available without any manual -intervention. - -2. Hot Swap support : - -Hot swap is an operation in which an existing EVIC/VEx is replaced by another -EVIC/VEx (in the same slot of the switch chassis as the older one). In such a -case, the current connection for the corresponding VNIC interface will have to -be re-established. The daemon detects this hot swap case and re-establishes -the connection automatically. To make use of this feature of the daemon, it is -recommended that IOCSTRING be used in the configuration file to configure the -VNIC interfaces. - -This is because, after a hot swap though all other parameters like DGID, IOCGUID etc -of the EVIC/VEx change, the IOCSTRING remains the same. Thus the daemon monitors -for changes in IOCGUID and DGID of disconnected interfaces based on the IOCSTRING. -If these values have changed it updates the kernel driver so that the VNIC -interface can start using the new EVIC/VEx. - -If in addition to IOCSTRING, DGID and IOCGUID have been used to configure -a VNIC interface, then on a hotswap the daemon will update the parameters as required. -But to have that VNIC interface available immediately on the next restart of the -QLogic VNIC service, please make sure to update the configuration file with the -new DGID and IOCGUID values. Otherwise, the creation of such interfaces will be -delayed till the daemon runs and updates the parameters. - -J) Information about creating VLAN interfaces - -The EVIC/VEx supports VLAN tagging without having to explicitly create VLAN -interfaces for the VNIC interface on the host. This is done by enabling -Egress/Ingress tagging on the EVIC/VEx and setting the "Host ignores VLAN" -option for the VNIC interface. The "Host ignores VLAN" option is enabled -by default due to which VLAN tags are ignored on the host by the QLogic -VNIC driver. Thus explicitly created VLAN interfaces (using vconfig command) -for a given VNIC interface will not be operational. - -If you want to explicitly create a VLAN interface for a given VNIC interface, -then you will have to disable the "Host ignores VLAN" option for the -VNIC interface on the EVIC/VEx. The qlgc_vnic service must be restarted -on the host after disabling (or enabling) the "Host ignores VLAN" option. - -Please refer to the EVIC/VEx documentation for more information on Egress/Ingress -port tagging feature and disabling the "Host ignores VLAN" option. - -K) Information about enabling IB Multicast for QLogic VNIC interface - -QLogic VNIC driver has been upgraded to support the IB Multicasting feature of -EVIC/VEx. This feature enables the QLogic VNIC host driver to support the IP -multicasting more efficiently. With this feature enabled, infiniband multicast -group acts as a carrier of IP multicast traffic. EVIC will make use of such IB -multicast groups for forwarding IP multicast traffic to VNIC interfaces which -are member of given IP multicast group. In the older QLogic VNIC host driver, -IB multicasting was not being used to carry IP multicast traffic. - -By default, IB multicasting is disabled on EVIC/VEx; but it is enabled by -default at the QLogic VNIC host driver. - -To disable IB multicast feature on the host driver, VNIC configuration file -needs to be modified by setting the parameter IB_MULTICAST=FALSE in the -interface configuration. Please refer to the qlgc_vnic.cfg.sample for more -details on configuration of VNIC interfaces for IB multicasting. -IB multicasting also needs to be enabled over EVIC/VEx. Please refer to the -EVIC/VEx documentation for more information on enabling IB multicast -feature over EVIC/VEx. - -L) Basic Troubleshooting - -1. In case of any problems, make sure that: - - a) The HCA ports you are trying to use have IB cables connected and are in an - active state. You can use the "ibv_devinfo" tool to check the state of - your HCA ports. - - b) If your HCA ports are not active, check if an SM is running on the fabric - where the HCA ports are connected. If you have done a full install of - OFED, you can use the "sminfo" command ("sminfo -P 2" for port 2) to - check SM information. - - c) Make sure that the EVIC/VEx is powered up and its Ethernet cables are connected - properly. - - d) Check /var/log/messages for any error messages. - -2. If some of your VNIC interfaces are not available: - - a) Use "ifconfig" tool with -a option to see if all interfaces are created. - It is possible that the interfaces are created but do not have an - IP address. Make sure that you have setup a correct ifcfg-XXX file for your - VNIC interfaces for automatic assignment of IP addresses. - - If the VNIC interface is created and the ifcfg file is also correct - but the VNIC interface is not UP, make sure that the target EVIC/VEx - IOC has an Ethernet cable properly connected. - - b) Make sure that the VNIC configuration file has been setup properly - with correct EVIC/VEx target DGID/IOCGUID/IOCSTRING information and - instance numbers. - - c) Make sure that the EVIC/VEx target IOC specified for that interface is - available. You can use the "ib_qlgc_vnic_query" tool to verify this. If it is not - available when you started the service, but it becomes available later - on, then the QLogic VNIC dynamic update daemon will bring up the - interface when the target becomes available. You will see messages in - /var/log/messages when the corresponding interface is created. - - d) Make sure that you have not exceeded the total number of Virtual interfaces - supported by the EVIC/VEx. You can check the total number of Virtual interfaces - currently in use on the HTTP interface of the EVIC/VEx. - diff --git a/QoS_architecture.txt b/QoS_architecture.txt deleted file mode 100644 index 1c19a98..0000000 --- a/QoS_architecture.txt +++ /dev/null @@ -1,216 +0,0 @@ - - QoS support in OFED - -============================================================================== -Table of contents -============================================================================== - -1. Overview -2. Architecture -3. Supported Policy -4. CMA functionality -5. IPoIB functionality -6. SDP functionality -7. RDS functionality -8. SRP functionality -9. iSER functionality -10. OpenSM functionality - - -============================================================================== -1. Overview -============================================================================== - -Quality of Service requirements stem from the realization of I/O consolidation -over IB network: As multiple applications and ULPs share the same fabric, -means to control their use of the network resources are becoming a must. -The basic need is to differentiate the service levels provided to different -traffic flows, such that a policy could be enforced and control each flow -utilization of the fabric resources. - -IBTA specification defined several hardware features and management interfaces -to support QoS: -* Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner -* Arbitration between traffic of different VLs is performed by a 2 priority - levels weighted round robin arbiter. The arbiter is programmable with - a sequence of (VL, weight) pairs and maximal number of high priority credits - to be processed before low priority is served -* Packets carry class of service marking in the range 0 to 15 in their - header SL field -* Each switch can map the incoming packet by its SL to a particular output - VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) -* The Subnet Administrator controls each communication flow parameters - by providing them as a response to Path Record (PR) or MultiPathRecord (MPR) - queries - -The IB QoS features provide the means to implement a DiffServ like -architecture. DiffServ architecture (IETF RFC 2474 & 2475) is widely used -today in highly dynamic fabrics. - -This document provides the detailed functional definition for the various -software elements that enable a DiffServ like architecture over the -OpenFabrics software stack. - - -============================================================================== -2. Architecture -============================================================================== - -QoS functionality is split between the SM/SA, CMA and the various ULPS. -We take the "chronology approach" to describe how the overall system works. - -2.1. The network manager (human) provides a set of rules (policy) that -define how the network is being configured and how its resources are split -to different QoS-Levels. The policy also define how to decide which QoS-Level -each application or ULP or service use. - -2.2. The SM analyzes the provided policy to see if it is realizable and -performs the necessary fabric setup. Part of this policy defines the default -QoS-Level of each partition. The SA is enhanced to match the requested Source, -Destination, QoS-Class, Service-ID, PKey against the policy, so clients -(ULPs, programs) can obtain a policy enforced QoS. The SM may also set up -partitions with appropriate IPoIB broadcast group. This broadcast group -carries its QoS attributes: SL, MTU, RATE, and Packet Lifetime. - -2.3. IPoIB is being setup. IPoIB uses the SL, MTU, RATE and Packet Lifetime -available on the multicast group which forms the broadcast group of this -partition. - -2.4. MPI which provides non IB based connection management should be -configured to run using hard coded SLs. It uses these SLs for every QP -being opened. - -2.5. ULPs that use CM interface (like SRP) have their own pre-assigned -Service-ID and use it while obtaining PathRecord/MultiPathRecord (PR/MPR) -for establishing connections. The SA receiving the PR/MPR matches it -against the policy and returns the appropriate PR/MPR including SL, MTU, -RATE and Lifetime. - -2.6. ULPs and programs (e.g. SDP) use CMA to establish RC connection provide -the CMA the target IP and port number. ULPs might also provide QoS-Class. -The CMA then creates Service-ID for the ULP and passes this ID and optional -QoS-Class in the PR/MPR request. The resulting PR/MPR is used for configuring -the connection QP. - -PathRecord and MultiPathRecord enhancement for QoS: - -As mentioned above the PathRecord and MultiPathRecord attributes are enhanced -to carry the Service-ID which is a 64bit value. A new field QoS-Class is also -provided. -A new capability bit describes the SM QoS support in the SA class port info. -This approach provides an easy migration path for existing access layer and -ULPs by not introducing new set of PR/MPR attributes. - - -============================================================================== -3. Supported Policy -============================================================================== - -The QoS policy that is specified in a separate file is divided into -4 sub sections: - -I) Port Group: a set of CAs, Routers or Switches that share the same settings. - A port group might be a partition defined by the partition manager policy, - list of GUIDs, or list of port names based on NodeDescription. - -II) Fabric Setup: Defines how the SL2VL and VLArb tables should be setup. - NOTE: Currently this part of the policy is ignored. SL2VL and VLArb - tables should be configured in the OpenSM options file - (opensm.opts). - -III) QoS-Levels Definition: This section defines the possible sets of - parameters for QoS that a client might be mapped to. Each set holds - SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits. - NOTE: Currently, Path Bits are not implemented. - -IV) Matching Rules: A list of rules that match an incoming PR/MPR request - to a QoS-Level. The rules are processed in order such as the first match - is applied. Each rule is built out of a set of match expressions which - should all match for the rule to apply. The matching expressions are - defined for the following fields: - - SRC and DST to lists of port groups - - Service-ID to a list of Service-ID values or ranges - - QoS-Class to a list of QoS-Class values or ranges - - -============================================================================== -4. CMA features -============================================================================== - -The CMA interface supports Service-ID through the notion of port space -as a prefixes to the port_num which is part of the sockaddr provided to -rdma_resolve_add(). -CMP also allows the ULP (like SDP) to propagate a request for specific -QoS-Class. CMA uses the provided QoS-Class and Service-ID in the sent PR/MPR. - - -============================================================================== -5. IPoIB -============================================================================== - -IPoIB queries the SA for its broadcast group information. -It provides the broadcast group SL, MTU, and RATE in every following -PathRecord query performed when a new UDAV is needed by IPoIB. - - -============================================================================== -6. SDP -============================================================================== - -SDP uses CMA for building its connections. -The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits -holding the remote TCP/IP Port Number to connect to. - - -============================================================================== -7. RDS -============================================================================== - -RDS uses CMA and thus it is very close to SDP. The Service-ID for RDS is -0x000000000106PPPP, where PPPP are 4 hex digits holding the TCP/IP Port -Number that the protocol connects to. -Default port number for RDS is 0x48CA, which makes a default Service-ID -0x00000000010648CA. - - -============================================================================== -8. SRP -============================================================================== - -Current SRP implementation uses its own CM callbacks (not CMA). So SRP fills -in the Service-ID in the PR/MPR by itself and use that information in setting -up the QP. -SRP Service-ID is defined by the SRP target I/O Controller (it also complies -with IBTA Service-ID rules). The Service-ID is reported by the I/O Controller -in the ServiceEntries DMA attribute and should be used in the PR/MPR if the -SA reports its ability to handle QoS PR/MPRs. - - -============================================================================== -9. iSER -============================================================================== - -Similar to RDS, iSER also uses CMA. The Service-ID for iSER is similar to RDS -(0x000000000106PPPP), with default port number 0x0CBC, which makes a default -Service-ID 0x0000000001060CBC. - - -============================================================================== -10. OpenSM features -============================================================================== - -The QoS related functionality that is provided by OpenSM can be split into two -main parts: - -10.1. Fabric Setup -During fabric initialization the SM parses the policy and apply its settings -to the discovered fabric elements. - -10.2. PR/MPR query handling: -OpenSM enforces the provided policy on client request. -The overall flow for such requests is: first the request is matched against -the defined match rules such that the target QoS-Level definition is found. -Given the QoS-Level a path(s) search is performed with the given restrictions -imposed by that level. - -============================================================================== diff --git a/QoS_management_in_OpenSM.txt b/QoS_management_in_OpenSM.txt deleted file mode 100644 index 8c9915f..0000000 --- a/QoS_management_in_OpenSM.txt +++ /dev/null @@ -1,492 +0,0 @@ - - QoS Management in OpenSM - -============================================================================== - Table of contents -============================================================================== - -1. Overview -2. Full QoS Policy File -3. Simplified QoS Policy Definition -4. Policy File Syntax Guidelines -5. Examples of Full Policy File -6. Simplified QoS Policy - Details and Examples -7. SL2VL Mapping and VL Arbitration - - -============================================================================== - 1. Overview -============================================================================== - -When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file. -The default name of OpenSM QoS policy file is -/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y -or --qos_policy_file option with OpenSM. - -During fabric initialization and at every heavy sweep OpenSM parses the QoS -policy file, applies its settings to the discovered fabric elements, and -enforces the provided policy on client requests. The overall flow for such -requests is: - - The request is matched against the defined matching rules such that the - QoS Level definition is found. - - Given the QoS Level, path(s) search is performed with the given - restrictions imposed by that level. - -There are two ways to define QoS policy: - - Full policy, where the policy file syntax provides an administrator - various ways to match PathRecord/MultiPathRecord (PR/MPR) request and - enforce various QoS constraints on the requested PR/MPR - - Simplified QoS policy definition, where an administrator would be able to - match PR/MPR requests by various ULPs and applications running on top of - these ULPs. - -While the full policy syntax is very flexible, in many cases the simplified -policy definition would be sufficient. - - -============================================================================== - 2. Full QoS Policy File -============================================================================== - -QoS policy file has the following sections: - -I) Port Groups (denoted by port-groups). -This section defines zero or more port groups that can be referred later by -matching rules (see below). Port group lists ports by: - - Port GUID - - Port name, which is a combination of NodeDescription and IB port number - - PKey, which means that all the ports in the subnet that belong to - partition with a given PKey belong to this port group - - Partition name, which means that all the ports in the subnet that belong - to partition with a given name belong to this port group - - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and - SELF (SM's port). - -II) QoS Setup (denoted by qos-setup). -This section describes how to set up SL2VL and VL Arbitration tables on -various nodes in the fabric. -However, this is not supported in OpenSM currently. -SL2VL and VLArb tables should be configured in the OpenSM options file -(default location - /usr/local/etc/opensm/opensm.conf). - -III) QoS Levels (denoted by qos-levels). -Each QoS Level defines Service Level (SL) and a few optional fields: - - MTU limit - - Rate limit - - PKey - - Packet lifetime -When path(s) search is performed, it is done with regards to restriction that -these QoS Level parameters impose. -One QoS level that is mandatory to define is a DEFAULT QoS level. It is -applied to a PR/MPR query that does not match any existing match rule. -Similar to any other QoS Level, it can also be explicitly referred by any -match rule. - -IV) QoS Matching Rules (denoted by qos-match-rules). -Each PathRecord/MultiPathRecord query that OpenSM receives is matched against -the set of matching rules. Rules are scanned in order of appearance in the QoS -policy file such as the first match takes precedence. -Each rule has a name of QoS level that will be applied to the matching query. -A default QoS level is applied to a query that did not match any rule. -Queries can be matched by: - - Source port group (whether a source port is a member of a specified group) - - Destination port group (same as above, only for destination port) - - PKey - - QoS class - - Service ID -To match a certain matching rule, PR/MPR query has to match ALL the rule's -criteria. However, not all the fields of the PR/MPR query have to appear in -the matching rule. -For instance, if the rule has a single criterion - Service ID, it will match -any query that has this Service ID, disregarding rest of the query fields. -However, if a certain query has only Service ID (which means that this is the -only bit in the PR/MPR component mask that is on), it will not match any rule -that has other matching criteria besides Service ID. - - -============================================================================== - 3. Simplified QoS Policy Definition -============================================================================== - -Simplified QoS policy definition comprises of a single section denoted by -qos-ulps. Similar to the full QoS policy, it has a list of match rules and -their QoS Level, but in this case a match rule has only one criterion - its -goal is to match a certain ULP (or a certain application on top of this ULP) -PR/MPR request, and QoS Level has only one constraint - Service Level (SL). -The simplified policy section may appear in the policy file in combine with -the full policy, or as a stand-alone policy definition. -See more details and list of match rule criteria below. - - -============================================================================== - 4. Policy File Syntax Guidelines -============================================================================== - -- Empty lines are ignored. -- Leading and trailing blanks, as well as empty lines, are ignored, so - the indentation in the example is just for better readability. -- Comments are started with the pound sign (#) and terminated by EOL. -- Any keyword should be the first non-blank in the line, unless it's a - comment. -- Keywords that denote section/subsection start have matching closing - keywords. -- Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR - requests that didn't match any of the matching rules. -- Any section/subsection of the policy file is optional. - - -============================================================================== - 5. Examples of Full Policy File -============================================================================== - -As mentioned earlier, any section of the policy file is optional, and -the only mandatory part of the policy file is a default QoS Level. -Here's an example of the shortest policy file: - - qos-levels - qos-level - name: DEFAULT - sl: 0 - end-qos-level - end-qos-levels - -Port groups section is missing because there are no match rules, which means -that port groups are not referred anywhere, and there is no need defining -them. And since this policy file doesn't have any matching rules, PR/MPR query -won't match any rule, and OpenSM will enforce default QoS level. -Essentially, the above example is equivalent to not having QoS policy file -at all. - -The following example shows all the possible options and keywords in the -policy file and their syntax: - - # - # See the comments in the following example. - # They explain different keywords and their meaning. - # - port-groups - - port-group # using port GUIDs - name: Storage - # "use" is just a description that is used for logging - # Other than that, it is just a comment - use: SRP Targets - port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA - port-guid: 0x1000000000FFFF - end-port-group - - port-group - name: Virtual Servers - # The syntax of the port name is as follows: - # "node_description/Pnum". - # node_description is compared to the NodeDescription of the node, - # and "Pnum" is a port number on that node. - port-name: vs1 HCA-1/P1, vs2 HCA-1/P1 - end-port-group - - # using partitions defined in the partition policy - port-group - name: Partitions - partition: Part1 - pkey: 0x1234 - end-port-group - - # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM) - # or ALL (for all the nodes in the subnet) - port-group - name: CAs and SM - node-type: CA, SELF - end-port-group - - end-port-groups - - qos-setup - # This section of the policy file describes how to set up SL2VL and VL - # Arbitration tables on various nodes in the fabric. - # However, this is not supported in OpenSM currently - the section is - # parsed and ignored. SL2VL and VLArb tables should be configured in the - # OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf). - end-qos-setup - - qos-levels - - # Having a QoS Level named "DEFAULT" is a must - it is applied to - # PR/MPR requests that didn't match any of the matching rules. - qos-level - name: DEFAULT - use: default QoS Level - sl: 0 - end-qos-level - - # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime - qos-level - name: WholeSet - sl: 1 - mtu-limit: 4 - rate-limit: 5 - pkey: 0x1234 - packet-life: 8 - end-qos-level - - end-qos-levels - - # Match rules are scanned in order of their apperance in the policy file. - # First matched rule takes precedence. - qos-match-rules - - # matching by single criteria: QoS class - qos-match-rule - use: by QoS class - qos-class: 7-9,11 - # Name of qos-level to apply to the matching PR/MPR - qos-level-name: WholeSet - end-qos-match-rule - - # show matching by destination group and service id - qos-match-rule - use: Storage targets - destination: Storage - service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF - qos-level-name: WholeSet - end-qos-match-rule - - qos-match-rule - source: Storage - use: match by source group only - qos-level-name: DEFAULT - end-qos-match-rule - - qos-match-rule - use: match by all parameters - qos-class: 7-9,11 - source: Virtual Servers - destination: Storage - service-id: 0x0000000000010000-0x000000000001FFFF - pkey: 0x0F00-0x0FFF - qos-level-name: WholeSet - end-qos-match-rule - - end-qos-match-rules - - -============================================================================== - 6. Simplified QoS Policy - Details and Examples -============================================================================== - -Simplified QoS policy match rules are tailored for matching ULPs (or some -application on top of a ULP) PR/MPR requests. This section has a list of -per-ULP (or per-application) match rules and the SL that should be enforced -on the matched PR/MPR query. - -Match rules include: - - Default match rule that is applied to PR/MPR query that didn't match any - of the other match rules - - SDP - - SDP application with a specific target TCP/IP port range - - SRP with a specific target IB port GUID - - RDS - - iSER - - iSER application with a specific target TCP/IP port range - - IPoIB with a default PKey - - IPoIB with a specific PKey - - any ULP/application with a specific Service ID in the PR/MPR query - - any ULP/application with a specific PKey in the PR/MPR query - - any ULP/application with a specific target IB port GUID in the PR/MPR query - -Since any section of the policy file is optional, as long as basic rules of -the file are kept (such as no referring to nonexisting port group, having -default QoS Level, etc), the simplified policy section (qos-ulps) can serve -as a complete QoS policy file. -The shortest policy file in this case would be as follows: - - qos-ulps - default : 0 #default SL - end-qos-ulps - -It is equivalent to the previous example of the shortest policy file, and it -is also equivalent to not having policy file at all. - -Below is an example of simplified QoS policy with all the possible keywords: - - qos-ulps - default : 0 # default SL - sdp, port-num 30000 : 0 # SL for application running on top - # of SDP when a destination - # TCP/IPport is 30000 - sdp, port-num 10000-20000 : 0 - sdp : 1 # default SL for any other - # application running on top of SDP - rds : 2 # SL for RDS traffic - iser, port-num 900 : 0 # SL for iSER with a specific target - # port - iser : 3 # default SL for iSER - ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with - # pkey 0x0001 - ipoib : 4 # default IPoIB partition, - # pkey=0x7FFF - any, service-id 0x6234 : 6 # match any PR/MPR query with a - # specific Service ID - any, pkey 0x0ABC : 6 # match any PR/MPR query with a - # specific PKey - srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on - # a specified IB port GUID - any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with - # a specific target port GUID - end-qos-ulps - - -Similar to the full policy definition, matching of PR/MPR queries is done in -order of appearance in the QoS policy file such as the first match takes -precedence, except for the "default" rule, which is applied only if the query -didn't match any other rule. - -All other sections of the QoS policy file take precedence over the qos-ulps -section. That is, if a policy file has both qos-match-rules and qos-ulps -sections, then any query is matched first against the rules in the -qos-match-rules section, and only if there was no match, the query is matched -against the rules in qos-ulps section. - -Note that some of these match rules may overlap, so in order to use the -simplified QoS definition effectively, it is important to understand how each -of the ULPs is matched: - -6.1 IPoIB -IPoIB query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so -the following three match rules are equivalent: - - ipoib : - ipoib, pkey 0x7fff : - any, pkey 0x7fff : - -6.2 SDP -SDP PR query is matched by Service ID. The Service-ID for SDP is -0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port -Number to connect to. The following two match rules are equivalent: - - sdp : - any, service-id 0x0000000000010000-0x000000000001ffff : - -6.3 RDS -Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS -is 0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP -Port Number to connect to. Default port number for RDS is 0x48CA, which makes -a default Service-ID 0x00000000010648CA. The following two match rules are -equivalent: - - rds : - any, service-id 0x00000000010648CA : - -6.4 iSER -Similar to RDS, iSER query is matched by Service ID, where the the Service ID -is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes -a default Service-ID 0x0000000001060CBC. The following two match rules are -equivalent: - - iser : - any, service-id 0x0000000001060CBC : - -6.5 SRP -Service ID for SRP varies from storage vendor to vendor, thus SRP query is -matched by the target IB port GUID. The following two match rules are -equivalent: - - srp, target-port-guid 0x1234 : - any, target-port-guid 0x1234 : - -Note that any of the above ULPs might contain target port GUID in the PR -query, so in order for these queries not to be recognized by the QoS manager -as SRP, the SRP match rule (or any match rule that refers to the target port -guid only) should be placed at the end of the qos-ulps match rules. - -6.6 MPI -SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL -on the MPI traffic, and that's why it is the only ULP that did not appear in -the qos-ulps section. - - -============================================================================== - 7. SL2VL Mapping and VL Arbitration -============================================================================== - -OpenSM cached options file has a set of QoS related configuration parameters, -that are used to configure SL2VL mapping and VL arbitration on IB ports. -These parameters are: - - Max VLs: the maximum number of VLs that will be on the subnet. - - High limit: the limit of High Priority component of VL Arbitration - table (IBA 7.6.9). - - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template. - - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template. - - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs - corresponding to SLs 0-15 (Note that VL15 used here means drop this SL). - -There are separate QoS configuration parameters sets for various target types: -CAs, routers, switch external ports, and switch's enhanced port 0. The names -of such parameters are prefixed by "qos__" string. Here is a full list -of the currently supported sets: - - qos_ca_ - QoS configuration parameters set for CAs. - qos_rtr_ - parameters set for routers. - qos_sw0_ - parameters set for switches' port 0. - qos_swe_ - parameters set for switches' external ports. - -Here's the example of typical default values for CAs and switches' external -ports (hard-coded in OpenSM initialization): - - qos_ca_max_vls 15 - qos_ca_high_limit 0 - qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 - qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 - qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 - - qos_swe_max_vls 15 - qos_swe_high_limit 0 - qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 - qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 - qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 - -VL arbitration tables (both high and low) are lists of VL/Weight pairs. -Each list entry contains a VL number (values from 0-14), and a weighting value -(values 0-255), indicating the number of 64 byte units (credits) which may be -transmitted from that VL when its turn in the arbitration occurs. A weight -of 0 indicates that this entry should be skipped. If a list entry is -programmed for VL15 or for a VL that is not supported or is not currently -configured by the port, the port may either skip that entry or send from any -supported VL for that entry. - -Note, that the same VLs may be listed multiple times in the High or Low -priority arbitration tables, and, further, it can be listed in both tables. - -The limit of high-priority VLArb table (qos__high_limit) indicates the -number of high-priority packets that can be transmitted without an opportunity -to send a low-priority packet. Specifically, the number of bytes that can be -sent is high_limit times 4K bytes. - -A high_limit value of 255 indicates that the byte limit is unbounded. -Note: if the 255 value is used, the low priority VLs may be starved. -A value of 0 indicates that only a single packet from the high-priority table -may be sent before an opportunity is given to the low-priority table. - -Keep in mind that ports usually transmit packets of size equal to MTU. -For instance, for 4KB MTU a single packet will require 64 credits, so in order -to achieve effective VL arbitration for packets of 4KB MTU, the weighting -values for each VL should be multiples of 64. - -Below is an example of SL2VL and VL Arbitration configuration on subnet: - - qos_ca_max_vls 15 - qos_ca_high_limit 6 - qos_ca_vlarb_high 0:4 - qos_ca_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 - qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 - - qos_swe_max_vls 15 - qos_swe_high_limit 6 - qos_swe_vlarb_high 0:4 - qos_swe_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 - qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 - -In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is -defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single -transmission burst. Such configuration would suilt VL that needs low latency -and uses small MTU when transmitting packets. Rest of VLs are defined as low -priority VLs with different weights, while VL4 is effectively turned off. diff --git a/RDS_README.txt b/RDS_README.txt deleted file mode 100644 index f0f8f5d..0000000 --- a/RDS_README.txt +++ /dev/null @@ -1,335 +0,0 @@ -RDS(7) RDS(7) - - - -NAME - RDS - Reliable Datagram Sockets - -SYNOPSIS - #include - #include - -DESCRIPTION - This is an implementation of the RDS socket API. It provides reliable, - in-order datagram delivery between sockets over a variety of trans‐ - ports. - - Currently, RDS can be transported over Infiniband, and loopback. - iWARP bcopy is supported, but not RDMA operations. - - RDS uses standard AF_INET addresses as described in ip(7) to identify - end points. - - Socket Creation - RDS is still in development and as such does not have a reserved proto‐ - col family constant. Applications must read the string representation - of the protocol family value from the pf_rds sysctl parameter file - described below. - - rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0); - - - Socket Options - RDS sockets support a number of socket options through the setsock‐ - opt(2) and getsockopt(2) calls. The following generic options (with - socket level SOL_SOCKET) are of specific importance: - - SO_RCVBUF - Specifies the size of the receive buffer. See section on "Con‐ - gestion Control" below. - - SO_SNDBUF - Specifies the size of the send buffer. See "Message Transmis‐ - sion" below. - - SO_SNDTIMEO - Specifies the send timeout when trying to enqueue a message on a - socket with a full queue in blocking mode. - - In addition to these, RDS supports a number of protocol specific - options (with socket level SOL_RDS). Just as with the RDS protocol - family, an official value has not been assigned yet, so the kernel will - assign a value dynamically. The assigned value can be retrieved from - the sol_rds sysctl parameter file. - - RDS specific socket options will be described in a separate section - below. - - Binding - A new RDS socket has no local address when it is first returned from - socket(2). It must be bound to a local address by calling bind(2) - before any messages can be sent or received. This will also attach the - socket to a specific transport, based on the type of interface the - local address is attached to. From that point on, the socket can only - reach destinations which are available through this transport. - - For instance, when binding to the address of an Infiniband interface - such as ib0, the socket will use the Infiniband transport. If RDS is - not able to associate a transport with the given address, it will - return EADDRNOTAVAIL. - - An RDS socket can only be bound to one address and only one socket can - be bound to a given address/port pair. If no port is specified in the - binding address then an unbound port is selected at random. - - RDS does not allow the application to bind a previously bound socket to - another address. Binding to the wildcard address INADDR_ANY is not per‐ - mitted either. - - Connecting - The default mode of operation for RDS is to use unconnected socket, and - specify a destination address as an argument to sendmsg. However, RDS - allows sockets to be connected to a remote end point using connect(2). - If a socket is connected, calling sendmsg without specifying a destina‐ - tion address will use the previously given remote address. - - Congestion Control - RDS does not have explicit congestion control like common streaming - protocols such as TCP. However, sockets have two queue limits associ‐ - ated with them; the send queue size and the receive queue size. Mes‐ - sages are accounted based on the number of bytes of payload. - - The send queue size limits how much data local processes can queue on a - local socket (see the following section). If that limit is exceeded, - the kernel will not accept further messages until the queue is drained - and messages have been delivered to and acknowledged by the remote - host. - - The receive queue size limits how much data RDS will put on the receive - queue of a socket before marking the socket as congested. When a - socket becomes congested, RDS will send a congestion map update to the - other participating hosts, who are then expected to stop sending more - messages to this port. - - There is a timing window during which a remote host can still continue - to send messages to a congested port; RDS solves this by accepting - these messages even if the socket's receive queue is already over the - limit. - - As the application pulls incoming messages off the receive queue using - recvmsg(2), the number of bytes on the receive queue will eventually - drop below the receive queue size, at which point the port is then - marked uncongested, and another congestion update is sent to all par‐ - ticipating hosts. This tells them to allow applications to send addi‐ - tional messages to this port. - - The default values for the send and receive buffer size are controlled - by the A given RDS socket has limited transmit buffer space. It - defaults to the system wide socket send buffer size set in the - wmem_default and rmem_default sysctls, respectively. They can be tuned - by the application through the SO_SNDBUF and SO_RCVBUF socket options. - - - Blocking Behavior - The sendmsg(2) and recvmsg(2) calls can block in a variety of situa‐ - tions. Whether a call blocks or returns with an error depends on the - non-blocking setting of the file descriptor and the MSG_DONTWAIT mes‐ - sage flag. If the file descriptor is set to blocking mode (which is the - default), and the MSG_DONTWAIT flag is not given, the call will block. - - In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used - to specify a timeout (in seconds) after which the call will abort wait‐ - ing, and return an error. The default timeout is 0, which tells RDS to - block indefinitely. - - Message Transmission - Messages may be sent using sendmsg(2) once the RDS socket is bound. - Message length cannot exceed 4 gigabytes as the wire protocol uses an - unsigned 32 bit integer to express the message length. - - RDS does not support out of band data. Applications are allowed to send - to unicast addresses only; broadcast or multicast are not supported. - - A successful sendmsg(2) call puts the message in the socket's transmit - queue where it will remain until either the destination acknowledges - that the message is no longer in the network or the application removes - the message from the send queue. - - Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO - socket option described below. - - While a message is in the transmit queue its payload bytes are - accounted for. If an attempt is made to send a message while there is - not sufficient room on the transmit queue, the call will either block - or return EAGAIN. - - Trying to send to a destination that is marked congested (see above), - the call will either block or return ENOBUFS. - - A message sent with no payload bytes will not consume any space in the - destination's send buffer but will result in a message receipt on the - destination. The receiver will not get any payload data but will be - able to see the sender's address. - - Messages sent to a port to which no socket is bound will be silently - discarded by the destination host. No error messages are reported to - the sender. - - Message Receipt - Messages may be received with recvmsg(2) on an RDS socket once it is - bound to a source address. RDS will return messages in-order, i.e. mes‐ - sages from the same sender will arrive in the same order in which they - were be sent. - - The address of the sender will be returned in the sockaddr_in structure - pointed to by the msg_name field, if set. - - If the MSG_PEEK flag is given, the first message on the receive is - returned without removing it from the queue. - - The memory consumed by messages waiting for delivery does not limit the - number of messages that can be queued for receive. RDS does attempt to - perform congestion control as described in the section above. - - If the length of the message exceeds the size of the buffer provided to - recvmsg(2), then the remainder of the bytes in the message are dis‐ - carded and the MSG_TRUNC flag is set in the msg_flags field. In this - truncating case recvmsg(2) will still return the number of bytes - copied, not the length of entire messge. If MSG_TRUNC is set in the - flags argument to recvmsg(2), then it will return the number of bytes - in the entire message. Thus one can examine the size of the next mes‐ - sage in the receive queue without incurring a copying overhead by pro‐ - viding a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the - flags argument. - - The sending address of a zero-length message will still be provided in - the msg_name field. - - Control Messages - RDS uses control messages (a.k.a. ancillary data) through the msg_con‐ - trol and msg_controllen fields in sendmsg(2) and recvmsg(2). Control - messages generated by RDS have a cmsg_level value of sol_rds. Most - control messages are related to the zerocopy interface added in RDS - version 3, and are described in rds-rdma(7). - - The only exception is the RDS_CMSG_CONG_UPDATE message, which is - described in the following section. - - Polling - RDS supports the poll(2) interface in a limited fashion. POLLIN is - returned when there is a message (either a proper RDS message, or a - control message) waiting in the socket's receive queue. POLLOUT is - always returned while there is room on the socket's send queue. - - Sending to congested ports requires special handling. When an applica‐ - tion tries to send to a congested destination, the system call will - return ENOBUFS. However, it cannot poll for POLLOUT, as there is prob‐ - ably still room on the transmit queue, so the call to poll(2) would - return immediately, even though the destination is still congested. - - There are two ways of dealing with this situation. The first is to sim‐ - ply poll for POLLIN. By default, a process sleeping in poll(2) is - always woken up when the congestion map is updated, and thus the appli‐ - cation can retry any previously congested sends. - - The second option is explicit congestion monitoring, which gives the - application more fine-grained control. - - With explicit monitoring, the application polls for POLLIN as before, - and additionally uses the RDS_CONG_MONITOR socket option to install a - 64bit mask value in the socket, where each bit corresponds to a group - of ports. When a congestion update arrives, RDS checks the set of ports - that became uncongested against the bit mask installed in the socket. - If they overlap, a control messages is enqueued on the socket, and the - application is woken up. When it calls recvmsg(2), it will be given the - control message containing the bitmap. on the socket. - - The congestion monitor bitmask can be set and queried using setsock‐ - opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable. - - Congestion updates are delivered to the application via - RDS_CMSG_CONG_UPDATE control messages. These control messages are - always delivered by themselves (or possibly additional control mes‐ - sages), but never along with a RDS data message. The cmsg_data field of - the control message is an 8 byte datum containing the 64bit mask value. - - Applications can use the following macros to test for and set bits in - the bitmask: - - #define RDS_CONG_MONITOR_SIZE 64 - #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE) - #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port)) - - - Canceling Messages - An application can cancel (flush) messages from the send queue using - the RDS_CANCEL_SENT_TO socket option with setsockopt(2). This call - takes an optional sockaddr_in address structure as argument. If given, - only messages to the destination specified by this address are dis‐ - carded. If no address is given, all pending messages are discarded. - - Note that this affects messages that have not yet been transmitted as - well as messages that have been transmitted, but for which no acknowl‐ - edgment from the remote host has been received yet. - - Reliability - If sendmsg(2) succeeds, RDS guarantees that the message will be vis‐ - ible to recvmsg(2) on a socket bound to the destination address as - long as that destination socket remains open. - - If there is no socket bound on the destination, the message is - silently dropped. If the sending RDS can't be sure that there is no - socket bound then it will try to send the message indefinitely until it - can be sure or the sent message is canceled. - - If a socket is closed then all pending sent messages on the socket are - canceled and may or may not be seen by the receiver. - - The RDS_CANCEL_SENT_TO socket option can be used to cancel all pending - messages to a given destination. - - If a receiving socket is closed with pending messages then the sender - considers those messages as having left the network and will not - retransmit them. - - A message will only be seen by recvmsg(2) once, unless MSG_PEEK was - specified. Once the message has been delivered it is removed from the - sending socket's transmit queue. - - All messages sent from the same socket to the same destination will be - delivered in the order they're sent. Messages sent from different sock‐ - ets, or to different destinations, may be delivered in any order. - -SYSCTL VALUES - These parameteres may only be accessed through their files in - /proc/sys/net/rds. Access through sysctl(2) is not supported. - - pf_rds This file contains the string representation of the protocol - family constant passed to socket(2) to create a new RDS socket. - - sol_rds - This file contains the string representation of the socket level - parameter that is passed to getsockopt(2) and setsockopt(2) to - manipulate RDS socket options. - - max_unacked_bytes and max_unacked_packets - These parameters are used to tune the generation of acknowledge‐ - ments. By default, the system receiving RDS messages does not - send back explicit acknowledgements unless it transmits a mes‐ - sage of its own (in which case the ACK is piggybacked onto the - outgoing message), or when the sending system requests an ACK. - - However, the sender needs to see an ACK from time to time so - that it can purge old messages from the send queue. The unacked - bytes and packet counters are used to keep track of how much - data has been sent without requesting an ACK. The default is to - request an acknowledgement every 16 packets, or every 16 MB, - whichever comes first. - - reconnect_delay_min_ms and reconnect_delay_max_ms - RDS uses host-to-host connections to transport RDS messages - (both for the TCP and the Infiniband transport). If this connec‐ - tion breaks, RDS will try to re-establish the connection. - Because this reconnect may be triggered by both hosts at the - same time and fail, RDS uses a random backoff before attempting - a reconnect. These two parameters specify the minimum and maxi‐ - mum delay in milliseconds. The default values are 1 and 1000, - respectively. - -SEE ALSO - rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2), - setsockopt(2). - - - - RDS(7) diff --git a/README.txt b/README.txt index 90d50b0..550529d 100644 --- a/README.txt +++ b/README.txt @@ -1,38 +1,85 @@ Open Fabrics Enterprise Distribution (OFED) - Version 1.5.2 - README - - September 2010 + Version 1.5.3 + README + + January 2010 + +============================================================================== +Table of contents +============================================================================== + + 1. Overview + 2. Contents of the OFED Distribution + 3. Hardware and Software Requirements + 4. How to Download and Extract the OFED Distribution + 5. Installing OFED Software + 6. Building OFED RPMs + 7. IPoIB Configuration + 8. Using SDP + 9. Uninstalling OFED + 10. Upgrading OFED + 11. Configuration + 12. Starting and Verifying the IB Fabric + 13. MPI (Message Passing Interface) + 14. Related Documentation + + +============================================================================== +1. Overview +============================================================================== This is the OpenFabrics Enterprise Distribution (OFED) version 1.5.2 software package supporting InfiniBand and iWARP fabrics. It is composed of several software modules intended for use on a computer cluster constructed as an InfiniBand subnet or an iWARP network. -*** Note: If you plan to upgrade OFED on your cluster, please upgrade all - its nodes to this new version. +This document describes how to install the various modules and test them in +a Linux environment. -This document includes the following sections: +General Notes: + 1) The install script removes all previously installed OFED packages + and re-installs from scratch. (Note: Configuration files will not + be removed). You will be prompted to acknowledge the deletion of + the old packages. -1. HW and SW Requirements -2. OFED Package Contents -3. Installing OFED Software -4. Starting and Verifying the IB Fabric -5. MPI (Message Passing Interface) -6. Related Documentation - -OpenFabrics Home Page: http://www.openfabrics.org + 2) When installing OFED on an entire [homogeneous] cluster, a common + strategy is to install the software on one of the cluster nodes + (perhaps on a shared file system such as NFS). The resulting RPMs, + created under OFED-X.X.X/RPMS directory, can then be installed on all + nodes in the cluster using any cluster-aware tools (such as pdsh). -The OFED rev 1.5.2 software download available in -http://www.openfabrics.org/builds/ofed-1.5.2/release/ +============================================================================== +2. OFED Package Contents +============================================================================== -Please email bugs and error reports to your InfiniBand vendor, or use bugzilla -https://bugs.openfabrics.org/ +The OFED Distribution package generates RPMs for installing the following: + o OpenFabrics core and ULPs: + - HCA drivers (mthca, mlx4, qib, ehca) + - iWARP driver (cxgb3, nes) + - core + - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER + Initiator and target, RDS, qlgc_vnic, uDAPL and NFS-RDMA + o OpenFabrics utilities + - OpenSM: InfiniBand Subnet Manager + - Diagnostic tools + - Performance tests + o MPI + - OSU MVAPICH stack supporting the InfiniBand and iWARP interface + - Open MPI stack supporting the InfiniBand and iWARP interface + - OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface + - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta) + o Extra packages + - open-iscsi: open-iscsi initiator with iSER support + - ib-bonding: Bonding driver for IPoIB interface + o Sources of all software modules (under conditions mentioned in the + modules' LICENSE files) + o Documentation +============================================================================== +3. Hardware and Software Requirements +============================================================================== -1. HW and SW Requirements: -========================== 1) Server platform with InfiniBand HCA or iWARP RNIC (see OFED Distribution Release Notes for details) @@ -85,72 +132,453 @@ Note: The installer will warn you if you attempt to compile any of the another open-iscsi version is already installed will cause the installation process to fail. -2. OFED Package Contents -======================== +============================================================================== +4. How to Download and Extract the OFED Distribution +============================================================================== -The OFED Distribution package generates RPMs for installing the following: +1) Download the OFED-X.X.X.tgz file to your target Linux host. - o OpenFabrics core and ULPs - - HCA drivers (mthca, mlx4, mlx4_en, qib, ehca) - - iWARP driver (cxgb3, nes) - - core - - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER - Initiator and target, RDS, uDAPL, qlgc_vnic and NFS-RDMA. - o OpenFabrics utilities - - OpenSM: InfiniBand Subnet Manager - - Diagnostic tools - - Performance tests - o MPI - - OSU MVAPICH stack supporting the InfiniBand and iWARP interface - - Open MPI stack supporting the InfiniBand and iWARP interface - - OSU MVAPICH2 stack supporting the InfiniBand and iWARP interface - - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta) - o Extra packages - - open-iscsi: open-iscsi initiator with iSER support - - ib-bonding: Bonding driver for IPoIB interface - o Sources of all software modules (under conditions mentioned in the - modules' LICENSE files) - o Documentation + If this package is to be installed on a cluster, it is recommended to + download it to an NFS shared directory. + +2) Extract the package using: + + tar xzvf OFED-X.X.X.tgz + +============================================================================== +5. Installing OFED Software +============================================================================== + +1) Go to the directory into which the package was extracted: + + cd /..../OFED-X.X.X + +2) Installing the OFED package must be done as root. For a + menu-driven first build and installation, run the installer + script: + + ./install.pl + + Interactive menus will direct you through the install process. + + Note: After the installer completes, information about the OFED + installation such as the prefix, the kernel version, and + installation parameters can be found by running + /etc/infiniband/info. + + Information on the driver version and source git trees can be found + using the ofed_info utility + + + During the interactive installation of OFED, two files are + generated: ofed.conf and ofed_net.conf. + ofed.conf holds the installed software modules and configuration settings + chosen by the user. ofed_net.conf holds the IPoIB settings chosen by the + user. + If the package is installed on a cluster-shared directory, these + files can then be used to perform an automatic, unattended + installation of OFED on other machines in the cluster. The + unattended installation will use the same choices as were selected + in the interactive installation. -3. Installing OFED Software -============================ + For an automatic installation on any host, run the following: -The default installation directory is: /usr + ./OFED-X.X.X/install.pl -c /ofed.conf -n /ofed_net.conf -Install Quick Guide: -1) Download and extract: tar xzvf OFED-1.5.2.tgz file. -2) Change into directory: cd OFED-1.5.2 -3) Run as root: ./install.pl -4) Follow the directions to install required components. For details, please see - OFED_Installation_Guide.txt under OFED-1.5.2/docs. +3) Install script usage: + Usage: ./install.pl [-c |--all|--hpc|--basic] + [-n|--net ] + + -c|--config . Example of the config file can + be found under docs (ofed.conf-example) + -n|--net Example of the config file can be + found under docs (ofed_net.conf-example) + -l|--prefix Set installation prefix. + -p|--print-available Print available packages for current platform. + And create corresponding ofed.conf file. + -k|--kernel . Default on this system: $(uname -r) + -s|--kernel-sources . Default on this + system: /lib/modules/$(uname -r)/build + --build32 Build 32-bit libraries. Relevant for x86_64 and + ppc64 platforms + --without-depcheck Skip Distro's libraries check + -v|-vv|-vvv. Set verbosity level + -q. Set quiet - no messages will be printed + --force Force uninstall RPM coming with Distribution + --builddir Change build directory. Default: /var/tmp/ + + --all|--hpc|--basic Install all,hpc or basic packages + correspondingly Notes: -1. The install script removes previously installed IB packages and - re-installs from scratch. You will be prompted to acknowledge the deletion - of the old packages. However, configuration files (.conf) will be - preserved and saved with a ".rpmsave" extension. +------ +a. It is possible to rename and/or edit the ofed.conf and ofed_net.conf files. + Thus it is possible to change user choices (observing the original format). + See examples of ofed.conf and ofed_net.conf under OFED-X.X.X/docs. + Run './install.pl -p' to get ofed.conf with all available packages included. + +b. Important note for open-iscsi users: + Installing iSER as part of the OFED installation will also install + open-iscsi. Before installing OFED, please uninstall any open-iscsi version + that may be installed on your machine. Installing OFED with iSER support + while another open-iscsi version is already installed will cause the + installation process to fail. + + +Install Process Results: +------------------------ + +o The OFED package is installed under directory. Default prefix is /usr +o The kernel modules are installed under: + - Infiniband subsystem: + /lib/modules/`uname -r`/updates/kernel/drivers/infiniband/ + - open-iscsi: + /lib/modules/`uname -r`/updates/kernel/drivers/scsi/ + - Chelsio driver: + /lib/modules/`uname -r`/updates/kernel/drivers/net/cxgb3/ + - ConnectX driver: + /lib/modules/`uname -r`/updates/kernel/drivers/net/mlx4/ + - RDS: + /lib/modules/`uname -r`/updates/kernel/net/rds/ + - NFSoRDMA: + /lib/modules/`uname -r`/updates/kernel/fs/exportfs/ + /lib/modules/`uname -r`/updates/kernel/fs/lockd/ + /lib/modules/`uname -r`/updates/kernel/fs/nfs/ + /lib/modules/`uname -r`/updates/kernel/fs/nfs_common/ + /lib/modules/`uname -r`/updates/kernel/fs/nfsd/ + /lib/modules/`uname -r`/updates/kernel/net/sunrpc/ + - Bonding module: + /lib/modules/`uname -r`/updates/kernel/drivers/net/bonding/bonding.ko +o The package kernel include files are placed under /src/ofa_kernel/. + These includes should be used when building kernel modules which use + the Openfabrics stack. (Note that these includes, if needed, are + "backported" to your kernel). +o The raw package (un-backported) source files are placed under + /src/ofa_kernel-x.x.x +o The script "openibd" is installed under /etc/init.d/. This script can + be used to load and unload the software stack. +o The directory /etc/infiniband is created with the files "info" and + "openib.conf". The "info" script can be used to retrieve OFED + installation information. The "openib.conf" file contains the list of + modules that are loaded when the "openibd" script is used. +o The file "90-ib.rules" is installed under /etc/udev/rules.d/ +o If libibverbs-utils is installed, then ofed.sh and ofed.csh are + installed under /etc/profile.d/. These automatically update the PATH + environment variable with /bin. In addition, ofed.conf is + installed under /etc/ld.so.conf.d/ to update the dynamic linker's + run-time search path to find the InfiniBand shared libraries. +o The file /etc/modprobe.d/ib_ipoib.conf is updated to include the following: + - "alias ib ib_ipoib" for each ib interface. +o The file /etc/modprobe.d/ib_sdp.conf is updated to include the following: + - "alias net-pf-27 ib_sdp" for sdp. +o If opensm is installed, the daemon opensmd is installed under /etc/init.d/ +o All verbs tests and examples are installed under /bin and management + utilities under /sbin +o ofed_info script provides information on the OFED version and git repository. +o If iSER is included, open-iscsi user-space files will be also installed: + - Configuration files will be installed at /etc/iscsi + - Startup script will be installed at: + - RedHat: /etc/init.d/iscsi + - SuSE: /etc/init.d/open-iscsi + - Other tools (iscsiadm, iscsid, iscsi_discovery, iscsi-iname, iscsistart) + will be installed under /sbin. + - Documentation will be installed under: + - RedHat: /usr/share/doc/iscsi-initiator-utils- + - SuSE: /usr/share/doc/packages/open-iscsi +o man pages will be installed under /usr/share/man/. + +============================================================================== +6. Building OFED RPMs +============================================================================== + +1) Go to the directory into which the package was extracted: + + cd /..../OFED-X.X.X + +2) Run install.pl as explained above + This script also builds OFED binary RPMs under OFED-X.X.X/RPMS; the sources + are placed in OFED-X.X.X/SRPMS/. + + Once the install process has completed, the user may run ./install.pl on + other machines that have the same operating system and kernel to + install the new RPMs. + +Note: Depending on your hardware, the build procedure may take 30-45 + minutes. Installation, however, is a relatively short process + (~5 minutes). A common strategy for OFED installation on large + homogeneous clusters is to extract the tarball on a network + file system (such as NFS), build OFED RPMs on NFS, and then run the + installer on each node with the RPMs that were previously built. + +============================================================================== +7. IP-over-IB (IPoIB) Configuration +============================================================================== + +Configuring IPoIB is an optional step during the installation. During +an interactive installation, the user may choose to insert the ifcfg-ib +files. If this option is chosen, the ifcfg-ib files will be +installed under: + +- RedHat: /etc/sysconfig/network-scripts/ +- SuSE: /etc/sysconfig/network/ + +Setting IPoIB Configuration: +---------------------------- +There is no default configuration for IPoIB interfaces. + +One should manually specify the full IP configuration during the +interactive installation: IP address, network address, netmask, and +broadcast address, or use the ofed_net.conf file. + +For bonding setting please see "ipoib_release_notes.txt" + +For unattended installations, a configuration file can be provided +with this information. The configuration file must specify the +following information: +- Fixed values for each IPoIB interface +- Base IPoIB configuration on Ethernet configuration (may be useful for + cluster configuration) + +Here are some examples of ofed_net.conf: + +# Static settings; all values provided by this file +IPADDR_ib0=172.16.0.4 +NETMASK_ib0=255.255.0.0 +NETWORK_ib0=172.16.0.0 +BROADCAST_ib0=172.16.255.255 +ONBOOT_ib0=1 + +# Based on eth0; each '*' will be replaced by the script with corresponding +# octet from eth0. +LAN_INTERFACE_ib0=eth0 +IPADDR_ib0=172.16.'*'.'*' +NETMASK_ib0=255.255.0.0 +NETWORK_ib0=172.16.0.0 +BROADCAST_ib0=172.16.255.255 +ONBOOT_ib0=1 + +# Based on the first eth interface that is found (for n=0,1,...); +# each '*' will be replaced by the script with corresponding octet from eth. +LAN_INTERFACE_ib0= +IPADDR_ib0=172.16.'*'.'*' +NETMASK_ib0=255.255.0.0 +NETWORK_ib0=172.16.0.0 +BROADCAST_ib0=172.16.255.255 +ONBOOT_ib0=1 + + +============================================================================== +8. Using SDP +============================================================================== + +Overview: +--------- + +Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol +that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced +protocol offload capabilities, SDP can provide lower latency, higher +bandwidth, and lower CPU utilization than IPoIB running some sockets-based +applications. + +SDP can be used by applications and improve their performance transparently +(that is, without any recompilation). Since SDP has the same socket semantics +as TCP, an existing application is able to run using SDP; the difference is +that the application's TCP socket gets replaced with an SDP socket. + +It is also possible to configure the driver to automatically translate TCP to +SDP based on the source IP, the destination, or the application name (See +below). + +The SDP protocol is composed of a kernel module that implements the SDP as a +new address-family/protocol-family, and a library that is used for replacing +the TCP address family with SDP according to a policy. + +libsdp.so Library: +------------------ + +libsdp.so is a dynamically linked library, which is used for transparent +integration of applications with SDP. The library is preloaded, and therefore +takes precedence over glibc for certain socket calls. Thus, it can +transparently replace the TCP socket family with SDP socket calls. + +The library also implements a user-level socket switch. Using a configuration +file, the system administrator can set up the policy that selects the type of +socket to be used. libsdp.so also has the option to allow server sockets to +listen on both SDP and TCP interfaces. The various configurations with SDP/TCP +sockets are explained inside the /etc/libsdp.conf file. + +Configuring SDP: +---------------- + +To load SDP upon boot, edit the file /etc/infiniband/openib.conf and set +"SDP_LOAD=yes". + +Note: For the changes to take effect, run: /etc/init.d/openibd restart + +SDP shares the same IP addresses and interface names as IPoIB. See IPoIB +Configuration (chapter 7) + +How to Know SDP Is Working: +--------------------------- + +Since SDP is a transparent TCP replacement, it can sometimes be difficult to +know that it is working correctly. +To figure out whether traffic is passing through SDP or TCP, check +/proc/net/sdpstats and monitor which counters are running. + +sdpnetstat: +----------- + +The sdpnetstat program can be used to verify both that SDP is loaded and is +being used: + +host1$ sdpnetstat -S + +This command shows all active SDP sockets using the same format as the +traditional netstat program. Without the '-S' option, it shows all the +information that netstat does plus SDP data. + +Assuming that the SDP kernel module is loaded and is being used, then the +output of the command will show something like the following: + +host1$ sdpnetstat -S + +Proto Recv-Q Send-Q Local Address Foreign Address +sdp 0 0 193.168.10.144:34216 193.168.10.125:12865 +sdp 0 884720 193.168.10.144:42724 193.168.10.:filenet-rmi + +The example output above shows two active SDP sockets and contains details +about the connections. If the SDP kernel module is not loaded, or it is +loaded but is not being used, then the output of the command will be something +like the following: + +host1$ sdpnetstat -S + +Proto Recv-Q Send-Q Local Address Foreign Address +netstat: no support for `AF INET (tcp)' on this system. + +To verify whether the module is loaded or not, you can use the lsmod command + +Monitoring and Troubleshooting Tools: +------------------------------------- + +SDP has debug support for both the user space libsdp.so library and the ib_sdp +kernel module. +Both can be useful to understand why a TCP socket was not redirected over SDP +and to help find problems in the SDP implementation. -2. After the installer completes, information about the OFED - installation such as the prefix, the kernel version, and - installation parameters can be found by running - /etc/infiniband/info. +User-space SDP debug is controlled by options in the libsdp.conf file. You can also have a local +version and point to it explicitly using the following command: -3. Information on the driver version and source git trees can be found - using the ofed_info utility +host1$ export LIBSDP_CONFIG_FILE=/libsdp.conf +To obtain extensive debug information, you can modify libsdp.conf to have the +log directive produce maximum debug output (provide the min-level flag with +the value 1). More details in the default libsdp.conf installed by OFED. +A non-root user can configure libsdp.so to record function calls and return values in the file +/tmp/libsdp.log. -4. Starting and Verifying the IB Fabric -======================================= +Kernel Space SDP Debug - The SDP kernel module can log detailed trace +information if you enable it using the 'debug_level' variable in the sysfs +filesystem. The following command performs this: +host1$ echo 1 > /sys/module/ib_sdp/debug_level + +Note: Depending on the operating system distribution on your machine, you may need +an extra level, 'parameters', in the directory structure, so you may need to direct +the echo command to /sys/module/ib_sdp/parameters/debug_level. + +Turning off kernel debug is done by setting the sysfs variable to zero using +the following command: + +host1$ echo 0 > /sys/module/ib_sdp/debug_level + +To display debug information, use the dmesg command: + +host1$ dmesg + +Environment Variables: +---------------------- + +For the transparent integration with SDP, the following two environment +variables are required: +1. LD_PRELOAD - this environment variable is used to preload libsdp.so and it + should point to the libsdp.so library. The variable should be set by the + system administrator to libsdp.so. +2. LIBSDP_CONFIG_FILE - this environment variable is used to configure the + policy for replacing TCP sockets with SDP sockets. By default it points to: + /etc/libsdp.conf + +Using RDMA: +----------- + +For smaller buffers, the overhead of preparing a user buffer to be RDMA'ed is +too big; therefore, it is more efficient to use BCopy. (Large buffers can also +be sent using RDMA, but they lower CPU utilization.) This mode is called +"ZCopy combined mode". The sendmsg syscall is blocked until the buffer is +transfered to the socket's peer, and the data is copied directly from the user +buffer at the source side to the user buffer at the sink side. + +To set the threshold, use the module parameter sdp_zcopy_thresh. This parameter +can be accessed through sysfs (/sys/module/ib_sdp/parameters/sdp_zcopy_thresh). +Setting it to 0, disables ZCopy. + + +============================================================================== +9. Uninstalling OFED +============================================================================== + +There are two ways to uninstall OFED: +1) Via the installation menu. +2) Using the script ofed_uninstall.sh. The script is part of ofed-scripts + package. +3) ofed_uninstall.sh script supports an option to executes 'openibd stop' + before removing the RPMs using the flag: --unload-modules + +============================================================================== +10. Upgrading OFED +============================================================================== + +If an old OFED version is installed, it may be upgraded by installing a +new OFED version as described in section 5. Note that if the old OFED +version was loaded before upgrading, you need to restart OFED or reboot +your machine in order to start the new OFED stack. + +============================================================================== +11. Configuration +============================================================================== + +Most of the OFED components can be configured or reconfigured after +the installation by modifying the relevant configuration files. The +list of the modules that will be loaded automatically upon boot can be +found in the /etc/infiniband/openib.conf file. Other configuration +files include: +- SDP configuration file: /etc/libsdp.conf +- OpenSM configuration file: /etc/ofa/opensm.conf (for RedHat) + /etc/sysconfig/opensm (for SuSE) - should be + created manually if required. +- DAPL configuration file: /etc/dat.conf + +See packages Release Notes for more details. + +Note: After the installer completes, information about the OFED + installation such as the prefix, kernel version, and + installation parameters can be found by running + /etc/infiniband/info. + + +============================================================================== +12. Starting and Verifying the IB Fabric +============================================================================== 1) If you rebooted your machine after the installation process completed, IB interfaces should be up. If you did not reboot your machine, please - enter the following command: /etc/init.d/openibd start + enter the following command: /etc/init.d/openibd restart 2) Check that the IB driver is running on all nodes: ibv_devinfo should print "hca_id: " on the first line. - + 3) Make sure that a Subnet Manager is running by invoking the sminfo utility. If an SM is not running, sminfo prints: sminfo: iberror: query failed @@ -185,10 +613,9 @@ Notes: (or LD_PRELOAD='stack_prefix'/lib64/libsdp.so on 64 bit machines) The default 'stack_prefix' is /usr - -5. MPI (Message Passing Interface) -================================== - +============================================================================== +13. MPI (Message Passing Interface) +============================================================================== In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install one or more MPI stacks. Multiple MPI stacks can be installed simultaneously -- they will not conflict with each other. @@ -205,16 +632,31 @@ are located under: /mpi///tests/. Please see MPI_README.txt for more details on each MPI package and how to run the tests. +============================================================================== +14. Related Documentation +============================================================================== + +OFED documentation is located in the ofed-docs RPM. After +installation the documents are located under the directory: +/usr/share/doc/ofed-docs-x.x.x for RedHat +/usr/share/doc/packages/ofed-docs-x.x.x for SuSE + +Documents list: -6. Related Documentation -======================== -1) Release Notes for OFED Distribution components are to be found under - OFED-1.5.2/docs and, after the package installation, under - /usr/share/doc/ofed-docs-1.5.2 for RedHat - /usr/share/doc/packages/ofed-docs-1.5.2 for SuSE. -2) For a detailed installation guide, see OFED_Installation_Guide.txt. -3) For more information, please visit the OFED web-page http://www.openfabrics.org + o README.txt + o OFED_Installation_Guide.txt + o MPI_README.txt + o Examples of configuration files + o OFED_tips.txt + o HOWTO.build_ofed + o All release notes and README files +For more information, please visit the OpenFabrics web site: + http://www.openfabrics.org -For more information contact your vendor. +open-iscsi documentation is located at: +- RedHat: /usr/share/doc/iscsi-initiator-utils- +- SuSE: /usr/share/doc/packages/open-iscsi +For more information, please visit the open-iscsi web site: + http://www.open-iscsi.org diff --git a/RoCEE_README.txt b/RoCEE_README.txt deleted file mode 100644 index ab6a826..0000000 --- a/RoCEE_README.txt +++ /dev/null @@ -1,184 +0,0 @@ -=============================================================================== - OFED-1.5.1 RoCEE Support README - February 2010 -=============================================================================== - -Contents: -========= -1. Overview -2. Software Dependencies -3. User Guidelines -4. Ported Applications -5. Gid tables -6. Using VLANs -7. Statistic counters -8. Firmware Requirements -9. Supported hardware -10. Added fearues -11. Known Issues - - -1. Overview -=========== -RDMA over Converged Enhanced Ethernet (RoCEE) allows InfiniBand (IB) transport -over Ethernet networks. It encapsulates IB transport and GRH headers in -Ethernet packets bearing a dedicated ether type. -While the use of GRH is optional within IB subnets, it is mandatory when using -RoCEE. Verbs applications written over IB verbs should work seamlessly, but -they require provisioning of GRH information when creating address vectors. The -library and driver are modified to provide for mapping from GID to MAC -addresses required by the hardware. - -2. Software Dependencies -======================== -In order to use RoCEE over Mellanox ConnectX(R) hardware, the mlx4_en driver -must be loaded. Please refer to MLNX_EN_README.txt for further details. - - -3. User Guidelines -================== -Since RoCEE encapsulates InfiniBand traffic in Ethernet frames, the -corresponding net device must be up and running. In case of Mellanox -hardware, mlx4_en must be loaded and the corresponding interface configured. -- Make sure mlx4_en.ko is loaded -- Make sure an IP address has been configured to this interface -- Run "ibv_devinfo". There is a new field named "link_layer" which can be - either "Ethernet" or "IB". If the value is IB, then you need to use - connectx_port_config to change the ConnectX ports designation to eth (see - mlx4_release_notes.txt for details) -- Configure the IP address of the interface so that the link will become - active -- All IB verbs applications which run over IB verbs should work on RoCEE - links as long as they use GRH headers (that is, as long as they specify use - of GRH in their address vector) -- rdma_cm applications working over RoCEE will have the TOS field set to a - default value of 3. The default value is given as a module paramter to - rdma_cm: - def_prec2sl:Default value for SL priority with RoCE. Valid values 0 - 7 (int). - - -4. Ported Applications -====================== -- ibv_*_pingpong examples have been ported too. The user must specify the GID - of the remote peer using the new '-g' option. The GID has the same format as - that in /sys/class/infiniband/mlx4_0/ports/1/gids/0 - -- Note: Care should be taken when using ibv_ud_pingpong. The default message - size is 2K, which is likely to exceed the MTU of the RoCEE link. Use - ibv_devinfo to inspect the link MTU and specify an appropriate message size - -- All rdma_cm applications should work seamlessly without any change - -- libsdp works without any change - -- Performance tests have been ported - - -5. Gid tables -============= -With RoCEE, there may be several entries in a port's GID table. The first entry -always contains the IPv6 link local address of the corresponding ethernet -interface. The link local address is formed in the following way: - -gid[0..7] = fe80000000000000 -gid[8] = mac[0] ^ 2 -gid[9] = mac[1] -gid[10] = mac[2] -gid[11] = ff -gid[12] = fe -gid[13] = mac[3] -gid[14] = mac[4] -gid[15] = mac[5] - -If VLAN is supported by the kernel, and there are VLAN interfaces on the main -ethernet interface (the interface that the IB port is tied to), each such VLAN -will appear as a new GID in the port's GID table. The format of the GID entry -will be identical to the one decribed above with the following change: - -gid[11] = VLAN ID high byte (4 MS bits). -gid[12] = VLAN ID low byte - -Please note that VLAN ID is 12 bits. - -Priority pause frames ---------------------- -Tagged ethernet frames carry a 3 bit priority field. The value of this field is -derived from the IB SL field by taking the 3 LS bits of the SL field. - - -6. Using VLANs -============== -In order for RoCEE traffic to used VLAN tagged frames, the user has to specify -GID table entries that are derived from VLAN devices, when creating address -vectors. Consider the example bellow: - -6.1 Make sure VLAN support is enabled by the kernel. Usually this requires -loading the 8021q module. -- modprobe 8021q - -6.2 Add a VLAN device -- vconfig add eth2 7 - -6.3 Assign IP address to the VLAN interface -- ifconfig eth2.7 7.10.11.12 -suppose this created a new entry in the GID table in index 1. - -6.4 verbs test: -server: ibv_rc_pingpong -g 1 -client: ibv_rc_pingpongs -g 1 server - -6.5 For rdma_cm applications, the user only needs to specify an IP address of a -VLAN device for the traffic to go with that VLAN tagged frames. - -7. Statistic counters -===================== -RoCEE traffic is counted and can be read from the sysfs counters in the same -manner as it is done for regular Infiniband devices. Only the following -counters are supported: -- port_xmit_packets -- port_rcv_packets -- port_rcv_data -- port_xmit_data - -For example, to read the number of transmitted packets on port 2 of device -mlx4_1, one needs to read the file: -/sys/class/infiniband/mlx4_1/ports/2/counters/port_xmit_packets - -Note: RoCEE traffic will not show in the associated Etherent device's counters -since it is offloaded by the hardware and does not go through Ethernet network -driver. - - -8. Firmware Requirements -======================== -RoCEE has limited support with firmware 2.7.700 and will be fully supported -with firmware 2.8.000. - - -9. Supported hardware -===================== -Currently, ConnectX B0 hardware is supported. A0 hardware may have issues. - - -10. Added fearues -================= -ibdev2netdev is a utility that displays the association between an HCA's port -and the network interface bound to it. Example run: - -sw417:/usr/src/packages/SOURCES/ofa_kernel-1.5.2 # ibdev2netdev -mlx4_0 port 1 ==> ib0 (Down) -mlx4_0 port 2 ==> ib1 (Down) -mlx4_1 port 1 ==> eth2 (Up) -mlx4_1 port 2 ==> eth3 (Up) - - - -11. Known Issues -=============== -- PowerPC and ia64 architectures are not supported. x32 architectures were - not tested. - -- SRP is not supported. - -- UD QPs that send traffic with VLAN tags (e.g. 802.1q tagged frames) do not - work. This will be fixed in a subsequent release. diff --git a/SRPT_README.txt b/SRPT_README.txt deleted file mode 100644 index ac16fb7..0000000 --- a/SRPT_README.txt +++ /dev/null @@ -1,223 +0,0 @@ -SCSI RDMA Protocol (SRP) Target driver for Linux -================================================= - -SRP Target driver is designed to work directly on top of OpenFabrics -OFED-1.x software stack (http://www.openfabrics.org) or Infiniband -drivers in Linux kernel tree (kernel.org). It also interfaces with -Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net) - -By interfacing with SCST driver we are able to work and support a lot IO -modes on real or virtual devices in the backend - -1. scst_disk -- interfacing with scsi sub-system to claim and export real - scsi devices ie. disks, hardware raid volumes, tape library as SRP's luns - -2. scst_vdisk -- fileio and blockio modes. This allows you to turn software - raid volumes, LVM volumes, IDE disks, block devices and normal files into - SRP's luns - -3. NULLIO mode will allow you to measure the performance without sending IOs - to *real* devices - - -Prerequisites -------------- -0. Supported distributions: RHEL 5.2/5.3/5.4, SLES 10 sp2/sp3, SLES 11 - -NOTES: On distribution default kernels, you can run scst_vdisk blockio mode - to have good performance. - - It is required to patch and recompile the kernel to run scst_disk - ie. scsi pass-thru mode - OR - You have to compile scst with -DSTRICT_SERIALIZING enabled and this - does not yield good performance. - -1. Download and install SCST driver (supported version 1.0.1.1) - -1a. Download scst-1.0.1.1.tar.gz from this URL - http://scst.sourceforge.net/downloads.html - -1b. untar and install scst-1.0.1.1 - - $ tar zxvf scst-1.0.1.1.tar.gz - $ cd scst-1.0.1.1 - - THIS STEP IS SPECIFIC FOR SLES 10 sp2/sp3 distributions: - - $ patch -p1 -i /docs/scst/scst_sles10_sp2.patch - - For all distributions: - - $ make && make install - -NOTES: FOR SLES 11 distribution, skip next step (step 1c) and go directly to - step (2) - -1c. patch scst.h header file with scst.patch - - $ cd /usr/local/include/scst - $ patch -p1 -i /docs/scst/scst.patch - - -2. Download/install OFED-1.5.1 package - SRP target is part of OFED package - -NOTES: if your system already have OFED stack installed, you need to remove - the previous built of kernel-ib RPMs and reinstall - - $ cd ~/OFED-1.5.1 - $ rm RPMS/*/*/kernel-ib* - $ ./install.pl -c ofed.conf - - Make sure that srpt=y in the ofed.conf - -2a. download OFED packages from this URL - http://www.openfabrics.org/downloads/OFED/OFED-1.5.1/ - -2b. install OFED - remember to choose srpt=y - - $ cd ~/OFED-1.5.1 - $ ./install.pl - - -How-to run ------------ - -A. On srp target machine - -A1. Please refer to SCST's README for loading scst driver and its dev_handlers - drivers (scst_disk, scst_vdisk block or file IO mode, nullio, ...) - SCST's README locates in ~/scst-1.0.1.1/ directory - -NOTES: In any mode you always need to have lun 0 in any group's device list - Then you can have any lun number following lun 0 (it does not required - have lun number in order except that the first lun is always 0) - - Setting SRPT_LOAD=yes in /etc/infiniband/openib.conf is not good enough - It only load ib_srpt module and does not load scst and its dev_handlers - - SCST's scst_disk module (pass-thru mode) does not run on default - distribution kernels (kernels come with RHEL 5.2/5.3/5.4 & SLES 11) - because it requires to patch and recompile the kernel. It can only - run with vanilla kernels. - -Example 1: working with VDISK BLOCKIO mode - (using md0 device, sda, and cciss/c1d0) -a. modprobe scst -b. modprobe scst_vdisk -c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk -d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk -e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk -f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices -g. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices -h. echo "add vdisk2 2" >/proc/scsi_tgt/groups/Default/devices - -Example 2: working with real back-end scsi disks in scsi pass-thru mode -a. modprobe scst -b. modprobe scst_disk -c. cat /proc/scsi_tgt/scsi_tgt -ibstor00:~ # cat /proc/scsi_tgt/scsi_tgt -Device (host:ch:id:lun or name) Device handler -0:0:0:0 dev_disk -4:0:0:0 dev_disk -5:0:0:0 dev_disk -6:0:0:0 dev_disk -7:0:0:0 dev_disk - -Now you want to exclude the first scsi disk and expose the last 4 scsi disks -as IB/SRP luns for I/O - -echo "add 4:0:0:0 0" >/proc/scsi_tgt/groups/Default/devices -echo "add 5:0:0:0 1" >/proc/scsi_tgt/groups/Default/devices -echo "add 6:0:0:0 2" >/proc/scsi_tgt/groups/Default/devices -echo "add 7:0:0:0 3" >/proc/scsi_tgt/groups/Default/devices - -Example 3: working with scst_vdisk FILEIO mode - (using md0 device and file 10G-file) -a. modprobe scst -b. modprobe scst_vdisk -c. echo "open vdisk0 /dev/md0" > /proc/scsi_tgt/vdisk/vdisk -d. echo "open vdisk1 /10G-file" > /proc/scsi_tgt/vdisk/vdisk -e. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices -f. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices - -A2. modprobe ib_srpt - - -B. On initiator machines you can manualy do the following steps: - -B1. modprobe ib_srp -B2. ipsrpdm -c -d /dev/infiniband/umadX - (to discover new SRP target) - umad0: port 1 of the first HCA - umad1: port 2 of the first HCA - umad2: port 1 of the second HCA -B3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target -B4. fdisk -l (will show new discovered scsi disks) - -Example: -Assume that you use port 1 of first HCA in the system ie. mthca0 - -[root@lab104 ~]# ibsrpdm -c -d /dev/infiniband/umad0 -id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4, -dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 -[root@lab104 ~]# echo id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4, -dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 > -/sys/class/infiniband_srp/srp-mthca0-1/add_target - -OR - -+ You can edit /etc/infiniband/openib.conf to load srp driver and srp HA daemon -automatically ie. set SRP_LOAD=yes, SRP_DAEMON_ENABLE=yes, and SRPHA_ENABLE=yes -+ To set up and use high availability feature you need dm-multipath driver -and multipath tool -+ Please refer to OFED-1.5.1 SRP's user manual for more in-details instructions -on how-to enable/use HA feature (OFED-1.5.1/docs/srp_release_notes.txt) - - -Here is an example of srp target setup file --------------------------------------------- - -*********************** srpt.sh ***************************************** -#!/bin/sh -modprobe scst scst_threads=1 -modprobe scst_vdisk scst_vdisk_ID=100 - -echo "open vdisk0 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk -echo "open vdisk1 /dev/sdb BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk -echo "open vdisk2 /dev/sdc BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk -echo "open vdisk3 /dev/sdd BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk -echo "add vdisk0 0" > /proc/scsi_tgt/groups/Default/devices -echo "add vdisk1 1" > /proc/scsi_tgt/groups/Default/devices -echo "add vdisk2 2" > /proc/scsi_tgt/groups/Default/devices -echo "add vdisk3 3" > /proc/scsi_tgt/groups/Default/devices - -modprobe ib_srpt - -echo "add "mgmt"" > /proc/scsi_tgt/trace_level -echo "add "mgmt_dbg"" > /proc/scsi_tgt/trace_level -echo "add "out_of_mem"" > /proc/scsi_tgt/trace_level - -*********************** End srpt.sh ************************************** - - -How-to unload/shutdown ------------------------ - -1. Unload ib_srpt - $ modprobe -r ib_srpt -2. Unload scst and its dev_handlers - $ modprobe -r scst_vdisk scst -3. Unload ofed - $ /etc/rc.d/openibd stop - -=========================================================================== -Known Issues -=========================================================================== - -- With active connections/sesssions and active I/Os, unload ib_srpt driver - will randomly fail and got stuck. - -- With active connections/sessions with active I/Os, reboot system will - randomly get stuck. - diff --git a/create_Module.symvers.sh b/create_Module.symvers.sh deleted file mode 100755 index 5b2d76d..0000000 --- a/create_Module.symvers.sh +++ /dev/null @@ -1,64 +0,0 @@ -#!/bin/bash -# -# Copyright (c) 2006 Mellanox Technologies. All rights reserved. -# Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. -# -# This Software is licensed under one of the following licenses: -# -# 1) under the terms of the "Common Public License 1.0" a copy of which is -# available from the Open Source Initiative, see -# http://www.opensource.org/licenses/cpl.php. -# -# 2) under the terms of the "The BSD License" a copy of which is -# available from the Open Source Initiative, see -# http://www.opensource.org/licenses/bsd-license.php. -# -# 3) under the terms of the "GNU General Public License (GPL) Version 2" a -# copy of which is available from the Open Source Initiative, see -# http://www.opensource.org/licenses/gpl-license.php. -# -# Licensee has the right to choose one of the above licenses. -# -# Redistributions of source code must retain the above copyright -# notice and one of the license notices. -# -# Redistributions in binary form must reproduce both the above copyright -# notice, one of the license notices in the documentation -# and/or other materials provided with the distribution. -# -# Description: creates Module.symvers file for InfiniBand modules - -KVERSION=${KVERSION:-$(uname -r)} -MOD_SYMVERS=./Module.symvers -SYMS=/tmp/syms - -echo MODULES_DIR=${MODULES_DIR-:./} - -if [ -f ${MOD_SYMVERS} -a ! -f ${MOD_SYMVERS}.save ]; then - mv ${MOD_SYMVERS} ${MOD_SYMVERS}.save -fi -rm -f $MOD_SYMVERS -rm -f $SYMS - -for mod in $(find ${MODULES_DIR} -name '*.ko') ; do - nm -o $mod |grep __crc >> $SYMS - n_mods=$((n_mods+1)) -done - -n_syms=$(wc -l $SYMS |cut -f1 -d" ") -echo Found $n_syms OFED kernel symbols in $n_mods modules -n=1 - -while [ $n -le $n_syms ] ; do - line=$(head -$n $SYMS|tail -1) - - line1=$(echo $line|cut -f1 -d:) - line2=$(echo $line|cut -f2 -d:) - file=$(echo $line1| sed -e 's@./@@' -e 's@.ko@@' -e "s@$PWD/@@") - crc=$(echo $line2|cut -f1 -d" ") - sym=$(echo $line2|cut -f3 -d" ") - echo -e "0x$crc\t$sym\t$file" >> $MOD_SYMVERS - n=$((n+1)) -done - -echo ${MOD_SYMVERS} created. diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt deleted file mode 100644 index 9a993c8..0000000 --- a/cxgb3_release_notes.txt +++ /dev/null @@ -1,352 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - CHELSIO T3 RNIC RELEASE NOTES - September 2010 - - -The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the -Chelsio S series adapters. Make sure you choose the 'cxgb3' and -'libcxgb3' options when generating your ofed rpms. - -============================================ -New for ofed-1.5.2 -============================================ - -- Bug fixes. Various upstream bug fixes have been included in this -release. - -============================================ -Enabling Various MPIs -============================================ - -For OpenMPI, Intel MPI, HP MPI, and Scali MPI: you must set the iw_cxgb3 -module option peer2peer=1 on all systems. This can be done by writing -to the /sys/module file system during boot. EG: - -# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer - -Or you can add the following line to /etc/modprobe.conf to set the option -at module load time: - -options iw_cxgb3 peer2peer=1 - -For Intel MPI, HP MPI, and Scali MPI: Enable the chelsio device by adding -an entry to /etc/dat.conf for the chelsio interface. For instance, -if your chelsio interface name is eth2, then the following line adds -a DAT version 1.2 and 2.0 devices named "chelsio" and "chelsio2" for -that interface: - -chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" "" -chelsio2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" "" - -============= -Intel MPI: -============= - -The following env vars enable Intel MPI version 3.1.038. Place these -in your user env after installing and setting up Intel MPI: - -export RSH=ssh -export DAPL_MAX_INLINE=64 -export I_MPI_DEVICE=rdssm:chelsio -export MPIEXEC_TIMEOUT=180 -export MPI_BIT_MODE=64 - -Logout & log back in. - -Populate mpd.hosts with node names. -Note: The hosts in this file should be Chelsio interface IP addresses. - -Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in -/etc/dat.conf named "chelsio". - -Note: MPIEXEC_TIMEOUT value might be required to increase if heavy traffic -is going across the systems. - -Contact Intel for obtaining their MPI with DAPL support. - -To run Intel MPI applications: - - mpdboot -n -r ssh --ncpus= - mpiexec -ppn -n - - -============= -HP MPI: -============= - -The following env vars enable HP MPI version 2.03.01.00. Place these -in your user env after installing and setting up HP MPI: - -export MPI_ROOT=/opt/hpmpi -export PATH=$MPI_ROOT/bin:/opt/bin:$PATH -export MANPATH=$MANPATH:$MPI_ROOT/share/man - -Log out & log back in. - -To run HP MPI applications, use these mpirun options: - --prot -e DAPL_MAX_INLINE=64 -UDAPL - -EG: - -$ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob - -Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces. - -Also this assumes your first entry in /etc/dat.conf is for the chelsio -device. - -Contact HP for obtaining their MPI with DAPL support. - -============= -Scali MPI: -============= - -The following env vars enable Scali MPI. Place these in your user env -after installing and setting up Scali MPI for running over IWARP: - -export DAPL_MAX_INLINE=64 -export SCAMPI_NETWORKS=chelsio -export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128" - -Log out & log back in. - -Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf -named "chelsio". - -Note: SCAMPI supports only dapl 1.2 library not dapl 2.0 - -Contact Scali for obtaining their MPI with DAPL support. - -To run SCALI MPI applications: - - mpimon -- - -Note: is the number of processes to run on the node Note: - should be the IP of Chelsio's interface - -============= -OpenMPI: -============= - -OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater. - -Open MPI will work without any specific configuration via the openib btl. -Users wishing to performance tune the configurable options may wish to -inspect the receive queue values. Those can be found in the "Chelsio T3" -section of mca-btl-openib-hca-params.ini. - -Note: OpenMPI version 1.3 does not support newer Chelsio card with device -ID 0x0035 and 0x0036. To use those cards add the device id of the cards -in the "Chelsio T3" section of mca-btl-openib-hca-params.ini file. - -To run OpenMPI applications: - - mpirun --host , -mca btl openib,sm,self - -============= -MVAPICH2: -============= - -The following env vars enable MVAPICH2 version 1.4-2. Place these -in your user env after installing and setting up MVAPICH2 MPI: - -export MVAPICH2_HOME=/usr/mpi/gcc/mvapich2-1.4/ -export MV2_USE_IWARP_MODE=1 -export MV2_USE_RDMA_CM=1 - -On each node, add this to the end of /etc/profile. - - ulimit -l 999999 - -On each node, add this to the end of /etc/init.d/sshd and restart sshd. - - ulimit -l 999999 - % service sshd restart - -Verify the ulimit changes worked. These should show '999999': - - % ulimit -l - % ssh ulimit -l - -Note: You may have to restart sshd a few times to get it to work. - -Create mpd.hosts with list of hostname or ipaddrs in the cluster. They -should be names/addresses that you can ssh to without passwords. (See -Passwordless SSH Setup). - -On each node, create /etc/mv2.conf with a single line containing the -IP address of the local T3 interface. This is how MVAPICH2 picks which -interface to use for RDMA traffic. - -On each node, edit /etc/hosts file. Comment the entry if there is an -entry with 127.0.0.1 IP Address and local host name. Add an entry for -corporate IP address and local host name (name that you have given in -mpd.hosts file) in /etc/hosts file. - -To run MVAPICH2 application: - - mpirun_rsh -ssh -np 8 -hostfile mpd.hosts - -============================================ -Loadable Module options: -============================================ - -The following options can be used when loading the iw_cxgb3 module to -tune the iWARP driver: - -cong_flavor - set the congestion control algorithm. Default is 1. - 0 == Reno - 1 == Tahoe - 2 == NewReno - 3 == HighSpeed - -snd_win - set the TCP send window in bytes. Default is 32kB. - -rcv_win - set the TCP receive window in bytes. Default is 256kB. - -crc_enabled - set whether MPA CRC should be negotiated. Default is 1. - -markers_enabled - set whether to request receiving MPA markers. Default is - 0; do not request to receive markers. - - NOTE: The Chelsio RNIC fully supports markers, but - the current OFA RDMA-CM doesn't provide an API for - requesting either markers or crc to be negotiated. Thus - this functionality is provided via module parameters. - -mpa_rev - set the MPA revision to be used. Default is 1, which is - spec compliant. Set to 0 to connect with the Ammasso 1100 - rnic. - -ep_timeout_secs - set the number of seconds for timing out MPA start up - negotiation and normal close. Default is 60. - -peer2peer - Enables connection setup changes to allow peer2peer - applications to work over chelsio rnics. This enables - the following applications: - Intel MPI - HP MPI - Open MPI - Scali MPI - MVAPICH2 - Set peer2peer=1 on all systems to enable these - applications. - -The following options can be used when loading the cxgb3 module to -tune the NIC driver: - -msi - whether to use MSI or MSI-X. Default is 2. - 0 = only pin - 1 = only MSI or pin - 2 = use MSI/X, MSI, or pin, based on system - -============================================ -Updating Firmware: -============================================ - -This release requires firmware version 7.10.0, and Protocol SRAM -version 1.1.0. These versions are included in the ofed-1.5.2 release -and will be automatically loaded when the cxgb3 module is loaded and -the interface configured. To load later/newer versions of the firmware, -follow this procedure: - -If your distro/kernel supports firmware loading, you can place the chelsio -firmware and psram images in /lib/firmware/cxgb3, then unload and reload -the cxgb3 module to get the new images loaded. If this does not work, -then you can load the firmware images manually: - -Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio. - -To build cxgbtool: - -# cd -# make && make install - -Then load the cxgb3 driver: - -# modprobe cxgb3 - -Now note the ethernet interface name for the T3 device. This can be -done by typing 'ifconfig -a' and noting the interface name for the -interface with a HW address that begins with "00:07:43". Then load the -new firmware and eeprom file: - -# cxgbtool ethxx loadfw -# update_eeprom.sh ethxx -# reboot - -============================================ -Testing connectivity with ping and rping: -============================================ - -Configure the ethernet interfaces for your cxgb3 device. After you -modprobe iw_cxgb3 you will see one or two ethernet interfaces for the -T3 device. Configure them with an appropriate ip address, netmask, etc. -You can use the Linux ping command to test basic connectivity via the -T3 interface. - -To test RDMA, use the rping command that is included in the librdmacm-utils -rpm: - -On the server machine: - -# rping -s -a 0.0.0.0 -p 9999 - -On the client machine: - -# rping -c -VvC10 -a server_ip_addr -p 9999 - -You should see ping data like this on the client: - -ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr -ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs -ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst -ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu -ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv -ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw -ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx -ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy -ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz -ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA -client DISCONNECT EVENT... -# - -============================================ -Addition Notes and Issues -============================================ - -1) To run uDAPL over the chelsio device, you must export this environment -variable: - - export DAPL_MAX_INLINE=64 - -2) If you have a multi-homed host and the physical ethernet networks -are bridged, or if you have multiple chelsio rnics in the system, then -you need to configure arp to only send replies on the interface with -the target ip address: - - sysctl -w net.ipv4.conf.all.arp_ignore=2 - -3) If you are building OFED against a kernel.org kernel later than -2.6.20, then make sure your kernel is configured with the cxgb3 and -iw_cxgb3 modules enabled. This forces the kernel to pull in the genalloc -allocator, which is required for the OFED iw_cxgb3 module. Make sure -these config options are included in your .config file: - - CONFIG_CHELSIO_T3=m - CONFIG_INFINIBAND_CXGB=m - -4) If you run the RDMA latency test using the ib_rdma_lat program, make -sure you use the following command lines to limit the amount of inline -data to 64: - - server: ib_rdma_lat -c -I 64 - client: ib_rdma_lat -c -I 64 server_ip_addr - -5) If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are -using a 64KB page size (like PPC64 and IA64 systems) and your server is -using a 4KB page size (like i386 and X86_64), then you need to mount the -server using rsize=32768,wsize=32768 to avoid overrunning the Chelsio -RNIC fast register limits. This is a known firmware limitation in the -Chelsio RNIC. diff --git a/diags_release_notes.txt b/diags_release_notes.txt deleted file mode 100644 index 9db4cbf..0000000 --- a/diags_release_notes.txt +++ /dev/null @@ -1,89 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - Diagnostic Tools in OFED 1.5 Release Notes - - December 2009 - - -Repo: git://git.openfabrics.org/~sashak/management/management.git -URL: http://www.openfabrics.org/downloads/management - - -General -------- -Model of operation: All diag utilities use direct MAD access to perform their -operations. Operations that require QP0 mads only may use direct routed -mads, and therefore can work even in unconfigured subnets. Almost all -utilities can operate without accessing the SM, unless GUID to lid translation -is required. The only exception to this is saquery which requires the SM. - - -Dependencies ------------- -Most diag utilities depend on libibmad and libibumad. -All diag utilities depend on the ib_umad kernel module. - - -Multiple port/Multiple CA support ---------------------------------- -When no IB device or port is specified (see the "local umad parameters" below), -the libibumad library selects the port to use by the following criteria: -1. the first port that is ACTIVE. -2. if not found, the first port that is UP (physical link up). - -If a port and/or CA name is specified, the libibumad library attempts to -satisfy the user request, and will fail if it cannot do so. - -For example: - ibaddr # use the 'best port' - ibaddr -C mthca1 # pick the best port from mthca1 only. - ibaddr -P 2 # use the second (active/up) port from the - first available IB device. - ibaddr -C mthca0 -P 2 # use the specified port only. - - -Common options & flags ----------------------- -Most diagnostics take the following flags. The exact list of supported -flags per utility can be found in the usage message and can be displayed -using util_name -h syntax. - -# Debugging flags - -d raise the IB debugging level. May be used - several times (-ddd or -d -d -d). - -e show umad send receive errors (timeouts and others) - -h display the usage message - -v increase the application verbosity level. - May be used several times (-vv or -v -v -v) - -V display the internal version info. - -# Addressing flags - -D use directed path address arguments. The path - is a comma separated list of out ports. - Examples: - "0" # self port - "0,1,2,1,4" # out via port 1, then 2, ... - -G use GUID address arguments. In most cases, it is the Port GUID. - Examples: - "0x08f1040023" - -s use 'smlid' as the target lid for SA queries. - -# Local umad parameters: - -C use the specified ca_name. - -P use the specified ca_port. - -t override the default timeout for the solicited mads. - - -CLI notation ------------- -All utilities use the POSIX style notation, meaning that all options (flags) -must precede all arguments (parameters). - - -Utilities descriptions ----------------------- -See man pages - - -Bugs Fixed ----------- - diff --git a/ehca_release_notes.txt b/ehca_release_notes.txt deleted file mode 100644 index e1ca30b..0000000 --- a/ehca_release_notes.txt +++ /dev/null @@ -1,113 +0,0 @@ - - Open Fabrics Enterprise Distribution (OFED) - ehca in OFED 1.5.2 Release Notes - - September 2010 - - -Overview --------- -ehca is the low level driver implementation for all IBM GX-based HCAs. - -Supported HCAs --------------- -- GX Dual-port SDR 4x IB HCA -- GX Dual-port SDR 12x IB HCA -- GX Dual-port DDR 4x IB HCA -- GX Dual-port DDR 12x IB HCA - -Available Parameters --------------------- -In order to set ehca parameters, add the following line(s) to /etc/modprobe.conf: - - options ib_ehca = - -whereby is one of the following items: -- debug_level debug level (0: no debug traces (default), 1: with debug traces) -- port_act_time time to wait for port activation (default: 30 sec) -- scaling_code scaling code (0: disable (default), 1: enable) -- open_aqp1 Open AQP1 on startup (default: no) (bool) -- hw_level Hardware level (0: autosensing (default), 0x10..0x14: eHCA, 0x20..0x23: eHCA2) (int) -- nr_ports number of connected ports (-1: autodetect (default), 1: port one only, 2: two ports) (int) -- use_hp_mr Use high performance MRs (default: no) (bool) -- poll_all_eqs Poll all event queues periodically (default: yes) (bool) -- static_rate Set permanent static rate (default: no static rate) (int) -- lock_hcalls Serialize all hCalls made by the driver (default: autodetect) (bool) -- number_of_cqs Max number of CQs which can be allocated (default: autodetect) (int) -- number_of_qps Max number of QPs which can be allocated (default: autodetect) (int) - -New Features ------------- -- None - -Fixed Bugs ofed-1.5.2 ---------------------- -- Fixed automatic detection if hcall locks should be enabled or not - -Fixed Bugs ofed-1.5.1 ---------------------- -- Fixed crash when reading sysfs performance counters -- Do not disable IRQs when processing EQs -- Allow query of max_dest_rd_atomic and max_qp_rd_atomic values - -Fixed Bugs ofed-1.5 ---------------------- -- SRQ overflow prevention -- Performance improvements for QP creation -- MAD redirection fix - -Fixed Bugs ofed-1.4.1 ---------------------- -- none - -Fixed Bugs ofed-1.4 ---------------------- -- Reject send work requests only for RESET, INIT and RTR state -- Reject receive work requests if QP is in RESET state -- In case of lost interrupts, trigger EOI to reenable interrupts -- Filter PATH_MIG events if QP was never armed -- Release mutex in error path of alloc_small_queue_page() -- Check idr_find() return value -- Discard double CQE for one WR -- Generate flush status CQ entries -- Don't allow creating UC QP with SRQ -- Fix reported max number of QPs and CQs in systems with >1 adapter -- Reject dynamic memory add/remove when ehca adapter is present -- Remove reference to special QP in case of port activation failure -- Fix locking for shca_list_lock - -Fixed Bugs ofed-1.3.1 ---------------------- -- Support all ibv_devinfo values in query_device() and query_port() -- Prevent posting of SQ WQEs if QP not in RTS -- Remove mr_largepage parameter, ie always enable large page support -- Allocate event queue size depending on max number of CQs and QPs -- Protect QP against destroying until all async events for it are handled - -Fixed Bugs ofed-1.3 -------------------- -- Serialize HCA-related hCalls if necessary -- Fix static rate if path faster than link -- Return physical link information in query_port() -- Fix clipping of device limits to INT_MAX -- Fix issues related to path migration support -- Support more than 4k QPs for userspace and kernelspace -- Prevent sending UD packets to QP0 -- Prevent RDMA-related connection failures on some eHCA2 hardware - -Available backports -------------------- -- RedHat EL5 up4: 2.6.18-164.ELsmp -- RedHat EL5 up5: 2.6.18-194.ELsmp -- SLES11: 2.6.27.19-5.1-smp -- SLES11SP1: 2.6.32.12-0.7-default -- SLES10SP3: 2.6.16.60-0.54.5 -- kernel.org: 2.6.29-32 - -Known Issues ------------- -1. The port(s) needs to be connected to an active switch port while -loading the ehca device driver. - -2. Dynamic memory operations are tolerated by ehca, but are prevented by -the driver while it is loaded. diff --git a/ib-bonding.txt b/ib-bonding.txt deleted file mode 100644 index 1727d6b..0000000 --- a/ib-bonding.txt +++ /dev/null @@ -1,191 +0,0 @@ -IB Bonding -=============================================================================== - -1. Introduction -2. How to work with interface configuration scripts -2.1 Configuration with initscripts support -2.1.1 Writing network scripts under Redhat-AS4 (Update 6, 7 or 8) -2.1.2 Writing network scripts under Redhhat-EL5 -2.2 Configuration with sysconfig support -2.2.1 Writing network scripts under SLES-10 -2.3 Configuring Ethernet slaves - -1. Introduction -------------------------------------------------------------------------------- -ib-bonding is a High Availability solution for IPoIB interfaces. It is based -on the Linux Ethernet Bonding Driver and was adopted to work with IPoIB. -However, the support for for IPoIB interfaces is only for the active-backup -mode, other modes should not be used. - -2. How to work with interface configuration scripts -------------------------------------------------------------------------------- -To create an interface configuration script for the ibX and bondX interfaces, -you should use the standard syntax (depending on your OS). - -2.1 Configuration with initscripts support ------------------------------------------- -Note: This feature is available only for Redhat-AS4 (Update 4, Update 5, -Update 6 or Update 7) and for Redhat-EL5 and above. - -2.1.1 Writing network scripts under Redhat-AS4 (Update 4, 5, 6 or 7) ------------------------------------------------------------------ -* In the master (bond) interface script add the line: -TYPE=Bonding -MTU= - -Exmaple: for bond0 (master) the file is named /etc/sysconfig/network-scripts/ifcfg-bond0 -with the following text in the file: - -DEVICE=bond0 -IPADDR=192.168.1.1 -NETMASK=255.255.255.0 -NETWORK=192.168.1.0 -BROADCAST=192.168.1.255 -ONBOOT=yes -BOOTPROTO=none -USERCTL=no -TYPE=Bonding -MTU=65520 - -Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected -mode and are configured with the same value. For IPoIB slaves that work in -datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at -all (and letting it to be set to the default value), performance of the -interface might decrease. - -* In the slave (ib) interface script put the following lines: -SLAVE=yes -MASTER= -TYPE=InfiniBand -PRIMARY= - -Example: the script for ib0 (slave) would be named /etc/sysconfig/network-scripts/ifcfg-ib0 -with the following text in the file: - -DEVICE=ib0 -USERCTL=no -ONBOOT=yes -MASTER=bond0 -SLAVE=yes -BOOTPROTO=none -TYPE=InfiniBand -PRIMARY=yes - -Note: If the slave interface is not primary then the line PRIMARY= is not -required and can be omitted. - -After the configuration is saved, restart the network service by running: -/etc/init.d/network restart - -2.1.2 Writing network scripts under Redhhat-EL5 ------------------------------------------------ -Follow the instructions in 3.1.1 (Writing network scripts under Redhat-AS4) -with the following changes: -* In the bondX (master) script - the line TYPE=Bonding is not needed. -* In the bondX (master) script - you may add to the configuration more options -with the following line -BONDING_OPTS=" primary=ib0 updelay=0 downdelay=0" -* in the ibX (slave) script - the line TYPE=InfiniBand necessary when using - bonding over devices configured with partitions ( p_key) -Example: - ifcfg-ibX.8003 and ifcfg-ibY.8003 must include TYPE=InfiniBand line in - their configuration files, when using as slaves for bondX device -* in /etc/modprobe.conf add the following lines -alias bond0 bonding -options bond0 miimon=100 mode=1 max_bonds=1 - -If you want more than one bonding interface, name them bond1, bond2... and -just add the necessary lines in /etc/modprobe.conf and change max_bonds=1 to -max_bonds=N where N=number_of_bonding_interfaces - -Note: restarting OFED doesn't keep the bonding configuration via initscripts. -You have to restart the network service in order to recreate the bonding -interface. - -2.2 Configuration with sysconfig support ----------------------------------------- -Note: This feature is available only for SLES-10 and above. - -2.2.1 Writing network scripts under SLES-10 ------------------------------------------------ -* In the master (bond) interface script add the lins: - -BONDING_MASTER=yes -BONDING_MODULE_OPTS="mode=active-backup miimon=" -BONDING_SLAVE0=slave0 -BONDING_SLAVE1=slave1 -MTU= - -Exmaple: for bond0 (master) the file is named /etc/sysconfig/network/ifcfg-bond0 -with the following text in the file: - -BOOTPROTO="static" -BROADCAST="10.0.2.255" -IPADDR="10.0.2.10" -NETMASK="255.255.0.0" -NETWORK="10.0.2.0" -REMOTE_IPADDR="" -STARTMODE="onboot" -BONDING_MASTER="yes" -BONDING_MODULE_OPTS="mode=active-backup miimon=100 primary=ib0 updelay=0 downdelay=0" -BONDING_SLAVE0=ib0 -BONDING_SLAVE1=ib1 -MTU=65520 - -Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected -mode and are configured with the same value. For IPoIB slaves that work in -datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at -all (and letting it to be set to the default value), performance of the -interface might decrease. - -Note: primary, downdelay and updelay is an optional bonding interface -configuration. You may choose to use them, change them or delete them from the -configuration script (by editing the line that starts with BONDING_OPTS) - -* The slave (ib) interace script should look like this: - -BOOTPROTO='none' -STARTMODE='off' -PRE_DOWN_SCRIPT=/etc/sysconfig/network/unenslave.sh - -After the configuration is saved, restart the network service by running: -/etc/init.d/network restart - -2.3 Configuring Ethernet slaves -------------------------------- -It is not possible to have a mix of Ethernt slaves and IPoIB slaves under the -same bonding master. It is possible however that a bonding master of Ethernet -slaves and a bonding master of IPoIB slaves will co-exist in one machne. -To configure Ethernet slaves under a bonding master use the following -instructios (depending on the OS) - -* Under Redhat-AS4 - -Use the same instructions as for IPoIB slaves with the following exceptions - -- In the master configuration file add the line -SLAVEDEV=1 -- In the slave configuration file leave the line -TYPE=InfiniBand -- For Ethernet, it is possible to set parameters of the bonding module in /etc/modprobe.conf -with the following line for example -options bonding miimon=100 mode=1 primary=eth0 -Note that alias names for the bonding module (such as bond0) may not work. - -* Under Redhat-AS5 - -No special instructions are required. - -* Under SLES10 - -When using both type of bonding under, it is neccessary to update the -MANDATORY_DEVICES environment variable in /etc/sysconfig/network/config with the names -of the InfiniBand devices ( ib0, ib1, etc. ). Otherwise, bonding devices will be created -before InfiniBand devices at boot time. - -Note: If there is more than one Ethernet NIC installed then there might be a -race for the interface name eth0, eth1 etc. This may lead to unexpected -relation between logical and physical devices which may lead to wrong bonding -configuration. This issue may be solved by binding a logical device name (e.g. -eth0) to a physical (hardware) device by specifying the MAC address in the -ethN configuration file. diff --git a/ibacm_release_notes.txt b/ibacm_release_notes.txt deleted file mode 100644 index 4048b39..0000000 --- a/ibacm_release_notes.txt +++ /dev/null @@ -1,144 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - IB ACM in OFED 1.5 Release Notes - - July 2010 - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Quick Start Guide -3. Operation Details -4. Known Issues - -=============================================================================== -1. Overview -=============================================================================== -The IB ACM package implements and provides a framework for experimental name, -address, and route resolution services over InfiniBand. It is intended to -address connection setup scalability issues running MPI applications on -large clusters. The IB ACM provides information needed to establish a -connection, but does not implement the CM protocol. - -The librdmacm can invoke IB ACM services when built using the --with-ib_acm -option. The IB ACM services tie in under the rdma_resolve_addr, -rdma_resolve_route, and rdma_getaddrinfo routines. For maximum benefit, -the rdma_getaddrinfo routine should be used, however existing applications -should still see significant connection scaling benefits using the calls -available in librdmacm 1.0.11 and previous releases. - -The IB ACM is focused on being scalable and efficient. The current -implementation limits network traffic, SA interactions, and centralized -services. ACM supports multiple resolution protocols in order to handle -different fabric topologies. - -The IB ACM package is comprised of two components: the ib_acm service -and a test/configuration utility - ib_acme. Both are userspace components -and are available for Linux and Windows. Additional details are given below. - -=============================================================================== -2. Quick Start Guide -=============================================================================== - -1. Prerequisites: libibverbs and libibumad must be installed. - The IB stack should be running with IPoIB configured. - These steps assume that the user has administrative privileges. -2. Install the IB ACM package - This installs ib_acm, and ib_acme. -3. Run ib_acme -A -O - This will generate IB ACM address and options configuration files. - (acm_addr.cfg and acm_opts.cfg) -4. Run ib_acm and leave running. - ib_acm will eventually be converted to a service/daemon, but for now - is a userspace application. Because ib_acm uses the libibumad - interfaces, it should be run with administrative privileges. -5. Optionally, run ib_acme -s -d -v - This will verify that the ib_acm service is running. -5. Install librdmacm using the build option --with-ib_acm. - The librdmacm will automatically use the ib_acm service. - On failures, the librdmacm will fall back to normal resolution. - -=============================================================================== -3. Operation Details -=============================================================================== - -ib_acme: -The ib_acme program serves a dual role. It acts as a utility to test -ib_acm operation and help verify if the ib_acm service and selected -protocol is usable for a given cluster configuration. Additionally, -it automatically generates ib_acm configuration files to assist with -or eliminate manual setup. - - -acm configuration files: -The ib_acm service relies on two configuration files. - -The acm_addr.cfg file contains name and address mappings for each IB - endpoint. Although the names in the acm_addr.cfg -file can be anything, ib_acme maps the host name and IP addresses to -the IB endpoints. - -The acm_opts.cfg file provides a set of configurable options for the -ib_acm service, such as timeout, number of retries, logging level, etc. -ib_acme generates the acm_opts.cfg file using static information. A -future enhancement would adjust options based on the current system -and cluster size. - - -ib_acm: -The ib_acm service is responsible for resolving names and addresses to -InfiniBand path information and caching such data. It is currently -implemented as an executable application, but is a conceptual service -or daemon that should execute with administrative privileges. - -The ib_acm implements a client interface over TCP sockets, which is -abstracted by the librdmacm library. One or more back-end protocols are -used by the ib_acm service to satisfy user requests. Although the -ib_acm supports standard SA path record queries on the back-end, it -provides an experimental multicast resolution protocol in hope of -achieving greater scalability. The latter is not usable on all fabric -topologies, specifically ones that may not have reversible paths. -Users should use the ib_acme utility to verify that multicast protocol -is usable before running other applications. - -Conceptually, the ib_acm service implements an ARP like protocol and either -uses IB multicast records to construct path record data or queries the -SA directly, depending on the selected route protocol. By default, the -ib_acm services uses and caches SA path record queries. - -Specifically, all IB endpoints join a number of multicast groups. -Multicast groups differ based on rates, mtu, sl, etc., and are prioritized. -All participating endpoints must be able to communicate on the lowest -priority multicast group. The ib_acm assigns one or more names/addresses -to each IB endpoint using the acm_addr.cfg file. Clients provide source -and destination names or addresses as input to the service, and receive -as output path record data. - -The service maps a client's source name/address to a local IB endpoint. -If a client does not provide a source address, then the ib_acm service -will select one based on the destination and local routing tables. If the -destination name/address is not cached locally, it sends a multicast -request out on the lowest priority multicast group on the local endpoint. -The request carries a list of multicast groups that the sender can use. -The recipient of the request selects the highest priority multicast group -that it can use as well and returns that information directly to the sender. -The request data is cached by all endpoints that receive the multicast -request message. The source endpoint also caches the response and uses -the multicast group that was selected to construct or obtain path record -data, which is returned to the client. - -=============================================================================== -4. Known Issues -=============================================================================== - -The current implementation of the IB ACM has several restrictions: -- The ib_acm is limited in its handling of dynamic changes; - the ib_acm must be stopped and restarted if a cluster is reconfigured. -- Cached data does not timed out and is only updated if a new resolution - request is received from a different QPN than a cached request. -- Support for IPv6 has not been verified. -- The number of addresses that can be assigned to a single endpoint is - limited to 4. -- The number of multicast groups that an endpoint can support is limited to 2. - diff --git a/ibutils_release_notes.txt b/ibutils_release_notes.txt deleted file mode 100644 index 83478cc..0000000 --- a/ibutils_release_notes.txt +++ /dev/null @@ -1,74 +0,0 @@ - Open Fabrics InfiniBand Diagnostic Utilities - -------------------------------------------- - -******************************************************************************* -RELEASE: OFED 1.5 -DATE: Dec 2009 - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. New features -3. Major Bugs Fixed -4. Known Issues - -=============================================================================== -1. Overview -=============================================================================== - -The ibutils package provides a set of diagnostic tools that check the health -of an InfiniBand fabric. - -Package components: -ibis: IB interface - A TCL shell that provides interface for sending various - MADs on the IB fabric. This is the component that actually accesses - the IB Hardware. - -ibdm: IB Data Model - A library that provides IB fabric analysis. - -ibmgtsim: An IB fabric simulator. Useful for developing IB tools. - -ibdiag: This package provides 3 tools which provide the user interface - to activate the above functionality: - - ibdiagnet: Performs various quality and health checks on the IB - fabric. - - ibdiagpath: Performs various fabric quality and health checks on - the given links and nodes in a specific path. - - ibdiagui: A GUI wrapper for the above tools. - -=============================================================================== -2. New Features -=============================================================================== - -* New "From the Edge" topology matching algorithm. - Integrated into ibtopodiff when run with the flag -e - -* New library - libsysapi - The library is a C API for IBDM C++ objects - -* Added ibnl definition files for Mellanox and Sun IB QDR products - -* Added new feature to ibdiagnet - general device info - -* ibdiagnet now can get port 0 as a parameterr (for managed switches). - - -=============================================================================== -3. Major Bugs Fixed -=============================================================================== - -* ibutils: various fixes in build process (dependencies, parallel build, etc) - -* ibdiagnet: fixed crash with -r flag - -* ibdiagnet: fixed regular expression for pkey matching - -* ibdiagnet: ibdiagnet.lst file has device IDs with trailing zeroes - fixed - -=============================================================================== -4. Known Issues -=============================================================================== - -- Ibdiagnet "-wt" option may generate a bad topology file when running on a - cluster that contains complex switch systems. diff --git a/ipath_release_notes.txt b/ipath_release_notes.txt deleted file mode 100644 index 0382fe9..0000000 --- a/ipath_release_notes.txt +++ /dev/null @@ -1,13 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - ipath in OFED 1.5 Release Notes - - December 2009 - -====================================================================== -1. Overview -====================================================================== -ipath is the low level driver implementation for the -QLogic HyperTransport HCA only (model QHT7140). - -The qib driver is the currently supported driver for all -PCI-Express based Infiniband HCAs. diff --git a/ipoib_release_notes.txt b/ipoib_release_notes.txt deleted file mode 100644 index c332d6c..0000000 --- a/ipoib_release_notes.txt +++ /dev/null @@ -1,483 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - IPoIB in OFED 1.5.2 Release Notes - - December 2010 - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Known Issues -3. DHCP Support of IPoIB -4. The ib-bonding driver -5. Child interfaces -6. Bug Fixes and Enhancements Since OFED 1.3 -7. Bug Fixes and Enhancements Since OFED 1.3.1 -8. Bug Fixes and Enhancements Since OFED 1.4 -9. Bug Fixes and Enhancements Since OFED 1.4.2 -10. Bug Fixes and Enhancements Since OFED 1.5.0 -11. Bug Fixes and Enhancements Since OFED 1.5.2 -12. Performance tuning - -=============================================================================== -1. Overview -=============================================================================== -IPoIB is a network driver implementation that enables transmitting IP and ARP -protocol packets over an InfiniBand UD channel. The implementation conforms to -the relevant IETF working group's RFCs (http://www.ietf.org). - - -Usage and configuration: -======================== -1. To check the current mode used for outgoing connections, enter: - cat /sys/class/net/ib0/mode -2. To disable IPoIB CM at compile time, enter: - cd OFED-1.5 - export OFA_KERNEL_PARAMS="--without-ipoib-cm" - ./install.pl -3. To change the run-time configuration for IPoIB, enter: - edit /etc/infiniband/openib.conf, change the following parameters: - # Enable IPoIB Connected Mode - SET_IPOIB_CM=yes - # Set IPoIB MTU - IPOIB_MTU=65520 - -4. You can also change the mode and MTU for a specific interface manually. - - To enable connected mode for interface ib0, enter: - echo connected > /sys/class/net/ib0/mode - - To increase MTU, enter: - ifconfig ib0 mtu 65520 - -5. Switching between CM and UD mode can be done in run time: - echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD - echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM - - -=============================================================================== -2. Known Issues -=============================================================================== -1. If a host has multiple interfaces and (a) each interface belongs to a - different IP subnet, (b) they all use the same InfiniBand Partition, and (c) - they are connected to the same IB Switch, then the host violates the IP rule - requiring different broadcast domains. Consequently, the host may build an - incorrect ARP table. - - The correct setting of a multi-homed IPoIB host is achieved by using a - different PKEY for each IP subnet. If a host has multiple interfaces on the - same IP subnet, then to prevent a peer from building an incorrect ARP entry - (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X - stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This - causes the network stack to send ARP replies only on the interface with the - IP address specified in the ARP request: - - sysctl -w net.ipv4.conf.ib0.arp_ignore=1 - sysctl -w net.ipv4.conf.ib1.arp_ignore=1 - - Or, globally, - - sysctl -w net.ipv4.conf.all.arp_ignore=1 - - To learn more about the arp_ignore parameter, see - Documentation/networking/ip-sysctl.txt. - Note that distributions have the means to make kernel parameters persistent. - -2. There are IPoIB alias lines in /etc/modprobe.d/ib_ipoib.conf which prevent - stopping/unloading the stack (i.e., '/etc/init.d/openibd stop' will fail). - These alias lines cause the drivers to be loaded again by udev scripts. - - Workaround: Change modprobe.conf to set - OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove - the alias lines from /etc/modprobe.d/ib_ipoib.conf. - -3. On SLES 10: - The ib1 interface uses the configuration script of ib0. - - Workaround: Invoke ifup/ifdown using both the interface name and the - configuration script name (example: ifup ib1 ib1). - -4. After a hotplug event, the IPoIB interface falls back to datagram mode, and - MTU is reduced to 2K. - Workaround: Re-enable connected mode and increase MTU manually: - echo connected > /sys/class/net/ib0/mode - ifconfig ib0 mtu 65520 - -5. Since the IPoIB configuration files (ifcfg-ib) are installed under the - standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/ - and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf - does not prevent the loading of IPoIB on boot. - -6. If IPoIB connected mode is enabled, it uses a large MTU for connected mode - messages and a small MTU for datagram (in particular, multicast) messages, - and relies on path MTU discovery to adjust MTU appropriately. Packets sent - in the window before MTU discovery automatically reduces the MTU for a - specific destination will be dropped, producing the following message in the - system log: - "packet len (> ) too long to send, dropping" - - To warn about this, a message is produced in the system log each time MTU is - set to a value higher than 2K. - -7. IPoIB IPv6 support is broken for systems with kernels < 2.6.12 and - kernels >= 2.6.12. The reason for that is that kernel 2.6.12 puts the link - layer address at an offset of two bytes with respect to older kernels. This - causes the other host to misinterpret the hardware address resulting in failure - to resolve path which are based on wrong GIDs. As an example, RH 4.x and RH - 5.x cannot inter-operate. - -8. In connected mode, TCP latency for short messages is larger by approx. 1usec - (~5%) than in datagram mode. As a workaround, use datagram mode. - -9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with - newer kernels. It is recommended to use kernel 2.6.18 or up for - best IPoIB performance. - -10. Connectivity issues encountered when using IPv6 on ia64 systems. - -11. The IPoIB module uses a Linux implementation for Large Receive Offload - (LRO) in kernel 2.6.24 and later. These kernels require installing the - "inet_lro" module. - -12. ConnectX only: If you have a port configured as ETH and IPoIB is running - in connected mode, and then you change the port type to IB, the IPoIB mode - will change to datagram mode. - -13. When working with iSCSI, you must disable LRO (even if you are working in - connected mode). This is because there is a bug in older kernels which causes - a kernel panic. - -14. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test - gets to packet size 8192 or larger, it always loses the first packet in the - sequence. - Workaround: Increase the number of pending skb's before a neighbor is - resolved (default is 3). This value can be changed with: - sysctl net.ipv4.neigh.ib0.unres_qlen. - -15. IPoIB multicast support is broken in RH4.x kernels. This is because - ndisc_mc_map() does not handle IPoIB hardware addresses. - -16. If bonding uses an IPoIB slave, then un-enslaving all slaves (or downing - them with ifdown) followed by unloading the module ib_ipoib might crash the - kernel. To avoid this leave the IPoIB interfaces enslaved when unloading - ib_ipoib. - -17. On SLES 11, sysconfig scripts override the interface mode and set it to - datagram on each call to ifup, ifdown, etc. To avoid this, add the line - IPOIB_MODE=connected - to the interface configuration file (e.g. ifcfg-ib0) - -18. When installing OFED on a machine that runs kernel 2.6.30 (or another - kernel from kernel.org that OFED supports), the installation script blocks - the installation of ib-bonding since the bonding module that comes with the - kernel has all the functionality to support IPoIB slaves. This approach - however doesn't patch the sysconfig (SuSE) or initscripts (RedHat) package, - so the network configuration script may not work properly. - For example, if you install OFED on RHEL5.2 that runs kernel 2.6.30 and - you try to configure and run bonding, you won't be able to restart the - network and see bond0 up and running with IPoIB slaves. - A workaround to this problem would be as follows: - a. Compile ib-bonding source rpm (under SRPMS directory) separately on - a machine with RHEL5.2 and kernel 2.6.18-92.el5 (default for this OS). - b. Install the binary RPM while the machine runs kernel 2.6.18-92.el5. - This will patch the OS configuration scripts and install the bonding - module. - c. Switch to kernel 2.6.30. The module that was compiled in (a) will - not be loaded since it was compiled and installed for a different - kernel. - d. Configure bonding and restart the network. The bonding interface - should be up and running afterwards. - -19. On RHEL5.X, '/etc/init.d/openibd start' prints the following messages while - bringing up IPoIB interfaces: - - Setting up InfiniBand network interfaces: - Bringing up interface ib0: [ OK ] - RTNETLINK answers: File exists - Error adding address 192.168.1.11 for ib1. - Bringing up interface ib1: [ OK ] - Setting up service network . . . [ done ] - - This does not affect IPoIB configuration and interfaces are configured as - expected. - -20. In IPoIB connected mode, packages larger than 2016 bytes are not sent. - https://bugs.openfabrics.org/show_bug.cgi?id=1839 - -21. Under SLES11, if an IP configuration exists for an IPoIB interface - that later becomes a slave of a bonding master, a network restart - does not erase the IP configuration from the slave and it appears to have - an IP address even though the new configuration does not set one. This - may cause problems when trying to use the bonded network interface. To - avoid this, restart the IB stack (openib restart) once you change the - configuration. - This issue is described in - https://bugs.openfabrics.org/show_bug.cgi?id=1975 - -22. Currently, IPoIB LRO is not supported on ConnectX-2 devices - -=============================================================================== -3. IPoIB Configuration Based on DHCP -=============================================================================== - -Setting an IPoIB interface configuration based on DHCP (v3.0.4 which is -available via www.isc.org) is performed similarly to the configuration of -Ethernet interfaces. In other words, you need to make sure that IPoIB -configuration files include the following line: - For RedHat: - BOOTPROTO=dhcp - For SLES: - BOOTPROTO=dhcp -Note: If IPoIB configuration files are included, ifcfg-ib files will be -installed under: -/etc/sysconfig/network-scripts/ on a RedHat machine -/etc/sysconfig/network/ on a SuSE machine - -Note: Two patches for DHCP are required for supporting IPoIB. The patch files -for DHCP v3.0.4 are available under the docs/dhcp/ directory. - -Standard DHCP fields holding MAC addresses are not large enough to contain an -IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages -convey a client identifier field used to identify the DHCP session. This client -identifier field can be used to associate an IP address with a client identifier -value, such that the DHCP server will grant the same IP address to any client -that conveys this client identifier. - -Note: Refer to the DHCP documentation for more details how to make this -association. - -The length of the client identifier field is not fixed in the specification. - -4.1 DHCP Server -In order for the DHCP server to provide configuration records for clients, an -appropriate configuration file needs to be created. By default, the DHCP server -looks for a configuration file called dhcpd.conf under /etc. You can either -edit this file or create a new one and provide its full path to the DHCP server -using the -cf flag. See a file example at docs/dhcpd.conf of this package. -The DHCP server must run on a machine which has loaded the IPoIB module. - -To run the DHCP server from the command line, enter: -dhcpd -d -Example: -host1# dhcpd ib0 -d - -4.2 DHCP Client (Optional) - -Note: A DHCP client can be used if you need to prepare a diskless machine with -an IB driver. - -In order to use a DHCP client identifier, you need to first create a -configuration file that defines the DHCP client identifier. Then run the DHCP -client with this file using the following command: -dhclient cf -Example of a configuration file for the ConnectX (PCI Device ID 26428), called -dhclient.conf: -# The value indicates a hexadecimal number -interface "ib1" { -send dhcp-client-identifier ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39; -} -Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), -called dhclient.conf: -# The value indicates a hexadecimal number -interface "ib1" { -send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92; -} - -In order to use the configuration file, run: -host1# dhclient -cf dhclient.conf ib1 - - -=============================================================================== -4. The ib-bonding driver -=============================================================================== -The ib-bonding driver is a High Availability solution for IPoIB interfaces. -It is based on the Linux Ethernet Bonding Driver and was adapted to work with -IPoIB. The ib-bonding driver comes with the ib-bonding package -(run rpm -qi ib-bonding to get the package information). - -Using the ib-bonding driver ---------------------------- -The ib-bonding driver is loaded automatically. - -Automatic operation: -Use standard OS tools (sysconfig in SuSE and initscripts in RedHat) -to create a configuration that will come up with network restart. For details -on this, read the documentation for the ib-bonding package. - -Notes: -* Using /etc/infiniband/openib.conf to create a persistent configuration is - no longer supported -* On RHEL4_U7, a slave interface cannot be set as primary. -* ib-bonding will not be compiled and installed with OFED on an OS with kernel - that is >= 2.6.27 (e.g., SLES11). The bonding driver that comes with those - kernels already supports enslaving IPoIB interfaces. In addition, an OS - can come with an older kernel but with a patched bonding driver that also - does not require modification (e.g., RHEL5.4). OFED will not replace the - bonding module in such cases either. - However, there might still be issues with OS configuration tools (like - sysconfig or initscripts) that may need fixing, but such issues have not - been observed yet. - - -=============================================================================== -5. Child interfaces -=============================================================================== - -5.1 Subinterfaces ------------------ -You can create subinterfaces for a primary IPoIB interface to provide traffic -isolation. Each such subinterface (also called a child interface) has -different IP and network addresses from the primary (parent) interface. The -default Partition Key (PKey), ff:ff, applies to the primary (parent) interface. - -5.1.1 Creating a Subinterface ------------------------------ -To create a child interface (subinterface), follow this procedure: -Note: In the following procedure, ib0 is used as an example of an IB -subinterface. - -Step 1. Decide on the PKey to be used in the subnet. Valid values are 0-255. -The actual PKey used is a 16-bit number with the most significant bit set. For -example, a value of 0 will give a PKey with the value 0x8000. - -Step 2. Create a child interface by running: -host1$ echo > /sys/class/net//create_child -Example: -host1$ echo 0 > /sys/class/net/ib0/create_child -This will create the interface ib0.8000. - -Step 3. Verify the configuration of this interface by running: -Using the example of Step 2: -host1$ ifconfig ib0.8000 -ib0.8000 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00- -00-00-00-00-00-00 -BROADCAST MULTICAST MTU:2044 Metric:1 -RX packets:0 errors:0 dropped:0 overruns:0 frame:0 -TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 -collisions:0 txqueuelen:128 -RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) - -Step 4. As can be seen, the interface does not have IP or network addresses so -it needs to be configured. - -Step 5. To be able to use this interface, a configuration of the Subnet Manager -is needed so that the PKey chosen, which defines a broadcast address, can be -recognized. - -5.1.2 Removing a Subinterface -To remove a child interface (subinterface), run: -echo /sys/class/net//delete_child -Using the example of Step 2: -echo 0x8000 > /sys/class/net/ib0/delete_child -Note that when deleting the interface you must use the PKey value with the most -significant bit set (e.g., 0x8000 in the example above). - - -=============================================================================== -6. Bug Fixes and Enhancements Since OFED 1.3 -=============================================================================== -- There is no default configuration for IPoIB interfaces: One should manually - specify the full IP configuration or use the ofed_net.conf file. See - OFED_Installation_Guide.txt for details on ipoib configuration. -- Don't drop multicast sends when they can be queued -- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small - SKBs (bug 989) -- IPoIB failed on stress testing (bug 1004) -- Kernel Oops during "port up/down test" (bug 1040) -- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel - panic (bug 985) -- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20 -- Set max CM MTU when moving to CM mode, instead of setting it in openibd script -- Fix CQ size calculations for ipoib -- Bonding: Enable build for SLES10 SP2 -- Bonding: Fix issue in using the bonding module for Ethernet slaves (see - documentation for details) - -=============================================================================== -7. Bug Fixes and Enhancements Since OFED 1.3.1 -=============================================================================== -- IPoIB: Refresh paths instead of flushing them on SM change events to improve - failover respond -- IPoIB: Fix loss of connectivity after bonding failover on both sides -- Bonding: Fix link state detection under RHEL4 -- Bonding: Avoid annoying messages from initscripts when starting bond -- Bonding: Set default number of grat. ARP after failover to three (was one) - -=============================================================================== -8. Bug Fixes and Enhancements Since OFED 1.4 -=============================================================================== -- Performance tuning is enabled by default for IPOIB CM. -- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails -- Disable napi while cq is being drained (bugzilla #1587) -- rdma_cm: Use the rate from the ipoib broadcast when joining an ipoib - multicast. When joining an IPoIB multicast group, use the same rate as in the - broadcast group. Otherwise, if rdma_cm creates this group before IPoIB does, - it might get a different rate. This will cause IPoIB to fail joining the same - group later on, because IPoIB has a strict rate selection. -- Fixed unprotected use of priv->broadcast in ipoib_mcast_join_task. -- Do not join broadcast group if interface is brought down - - -=============================================================================== -9. Bug Fixes and Enhancements Since OFED 1.4.2 -=============================================================================== - -- Check that the format of multicast link addresses is correct before taking - them from dev->mc_list to priv->multicast_list. This way we never try to - send a bogus address to the SA, which prevents badness from erroneous - 'ip maddr addr add', broken bonding drivers, etc. (bugzilla #1664) -- IPoIB: Don't turn on carrier for a non-active port. - If a bonding interface uses this IPoIB interface as a slave it might - not detect that this slave is almost useless and failover - functionality will be damaged. The fix checks the state of the IB - port in the carrier_task before calling netif_carrier_on(). (bugzilla #1726) -- Clear ipoib_neigh.dgid in ipoib_neigh_alloc() - IPoIB can miss a change in destination GID under some conditions. The - problem is caused when ipoib_neigh->dgid contains a stale address. - The fix is to set ipoib_neigh->dgid to zero in ipoib_neigh_alloc(). - -=============================================================================== -10. Bug Fixes and Enhancements Since OFED 1.5.0 -=============================================================================== - -- Fixed lockup of the TX queue on mixed CM/UD traffic - When there is a high rate of send traffic on both CM and UD QPs, the - transmitter can be stopped by the CM path but not re-enabled. - -=============================================================================== -11. Bug Fixes and Enhancements Since OFED 1.5.2 -=============================================================================== -1. Fix IPoIB rx_frames and rx_usecs to conform to ethtool documentation. - - -=============================================================================== -12. Performance Tuning -=============================================================================== -When IPoIB is configured to run in connected mode, tcp parameter tuning is -performed at driver startup to improve the throughput of medium and large -messages. -The driver startup scripts set the following TCP parameters as follows: - - net.ipv4.tcp_timestamps=0 - net.ipv4.tcp_sack=0 - net.core.netdev_max_backlog=250000 - net.core.rmem_max=16777216 - net.core.wmem_max=16777216 - net.core.rmem_default=16777216 - net.core.wmem_default=16777216 - net.core.optmem_max=16777216 - net.ipv4.tcp_mem="16777216 16777216 16777216" - net.ipv4.tcp_rmem="4096 87380 16777216" - net.ipv4.tcp_wmem="4096 65536 16777216" - -This tuning is effective only for connected mode. If you run in datagram mode, -it actually reduces performance. - -If you change the IPoIB run mode to "datagram" while the driver is running, -the tuned parameters do not get reset to their default values. We therefore -recommend that you change the IPoIB mode only while the driver is down -(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file -/etc/infiniband/openib.conf, and then restarting the driver). - - diff --git a/iser_release_notes.txt b/iser_release_notes.txt deleted file mode 100644 index cd2d12d..0000000 --- a/iser_release_notes.txt +++ /dev/null @@ -1,90 +0,0 @@ - - Open Fabrics Enterprise Distribution (OFED) - iSER initiator in OFED 1.5.x Release Notes - - March 2010 - - -* Background - - iSER allows iSCSI to be layered over RDMA transports (including - InfiniBand and iWARP (RNIC)). - - The OpenFabrics iSER initiator implementation is inter-operable with - open-iscsi (http://www.open-iscsi.org/). It provides an alternative - transport to iscsi_tcp in the open-iscsi framework. The iSER transport - exposes a transport API to scsi_transport_iscsi, and a SCSI LLD API to - the Linux SCSI mid-layer (scsi_mod). Currently, the OpenFabrics iSER - initiator can be layered over InfiniBand (no iWARP support yet). - -* Supported platforms - - - kernel.org: 2.6.30 and higher - - RHEL 5.4 - - Except for these platforms, OFED-1.5.x will not install iSER on top of - the kernel and the original iSER module coming with Linux Distribution - will stop working because of mismatch in symbols version. - -* Fixed Bugs and Enhancements since OFED 1.3 - iSER: - - Add logical unit reset support - - Update URLs of iSER docs - - Add change_queue_depth method - - Fix list iteration bug - - Handle iser_device allocation error gracefully - - Don't change ITT endianess - - Move high-volume debug output to higher debug level - - Count FMR alignment violations per session - Open-iSCSI: - - Update open-iscsi rpm versions from - 2.0-754 to 2.0-754.1 and from 2.0-865.15 to 2.0-869.2 - - Change open-iscsi defaults - - iscsi_discovery: fixed printing debug information - - iscsi_discovery: check if iscsid is running - - Set open-iscsi for auto-startup when installing OFED - - iscsiadm: bail out if daemon isn't running - -* Known Issues - Open-iSCSI: - - modifying node transport_name while session is active - will create stale session. It will be deleted only after reboot. - -* Installation/upgrade of open-iscsi - If iSER is selected to be installed with OFED, open-iscsi will be also - installed (or upgraded if another version of open-iscsi is already - installed). Installing/upgrading open-iscsi is required for iSER to - work properly. Before installing OFED, please make sure that no version - of open-iscsi is installed or add the following key to your ofed.conf - file: upgrade_open_iscsi=yes. Using this key will remove any old version - of open-iscsi. - - If an older version of open-iscsi was installed, it is recommended to - delete its records before running open-iscsi. This can easily be done by - running the following command (while open-iscsi is stopped): - - rm -rf /etc/iscsi/nodes/* /etc/iscsi/send_targets/* - - Then, open-iscsi may be started, and targets may be discovered by running - 'iscsi_discovery '. - -* iSER links - - Wiki pages - - Information on building/configuring/running the open iscsi initiator over - iSER: https://wiki.openfabrics.org/tiki-index.php?page=iSER - - IETF pages - - iSCSI and iSER specifications come out of the IETF IP storage (IPS) work - group. - - iSCSI specification: http://www.ietf.org/rfc/rfc3720.txt - iSER specification: http://www.ietf.org/rfc/rfc5046.txt - - "About" page - - general and detailed information on iSCSI and iSER - http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA - diff --git a/iser_target_release_notes.txt b/iser_target_release_notes.txt deleted file mode 100644 index b865944..0000000 --- a/iser_target_release_notes.txt +++ /dev/null @@ -1,51 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - STGT/iSER target in OFED 1.5 Release Notes - - December 2009 - - -* Background - - iSER allows iSCSI to be layered over RDMA transports (including InfiniBand - and iWARP (RNIC)). Linux target framework (tgt) aims to simplify various SCSI - target driver (iSCSI, Fibre Channel, SRP, etc) creation and maintenance. - - tgt supports the following target drivers (among othets) - - - iSCSI software (tcp) target driver for Ethernet/IPoIB NICs - - iSER software target driver for Infiniband and RDMA NICs - - For iSCSI and iSER tgt consists of user-space daemon, and user-space - tools. That is, no special kernel support is needed other than the - kernel (and user space) RDMA stacks. - - The code is under the GNU General Public License version 2. - - This package is based on a snapshot (clone) of the tgt git tree taken - on August 28th, 2008 - -* Supported platforms - - RHEL 5 and its updates - SLES 10 and its service-packs - - The release has been tested against the Linux open iscsi initiator - -* STGT/iSER links - - STGT home page - http://stgt.berlios.de - - STGT git - git://git.kernel.org/pub/scm/linux/kernel/git/tomo/tgt.git - - the STGT sources have some embedded documentation, specifically - the README and REDMA.iscsi files would be usefull - - Wiki pages - - Information on building/configuring/running the stgt/iser target - https://wiki.openfabrics.org/tiki-index.php?page=iSER-target - - general and detailed information on iSCSI and iSER - http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA diff --git a/mlx4_release_notes.txt b/mlx4_release_notes.txt deleted file mode 100644 index 897b112..0000000 --- a/mlx4_release_notes.txt +++ /dev/null @@ -1,348 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - ConnectX driver (mlx4) in OFED 1.5.2 Release Notes - - December 2010 - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Supported firmware versions -3. VPI (Virtual Process Interconnect) -4. InfiniBand new features and bug fixes since OFED 1.3.1 -5. InfiniBand (mlx4_ib) new features and bug fixes since OFED 1.4 -6. Eth (mlx4_en) new features and bug fixes since OFED 1.4 -7. New features and bug fixes since OFED 1.4.1 -8. New features and bug fixes since OFED 1.4.2 -9. New features and bug fixes since OFED 1.5 -10. New features and bug fixes since OFED 1.5.1 -11. New features and bug fixes since OFED 1.5.2 -12. Known Issues -13. mlx4 available parameters - -=============================================================================== -1. Overview -=============================================================================== -mlx4 is the low level driver implementation for the ConnectX adapters designed -by Mellanox Technologies. The ConnectX can operate as an InfiniBand adapter, -as an Ethernet NIC, or as a Fibre Channel HBA. The driver in OFED 1.4 supports -InfiniBand and Ethernet NIC configurations. To accommodate the supported -configurations, the driver is split into three modules: - -- mlx4_core - Handles low-level functions like device initialization and firmware - commands processing. Also controls resource allocation so that the - InfiniBand and Ethernet functions can share the device without - interfering with each other. -- mlx4_ib - Handles InfiniBand-specific functions and plugs into the InfiniBand - midlayer -- mlx4_en - Handles Ethernet specific functions and plugs into the netdev mid-layer. - -=============================================================================== -2. Supported firmware versions -=============================================================================== -- This release was tested with FW 2.8.0000 -- The minimal version to use is 2.3.000. -- To use both IB and Ethernet (VPI) use FW version 2.6.000 or higher - -=============================================================================== -3. VPI (Virtual Protocol Interconnect) -=============================================================================== -VPI enables ConnectX to be configured as an Ethernet NIC and/or an InfiniBand -adapter. -o Overview: - The VPI driver is a combination of the Mellanox ConnectX HCA Ethernet and - InfiniBand drivers. - It supplies the user with the ability to run InfiniBand and Ethernet - protocols on the same HCA (separately or at the same time). - For more details on the Ethernet driver see MLNX_EN_README.txt. -o Firmware: - The VPI driver works with FW 25408 version 2.6.000 or higher. - One needs to use INI files that allow different protocols over same HCA. -o Port type management: - By default both ConnectX ports are initialized as InfiniBand ports. - If you wish to change the port type use the connectx_port_config script after - the driver is loaded. - Running "/sbin/connectx_port_config -s" will show current port configuration - for all ConnectX devices. - Port configuration is saved in file: /etc/infiniband/connectx.conf. - This saved configuration is restored at driver restart only if done via - "/etc/init.d/openibd restart". - - Possible port types are: - "eth" - Always Ethernet. - "ib" - Always InfiniBand. - "auto" - Link sensing mode - detect port type based on the attached - network type. If no link is detected, the driver retries link - sensing every few seconds. - - Port link type can be configured for each device in the system at run time - using the "/sbin/connectx_port_config" script. - - This utility will prompt for the PCI device to be modified (if there is only - one it will be selected automatically). - At the next stage the user will be prompted for the desired mode for each port. - The desired port configuration will then be set for the selected device. - Note: This utility also has a non interactive mode: - "/sbin/connectx_port_config [[-d|--device ] -c|--conf ]". - -- The following configurations are supported by VPI: - Port1 = eth Port2 = eth - Port1 = ib Port2 = ib - Port1 = auto Port2 = auto - Port1 = ib Port2 = eth - Port1 = ib Port2 = auto - Port1 = auto Port2 = eth - - Note: the following options are not supported: - Port1 = eth Port2 = ib - Port1 = eth Port2 = auto - Port1 = auto Port2 = ib - - -=============================================================================== -4. InfiniBand new features and bug fixes since OFED 1.3.1 -=============================================================================== -Features that are enabled with ConnectX firmware 2.5.0 only: -- Send with invalidate and Local invalidate send queue work requests. -- Resize CQ support. - -Features that are enabled with ConnectX firmware 2.6.0 only: -- Fast register MR send queue work requests. -- Local DMA L_Key. -- Raw Ethertype QP support (one QP per port) -- receive only. - -Non-firmware dependent features: -- Allow 4K messages for UD QPs -- Allocate/free fast register MR page lists -- More efficient MTT allocator -- RESET->ERR QP state transition no longer supported (IB Spec 1.2.1) -- Pass congestion management class MADs to the HCA -- Enable firmware diagnostic counters available via sysfs -- Enable LSO support for IPOIB -- IB_EVENT_LID_CHANGE is generated more appropriately -- Fixed race condition between create QP and destroy QP (bugzilla 1389) - - -=============================================================================== -5. InfiniBand new features and bug fixes since OFED 1.4 -=============================================================================== -- Enable setting via module param (set_4k_mtu) 4K MTU for ConnectX ports. -- Support optimized registration of huge pages backed memory. - With this optimization, the number of MTT entries used is significantly - lower than for regular memory, so the HCA will access registered memory with - fewer cache misses and improved performance. - For more information on this topic, please refer to Linux documentation file: - Documentation/vm/hugetlbpage.txt -- Do not enable blueflame sends if write combining is not available -- Add write combining support for for PPC64, and thus enable blueflame sends. -- Unregister IB device before executing CLOSE_PORT. -- Notify and exit if the kernel module used does not support XRC. This is done - to avoid libmlx4 compatibility problem. -- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment. - This enable to register more memory with the same number of segments. - - -=============================================================================== -6. Eth (mlx4_en) new features and bug fixes since OFED 1.4 -=============================================================================== -6.1 Changes and New Features ----------------------------- -- Added Tx Multi-queue support which Improves multi-stream and bi-directional - TCP performance. -- Added IP Reassembly to improve RX bandwidth for IP fragmented packets. -- Added linear skb support which improves UDP performance. -- Removed the following module parameters: - - rx/tx_ring_size - - rx_ring_num - number of RX rings - - pprx/pptx - global pause frames - The parameters above are controlled through the standard Ethtool interface. - -Bug Fixes ---------- -- Memory leak when driver is unloaded without configuring interfaces first. -- Setting flow control parameters for one ConnectX port through Ethtool - impacts the other port as well. -- Adaptive interrupt moderation malfunctions after receiving/transmitting - around 7 Tera-bytes of data. -- Firmware commands fail with bad flow messages when bringing an interface up. -- Unexpected behavior in case of memory allocation failures. - -=============================================================================== -7. New features and bug fixes since OFED 1.4.1 -=============================================================================== -- Added support for new device ID: 0x6764: MT26468 ConnectX EN 10GigE PCIe gen2 - -=============================================================================== -8. New features and bug fixes since OFED 1.4.2 -=============================================================================== -8.1 Changes and New Features ----------------------------- -- mlx4_en is now supported on PPC and IA64. -- Added self diagnostics feature: ethtool -t eth. -- Card's vpd can be accessed for read and write using ethtool interface. - -8.2 Bug Fixes -------------- -- mlx4 can now work with MSI-X on RH4 systems. -- Enabled the driver to load on systems with 32 cores and higher. -- The driver is being stuck if the HW/FW stops responding, reset is done - instead. -- Fixed recovery flows from memory allocation failures. -- When the system is low on memory, the mlx4_en driver now allocates smaller RX - rings. -- The mlx4_core driver now retries to obtain MSI-X vectors if the initial request is - rejected by the OS - -=============================================================================== -9. New features and bug fixes since OFED 1.5 -=============================================================================== -9.1 Changes and New Features ----------------------------- -- Added RDMA over Converged Enhanced Ethernet (RoCEE) support - See RoCEE_README.txt. -- Masked Compare and Swap (MskCmpSwap) - The MskCmpSwap atomic operation is an extension to the CmpSwap operation - defined in the IB spec. MskCmpSwap allows the user to select a portion of the - 64 bit target data for the "compare" check as well as to restrict the swap to - a (possibly different) portion. -- Masked Fetch and Add (MFetchAdd) - The MFetchAdd Atomic operation extends the functionality of the standard IB - FetchAdd by allowing the user to split the target into multiple fields of - selectable length. The atomic add is done independently on each one of this - fields. A bit set in the field_boundary parameter specifies the field - boundaries. -- Improved VLAN tagging performance for the mlx4_en driver. -- RSS support for Ethernet UDP traffic on ConnectX-2 cards with firmware - 2.7.700 and higher. - -9.2 Bug Fixes -------------- -- Bonding stops functioning when one of the Ethernet ports is closed. -- "Scheduling while atomic" errors in /var/log/messages when working with - bonding and mlx4_en drivers in several operating systems. - -=============================================================================== -10. New features and bug fixes since OFED 1.5.1 -=============================================================================== -10.1 Changes and New Features ----------------------------- -1. Added RAW QP support -2. Extended the range of log_mtts_per_seg - upper bound moved from 5 to 7. -3. Added 0xff70 vendor ID support for MADs. -4. Added support for GID change event. -5. Better interrupts spreading under heavy RX load (mlx4_en) - -10.2 Bug Fixes -------------- -1. Fixed chunk sg list overflow in mlx4_alloc_icm() -2. Fixed bug in invalidation of counter index. -3. Fixed bug in catching netdev events for updating GID table. -4. Fixed bug in populating GID table for RoCE. -5. Fixed XRC locking and prevention of null dereference. -6. Added spinlock to xrc_reg_list changes and scanning in interrupt context. -7. Fixed offload changes via Ethtool for VLAN interfaces - -=============================================================================== -11. New features and bug fixes since OFED 1.5.2 -=============================================================================== -11.1 Changes and new features ------------------------------ -1. RoCE counters are now added to the regular Ethernet counters. The counters -for RoCE specific traffic are at the same place and are not changed. -2. Forward any vendor ID SMP MADs to firmware for handling. -3. Add blue flame support for kernel consumers. This allows lower latencies to -be achieved. To use blue flame, a consumer needs to create the QP with inline -support. - -11.2 Bug fixes --------------- -1. Fix race when reading node desctription through MADs. -2. Fix modify CQ so each of moderation parameters is independent. -3. Limit the number of fast registration work requests to match HW capabilities. -4 Changes to node-description via sysfs are now propagated to FW (for FW -2.8.000 and later). This enables FW to send a 144 trap to OpenSM regarding the -change, so that OpenSM can read that nodes updated description. This fixes an -old race condition, where OpenSM read the nodes description before it was -changed during driver startup. -5. Fix max fast registration WRs that can be posted to CX. -6. Fix port speed reporting for RoCE ports. -7. Limit GID entries for VLAN to match hardware capabilities. -8. Fix RoCE link state report. -9. Workaround firmware bug reporting wrong number of blue flame registers. -10. Bug fix in kernel pos_send when VLANs are used. -11. Fix in mlx4_en for handling VLAN operations when working under bond - interfaces. -12.Fix Ethtool transceiver type report for mlx4_en - - -=============================================================================== -12. Known Issues -=============================================================================== -- The SQD feature is not supported -- To load the driver on machines with a 64KB default page size, the UAR bar - must be enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium - with SLES 11 or when 64KB page size enabled. - Perform the following three steps: - 1. Add the following line in the firmware configuration (INI) file under the - [HCA] section: - log2_uar_bar_megabytes = 5 - 2. Burn a modified firmware image with the changed INI file. - 3. Reboot the system. - - -================================================================================ -13. mlx4 available parameters -================================================================================ -In order to set mlx4 parameters, add the following line(s) to /etc/modpobe.conf: - options mlx4_core parameter= - and/or - options mlx4_ib parameter= - and/or - options mlx4_en parameter= - -mlx4_core parameters: - set_4k_mtu: try to set 4K MTU to all ConnectX ports (int) - debug_level: enable debug tracing if > 0 (int) - block_loopback: block multicast loopback packets if > 0 (int) - msi_x: attempt to use MSI-X if nonzero (int) - log_num_mac: log2 max number of MACs per ETH port (1-7, int) - use_prio: enable steering by VLAN priority on ETH ports - (0/1, default 0) (bool) - log_num_qp: log maximum number of QPs per HCA (int) - log_num_srq: log maximum number of SRQs per HCA (int) - log_rdmarc_per_qp: log number of RDMARC buffers per QP (int) - log_num_cq: log maximum number of CQs per HCA (int) - log_num_mcg: log maximum number of multicast groups per HCA - (int) - log_num_mpt: log maximum number of memory protection table - entries per HCA (int) - log_num_mtt: log maximum number of memory translation table - segments per HCA (int) - log_mtts_per_seg: log2 number of MTT entries per segment (1-5) - (int) - enable_qos: enable Quality of Service support in the HCA - (default: off) (bool) - enable_pre_t11_mode: set FCoXX to pre-T11 mode if non-zero - (default 0) (int) - internal_err_reset: reset device on internal errors if non-zero - (default 1) (int) - -mlx4_ib parameters: - debug_level: enable debug tracing if > 0 (default 0) - -mlx4_en parameters: - udp_rss: enable RSS for incoming UDP traffic or disabled (0) - tcp_rss: enable RSS for incoming TCP traffic or disabled (0) - num_lro: number of LRO sessions per ring or disabled (0) - (default is 32) - ip_reasm: allow reassembly of fragmented IP packets (default - is enabled) - pfctx: priority based Flow Control policy on TX[7:0] - per priority bit mask (default is 0) - pfcrx: priority based Flow Control policy on RX[7:0] - per priority bit mask (default is 0) - inline_thold: threshold for using inline data (default is 128) diff --git a/mpi-selector_release_notes.txt b/mpi-selector_release_notes.txt deleted file mode 100644 index 95944dc..0000000 --- a/mpi-selector_release_notes.txt +++ /dev/null @@ -1,43 +0,0 @@ - MPI Selector 1.0 release notes - December 2009 - ============================== - -OFED contains a simple mechanism for system administrators and end -users to select which MPI implementation they want to use. The MPI -selector functionality is not specific to any MPI implementation; it -can be used with any implementation that provides shell startup files -that correctly set the environment for that MPI. The OFED installer -will automatically add MPI selector support for each MPI that it -installs. Additional MPI's not known by the OFED installer can be -listed in the MPI selector; see the mpi-selector(1) man page for -details. - -Note that MPI selector only affects the default MPI environment for -*future* shells. Specifically, if you use MPI selector to select MPI -implementation ABC, this default selection will not take effect until -you start a new shell (e.g., logout and login again). Other packages -(such as environment modules) provide functionality that allows -changing your environment to point to a new MPI implementation in the -current shell. The MPI selector was not meant to duplicate or replace -that functionality. - -The MPI selector functionality can be invoked in one of two ways: - -1. The mpi-selector-menu command. - - This command is a simple, menu-based program that allows the - selection of the system-wide MPI (usually only settable by root) - and a per-user MPI selection. It also shows what the current - selections are. - - This command is recommended for all users. - -2. The mpi-selector command. - - This command is a CLI-equivalent of the mpi-selector-menu, - allowing for the same functionality as mpi-selector-menu but - without the interactive menus and prompts. It is suitable for - scripting. - -See the mpi-selector(1) man page for more information. - diff --git a/mstflint_release_notes.txt b/mstflint_release_notes.txt deleted file mode 100644 index 15c7c79..0000000 --- a/mstflint_release_notes.txt +++ /dev/null @@ -1,77 +0,0 @@ -=============================================================================== - OFED 1.5.2 for Linux - Mellanox Firmware Burning and Diagnostic Utilities - December 2010 -=============================================================================== - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. New Features -3. Major Bugs Fixed -4. Known Issues - -=============================================================================== -1. Overview -=============================================================================== - -This package contains a burning and diagnostic tools for Mellanox -manufactured cards. It also provides access to the relevant source code. Please -see the file LICENSE for licensing details. - -Package Contents: - a) mstflint source code - b) mflash lib - This lib provides Flash access through Mellanox HCAs. - c) mtcr lib (implemented in mtcr.h file) - This lib enables access to adapter hardware registers via PCIe - d) mstregdump utility - This utility dumps hardware registers from Mellanox hardware for later - analysis by Mellanox. - e) mstvpd - This utility dumps the on-card VPD (Vital Product Data, which contains - the card serial number, part number, and other info). - f) hca_self_test.ofed - This scripts checks the status of software, firmware and hardware of the - HCAs or NICs installed on the local host. - -=============================================================================== -2. New Features -=============================================================================== - -* Added support for flash type SST25VF016B in mstflint - -* Added support for flash type M25PX16 in mstflint - -* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') in - a binary image file. This is useful for production to prepare images for pre- - assembly flash burning. These new commands are supported by Mellanox 4th - generation devices. - -* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') on - an already burnt device. These command re-burn the existing image with the - given GUIDs or VSD. - When the 'sg' command is applied on a device with blank (0xff) GUIDs, it - updates the GUIDs without re-burning the image. - -* mstregdump: Updated address list for ConnectX2 device. - -=============================================================================== -3. Bugs Fixed -=============================================================================== - -* Show correct device names in mstflint help - -=============================================================================== -4. Known Issues -=============================================================================== - -* Rarely you may get the following error message when running mstflint: - Warning: memory access to device 0a:00.0 failed: Input/output error. - Warning: Fallback on IO: much slower, and unsafe if device in use. - *** buffer overflow detected ***: mstflint terminated - - To solve the issue, run "mst start" (requires MFT - Mellanox Firmware Tools package) and - then re-run mstflint. - diff --git a/mthca_release_notes.txt b/mthca_release_notes.txt deleted file mode 100644 index 40f3c4e..0000000 --- a/mthca_release_notes.txt +++ /dev/null @@ -1,92 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - mthca in OFED 1.5 Release Notes - - December 2009 - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Fixed Bugs since OFED 1.3.1 -3. Bug fixes and enhancements since OFED 1.4 -4. Known Issues - -=============================================================================== -1. Overview -=============================================================================== -mthca is the low level driver implementation for the following Mellanox -Technologies HCAs: InfiniHost, InfiniHost III Ex and InfiniHost III Lx. - -mthca Available Parameters --------------------------- -In order to set mthca parameters, add the following line to /etc/modpobe.conf: - - options ib_mthca parameter= - -mthca parameters: - catas_reset_disable: disable reset on catastrophic event if nonzero - (int) - fw_cmd_doorbell: post FW commands through doorbell page if - nonzero (and supported by FW) (int) - debug_level: Enable debug tracing if > 0 (int) - msi_x: attempt to use MSI-X if nonzero (int) - tune_pci: increase PCI burst from the default set by BIOS if nonzero (int) - num_qp: maximum number of QPs per HCA (int) - rdb_per_qp: number of RDB buffers per QP (int) - num_cq: maximum number of CQs per HCA (int) - num_mcg: maximum number of multicast groups per HCA (int) - num_mpt: maximum number of memory protection table entries per HCA (int) - num_mtt: maximum number of memory translation table segments per HCA (int) - num_udav: maximum number of UD address vectors per HCA (int) - fmr_reserved_mtts: number of memory translation table segments reserved for - FMR (int) - log_mtts_per_seg: Log2 number of MTT entries per segment (1-5) (int) - -=============================================================================== -2. Fixed Bugs -=============================================================================== -- Fix access to freed memory in catastrophic processing - catas_reset() uses pointer to mthca_dev, but mthca_dev is not valid after - call __mthca_restart_one(). - - -=============================================================================== -3. Bug fixes and enhancements since OFED 1.4 -=============================================================================== -- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment. - This enable to register more memory with the same number of segments. -- Bring INIT_HCA and other commands timeout into consistency with PRM. This - solve an issue when had more than 2^18 max qp's configured. - -=============================================================================== -4. Known Issues -=============================================================================== -1. A UAR size other than 8MB prevents mthca driver loading. The default UAR - size is 8MB. If the size is changed, the following error message will be - logged to /var/log/messages upon attempting to load the mthca driver: - ib_mthca 0000:04:00.0: Missing UAR, aborting. - -2. If a user level application using multicast receives a control signal - in the process of detaching from a multicast group, its QP may remain a - member of the multicast group (in HCA). - Workaround: Destroy the multicast group after detaching the QP from it. - -3. In mem-free devices, RC QPs can be created with a maximum of (max_sge - 1) - entries only; UD QPs can be created with a maximum of (max_sge - 3) entries. - -4. Performance can be degraded due to a wrong BIOS configuration: - The PCI Express specification requires the BIOS to set the MaxReadReq - register for each HCA card for maximum performance and stability. - - If you experience bandwidth performance degradation, try forcing the card to - behave not according to the PCI Express specification by setting the - tune_pci=1 module parameter. This tune_pci=1 assignment was the default - setting in OFED 1.0; therefore, it may have masked performance degradation - on some systems. - - If tune_pci=1 improves bandwidth, please report the issue to your BIOS - vendor. Please note that Mellanox Technologies does not recommend using - tune_pci=1 in production systems: working with tune_pci=1 set is untested - and is known to trigger instability issues on some platforms. - diff --git a/mvapich2_release_notes.txt b/mvapich2_release_notes.txt deleted file mode 100644 index 9a0fa90..0000000 --- a/mvapich2_release_notes.txt +++ /dev/null @@ -1,118 +0,0 @@ -======================================================================== - - Open Fabrics Enterprise Distribution (OFED) - MVAPICH2-1.5.1 in OFED 1.5.2 Release Notes - - September 2010 - - -Overview --------- - -These are the release notes for MVAPICH2-1.5.1. MVAPICH2 is an MPI-2 -implementation over InfiniBand, iWARP and RoCEE (RDMAoE) from the Ohio -State University (http://mvapich.cse.ohio-state.edu/). - - -User Guide ----------- - -For more information on using MVAPICH2-1.5.1, please visit the user -guide at http://mvapich.cse.ohio-state.edu/support/. - - -Software Dependencies ---------------------- - -MVAPICH2 depends on the installation of the OFED Distribution stack with -OpenSM running. The MPI module also requires an established network -interface (either InfiniBand, IPoIB, iWARP, RoCEE uDAPL, or Ethernet). -BLCR support is needed if built with fault tolerance support. Similarly, -HWLOC support is needed if built with Portable Hardware Locality feature -for CPU mapping. - - -ChangeLog ---------- - -* Features and Enhancements - - Significantly reduce memory footprint on some systems by changing - the stack size setting for multi-rail configurations - - Optimization to the number of RDMA Fast Path connections - - Performance improvements in Scatterv and Gatherv collectives for - CH3 interface (Thanks to Dan Kokran and Max Suarez of NASA for - identifying the issue) - - Tuning of Broadcast Collective - - Support for tuning of eager thresholds based on both adapter and - platform type - - Environment variables for message sizes can now be expressed in - short form K=Kilobytes and M=Megabytes (e.g. - MV2_IBA_EAGER_THRESHOLD=12K) - - Ability to selectively use some or all HCAs using colon separated - lists. e.g. MV2_IBA_HCA=mlx4_0:mlx4_1 - - Improved Bunch/Scatter mapping for process binding with HWLOC and - SMT support (Thanks to Dr. Bernd Kallies of ZIB for ideas and - suggestions) - - Update to Hydra code from MPICH2-1.3b1 - - Auto-detection of various iWARP adapters - - Specifying MV2_USE_IWARP=1 is no longer needed when using iWARP - - Changing automatic eager threshold selection and tuning for iWARP - adapters based on number of nodes in the system instead of the - number of processes - - PSM progress loop optimization for QLogic Adapters (Thanks to Dr. - Avneesh Pant of QLogic for the patch) - -* Bug fixes - - Fix memory leak in registration cache with --enable-g=all - - Fix memory leak in operations using datatype modules - - Fix for rdma_cross_connect issue for RDMA CM. The server is - prevented from initiating a connection. - - Don't fail during build if RDMA CM is unavailable - - Various mpirun_rsh bug fixes for CH3, Nemesis and uDAPL interfaces - - ROMIO panfs build fix - - Update panfs for not-so-new ADIO file function pointers - - Shared libraries can be generated with unknown compilers - - Explicitly link against DL library to prevent build error due to - DSO link change in Fedora 13 (introduced with gcc-4.4.3-5.fc13) - - Fix regression that prevents the proper use of our internal HWLOC - component - - Remove spurious debug flags when certain options are selected at - build time - - Error code added for situation when received eager SMP message is - larger than receive buffer - - Fix for Gather and GatherV back-to-back hang problem with LiMIC2 - - Fix for packetized send in Nemesis - - Fix related to eager threshold in nemesis ib-netmod - - Fix initialization parameter for Nemesis based on adapter type - - Fix for uDAPL one sided operations (Thanks to Jakub Fedoruk from - Intel for reporting this) - - Fix an issue with out-of-order message handling for iWARP - - Fixes for memory leak and Shared context Handling in PSM for - QLogic Adapters (Thanks to Dr. Avneesh Pant of QLogic for the - patch) - - -Main Verification Flows ------------------------ - -In order to verify the correctness of MVAPICH2-1.4.1, the following -tests and parameters were run. - -Test Description -==================================================================== -Intel Intel's MPI functionality test suite -OSU Benchmarks OSU's performance tests -IMB Intel's MPI Benchmark test -mpich2 Test suite distributed with MPICH2 -NAS NAS Parallel Benchmarks (NPB3.2) - - -Mailing List ------------- - -There is a public mailing list mvapich-discuss@cse.ohio-state.edu for -mvapich users and developers to -- Ask for help and support from each other and get prompt response -- Contribute patches and enhancements - -======================================================================== diff --git a/mvapich_release_notes.txt b/mvapich_release_notes.txt deleted file mode 100644 index 8c872ae..0000000 --- a/mvapich_release_notes.txt +++ /dev/null @@ -1,102 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - OSU MPI MVAPICH-1.2.0, in OFED 1.5 Release Notes - - December 2009 - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Software Dependencies -3. New Features -4. Bug Fixes -5. Known Issues -6. Main Verification Flows - - -=============================================================================== -1. Overview -=============================================================================== -These are the release notes for OSU MPI MVAPICH-1.2.0. -OSU MPI is an MPI channel implementation over InfiniBand -by Ohio State University (OSU). - -See http://mvapich.cse.ohio-state.edu - - -=============================================================================== -2. Software Dependencies -=============================================================================== -OSU MPI depends on the installation of the OFED stack with OpenSM running. -The MPI module also requires an established network interface (either -InfiniBand IPoIB or Ethernet). - - -=============================================================================== -3. New Features ( Compared to mvapich 1.1.0 ) -=============================================================================== -MVAPICH-1.2.0 has the following additional features: -- Advanced network recovery support -- mpirun launcher improvements -- Efficient intra-node shared memory communication - support for diskless clusters -- RoCEE (RDMAoE) networks support - -=============================================================================== -4. Bug Fixes ( Compared to mvapich 1.1.0 ) -=============================================================================== -- Multiple fixes for mpirun_rsh launcher - -=============================================================================== -5. Known Issues -=============================================================================== -- Shared memory broadcast optimization is disabled by default. - -- MVAPICH MPI compiled on AMD x86_64 does not work with MVAPICH MPI compiled - on Intel X86_64 (EM64t). - Workaround: - Use "VIADEV_USE_COMPAT_MODE=1" run time option in order to enable compatibility - mode that works for AMD and Intel platform. - -- A process running MPI cannot fork after MPI_Init unless the environment - variable IBV_FORK_SAFE=1 is set to enable fork support. This support also - requires a kernel version of 2.6.16 or higher. - -- For users of Mellanox Technologies firmware fw-23108 or fw-25208 only: - MVAPICH might fail in its default configuration if your HCA is burnt with an - fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version - 4.7.400 or earlier. - - NOTE: There is no issue if you chose to update firmware during Mellanox - OFED installation as newer firmware versions were burnt. - - Workaround: - Option 1 - Update the firmware. For instructions, see Mellanox Firmware Tools - (MFT) User's Manual under the docs/ folder. - Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0 - -- MVAPICH may fail to run on some SLES 10 machines due to problems in resolving - the host name. - Workaround: Edit /etc/hosts and comment-out/remove the line that maps - IP address 127.0.0.2 to the system's fully qualified hostname. - - -=============================================================================== -6. Main Verification Flows -=============================================================================== -In order to verify the correctness of MVAPICH, the following tests and -parameters were run. - -Test Description -------------------------------------------------------------------- -Intel's Test suite - 1400 Intel tests -BW/LT OSU's test for bandwidth latency -IMB Intel's MPI Benchmark test -mpitest b_eff test -Presta Presta multicast test -Linpack Linpack benchmark -NAS2.3 NAS NPB2.3 tests -SuperLU SuperLU benchmark (NERSC edition) -NAMD NAMD application -CAM CAM application diff --git a/nes_release_notes.txt b/nes_release_notes.txt deleted file mode 100644 index 0233d90..0000000 --- a/nes_release_notes.txt +++ /dev/null @@ -1,319 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - NetEffect Ethernet Cluster Server Adapter Release Notes - September 2010 - - - -The iw_nes module and libnes user library provide RDMA and L2IF -support for the NetEffect Ethernet Cluster Server Adapters. - -========== -What's New -========== -OFED 1.5.2 contains several enhancements and bug fixes to iw_nes driver. - -* Add new feature iWarp Multicast Acceleration (IMA). -* Add module option to disable extra doorbell read after a write. -* Change CQ event notification to not fire event unless there is a - new CQE not polled. -* Fix payload calculation for post receive with more than one SGE. -* Fix crash when CLOSE was indicated twice due to connection close - during remote peer's timeout on pending MPA reply. -* Fix ifdown hang by not calling ib_unregister_device() till removal - of iw_nes module. -* Handle RST when state of connection is in FIN_WAIT2. -* Correct properties for various nes_query_{qp, port, device} calls. - - -============================================ -Required Setting - RDMA Unify TCP port space -============================================ -RDMA connections use the same TCP port space as the host stack. To avoid -conflicts, set rdma_cm module option unify_tcp_port_space to 1 by adding -the following to /etc/modprobe.conf: - - options rdma_cm unify_tcp_port_space=1 - - -======================================== -Required Setting - Power Management Mode -======================================== -If possible, disable Active State Power Management in the BIOS, e.g.: - - PCIe ASPM L0s - Advanced State Power Management: DISABLED - - -======================= -Loadable Module Options -======================= -The following options can be used when loading the iw_nes module by modifying -modprobe.conf file. - -wide_ppm_offset=0 - Set to 1 will increase CX4 interface clock ppm offset to 300ppm. - Default setting 0 is 100ppm. - -mpa_version=1 - MPA version to be used int MPA Req/Resp (0 or 1). - -disable_mpa_crc=0 - Disable checking of MPA CRC. - Set to 1 to enable MPA CRC. - -send_first=0 - Send RDMA Message First on Active Connection. - -nes_drv_opt=0x00000100 - Following options are supported: - - 0x00000010 - Enable MSI - 0x00000080 - No Inline Data - 0x00000100 - Disable Interrupt Moderation - 0x00000200 - Disable Virtual Work Queue - 0x00001000 - Disable extra doorbell read after write - -nes_debug_level=0 - Specify debug output level. - -wqm_quanta=65536 - Set size of data to be transmitted at a time. - -limit_maxrdreqsz=0 - Limit PCI read request size to 256 bytes. - - -=============== -Runtime Options -=============== -The following options can be used to alter the behavior of the iw_nes module: -NOTE: Assuming NetEffect Ethernet Cluster Server Adapter is assigned eth2. - - ifconfig eth2 mtu 9000 - largest mtu supported - - ethtool -K eth2 tso on - enables TSO - ethtool -K eth2 tso off - disables TSO - - ethtool -C eth2 rx-usecs-irq 128 - set static interrupt moderation - - ethtool -C eth2 adaptive-rx on - enable dynamic interrupt moderation - ethtool -C eth2 adaptive-rx off - disable dynamic interrupt moderation - ethtool -C eth2 rx-frames-low 16 - low watermark of rx queue for dynamic - interrupt moderation - ethtool -C eth2 rx-frames-high 256 - high watermark of rx queue for - dynamic interrupt moderation - ethtool -C eth2 rx-usecs-low 40 - smallest interrupt moderation timer - for dynamic interrupt moderation - ethtool -C eth2 rx-usecs-high 1000 - largest interrupt moderation timer - for dynamic interrupt moderation - -=================== -uDAPL Configuration -=================== -Rest of the document assumes the following uDAPL settings in dat.conf: - - OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" "" - ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" "" - - -============== -mpd.hosts file -============== -mpd.hosts is a text file with a list of nodes, one per line, in the MPI ring. -Use either fully qualified hostname or IP address. - - -======================================= -Recommended Settings for HP MPI 2.2.7 -======================================= -Add the following to mpirun command: - - -1sided - -Example mpirun command with uDAPL-2.0: - - mpirun -np 2 -hostfile /opt/mpd.hosts - -UDAPL -prot -intra=shm - -e MPI_HASIC_UDAPL=ofa-v2-iwarp - -1sided - /opt/hpmpi/help/hello_world - -Example mpirun command with uDAPL-1.2: - - mpirun -np 2 -hostfile /opt/mpd.hosts - -UDAPL -prot -intra=shm - -e MPI_HASIC_UDAPL=OpenIB-iwarp - -1sided - /opt/hpmpi/help/hello_world - - -============================================================ -Recommended Settings for Platform MPI 7.1 (formerly HP-MPI) -============================================================ -Add the following to mpirun command: - - -1sided - -Example mpirun command with uDAPL-2.0: - - mpirun -np 2 -hostfile /opt/mpd.hosts - -UDAPL -prot -intra=shm - -e MPI_HASIC_UDAPL=ofa-v2-iwarp - -1sided - /opt/platform_mpi/help/hello_world - -Example mpirun command with uDAPL-1.2: - - mpirun -np 2 -hostfile /opt/mpd.hosts - -UDAPL -prot -intra=shm - -e MPI_HASIC_UDAPL=OpenIB-iwarp - -1sided - /opt/platform_mpi/help/hello_world - - -============================================== -Recommended Settings for Intel MPI 3.2.x/4.0.x -============================================== -Add the following to mpiexec command: - - -genv I_MPI_FALLBACK_DEVICE 0 - -genv I_MPI_DEVICE rdma:OpenIB-iwarp - -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1 - -Example mpiexec command line for uDAPL-2.0: - - mpiexec -genv I_MPI_FALLBACK_DEVICE 0 - -genv I_MPI_DEVICE rdma:ofa-v2-iwarp - -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1 - -ppn 1 -n 2 - /opt/intel/impi/3.2.2/bin64/IMB-MPI1 - -Example mpiexec command line for uDAPL-1.2: - mpiexec -genv I_MPI_FALLBACK_DEVICE 0 - -genv I_MPI_DEVICE rdma:OpenIB-iwarp - -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1 - -ppn 1 -n 2 - /opt/intel/impi/3.2.2/bin64/IMB-MPI1 - - -======================================== -Recommended Setting for MVAPICH2 and OFA -======================================== -Add the following to the mpirun command: - - -env MV2_USE_IWARP_MODE 1 - -Example mpiexec command line: - - mpiexec -l -n 2 - -env MV2_USE_IWARP_MODE 1 - /usr/mpi/gcc/mvapich2-1.5/tests/osu_benchmarks-3.1.1/osu_latency - - -========================================== -Recommended Setting for MVAPICH2 and uDAPL -========================================== -Add the following to the mpirun command for 64 or more processes: - - -env MV2_ON_DEMAND_THRESHOLD - -Example mpirun command with uDAPL-2.0: - - mpiexec -l -n 64 - -env MV2_DAPL_PROVIDER ofa-v2-iwarp - -env MV2_ON_DEMAND_THRESHOLD 64 - /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1 - -Example mpirun command with uDAPL-1.2: - - mpiexec -l -n 64 - -env MV2_DAPL_PROVIDER OpenIB-iwarp - -env MV2_ON_DEMAND_THRESHOLD 64 - /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1 - - -=========================== -Modify Settings in Open MPI -=========================== -There is more than one way to specify MCA parameters in -Open MPI. Please visit this link and use the best method -for your environment: - -http://www.open-mpi.org/faq/?category=tuning#setting-mca-params - - -======================================= -Recommended Settings for Open MPI 1.4.2 -======================================= -Allow the sender to use RDMA Writes: - - -mca btl_openib_flags 2 - -Example mpirun command line: - - mpirun -np 2 -hostfile /opt/mpd.hosts - -mca btl openib,self,sm - -mca btl_mpi_leave_pinned 0 - -mca btl_openib_flags 2 - /usr/mpi/gcc/openmpi-1.4.2/tests/IMB-3.2/IMB-MPI1 - - -=================================== -iWARP Multicast Acceleration (IMA) -=================================== - -iWARP multicast acceleration enables raw L2 multicast traffic kernel -bypass using user-space verbs API using the new defined QP type -IBV_QPT_RAW_ETH. - -The L2 RAW_ETH acceleration assumes that user application transmits and -receives a whole L2 frame including MAC/IP/UDP/TCP headers. - -ETH RAW QP usage: -First the application creates IBV_QPT_RAW_ETH QP with associated CQ, PD, -completion channels as it is performed for RDMA connection. - -Next step is enabling L2 MAC address RX filters for directing received -multicasts to the RAW_ETH QPs using ibv_attach_multicast() verb. - -From this point the application is ready to receive and transmit multicast -traffic. - -In multicast acceleration the user application passes to ibv_post_send() -whole IGMP frame including MAC header, IP header, UDP header and UDP payload. -It is a user responsibility to make IP fragmentation when required payload -is larger than MTU. Every fragment is a separate L2 frame to transmit. -The ibv_poll_cq() provides an information about the status of transmit buffer. - -On receive path, ibv_poll_cq() returns information about received L2 -packet, the Rx buffer (previously posted by ibv_post_recv() ) contains -whole L2 frame including MAC header, IP header and UDP header. -It is a user application responsibility to check if received packet is -a valid UDP frame so the fragments must be checked and checksums must be -computed. - -IMA API description (NE020 specific): -User application must create separate CQs for RX and TX path. -Only single SGE on tranmit is supported. -User application must post at least 65 rx buffers to keep RX path working. - -IMA device: -IMA requires creation of the /dev/infiniband/nes_ud_sksq device to get -access to optimized IMA transmit path. The best method for creation of this -device is manual addition following line to /etc/udev/rules.d/90-ib.rules -file after OFED distribution installation and rebooting machine. - -KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644" - -As a result the 90-ib.rules should look like: - -KERNEL=="umad*", NAME="infiniband/%k" -KERNEL=="issm*", NAME="infiniband/%k" -KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666" -KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666" -KERNEL=="ucma", NAME="infiniband/%k", MODE="0666" -KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666" -KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644" - - - -NetEffect is a trademark of Intel Corporation in the U.S. and other countries. diff --git a/nfs-rdma.release-notes.txt b/nfs-rdma.release-notes.txt deleted file mode 100644 index 9b1a794..0000000 --- a/nfs-rdma.release-notes.txt +++ /dev/null @@ -1,230 +0,0 @@ -################################################################################ -# # -# NFS/RDMA README # -# # -################################################################################ - - Author: NetApp and Open Grid Computing - - Adapted for OFED 1.5.1 (from linux-2.6.30/Documentation/filesystems/nfs-rdma.txt) - by Jon Mason - -Table of Contents -~~~~~~~~~~~~~~~~~ - - Overview - - OFED 1.5.1 limitations - - Getting Help - - Installation - - Check RDMA and NFS Setup - - NFS/RDMA Setup - -Overview -~~~~~~~~ - - This document describes how to install and setup the Linux NFS/RDMA client - and server software. - - The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server - was first included in the following release, Linux 2.6.25. - - In our testing, we have obtained excellent performance results (full 10Gbit - wire bandwidth at minimal client CPU) under many workloads. The code passes - the full Connectathon test suite and operates over both Infiniband and iWARP - RDMA adapters. - -OFED 1.5.1 limitations: -~~~~~~~~~~~~~~~~~~~~~ - NFS-RDMA is supported for the following releases: - - Redhat Enterprise Linux (RHEL) version 5.2 - - Redhat Enterprise Linux (RHEL) version 5.3 - - Redhat Enterprise Linux (RHEL) version 5.4 - - SUSE Linux Enterprise Server (SLES) version 11 - - And the following kernel.org kernels: - - 2.6.22 - - 2.6.25 - - 2.6.30 - - All other Linux Distrubutions and kernel versions are NOT supported on OFED - 1.5.1 - -Getting Help -~~~~~~~~~~~~ - - If you get stuck, you can ask questions on the - nfs-rdma-devel@lists.sourceforge.net, or linux-rdma@vger.kernel.org - mailing lists. - -Installation -~~~~~~~~~~~~ - - These instructions are a step by step guide to building a machine for - use with NFS/RDMA. - - - Install an RDMA device - - Any device supported by the drivers in drivers/infiniband/hw is acceptable. - - Testing has been performed using several Mellanox-based IB cards and - the Chelsio cxgb3 iWARP adapter. - - - Install OFED 1.5.1 - - NFS/RDMA has been tested on RHEL5.2, RHEL 5.3, RHEL5.4, SLES11, - kernels 2.6.22, 2.6.25, and 2.6.30. On these kernels, - NFS-RDMA will be installed by default if you simply select "install all", - and can be specifically included by a "custom" install. - - In addition, the install script will install a version of the nfs-utils that - is required for NFS/RDMA. The binary installed will be named "mount.rnfs". - This version is not necessary for Linux Distributions with nfs-utils 1.1 or - later. - - Upon successful installation, the nfs kernel modules will be placed in the - directory /lib/modules/'uname -a'/updates. It is recommended that you reboot - to ensure that the correct modules are loaded. - -Check RDMA and NFS Setup -~~~~~~~~~~~~~~~~~~~~~~~~ - - Before configuring the NFS/RDMA software, it is a good idea to test - your new kernel to ensure that the kernel is working correctly. - In particular, it is a good idea to verify that the RDMA stack - is functioning as expected and standard NFS over TCP/IP and/or UDP/IP - is working properly. - - - Check RDMA Setup - - If you built the RDMA components as modules, load them at - this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel - card: - - $ modprobe ib_mthca - $ modprobe ib_ipoib - - If you are using InfiniBand, make sure there is a Subnet Manager (SM) - running on the network. If your IB switch has an embedded SM, you can - use it. Otherwise, you will need to run an SM, such as OpenSM, on one - of your end nodes. - - If an SM is running on your network, you should see the following: - - $ cat /sys/class/infiniband/driverX/ports/1/state - 4: ACTIVE - - where driverX is mthca0, ipath5, ehca3, etc. - - To further test the InfiniBand software stack, use IPoIB (this - assumes you have two IB hosts named host1 and host2): - - host1$ ifconfig ib0 a.b.c.x - host2$ ifconfig ib0 a.b.c.y - host1$ ping a.b.c.y - host2$ ping a.b.c.x - - For other device types, follow the appropriate procedures. - - - Check NFS Setup - - For the NFS components enabled above (client and/or server), - test their functionality over standard Ethernet using TCP/IP or UDP/IP. - -NFS/RDMA Setup -~~~~~~~~~~~~~~ - - We recommend that you use two machines, one to act as the client and - one to act as the server. - - One time configuration: - - - On the server system, configure the /etc/exports file and - start the NFS/RDMA server. - - Exports entries with the following formats have been tested: - - /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) - /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) - - The IP address(es) is(are) the client's IPoIB address for an InfiniBand - HCA or the client's iWARP address(es) for an RNIC. - - NOTE: The "insecure" option must be used because the NFS/RDMA client does - not use a reserved port. - - Each time a machine boots: - - - Load and configure the RDMA drivers - - For InfiniBand using a Mellanox adapter: - - $ modprobe ib_mthca - $ modprobe ib_ipoib - $ ifconfig ib0 a.b.c.d - - NOTE: use unique addresses for the client and server - - - Start the NFS server - - Load the RDMA transport module: - - $ modprobe svcrdma - - Start the server: - - $ /etc/init.d/nfsserver start - - or - - $ service nfs start - - Instruct the server to listen on the RDMA transport: - - $ echo rdma 20049 > /proc/fs/nfsd/portlist - - NOTE for SLES10 servers: The nfs start scripts on most distro's start - rpc.statd by default. However, the in-kernel lockd that was in SLES10 has - been removed in the new kernels. Since OFED is back-porting the new code to - the older distro's, there is no in-kernel lockd in SLES10 and the SLES10 - nfsserver scripts do not know the need to start it. Therefore, the - nfsserver scripts will be modified when the rnfs-utils rpm is installed to - start/stop rpc.statd. - - - On the client system - - Load the RDMA client module: - - $ modprobe xprtrdma - - Mount the NFS/RDMA server: - - $ mount.rnfs :/ /mnt -o proto=rdma,port=20049 - - NOTE: For kernels < 2.6.23, the "-i" flag must be passed into mount.rnfs. - This option allows the mount command to ignore the kernel version check. If - not disabled, the check will prevent passing arguments to the kernel and not - allow the updated version of NFS to accept the "rdma" NFS option. - - To verify that the mount is using RDMA, run "cat /proc/mounts" and check - the "proto" field for the given mount. - - Congratulations! You're using NFS/RDMA! - -Known Issues -~~~~~~~~~~~~~~~~~~~~~~~~ - -If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are using -a 64KB page size (like PPC64 and IA64 systems) and your server is using a -4KB page size (like i386 and X86_64), then you need to mount the server -using rsize=32768,wsize=32768 to avoid overrunning the Chelsio RNIC fast -register limits. This is a known firmware limitation in the Chelsio RNIC. - -Running NFSRDMA over Mellanox's ConnectX HCA requires that the adapter firmware -be 2.7.0 or greater on all NFS clients and servers. Firmware 2.6.0 has known -issues that prevent the RDMA connection from being established. Firmware 2.7.0 -has resolved these issues. - -IPv6 support requires portmap that supports version 4. Portmap included in RHEL5 -and SLES10 only supports version 2. Without version 4 support, the following -error will be logged: - svc: failed to register lockdv1 (errno 97). -This error will not affect IPv4 support. diff --git a/ofed_patch.sh b/ofed_patch.sh deleted file mode 100755 index 0497f95..0000000 --- a/ofed_patch.sh +++ /dev/null @@ -1,268 +0,0 @@ -#!/bin/bash -# -# Copyright (c) 2009 Mellanox Technologies. All rights reserved. -# -# This Software is licensed under one of the following licenses: -# -# 1) under the terms of the "Common Public License 1.0" a copy of which is -# available from the Open Source Initiative, see -# http://www.opensource.org/licenses/cpl.php. -# -# 2) under the terms of the "The BSD License" a copy of which is -# available from the Open Source Initiative, see -# http://www.opensource.org/licenses/bsd-license.php. -# -# 3) under the terms of the "GNU General Public License (GPL) Version 2" a -# copy of which is available from the Open Source Initiative, see -# http://www.opensource.org/licenses/gpl-license.php. -# -# Licensee has the right to choose one of the above licenses. -# -# Redistributions of source code must retain the above copyright -# notice and one of the license notices. -# -# Redistributions in binary form must reproduce both the above copyright -# notice, one of the license notices in the documentation -# and/or other materials provided with the distribution. -# -# -# Add/Remove a patch to/from OFED's ofa_kernel package - - -usage() -{ -cat << EOF - - Usage: - Add patch to OFED: - `basename $0` --add - --ofed|-o - --patch|-p - --type|-t |addons > - - Remove patch from OFED: - `basename $0` --remove - --ofed|-o - --patch|-p - --type|-t |addons > - - Example: - `basename $0` --add --ofed /tmp/OFED-1.X/ --patch /tmp/cma_establish.patch --type kernel - - `basename $0` --remove --ofed /tmp/OFED-1.X/ --patch cma_establish.patch --type kernel - -EOF -} - -action="" - -# Execute command w/ echo and exit if it fail -ex() -{ - echo "$@" - if ! "$@"; then - printf "\nFailed executing $@\n\n" - exit 1 - fi -} - -add_patch() -{ - if [ -f $2/${1##*/} ]; then - echo Replacing $2/${1##*/} - ex /bin/rm -f $2/${1##*/} - fi - ex cp $1 $2 -} - -remove_patch() -{ - if [ -f $2/${1##*/} ]; then - echo Removing $2/${1##*/} - ex /bin/rm -f $2/${1##*/} - else - echo Patch $2/${1##*/} was not found - exit 1 - fi -} - -set_rpm_info() -{ - package_SRC_RPM=$(/bin/ls -1 ${ofed}/SRPMS/${1}*src.rpm 2> /dev/null) - if [[ -n "${package_SRC_RPM}" && -s ${package_SRC_RPM} ]]; then - package_name=$(rpm --queryformat "[%{NAME}]" -qp ${package_SRC_RPM}) - package_ver=$(rpm --queryformat "[%{VERSION}]" -qp ${package_SRC_RPM}) - package_rel=$(rpm --queryformat "[%{RELEASE}]" -qp ${package_SRC_RPM}) - else - echo $1 src.rpm not found under ${ofed}/SRPMS - exit 1 - fi -} - -main() -{ - while [ ! -z "$1" ] - do - case $1 in - --add) - action="add" - shift - ;; - --remove) - action="remove" - shift - ;; - --ofed|-o) - ofed=$2 - shift 2 - ;; - --patch|-p) - patch=$2 - shift 2 - ;; - --type|-t) - type=$2 - shift 2 - case ${type} in - backport|addons) - tag=$1 - shift - ;; - esac - ;; - --help|-h) - usage - exit 0 - ;; - *) - usage - exit 1 - ;; - esac - done - - if [ -z "$action" ]; then - usage - exit 1 - fi - - if [ -z "$ofed" ] || [ ! -d "$ofed" ]; then - echo Set the path to the OFED directory. Use \'--ofed\' parameter - exit 1 - else - ofed=$(readlink -f $ofed) - fi - - if [ "$action" == "add" ]; then - if [ -z "$patch" ] || [ ! -r "$patch" ]; then - echo Set the path to the patch file. Use \'--patch\' parameter - exit 1 - else - patch=$(readlink -f $patch) - fi - else - if [ -z "$patch" ]; then - echo Set the name of the patch to be removed. Use \'--patch\' parameter - exit 1 - fi - fi - - if [ -z "$type" ]; then - echo Set the type of the patch. Use \'--type\' parameter - exit 1 - fi - - if [ "$type" == "backport" ] || [ "$type" == "addons" ]; then - if [ -z "$tag" ]; then - echo Set tag for backport patch. - exit 1 - fi - fi - - # Get ofa RPM version - case $type in - kernel|backport|addons) - set_rpm_info ofa_kernel - ;; - *) - echo "Unknown type $type" - exit 1 - ;; - esac - - package=${package_name}-${package_ver} - cd ${ofed} - if [ ! -e SRPMS/${package}-${package_rel}.src.rpm ]; then - echo File ${ofed}/SRPMS/${package}-${package_rel}.src.rpm not found - exit 1 - fi - - if ! ( set -x && rpm -i --define "_topdir $(pwd)" SRPMS/${package}-${package_rel}.src.rpm && set +x ); then - echo "Failed to install ${package}-${package_rel}.src.rpm" - exit 1 - fi - - cd - - - cd ${ofed}/SOURCES - ex tar xzf ${package}.tgz - - case $type in - kernel) - if [ "$action" == "add" ]; then - add_patch $patch ${package}/kernel_patches/fixes - else - remove_patch $patch ${package}/kernel_patches/fixes - fi - ;; - backport) - if [ "$action" == "add" ]; then - if [ ! -d ${package}/kernel_patches/backport/$tag ]; then - echo Creating ${package}/kernel_patches/backport/$tag directory - ex mkdir -p ${package}/kernel_patches/backport/$tag - echo WARNING: Check that ${package} configure supports backport/$tag - fi - add_patch $patch ${package}/kernel_patches/backport/$tag - else - remove_patch $patch ${package}/kernel_patches/backport/$tag - fi - ;; - addons) - if [ "$action" == "add" ]; then - if [ ! -d ${package}/kernel_addons/backport/$tag ]; then - echo Creating ${package}/kernel_addons/backport/$tag directory - ex mkdir -p ${package}/kernel_addons/backport/$tag - echo WARNING: Check that ${package} configure supports backport/$tag - fi - add_patch $patch ${package}/kernel_addons/backport/$tag - else - remove_patch $patch ${package}/kernel_addons/backport/$tag - fi - ;; - *) - echo Unknown patch type: $type - exit 1 - ;; - esac - - ex tar czf ${package}.tgz ${package} - cd - - - cd ${ofed} - echo Rebuilding ${package_name} source rpm: - if ! ( set -x && rpmbuild -bs --define "_topdir $(pwd)" SPECS/${package_name}.spec && set +x ); then - echo Failed to create ${package}-${package_rel}.src.rpm - exit 1 - fi - ex rm -rf SOURCES/${package}* - if [ "$action" == "add" ]; then - echo Patch added successfully. - else - echo Patch removed successfully. - fi - echo - echo Remove existing RPM packages from ${ofed}/RPMS direcory in order - echo to rebuild RPMs -} - -main $@ diff --git a/open_mpi_release_notes.txt b/open_mpi_release_notes.txt deleted file mode 100644 index e231b04..0000000 --- a/open_mpi_release_notes.txt +++ /dev/null @@ -1,1756 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - Open MPI in OFED 1.5.1 Copyrights, License, and Release Notes - - March 2010 - -Open MPI Copyrights -------------------- -Most files in this release are marked with the copyrights of the -organizations who have edited them. The copyrights below generally -reflect members of the Open MPI core team who have contributed code to -this release. The copyrights for code used under license from other -parties are included in the corresponding files. - -Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana - University Research and Technology - Corporation. All rights reserved. -Copyright (c) 2004-2009 The University of Tennessee and The University - of Tennessee Research Foundation. All rights - reserved. -Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, - University of Stuttgart. All rights reserved. -Copyright (c) 2004-2007 The Regents of the University of California. - All rights reserved. -Copyright (c) 2006-2009 Los Alamos National Security, LLC. All rights - reserved. -Copyright (c) 2006-2009 Cisco Systems, Inc. All rights reserved. -Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. -Copyright (c) 2006-2008 Sandia National Laboratories. All rights - reserved. -Copyright (c) 2006-2009 Sun Microsystems, Inc. All rights reserved. - Use is subject to license terms. -Copyright (c) 2006-2009 The University of Houston. All rights - reserved. -Copyright (c) 2006-2008 Myricom, Inc. All rights reserved. -Copyright (c) 2007-2008 UT-Battelle, LLC. All rights reserved. -Copyright (c) 2007-2008 IBM Corporation. All rights reserved. -Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich - Supercomputing - Centre, Federal Republic of Germany -Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany -Copyright (c) 2007 Evergrid, Inc. All rights reserved. -Copyright (c) 2008 Institut National de Recherche en - Informatique. All rights reserved. -Copyright (c) 2007 Lawrence Livermore National Security, LLC. - All rights reserved. -Copyright (c) 2007-2010 Mellanox Technologies. All rights reserved. -Copyright (c) 2006 QLogic Corporation. All rights reserved. - -Additional copyrights may follow - -Open MPI License ----------------- -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are -met: - -- Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - -- Redistributions in binary form must reproduce the above copyright - notice, this list of conditions and the following disclaimer listed - in this license in the documentation and/or other materials - provided with the distribution. - -- Neither the name of the copyright holders nor the names of its - contributors may be used to endorse or promote products derived from - this software without specific prior written permission. - -The copyright holders provide no reassurances that the source code -provided does not infringe any patent, copyright, or any other -intellectual property rights of third parties. The copyright holders -disclaim any liability to any recipient for claims brought against -recipient by any third party for infringement of that parties -intellectual property rights. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS -"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT -LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR -A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT -OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, -SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT -LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, -DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY -THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE -OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -=========================================================================== - -When submitting questions and problems, be sure to include as much -extra information as possible. This web page details all the -information that we request in order to provide assistance: - - http://www.open-mpi.org/community/help/ - -The best way to report bugs, send comments, or ask questions is to -sign up on the user's and/or developer's mailing list (for user-level -and developer-level questions; when in doubt, send to the user's -list): - - users@open-mpi.org - devel@open-mpi.org - -Because of spam, only subscribers are allowed to post to these lists -(ensure that you subscribe with and post from exactly the same e-mail -address -- joe@example.com is considered different than -joe@mycomputer.example.com!). Visit these pages to subscribe to the -lists: - - http://www.open-mpi.org/mailman/listinfo.cgi/users - http://www.open-mpi.org/mailman/listinfo.cgi/devel - -Thanks for your time. - -=========================================================================== - -Much, much more information is also available in the Open MPI FAQ: - - http://www.open-mpi.org/faq/ - -=========================================================================== - -OFED-Specific Release Notes ---------------------------- - -** SLES 10 with Pathscale compiler support: - -Using the Pathscale compiler to build Open MPI on SLES10 may result in -a non-functional Open MPI installation (every Open MPI command fails). -If this problem occurs, try upgrading your Pathscale installation to -the latest maintenance release, or use a different compiler to compile -Open MPI. - -** Intel compiler support: - -Some versions of the Intel 9.1 C++ compiler suite series produce -incorrect code when used with the Open MPI C++ bindings. Symptoms of -this problem include crashing applications (e.g., segmentation -violations) and Open MPI producing errors about incorrect parameters. -Be sure to upgrade to the latest maintenance release of the Intel 9.1 -compiler to avoid these problems. - -** Installing newer versions of Open MPI after OFED is installed: - -Open MPI can be built from source after OFED is fully installed. The -source code for Open MPI can be extracted from the SRPM shipped with -OFED or downloaded from the main Open MPI web site: -http://www.open-mpi.org/. - -To compile with Open MPI from source with OFED support, fully install -the rest of OFED. If you used the default prefix for the OFED -installation (/usr), Open MPI should build with OpenFabrics support by -default. If you used a different OFED prefix, you must tell Open MPI -what it is with the "--with-openib=" switch to configure. -You can verify that Open MPI installed with OpenFabrics support by -running (the exact version numbers displayed may be different; the -important part is that the "openib" BTL is displayed): - - shell$ ompi_info | grep openib - MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.2) - -See the rest of the documentation below for other configure command -line options and installation instructions. - -** Changelog summary - -Showing versions 1.2.7 - 1.4; see the "NEWS" file in an Open MPI -distribution for the full list. - -1.4.1 (OFED version) ---- -- Update support for various OpenFabrics devices in the openib BTL's - .ini file. -- Fixing RDMA CM failure during QP creation (Ticket #2307) - -1.4.1 ---- -- Update to PLPA v1.3.2, addressing a licensing issue identified by - the Fedora project. See - https://svn.open-mpi.org/trac/plpa/changeset/262 for details. -- Add check for malformed checkpoint metadata files (Ticket #2141). -- Fix error path in ompi-checkpoint when not able to checkpoint - (Ticket #2138). -- Cleanup component release logic when selecting checkpoint/restart - enabled components (Ticket #2135). -- Fixed VT node name detection for Cray XT platforms, and fixed some - broken VT documentation files. -- Fix a possible race condition in tearing down RDMA CM-based - connections. -- Relax error checking on MPI_GRAPH_CREATE. Thanks to David Singleton - for pointing out the issue. -- Fix a shared memory "hang" problem that occurred on x86/x86_64 - platforms when used with the GNU >=4.4.x compiler series. -- Add fix for Libtool 2.2.6b's problems with the PGI 10.x compiler - suite. Inspired directly from the upstream Libtool patches that fix - the issue (but we need something working before the next Libtool - release). - -1.4 ---- - -The *only* change in the Open MPI v1.4 release (as compared to v1.3.4) -was to update the embedded version of Libtool's libltdl to address a -potential security vulnerability. Specifically: Open MPI v1.3.4 was -created with GNU Libtool 2.2.6a; Open MPI v1.4 was created with GNU -Libtool 2.2.6b. There are no other changes between Open MPI v1.3.4 -and v1.4. - - -1.3.4 ------ - -- Fix some issues in OMPI's SRPM with regard to shell_scripts_basename - and its use with mpi-selector. Thanks to Bill Johnstone for - pointing out the problem. -- Added many new MPI job process affinity options to mpirun. See the - newly-updated mpirun(1) man page for details. -- Several updates to mpirun's XML output. -- Update to fix a few Valgrind warnings with regards to the ptmalloc2 - allocator and Open MPI's use of PLPA. -- Many updates and fixes to the (non-default) "sm" collective - component (i.e., native shared memory MPI collective operations). -- Updates and fixes to some MPI_COMM_SPAWN_MULTIPLE corner cases. -- Fix some internal copying functions in Open MPI's use of PLPA. -- Correct some SLURM nodelist parsing logic that may have interfered - with large jobs. Additionally, per advice from the SLURM team, - change the environment variable that we use for obtaining the job's - allocation. -- Revert to an older, safer (but slower) communicator ID allocation - algorithm. -- Fixed minimum distance finding for OpenFabrics devices in the openib - BTL. -- Relax the parameter checking MPI_CART_CREATE a bit. -- Fix MPI_COMM_SPAWN[_MULTIPLE] to only error-check the info arguments - on the root process. Thanks to Federico Golfre Andreasi for - reporting the problem. -- Fixed some BLCR configure issues. -- Fixed a potential deadlock when the openib BTL was used with - MPI_THREAD_MULTIPLE. -- Fixed dynamic rules selection for the "tuned" coll component. -- Added a launch progress meter to mpirun (useful for large jobs; set - the orte_report_launch_progress MCA parameter to 1 to see it). -- Reduced the number of file descriptors consumed by each MPI process. -- Add new device IDs for Chelsio T3 RNICs to the openib BTL config file. -- Fix some CRS self component issues. -- Added some MCA parameters to the PSM MTL to tune its run-time - behavior. -- Fix some VT issues with MPI_BOTTOM/MPI_IN_PLACE. -- Man page updates from the Debain Open MPI package maintainers. -- Add cycle counter support for the Alpha and Sparc platforms. -- Pass visibility flags to libltdl's configure script, resulting in - those symbols being hidden. This appears to mainly solve the - problem of applications attempting to use different versions of - libltdl from that used to build Open MPI. - - -1.3.3 ------ - -- Fix a number of issues with the openib BTL (OpenFabrics) RDMA CM, - including a memory corruption bug, a shutdown deadlock, and a route - timeout. Thanks to David McMillen and Hal Rosenstock for help in - tracking down the issues. -- Change the behavior of the EXTRA_STATE parameter that is passed to - Fortran attribute callback functions: this value is now stored - internally in MPI -- it no longer references the original value - passed by MPI_*_CREATE_KEYVAL. -- Allow the overriding RFC1918 and RFC3330 for the specification of - "private" networks, thereby influencing Open MPI's TCP - "reachability" computations. -- Improve flow control issues in the sm btl, by both tweaking the - shared memory progression rules and by enabling the "sync" collective - to barrier every 1,000th collective. -- Various fixes for the IBM XL C/C++ v10.1 compiler. -- Allow explicit disabling of ptmalloc2 hooks at runtime (e.g., enable - support for Debian's builtroot system). Thanks to Manuel Prinz and - the rest of the Debian crew for helping identify and fix this issue. -- Various minor fixes for the I/O forwarding subsystem. -- Big endian iWARP fixes in the Open Fabrics RDMA CM support. -- Update support for various OpenFabrics devices in the openib BTL's - .ini file. -- Fixed undefined symbol issue with Open MPI's parallel debugger - message queue support so it can be compiled by Sun Studio compilers. -- Update MPI_SUBVERSION to 1 in the Fortran bindings. -- Fix MPI_GRAPH_CREATE Fortran 90 binding. -- Fix MPI_GROUP_COMPARE behavior with regards to MPI_IDENT. Thanks to - Geoffrey Irving for identifying the problem and supplying the fix. -- Silence gcc 4.1 compiler warnings about type punning. Thanks to - Number Cruncher for the fix. -- Added more Valgrind and other memory-cleanup fixes. Thanks to - various Open MPI users for help with these issues. -- Miscellaneous VampirTrace fixes. -- More fixes for openib credits in heavy-congestion scenarios. -- Slightly decrease the latency in the openib BTL in some conditions - (add "send immediate" support to the openib BTL). -- Ensure to allow MPI_REQUEST_GET_STATUS to accept an - MPI_STATUS_IGNORE parameter. Thanks to Shaun Jackman for the bug - report. -- Added Microsoft Windows support. See README.WINDOWS file for - details. - - -1.3.2 ------ - -- Fixed a potential infinite loop in the openib BTL that could occur - in senders in some frequent-communication scenarios. Thanks to Don - Wood for reporting the problem. -- Add a new checksum PML variation on ob1 (main MPI point-to-point - communication engine) to detect memory corruption in node-to-node - messages -- Add a new configuration option to add padding to the openib - header so the data is aligned -- Add a new configuration option to use an alternative checksum algo - when using the checksum PML -- Fixed a problem reported by multiple users on the mailing list that - the LSF support would fail to find the appropriate libraries at - run-time. -- Allow empty shell designations from getpwuid(). Thanks to Sergey - Koposov for the bug report. -- Ensure that mpirun exits with non-zero status when applications die - due to user signal. Thanks to Geoffroy Pignot for suggesting the - fix. -- Ensure that MPI_VERSION / MPI_SUBVERSION match what is returned by - MPI_GET_VERSION. Thanks to Rob Egan for reporting the error. -- Updated MPI_*KEYVAL_CREATE functions to properly handle Fortran - extra state. -- A variety of ob1 (main MPI point-to-point communication engine) bug - fixes that could have caused hangs or seg faults. -- Do not install Open MPI's signal handlers in MPI_INIT if there are - already signal handlers installed. Thanks to Kees Verstoep for - bringing the issue to our attention. -- Fix GM support to not seg fault in MPI_INIT. -- Various VampirTrace fixes. -- Various PLPA fixes. -- No longer create BTLs for invalid (TCP) devices. -- Various man page style and lint cleanups. -- Fix critical OpenFabrics-related bug noted here: - http://www.open-mpi.org/community/lists/announce/2009/03/0029.php. - Open MPI now uses a much more robust memory intercept scheme that is - quite similar to what is used by MX. The use of "-lopenmpi-malloc" - is no longer necessary, is deprecated, and is expected to disappear - in a future release. -lopenmpi-malloc will continue to work for the - duration of the Open MPI v1.3 and v1.4 series. -- Fix some OpenFabrics shutdown errors, both regarding iWARP and SRQ. -- Allow the udapl BTL to work on Solaris platforms that support - relaxed PCI ordering. -- Fix problem where the mpirun would sometimes use rsh/ssh to launch on - the localhost (instead of simply forking). -- Minor SLURM stdin fixes. -- Fix to run properly under SGE jobs. -- Scalability and latency improvements for shared memory jobs: convert - to using one message queue instead of N queues. -- Automatically size the shared-memory area (mmap file) to match - better what is needed; specifically, so that large-np jobs will start. -- Use fixed-length MPI predefined handles in order to provide ABI - compatibility between Open MPI releases. -- Fix building of the posix paffinity component to properly get the - number of processors in loosely tested environments (e.g., - FreeBSD). Thanks to Steve Kargl for reporting the issue. -- Fix --with-libnuma handling in configure. Thanks to Gus Correa for - reporting the problem. - - -1.3.1 ------ - -- Added "sync" coll component to allow users to synchronize every N - collective operations on a given communicator. -- Increased the default values of the IB and RNR timeout MCA parameters. -- Fix a compiler error noted by Mostyn Lewis with the PGI 8.0 compiler. -- Fix an error that prevented stdin from being forwarded if the - rsh launcher was in use. Thanks to Branden Moore for pointing out - the problem. -- Correct a case where the added datatype is considered as contiguous but - has gaps in the beginning. -- Fix an error that limited the number of comm_spawns that could - simultaneously be running in some environments -- Correct a corner case in OB1's GET protocol for long messages; the - error could sometimes cause MPI jobs using the openib BTL to hang. -- Fix a bunch of bugs in the IO forwarding (IOF) subsystem and add some - new options to output to files and redirect output to xterm. Thanks to - Jody Weissmann for helping test out many of the new fixes and - features. -- Fix SLURM race condition. -- Fix MPI_File_c2f(MPI_FILE_NULL) to return 0, not -1. Thanks to - Lisandro Dalcin for the bug report. -- Fix the DSO build of tm PLM. -- Various fixes for size disparity between C int's and Fortran - INTEGER's. Thanks to Christoph van Wullen for the bug report. -- Ensure that mpirun exits with a non-zero exit status when daemons or - processes abort or fail to launch. -- Various fixes to work around Intel (NetEffect) RNIC behavior. -- Various fixes for mpirun's --preload-files and --preload-binary - options. -- Fix the string name in MPI::ERRORS_THROW_EXCEPTIONS. -- Add ability to forward SIFTSTP and SIGCONT to MPI processes if you - set the MCA parameter orte_forward_job_control to 1. -- Allow the sm BTL to allocate larger amounts of shared memory if - desired (helpful for very large multi-core boxen). -- Fix a few places where we used PATH_MAX instead of OMPI_PATH_MAX, - leading to compile problems on some platforms. Thanks to Andrea Iob - for the bug report. -- Fix mca_btl_openib_warn_no_device_params_found MCA parameter; it - was accidentally being ignored. -- Fix some run-time issues with the sctp BTL. -- Ensure that RTLD_NEXT exists before trying to use it (e.g., it - doesn't exist on Cygwin). Thanks to Gustavo Seabra for reporting - the issue. -- Various fixes to VampirTrace, including fixing compile errors on - some platforms. -- Fixed missing MPI_Comm_accept.3 man page; fixed minor issue in - orterun.1 man page. Thanks to Dirk Eddelbuettel for identifying the - problem and submitting a patch. -- Implement the XML formatted output of stdout/stderr/stddiag. -- Fixed mpirun's -wdir switch to ensure that working directories for - multiple app contexts are properly handled. Thanks to Geoffroy - Pignot for reporting the problem. -- Improvements to the MPI C++ integer constants: - - Allow MPI::SEEK_* constants to be used as constants - - Allow other MPI C++ constants to be used as array sizes -- Fix minor problem with orte-restart's command line options. See - ticket #1761 for details. Thanks to Gregor Dschung for reporting - the problem. - -1.3 ---- - -- Extended the OS X 10.5.x (Leopard) workaround for a problem when - assembly code is compiled with -g[0-9]. Thanks to Barry Smith for - reporting the problem. See ticket #1701. -- Disabled MPI_REAL16 and MPI_COMPLEX32 support on platforms where the - bit representation of REAL*16 is different than that of the C type - of the same size (usually long double). Thanks to Julien Devriendt - for reporting the issue. See ticket #1603. -- Increased the size of MPI_MAX_PORT_NAME to 1024 from 36. See ticket #1533. -- Added "notify debugger on abort" feature. See tickets #1509 and #1510. - Thanks to Seppo Sahrakropi for the bug report. -- Upgraded Open MPI tarballs to use Autoconf 2.63, Automake 1.10.1, - Libtool 2.2.6a. -- Added missing MPI::Comm::Call_errhandler() function. Thanks to Dave - Goodell for bringing this to our attention. -- Increased MPI_SUBVERSION value in mpi.h to 1 (i.e., MPI 2.1). -- Changed behavior of MPI_GRAPH_CREATE, MPI_TOPO_CREATE, and several - other topology functions per MPI-2.1. -- Fix the type of the C++ constant MPI::IN_PLACE. -- Various enhancements to the openib BTL: - - Added btl_openib_if_[in|ex]clude MCA parameters for - including/excluding comma-delimited lists of HCAs and ports. - - Added RDMA CM support, includng btl_openib_cpc_[in|ex]clude MCA - parameters - - Added NUMA support to only use "near" network adapters - - Added "Bucket SRQ" (BSRQ) support to better utilize registered - memory, including btl_openib_receive_queues MCA parameter - - Added ConnectX XRC support (and integrated with BSRQ) - - Added btl_openib_ib_max_inline_data MCA parameter - - Added iWARP support - - Revamped flow control mechansisms to be more efficient - - "mpi_leave_pinned=1" is now the default when possible, - automatically improving performance for large messages when - application buffers are re-used -- Elimiated duplicated error messages when multiple MPI processes fail - with the same error. -- Added NUMA support to the shared memory BTL. -- Add Valgrind-based memory checking for MPI-semantic checks. -- Add support for some optional Fortran datatypes (MPI_LOGICAL1, - MPI_LOGICAL2, MPI_LOGICAL4 and MPI_LOGICAL8). -- Remove the use of the STL from the C++ bindings. -- Added support for Platform/LSF job launchers. Must be Platform LSF - v7.0.2 or later. -- Updated ROMIO with the version from MPICH2 1.0.7. -- Added RDMA capable one-sided component (called rdma), which - can be used with BTL components that expose a full one-sided - interface. -- Added the optional datatype MPI_REAL2. As this is added to the "end of" - predefined datatypes in the fortran header files, there will not be - any compatibility issues. -- Added Portable Linux Processor Affinity (PLPA) for Linux. -- Addition of a finer symbols export control via the visibiliy feature - offered by some compilers. -- Added checkpoint/restart process fault tolerance support. Initially - support a LAM/MPI-like protocol. -- Removed "mvapi" BTL; all InfiniBand support now uses the OpenFabrics - driver stacks ("openib" BTL). -- Added more stringent MPI API parameter checking to help user-level - debugging. -- The ptmalloc2 memory manager component is now by default built as - a standalone library named libopenmpi-malloc. Users wanting to - use leave_pinned with ptmalloc2 will now need to link the library - into their application explicitly. All other users will use the - libc-provided allocator instead of Open MPI's ptmalloc2. This change - may be overriden with the configure option enable-ptmalloc2-internal -- The leave_pinned options will now default to using mallopt on - Linux in the cases where ptmalloc2 was not linked in. mallopt - will also only be available if munmap can be intercepted (the - default whenever Open MPI is not compiled with --without-memory- - manager. -- Open MPI will now complain and refuse to use leave_pinned if - no memory intercept / mallopt option is available. -- Add option of using Perl-based wrapper compilers instead of the - C-based wrapper compilers. The Perl-based version does not - have the features of the C-based version, but does work better - in cross-compile environments. - - -1.2.9 ------ - -- Fix a segfault when using one-sided communications on some forms of derived - datatypes. Thanks to Dorian Krause for reporting the bug. See #1715. -- Fix an alignment problem affecting one-sided communications on - some architectures (e.g., SPARC64). See #1738. -- Fix compilation on Solaris when thread support is enabled in Open MPI - (e.g., when using --with-threads). See #1736. -- Correctly take into account the MTU that an OpenFabrics device port - is using. See #1722 and - https://bugs.openfabrics.org/show_bug.cgi?id=1369. -- Fix two datatype engine bugs. See #1677. - Thanks to Peter Kjellstrom for the bugreport. -- Fix the bml r2 help filename so the help message can be found. See #1623. -- Fix a compilation problem on RHEL4U3 with the PGI 32 bit compiler - caused by . See ticket #1613. -- Fix the --enable-cxx-exceptions configure option. See ticket #1607. -- Properly handle when the MX BTL cannot open an endpoint. See ticket #1621. -- Fix a double free of events on the tcp_events list. See ticket #1631. -- Fix a buffer overun in opal_free_list_grow (called by MPI_Init). - Thanks to Patrick Farrell for the bugreport and Stephan Kramer for - the bugfix. See ticket #1583. -- Fix a problem setting OPAL_PREFIX for remote sh-based shells. - See ticket #1580. - - -1.2.8 ------ - -- Tweaked one memory barrier in the openib component to be more conservative. - May fix a problem observed on PPC machines. See ticket #1532. -- Fix OpenFabrics IB partition support. See ticket #1557. -- Restore v1.1 feature that sourced .profile on remote nodes if the default - shell will not do so (e.g. /bin/sh and /bin/ksh). See ticket #1560. -- Fix segfault in MPI_Init_thread() if ompi_mpi_init() fails. See ticket #1562. -- Adjust SLURM support to first look for $SLURM_JOB_CPUS_PER_NODE instead of - the deprecated $SLURM_TASKS_PER_NODE environment variable. This change - may be *required* when using SLURM v1.2 and above. See ticket #1536. -- Fix the MPIR_Proctable to be in process rank order. See ticket #1529. -- Fix a regression introduced in 1.2.6 for the IBM eHCA. See ticket #1526. - - -1.2.7 ------ - -- Add some Sun HCA vendor IDs. See ticket #1461. -- Fixed a memory leak in MPI_Alltoallw when called from Fortran. - Thanks to Dave Grote for the bugreport. See ticket #1457. -- Only link in libutil when it is needed/desired. Thanks to - Brian Barret for diagnosing and fixing the problem. See ticket #1455. -- Update some QLogic HCA vendor IDs. See ticket #1453. -- Fix F90 binding for MPI_CART_GET. Thanks to Scott Beardsley for - bringing it to our attention. See ticket #1429. -- Remove a spurious warning message generated in/by ROMIO. See ticket #1421. -- Fix a bug where command-line MCA parameters were not overriding - MCA parameters set from environment variables. See ticket #1380. -- Fix a bug in the AMD64 atomics assembly. Thanks to Gabriele Fatigati - for the bug report and bugfix. See ticket #1351. -- Fix a gather and scatter bug on intercommunicators when the datatype - being moved is 0 bytes. See ticket #1331. -- Some more man page fixes from the Debian maintainers. - See tickets #1324 and #1329. -- Have openib BTL (OpenFabrics support) check for the presence of - /sys/class/infiniband before allowing itself to be used. This check - prevents spurious "OMPI did not find RDMA hardware!" notices on - systems that have the software drivers installed, but no - corresponding hardware. See tickets #1321 and #1305. -- Added vendor IDs for some ConnectX openib HCAs. See ticket #1311. -- Fix some RPM specfile inconsistencies. See ticket #1308. - Thanks to Jim Kusznir for noticing the problem. -- Removed an unused function prototype that caused warnings on - some systems (e.g., OS X). See ticket #1274. -- Fix a deadlock in inter-communicator scatter/gather operations. - Thanks to Martin Audet for the bug report. See ticket #1268. - -=========================================================================== - -Much, much more information is also available in the Open MPI FAQ: - - http://www.open-mpi.org/faq/ - -=========================================================================== - -General Release Notes ---------------------- - -Detailed Open MPI v1.3 Feature List: - - o Open MPI RunTime Environment (ORTE) improvements - - General robustness improvements - - Scalable job launch (we've seen ~16K processes in less than a - minute in a highly-optimized configuration) - - New process mappers - - Support for Platform/LSF environments (v7.0.2 and later) - - More flexible processing of host lists - - new mpirun cmd line options and associated functionality - - o Fault-Tolerance Features - - Asynchronous, transparent checkpoint/restart support - - Fully coordinated checkpoint/restart coordination component - - Support for the following checkpoint/restart services: - - blcr: Berkley Lab's Checkpoint/Restart - - self: Application level callbacks - - Support for the following interconnects: - - tcp - - mx - - openib - - sm - - self - - Improved Message Logging - - o MPI_THREAD_MULTIPLE support for point-to-point messaging in the - following BTLs (note that only MPI point-to-point messaging API - functions support MPI_THREAD_MULTIPLE; other API functions likely - do not): - - tcp - - sm - - mx - - elan - - self - - o Point-to-point Messaging Layer (PML) improvements - - Memory footprint reduction - - Improved latency - - Improved algorithm for multiple communication device - ("multi-rail") support - - o Numerous Open Fabrics improvements/enhancements - - Added iWARP support (including RDMA CM) - - Memory footprint and performance improvements - - "Bucket" SRQ support for better registered memory utilization - - XRC/ConnectX support - - Message coalescing - - Improved error report mechanism with Asynchronous events - - Automatic Path Migration (APM) - - Improved processor/port binding - - Infrastructure for additional wireup strategies - - mpi_leave_pinned is now enabled by default - - o uDAPL BTL enhancements - - Multi-rail support - - Subnet checking - - Interface include/exclude capabilities - - o Processor affinity - - Linux processor affinity improvements - - Core/socket <--> process mappings - - o Collectives - - Performance improvements - - Support for hierarchical collectives (must be activated - manually; see below) - - o Miscellaneous - - MPI 2.1 compliant - - Sparse process groups and communicators - - Support for Cray Compute Node Linux (CNL) - - One-sided RDMA component (BTL-level based rather than PML-level - based) - - Aggregate MCA parameter sets - - MPI handle debugging - - Many small improvements to the MPI C++ bindings - - Valgrind support - - VampirTrace support - - Updated ROMIO to the version from MPICH2 1.0.7 - - Removed the mVAPI IB stacks - - Display most error messages only once (vs. once for each - process) - - Many other small improvements and bug fixes, too numerous to - list here - -Known issues ------------- - - o There is a segfault that sometimes occurs on one of our x86_64 test - clusters when using MPI onesided communications over Myrinet MX. - Since no one else has reported this problem we are not holding - up the 1.3 release. See ticket #1757 for the details, and any - possible workarounds. - - o XGrid support is currently broken. - https://svn.open-mpi.org/trac/ompi/ticket/1777 - - o MPI_REDUCE_SCATTER does not work with counts of 0. - https://svn.open-mpi.org/trac/ompi/ticket/1559 - - o Please also see the Open MPI bug tracker for bugs beyond this release. - https://svn.open-mpi.org/trac/ompi/report - -=========================================================================== - -The following abbreviated list of release notes applies to this code -base as of this writing (10 July 2009): - -General notes -------------- - -- Open MPI includes support for a wide variety of supplemental - hardware and software package. When configuring Open MPI, you may - need to supply additional flags to the "configure" script in order - to tell Open MPI where the header files, libraries, and any other - required files are located. As such, running "configure" by itself - may not include support for all the devices (etc.) that you expect, - especially if their support headers / libraries are installed in - non-standard locations. Network interconnects are an easy example - to discuss -- Myrinet and OpenFabrics networks, for example, both - have supplemental headers and libraries that must be found before - Open MPI can build support for them. You must specify where these - files are with the appropriate options to configure. See the - listing of configure command-line switches, below, for more details. - -- The majority of Open MPI's documentation is here in this file, the - included man pages, and on the web site FAQ - (http://www.open-mpi.org/). This will eventually be supplemented - with cohesive installation and user documentation files. - -- Note that Open MPI documentation uses the word "component" - frequently; the word "plugin" is probably more familiar to most - users. As such, end users can probably completely substitute the - word "plugin" wherever you see "component" in our documentation. - For what it's worth, we use the word "component" for historical - reasons, mainly because it is part of our acronyms and internal API - functionc calls. - -- The run-time systems that are currently supported are: - - rsh / ssh - - LoadLeveler - - PBS Pro, Open PBS, Torque - - Platform LSF (v7.0.2 and later) - - SLURM - - XGrid (known to be broken in 1.3 through 1.3.2) - - Cray XT-3 and XT-4 - - Sun Grid Engine (SGE) 6.1, 6.2 and open source Grid Engine - - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008) - -- Systems that have been tested are: - - Linux (various flavors/distros), 32 bit, with gcc, and Sun Studio 12 - - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft, - Intel, Portland, Pathscale, and Sun Studio 12 compilers (*) - - OS X (10.4), 32 and 64 bit (i386, PPC, PPC64, x86_64), with gcc - and Absoft compilers (*) - - Solaris 10 update 2, 3 and 4, 32 and 64 bit (SPARC, i386, x86_64), - with Sun Studio 10, 11 and 12 - - (*) Be sure to read the Compiler Notes, below. - -- Other systems have been lightly (but not fully tested): - - Other 64 bit platforms (e.g., Linux on PPC64) - - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008); - more testing and support is expected later in the Open MPI v1.3.x - series. See the README.WINDOWS file. - -Compiler Notes --------------- - -- Mixing compilers from different vendors when building Open MPI - (e.g., using the C/C++ compiler from one vendor and the F77/F90 - compiler from a different vendor) has been successfully employed by - some Open MPI users (discussed on the Open MPI user's mailing list), - but such configurations are not tested and not documented. For - example, such configurations may require additional compiler / - linker flags to make Open MPI build properly. - -- Open MPI does not support the Sparc v8 CPU target, which is the - default on Sun Solaris. The v8plus (32 bit) or v9 (64 bit) - targets must be used to build Open MPI on Solaris. This can be - done by including a flag in CFLAGS, CXXFLAGS, FFLAGS, and FCFLAGS, - -xarch=v8plus for the Sun compilers, -mv8plus for GCC. - -- At least some versions of the Intel 8.1 compiler seg fault while - compiling certain Open MPI source code files. As such, it is not - supported. - -- The Intel 9.0 v20051201 compiler on IA64 platforms seems to have a - problem with optimizing the ptmalloc2 memory manager component (the - generated code will segv). As such, the ptmalloc2 component will - automatically disable itself if it detects that it is on this - platform/compiler combination. The only effect that this should - have is that the MCA parameter mpi_leave_pinned will be inoperative. - -- Early versions of the Portland Group 6.0 compiler have problems - creating the C++ MPI bindings as a shared library (e.g., v6.0-1). - Tests with later versions show that this has been fixed (e.g., - v6.0-5). - -- The Portland Group compilers prior to version 7.0 require the - "-Msignextend" compiler flag to extend the sign bit when converting - from a shorter to longer integer. This is is different than other - compilers (such as GNU). When compiling Open MPI with the Portland - compiler suite, the following flags should be passed to Open MPI's - configure script: - - shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-Msignextend \ - --with-wrapper-cflags=-Msignextend \ - --with-wrapper-cxxflags=-Msignextend ... - - This will both compile Open MPI with the proper compile flags and - also automatically add "-Msignextend" when the C and C++ MPI wrapper - compilers are used to compile user MPI applications. - -- Using the MPI C++ bindings with the Pathscale compiler is known - to fail, possibly due to Pathscale compiler issues. - -- Using the Absoft compiler to build the MPI Fortran bindings on Suse - 9.3 is known to fail due to a Libtool compatibility issue. - -- Open MPI will build bindings suitable for all common forms of - Fortran 77 compiler symbol mangling on platforms that support it - (e.g., Linux). On platforms that do not support weak symbols (e.g., - OS X), Open MPI will build Fortran 77 bindings just for the compiler - that Open MPI was configured with. - - Hence, on platforms that support it, if you configure Open MPI with - a Fortran 77 compiler that uses one symbol mangling scheme, you can - successfully compile and link MPI Fortran 77 applications with a - Fortran 77 compiler that uses a different symbol mangling scheme. - - NOTE: For platforms that support the multi-Fortran-compiler bindings - (i.e., weak symbols are supported), due to limitations in the MPI - standard and in Fortran compilers, it is not possible to hide these - differences in all cases. Specifically, the following two cases may - not be portable between different Fortran compilers: - - 1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE - will only compare properly to Fortran applications that were - created with Fortran compilers that that use the same - name-mangling scheme as the Fortran compiler that Open MPI was - configured with. - - 2. Fortran compilers may have different values for the logical - .TRUE. constant. As such, any MPI function that uses the Fortran - LOGICAL type may only get .TRUE. values back that correspond to - the the .TRUE. value of the Fortran compiler that Open MPI was - configured with. Note that some Fortran compilers allow forcing - .TRUE. to be 1 and .FALSE. to be 0. For example, the Portland - Group compilers provide the "-Munixlogical" option, and Intel - compilers (version >= 8.) provide the "-fpscomp logicals" option. - - You can use the ompi_info command to see the Fortran compiler that - Open MPI was configured with. - -- The Fortran 90 MPI bindings can now be built in one of three sizes - using --with-mpi-f90-size=SIZE (see description below). These sizes - reflect the number of MPI functions included in the "mpi" Fortran 90 - module and therefore which functions will be subject to strict type - checking. All functions not included in the Fortran 90 module can - still be invoked from F90 applications, but will fall back to - Fortran-77 style checking (i.e., little/none). - - - trivial: Only includes F90-specific functions from MPI-2. This - means overloaded versions of MPI_SIZEOF for all the MPI-supported - F90 intrinsic types. - - - small (default): All the functions in "trivial" plus all MPI - functions that take no choice buffers (meaning buffers that are - specified by the user and are of type (void*) in the C bindings -- - generally buffers specified for message passing). Hence, - functions like MPI_COMM_RANK are included, but functions like - MPI_SEND are not. - - - medium: All the functions in "small" plus all MPI functions that - take one choice buffer (e.g., MPI_SEND, MPI_RECV, ...). All - one-choice-buffer functions have overloaded variants for each of - the MPI-supported Fortran intrinsic types up to the number of - dimensions specified by --with-f90-max-array-dim (default value is - 4). - - Increasing the size of the F90 module (in order from trivial, small, - and medium) will generally increase the length of time required to - compile user MPI applications. Specifically, "trivial"- and - "small"-sized F90 modules generally allow user MPI applications to - be compiled fairly quickly but lose type safety for all MPI - functions with choice buffers. "medium"-sized F90 modules generally - take longer to compile user applications but provide greater type - safety for MPI functions. - - Note that MPI functions with two choice buffers (e.g., MPI_GATHER) - are not currently included in Open MPI's F90 interface. Calls to - these functions will automatically fall through to Open MPI's F77 - interface. A "large" size that includes the two choice buffer MPI - functions is possible in future versions of Open MPI. - - -General Run-Time Support Notes ------------------------------- - -- The Open MPI installation must be in your PATH on all nodes (and - potentially LD_LIBRARY_PATH, if libmpi is a shared library), unless - using the --prefix or --enable-mpirun-prefix-by-default - functionality (see below). - -- LAM/MPI-like mpirun notation of "C" and "N" is not yet supported. - -- The XGrid support is experimental - see the Open MPI FAQ and this - post on the Open MPI user's mailing list for more information: - - http://www.open-mpi.org/community/lists/users/2006/01/0539.php - -- Open MPI's run-time behavior can be customized via MCA ("MPI - Component Architecture") parameters (see below for more information - on how to get/set MCA parameter values). Some MCA parameters can be - set in a way that renders Open MPI inoperable (see notes about MCA - parameters later in this file). In particular, some parameters have - required options that must be included. - - - If specified, the "btl" parameter must include the "self" - component, or Open MPI will not be able to deliver messages to the - same rank as the sender. For example: "mpirun --mca btl tcp,self - ..." - - If specified, the "btl_tcp_if_exclude" paramater must include the - loopback device ("lo" on many Linux platforms), or Open MPI will - not be able to route MPI messages using the TCP BTL. For example: - "mpirun --mca btl_tcp_if_exclude lo,eth1 ..." - -- Running on nodes with different endian and/or different datatype - sizes within a single parallel job is supported in this release. - However, Open MPI does not resize data when datatypes differ in size - (for example, sending a 4 byte MPI_DOUBLE and receiving an 8 byte - MPI_DOUBLE will fail). - - -MPI Functionality and Features ------------------------------- - -- All MPI-2.1 functionality is supported. - -- MPI_THREAD_MULTIPLE support is included, but is only lightly tested. - It likely does not work for thread-intensive applications. Note - that *only* the MPI point-to-point communication functions for the - BTL's listed above are considered thread safe. Other support - functions (e.g., MPI attributes) have not been certified as safe - when simultaneously used by multiple threads. - - Note that Open MPI's thread support is in a fairly early stage; the - above devices are likely to *work*, but the latency is likely to be - fairly high. Specifically, efforts so far have concentrated on - *correctness*, not *performance* (yet). - -- MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a - portable C datatype can be found that matches the Fortran type - REAL*16, both in size and bit representation. - -- Asynchronous message passing progress using threads can be turned on - with the --enable-progress-threads option to configure. - Asynchronous message passing progress is only supported with devices - that support MPI_THREAD_MULTIPLE, but is only very lightly tested - (and may not provide very much performance benefit). - - -Collectives ------------ - -- The "hierarch" coll component (i.e., an implementation of MPI - collective operations) attempts to discover network layers of - latency in order to segregate individual "local" and "global" - operations as part of the overall collective operation. In this - way, network traffic can be reduced -- or possibly even minimized - (similar to MagPIe). The current "hierarch" component only - separates MPI processes into on- and off-node groups. - - Hierarch has had sufficient correctness testing, but has not - received much performance tuning. As such, hierarch is not - activated by default -- it must be enabled manually by setting its - priority level to 100: - - mpirun --mca coll_hierarch_priority 100 ... - - We would appreciate feedback from the user community about how well - hierarch works for your applications. - - -Network Support ---------------- - -- The OpenFabrics Enterprise Distribution (OFED) software package v1.0 - will not work properly with Open MPI v1.2 (and later) due to how its - Mellanox InfiniBand plugin driver is created. The problem is fixed - OFED v1.1 (and later). - -- Older mVAPI-based InfiniBand drivers (Mellanox VAPI) are no longer - supported. Please use an older version of Open MPI (1.2 series or - earlier) if you need mVAPI support. - -- The use of fork() with the openib BTL is only partially supported, - and only on Linux kernels >= v2.6.15 with libibverbs v1.1 or later - (first released as part of OFED v1.2), per restrictions imposed by - the OFED network stack. - -- There are two MPI network models available: "ob1" and "cm". "ob1" - uses BTL ("Byte Transfer Layer") components for each supported - network. "cm" uses MTL ("Matching Tranport Layer") components for - each supported network. - - - "ob1" supports a variety of networks that can be used in - combination with each other (per OS constraints; e.g., there are - reports that the GM and OpenFabrics kernel drivers do not operate - well together): - - OpenFabrics: InfiniBand and iWARP - - Loopback (send-to-self) - - Myrinet: GM and MX - - Portals - - Quadrics Elan - - Shared memory - - TCP - - SCTP - - uDAPL - - - "cm" supports a smaller number of networks (and they cannot be - used together), but may provide better better overall MPI - performance: - - Myrinet MX (not GM) - - InfiniPath PSM - - Portals - - Open MPI will, by default, choose to use "cm" when the InfiniPath - PSM MTL can be used. Otherwise, OB1 will be used and the - corresponding BTLs will be selected. Users can force the use of ob1 - or cm if desired by setting the "pml" MCA parameter at run-time: - - shell$ mpirun --mca pml ob1 ... - or - shell$ mpirun --mca pml cm ... - -- Myrinet MX support is shared between the 2 internal devices, the MTL - and the BTL. The design of the BTL interface in Open MPI assumes - that only naive one-sided communication capabilities are provided by - the low level communication layers. However, modern communication - layers such as Myrinet MX, InfiniPath PSM, or Portals, natively - implement highly-optimized two-sided communication semantics. To - leverage these capabilities, Open MPI provides the "cm" PML and - corresponding MTL components to transfer messages rather than bytes. - The MTL interface implements a shorter code path and lets the - low-level network library decide which protocol to use (depending on - issues such as message length, internal resources and other - parameters specific to the underlying interconnect). However, Open - MPI cannot currently use multiple MTL modules at once. In the case - of the MX MTL, process loopback and on-node shared memory - communications are provided by the MX library. Moreover, the - current MX MTL does not support message pipelining resulting in - lower performances in case of non-contiguous data-types. - - The "ob1" PML and BTL components use Open MPI's internal on-node - shared memory and process loopback devices for high performance. - The BTL interface allows multiple devices to be used simultaneously. - For the MX BTL it is recommended that the first segment (which is as - a threshold between the eager and the rendezvous protocol) should - always be at most 4KB, but there is no further restriction on the - size of subsequent fragments. - - The MX MTL is recommended in the common case for best performance on - 10G hardware when most of the data transfers cover contiguous memory - layouts. The MX BTL is recommended in all other cases, such as when - using multiple interconnects at the same time (including TCP), or - transferring non contiguous data-types. - - -Shared library versioning support ---------------------------------- - -Open MPI started using GNU-Libtool recommended shared library -versioning with the v1.3.3 release (where all versions were set to -0:0:0) for the main MPI libraries: libmpi, libmpi_cxx, libmpi_f77, and -libmpi_f90. - -Open MPI's other internal libraries are not [yet] versioned for deep -voodoo technical reasons. Please see -https://svn.open-mpi.org/trac/ompi/ticket/2092 for more details. - -=========================================================================== - -Building Open MPI ------------------ - -Open MPI uses a traditional configure script paired with "make" to -build. Typical installs can be of the pattern: - ---------------------------------------------------------------------------- -shell$ ./configure [...options...] -shell$ make all install ---------------------------------------------------------------------------- - -There are many available configure options (see "./configure --help" -for a full list); a summary of the more commonly used ones follows: - ---prefix= - Install Open MPI into the base directory named . Hence, - Open MPI will place its executables in /bin, its header - files in /include, its libraries in /lib, etc. - ---with-elan= - Specify the directory where the Quadrics Elan library and header - files are located. This option is generally only necessary if the - Elan headers and libraries are not in default compiler/linker - search paths. - - Elan is the support library for Quadrics-based networks. - ---with-elan-libdir= - Look in directory for the Quadrics Elan libraries. By default, Open - MPI will look in /lib and /lib64, - which covers most cases. This option is only needed for special - configurations. - ---with-gm= - Specify the directory where the GM libraries and header files are - located. This option is generally only necessary if the GM headers - and libraries are not in default compiler/linker search paths. - - GM is the support library for older Myrinet-based networks (GM has - been obsoleted by MX). - ---with-gm-libdir= - Look in directory for the GM libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-mx= - Specify the directory where the MX libraries and header files are - located. This option is generally only necessary if the MX headers - and libraries are not in default compiler/linker search paths. - - MX is the support library for Myrinet-based networks. - ---with-mx-libdir= - Look in directory for the MX libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-openib= - Specify the directory where the OpenFabrics (previously known as - OpenIB) libraries and header files are located. This option is - generally only necessary if the OpenFabrics headers and libraries - are not in default compiler/linker search paths. - - "OpenFabrics" refers to iWARP- and InifiniBand-based networks. - ---with-openib-libdir= - Look in directory for the OpenFabrics libraries. By default, Open - MPI will look in /lib and /lib64, which covers most cases. This option is only - needed for special configurations. - ---with-portals= - Specify the directory where the Portals libraries and header files - are located. This option is generally only necessary if the Portals - headers and libraries are not in default compiler/linker search - paths. - - Portals is the support library for Cray interconnects, but is also - available on other platforms (e.g., there is a Portals library - implemented over regular TCP). - ---with-portals-config= - Configuration to use for Portals support. The following - values are possible: "utcp", "xt3", "xt3-modex" (default: utcp). - ---with-portals-libs= - Additional libraries to link with for Portals support. - ---with-psm= - Specify the directory where the QLogic InfiniPath PSM library and - header files are located. This option is generally only necessary - if the InfiniPath headers and libraries are not in default - compiler/linker search paths. - - PSM is the support library for QLogic InfiniPath network adapters. - ---with-psm-libdir= - Look in directory for the PSM libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-sctp= - Specify the directory where the SCTP libraries and header files are - located. This option is generally only necessary if the SCTP headers - and libraries are not in default compiler/linker search paths. - - SCTP is a special network stack over ethernet networks. - ---with-sctp-libdir= - Look in directory for the SCTP libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-udapl= - Specify the directory where the UDAPL libraries and header files are - located. Note that UDAPL support is disabled by default on Linux; - the --with-udapl flag must be specified in order to enable it. - Specifying the directory argument is generally only necessary if the - UDAPL headers and libraries are not in default compiler/linker - search paths. - - UDAPL is the support library for high performance networks in Sun - HPC ClusterTools and on Linux OpenFabrics networks (although the - "openib" options are preferred for Linux OpenFabrics networks, not - UDAPL). - ---with-udapl-libdir= - Look in directory for the UDAPL libraries. By default, Open MPI - will look in /lib and /lib64, - which covers most cases. This option is only needed for special - configurations. - ---with-lsf= - Specify the directory where the LSF libraries and header files are - located. This option is generally only necessary if the LSF headers - and libraries are not in default compiler/linker search paths. - - LSF is a resource manager system, frequently used as a batch - scheduler in HPC systems. - - NOTE: If you are using LSF version 7.0.5, you will need to add - "LIBS=-ldl" to the configure command line. For example: - - ./configure LIBS=-ldl --with-lsf ... - - This workaround should *only* be needed for LSF 7.0.5. - ---with-lsf-libdir= - Look in directory for the LSF libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-tm= - Specify the directory where the TM libraries and header files are - located. This option is generally only necessary if the TM headers - and libraries are not in default compiler/linker search paths. - - TM is the support library for the Torque and PBS Pro resource - manager systems, both of which are frequently used as a batch - scheduler in HPC systems. - ---with-sge - Specify to build support for the Sun Grid Engine (SGE) resource - manager. SGE support is disabled by default; this option must be - specified to build OMPI's SGE support. - - The Sun Grid Engine (SGE) is a resource manager system, frequently - used as a batch scheduler in HPC systems. - ---with-mpi-param_check(=value) - "value" can be one of: always, never, runtime. If --with-mpi-param - is not specified, "runtime" is the default. If --with-mpi-param - is specified with no value, "always" is used. Using - --without-mpi-param-check is equivalent to "never". - - - always: the parameters of MPI functions are always checked for - errors - - never: the parameters of MPI functions are never checked for - errors - - runtime: whether the parameters of MPI functions are checked - depends on the value of the MCA parameter mpi_param_check - (default: yes). - ---with-threads=value - Since thread support (both support for MPI_THREAD_MULTIPLE and - asynchronous progress) is only partially tested, it is disabled by - default. To enable threading, use "--with-threads=posix". This is - most useful when combined with --enable-mpi-threads and/or - --enable-progress-threads. - ---enable-mpi-threads - Allows the MPI thread level MPI_THREAD_MULTIPLE. See - --with-threads; this is currently disabled by default. - ---enable-progress-threads - Allows asynchronous progress in some transports. See - --with-threads; this is currently disabled by default. See the - above note about asynchronous progress. - ---disable-mpi-cxx - Disable building the C++ MPI bindings. Note that this does *not* - disable the C++ checks during configure; some of Open MPI's tools - are written in C++ and therefore require a C++ compiler to be built. - ---disable-mpi-cxx-seek - Disable the MPI::SEEK_* constants. Due to a problem with the MPI-2 - specification, these constants can conflict with system-level SEEK_* - constants. Open MPI attempts to work around this problem, but the - workaround may fail in some esoteric situations. The - --disable-mpi-cxx-seek switch disables Open MPI's workarounds (and - therefore the MPI::SEEK_* constants will be unavailable). - ---disable-mpi-f77 - Disable building the Fortran 77 MPI bindings. - ---disable-mpi-f90 - Disable building the Fortran 90 MPI bindings. Also related to the - --with-f90-max-array-dim and --with-mpi-f90-size options. - ---with-mpi-f90-size= - Three sizes of the MPI F90 module can be built: trivial (only a - handful of MPI-2 F90-specific functions are included in the F90 - module), small (trivial + all MPI functions that take no choice - buffers), and medium (small + all MPI functions that take 1 choice - buffer). This parameter is only used if the F90 bindings are - enabled. - ---with-f90-max-array-dim= - The F90 MPI bindings are strictly typed, even including the number of - dimensions for arrays for MPI choice buffer parameters. Open MPI - generates these bindings at compile time with a maximum number of - dimensions as specified by this parameter. The default value is 4. - ---enable-mpirun-prefix-by-default - This option forces the "mpirun" command to always behave as if - "--prefix $prefix" was present on the command line (where $prefix is - the value given to the --prefix option to configure). This prevents - most rsh/ssh-based users from needing to modify their shell startup - files to set the PATH and/or LD_LIBRARY_PATH for Open MPI on remote - nodes. Note, however, that such users may still desire to set PATH - -- perhaps even in their shell startup files -- so that executables - such as mpicc and mpirun can be found without needing to type long - path names. --enable-orterun-prefix-by-default is a synonym for - this option. - ---disable-shared - By default, libmpi is built as a shared library, and all components - are built as dynamic shared objects (DSOs). This switch disables - this default; it is really only useful when used with - --enable-static. Specifically, this option does *not* imply - --enable-static; enabling static libraries and disabling shared - libraries are two independent options. - ---enable-static - Build libmpi as a static library, and statically link in all - components. Note that this option does *not* imply - --disable-shared; enabling static libraries and disabling shared - libraries are two independent options. - ---enable-sparse-groups - Enable the usage of sparse groups. This would save memory - significantly especially if you are creating large - communicators. (Disabled by default) - ---enable-peruse - Enable the PERUSE MPI data analysis interface. - ---enable-dlopen - Build all of Open MPI's components as standalone Dynamic Shared - Objects (DSO's) that are loaded at run-time. The opposite of this - option, --disable-dlopen, causes two things: - - 1. All of Open MPI's components will be built as part of Open MPI's - normal libraries (e.g., libmpi). - 2. Open MPI will not attempt to open any DSO's at run-time. - - Note that this option does *not* imply that OMPI's libraries will be - built as static objects (e.g., libmpi.a). It only specifies the - location of OMPI's components: standalone DSOs or folded into the - Open MPI libraries. You can control whenther Open MPI's libraries - are build as static or dynamic via --enable|disable-static and - --enable|disable-shared. - ---enable-heterogeneous - Enable support for running on heterogeneous clusters (e.g., machines - with different endian representations). Heterogeneous support is - disabled by default because it imposes a minor performance penalty. - ---enable-ptmalloc2-internal - ***NOTE: This option no longer exists. - - This option was introduced in Open MPI v1.3 and was then removed in - Open MPI v1.3.2. Open MPI fundamentally changed how it uses - ptmalloc2 support in v1.3.2 such that the - --enable-ptmalloc2-internal flag was no longer necessary. It can - still harmlessly be supplied to Open MPI's configure script, but a - warning will appear about how it is an unrecognized option. - - In v1.3 and v1.3.1, Open MPI built the ptmalloc2 library as a - standalone library that users could choose to link in or not (by - adding -lopenmpi-malloc to their link command). Using this option - restored pre-v1.3 behavior of *always* forcing the user to use the - ptmalloc2 memory manager (because it is part of libmpi). - - Starting with v1.3.2, ptmalloc2 is always built into Open MPI, but - is only activated in certain scenarios. - ---with-wrapper-cflags= ---with-wrapper-cxxflags= ---with-wrapper-fflags= ---with-wrapper-fcflags= ---with-wrapper-ldflags= ---with-wrapper-libs= - Add the specified flags to the default flags that used are in Open - MPI's "wrapper" compilers (e.g., mpicc -- see below for more - information about Open MPI's wrapper compilers). By default, Open - MPI's wrapper compilers use the same compilers used to build Open - MPI and specify an absolute minimum set of additional flags that are - necessary to compile/link MPI applications. These configure options - give system administrators the ability to embed additional flags in - OMPI's wrapper compilers (which is a local policy decision). The - meanings of the different flags are: - - : Flags passed by the mpicc wrapper to the C compiler - : Flags passed by the mpic++ wrapper to the C++ compiler - : Flags passed by the mpif77 wrapper to the F77 compiler - : Flags passed by the mpif90 wrapper to the F90 compiler - : Flags passed by all the wrappers to the linker - : Flags passed by all the wrappers to the linker - - There are other ways to configure Open MPI's wrapper compiler - behavior; see the Open MPI FAQ for more information. - -There are many other options available -- see "./configure --help". - -Changing the compilers that Open MPI uses to build itself uses the -standard Autoconf mechanism of setting special environment variables -either before invoking configure or on the configure command line. -The following environment variables are recognized by configure: - -CC - C compiler to use -CFLAGS - Compile flags to pass to the C compiler -CPPFLAGS - Preprocessor flags to pass to the C compiler - -CXX - C++ compiler to use -CXXFLAGS - Compile flags to pass to the C++ compiler -CXXCPPFLAGS - Preprocessor flags to pass to the C++ compiler - -F77 - Fortran 77 compiler to use -FFLAGS - Compile flags to pass to the Fortran 77 compiler - -FC - Fortran 90 compiler to use -FCFLAGS - Compile flags to pass to the Fortran 90 compiler - -LDFLAGS - Linker flags to pass to all compilers -LIBS - Libraries to pass to all compilers (it is rarely - necessary for users to need to specify additional LIBS) - -For example: - -shell$ ./configure CC=mycc CXX=myc++ F77=myf77 F90=myf90 ... - -***Note: We generally suggest using the above command line form for - setting different compilers (vs. setting environment variables and - then invoking "./configure"). The above form will save all - variables and values in the config.log file, which makes - post-mortem analysis easier when problems occur. - -Note that you may also want to ensure that the value of -LD_LIBRARY_PATH is set appropriately (or not at all) for your build -(or whatever environment variable is relevant for your operating -system). For example, some users have been tripped up by setting to -use non-default Fortran compilers via FC / F77, but then failing to -set LD_LIBRARY_PATH to include the directory containing that -non-default Fortran compiler's support libraries. This causes Open -MPI's configure script to fail when it tries to compile / link / run -simple Fortran programs. - -It is required that the compilers specified be compile and link -compatible, meaning that object files created by one compiler must be -able to be linked with object files from the other compilers and -produce correctly functioning executables. - -Open MPI supports all the "make" targets that are provided by GNU -Automake, such as: - -all - build the entire Open MPI package -install - install Open MPI -uninstall - remove all traces of Open MPI from the $prefix -clean - clean out the build tree - -Once Open MPI has been built and installed, it is safe to run "make -clean" and/or remove the entire build tree. - -VPATH and parallel builds are fully supported. - -Generally speaking, the only thing that users need to do to use Open -MPI is ensure that /bin is in their PATH and /lib is -in their LD_LIBRARY_PATH. Users may need to ensure to set the PATH -and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc) -so that non-interactive rsh/ssh-based logins will be able to find the -Open MPI executables. - -=========================================================================== - -Checking Your Open MPI Installation ------------------------------------ - -The "ompi_info" command can be used to check the status of your Open -MPI installation (located in /bin/ompi_info). Running it with -no arguments provides a summary of information about your Open MPI -installation. - -Note that the ompi_info command is extremely helpful in determining -which components are installed as well as listing all the run-time -settable parameters that are available in each component (as well as -their default values). - -The following options may be helpful: - ---all Show a *lot* of information about your Open MPI - installation. ---parsable Display all the information in an easily - grep/cut/awk/sed-able format. ---param - A of "all" and a of "all" will - show all parameters to all components. Otherwise, the - parameters of all the components in a specific framework, - or just the parameters of a specific component can be - displayed by using an appropriate and/or - name. - -Changing the values of these parameters is explained in the "The -Modular Component Architecture (MCA)" section, below. - -=========================================================================== - -Compiling Open MPI Applications -------------------------------- - -Open MPI provides "wrapper" compilers that should be used for -compiling MPI applications: - -C: mpicc -C++: mpiCC (or mpic++ if your filesystem is case-insensitive) -Fortran 77: mpif77 -Fortran 90: mpif90 - -For example: - -shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g -shell$ - -All the wrapper compilers do is add a variety of compiler and linker -flags to the command line and then invoke a back-end compiler. To be -specific: the wrapper compilers do not parse source code at all; they -are solely command-line manipulators, and have nothing to do with the -actual compilation or linking of programs. The end result is an MPI -executable that is properly linked to all the relevant libraries. - -Customizing the behavior of the wrapper compilers is possible (e.g., -changing the compiler [not recommended] or specifying additional -compiler/linker flags); see the Open MPI FAQ for more information. - -=========================================================================== - -Running Open MPI Applications ------------------------------ - -Open MPI supports both mpirun and mpiexec (they are exactly -equivalent). For example: - -shell$ mpirun -np 2 hello_world_mpi -or -shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi - -are equivalent. Some of mpiexec's switches (such as -host and -arch) -are not yet functional, although they will not error if you try to use -them. - -The rsh launcher accepts a -hostfile parameter (the option -"-machinefile" is equivalent); you can specify a -hostfile parameter -indicating an standard mpirun-style hostfile (one hostname per line): - -shell$ mpirun -hostfile my_hostfile -np 2 hello_world_mpi - -If you intend to run more than one process on a node, the hostfile can -use the "slots" attribute. If "slots" is not specified, a count of 1 -is assumed. For example, using the following hostfile: - ---------------------------------------------------------------------------- -node1.example.com -node2.example.com -node3.example.com slots=2 -node4.example.com slots=4 ---------------------------------------------------------------------------- - -shell$ mpirun -hostfile my_hostfile -np 8 hello_world_mpi - -will launch MPI_COMM_WORLD rank 0 on node1, rank 1 on node2, ranks 2 -and 3 on node3, and ranks 4 through 7 on node4. - -Other starters, such as the resource manager / batch scheduling -environments, do not require hostfiles (and will ignore the hostfile -if it is supplied). They will also launch as many processes as slots -have been allocated by the scheduler if no "-np" argument has been -provided. For example, running a SLURM job with 8 processors: - -shell$ salloc -n 8 mpirun a.out - -The above command will reserve 8 processors and run 1 copy of mpirun, -which will, in turn, launch 8 copies of a.out in a single -MPI_COMM_WORLD on the processors that were allocated by SLURM. - -Note that the values of component parameters can be changed on the -mpirun / mpiexec command line. This is explained in the section -below, "The Modular Component Architecture (MCA)". - -=========================================================================== - -The Modular Component Architecture (MCA) - -The MCA is the backbone of Open MPI -- most services and functionality -are implemented through MCA components. Here is a list of all the -component frameworks in Open MPI: - ---------------------------------------------------------------------------- - -MPI component frameworks: -------------------------- - -allocator - Memory allocator -bml - BTL management layer -btl - MPI point-to-point Byte Transfer Layer, used for MPI - point-to-point messages on some types of networks -coll - MPI collective algorithms -crcp - Checkpoint/restart coordination protocol -dpm - MPI-2 dynamic process management -io - MPI-2 I/O -mpool - Memory pooling -mtl - Matching transport layer, used for MPI point-to-point - messages on some types of networks -osc - MPI-2 one-sided communications -pml - MPI point-to-point management layer -pubsub - MPI-2 publish/subscribe management -rcache - Memory registration cache -topo - MPI topology routines - -Back-end run-time environment component frameworks: ---------------------------------------------------- - -errmgr - RTE error manager -ess - RTE environment-specfic services -filem - Remote file management -grpcomm - RTE group communications -iof - I/O forwarding -notifier - System/network administrator noficiation system -odls - OpenRTE daemon local launch subsystem -oob - Out of band messaging -plm - Process lifecycle management -ras - Resource allocation system -rmaps - Resource mapping system -rml - RTE message layer -routed - Routing table for the RML -snapc - Snapshot coordination - -Miscellaneous frameworks: -------------------------- - -backtrace - Debugging call stack backtrace support -carto - Cartography (host/network mapping) support -crs - Checkpoint and restart service -installdirs - Installation directory relocation services -maffinity - Memory affinity -memchecker - Run-time memory checking -memcpy - Memopy copy support -memory - Memory management hooks -paffinity - Processor affinity -timer - High-resolution timers - ---------------------------------------------------------------------------- - -Each framework typically has one or more components that are used at -run-time. For example, the btl framework is used by the MPI layer to -send bytes across different types underlying networks. The tcp btl, -for example, sends messages across TCP-based networks; the openib btl -sends messages across OpenFabrics-based networks; the MX btl sends -messages across Myrinet networks. - -Each component typically has some tunable parameters that can be -changed at run-time. Use the ompi_info command to check a component -to see what its tunable parameters are. For example: - -shell$ ompi_info --param btl tcp - -shows all the parameters (and default values) for the tcp btl -component. - -These values can be overridden at run-time in several ways. At -run-time, the following locations are examined (in order) for new -values of parameters: - -1. /etc/openmpi-mca-params.conf - - This file is intended to set any system-wide default MCA parameter - values -- it will apply, by default, to all users who use this Open - MPI installation. The default file that is installed contains many - comments explaining its format. - -2. $HOME/.openmpi/mca-params.conf - - If this file exists, it should be in the same format as - /etc/openmpi-mca-params.conf. It is intended to provide - per-user default parameter values. - -3. environment variables of the form OMPI_MCA_ set equal to a - - - Where is the name of the parameter. For example, set the - variable named OMPI_MCA_btl_tcp_frag_size to the value 65536 - (Bourne-style shells): - - shell$ OMPI_MCA_btl_tcp_frag_size=65536 - shell$ export OMPI_MCA_btl_tcp_frag_size - -4. the mpirun command line: --mca - - Where is the name of the parameter. For example: - - shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi - -These locations are checked in order. For example, a parameter value -passed on the mpirun command line will override an environment -variable; an environment variable will override the system-wide -defaults. - -=========================================================================== - -Common Questions ----------------- - -Many common questions about building and using Open MPI are answered -on the FAQ: - - http://www.open-mpi.org/faq/ - -=========================================================================== - -Got more questions? -------------------- - -Found a bug? Got a question? Want to make a suggestion? Want to -contribute to Open MPI? Please let us know! - -When submitting questions and problems, be sure to include as much -extra information as possible. This web page details all the -information that we request in order to provide assistance: - - http://www.open-mpi.org/community/help/ - -User-level questions and comments should generally be sent to the -user's mailing list (users@open-mpi.org). Because of spam, only -subscribers are allowed to post to this list (ensure that you -subscribe with and post from *exactly* the same e-mail address -- -joe@example.com is considered different than -joe@mycomputer.example.com!). Visit this page to subscribe to the -user's list: - - http://www.open-mpi.org/mailman/listinfo.cgi/users - -Developer-level bug reports, questions, and comments should generally -be sent to the developer's mailing list (devel@open-mpi.org). Please -do not post the same question to both lists. As with the user's list, -only subscribers are allowed to post to the developer's list. Visit -the following web page to subscribe: - - http://www.open-mpi.org/mailman/listinfo.cgi/devel - -Make today an Open MPI day! diff --git a/opensm_release_notes.txt b/opensm_release_notes.txt deleted file mode 100644 index 9b5de67..0000000 --- a/opensm_release_notes.txt +++ /dev/null @@ -1,728 +0,0 @@ - OpenSM Release Notes 3.3 - ============================= - -Version: OpenSM 3.3.x -Repo: git://git.openfabrics.org/~sashak/management.git -Date: Dec 2009 - -1 Overview ----------- -This document describes the contents of the OpenSM 3.3 release. -OpenSM is an InfiniBand compliant Subnet Manager and Administration, -and runs on top of OpenIB. The OpenSM version for this release -is opensm-3.3.5. - -This document includes the following sections: -1 This Overview section (describing new features and software - dependencies) -2 Known Issues And Limitations -3 Unsupported IB compliance statements -4 Bug Fixes -5 Main Verification Flows -6 Qualified Software Stacks and Devices - -1.1 Major New Features - -* Mesh Analysis for LASH routing algorithm. - The performance of LASH can be improved by preconditioning the mesh in - cases where there are multiple links connecting switches and also in - cases where the switches are not cabled consistently. - Activated with --do_mesh_analysis command line and config file option. - -* Reloadable OpenSM configuration (preliminary implemented) - This is possible now to reload OpenSM configuration parameters on the - fly without restarting. - -* Routing paths sorted balancing (for UpDown and MinHops) - This sorts the port order in which routing paths balancing is performed - by OpenSM. Helps to improve performance dramatically (40-50%) for most - popular application communication patterns. - To overwrite this behavior use --guid_routing_order_file command line - option. - -* Weighted Lid Matrices calculation (for UpDown, MinHop and DOR). - This low level routing fine-tuning feature provides the means to - define a weighting factor per port for customizing the least weight - hops for the routing. Custom weights are provided using file specified - with '--hop_weights_file' command line option. - -* I/O nodes connectivity (for FatTree). - This provides possibility to define the set of I/O nodes for the - Fat-Tree routing algorithm. I/O nodes are non-CN nodes allowed to use - up to N (specified using --max_reverse_hops) switches the wrong way - around to improve connectivity. I/O nodes list is provided using file - and --io_guid_file command line option. - -* MGID to MLID compression - infrastructure for many MGIDs to single MLID - compression. This becomes helpful when number of multicast groups - exceeds subnet's MLID routing capability (normally 1024 groups). In such - cases many multicast groups (MGID) can be routed using same MLID value. - -* Many code improvements, optimizations and cleanups. - -* Windows support (early stage). - -1.2 Minor New Features: - -cde0c0d opensm: Convert remaining helper routines for GID printing format -bc5743c opensm: Add support for MaxCreditHint and LinkRoundTripLatency to - osm_dump_port_info -6cd34ab opensm: Add Dell to known vendor list -003d6bd opensm: Add more info for traps 144 and 256-259 in osm_dump_notice -5b0c5de opensm/osm_ucat_ftree.c Enhance min hops counters usage -0715b92 ib_types.h: Add ib_switch_info_get_state_opt_sl2vlmapping routine -2ddba79 opensm: Remove some __ and __osm_ prefixes -ea0691f opensm/iba/ib_types.h: Add PortXmit/RcvDataSL PerfMgt attributes -9c79be5 ib_types.h: Adding BKEY violation trap (259) -c608ea6 opensm: Add and utilize ib_gid_is_notzero routine -b639e64 opensm: Handle trap repress on trap 144 generation -b034205 Add pkey table support to osm_get_all_port_attr -876605b opensm/ib_types.h: Add attribute ID for PortCountersExtended -aae3bbc opensm: PortInfo requests for discovered switches -0147b09 opensm/osm_lid_mgr: use single array for used_lids -a9225b0 opensm/Makefile.am: remove osm_build_id.h junk file generation -8e3a57d opensm/osm_console.c: Add list of SMs to status command -3d664b9 opensm/osm_console.c : Added dump_portguid function to console to - generate a list of port guids matching one or more regexps -85b35bc opensm/osm_helper.c: print port number as decimal -8674cb7 opensm: sort port order for routing by switch loads -80c0d48 opensm: rescan config file even in standby -8b7aa5e opensm/osm_subnet.c enable log_max_size opt update -8558ee5 opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters -ecde2f7 opensm/osm_subnet.c support subnet configuration rescan and update -58c45e4 opensm/osm_log.c save log_max_size in subnet opt in MB -cf88e93 opensm: Add new partition keyword for all hca, switches and routers -4bfd4e0 opensm: remove libibcommon build dependencies -3718fc4 opensm/event_plugin: link opensm with -rdynamic flag -587ce14 opensm/osm_inform.c report IB traps to plugin -ced5a6e opensm/opensm/osm_console.c: move reporting of plugins to "status" - command. -696aca2 opensm: Add configurable retries for transactions -0d932ff opensm/osm_sa_mcmember_record.c: optimization in zero mgid comparison -254c2ef opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, set init - failure on PKeyTable and QoS initialization failure -83bd10a opensm: Reduce heap consumption by multicast routing tables (MFTs) -cd33bc5 opensm: Add some additional HP vendor IDs/OUIs -f78ec3a opensm/osm_mcast_tbl.(h c): Make max_mlid_ho be maximum MLID configured -2d13530 opensm: Add infrastructure support for PortInfo - IsMulticastPkeyTrapSuppressionSupported -3ace760 opensm: Reduce heap consumption by unicast routing tables (LFTs) -eec568e osmtest: Add SA get PathRecord stress test -aabc476 opensm: Add infrastructure support for more newly allocated PortInfo - CapabilityMask bits -c83c331 opensm: improve multicast re-routing requests processing -46db92f opensm: Parallelize (Stripe) MFT sets across switches -00c6a6e opensm: Parallelize (Stripe) LFT sets across switches -e21c651 opensm/osm_base.h: Add new SA ClassPortInfo:CapabilityMask2 bit - allocations -09056b1 opensm/ib_types.h: Add CounterSelect2 field to PortCounters attribute -6a63003 opensm: Add ability to configure SMSL -25f071f opensm/lash: Set minimum VL for LASH to use -622d853 opensm/osm_ucast_ftree.cd: Added support for same level links -8146ba7 opensm: Add new Sun vendor ID -1d7dd18 opensm/osm_ucast_ftree.c: Enhanced Fat-Tree algorithm -e07a2f1 Add LMC support to DOR routing -1acfe8a opensm: Add SuperMicro to list of recognized vendors -f02f40e opensm: implement 'connect_roots' option in fat-tree routing -748d41e opensm SA DB dump/restore: added option to dump SA DB on every sweep -b03a95e complib/cl_fleximap: add cl_fmap_match() function -b7a8a87 opensm/include/iba/ib_types.h: adding Congestion Control definitions - -1.3 Library API Changes - - None - -1.4 Software Dependencies - -OpenSM depends on the installation of libibumad package (distributed as -part of OFA IB management together with OpenSM) and IB stack presence, -in particular libibumad uses user_mad kernel interface ('ib_umad' kernel -module). The qualified driver versions are provided in Table 2, -"Qualified IB Stacks". - -Also, building of QoS manager policy file parser requires flex, and either -bison or byacc installed. - -1.5 Supported Devices Firmware - -The main task of OpenSM is to initialize InfiniBand devices. The -qualified devices and their corresponding firmware versions -are listed in Table 3. - -2 Known Issues And Limitations ------------------------------- - -* No Service / Key associations: - There is no way to manage Service access by Keys. - -* No SM to SM SMDB synchronization: - Puts the burden of re-registering services, multicast groups, and - inform-info on the client application (or IB access layer core). - -3 Unsupported IB Compliance Statements --------------------------------------- -The following section lists all the IB compliance statements which -OpenSM does not support. Please refer to the IB specification for detailed -information regarding each compliance statement. - -* C14-22 (Authentication): - M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one - SubnSet method. As a work-around, an OpenSM option is provided for - defining the protect bits. - -* C14-67 (Authentication): - On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then - the SM shall generate a SubnGetResp if the M_Key matches, or - silently drop the packet if M_Key does not match. - -* C15-0.1.23.4 (Authentication): - InformInfoRecords shall always be provided with the QPN set to 0, - except for the case of a trusted request, in which case the actual - subscriber QPN shall be returned. - -* o13-17.1.2 (Event-FWD): - If no permission to forward, the subscription should be removed and - no further forwarding should occur. - -* C14-24.1.1.5 and C14-62.1.1.22 (Initialization): - GUIDInfo - SM should enable assigning Port GUIDInfo. - -* C14-44 (Initialization): - If the SM discovers that it is missing an M_Key to update CA/RT/SW, - it should notify the higher level. - -* C14-62.1.1.12 (Initialization): - PortInfo:M_Key - Set the M_Key to a node based random value. - -* C14-62.1.1.13 (Initialization): - PortInfo:M_KeyProtectBits - set according to an optional policy. - -* C14-62.1.1.24 (Initialization): - SwitchInfo:DefaultPort - should be configured for random FDB. - -* C14-62.1.1.32 (Initialization): - RandomForwardingTable should be configured. - -* o15-0.1.12 (Multicast): - If the JoinState is SendOnlyNonMember = 1 (only), then the endport - should join as sender only. - -* o15-0.1.8 (Multicast): - If a request for creating an MCG with fields that cannot be met, - return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass). - -* C15-0.1.8.6 (SA-Query): - Respond to SubnAdmGetTraceTable - this is an optional attribute. - -* C15-0.1.13 Services: - Reject ServiceRecord create, modify or delete if the given - ServiceP_Key does not match the one included in the ServiceGID port - and the port that sent the request. - -* C15-0.1.14 (Services): - Provide means to associate service name and ServiceKeys. - -4 Bug Fixes ------------ - -4.1 Major Bug Fixes - -18990fa opensm: set IS_SM bit during opensm init -3551389 fix local port smlid in osm_send_trap144() -a6de48d opensm/osm_link_mgr.c initialize SMSL -82df467 opensm/osm_req.c: Shouldn't reveal port's MKey on Trap method -45ebff9 opensm/osm_console_io.h: Modify osm_console_exit so only the - connection is killed, not the socket -d10660a opensm/osm_req.c: In osm_send_trap144, set producer type according - to node type -8a2d2dd opensm/osm_node_info_rcv.c: create physp for the newly discovered - port of the known node -39b241f opensm/lid_mgr: fix duplicated lid assignment -b44c398 opensm: invalidate routing cache when entering master state -595f2e3 opensm: update LFTs when entering master -8406c65 opensm: fix port chooser -fa90512 opensm/osm_vendor_*_sa: fix incompatibility with QLogic SM -7ec9f7c opensm: discard multicast SA PR with wildcard DGID -5cdb53f opensm/osm_sa_node_record.c use comp mask to match by LID or GUID -55f9772 opensm: Return single PathRecord for SubnAdmGet with DGID/SGID wild - carded -5ec0b5f opensm: compress IPV6 SNM groups to use a single MLID - -4.2 Other Bug Fixes - -4911e0b performance-manager-HOWTO.txt: Indicate master state -86ccaa4 opensm/osm_pkey_mgr.c: Fix pkey endian in log message -b79b079 opensm.8.in: Add mention of backing documentation for QoS policy - file and performance manager -b4d92af opensm/osm_perfmgr.c: Eliminate duplicated error number -a10b57a opensm/osm_ucast_ftree.c: lids are always handled in host order -44273a2 opensm/osm_ucast_ftree.c: fixing bug in indexing -5cd98f7 Fix further bugs around console closure and clean up code. -6b34339 opensm/osm_opensm.c: add newline to log message -68c241c send trap144 when local priority is higher than master priority -6462999 opensm/osm_inform.c: In __osm_send_report, make sure p_report_madw - valid before using -9b8561a opensm/console: Fixed osm_console poll to handle POLLHUP -91d0700 osm_vendor_ibumad.c: In clear_madw, fix tid endian in message -5a5136b osm_switch.h : Fixed wrong comment about return value of - osm_switch_set_hops -c1ec8c0 osm_ucast_ftree.c: Removed useless initialization on switch indexes -418d01f opensm/osm_helper.c: use single buffer in osm_dump_dr_smp() -2c9153c opensm/osm_helper.c: consolidate dr path printing code -048c447 opensm/osm_helper.c: return then log is inactive -dd3ef0c opensm: Return error status when cl_disp_register fails -0143bf7 opensm/osm_perfmgr.c: Improve assert in osm_pc_rcv_process -6622504 osm_perfmgr.c: In osm_perfmgr_shutdown, add missing cl_disp_unregister -7b66dee opensm: remove unneeded anymore physp initializations -f11274a opensm/partition-config.txt: Update for defmember feature -d240e7d opensm/osm_sm_state_mgr.c: Remove unneeded return statement -898fb8c opensm: Improve some snprintf uses -6820e63 opensm/osm_sa_link_record.c: improve get_base_lid() -64c8d31 opensm: initialize all switch ports -555fae8 opensm/sweep: add log message before lid assignment -8e22307 opensm/console: Enhance perfmgr print_counters for better nodenames -b9721a1 opensm/osm_console.c: Improve perfmgr print_counters error message -4d8dc72 opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec -a98dd82 opensm/main.c: remove enable_stack_dump() call -db6d51e opensm/osm_subnet: fix crash in qos string config parameters reloading -e5111c8 opensm: proper config file rescan -e5295b2 opensm: pre-scan command line for config file option -e2f549e opensm/osm_console.c: Eliminate some extraneous parentheses -0a265dc opensm/console: dump_portguid - don't duplicate matched guids -540fefb opensm/console: dump_portguid command fixes -d96202c opensm/osm_console.c: Add missing command in help_perfmgr -ae1bd3c opensm/osm_helper.c: Add port counters to __osm_disp_msg_str -1d38b31 opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin -156c749 opensm: fix structure definition for trap 257-258 -5c09f4a opensm/osm_state_mgr.c: small bug in scanning lid table -72a2fa2 opensm/osm_sa.c: fixing SA MAD dump -539a4d3 opensm/osm_ucast_ftree.c Fixed bad init value for down port index -6690833 opensm/ftree: simplify root guids setup. -90e3291 opensm/ftree: cleanup ftree_sw_tbl_element_t use -c07d245 opensm/qos_config: no invalid option message on default values -b382ad8 opensm: avoid memory leaks on config parameters reloading -45f57ce opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation -3d618aa opensm/osm_subnet.c: break matching when config parameter already found -44d98e3 opensm/osm_subnet.c: clean_val() remove trailing quotation -173010a opensm/doc/perf-manager-arch.txt: Fix some commentary typos -83bf6c5 opensm/osm_subnet.c fix parse functions for big endian machines -6b9a1e9 opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager - operation -4f79a17 opensm/osm_perfmgr.c: In osm_perfmgr_init, eliminate memory leak - on error -22da81f opensm/osm_ucast_ftree.c: fix full topology dump -aa25fcb opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 - is active -003bd4b opensm/osm_subnet.c Fix memory leak for QOS string parameters. -9cbbab2 opensm/opensm.spec: fix event plugin config options -996e8f6 OpenSM: update osmeventplugin example for the new TRAP event. -67f4c07 opensm/lash: simplify some memory allocations -3e6bcdb opensm/lash: fix memory leaks -3ff97b9 opensm/vendor: save some stack memory -ccc7621 opensm/osm_ucast_ftree.c: fixing errors in comments -1a802b3 Corrected incoherency in __osm_ftree_fabric_route_to_non_cns comments -85a7e54 opensm/osm_sm.c: fix MC group creation in race condition -aad1af2 opensm/osm_trap_rcv.c: Improvements in log_trap_info() -f619d67 opensm/osm_trap_rcv.c: Minor reorganization of trap_rcv_process_request -084335b opensm/link_mgr: verify port's lid -d525931 opensm/osm_vendor_ibumad: Use OSM_UMAD_MAX_AGENTS rather than - UMAD_CA_MAX_AGENTS -f342c62 opensm/osm_sa.c: don't ignore failure in osm_mgrp_add_port() -587fda4 osmtest/osmt_multicast.c: fix strict aliasing breakage warning -6931f3e opensm: make subnet's max mlid update implementation independent -30f1acd osm_ucast_ftree.c missing reset of ca_ports -ac04779 opensm: fix LFT allocation size -a7838d0 opensm/osm_ucast_cache: reduce OSM_LOG_INFO debug printouts -c027335 opensm/osm_ucast_updn.c: Further reduction in cas_per_sw allocation -e8ee292 opensm/opensm/osm_subnet.c: adjust buffer to ensure a '\n' is printed -84d9830 opensm/osm_ucast_updn.c: Reduce temporary allocation of cas_per_sw -347ad64 opensm/ib_types.h: Mask off client rereg bit in set_client_rereg -c2ab189 opensm/osm_state_mgr.c: in cleanup_switch() check only relevant - LFT part -40c93d3 use transportable constant attributes -c8fa71a osmtest -code cleanup - use strncasecmp() -770704a opensm/osm_mcast_mgr.c: In mcast_mgr_set_mft_block, fix node GUID - in log message -3d20f82 opensm/osm_sa_path_record.c: separate router guid resolution code -27ea3c8 opensm: fix gcc-4.4.1 warnings -c88bfd3 opensm/osm_lid_mgr.c: Fix typo in OSM_LOG message -a9ea08c opensm/osm_mesh.c: Add dump_mesh routine at OSM_LOG_DEBUG level -bc2a61e C++ style coding does not compile -6647600 opensm: remove meanless 'const' keywords in APIs -323a74f opensm/osm_qos_parser_y.y: fix endless loop -0121a81 opensm: fix endless looping in mcast_mgr -696c022 opensm: fix some obvious -Wsign-compare warnings -b91e3c3 opensm/osm_get_port_by_lid(): don't bother with lmc -ca582df opensm/osm_get_port_by_lid(): speedup a port lookup -fd846ee opensm/osm_mesh.c: simplify compare_switches() function -fe20080 osm_sa.c - void * arithmetic causes problems -220130f osm_helper.c use explicit value for struct init -0168ece use standard varargs syntax in macro OSM_LOG() -180b335 update functions to match .h prototypes -9240ef4 opensm/osm_ucast_lash: fix use after free bug -6f1a21a opensm: osm_get_port_by_lid() helper -c9e2818 opensm/osm_sa_path_record.c: validate multicast membership -225dcf5 opensm/osm_mesh.c: Remove edges in lash matrix -4dd928b opensm/osm_sa_mcmember_record.c: clean uninitialized variable use -c48f0bc opensm/osm_perfmgr_db.c: Fix memory leak of db nodes -82d3585 opensm/osm_notice.c: move logging code to separate function -9557f60 opensm/osm_inform.c: For traps 64-67, use GID from DataDetails in - log message -e2e78d9 opensm/opensm.8.in: Indicate default rule for Default partition -08c5beb opensm/osm_sa_node_record.c: dump NodeInfo with debug verbosity -1fe88f0 opensm/multicast: merge mcm_port and mcm_info -ba75747 opensm/multicast: consolidate port addition/removing code -5e61ab8 opensm: port object reference in mcm ports list -5c5dacf opensm: fix uninitialized return value in osm_sm_mcgrp_leave() -7cfe18d osm_ucast_ftree.c: Removed reverse_hop parameters from - fabric_route_upgoing_by_going_down -aa7fb47 opensm/multicast: kill mc group to_be_deleted flag -a4910fe opensm/osm_mcast_mgr.c: multicast routing by mlid - renaming -1d14060 opensm/multicast: remove change id tracking -5a84951 opensm: use mgrp pointer as osm_sm_mcgrp_join/leave() parameter -d8e3ff5 opensm: use mgrp pointer in port mcm_info -0631cd3 opensm doc: Indicated limited (rather than partial) partition - membership -1010535 opensm/osm_ucast_lash.c: In lash_core, return status -1 for all errors -942e20f opensm/osm_helper.c: Add SM priority changed into trap 144 description -2372999 opensm/osm_ucast_mgr: better lft setup -e268b32 opensm/osm_helper.c: Only change method when > rather than >= -9309e8c complib/cl_event.c: change nanosec var type long -d93b126 opensm/complib: account for nsec overflow in timeout values -ef4c8ac opensm/osm_qos_policy.c: matching PR query to QoS level with pkey -c93b58b opensm: fixing some data types in osm_req_get/set -2b89177 opensm/libvendor/osm_vendor_ibumad.c: Handle umad_alloc failure in - osm_vendor_get -2cba163 opensm/osm_helper.c: In osm_dump_dr_smp, fix endian of status -47397e3 opensm/osm_sm_mad_ctrl.c: Fix endian of status in error message -e83b7ca opensm/osm_mesh.c: Reorder switches for lash -9256239 opensm/osm_trap_rcv.c: Validate trap is 144 before checking for - NodeDescription changed -011d9ca opensm/osm_ucast_lash.c: Handle calloc failure in generate_cdg_for_sp -59964d7 opensm: fixing handling of opt.max_wire_smps -f4e3cd0 opensm/osm_ucast_lash.c: Directly call calloc/free rather than - create/delete_cdg -5a208bd opensm/osm_ucast_lash.c: Added error numbers to some error log messages -3b80d10 opensm/osm_helper.c: fix printing trap 258 details -f682fe0 opensm: do not configure MFTs when mcast support is disabled -cc42095 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, indicate - failed attribute -aebf215 opensm/osm_ucast_lash.c: Remove osm_mesh_node_delete call from - switch_delete -1ef4694 opensm/osm_path.h: In osm_dr_path_init, only copy needed part of path -c594a2d opensm: osm_dr_path_extend can fail due to invalid hop count -46e5668 opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete -81841dc opensm/osm_ucast_lash.c: Handle malloc failures better -2801203 opensm: remove extra "0x" from debug message. -88821d2 opensm/main.c: Display SMSL when specified -f814dcd opensm/osm_subnet.c: Format lash_start_vl consistent with other - uint8 items -66669c9 opensm/main.c: Display LASH start VL when specified -31bb0a7 opensm/osm_mcst_mgr.c: check number of switches only once -75e672c opensm: find MC group by MGID using fleximap -2b7260d Clarify the syntax of the hop_weights_file -e6f0070 opensm/osm_mesh.c: Improve VL utilization -27497a0 opensm/osm_ucast_ftree.c Fix assert comparing number of CAs to CN ports -3b98131 opensm/osm_qos_policy.c: Use proper size in malloc in - osm_qos_policy_vlarb_scope_create -e6f367d opensm/osm_ucast_ftree.c: Made error numbers unique in some log - messages -83261a8 osm_ucast_ftree.c Count number of hops instead of calculating it -7bdf4ff opensm/osm_sa_(path multipath)_record.c: Fix typo in a couple of - log messages -0f8ed87 opensm/osm_ucast_mgr.c: Add error numbers to some error log messages -0b5ccb4 complib/Makefile.am: prevent file duplications -e0b8ec9 opensm/osm_sminfo_rcv.c: clean type of smi_rcv_process_get_sm() -4d01005 opensm: sweep component processors return status value -6ad8d78 opensm/libvendor/osm_vendor_(ibumad mlx)_sa.c: Handle malloc - failure in __osmv_send_sa_req -cf97ebf opensm/osm_ucast_lash.(h c): Replace memory allocation by array -957461c opensm/osm_sa.c add attribute and component mask to error message -5d339a1 osm_dump.c dump port if lft is set up -518083d osm_port.c: check if op_vls = 0 before max_op_vls comparison -b6964cb opensm/osm_port.c: Change log level of Invalid OP_VLS 0 message - to VERBOSE -b27568c opensm/PerfMgr: Reduce host name length -bc495c0 opensm/osm_lid_mgr.c bug in opensm LID assignment -5a466fd opensm/osm_perfmgr_db.c: Remove unneeded initialization in - perfmgr_db_print_by_name -57cf328 opensm/osm_ucast_ftree.c Increase the size of the hop table -8323cf1 opensm/PerfMgr: Remove some underbars from internal names -65b1c15 opensm: Changes to spec and make files for updated release notes -cd226c7 OpenSM: include/vendor/osm_vendor.h - Replaced #elif with no - condition by #else -9f8bd4a management: Fixed custom_release in SPEC files -c0b8207 opensm/PerfMgr: Change redir_tbl_size to num_ports for better clarity -596bb08 opensm/osm_sa.c: check for SA DB file only if requested -2f2bd4e opensm SA DB dump/restore: load SA DB only once -4abcbf2 opensm: Added print_desc to various log messages -5e3d235 opensm/osm_vendor_ibumad.c: Move error info into single message -8e5ca10 opensm/libvendor//osm_vendor_ibumad_sa.c: uninitialized fields -d13c2b6 opensm/osm_sm_mad_ctrl.c Changes to some error messages -f79d315 opensm/osm_sm_mad_ctrl.c: Add missing call to return mad to mad pool -150a9b1 opensm/osm_sa_mcmember_record.c: print mcast join/create failures in - VERBOSE instead of DEBUG level -9b7882a opensm/osm_vendor_ibumad.c: Change LID format to decimal in log message -5256c43 opensm/osm_vendor_mlx: fix compilation error -93db10d opensm/osm_vendor_mlx_txn.c: eliminate bunch of compilation warnings -156fdc1 opensm/osm_helper.c Log format changes -7a55434 opensm/osm_ucast_ftree.c Changed log level -a1694de opensm/osm_state_mgr.c Added more info to some error messages -fdec20a opensm/osm_trap_rcv.c: Eliminate heavy sweep on receipt of trap 145 -13a32a7 opensm - standardize on a single Windows #define - take #2 -b236a10 opensm/osm_db_files.c: kill useless malloc() castings -4ba0c26 opensm/osm_db_files.c: add '/' path delimited -e3b98a5 opensm/osm_sm_mad_ctrl.c: Fix qp0_mads_accounting -dbbe5b3 opensm/osm_subnet.c: fixing bug in dumping options file -f22856a opensm/osm_ucast_mgr.c: fix memory leak -0d5f0b6 opensm: osm_get_mgrp_by_mgid() helper -e3c044a osm_sa_mcmember_record.c: pass MCM Record data to mlid allocator -3dda2dc opensm/osm_sa_member_record.c: mlid independent MGID generator -1f95a3c opensm/osm_sa_mcmember_record.c: move mgid allocation code -b78add1 complib: replace intn_t types by C99 intptr_t -a864fd3 osmtest/osmt_mtl_regular_qp.c: cleaning uintn_t use -9e01318 opensm/osm_console.c: make const functions -f8c4c3e opensm/osm_mgrp_new(): add subnet db insertion -80da047 complib/fleximap: make compar callback to return int -bf7fe2d opensm: cleanup intn_t uses -0862bba opensm/main.c: opensm cannot be killed while asking for port guid -2b70193 opensm/complib: bug in cl_list_insert_array_head/tail functions -4764199 opensm - use C99 transportable data type for pointer storage -a9c326c opensm/osm_state_mgr.c: do not probe remote side of port 0 -4945706 opensm/osm_mcast_mgr.c: fix return value on alloc_mfts() failures -8312a24 OpenSM: Fix unused variable compiler warning. -ab8f0a3 opensm/partition: keep multicast group pointer -a817430 opensm: Only clear SMP beyond end of PortInfo attribute -52fb6f2 opensm/osm_switch.h: Remove dead osm_switch_get_physp_ptr routine -aa6d932 opensm/osm_mcast_tbl.c: In osm_mcast_tbl_clear_mlid, use memset to - clear port mask entry -2ad846b opensm/osm_trap_rcv.c: use source_lid and port_num for logging -b9d7756 opensm/osm_mcast_tbl: Fix size of port mask table array -11c0a9b opensm/main.c: Use strtoul rather than strtol for parsing transaction - timeout -0608af9 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, revert setting - of init failure on QoS initialization failures -c6b4d4a opensm/osm_vendor_ibumad.c: Add transaction ID to osm_vendor_send - log message -520af84 opensm/osm_sa_path_record.c: don't set dgid pointer for local subnet -4a878fb opensm/osm_mcast_mgr.c: fix osm_mcast_mgr_compute_max_hops for - managed switch - -* Other less critical or visible bugs were also fixed. - -5 Main Verification Flows -------------------------- - -OpenSM verification is run using the following activities: -* osmtest - a stand-alone program -* ibmgtsim (IB management simulator) based - a set of flows that - simulate clusters, inject errors and verify OpenSM capability to - respond and bring up the network correctly. -* small cluster regression testing - where the SM is used on back to - back or single switch configurations. The regression includes - multiple OpenSM dedicated tests. -* cluster testing - when we run OpenSM to setup a large cluster, perform - hand-off, reboots and reconnects, verify routing correctness and SA - responsiveness at the ULP level (IPoIB and SDP). - -5.1 osmtest - -osmtest is an automated verification tool used for OpenSM -testing. Its verification flows are described by list below. - -* Inventory File: Obtain and verify all port info, node info, link and path - records parameters. - -* Service Record: - - Register new service - - Register another service (with a lease period) - - Register another service (with service p_key set to zero) - - Get all services by name - - Delete the first service - - Delete the third service - - Added bad flows of get/delete non valid service - - Add / Get same service with different data - - Add / Get / Delete by different component mask values (services - by Name & Key / Name & Data / Name & Id / Id only ) - -* Multicast Member Record: - - Query of existing Groups (IPoIB) - - BAD Join with insufficient comp mask (o15.0.1.3) - - Create given MGID=0 (o15.0.1.4) - - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4) - - Create BAD MGID=0xFA. (o15.0.1.6) - - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6) - - New MGID with invalid join state (o15.0.1.9) - - Retry of existing MGID - See JoinState update (o15.0.1.11) - - BAD RATE when connecting to existing MGID (o15.0.1.13) - - Partial JoinState delete request - removing FullMember (o15.0.1.14) - - Full Delete of a group (o15.0.1.14) - - Verify Delete by trying to Join deleted group (o15.0.1.14) - - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15) - -* GUIDInfo Record: - - All GUIDInfoRecords in subnet are obtained - -* MultiPathRecord: - - Perform some compliant and noncompliant MultiPathRecord requests - - Validation is via status in responses and IB analyzer - -* PKeyTableRecord: - - Perform some compliant and noncompliant PKeyTableRecord queries - - Validation is via status in responses and IB analyzer - -* LinearForwardingTableRecord: - - Perform some compliant and noncompliant LinearForwardingTableRecord queries - - Validation is via status in responses and IB analyzer - -* Event Forwarding: Register for trap forwarding using reports - - Send a trap and wait for report - - Unregister non-existing - -* Trap 64/65 Flow: Register to Trap 64-65, create traps (by - disconnecting/connecting ports) and wait for report, then unregister. - -* Stress Test: send PortInfoRecord queries, both single and RMPP and - check for the rate of responses as well as their validity. - - -5.2 IB Management Simulator OpenSM Test Flows: - -The simulator provides ability to simulate the SM handling of virtual -topologies that are not limited to actual lab equipment availability. -OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily -regressions use smaller (16 and 128 nodes clusters). - -The following test flows are run on the IB management simulator: - -* Stability: - Up to 12 links from the fabric are randomly selected to drop packets - at drop rates up to 90%. The SM is required to succeed in bringing the - fabric up. The resulting routing is verified to be correct as well. - -* LID Manager: - Using LMC = 2 the fabric is initialized with LIDs. Faults such as - zero LID, Duplicated LID, non-aligned (to LMC) LIDs are - randomly assigned to various nodes and other errors are randomly - output to the guid2lid cache file. The SM sweep is run 5 times and - after each iteration a complete verification is made to ensure that all - LIDs that could possibly be maintained are kept, as well as that all nodes - were assigned a legal LID range. - -* Multicast Routing: - Nodes randomly join the 0xc000 group and eventually the - resulting routing is verified for completeness and adherence to - Up/Down routing rules. - -* osmtest: - The complete osmtest flow as described in the previous table is run on - the simulated fabrics. - -* Stress Test: - This flow merges fabric, LID and stability issues with continuous - PathRecord, ServiceRecord and Multicast Join/Leave activity to - stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get - were added to the test such both existing and non existing nodes - perform them in random order. - -5.3 OpenSM Regression - -Using a back-to-back or single switch connection, the following set of -tests is run nightly on the stacks described in table 2. The included -tests are: - -* Stress Testing: Flood the SA with queries from multiple channel - adapters to check the robustness of the entire stack up to the SA. - -* Dynamic Changes: Dynamic Topology changes, through randomly - dropping SMP packets, used to test OpenSM adaptation to an unstable - network & verify DB correctness. - -* Trap Injection: This flow injects traps to the SM and verifies that it - handles them gracefully. - -* SA Query Test: This test exhaustively checks the SA responses to all - possible single component mask. To do that the test examines the - entire set of records the SA can provide, classifies them by their - field values and then selects every field (using component mask and a - value) and verifies that the response matches the expected set of records. - A random selection using multiple component mask bits is also performed. - -5.4 Cluster testing: - -Cluster testing is usually run before a distribution release. It -involves real hardware setups of 16 to 32 nodes (or more if a beta site -is available). Each test is validated by running all-to-all ping through the IB -interface. The test procedure includes: - -* Cluster bringup - -* Hand-off between 2 or 3 SM's while performing: - - Node reboots - - Switch power cycles (disconnecting the SM's) - -* Unresponsive port detection and recovery - -* osmtest from multiple nodes - -* Trap injection and recovery - - -6 Qualified Software Stacks and Devices ---------------------------------------- - -OpenSM Compatibility --------------------- -Note that OpenSM version 3.2.1 and earlier used a value of 1 in host -byte order for the default SM_Key, so there is a compatibility issue -with these earlier versions of OpenSM when the 3.2.2 or later version -is running on a little endian machine. This affects SM handover as well -as SA queries (saquery tool in infiniband-diags). - - -Table 2 - Qualified IB Stacks -============================= - -Stack | Version ------------------------------------------|-------------------------- -The main stream Linux kernel | 2.6.x -OFED | 1.4 -OFED | 1.3 -OFED | 1.2 -OFED | 1.1 -OFED | 1.0 - -Table 3 - Qualified Devices and Corresponding Firmware -====================================================== - -Mellanox -Device | FW versions -------------------------------------|------------------------------- -InfiniScale | fw-43132 5.2.000 (and later) -InfiniScale III | fw-47396 0.5.000 (and later) -InfiniScale IV | fw-48436 7.1.000 (and later) -InfiniHost | fw-23108 3.5.000 (and later) -InfiniHost III Lx | fw-25204 1.2.000 (and later) -InfiniHost III Ex (InfiniHost Mode) | fw-25208 4.8.200 (and later) -InfiniHost III Ex (MemFree Mode) | fw-25218 5.3.000 (and later) -ConnectX IB | fw-25408 2.3.000 (and later) - -QLogic/PathScale -Device | Note ---------|----------------------------------------------------------- -iPath | QHT6040 (PathScale InfiniPath HT-460) -iPath | QHT6140 (PathScale InfiniPath HT-465) -iPath | QLE6140 (PathScale InfiniPath PE-880) -iPath | QLE7240 -iPath | QLE7280 - -Note 1: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose -QP0 and QP1. However, it does support it as a device on the subnet. - -Note 2: QoS firmware and Mellanox devices - -HCAs: QoS supported by ConnectX. QoS-enabled FW release is 2_5_000 and -later. - -Switches: QoS supported by InfiniScale III -Any InfiniScale III FW that is supported by OpenSM supports QoS. diff --git a/qib_release_notes.txt b/qib_release_notes.txt deleted file mode 100644 index 3c5558f..0000000 --- a/qib_release_notes.txt +++ /dev/null @@ -1,16 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - qib in OFED 1.5.1 Release Notes - - March 2010 - -====================================================================== -1. Overview -====================================================================== -qib is the low level driver implementation for all QLogic InfiniPath -PCI-Express HCAs: gen 1 x8 SDR QLE7140, gen 1 x8 DDR QLE7240, -gen 1 x16 DDR QLE7280, gen 2 x8 QDR QLE7340 and QLE7342. - -The qib driver is new for OFED 1.5. - -The qib kernel driver obsoletes the ipath kernel driver but is -compatible with libipathverbs so no new user level components are needed. diff --git a/qlgc_vnic.cfg.sample b/qlgc_vnic.cfg.sample deleted file mode 100644 index 58734a6..0000000 --- a/qlgc_vnic.cfg.sample +++ /dev/null @@ -1,186 +0,0 @@ -# QLogic VNIC configuration file -# -# This file documents and describes the use of the -# VNIC configuration file qlgc_vnic.cfg. This file -# should reside in /etc/infiniband/qlgc_vnic.cfg -# -# -# Knowing how to fill the configuration file -############################################### -# -# For filling the configuration file you need to know -# some information about your EVIC/VEx device. This information -# can be obtained with the help of the ib_qlgc_vnic_query tool. -# "ib_qlgc_vnic_query -es" command will give DGID, IOCGUID and IOCSTRING information about -# the EVIC/VEx IOCs that are available through port 1 and -# "ib_qlgc_vnic_query -es -d /dev/infiniband/umad1" will give information about -# the EVIC/VEX IOCs available through port 2. -# -# Refer to the README for more information about the ib_qlgc_vnic_query tool. -# -# -# General structure of the configuration file -############################################### -# -# All lines beginning with a # are treated as comments. -# -# A simple configuration file consists of CREATE commands -# for each VNIC interface to be created. -# -# A simple CREATE command looks like this: -# -# {CREATE; NAME="eioc1"; -# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1"; -# } -# -#Where -# -#NAME - The device name for the interface -# -#DGID - The DGID of the IOC to use. -# -# If DGID is specified then IOCGUID MUST also be specified. -# -# Though specifying DGID is optional, using this option is recommended, -# as it will provide the quickest way of starting up the VNIC service. -# -# -#IOCGUID - The GUID of the IOC to use. -# -#IOCSTRING - The IOC Profile ID String of the IOC to use. -# -# Either an IOCGUID or an IOCSTRING MUST always be specified. -# -# If DGID is specified then IOCGUID MUST also be specified. -# -# If no DGID is specified and both IOCGUID and IOCSTRING are specified -# then IOCSTRING is given preference and the DGID of the IOC whose -# IOCSTRING is specified is used to create the VNIC interface. -# -# If hotswap capability of EVIC/VEx is to be used, then IOCSTRING -# must be specified. -# -#INSTANCE - Defaults to 0. Range 0-255. If a host will connect to the -# same IOC more than once, each connection must be assigned a unique -# number. -# -# -#RX_CSUM - defaults to TRUE. When true, indicates that the receive checksum -# should be done by the EVIC/VEx -# -#HEARTBEAT - defaults to 100. Specifies the time in 1/100'ths of a second -# between heartbeats -# -#PORT - Specification for local HCA port. First port is 1. -# -#HCA - Optional HCA specification for use with PORT specification. First HCA is 0. -# -#PORTGUID - The PORTGUID of the IB port to use. -# -# Use of PORTGUID for configuring the VNIC interface has an -# advantage on hosts having more than 1 HCAs plugged in. As -# PORTGUID is persistent for given IB port, VNIC configurations -# would be consistent and reliable - unaffected by restarts of -# OFED IB stack on host having more than 1 HCAs plugged in. -# -# On the downside, if HCA on the host is changed, VNIC interfaces -# configured with PORTGUID needs reconfiguration. -# -#IB_MULTICAST - Controls enabling or disabling of IB multicast feature on VNIC. -# Defaults to TRUE implying IB multicast is enabled for -# the interface. To disable IB multicast, set it to FALSE. -# -# Example of DGID and IOCGUID based configuration (this configuration will give -# the quickest start up of VNIC service): -# -# {CREATE; NAME="eioc1"; -# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; -# } -# -# -# Example of IOCGUID based configuration: -# -# {CREATE; NAME="eioc1"; IOCGUID=0x66A013000010C; -# RX_CSUM=TRUE; -# HEARTBEAT=100; } -# -# Example of IOCSTRING based configuration: -# -# {CREATE; NAME="eioc1"; IOCSTRING="Chassis 0x00066A0050000018, Slot 2, IOC 1"; -# RX_CSUM=TRUE; -# HEARTBEAT=100; } -# -# -#Failover configuration: -######################### -# -# It is possible to create a VNIC interface with failover configuration -# by using the PRIMARY and SECONDARY commands. The IOC specified in -# the PRIMARY command will be used as the primary IOC for this interface -# and the IOC specified in the SECONDARY command will be used as the -# fail-over backup in case the connection with the primary IOC fails -# for some reason. -# -# PRIMARY and SECONDARY commands are written in the following way: -# -# PRIMARY={DGID=...;IOCGUID=...; IOCSTRING=...;INSTANCE=... } - -# IOCGUID, and INSTANCE must be values that are unique to the primary interface -# -# SECONDARY={DGID=...;IOCGUID=...; INSTANCE=... } - -# IOCGUID, and INSTANCE must be values that are unique to the secondary interface -# -# OR it can also be specified without using DGID, like this: -# -# PRIMARY={IOCGUID=...; INSTANCE=... } - IOCGUID may be substituted with -# IOCSTRING. IOCGUID, IOCSTRING, and INSTANCE must be values that are -# unique to the primary interface -# -# SECONDARY={IOCGUID=...; INSTANCE=... } - bring up a secondary connection for -# fail-over. IOCGUID may be substituted with IOCSTRING. IOCGUID, IOCSTRING, -# and INSTANCE values to be used for the secondary connection -# -# -#Examples of failover configuration: -# -#{CREATE; NAME="veth1"; -# PRIMARY={ DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1"; -# INSTANCE=1; PORT=1; } -# SECONDARY={DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0230000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 2"; -# INSTANCE=1; PORT=2; } -#} -# -# {CREATE; NAME="eioc2"; -# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; } -# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; } -# } -# -#Example of configuration with IB_MULTICAST -# -# {CREATE; NAME="eioc2"; -# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; IB_MULTICAST=FALSE; } -# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; IB_MULTICAST=FALSE; } -# } -# -# Example of HCA/PORT and PORTGUID configurations: -# { -# CREATE; NAME="veth1"; -# PRIMARY={IOCGUID=00066a02de000070; INSTANCE=1; PORTGUID=0x0002c903000010f5; } -# SECONDARY={IOCGUID=00066a02de000070; INSTANCE=2; PORTGUID=0x0002c903000010f6; } -# } -# -# { -# CREATE; NAME="veth2"; -# PRIMARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=3; HCA=1; PORT=2; } -# SECONDARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=4; HCA=0; PORT=1; } -# } -# -# { -# CREATE; NAME="veth3"; -# IOCSTRING="EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"; -# INSTANCE=5 PORTGUID=0x0002c90300000786; -# } -# { -# CREATE; NAME="veth4; -# IOCGUID=00066a02de000070; -# INSTANCE=6; HCA=1; PORT=2; -# } diff --git a/qperf_release_notes.txt b/qperf_release_notes.txt deleted file mode 100644 index 28a5114..0000000 --- a/qperf_release_notes.txt +++ /dev/null @@ -1,79 +0,0 @@ -Distribution - Open Fabrics Enterprise Distribution (OFED) 1.5, December 2009 - -Summary - qperf - Measure RDMA and IP performance - -Overview - qperf measures bandwidth and latency between two nodes. It can work over - TCP/IP as well as the RDMA transports. - -Quick Start - * Since qperf measures latency and bandwidth between two nodes, you need - access to two nodes. Assume they are called node1 and node2. - - * On node1, run qperf without any arguments. It will act as a server and - continue to run until asked to quit. - - * To measure TCP bandwidth between the two nodes, on node2, type: - qperf node1 tcp_bw - - * To measure RDMA RC latency, type (on node2): - qperf node1 rc_lat - - * To measure RDMA UD latency using polling, type (on node2): - qperf node1 -P 1 ud_lat - - * To measure SDP bandwidth, on node2, type: - qperf node1 sdp_bw - -Documentation - * Man page available. Type - man qperf - - * To get a list of examples, type: - qperf --help examples - - * To get a list of tests, type: - qperf --help tests - -Tests - Miscellaneous - conf Show configuration - quit Cause the server to quit - Socket Based - rds_bw RDS streaming one way bandwidth - rds_lat RDS one way latency - sctp_bw SCTP streaming one way bandwidth - sctp_lat SCTP one way latency - sdp_bw SDP streaming one way bandwidth - sdp_lat SDP one way latency - tcp_bw TCP streaming one way bandwidth - tcp_lat TCP one way latency - udp_bw UDP streaming one way bandwidth - udp_lat UDP one way latency - RDMA Send/Receive - ud_bw UD streaming one way bandwidth - ud_bi_bw UD streaming two way bandwidth - ud_lat UD one way latency - rc_bw RC streaming one way bandwidth - rc_bi_bw RC streaming two way bandwidth - rc_lat RC one way latency - uc_bw UC streaming one way bandwidth - uc_bi_bw UC streaming two way bandwidth - uc_lat UC one way latency - RDMA - rc_rdma_read_bw RC RDMA read streaming one way bandwidth - rc_rdma_read_lat RC RDMA read one way latency - rc_rdma_write_bw RC RDMA write streaming one way bandwidth - rc_rdma_write_lat RC RDMA write one way latency - rc_rdma_write_poll_lat RC RDMA write one way polling latency - uc_rdma_write_bw UC RDMA write streaming one way bandwidth - uc_rdma_write_lat UC RDMA write one way latency - uc_rdma_write_poll_lat UC RDMA write one way polling latency - InfiniBand Atomics - rc_compare_swap_mr RC compare and swap messaging rate - rc_fetch_add_mr RC fetch and add messaging rate - Verification - ver_rc_compare_swap Verify RC compare and swap - ver_rc_fetch_add Verify RC fetch and add diff --git a/rdma_cm_release_notes.txt b/rdma_cm_release_notes.txt deleted file mode 100644 index 07e4cab..0000000 --- a/rdma_cm_release_notes.txt +++ /dev/null @@ -1,133 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - RDMA CM in OFED 1.5 Release Notes - - July 2010 - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. New Features -3. Known Issues - -=============================================================================== -1. Overview -=============================================================================== -The RDMA CM is a communication manager used to setup reliable, connected -and unreliable datagram data transfers. It provides an RDMA transport -neutral interface for establishing connections. The API is based on sockets, -but adapted for queue pair (QP) based semantics: communication must be -over a specific RDMA device, and data transfers are message based. - - -The RDMA CM only provides the communication management (connection setup / -teardown) portion of an RDMA API. It works in conjunction with the verbs -API for data transfers. - -=============================================================================== -2. New Features -=============================================================================== -for OFED 1.5.2: - -Several enhancements were added to librdmacm release 1.0.12 that -are intended to simplify using RDMA devices and address scalability issues. -These changes were in response to long standing requests to make -connection establishment 'more like sockets'. For full details, -users should refer to the appropriate man pages. Major changes include: - -* Support synchronous operation for library calls. Users can control - whether an rdma_cm_id operates asynchronously or synchronously based on - the rdma_event_channel parameter. Use of synchronous operations - reduces the amount of application code required to use the librdmacm - by eliminating the need for event processing code. - - An rdma_cm_id will be marked for synchronous operation if the - rdma_event_channel parameter is NULL for rdma_create_id or - rdma_migrate_id. Users can toggle between synchronous and - asynchronous operation through the rdma_migrate_id call. - - Calls that operate synchronously include rdma_resolve_addr, - rdma_resolve_route, rdma_connect, rdma_accept, and rdma_get_request. - Synchronous event data is returned to the user through the - rdma_cm_id. - -* The addition of a new API: rdma_getaddrinfo. This call is modeled - after getaddrinfo, but for RDMA devices and connections. It has the - following notable deviations from getaddrinfo: - - A source address is returned as part of the call to allow the - user to allocate necessary local HW resources for connections. - - Optional routing information may be returned to support - Infiniband fabrics. IB routing information includes necessary - path record data. rdma_getaddrinfo will obtain this information - if IB ACM support (see below) is enabled. The use of IB ACM - is not required for rdma_getaddrinfo. - - rdma_getaddrinfo provides future extensions to support - more complex address and route resolution mechanisms, such as - multiple path support and failover. - -* Support for a new APIs: rdma_get_request, rdma_create_ep, and - rdma_destroy_ep. rdma_get_request simplifies the passive side - implementation by adding synchronous support for accepting new - connections. rdma_create_ep combines the functionality of - rdma_create_id, rdma_create_qp, rdma_resolve_addr, and rdma_resolve_route - in a single API that uses the output of rdma_getaddrinfo as its input. - -* Support for optional parameters. To simplify support for casual RDMA - developers and researchers, the librdmacm can allocate protection - domains, completion queues, and queue pairs on a user's behalf. - This simplifies the amount of information that a developer - must learn in order to use RDMA, plus allows the user to take - advantage of higher-level completion processing abstractions. - - In addition to optional parameters, a user can also specify that the - librdmacm should automatically select usable values for RDMA read - operations. - -* Add support for IB ACM. IB ACM (InfiniBand Assistant for Communication - Management) defines a socket based protocol to an IB address and route - resolution service. One implementation of that service is provided - separately by the ibacm package, but anyone can implement the service - provided that they adhere to the IB ACM socket protocol. IB ACM is an - experimental service targeted at increasing the scalability of applications - running on a large cluster. - - Use of IB ACM is not required and is controlled through the build option - '--with-ib_acm'. If the librdmacm fails to contact the IB ACM service, it - reverts to using kernel services to resolve address and routing data. - -* Add RDMA helper routines. The librdmacm provide a set of simpler verbs - calls for posting work requests, registering memory, and checking for - completions. These calls are wrappers around libibverbs routines. - -=============================================================================== -3. Known Issues -=============================================================================== -The RDMA CM relies on the operating system's network configuration tables to -map IP addresses to RDMA devices. Incorrectly configured network -configurations can result in the RDMA CM being unable to locate the correct -RDMA device. Currently, the RDMA CM only supports IPv4 addressing. - -All RDMA interfaces must provide a way to map IP addresses to an RDMA device. -For Infiniband, this is done using IPoIB, and requires correctly configured -IPoIB device interfaces sharing the same multicast domain. For details on -configuring IPoIB, refer to ipoib_release_notes.txt. For RDMA devices to -communicate, they must support the same underlying network and data link -layers. - -If you experience problems using the RDMA CM, you may want to check the -following: - - * Verify that you have IP connectivity over the RDMA devices. For example, - ping between iWarp or IPoIB devices. - - * Ensure that IP network addresses assigned to RDMA devices do not - overlap with IP network addresses assigned to standard Ethernet devices. - - * For multicast issues, either bind directly to a specific RDMA device, or - configure the IP routing tables to route multicast traffic over an RDMA - device's IP address. - diff --git a/rds_release_notes.txt b/rds_release_notes.txt deleted file mode 100644 index d7a6638..0000000 --- a/rds_release_notes.txt +++ /dev/null @@ -1,110 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - RDS in OFED 1.5.1 Release Notes - March 2010 - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Supported Platforms -3. Installation & Configuration -4. New Features -5. Bug fixes and Enhancements since OFED 1.4 -6. Bug fixes and Enhancements since OFED 1.3.1 -7. Bug fixes and Enhancements since OFED 1.3 -8. Bug fixes and Enhancements since OFED 1.2 -9. Known Issues - -=============================================================================== -1. Overview -=============================================================================== -RDS socket API. It provides reliable, in-order datagram delivery between -sockets over a variety of transports. -For details see RDS_README.txt and man 7 rds. - -=============================================================================== -2. supported platforms -=============================================================================== - -Same as overall OFED release. - -=============================================================================== -3. Installation & Configuration -=============================================================================== -To install RDS select rds in OFED's manual installation or put 'rds=y' in the -ofed.conf for unattended installation. - -To load RDS module upon boot edit file '/etc/infiniband/openib.conf' as -follows: - -# Load RDS module -RDS_LOAD=yes - -=============================================================================== -4. New Features -=============================================================================== - -GET_MR_FOR_DEST sockopt added. This allows a MR to be associated with -a remote host. GET_MR sockopt deprecated. - -Transports now modularized: rds_rdma.ko (IB and iWARP) and -rds_tcp.ko. This enables RDS use with TCP, without the IB stack -loaded. - -Improved receive processing to lower amount of time spent with interrupts -disabled. - -=============================================================================== -5. Bug fixes and Enhancements since OFED 1.4 -=============================================================================== - -* Set retry_count to 2 and make modifiable via modparam -* Many locking fixes -* Rebased to mainline kernel 2.6.30 resulted in rds trace framework - being removed. - -=============================================================================== -6. Bug fixes and Enhancements since OFED 1.3.1 -=============================================================================== -- RDMA completion notifications are signalled when the IB stack gives us the - completion event for the accompanying RDS message. This is a change from the - 1.3.x behavior, which signalled completion notifications when the RDS message - was ACKed. -- Fixed bugs associated with congestion monitoring. -- FMR pool size increased from 2K to 4K -- Added support for RDMA_CM_EVENT_ADDR_CHANGE event. -- RDS should now work on Qlogic HCAs. - -=============================================================================== -7. Bug fixes and Enhancements since OFED 1.3 -=============================================================================== -- Fix a bug in RDMA signaling -- Add 3 more stats counters -- Fix a kernel crash that can occur when RDS/IB connection drops -- Fixes for RDMA API - -=============================================================================== -8. Bug fixes and Enhancements since OFED 1.2 -=============================================================================== - -1) Wire protocol for RDS v3 and RDS v2 are not compatible. - -2) RDS over TCP is disabled in OFED 1.3. We will re-enable in future release. - -3) Congestion monitoring support gives the application more fine-grained - control. - -With explicit monitoring, the application polls for POLLIN as before, and -additionally uses the RDS_CONG_MONITOR socket option to install a 64bit mask -value in the socket, where each bit corresponds to a group of ports. -When a congestion update arrives, RDS checks the set of ports that became -uncongested against the bit mask installed in the socket. If they overlap, a -control messages is enqueued on the socket, and the application is woken up. -When application calls recvmsg (2), it will be given the control message -containing the bitmap on the socket. - -=============================================================================== -9. Known Issues -=============================================================================== -1. RDMAs over 1 MiB not supported. diff --git a/readme_and_howto/HOWTO.build_ofed b/readme_and_howto/HOWTO.build_ofed new file mode 100644 index 0000000..195185e --- /dev/null +++ b/readme_and_howto/HOWTO.build_ofed @@ -0,0 +1,69 @@ + Open Fabrics Enterprise Distribution (OFED) + How To Build OFED 1.5.1 + + March 2010 + + +============================================================================== +Table of contents +============================================================================== +1. Overview +2. Usage +3. Requirements + +============================================================================== +1. Overview +============================================================================== +The script "build.pl" is used to build the OFED package based on the +OpenFabrics project. The package is built under /tmp directory. + +See OFED_release_notes.txt for more details. + +============================================================================== +2. Usage +============================================================================== + +The build script for the OFED package can be downloaded from: + git://git.openfabrics.org/~vlad/build.git + branch: master + +Name: build.pl + +Usage: ./build.pl --version [-r|--release]|[--daily] [-d|--distribution ] [-v|--verbose] + [-b|--builddir ] + [-p|--packagesdir ] + [--pre-build ] + [--skip-prebuild] + [--post-build ] + [--skip-postbuild] + +Example: + + ./build.pl --version 1.5.1-rc1 -p packages-ofed + + This command will create a package (i.e., subtree) called OFED-1.5.1-rc1 + under /tmp/$USER/ + +============================================================================== +3. Requirements +============================================================================== + +1. Git: + Can be downloaded from: + http://www.kernel.org/pub/software/scm/git + +2. Autotools: + + libtool-1.5.20 or higher + autoconf-2.59 or higher + automake-1.9.6 or higher + m4-1.4.4 or higher + + The above tools can be downloaded from the following URLs: + + libtool - "http://ftp.gnu.org/gnu/libtool/libtool-1.5.20.tar.gz" + autoconf - "http://ftp.gnu.org/gnu/autoconf/autoconf-2.59.tar.gz" + automake - "http://ftp.gnu.org/gnu/automake/automake-1.9.6.tar.gz" + m4 - "http://ftp.gnu.org/gnu/m4/m4-1.4.4.tar.gz" + +3. wget or ssh slient diff --git a/readme_and_howto/MLNX_EN_README.txt b/readme_and_howto/MLNX_EN_README.txt new file mode 100644 index 0000000..88464ab --- /dev/null +++ b/readme_and_howto/MLNX_EN_README.txt @@ -0,0 +1,113 @@ +=============================================================================== + MLNX_EN driver for Mellanox Adapter Cards with 10GigE Support + README for OFED 1.5.2 + + December 2010 +=============================================================================== + +Contents: +========= +1. Overview +2. Ethernet Driver Usage and Configuration + + +1. Overview +=========== +MLNX_EN driver is composed from mlx4_core and mlx4_en kernel modules. + +The MLNX_EN driver release exposes the following capabilities: +- Single/Dual port +- Fibre Channel over Ethernet (FCoE) +- Up to 16 Rx queues per port +- 5 TX queues per port +- Rx steering mode: Receive Core Affinity (RCA) +- Tx arbitration mode: VLAN user-priority (off by default) +- MSI-X or INTx +- Adaptive interrupt moderation +- HW Tx/Rx checksum calculation +- Large Send Offload (i.e., TCP Segmentation Offload) +- Large Receive Offload +- IP reassembly offload for fragmented IP packets +- Multi-core NAPI support +- VLAN Tx/Rx acceleration (HW VLAN stripping/insertion) +- HW VLAN filtering +- HW multicast filtering +- ifconfig up/down + mtu changes (up to 10K) +- Ethtool support +- Net device statistics + + +2. Ethernet Driver Usage and Configuration +========================================== + +- To assign an IP address to the interface run: + #> ifconfig eth + + where 'x' is the OS assigned interface number. + +- To check driver and device information run: + #> ethtool -i eth + + Example: + #> ethtool -i eth2 + driver: mlx4_en (MT_0BD0110004) + version: 1.5.2 (March 2010) + firmware-version: 2.8.000 + bus-info: 0000:0e:00.0 + +- To query stateless offload status run: + #> ethtool -k eth + +- To set stateless offload status run: + #> ethtool -K eth [rx on|off] [tx on|off] [sg on|off] [tso on|off] + +- To query interrupt coalescing settings run: + #> ethtool -c eth + +- By default, the driver uses adaptive interrupt moderation for the receive path, + which adjusts the moderation time according to the traffic pattern. + Adaptive moderation settings can be set by: + #> ethtool -C eth adaptive-rx on|off + +- To set interrupt coalescing settings run: + #> ethtool -C eth [rx-usecs N] [rx-frames N] [tx-usecs N] [tx-frames N] + + Note: usec settings correspond to the time to wait after the *last* packet + sent/received before triggering an interrupt + +- To query pause frame settings run: + #> ethtool -a eth + +- To set pause frame settings run: + #> ethtool -A eth [rx on|off] [tx on|off] + +- To query ring size values run: + #> ethtool -g eth + +- To modify rings size run: + #> ethtool -G eth [rx ] [tx ] + +- To obtain additional device statistics, run: + #> ethtool -S eth + +- To perform a self diagnostics test, run: + #> ethtool -t eth + + +The driver defaults to the following parameters: +- Both ports are activated (i.e., a net device is created for each port) +- The number of Rx rings for each port is the number of on-line CPUs +- Per-core NAPI is enabled +- LRO is enabled with 32 concurrent sessions per Rx ring + +Some of these values can be changed using module parameters, which are +detailed by running: +#> modinfo mlx4_en + +To set non-default values to module parameters, the following line should be +added to /etc/modprobe.conf file: + "options mlx4_en = = ..." + +Values of all parameters can be observed in /sys/module/mlx4_en/parameters/. + + diff --git a/readme_and_howto/MPI_README.txt b/readme_and_howto/MPI_README.txt new file mode 100644 index 0000000..4713166 --- /dev/null +++ b/readme_and_howto/MPI_README.txt @@ -0,0 +1,612 @@ + + MPI in OFED 1.5.2 README + + September 2010 + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. MVAPICH +3. Open MPI +4. MVAPICH2 + + +=============================================================================== +1. Overview +=============================================================================== +Open Fabrics Enterprise Distribution (OFED)Three MPI stacks are included in +this release of OFED: +- MVAPICH 1.2.0 +- Open MPI 1.4.2 +- MVAPICH2 1.5.1 + +Setup, compilation and run information of MVAPICH, Open MPI and MVAPICH2 is +provided below in sections 2, 3 and 4 respectively. + +1.1 Installation Note +--------------------- +In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install +one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt +to learn about the different options. + +The installation script allows each MPI to be compiled using one or +more compilers. Users need to set, per MPI stack installed, the PATH +and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks. + +1.2 MPI Tests +------------- +OFED includes four basic tests that can be run against each MPI stack: +bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests +are located under: /mpi///tests/, +where is /usr by default. + +1.4 Selecting Which MPI to Use: mpi-selector +-------------------------------------------- +Depending on how the OFED installer was run, multiple different MPI +implementations may be installed on your system. The OFED installer +will run an MPI selector tool during the installation process, +presenting a menu-based interface to select which MPI implementation +is set as the default for all users. This MPI selector tool can be +re-run at any time by the administrator after the OFED installer +completes to modify the site-wide default MPI implementation selection +by invoking the "mpi-selector-menu" command (root access is typically +required to change the site-wide default). + +The mpi-selector-menu command can also be used by non-administrative +users to override the site-wide default MPI implementation selection +by setting a per-user default. Specifically: unless a user runs the +MPI selector tool to set a per-user default, their environment will be +setup for the site-wide default MPI implementation. + +Note that the default MPI selection does *not* affect the shell from +which the command was invoked (or any other shells that were already +running when the MPI selector tool was invoked). The default +selection is only changed for *new* shells started after the selector +tool was invoked. It is recommended that once the default MPI +implementation is changed via the selector tool, users should logout +and login again to ensure that they have a consistent view of the +default MPI implementation. Other tools can be used to change the MPI +environment in the current shell, such as the environment modules +software package (which is not included in the OFED software package; +see http://modules.sourceforge.net/ for details). + +Note that the site-wide default is set in a file that is typically not +on a networked file system, and is therefore specific to the host on +which it was run. As such, it is recommended to run the +mpi-selector-menu command on all hosts in a cluster, picking the same +default MPI implementation on each. It may be more convenient, +however, to use the mpi-selector command in script-based scenarios +(such as running on every host in a cluster); mpi-selector effects all +the same functionality as mpi-selector-menu, but is intended for +automated environments. See the mpi-selector(1) manual page for more +details. + +Additionally, per-user defaults are set in a file in the user's $HOME +directory. If this directory is not on a network-shared file system +between all hosts that will be used for MPI applications, then it also +needs to be propagated to all relevant hosts. + +Note: The MPI selector tool typically sets the PATH and/or +LD_LIBRARY_PATH for a given MPI implementation. This step can, of +course, also be performed manually by a user or on a site-wide basis. +The MPI selector tool simply bundles up this functionality in a +convenient set of command line tools and menus. + +1.4 Updating MPI Installations +------------------------------ +Note that all of the MPI implementations included in the OFED software +package are the versions that were available when OFED v1.5 was +released. They have been QA tested with this version of OFED and are +fully supported. + +However, note that administrators can go to the web sites of each MPI +implementation and download / install newer versions after OFED has +been successfully installed. There is nothing specific about the +OFED-included MPI software packages that prohibit installing +newer/other MPI implementations. + +It should be also noted that versions of MPI released after OFED v1.5 +are not supported by OFED. But since each MPI has its own release +schedule and QA process (each of which involves testing with the OFED +stack), it may sometimes be desirable -- or even advisable, depending +on how old the MPI implementations are that are included in OFED -- to +download install a newer version of MPI. + +The web sites of each MPI implementation are listed below: + +- Open MPI: http://www.open-mpi.org/ +- MVAPICH: http://mvapich.cse.ohio-state.edu/ +- MVAPICH2: http://mvapich.cse.ohio-state.edu/overview/mvapich2/ + +=============================================================================== +2. MVAPICH MPI +=============================================================================== + +This package is a 1.2.0 version of the MVAPICH software package, +and is the officially supported MPI stack for this release of OFED. +See http://mvapich.cse.ohio-state.edu for more details. + + +2.1 Setting up for MVAPICH +-------------------------- +To launch MPI jobs, its installation directory needs to be included +in PATH and LD_LIBRARY_PATH. To set them, execute one of the following +commands: + source /mpi///bin/mpivars.sh + -- when using sh for launching MPI jobs + or + source /mpi///bin/mpivars.csh + -- when using csh for launching MPI jobs + + +2.2 Compiling MVAPICH Applications: +----------------------------------- +***Important note***: +A valid Fortran compiler must be present in order to build the MVAPICH MPI +stack and tests. + +The default gcc-g77 Fortran compiler is provided with all RedHat Linux +releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide +this compiler as part of the default installation. + +The following compilers are supported by OFED's MVAPICH package: Gcc, +Intel, Pathscale and PGI. The install script prompts the user to choose +the compiler with which to build the MVAPICH RPM. Note that more +than one compiler can be selected simultaneously, if desired. + +For details see: + http://mvapich.cse.ohio-state.edu/support + +To review the default configuration of the installation, check the default +configuration file: /mpi///etc/mvapich.conf + +2.3 Running MVAPICH Applications: +--------------------------------- +Requirements: +o At least two nodes. Example: mtlm01, mtlm02 +o Machine file: Includes the list of machines. Example: /root/cluster +o Bidirectional rsh or ssh without a password + +Note: ssh will be used unless -rsh is specified. In order to use +rsh, add to the mpirun_rsh command the parameter: -rsh + +*** Running OSU tests *** + +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bw +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_latency +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bibw +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_bcast + +*** Running Intel MPI Benchmark test (Full test) *** + +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/IMB-3.2/IMB-MPI1 + +*** Running Presta test *** + +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/com -o 100 +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/glob -o 100 +/usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.2.0/tests/presta-1.4.0/globalop + + +=============================================================================== +3. Open MPI +=============================================================================== + +Open MPI is a next-generation MPI implementation from the Open MPI +Project (http://www.open-mpi.org/). Version 1.4 of Open MPI is +included in this release, which is also available directly from the +main Open MPI web site. + +A working Fortran compiler is not required to build Open MPI, but some +of the included MPI tests are written in Fortran. These tests will +not compile/run if Open MPI is built without Fortran support. + +The following compilers are supported by OFED's Open MPI package: GNU, +Pathscale, Intel, or Portland. The install script prompts the user +for the compiler with which to build the Open MPI RPM. Note that more +than one compiler can be selected simultaneously, if desired. + +Users should check the main Open MPI web site for additional +documentation and support. (Note: The FAQ file considers OpenFabrics +tuning among other issues.) + +3.1 Setting up for Open MPI +--------------------------- +Selecting to use Open MPI via the mpi-selector-mpi and mpi-selector +tools will perform all the necessary setup for users to build and run +Open MPI applications. If you use the MPI selector tools, you can +skip the rest of this section. + +If you do not wish to use the MPI selector tools, the Open MPI team +strongly advises users to put the Open MPI installation directory in +their PATH and LD_LIBRARY_PATH. This can be done at the system level +if all users are going to use Open MPI. Specifically: + +- add /bin to PATH +- add /lib to LD_LIBRARY_PATH + + is the directory where the desired Open MPI instance was +installed ("instance" refers to the compiler used for Open MPI +compilation at install time.). + +If you are using a job scheduler to launch MPI jobs (e.g., SLURM, +Torque), setting the PATH and LD_LIBRARY_PATH is still required, but +it does not need to be set in your shell startup files. Procedures +describing how to add these values to PATH and LD_LIBRARY_PATH are +described in detail at: + + http://www.open-mpi.org/faq/?category=running + +3.2 Open MPI Installation Support / Updates +------------------------------------------- +The OFED package will install Open MPI with support for TCP, shared +memory, and the OpenFabrics network stacks. No other networks are +supported by the OFED Open MPI installation. + +Open MPI supports a wide variety of run-time environments. The OFED +installer will not include support for all of them, however (e.g., +Torque and PBS-based environments are not supported by the +OFED-installed Open MPI). + +The ompi_info command can be used to see what support was installed; +look for plugins for your specific environment / network / etc. If +you do not see them, the OFED installer did not include support for +them. + +As described above, administrators or users can go to the Open MPI web +site and download / install either a newer version of Open MPI (if +available), or the same version with different configuration options +(e.g., support for Torque / PBS-based environments). + +3.3 Compiling Open MPI Applications +----------------------------------- +(copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see +this web page for more details) + +The Open MPI team strongly recommends that you simply use Open MPI's +"wrapper" compilers to compile your MPI applications. That is, instead +of using (for example) gcc to compile your program, use mpicc. Open +MPI provides a wrapper compiler for four languages: + + Language Wrapper compiler name + ------------- -------------------------------- + C mpicc + C++ mpiCC, mpicxx, or mpic++ + (note that mpiCC will not exist + on case-insensitive file-systems) + Fortran 77 mpif77 + Fortran 90 mpif90 + ------------- -------------------------------- + +Note that if no Fortran 77 or Fortran 90 compilers were found when +Open MPI was built, Fortran 77 and 90 support will automatically be +disabled (respectively). + +If you expect to compile your program as: + + > gcc my_mpi_application.c -lmpi -o my_mpi_application + +Simply use the following instead: + + > mpicc my_mpi_application.c -o my_mpi_application + +Specifically: simply adding "-lmpi" to your normal compile/link +command line *will not work*. See +http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the +Open MPI wrapper compilers. + +Note that Open MPI's wrapper compilers do not do any actual compiling +or linking; all they do is manipulate the command line and add in all +the relevant compiler / linker flags and then invoke the underlying +compiler / linker (hence, the name "wrapper" compiler). More +specifically, if you run into a compiler or linker error, check your +source code and/or back-end compiler -- it is usually not the fault of +the Open MPI wrapper compiler. + +3.4 Running Open MPI Applications: +---------------------------------- +Open MPI uses either the "mpirun" or "mpiexec" commands to launch +applications. If your cluster uses a resource manager (such as +SLURM), providing a hostfile is not necessary: + + > mpirun -np 4 my_mpi_application + +If you use rsh/ssh to launch applications, they must be set up to NOT +prompt for a password (see http://www.open-mpi.org/faq/?category=rsh +for more details on this topic). Moreover, you need to provide a +hostfile containing a list of hosts to run on. + +Example: + + > cat hostfile + host1.example.com + host2.example.com + host3.example.com + host4.example.com + + > mpirun -np 4 -hostfile hostfile my_mpi_application + (application runs on all 4 hosts) + +In the following examples, replace with the number of hosts to run on, +and with the filename of a valid hostfile listing the hosts +to run on (unless you are running under a supported resource manager, +in which case a hostfile is unnecessary). + +Also note that Open MPI is highly run-time tunable. There are many +options that can be tuned to obtain optimal performance of your MPI +applications (see the Open MPI web site / FAQ for more information: +http://www.open-mpi.org/faq/). + + - is an integer indicating how many MPI processes to run (e.g., 2) + - is the filename of a hostfile, as described above + +Example 1: Running the OSU bandwidth: + + > cd /usr/mpi/gcc/openmpi-1.4.1/tests/osu_benchmarks-3.1.1 + > mpirun -np -hostfile osu_bw + +Example 2: Running the Intel MPI Benchmark benchmarks: + + > cd /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2 + > mpirun -np -hostfile IMB-MPI1 + + --> Note that the version of IMB-EXT that ships in this version of + OFED contains a bug that will cause it to immediately error + out when run with Open MPI. + +Example 3: Running the Presta benchmarks: + + > cd /usr/mpi/gcc/openmpi-1.4.1/tests/presta-1.4.0 + > mpirun -np -hostfile com -o 100 + +NOTE: In order to run Open MPI over RoCCE (RDMAoE) network, follow MCA parameter + is required: + --mca btl_openib_cpc_include rdmacm + + +3.5 More Open MPI Information +----------------------------- +Much, much more information is available about using and tuning Open +MPI (to include OpenFabrics-specific tunable parameters) on the Open +MPI web site FAQ: + + http://www.open-mpi.org/faq/ + +Users who cannot find the answers that they are looking for, or are +experiencing specific problems should consult the "how to get help" web +page for more information: + + http://www.open-mpi.org/community/help/ + + +=============================================================================== +4. MVAPICH2 MPI +=============================================================================== + +MVAPICH2 is an MPI-2 implementation which includes all MPI-1 features. +It is based on MPICH2 and MVICH. MVAPICH2 provides many features including +fault-tolerance with checkpoint-restart, RDMA_CM support, iWARP support, +optimized collectives, on-demand connection management, multi-core optimized +and scalable shared memory support, and memory hook with ptmalloc2 library +support. The ADI-3-level design of MVAPICH2 supports many features including: +MPI-2 functionalities (one-sided, collectives and data-type), multi-threading +and all MPI-1 functionalities. It also supports a wide range of platforms +(architecture, OS, compilers, InfiniBand adapters and iWARP adapters). More +information can be found on the MVAPICH2 project site: + +http://mvapich.cse.ohio-state.edu/overview/mvapich2/ + +A valid Fortran compiler must be present in order to build the MVAPICH2 +MPI stack and tests. The following compilers are supported by OFED's +MVAPICH2 MPI package: gcc, intel, pgi, and pathscale. The install script +prompts the user to choose the compiler with which to build the MVAPICH2 +MPI RPM. Note that more than one compiler can be selected simultaneously, +if desired. + +The install script prompts for various MVAPICH2 build options as detailed +below: + + +- Implementation (OFA or uDAPL) [default "OFA"] + - OFA (IB and iWARP) Options: + - ROMIO Support [default Y] + - Shared Library Support [default Y] + - Checkpoint-Restart Support [default N] + * requires an installation of BLCR and prompts for the + BLCR installation directory location + - uDAPL Options: + - ROMIO Support [default Y] + - Shared Library Support [default Y] + - Cluster Size [default "Small"] + - I/O Bus [default "PCI-Express"] + - Link Speed [default "SDR"] + - Default DAPL Provider [default ""] + * the default provider is determined based on detected OS + +For non-interactive builds where no MVAPICH2 build options are stored in +the OFED configuration file, the default settings are: + +Implementation: OFA +ROMIO Support: Y +Shared Library Support: Y +Checkpoint-Restart Support: N + + +4.1 Setting up for MVAPICH2 +--------------------------- +Selecting to use MVAPICH2 via the MPI selector tools will perform +most of the setup necessary to build and run MPI applications with +MVAPICH2. If one does not wish to use the MPI Selector tools, using +the following settings should be enough: + + - add /bin to PATH + +The above is the directory where the desired MVAPICH2 +instance was installed ("instance" refers to the path based on +the RPM package name, including the compiler chosen during the +install). It is also possible to source the following files +in order to setup the proper environment: + +source /bin/mpivars.sh [for Bourne based shells] +source /bin/mpivars.csh [for C based shells] + +In addition to the user environment settings handled by the MPI selector +tools, some other system settings might need to be modified. MVAPICH2 +requires the memlock resource limit to be modified from the default +in /etc/security/limits.conf: + +* soft memlock unlimited + +MVAPICH2 requires bidirectional rsh or ssh without a password to work. +The default is ssh, and in this case it will be required to add the +following line to the /etc/init.d/sshd script before sshd is started: + +ulimit -l unlimited + +It is also possible to specify a specific size in kilobytes instead +of unlimited if desired. + +The MVAPICH2 OFA build requires an /etc/mv2.conf file specifying the +IP address of an Infiniband HCA (IPoIB) for RDMA-CM functionality +or the IP address of an iWARP adapter for iWARP functionality if +either of those are desired. This is not required by default, unless +either of the following runtime environment variables are set when +using the OFA MVAPICH2 build: + +RDMA-CM +------- +MV2_USE_RDMA_CM=1 + +iWARP +----- +MV2_USE_IWARP_MODE=1 + +Otherwise, the OFA build will work without an /etc/mv2.conf file using +only the Infiniband HCA directly. + +The MVAPICH2 uDAPL build requires an /etc/dat.conf file specifying the +DAPL provider information. The default DAPL provider is chosen at +build time, with a default value of "ib0", however it can also be +specified at runtime by setting the following environment variable: + +MV2_DEFAULT_DAPL_PROVIDER= + +More information about MVAPICH2 can be found in the MVAPICH2 User Guide: + +http://mvapich.cse.ohio-state.edu/support/ + + +4.2 Compiling MVAPICH2 Applications +----------------------------------- +The MVAPICH2 compiler command for each language are: + +Language Compiler Command +-------- ---------------- +C mpicc +C++ mpicxx +Fortran 77 mpif77 +Fortran 90 mpif90 + +The system compiler commands should not be used directly. The Fortran 90 +compiler command only exists if a Fortran 90 compiler was used during the +build process. + + +4.3 Running MVAPICH2 Applications +--------------------------------- +4.3.1 Running MVAPICH2 Applications with mpirun_rsh +--------------------------------------------------- +>From release 1.2, MVAPICH2 comes with a faster and more scalable startup based +on mpirun_rsh. To launch a MPI job with mpirun_rsh, password-less ssh needs to +be enabled across all nodes. + +Note: ssh will be used by default. In order to use rsh, use the -rsh option on +the mpirun_rsh command line. For more options, see mpirun_rsh -help or the +MVAPICH2 user guide. + +*** Running 4 processes on 4 nodes *** + +$ cat > hostfile +node1 +node2 +node3 +node4 +$ mpirun_rsh -np 4 -hostfile hostfile /path/to/my_mpi_app + +*** Running OSU tests *** + +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bw +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_latency +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bibw +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1/osu_bcast + +*** Running Intel MPI Benchmark test (Full test) *** + +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2/IMB-MPI1 + +*** Running Presta test *** + +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/com -o 100 +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/glob -o 100 +/usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/globalop + +4.3.2 Running MVAPICH2 Applications with mpd and mpiexec +-------------------------------------------------------- +Launching processes in MVAPICH2 is a two step process. First, mpdboot must +be used to launch MPD daemons on the desired hosts. Second, the mpiexec +command is used to launch the processes. MVAPICH2 requires bidirectional +ssh or rsh without a password. This is specified when the MPD daemons are +launched with the mpdboot command through the --rsh command line option. +The default is ssh. Once the processes are finished, stopping the MPD +daemons with the mpdallexit command should be done. The following example +shows the basic procedure: + +4 Processes on 4 Hosts Example: + +$ cat >hostsfile +node1.example.com +node2.example.com +node3.example.com +node4.example.com + +$ mpdboot -n 4 -f ./hostsfile + +$ mpiexec -n 4 ./my_mpi_application + +$ mpdallexit + +It is also possible to use the mpirun command in place of mpiexec. They are +actually the same command in MVAPICH2, however using mpiexec is preferred. + +It is possible to run more processes than hosts. In this case, multiple +processes will run on some or all of the hosts used. The following examples +demonstrate how to run the MPI tests. The default installation prefix and +gcc version of MVAPICH2 are shown. In each case, it is assumed that a hosts +file has been created in the specific directory with two hosts. + +OSU Tests Example: + +$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.1.1 +$ mpdboot -n 2 -f ./hosts +$ mpiexec -n 2 ./osu_bcast +$ mpiexec -n 2 ./osu_bibw +$ mpiexec -n 2 ./osu_bw +$ mpiexec -n 2 ./osu_latency +$ mpdallexit + +Intel MPI Benchmark Example: + +$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.2 +$ mpdboot -n 2 -f ./hosts +$ mpiexec -n 2 ./IMB-MPI1 +$ mpdallexit + +Presta Benchmarks Example: + +$ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0 +$ mpdboot -n 2 -f ./hosts +$ mpiexec -n 2 ./com -o 100 +$ mpiexec -n 2 ./glob -o 100 +$ mpiexec -n 2 ./globalop +$ mpdallexit diff --git a/readme_and_howto/MSTFLINT_README.txt b/readme_and_howto/MSTFLINT_README.txt new file mode 100644 index 0000000..86d3e02 --- /dev/null +++ b/readme_and_howto/MSTFLINT_README.txt @@ -0,0 +1,171 @@ +Mellanox Technologies - www.mellanox.com +**************************************** + +MSTFLINT Package - Firmware Burning and Diagnostics Tools + +1) Overview + This package contains a burning tool and diagnostic tools for Mellanox + manufactured HCA/NIC cards. It also provides access to the relevant source + code. Please see the file LICENSE for licensing details. + This package is based on a subset of the Mellanox Firmware Tools (MFT) package. + For a full documentation of the MFT package, please refer to the downloads page + in Mellanox web site. + + ---------------------------------------------------------------------------- + NOTE: + This burning tool should be used only with Mellanox-manufactured + HCA/NIC cards. Using it with cards manufactured by other vendors + may be harmful to the cards (due to different configurations). + Using the diagnostic tools is normally safe for all HCAs/NICs. + ---------------------------------------------------------------------------- + +2) Package Contents + a) mstflint source code + b) mflash lib + This lib provides low level Flash access through Mellanox HCAs. + c) mtcr lib (implemented in mtcr.h file) + This lib enables access to HCA hardware registers. + d) mstregdump utility + This utility dumps hardware registers from Mellanox hardware + for later analysis by Mellanox. + e) mstvpd + This utility dumps the on-card VPD. + f) mstmcra + This debug utility reads a single word from the device configuration space. + +3) Installation + a) Build the mstflint utility. This package is built using a standard + autotools method. + + Example: + > ./configure + > make + > make install + + - Run "configure --help" for custom configuration options. + - Typically, root privileges are required to run "make install" + +4) Hardware Access Device Names + The tools in this package require a device name in the command + line. The device name is the identifier of the target CA. + This section describes the device name formats and the HW access flow. + + a) The devices can be accessed by their PCI ID as displayed by lspci + (bus:dev.fn). + Example: + # List all Mellanox devices + > /sbin/lspci -d 15b3: + 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0) + + # Use mstflint tool to query the firmware on this device + > mstflint -d 02:00.0 q + + b) When the IB driver (mlx4 or mthca) is loaded, the devices can be accessed + by their IB device name. + Example: + # List the IB devices + > ibv_devinfo | grep hca_id + hca_id: mlx4_0 + + # Use mstvpd tool to dump the VPD of this device + > mstvpd mlx4_0 + + c) PCI configuration access + In examples a and b above, the device is accessed via PCI Memory Mapping. + The device can also be accessed by PCI configuration cycles. + PCI configuration access is slower and less safe than memory access -- + use it only if methods a and b above do not work. + + To force configuration access, use device names in the following format: + /proc/bus/pci// + + Example: + # List all Mellanox devices + > /sbin/lspci -d 15b3: + 02:00.0 Ethernet controller: Mellanox Technologies MT25448 [ConnectX EN 10GigE, PCIe 2.0 2.5GT/s] (rev a0) + + # Use mstregdump to dump HW registers, using PCI config cycles + > mstregdump /proc/bus/pci/02/00.0 > crdump.log + + Note: Typically, you will need root privileges for hardware access + + d) Accessing a multi-function device: + + In some configuration, the CA device identifies as a multi-function device on PCI. E.G.: + > /sbin/lspci -d 15b3: + 07:00.0 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + 07:00.1 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + 07:00.2 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + 07:00.3 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + 07:00.4 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + 07:00.5 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + 07:00.6 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + 07:00.7 Ethernet controller: Mellanox Technologies MT26468 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) + + These multiple "logical" devices are actually a single physical device, so firmware update or "physical" + diagnostics should be run only on one of the functions. + + When the device driver is loaded, only the primary physical function of the device can be accessed. + In Linux that would typically be function 0. This function can be accessed using memory mapping, aas + described in sub section a) above. E.G.: + > mstflint -d 07:00.0 q + + When the device driver is not loaded, all the functions can be accessed using configuration cycles, as + described in sub section c) above. It is recommended to use function 0 for FW update or diagnostics, E.G.: + > mstflint -d /proc/bus/pci/07/00.0 q + +5) Usage (mstflint): + Read mstflint usage. Enter "./mstflint -h" for a short help message, or + "./mstflint -hh" for a detailed help message. + + Obtaining firmware files: + If you purchased your card from Mellanox Technologies, please use the + Mellanox website (www.mellanox.com, under 'Firmware' downloads) to + download the firmware for your card. + If you purchased your card from a vendor other than Mellanox, get a + specific firmware configuration (INI) file from your HCA card vendor and + generate the binary image. + + Use mstflint to burn a device according to the burning instructions in + "mstflint -hh" and in Mellanox web site firmware page. + +6) Usage (mstregdump): + An internal register dump is displayed to the standard output. + Please store it in a file for analysis by Mellanox. + + Example: + > mstregdump mthca0 > dumpfile + +7) Usage (mstvpd): + A VPD dump is displayed to the standard output. + A list of keywords to dump can be supplied after the -- flag + to apply an output filter. + + Examples: + > mstvpd mlx4_0 + ID: Hawk Dual Port + PN: MNPH29C-XTR + EC: X2 + SN: MT1001X00749 + V0: PCIe Gen2 x8 + V1: N/A + YA: N/A + RW: + + > mstvpd mlx4_0 -- PN ID + PN: MNPH29C-XTR + ID: Hawk Dual Port + +8) Problem Reporting: + Please collect the following information when reporting issues: + + uname -a + cat /etc/issue + cat /proc/bus/pci/devices + mstflint -vv + lspci + mstflint -d 02:00.0 v + mstflint -d 02:00.0 q + mstvpd 02:00.0 + + diff --git a/readme_and_howto/PERF_TEST_README.txt b/readme_and_howto/PERF_TEST_README.txt new file mode 100644 index 0000000..af8b119 --- /dev/null +++ b/readme_and_howto/PERF_TEST_README.txt @@ -0,0 +1,146 @@ + Open Fabrics Enterprise Distribution (OFED) + Performance Tests README for OFED 1.5 + + December 2010 + + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Notes on Testing Methodology +3. Test Descriptions +4. Running Tests + +=============================================================================== +1. Overview +=============================================================================== +This is a collection of tests written over uverbs intended for use as a +performance micro-benchmark. As an example, the tests can be used for +HW or SW tuning and/or functional testing. + +The collection conatains a set of BW and latency benchmark such as : + + * Read - ib_read_bw and ib_read_lat. + * Write - ib_write_bw, ib_write_bw_postlist and ib_write_lat. + * Send - ib_send_bw and ib_send_lat. + * RDMA - rdma_bw and rdma_lat. + * Additional benchmark: ib_clock_test. + +Please post results/observations/bugs/remarks to the mailing list specified below: + * Maintainer - idos@dev.mellanox.co.il + * OFED mailing list - ewg@lists.openfabrics.org + or linux-rdma@vger.kernel.org + * http://openib.org/mailman/listinfo/openib-general + +=============================================================================== +2. Notes on Testing Methodology +=============================================================================== +The bencmarks specified below are tested of the following architectures: +- i686 +- x86_64 +- ia64 + +- The benchmark uses the CPU cycle counter to get time stamps without context + switch. Some CPU architectures (e.g., Intel's 80486 or older PPC) do NOT + have such capability. + +- The benchmark measures round-trip time but reports half of that as one-way + latency. Thus, it may not be sufficiently accurate for asymmetrical + configurations. + +- On BW benchmarks , the BW is calculated on the send side only, as it calculates + the BW after collecting completion from the receive side. + If using the bidirectional flag , BW is calculated on both sides + +- Min/Median/Max result is reported. + The median (vs average) is less sensitive to extreme scores. + Typically, the "Max" value is the first value measured. + +- Larger samples help marginally only. The default (1000) is sufficient. + Note that an array of cycles_t (typically unsigned long) is allocated + once to collect samples and again to store the difference between them. + Large sample sizes (e.g., 1 million) might expose other problems + with the program. + +- The "-H" option will dump the histogram for additional statistical analysis. + See xgraph, ygraph, r-base (http://www.r-project.org/), pspp, or other + statistical math programs. + +=============================================================================== +4. Test Descriptions +=============================================================================== + +rdma_lat.c latency test with RDMA write transactions +rdma_bw.c streaming BW test with RDMA write transactions + + +The following tests are mainly useful for HW/SW benchmarking. +They are not intended as actual usage examples. + +send_lat.c latency test with send transactions +send_bw.c BW test with send transactions +write_lat.c latency test with RDMA write transactions +write_bw.c BW test with RDMA write transactions +read_lat.c latency test with RDMA read transactions +read_bw.c BW test with RDMA read transactions + +The executable name of each test starts with the general prefix "ib_", +e.g., ib_write_lat, except for those of RDMA tests, +their excutable have the same name except of the .c. + +Running Tests +------------- + +Prerequisites: + kernel 2.6 + ib_uverbs (kernel module) matches libibverbs + ("match" means binary compatible, but ideally of the same SVN rev) + +Server: ./ +Client: ./ + + o is IPv4 or IPv6 address. You can use the IPoIB + address if IPoIB is configured. + o --help lists the available + + *** IMPORTANT NOTE: The SAME OPTIONS must be passed to both server and client. + + +Common Options to all tests: + -p, --port= Listen on/connect to port (default: 18515). + -m, --mtu= Mtu size (default: 1024). + -d, --ib-dev= Use IB device (default: first device found). + -i, --ib-port= Use port of IB device (default: 1). + -o, --out= Number of outstanding reads. only in READ. + -q, --qp= Number of Qps to perform. only in write_bw. + -c, --connection= Connection type : RC,UC,UD according to spec. + -g, --mcg= Number of Qps in MultiCast group. in SEND only + -M, --MGID= as the group MGID in format '255:1:X:X:X:X:X:X:X:X:X:X:X:X:X:X'. + -s, --size= Size of message to exchange (default: 1). + -a, --all Run sizes from 2 till 2^23. + -t, --tx-depth= Size of tx queue (default: 50). + -r, --rx-depth= Make rx queue bigger than tx (default 600). + -n, --iters= Number of exchanges (at least 100, default: 1000). + -I, --inline_size= Max size of message to be sent in inline mode. + On Bw tests default is 1,latency tests is 400. + -C, --report-cycles Report times in cpu cycle units. + -u, --qp-timeout= QP timeout, timeout value is 4 usec*2 ^(timeout). + Default is 14. + -S, --sl= SL (default 0). + -H, --report-histogram Print out all results (Default: summary only). + Only on Latnecy tests. + -x, --gid-index= Test uses GID with GID index taken from command + Line (for RDMAoE index should be 0). + -b, --bidirectional Measure bidirectional bandwidth (default uni). + On BW tests only (Implicit on latency tests). + -V, --version Display version number. + -e, --events Sleep on CQ events (default poll). + -N, --no peak-bw Cancel peak-bw calculation (default with peak-bw) + -F, --CPU-freq Do not fail even if cpufreq_ondemand module. + + *** IMPORTANT NOTE: You need to be running a Subnet Manager on the switch or + on one of the nodes in your fabric. + + diff --git a/readme_and_howto/QLOGIC_VNIC_README.txt b/readme_and_howto/QLOGIC_VNIC_README.txt new file mode 100644 index 0000000..9e5a75e --- /dev/null +++ b/readme_and_howto/QLOGIC_VNIC_README.txt @@ -0,0 +1,642 @@ +This is a release of the QLogic VNIC driver on OFED 1.4. This driver is +currently supported on Intel x86 32 and 64 bit machines. +Supported OS are: +- RHEL 4 Update 4. +- RHEL 4 Update 5. +- RHEL 4 Update 6. +- SLES 10. +- SLES 10 Service Pack 1. +- SLES 10 Service Pack 1 Update 1. +- SLES 10 Service Pack 2. +- RHEL 5. +- RHEL 5 Update 1. +- RHEL 5 Update 2. +- vanilla 2.6.27 kernel. + +The VNIC driver in conjunction with the QLogic Ethernet Virtual I/O Controller +(EVIC) provides Ethernet interfaces on a host with IB HCA(s) without the need +for any physical Ethernet NIC. + +This file describes the use of the QLogic VNIC ULP service on an OFED stack +and covers the following points: + +A) Creating QLogic VNIC interfaces +B) Discovering VEx/EVIC IOCs present on the fabric using ib_qlgc_vnic_query +C) Starting the QLogic VNIC driver and the VNIC interfaces +D) Assigning IP addresses etc for the QLogic VNIC interfaces +E) Information about the QLogic VNIC interfaces +F) Deleting a specific QLogic VNIC interface +G) Forced Failover feature for QLogic VNIC. +H) Infiniband Quality of Service for VNIC. +I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support +J) Information about creating VLAN interfaces +K) Information about enabling IB Multicast for QLogic VNIC interface +L) Basic Troubleshooting + +A) Creating QLogic VNIC interfaces + +The VNIC interfaces can be created with the help of +the configuration file which must be placed at /etc/infiniband/qlgc_vnic.cfg. + +Please take a look at /etc/infiniband/qlgc_vnic.cfg.sample file (available also +as part of the documentation) to see how VNIC configuration files are written. +You can use this configuration file as the basis for creating a VNIC configuration +file by copying it to /etc/infiniband/qlgc_vnic.cfg. Of course you will have to +replace the IOCGUID, IOCSTRING values etc in the sample configuration file +with those of the EVIC IOCs present on your fabric. + +(For backward compatibilty, if this file is missing, +/etc/infiniband/qlogic_vnic.cfg or /etc/sysconfig/ics_inic.cfg +will be used for configuration) + +Please note that using DGID of the EVIC/VEx IOC is +recommended as it will ensure the quickest startup of the +VNIC service. If DGID is specified then you must also +specify the IOCGUID. More details can be found in +the qlgc_vnic.cfg.sample file. + +In case of a host consisting of more than 1 HCAs plugged in, VNIC +interfaces can be configured based on HCA no and Port No or PORTGUID. + +B) Discovering EVIC/VEx IOCs present on the fabric using ib_qlgc_vnic_query + +For writing the configuration file, you will need information +about the EVIC/VEx IOCs present on the fabric like their IOCGUID, +IOCSTRING etc. The ib_qlgc_vnic_query tool should be used to get this +information. + +When ib_qlgc_vnic_query is executed without any options, it scans through ALL +active IB ports on the host and obtains the detailed information about all the +EVIC/VEx IOCs reachable through each active IB port: + +# ib_qlgc_vnic_query + +HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active + + IO Unit Info: + port LID: 0008 + port GID: fe8000000000000000066a11de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 1] + GUID: 00066a01de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 + service entries: 2 + service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 + service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 + + IO Unit Info: + port LID: 0009 + port GID: fe8000000000000000066a21de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 2] + GUID: 00066a02de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 + service entries: 2 + service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 + service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 + +HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active + + IO Unit Info: + port LID: 0008 + port GID: fe8000000000000000066a11de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 1] + GUID: 00066a01de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 + service entries: 2 + service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 + service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 + + IO Unit Info: + port LID: 0009 + port GID: fe8000000000000000066a21de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 2] + GUID: 00066a02de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 + service entries: 2 + service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 + service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 + +HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down + + Port State is Down. Skipping search of DM nodes on this port. + +HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active + + IO Unit Info: + port LID: 0008 + port GID: fe8000000000000000066a11de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 1] + GUID: 00066a01de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 + service entries: 2 + service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 + service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 + + IO Unit Info: + port LID: 0009 + port GID: fe8000000000000000066a21de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 2] + GUID: 00066a02de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 + service entries: 2 + service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 + service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 + +This is meant to help the network administrator to know about HCA/Port information +on host along with EVIC IOCs reachable through given IB ports on fabric. When +ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID information +and with -s option it reports the IOCSTRING information for the EVIC/VEx IOCs +present on the fabric: + +# ib_qlgc_vnic_query -e + +HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff +HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff +HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down + + Port State is Down. Skipping search of DM nodes on this port. + +HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff + +# ib_qlgc_vnic_query -s + +HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active + +"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" +"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" +HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active + +"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" +"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" +HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down + + Port State is Down. Skipping search of DM nodes on this port. + +HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active + +"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" +"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" + +# ib_qlgc_vnic_query -es + +HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" +HCA No = 0, HCA = mlx4_0, Port = 2, Port GUID = 0x0002c903000010f6, State = Active + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" +HCA No = 1, HCA = mlx4_1, Port = 1, Port GUID = 0x0002c90300000785, State = Down + + Port State is Down. Skipping search of DM nodes on this port. + +HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" + +ib_qlgc_vnic_query can be used to discover EVIC IOCs on the fabric based on +umad device, HCA no/Port no and PORTGUID as follows: + +For umad devices, it takes the name of the umad device mentioned with '-d' +option: + +# ib_qlgc_vnic_query -es -d /dev/infiniband/umad0 + +HCA No = 0, HCA = mlx4_0, Port = 1 + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" + +If the name of the HCA and its port no is known, then ib_qlgc_vnic_query can +make use of this information to discover EVIC IOCs on the fabric. HCA name +and port no is specified with '-C' and '-P' options respectively. + +# ib_qlgc_vnic_query -es -C mlx4_1 -P 2 + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" + +In case, if HCA name is not specified but port no is specified, HCA 0 is +selected as default HCA to discover IOCs and if Port no is missing then, +Port 1 of HCA name mentioned is used to discover the IOCs. If both are +missing, the behaviour is default and ib_qlgc_vnic_query will scan all the +IB ports on the host to discover IOCs reachable through each one of them. + +PORTGUID information about the IB ports on given host can be obtained using +the option '-L': + +# ib_qlgc_vnic_query -L + +0,mlx4_0,1,0x0002c903000010f5 +0,mlx4_0,2,0x0002c903000010f6 +1,mlx4_1,1,0x0002c90300000785 +1,mlx4_1,2,0x0002c90300000786 + +This actually lists different configurable parameters of IB ports present on +given host in the order: HCA No, HCA Name, Port No, PORTGUID separated by +commas. PORTGUID value obtained thus, can be used to discover EVIC IOCs +reachable through it using '-G' option as follows: + +# ib_qlgc_vnic_query -es -G 0x0002c903000010f5 + +HCA No = 0, HCA = mlx4_0, Port = 1, Port GUID = 0x0002c903000010f5, State = Active + + ioc_guid=00066a01de000070,dgid=fe8000000000000000066a11de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1" + ioc_guid=00066a02de000070,dgid=fe8000000000000000066a21de000070,pkey=ffff,"EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2" + +C) Starting the QLogic VNIC driver and the QLogic VNIC interfaces + +To start the QLogic VNIC service as a part of startup of OFED stack, set + +QLGC_VNIC_LOAD=yes + +in /etc/infiniband/openib.conf file. With this actually, the QLogic VNIC +service will also be stopped when the OFED stack is stopped. Also, if OFED +stack has been marked to start on boot, QLogic VNIC service will also start +on boot. + +The rest of the discussion in this subsection C) is valid only if + +QLGC_VNIC_LOAD=no + +is set into /etc/infiniband/openib.conf. + +Once you have created a configuration file, you can start the VNIC driver +and create the VNIC interfaces specified in the configuration file with: + +#/sbin/service qlgc_vnic start + +You can stop the VNIC driver and bring down the VNIC interfaces with + +#/sbin/service qlgc_vnic stop + +To restart the QLogic VNIC driver, you can use + +#/sbin/service qlgc_vnic restart + +If you have not started the Infiniband network stack (Infinipath or OFED), +then running "/sbin/service qlgc_vnic start" command will also cause the +Infiniband network stack to be started since the QLogic VNIC service requires +the Infiniband stack. + +On the other hand if you start the Infiniband network stack separately, then +the correct order of starting is: + +- Start the Infiniband stack +- Start QLogic VNIC service + +For example, if you use OFED, correct order of starting is: + +/sbin/service openibd start +/sbin/service qlgc_vnic start + +Correct order of stopping is: + +- Stop QLogic VNIC service +- Stop the Infiniband stack + +For example, if you use OFED, correct order of stopping is: + +/sbin/service qlgc_vnic stop +/sbin/service openibd stop + +If you try to stop the Infiniband stack when the QLogic VNIC service is +running, +you will get an error message that some of the modules of the Infiniband stack +are in use by the QLogic VNIC service. Also, any QLogic VNIC interfaces that +you +created are removed (because stopping the Infiniband network stack causes the +HCA +driver to be unloaded which is required for the VNIC interfaces to be +present). +In this case, do the following: + + 1. Stop the QLogic VNIC service with "/sbin/service qlgc_vnic stop" + + 2. Stop the Infiniband stack again. + + 3. If you want to restart the QLogic VNIC interfaces, use + "/sbin/service qlgc_vnic start". + + +D) Assigning IP addresses etc for the QLogic VNIC interfaces + +This can be done with ifconfig or by setting up the ifcfg-XXX (ifcfg-veth0 for +an interface named veth0 etc) network files for the corresponding VNIC interfaces. + +E) Information about the QLogic VNIC interfaces + +Information about VNIC interfaces on a given host can be obtained using a +script "ib_qlgc_vnic_info" :- + +# ib_qlgc_vnic_info + +VNIC Interface : eioc0 + VNIC State : VNIC_REGISTERED + Current Path : primary path + Receive Checksum : true + Transmit checksum : true + + Primary Path : + VIPORT State : VIPORT_CONNECTED + Link State : LINK_IDLING + HCA Info. : vnic-mthca0-1 + Heartbeat : 100 + IOC String : EVIC in Chassis 0x00066a00db000010, Slot 4, Ioc 1 + IOC GUID : 66a01de000037 + DGID : fe8000000000000000066a11de000037 + P Key : ffff + + Secondary Path : + VIPORT State : VIPORT_DISCONNECTED + Link State : INVALID STATE + HCA Info. : vnic-mthca0-2 + Heartbeat : 100 + IOC String : + IOC GUID : 66a01de000037 + DGID : 00000000000000000000000000000000 + P Key : 0 + +This information is collected from /sys/class/infiniband_qlgc_vnic/interfaces/ +directory under which there is a separate directory corresponding to each +VNIC interface. + +F) Deleting a specific QLogic VNIC interface + +VNIC interfaces can be deleted by writing the name of the interface to +the /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic file. + +For example to delete interface veth0 + +echo -n veth0 > /sys/class/infiniband_qlgc_vnic/interfaces/delete_vnic + +G) Forced Failover feature for QLogic VNIC. + +VNIC interfaces, when configured with failover configuration, can be +forced to failover to use other active path. For example, if VNIC interface +"veth1" is configured with failover configuration, then to switch to other +path, use command: + +echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/force_failover + +This will make VNIC interface veth1 to switch to other active path, even though +the path of VNIC interface, before the forced failover operation, is not in +disconnected state. + +This feature allows the network administrator to control the path of the +VNIC traffic at run time and reconfiguration as well as restart of VNIC +service is not required to achieve the same. + +Once enabled as mentioned above, forced failover can be cleared with +the unfailover command: + +echo -n veth1 > /sys/class/infiniband_qlgc_vnic/interfaces/unfailover + +This clears the forced failover on VNIC interface "veth1". Once cleared, +if module parameter "default_prefer_primary" is set to 1, then VNIC +interface switches back to primary path. If module parameter +"default_prefer_primary" is set to 0, then VNIC interface continues to +use its current active path. + +Forced failover, thus, takes priority over default_prefer_primary and the +default_prefer_primary feature will not be active unless the forced +failover is cleared through "unfailover". + +Besides this forced failover, QLogic VNIC service does retain its +original failover feature which gets triggered when current active +path gets disconnected. + +H) Infiniband Quality of Service for VNIC:- + +To enforce infiniband Quality of Service(QoS) for VNIC protocol, there +is no configuration required on host side. The service level for the +VNIC protocol can be configured using service ID or target port guid +in the "qos-ulps" section of /etc/opensm/qos-policy.conf on the host +running OpenSM. + +Service IDs for the EVIC IO controllers can be obtained from the output +of ib_qlgc_vnic_query: + +HCA No = 1, HCA = mlx4_1, Port = 2, Port GUID = 0x0002c90300000786, State = Active + + IO Unit Info: + port LID: 0008 + port GID: fe8000000000000000066a11de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 1] + GUID: 00066a01de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 1 + service entries: 2 +------> service[ 0]: 1000066a00000001 / InfiniNIC.InfiniConSys.Control:01 +------> service[ 1]: 1000066a00000101 / InfiniNIC.InfiniConSys.Data:01 + + IO Unit Info: + port LID: 0009 + port GID: fe8000000000000000066a21de000070 + change ID: 0003 + max controllers: 0x02 + + + controller[ 2] + GUID: 00066a02de000070 + vendor ID: 00066a + device ID: 000030 + IO class : 2000 + ID: EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2 + service entries: 2 +------> service[ 0]: 1000066a00000002 / InfiniNIC.InfiniConSys.Control:02 +------> service[ 1]: 1000066a00000102 / InfiniNIC.InfiniConSys.Data:02 + +Numbers 1000066a00000002, 1000066a00000102 are the required service IDs. + +Finer control on quality of service for the VNIC protocol can be achieved by +configuring the service level using target port guid values of the EVIC IO +controllers. Target port guid values for the EVIC IO controllers can be +obtained using "saquery" command supplied by OFED package. + +I) QLogic VNIC Dynamic Update Daemon Tool and Hot Swap support:- + +This tool is started and stopped as part of the QLogic VNIC service +(refer to C above) and provides the following features: + +1. Dynamic update of disconnected interfaces (which have been configured +WITHOUT using the DGID option in the configuration file) : + +At the start up of VNIC driver, if the HCA port through which a particular VNIC +interface path (primary or secondary) connects to target is down or the +EVIC/VEx IOC is not available then all the required parameters (DGID etc) for connecting +with the EVIC/VEx cannot be determined. Hence the corresponding VNIC interface +path is not available at the start of the VNIC service. This daemon constantly +monitors the configured VNIC interfaces to check if any of them are disconnected. +If any of the interfaces are disconnected, it scans for available EVIC/VEx targets using +"ib_qlgc_vnic_query" tool. When daemon sees that for a given path of a VNIC interface, +the configured EVIC/VEx IOC has become available, it dynamically updates the +VNIC kernel driver with the required information to establish connection for +that path of the interface. In this way, the interface gets connected with +the configured EVIC/VEx whenever it becomes available without any manual +intervention. + +2. Hot Swap support : + +Hot swap is an operation in which an existing EVIC/VEx is replaced by another +EVIC/VEx (in the same slot of the switch chassis as the older one). In such a +case, the current connection for the corresponding VNIC interface will have to +be re-established. The daemon detects this hot swap case and re-establishes +the connection automatically. To make use of this feature of the daemon, it is +recommended that IOCSTRING be used in the configuration file to configure the +VNIC interfaces. + +This is because, after a hot swap though all other parameters like DGID, IOCGUID etc +of the EVIC/VEx change, the IOCSTRING remains the same. Thus the daemon monitors +for changes in IOCGUID and DGID of disconnected interfaces based on the IOCSTRING. +If these values have changed it updates the kernel driver so that the VNIC +interface can start using the new EVIC/VEx. + +If in addition to IOCSTRING, DGID and IOCGUID have been used to configure +a VNIC interface, then on a hotswap the daemon will update the parameters as required. +But to have that VNIC interface available immediately on the next restart of the +QLogic VNIC service, please make sure to update the configuration file with the +new DGID and IOCGUID values. Otherwise, the creation of such interfaces will be +delayed till the daemon runs and updates the parameters. + +J) Information about creating VLAN interfaces + +The EVIC/VEx supports VLAN tagging without having to explicitly create VLAN +interfaces for the VNIC interface on the host. This is done by enabling +Egress/Ingress tagging on the EVIC/VEx and setting the "Host ignores VLAN" +option for the VNIC interface. The "Host ignores VLAN" option is enabled +by default due to which VLAN tags are ignored on the host by the QLogic +VNIC driver. Thus explicitly created VLAN interfaces (using vconfig command) +for a given VNIC interface will not be operational. + +If you want to explicitly create a VLAN interface for a given VNIC interface, +then you will have to disable the "Host ignores VLAN" option for the +VNIC interface on the EVIC/VEx. The qlgc_vnic service must be restarted +on the host after disabling (or enabling) the "Host ignores VLAN" option. + +Please refer to the EVIC/VEx documentation for more information on Egress/Ingress +port tagging feature and disabling the "Host ignores VLAN" option. + +K) Information about enabling IB Multicast for QLogic VNIC interface + +QLogic VNIC driver has been upgraded to support the IB Multicasting feature of +EVIC/VEx. This feature enables the QLogic VNIC host driver to support the IP +multicasting more efficiently. With this feature enabled, infiniband multicast +group acts as a carrier of IP multicast traffic. EVIC will make use of such IB +multicast groups for forwarding IP multicast traffic to VNIC interfaces which +are member of given IP multicast group. In the older QLogic VNIC host driver, +IB multicasting was not being used to carry IP multicast traffic. + +By default, IB multicasting is disabled on EVIC/VEx; but it is enabled by +default at the QLogic VNIC host driver. + +To disable IB multicast feature on the host driver, VNIC configuration file +needs to be modified by setting the parameter IB_MULTICAST=FALSE in the +interface configuration. Please refer to the qlgc_vnic.cfg.sample for more +details on configuration of VNIC interfaces for IB multicasting. +IB multicasting also needs to be enabled over EVIC/VEx. Please refer to the +EVIC/VEx documentation for more information on enabling IB multicast +feature over EVIC/VEx. + +L) Basic Troubleshooting + +1. In case of any problems, make sure that: + + a) The HCA ports you are trying to use have IB cables connected and are in an + active state. You can use the "ibv_devinfo" tool to check the state of + your HCA ports. + + b) If your HCA ports are not active, check if an SM is running on the fabric + where the HCA ports are connected. If you have done a full install of + OFED, you can use the "sminfo" command ("sminfo -P 2" for port 2) to + check SM information. + + c) Make sure that the EVIC/VEx is powered up and its Ethernet cables are connected + properly. + + d) Check /var/log/messages for any error messages. + +2. If some of your VNIC interfaces are not available: + + a) Use "ifconfig" tool with -a option to see if all interfaces are created. + It is possible that the interfaces are created but do not have an + IP address. Make sure that you have setup a correct ifcfg-XXX file for your + VNIC interfaces for automatic assignment of IP addresses. + + If the VNIC interface is created and the ifcfg file is also correct + but the VNIC interface is not UP, make sure that the target EVIC/VEx + IOC has an Ethernet cable properly connected. + + b) Make sure that the VNIC configuration file has been setup properly + with correct EVIC/VEx target DGID/IOCGUID/IOCSTRING information and + instance numbers. + + c) Make sure that the EVIC/VEx target IOC specified for that interface is + available. You can use the "ib_qlgc_vnic_query" tool to verify this. If it is not + available when you started the service, but it becomes available later + on, then the QLogic VNIC dynamic update daemon will bring up the + interface when the target becomes available. You will see messages in + /var/log/messages when the corresponding interface is created. + + d) Make sure that you have not exceeded the total number of Virtual interfaces + supported by the EVIC/VEx. You can check the total number of Virtual interfaces + currently in use on the HTTP interface of the EVIC/VEx. + diff --git a/readme_and_howto/QoS_architecture.txt b/readme_and_howto/QoS_architecture.txt new file mode 100644 index 0000000..1c19a98 --- /dev/null +++ b/readme_and_howto/QoS_architecture.txt @@ -0,0 +1,216 @@ + + QoS support in OFED + +============================================================================== +Table of contents +============================================================================== + +1. Overview +2. Architecture +3. Supported Policy +4. CMA functionality +5. IPoIB functionality +6. SDP functionality +7. RDS functionality +8. SRP functionality +9. iSER functionality +10. OpenSM functionality + + +============================================================================== +1. Overview +============================================================================== + +Quality of Service requirements stem from the realization of I/O consolidation +over IB network: As multiple applications and ULPs share the same fabric, +means to control their use of the network resources are becoming a must. +The basic need is to differentiate the service levels provided to different +traffic flows, such that a policy could be enforced and control each flow +utilization of the fabric resources. + +IBTA specification defined several hardware features and management interfaces +to support QoS: +* Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner +* Arbitration between traffic of different VLs is performed by a 2 priority + levels weighted round robin arbiter. The arbiter is programmable with + a sequence of (VL, weight) pairs and maximal number of high priority credits + to be processed before low priority is served +* Packets carry class of service marking in the range 0 to 15 in their + header SL field +* Each switch can map the incoming packet by its SL to a particular output + VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) +* The Subnet Administrator controls each communication flow parameters + by providing them as a response to Path Record (PR) or MultiPathRecord (MPR) + queries + +The IB QoS features provide the means to implement a DiffServ like +architecture. DiffServ architecture (IETF RFC 2474 & 2475) is widely used +today in highly dynamic fabrics. + +This document provides the detailed functional definition for the various +software elements that enable a DiffServ like architecture over the +OpenFabrics software stack. + + +============================================================================== +2. Architecture +============================================================================== + +QoS functionality is split between the SM/SA, CMA and the various ULPS. +We take the "chronology approach" to describe how the overall system works. + +2.1. The network manager (human) provides a set of rules (policy) that +define how the network is being configured and how its resources are split +to different QoS-Levels. The policy also define how to decide which QoS-Level +each application or ULP or service use. + +2.2. The SM analyzes the provided policy to see if it is realizable and +performs the necessary fabric setup. Part of this policy defines the default +QoS-Level of each partition. The SA is enhanced to match the requested Source, +Destination, QoS-Class, Service-ID, PKey against the policy, so clients +(ULPs, programs) can obtain a policy enforced QoS. The SM may also set up +partitions with appropriate IPoIB broadcast group. This broadcast group +carries its QoS attributes: SL, MTU, RATE, and Packet Lifetime. + +2.3. IPoIB is being setup. IPoIB uses the SL, MTU, RATE and Packet Lifetime +available on the multicast group which forms the broadcast group of this +partition. + +2.4. MPI which provides non IB based connection management should be +configured to run using hard coded SLs. It uses these SLs for every QP +being opened. + +2.5. ULPs that use CM interface (like SRP) have their own pre-assigned +Service-ID and use it while obtaining PathRecord/MultiPathRecord (PR/MPR) +for establishing connections. The SA receiving the PR/MPR matches it +against the policy and returns the appropriate PR/MPR including SL, MTU, +RATE and Lifetime. + +2.6. ULPs and programs (e.g. SDP) use CMA to establish RC connection provide +the CMA the target IP and port number. ULPs might also provide QoS-Class. +The CMA then creates Service-ID for the ULP and passes this ID and optional +QoS-Class in the PR/MPR request. The resulting PR/MPR is used for configuring +the connection QP. + +PathRecord and MultiPathRecord enhancement for QoS: + +As mentioned above the PathRecord and MultiPathRecord attributes are enhanced +to carry the Service-ID which is a 64bit value. A new field QoS-Class is also +provided. +A new capability bit describes the SM QoS support in the SA class port info. +This approach provides an easy migration path for existing access layer and +ULPs by not introducing new set of PR/MPR attributes. + + +============================================================================== +3. Supported Policy +============================================================================== + +The QoS policy that is specified in a separate file is divided into +4 sub sections: + +I) Port Group: a set of CAs, Routers or Switches that share the same settings. + A port group might be a partition defined by the partition manager policy, + list of GUIDs, or list of port names based on NodeDescription. + +II) Fabric Setup: Defines how the SL2VL and VLArb tables should be setup. + NOTE: Currently this part of the policy is ignored. SL2VL and VLArb + tables should be configured in the OpenSM options file + (opensm.opts). + +III) QoS-Levels Definition: This section defines the possible sets of + parameters for QoS that a client might be mapped to. Each set holds + SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits. + NOTE: Currently, Path Bits are not implemented. + +IV) Matching Rules: A list of rules that match an incoming PR/MPR request + to a QoS-Level. The rules are processed in order such as the first match + is applied. Each rule is built out of a set of match expressions which + should all match for the rule to apply. The matching expressions are + defined for the following fields: + - SRC and DST to lists of port groups + - Service-ID to a list of Service-ID values or ranges + - QoS-Class to a list of QoS-Class values or ranges + + +============================================================================== +4. CMA features +============================================================================== + +The CMA interface supports Service-ID through the notion of port space +as a prefixes to the port_num which is part of the sockaddr provided to +rdma_resolve_add(). +CMP also allows the ULP (like SDP) to propagate a request for specific +QoS-Class. CMA uses the provided QoS-Class and Service-ID in the sent PR/MPR. + + +============================================================================== +5. IPoIB +============================================================================== + +IPoIB queries the SA for its broadcast group information. +It provides the broadcast group SL, MTU, and RATE in every following +PathRecord query performed when a new UDAV is needed by IPoIB. + + +============================================================================== +6. SDP +============================================================================== + +SDP uses CMA for building its connections. +The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits +holding the remote TCP/IP Port Number to connect to. + + +============================================================================== +7. RDS +============================================================================== + +RDS uses CMA and thus it is very close to SDP. The Service-ID for RDS is +0x000000000106PPPP, where PPPP are 4 hex digits holding the TCP/IP Port +Number that the protocol connects to. +Default port number for RDS is 0x48CA, which makes a default Service-ID +0x00000000010648CA. + + +============================================================================== +8. SRP +============================================================================== + +Current SRP implementation uses its own CM callbacks (not CMA). So SRP fills +in the Service-ID in the PR/MPR by itself and use that information in setting +up the QP. +SRP Service-ID is defined by the SRP target I/O Controller (it also complies +with IBTA Service-ID rules). The Service-ID is reported by the I/O Controller +in the ServiceEntries DMA attribute and should be used in the PR/MPR if the +SA reports its ability to handle QoS PR/MPRs. + + +============================================================================== +9. iSER +============================================================================== + +Similar to RDS, iSER also uses CMA. The Service-ID for iSER is similar to RDS +(0x000000000106PPPP), with default port number 0x0CBC, which makes a default +Service-ID 0x0000000001060CBC. + + +============================================================================== +10. OpenSM features +============================================================================== + +The QoS related functionality that is provided by OpenSM can be split into two +main parts: + +10.1. Fabric Setup +During fabric initialization the SM parses the policy and apply its settings +to the discovered fabric elements. + +10.2. PR/MPR query handling: +OpenSM enforces the provided policy on client request. +The overall flow for such requests is: first the request is matched against +the defined match rules such that the target QoS-Level definition is found. +Given the QoS-Level a path(s) search is performed with the given restrictions +imposed by that level. + +============================================================================== diff --git a/readme_and_howto/QoS_management_in_OpenSM.txt b/readme_and_howto/QoS_management_in_OpenSM.txt new file mode 100644 index 0000000..8c9915f --- /dev/null +++ b/readme_and_howto/QoS_management_in_OpenSM.txt @@ -0,0 +1,492 @@ + + QoS Management in OpenSM + +============================================================================== + Table of contents +============================================================================== + +1. Overview +2. Full QoS Policy File +3. Simplified QoS Policy Definition +4. Policy File Syntax Guidelines +5. Examples of Full Policy File +6. Simplified QoS Policy - Details and Examples +7. SL2VL Mapping and VL Arbitration + + +============================================================================== + 1. Overview +============================================================================== + +When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file. +The default name of OpenSM QoS policy file is +/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y +or --qos_policy_file option with OpenSM. + +During fabric initialization and at every heavy sweep OpenSM parses the QoS +policy file, applies its settings to the discovered fabric elements, and +enforces the provided policy on client requests. The overall flow for such +requests is: + - The request is matched against the defined matching rules such that the + QoS Level definition is found. + - Given the QoS Level, path(s) search is performed with the given + restrictions imposed by that level. + +There are two ways to define QoS policy: + - Full policy, where the policy file syntax provides an administrator + various ways to match PathRecord/MultiPathRecord (PR/MPR) request and + enforce various QoS constraints on the requested PR/MPR + - Simplified QoS policy definition, where an administrator would be able to + match PR/MPR requests by various ULPs and applications running on top of + these ULPs. + +While the full policy syntax is very flexible, in many cases the simplified +policy definition would be sufficient. + + +============================================================================== + 2. Full QoS Policy File +============================================================================== + +QoS policy file has the following sections: + +I) Port Groups (denoted by port-groups). +This section defines zero or more port groups that can be referred later by +matching rules (see below). Port group lists ports by: + - Port GUID + - Port name, which is a combination of NodeDescription and IB port number + - PKey, which means that all the ports in the subnet that belong to + partition with a given PKey belong to this port group + - Partition name, which means that all the ports in the subnet that belong + to partition with a given name belong to this port group + - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and + SELF (SM's port). + +II) QoS Setup (denoted by qos-setup). +This section describes how to set up SL2VL and VL Arbitration tables on +various nodes in the fabric. +However, this is not supported in OpenSM currently. +SL2VL and VLArb tables should be configured in the OpenSM options file +(default location - /usr/local/etc/opensm/opensm.conf). + +III) QoS Levels (denoted by qos-levels). +Each QoS Level defines Service Level (SL) and a few optional fields: + - MTU limit + - Rate limit + - PKey + - Packet lifetime +When path(s) search is performed, it is done with regards to restriction that +these QoS Level parameters impose. +One QoS level that is mandatory to define is a DEFAULT QoS level. It is +applied to a PR/MPR query that does not match any existing match rule. +Similar to any other QoS Level, it can also be explicitly referred by any +match rule. + +IV) QoS Matching Rules (denoted by qos-match-rules). +Each PathRecord/MultiPathRecord query that OpenSM receives is matched against +the set of matching rules. Rules are scanned in order of appearance in the QoS +policy file such as the first match takes precedence. +Each rule has a name of QoS level that will be applied to the matching query. +A default QoS level is applied to a query that did not match any rule. +Queries can be matched by: + - Source port group (whether a source port is a member of a specified group) + - Destination port group (same as above, only for destination port) + - PKey + - QoS class + - Service ID +To match a certain matching rule, PR/MPR query has to match ALL the rule's +criteria. However, not all the fields of the PR/MPR query have to appear in +the matching rule. +For instance, if the rule has a single criterion - Service ID, it will match +any query that has this Service ID, disregarding rest of the query fields. +However, if a certain query has only Service ID (which means that this is the +only bit in the PR/MPR component mask that is on), it will not match any rule +that has other matching criteria besides Service ID. + + +============================================================================== + 3. Simplified QoS Policy Definition +============================================================================== + +Simplified QoS policy definition comprises of a single section denoted by +qos-ulps. Similar to the full QoS policy, it has a list of match rules and +their QoS Level, but in this case a match rule has only one criterion - its +goal is to match a certain ULP (or a certain application on top of this ULP) +PR/MPR request, and QoS Level has only one constraint - Service Level (SL). +The simplified policy section may appear in the policy file in combine with +the full policy, or as a stand-alone policy definition. +See more details and list of match rule criteria below. + + +============================================================================== + 4. Policy File Syntax Guidelines +============================================================================== + +- Empty lines are ignored. +- Leading and trailing blanks, as well as empty lines, are ignored, so + the indentation in the example is just for better readability. +- Comments are started with the pound sign (#) and terminated by EOL. +- Any keyword should be the first non-blank in the line, unless it's a + comment. +- Keywords that denote section/subsection start have matching closing + keywords. +- Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR + requests that didn't match any of the matching rules. +- Any section/subsection of the policy file is optional. + + +============================================================================== + 5. Examples of Full Policy File +============================================================================== + +As mentioned earlier, any section of the policy file is optional, and +the only mandatory part of the policy file is a default QoS Level. +Here's an example of the shortest policy file: + + qos-levels + qos-level + name: DEFAULT + sl: 0 + end-qos-level + end-qos-levels + +Port groups section is missing because there are no match rules, which means +that port groups are not referred anywhere, and there is no need defining +them. And since this policy file doesn't have any matching rules, PR/MPR query +won't match any rule, and OpenSM will enforce default QoS level. +Essentially, the above example is equivalent to not having QoS policy file +at all. + +The following example shows all the possible options and keywords in the +policy file and their syntax: + + # + # See the comments in the following example. + # They explain different keywords and their meaning. + # + port-groups + + port-group # using port GUIDs + name: Storage + # "use" is just a description that is used for logging + # Other than that, it is just a comment + use: SRP Targets + port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA + port-guid: 0x1000000000FFFF + end-port-group + + port-group + name: Virtual Servers + # The syntax of the port name is as follows: + # "node_description/Pnum". + # node_description is compared to the NodeDescription of the node, + # and "Pnum" is a port number on that node. + port-name: vs1 HCA-1/P1, vs2 HCA-1/P1 + end-port-group + + # using partitions defined in the partition policy + port-group + name: Partitions + partition: Part1 + pkey: 0x1234 + end-port-group + + # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM) + # or ALL (for all the nodes in the subnet) + port-group + name: CAs and SM + node-type: CA, SELF + end-port-group + + end-port-groups + + qos-setup + # This section of the policy file describes how to set up SL2VL and VL + # Arbitration tables on various nodes in the fabric. + # However, this is not supported in OpenSM currently - the section is + # parsed and ignored. SL2VL and VLArb tables should be configured in the + # OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf). + end-qos-setup + + qos-levels + + # Having a QoS Level named "DEFAULT" is a must - it is applied to + # PR/MPR requests that didn't match any of the matching rules. + qos-level + name: DEFAULT + use: default QoS Level + sl: 0 + end-qos-level + + # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime + qos-level + name: WholeSet + sl: 1 + mtu-limit: 4 + rate-limit: 5 + pkey: 0x1234 + packet-life: 8 + end-qos-level + + end-qos-levels + + # Match rules are scanned in order of their apperance in the policy file. + # First matched rule takes precedence. + qos-match-rules + + # matching by single criteria: QoS class + qos-match-rule + use: by QoS class + qos-class: 7-9,11 + # Name of qos-level to apply to the matching PR/MPR + qos-level-name: WholeSet + end-qos-match-rule + + # show matching by destination group and service id + qos-match-rule + use: Storage targets + destination: Storage + service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF + qos-level-name: WholeSet + end-qos-match-rule + + qos-match-rule + source: Storage + use: match by source group only + qos-level-name: DEFAULT + end-qos-match-rule + + qos-match-rule + use: match by all parameters + qos-class: 7-9,11 + source: Virtual Servers + destination: Storage + service-id: 0x0000000000010000-0x000000000001FFFF + pkey: 0x0F00-0x0FFF + qos-level-name: WholeSet + end-qos-match-rule + + end-qos-match-rules + + +============================================================================== + 6. Simplified QoS Policy - Details and Examples +============================================================================== + +Simplified QoS policy match rules are tailored for matching ULPs (or some +application on top of a ULP) PR/MPR requests. This section has a list of +per-ULP (or per-application) match rules and the SL that should be enforced +on the matched PR/MPR query. + +Match rules include: + - Default match rule that is applied to PR/MPR query that didn't match any + of the other match rules + - SDP + - SDP application with a specific target TCP/IP port range + - SRP with a specific target IB port GUID + - RDS + - iSER + - iSER application with a specific target TCP/IP port range + - IPoIB with a default PKey + - IPoIB with a specific PKey + - any ULP/application with a specific Service ID in the PR/MPR query + - any ULP/application with a specific PKey in the PR/MPR query + - any ULP/application with a specific target IB port GUID in the PR/MPR query + +Since any section of the policy file is optional, as long as basic rules of +the file are kept (such as no referring to nonexisting port group, having +default QoS Level, etc), the simplified policy section (qos-ulps) can serve +as a complete QoS policy file. +The shortest policy file in this case would be as follows: + + qos-ulps + default : 0 #default SL + end-qos-ulps + +It is equivalent to the previous example of the shortest policy file, and it +is also equivalent to not having policy file at all. + +Below is an example of simplified QoS policy with all the possible keywords: + + qos-ulps + default : 0 # default SL + sdp, port-num 30000 : 0 # SL for application running on top + # of SDP when a destination + # TCP/IPport is 30000 + sdp, port-num 10000-20000 : 0 + sdp : 1 # default SL for any other + # application running on top of SDP + rds : 2 # SL for RDS traffic + iser, port-num 900 : 0 # SL for iSER with a specific target + # port + iser : 3 # default SL for iSER + ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with + # pkey 0x0001 + ipoib : 4 # default IPoIB partition, + # pkey=0x7FFF + any, service-id 0x6234 : 6 # match any PR/MPR query with a + # specific Service ID + any, pkey 0x0ABC : 6 # match any PR/MPR query with a + # specific PKey + srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on + # a specified IB port GUID + any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with + # a specific target port GUID + end-qos-ulps + + +Similar to the full policy definition, matching of PR/MPR queries is done in +order of appearance in the QoS policy file such as the first match takes +precedence, except for the "default" rule, which is applied only if the query +didn't match any other rule. + +All other sections of the QoS policy file take precedence over the qos-ulps +section. That is, if a policy file has both qos-match-rules and qos-ulps +sections, then any query is matched first against the rules in the +qos-match-rules section, and only if there was no match, the query is matched +against the rules in qos-ulps section. + +Note that some of these match rules may overlap, so in order to use the +simplified QoS definition effectively, it is important to understand how each +of the ULPs is matched: + +6.1 IPoIB +IPoIB query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so +the following three match rules are equivalent: + + ipoib : + ipoib, pkey 0x7fff : + any, pkey 0x7fff : + +6.2 SDP +SDP PR query is matched by Service ID. The Service-ID for SDP is +0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port +Number to connect to. The following two match rules are equivalent: + + sdp : + any, service-id 0x0000000000010000-0x000000000001ffff : + +6.3 RDS +Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS +is 0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP +Port Number to connect to. Default port number for RDS is 0x48CA, which makes +a default Service-ID 0x00000000010648CA. The following two match rules are +equivalent: + + rds : + any, service-id 0x00000000010648CA : + +6.4 iSER +Similar to RDS, iSER query is matched by Service ID, where the the Service ID +is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes +a default Service-ID 0x0000000001060CBC. The following two match rules are +equivalent: + + iser : + any, service-id 0x0000000001060CBC : + +6.5 SRP +Service ID for SRP varies from storage vendor to vendor, thus SRP query is +matched by the target IB port GUID. The following two match rules are +equivalent: + + srp, target-port-guid 0x1234 : + any, target-port-guid 0x1234 : + +Note that any of the above ULPs might contain target port GUID in the PR +query, so in order for these queries not to be recognized by the QoS manager +as SRP, the SRP match rule (or any match rule that refers to the target port +guid only) should be placed at the end of the qos-ulps match rules. + +6.6 MPI +SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL +on the MPI traffic, and that's why it is the only ULP that did not appear in +the qos-ulps section. + + +============================================================================== + 7. SL2VL Mapping and VL Arbitration +============================================================================== + +OpenSM cached options file has a set of QoS related configuration parameters, +that are used to configure SL2VL mapping and VL arbitration on IB ports. +These parameters are: + - Max VLs: the maximum number of VLs that will be on the subnet. + - High limit: the limit of High Priority component of VL Arbitration + table (IBA 7.6.9). + - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template. + - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template. + - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs + corresponding to SLs 0-15 (Note that VL15 used here means drop this SL). + +There are separate QoS configuration parameters sets for various target types: +CAs, routers, switch external ports, and switch's enhanced port 0. The names +of such parameters are prefixed by "qos__" string. Here is a full list +of the currently supported sets: + + qos_ca_ - QoS configuration parameters set for CAs. + qos_rtr_ - parameters set for routers. + qos_sw0_ - parameters set for switches' port 0. + qos_swe_ - parameters set for switches' external ports. + +Here's the example of typical default values for CAs and switches' external +ports (hard-coded in OpenSM initialization): + + qos_ca_max_vls 15 + qos_ca_high_limit 0 + qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 + qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 + qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + + qos_swe_max_vls 15 + qos_swe_high_limit 0 + qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 + qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 + qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + +VL arbitration tables (both high and low) are lists of VL/Weight pairs. +Each list entry contains a VL number (values from 0-14), and a weighting value +(values 0-255), indicating the number of 64 byte units (credits) which may be +transmitted from that VL when its turn in the arbitration occurs. A weight +of 0 indicates that this entry should be skipped. If a list entry is +programmed for VL15 or for a VL that is not supported or is not currently +configured by the port, the port may either skip that entry or send from any +supported VL for that entry. + +Note, that the same VLs may be listed multiple times in the High or Low +priority arbitration tables, and, further, it can be listed in both tables. + +The limit of high-priority VLArb table (qos__high_limit) indicates the +number of high-priority packets that can be transmitted without an opportunity +to send a low-priority packet. Specifically, the number of bytes that can be +sent is high_limit times 4K bytes. + +A high_limit value of 255 indicates that the byte limit is unbounded. +Note: if the 255 value is used, the low priority VLs may be starved. +A value of 0 indicates that only a single packet from the high-priority table +may be sent before an opportunity is given to the low-priority table. + +Keep in mind that ports usually transmit packets of size equal to MTU. +For instance, for 4KB MTU a single packet will require 64 credits, so in order +to achieve effective VL arbitration for packets of 4KB MTU, the weighting +values for each VL should be multiples of 64. + +Below is an example of SL2VL and VL Arbitration configuration on subnet: + + qos_ca_max_vls 15 + qos_ca_high_limit 6 + qos_ca_vlarb_high 0:4 + qos_ca_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 + qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + + qos_swe_max_vls 15 + qos_swe_high_limit 6 + qos_swe_vlarb_high 0:4 + qos_swe_vlarb_low 0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 + qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + +In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is +defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single +transmission burst. Such configuration would suilt VL that needs low latency +and uses small MTU when transmitting packets. Rest of VLs are defined as low +priority VLs with different weights, while VL4 is effectively turned off. diff --git a/readme_and_howto/RDS_README.txt b/readme_and_howto/RDS_README.txt new file mode 100644 index 0000000..f0f8f5d --- /dev/null +++ b/readme_and_howto/RDS_README.txt @@ -0,0 +1,335 @@ +RDS(7) RDS(7) + + + +NAME + RDS - Reliable Datagram Sockets + +SYNOPSIS + #include + #include + +DESCRIPTION + This is an implementation of the RDS socket API. It provides reliable, + in-order datagram delivery between sockets over a variety of trans‐ + ports. + + Currently, RDS can be transported over Infiniband, and loopback. + iWARP bcopy is supported, but not RDMA operations. + + RDS uses standard AF_INET addresses as described in ip(7) to identify + end points. + + Socket Creation + RDS is still in development and as such does not have a reserved proto‐ + col family constant. Applications must read the string representation + of the protocol family value from the pf_rds sysctl parameter file + described below. + + rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0); + + + Socket Options + RDS sockets support a number of socket options through the setsock‐ + opt(2) and getsockopt(2) calls. The following generic options (with + socket level SOL_SOCKET) are of specific importance: + + SO_RCVBUF + Specifies the size of the receive buffer. See section on "Con‐ + gestion Control" below. + + SO_SNDBUF + Specifies the size of the send buffer. See "Message Transmis‐ + sion" below. + + SO_SNDTIMEO + Specifies the send timeout when trying to enqueue a message on a + socket with a full queue in blocking mode. + + In addition to these, RDS supports a number of protocol specific + options (with socket level SOL_RDS). Just as with the RDS protocol + family, an official value has not been assigned yet, so the kernel will + assign a value dynamically. The assigned value can be retrieved from + the sol_rds sysctl parameter file. + + RDS specific socket options will be described in a separate section + below. + + Binding + A new RDS socket has no local address when it is first returned from + socket(2). It must be bound to a local address by calling bind(2) + before any messages can be sent or received. This will also attach the + socket to a specific transport, based on the type of interface the + local address is attached to. From that point on, the socket can only + reach destinations which are available through this transport. + + For instance, when binding to the address of an Infiniband interface + such as ib0, the socket will use the Infiniband transport. If RDS is + not able to associate a transport with the given address, it will + return EADDRNOTAVAIL. + + An RDS socket can only be bound to one address and only one socket can + be bound to a given address/port pair. If no port is specified in the + binding address then an unbound port is selected at random. + + RDS does not allow the application to bind a previously bound socket to + another address. Binding to the wildcard address INADDR_ANY is not per‐ + mitted either. + + Connecting + The default mode of operation for RDS is to use unconnected socket, and + specify a destination address as an argument to sendmsg. However, RDS + allows sockets to be connected to a remote end point using connect(2). + If a socket is connected, calling sendmsg without specifying a destina‐ + tion address will use the previously given remote address. + + Congestion Control + RDS does not have explicit congestion control like common streaming + protocols such as TCP. However, sockets have two queue limits associ‐ + ated with them; the send queue size and the receive queue size. Mes‐ + sages are accounted based on the number of bytes of payload. + + The send queue size limits how much data local processes can queue on a + local socket (see the following section). If that limit is exceeded, + the kernel will not accept further messages until the queue is drained + and messages have been delivered to and acknowledged by the remote + host. + + The receive queue size limits how much data RDS will put on the receive + queue of a socket before marking the socket as congested. When a + socket becomes congested, RDS will send a congestion map update to the + other participating hosts, who are then expected to stop sending more + messages to this port. + + There is a timing window during which a remote host can still continue + to send messages to a congested port; RDS solves this by accepting + these messages even if the socket's receive queue is already over the + limit. + + As the application pulls incoming messages off the receive queue using + recvmsg(2), the number of bytes on the receive queue will eventually + drop below the receive queue size, at which point the port is then + marked uncongested, and another congestion update is sent to all par‐ + ticipating hosts. This tells them to allow applications to send addi‐ + tional messages to this port. + + The default values for the send and receive buffer size are controlled + by the A given RDS socket has limited transmit buffer space. It + defaults to the system wide socket send buffer size set in the + wmem_default and rmem_default sysctls, respectively. They can be tuned + by the application through the SO_SNDBUF and SO_RCVBUF socket options. + + + Blocking Behavior + The sendmsg(2) and recvmsg(2) calls can block in a variety of situa‐ + tions. Whether a call blocks or returns with an error depends on the + non-blocking setting of the file descriptor and the MSG_DONTWAIT mes‐ + sage flag. If the file descriptor is set to blocking mode (which is the + default), and the MSG_DONTWAIT flag is not given, the call will block. + + In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used + to specify a timeout (in seconds) after which the call will abort wait‐ + ing, and return an error. The default timeout is 0, which tells RDS to + block indefinitely. + + Message Transmission + Messages may be sent using sendmsg(2) once the RDS socket is bound. + Message length cannot exceed 4 gigabytes as the wire protocol uses an + unsigned 32 bit integer to express the message length. + + RDS does not support out of band data. Applications are allowed to send + to unicast addresses only; broadcast or multicast are not supported. + + A successful sendmsg(2) call puts the message in the socket's transmit + queue where it will remain until either the destination acknowledges + that the message is no longer in the network or the application removes + the message from the send queue. + + Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO + socket option described below. + + While a message is in the transmit queue its payload bytes are + accounted for. If an attempt is made to send a message while there is + not sufficient room on the transmit queue, the call will either block + or return EAGAIN. + + Trying to send to a destination that is marked congested (see above), + the call will either block or return ENOBUFS. + + A message sent with no payload bytes will not consume any space in the + destination's send buffer but will result in a message receipt on the + destination. The receiver will not get any payload data but will be + able to see the sender's address. + + Messages sent to a port to which no socket is bound will be silently + discarded by the destination host. No error messages are reported to + the sender. + + Message Receipt + Messages may be received with recvmsg(2) on an RDS socket once it is + bound to a source address. RDS will return messages in-order, i.e. mes‐ + sages from the same sender will arrive in the same order in which they + were be sent. + + The address of the sender will be returned in the sockaddr_in structure + pointed to by the msg_name field, if set. + + If the MSG_PEEK flag is given, the first message on the receive is + returned without removing it from the queue. + + The memory consumed by messages waiting for delivery does not limit the + number of messages that can be queued for receive. RDS does attempt to + perform congestion control as described in the section above. + + If the length of the message exceeds the size of the buffer provided to + recvmsg(2), then the remainder of the bytes in the message are dis‐ + carded and the MSG_TRUNC flag is set in the msg_flags field. In this + truncating case recvmsg(2) will still return the number of bytes + copied, not the length of entire messge. If MSG_TRUNC is set in the + flags argument to recvmsg(2), then it will return the number of bytes + in the entire message. Thus one can examine the size of the next mes‐ + sage in the receive queue without incurring a copying overhead by pro‐ + viding a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the + flags argument. + + The sending address of a zero-length message will still be provided in + the msg_name field. + + Control Messages + RDS uses control messages (a.k.a. ancillary data) through the msg_con‐ + trol and msg_controllen fields in sendmsg(2) and recvmsg(2). Control + messages generated by RDS have a cmsg_level value of sol_rds. Most + control messages are related to the zerocopy interface added in RDS + version 3, and are described in rds-rdma(7). + + The only exception is the RDS_CMSG_CONG_UPDATE message, which is + described in the following section. + + Polling + RDS supports the poll(2) interface in a limited fashion. POLLIN is + returned when there is a message (either a proper RDS message, or a + control message) waiting in the socket's receive queue. POLLOUT is + always returned while there is room on the socket's send queue. + + Sending to congested ports requires special handling. When an applica‐ + tion tries to send to a congested destination, the system call will + return ENOBUFS. However, it cannot poll for POLLOUT, as there is prob‐ + ably still room on the transmit queue, so the call to poll(2) would + return immediately, even though the destination is still congested. + + There are two ways of dealing with this situation. The first is to sim‐ + ply poll for POLLIN. By default, a process sleeping in poll(2) is + always woken up when the congestion map is updated, and thus the appli‐ + cation can retry any previously congested sends. + + The second option is explicit congestion monitoring, which gives the + application more fine-grained control. + + With explicit monitoring, the application polls for POLLIN as before, + and additionally uses the RDS_CONG_MONITOR socket option to install a + 64bit mask value in the socket, where each bit corresponds to a group + of ports. When a congestion update arrives, RDS checks the set of ports + that became uncongested against the bit mask installed in the socket. + If they overlap, a control messages is enqueued on the socket, and the + application is woken up. When it calls recvmsg(2), it will be given the + control message containing the bitmap. on the socket. + + The congestion monitor bitmask can be set and queried using setsock‐ + opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable. + + Congestion updates are delivered to the application via + RDS_CMSG_CONG_UPDATE control messages. These control messages are + always delivered by themselves (or possibly additional control mes‐ + sages), but never along with a RDS data message. The cmsg_data field of + the control message is an 8 byte datum containing the 64bit mask value. + + Applications can use the following macros to test for and set bits in + the bitmask: + + #define RDS_CONG_MONITOR_SIZE 64 + #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE) + #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port)) + + + Canceling Messages + An application can cancel (flush) messages from the send queue using + the RDS_CANCEL_SENT_TO socket option with setsockopt(2). This call + takes an optional sockaddr_in address structure as argument. If given, + only messages to the destination specified by this address are dis‐ + carded. If no address is given, all pending messages are discarded. + + Note that this affects messages that have not yet been transmitted as + well as messages that have been transmitted, but for which no acknowl‐ + edgment from the remote host has been received yet. + + Reliability + If sendmsg(2) succeeds, RDS guarantees that the message will be vis‐ + ible to recvmsg(2) on a socket bound to the destination address as + long as that destination socket remains open. + + If there is no socket bound on the destination, the message is + silently dropped. If the sending RDS can't be sure that there is no + socket bound then it will try to send the message indefinitely until it + can be sure or the sent message is canceled. + + If a socket is closed then all pending sent messages on the socket are + canceled and may or may not be seen by the receiver. + + The RDS_CANCEL_SENT_TO socket option can be used to cancel all pending + messages to a given destination. + + If a receiving socket is closed with pending messages then the sender + considers those messages as having left the network and will not + retransmit them. + + A message will only be seen by recvmsg(2) once, unless MSG_PEEK was + specified. Once the message has been delivered it is removed from the + sending socket's transmit queue. + + All messages sent from the same socket to the same destination will be + delivered in the order they're sent. Messages sent from different sock‐ + ets, or to different destinations, may be delivered in any order. + +SYSCTL VALUES + These parameteres may only be accessed through their files in + /proc/sys/net/rds. Access through sysctl(2) is not supported. + + pf_rds This file contains the string representation of the protocol + family constant passed to socket(2) to create a new RDS socket. + + sol_rds + This file contains the string representation of the socket level + parameter that is passed to getsockopt(2) and setsockopt(2) to + manipulate RDS socket options. + + max_unacked_bytes and max_unacked_packets + These parameters are used to tune the generation of acknowledge‐ + ments. By default, the system receiving RDS messages does not + send back explicit acknowledgements unless it transmits a mes‐ + sage of its own (in which case the ACK is piggybacked onto the + outgoing message), or when the sending system requests an ACK. + + However, the sender needs to see an ACK from time to time so + that it can purge old messages from the send queue. The unacked + bytes and packet counters are used to keep track of how much + data has been sent without requesting an ACK. The default is to + request an acknowledgement every 16 packets, or every 16 MB, + whichever comes first. + + reconnect_delay_min_ms and reconnect_delay_max_ms + RDS uses host-to-host connections to transport RDS messages + (both for the TCP and the Infiniband transport). If this connec‐ + tion breaks, RDS will try to re-establish the connection. + Because this reconnect may be triggered by both hosts at the + same time and fail, RDS uses a random backoff before attempting + a reconnect. These two parameters specify the minimum and maxi‐ + mum delay in milliseconds. The default values are 1 and 1000, + respectively. + +SEE ALSO + rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2), + setsockopt(2). + + + + RDS(7) diff --git a/readme_and_howto/RoCEE_README.txt b/readme_and_howto/RoCEE_README.txt new file mode 100644 index 0000000..ab6a826 --- /dev/null +++ b/readme_and_howto/RoCEE_README.txt @@ -0,0 +1,184 @@ +=============================================================================== + OFED-1.5.1 RoCEE Support README + February 2010 +=============================================================================== + +Contents: +========= +1. Overview +2. Software Dependencies +3. User Guidelines +4. Ported Applications +5. Gid tables +6. Using VLANs +7. Statistic counters +8. Firmware Requirements +9. Supported hardware +10. Added fearues +11. Known Issues + + +1. Overview +=========== +RDMA over Converged Enhanced Ethernet (RoCEE) allows InfiniBand (IB) transport +over Ethernet networks. It encapsulates IB transport and GRH headers in +Ethernet packets bearing a dedicated ether type. +While the use of GRH is optional within IB subnets, it is mandatory when using +RoCEE. Verbs applications written over IB verbs should work seamlessly, but +they require provisioning of GRH information when creating address vectors. The +library and driver are modified to provide for mapping from GID to MAC +addresses required by the hardware. + +2. Software Dependencies +======================== +In order to use RoCEE over Mellanox ConnectX(R) hardware, the mlx4_en driver +must be loaded. Please refer to MLNX_EN_README.txt for further details. + + +3. User Guidelines +================== +Since RoCEE encapsulates InfiniBand traffic in Ethernet frames, the +corresponding net device must be up and running. In case of Mellanox +hardware, mlx4_en must be loaded and the corresponding interface configured. +- Make sure mlx4_en.ko is loaded +- Make sure an IP address has been configured to this interface +- Run "ibv_devinfo". There is a new field named "link_layer" which can be + either "Ethernet" or "IB". If the value is IB, then you need to use + connectx_port_config to change the ConnectX ports designation to eth (see + mlx4_release_notes.txt for details) +- Configure the IP address of the interface so that the link will become + active +- All IB verbs applications which run over IB verbs should work on RoCEE + links as long as they use GRH headers (that is, as long as they specify use + of GRH in their address vector) +- rdma_cm applications working over RoCEE will have the TOS field set to a + default value of 3. The default value is given as a module paramter to + rdma_cm: + def_prec2sl:Default value for SL priority with RoCE. Valid values 0 - 7 (int). + + +4. Ported Applications +====================== +- ibv_*_pingpong examples have been ported too. The user must specify the GID + of the remote peer using the new '-g' option. The GID has the same format as + that in /sys/class/infiniband/mlx4_0/ports/1/gids/0 + +- Note: Care should be taken when using ibv_ud_pingpong. The default message + size is 2K, which is likely to exceed the MTU of the RoCEE link. Use + ibv_devinfo to inspect the link MTU and specify an appropriate message size + +- All rdma_cm applications should work seamlessly without any change + +- libsdp works without any change + +- Performance tests have been ported + + +5. Gid tables +============= +With RoCEE, there may be several entries in a port's GID table. The first entry +always contains the IPv6 link local address of the corresponding ethernet +interface. The link local address is formed in the following way: + +gid[0..7] = fe80000000000000 +gid[8] = mac[0] ^ 2 +gid[9] = mac[1] +gid[10] = mac[2] +gid[11] = ff +gid[12] = fe +gid[13] = mac[3] +gid[14] = mac[4] +gid[15] = mac[5] + +If VLAN is supported by the kernel, and there are VLAN interfaces on the main +ethernet interface (the interface that the IB port is tied to), each such VLAN +will appear as a new GID in the port's GID table. The format of the GID entry +will be identical to the one decribed above with the following change: + +gid[11] = VLAN ID high byte (4 MS bits). +gid[12] = VLAN ID low byte + +Please note that VLAN ID is 12 bits. + +Priority pause frames +--------------------- +Tagged ethernet frames carry a 3 bit priority field. The value of this field is +derived from the IB SL field by taking the 3 LS bits of the SL field. + + +6. Using VLANs +============== +In order for RoCEE traffic to used VLAN tagged frames, the user has to specify +GID table entries that are derived from VLAN devices, when creating address +vectors. Consider the example bellow: + +6.1 Make sure VLAN support is enabled by the kernel. Usually this requires +loading the 8021q module. +- modprobe 8021q + +6.2 Add a VLAN device +- vconfig add eth2 7 + +6.3 Assign IP address to the VLAN interface +- ifconfig eth2.7 7.10.11.12 +suppose this created a new entry in the GID table in index 1. + +6.4 verbs test: +server: ibv_rc_pingpong -g 1 +client: ibv_rc_pingpongs -g 1 server + +6.5 For rdma_cm applications, the user only needs to specify an IP address of a +VLAN device for the traffic to go with that VLAN tagged frames. + +7. Statistic counters +===================== +RoCEE traffic is counted and can be read from the sysfs counters in the same +manner as it is done for regular Infiniband devices. Only the following +counters are supported: +- port_xmit_packets +- port_rcv_packets +- port_rcv_data +- port_xmit_data + +For example, to read the number of transmitted packets on port 2 of device +mlx4_1, one needs to read the file: +/sys/class/infiniband/mlx4_1/ports/2/counters/port_xmit_packets + +Note: RoCEE traffic will not show in the associated Etherent device's counters +since it is offloaded by the hardware and does not go through Ethernet network +driver. + + +8. Firmware Requirements +======================== +RoCEE has limited support with firmware 2.7.700 and will be fully supported +with firmware 2.8.000. + + +9. Supported hardware +===================== +Currently, ConnectX B0 hardware is supported. A0 hardware may have issues. + + +10. Added fearues +================= +ibdev2netdev is a utility that displays the association between an HCA's port +and the network interface bound to it. Example run: + +sw417:/usr/src/packages/SOURCES/ofa_kernel-1.5.2 # ibdev2netdev +mlx4_0 port 1 ==> ib0 (Down) +mlx4_0 port 2 ==> ib1 (Down) +mlx4_1 port 1 ==> eth2 (Up) +mlx4_1 port 2 ==> eth3 (Up) + + + +11. Known Issues +=============== +- PowerPC and ia64 architectures are not supported. x32 architectures were + not tested. + +- SRP is not supported. + +- UD QPs that send traffic with VLAN tags (e.g. 802.1q tagged frames) do not + work. This will be fixed in a subsequent release. diff --git a/readme_and_howto/SRPT_README.txt b/readme_and_howto/SRPT_README.txt new file mode 100644 index 0000000..ac16fb7 --- /dev/null +++ b/readme_and_howto/SRPT_README.txt @@ -0,0 +1,223 @@ +SCSI RDMA Protocol (SRP) Target driver for Linux +================================================= + +SRP Target driver is designed to work directly on top of OpenFabrics +OFED-1.x software stack (http://www.openfabrics.org) or Infiniband +drivers in Linux kernel tree (kernel.org). It also interfaces with +Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net) + +By interfacing with SCST driver we are able to work and support a lot IO +modes on real or virtual devices in the backend + +1. scst_disk -- interfacing with scsi sub-system to claim and export real + scsi devices ie. disks, hardware raid volumes, tape library as SRP's luns + +2. scst_vdisk -- fileio and blockio modes. This allows you to turn software + raid volumes, LVM volumes, IDE disks, block devices and normal files into + SRP's luns + +3. NULLIO mode will allow you to measure the performance without sending IOs + to *real* devices + + +Prerequisites +------------- +0. Supported distributions: RHEL 5.2/5.3/5.4, SLES 10 sp2/sp3, SLES 11 + +NOTES: On distribution default kernels, you can run scst_vdisk blockio mode + to have good performance. + + It is required to patch and recompile the kernel to run scst_disk + ie. scsi pass-thru mode + OR + You have to compile scst with -DSTRICT_SERIALIZING enabled and this + does not yield good performance. + +1. Download and install SCST driver (supported version 1.0.1.1) + +1a. Download scst-1.0.1.1.tar.gz from this URL + http://scst.sourceforge.net/downloads.html + +1b. untar and install scst-1.0.1.1 + + $ tar zxvf scst-1.0.1.1.tar.gz + $ cd scst-1.0.1.1 + + THIS STEP IS SPECIFIC FOR SLES 10 sp2/sp3 distributions: + + $ patch -p1 -i /docs/scst/scst_sles10_sp2.patch + + For all distributions: + + $ make && make install + +NOTES: FOR SLES 11 distribution, skip next step (step 1c) and go directly to + step (2) + +1c. patch scst.h header file with scst.patch + + $ cd /usr/local/include/scst + $ patch -p1 -i /docs/scst/scst.patch + + +2. Download/install OFED-1.5.1 package - SRP target is part of OFED package + +NOTES: if your system already have OFED stack installed, you need to remove + the previous built of kernel-ib RPMs and reinstall + + $ cd ~/OFED-1.5.1 + $ rm RPMS/*/*/kernel-ib* + $ ./install.pl -c ofed.conf + + Make sure that srpt=y in the ofed.conf + +2a. download OFED packages from this URL + http://www.openfabrics.org/downloads/OFED/OFED-1.5.1/ + +2b. install OFED - remember to choose srpt=y + + $ cd ~/OFED-1.5.1 + $ ./install.pl + + +How-to run +----------- + +A. On srp target machine + +A1. Please refer to SCST's README for loading scst driver and its dev_handlers + drivers (scst_disk, scst_vdisk block or file IO mode, nullio, ...) + SCST's README locates in ~/scst-1.0.1.1/ directory + +NOTES: In any mode you always need to have lun 0 in any group's device list + Then you can have any lun number following lun 0 (it does not required + have lun number in order except that the first lun is always 0) + + Setting SRPT_LOAD=yes in /etc/infiniband/openib.conf is not good enough + It only load ib_srpt module and does not load scst and its dev_handlers + + SCST's scst_disk module (pass-thru mode) does not run on default + distribution kernels (kernels come with RHEL 5.2/5.3/5.4 & SLES 11) + because it requires to patch and recompile the kernel. It can only + run with vanilla kernels. + +Example 1: working with VDISK BLOCKIO mode + (using md0 device, sda, and cciss/c1d0) +a. modprobe scst +b. modprobe scst_vdisk +c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk +d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk +e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk +f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices +g. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices +h. echo "add vdisk2 2" >/proc/scsi_tgt/groups/Default/devices + +Example 2: working with real back-end scsi disks in scsi pass-thru mode +a. modprobe scst +b. modprobe scst_disk +c. cat /proc/scsi_tgt/scsi_tgt +ibstor00:~ # cat /proc/scsi_tgt/scsi_tgt +Device (host:ch:id:lun or name) Device handler +0:0:0:0 dev_disk +4:0:0:0 dev_disk +5:0:0:0 dev_disk +6:0:0:0 dev_disk +7:0:0:0 dev_disk + +Now you want to exclude the first scsi disk and expose the last 4 scsi disks +as IB/SRP luns for I/O + +echo "add 4:0:0:0 0" >/proc/scsi_tgt/groups/Default/devices +echo "add 5:0:0:0 1" >/proc/scsi_tgt/groups/Default/devices +echo "add 6:0:0:0 2" >/proc/scsi_tgt/groups/Default/devices +echo "add 7:0:0:0 3" >/proc/scsi_tgt/groups/Default/devices + +Example 3: working with scst_vdisk FILEIO mode + (using md0 device and file 10G-file) +a. modprobe scst +b. modprobe scst_vdisk +c. echo "open vdisk0 /dev/md0" > /proc/scsi_tgt/vdisk/vdisk +d. echo "open vdisk1 /10G-file" > /proc/scsi_tgt/vdisk/vdisk +e. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices +f. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices + +A2. modprobe ib_srpt + + +B. On initiator machines you can manualy do the following steps: + +B1. modprobe ib_srp +B2. ipsrpdm -c -d /dev/infiniband/umadX + (to discover new SRP target) + umad0: port 1 of the first HCA + umad1: port 2 of the first HCA + umad2: port 1 of the second HCA +B3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target +B4. fdisk -l (will show new discovered scsi disks) + +Example: +Assume that you use port 1 of first HCA in the system ie. mthca0 + +[root@lab104 ~]# ibsrpdm -c -d /dev/infiniband/umad0 +id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4, +dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 +[root@lab104 ~]# echo id_ext=0002c90200226cf4,ioc_guid=0002c90200226cf4, +dgid=fe800000000000000002c90200226cf5,pkey=ffff,service_id=0002c90200226cf4 > +/sys/class/infiniband_srp/srp-mthca0-1/add_target + +OR + ++ You can edit /etc/infiniband/openib.conf to load srp driver and srp HA daemon +automatically ie. set SRP_LOAD=yes, SRP_DAEMON_ENABLE=yes, and SRPHA_ENABLE=yes ++ To set up and use high availability feature you need dm-multipath driver +and multipath tool ++ Please refer to OFED-1.5.1 SRP's user manual for more in-details instructions +on how-to enable/use HA feature (OFED-1.5.1/docs/srp_release_notes.txt) + + +Here is an example of srp target setup file +-------------------------------------------- + +*********************** srpt.sh ***************************************** +#!/bin/sh +modprobe scst scst_threads=1 +modprobe scst_vdisk scst_vdisk_ID=100 + +echo "open vdisk0 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk +echo "open vdisk1 /dev/sdb BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk +echo "open vdisk2 /dev/sdc BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk +echo "open vdisk3 /dev/sdd BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk +echo "add vdisk0 0" > /proc/scsi_tgt/groups/Default/devices +echo "add vdisk1 1" > /proc/scsi_tgt/groups/Default/devices +echo "add vdisk2 2" > /proc/scsi_tgt/groups/Default/devices +echo "add vdisk3 3" > /proc/scsi_tgt/groups/Default/devices + +modprobe ib_srpt + +echo "add "mgmt"" > /proc/scsi_tgt/trace_level +echo "add "mgmt_dbg"" > /proc/scsi_tgt/trace_level +echo "add "out_of_mem"" > /proc/scsi_tgt/trace_level + +*********************** End srpt.sh ************************************** + + +How-to unload/shutdown +----------------------- + +1. Unload ib_srpt + $ modprobe -r ib_srpt +2. Unload scst and its dev_handlers + $ modprobe -r scst_vdisk scst +3. Unload ofed + $ /etc/rc.d/openibd stop + +=========================================================================== +Known Issues +=========================================================================== + +- With active connections/sesssions and active I/Os, unload ib_srpt driver + will randomly fail and got stuck. + +- With active connections/sessions with active I/Os, reboot system will + randomly get stuck. + diff --git a/readme_and_howto/ib-bonding.txt b/readme_and_howto/ib-bonding.txt new file mode 100644 index 0000000..1727d6b --- /dev/null +++ b/readme_and_howto/ib-bonding.txt @@ -0,0 +1,191 @@ +IB Bonding +=============================================================================== + +1. Introduction +2. How to work with interface configuration scripts +2.1 Configuration with initscripts support +2.1.1 Writing network scripts under Redhat-AS4 (Update 6, 7 or 8) +2.1.2 Writing network scripts under Redhhat-EL5 +2.2 Configuration with sysconfig support +2.2.1 Writing network scripts under SLES-10 +2.3 Configuring Ethernet slaves + +1. Introduction +------------------------------------------------------------------------------- +ib-bonding is a High Availability solution for IPoIB interfaces. It is based +on the Linux Ethernet Bonding Driver and was adopted to work with IPoIB. +However, the support for for IPoIB interfaces is only for the active-backup +mode, other modes should not be used. + +2. How to work with interface configuration scripts +------------------------------------------------------------------------------- +To create an interface configuration script for the ibX and bondX interfaces, +you should use the standard syntax (depending on your OS). + +2.1 Configuration with initscripts support +------------------------------------------ +Note: This feature is available only for Redhat-AS4 (Update 4, Update 5, +Update 6 or Update 7) and for Redhat-EL5 and above. + +2.1.1 Writing network scripts under Redhat-AS4 (Update 4, 5, 6 or 7) +----------------------------------------------------------------- +* In the master (bond) interface script add the line: +TYPE=Bonding +MTU= + +Exmaple: for bond0 (master) the file is named /etc/sysconfig/network-scripts/ifcfg-bond0 +with the following text in the file: + +DEVICE=bond0 +IPADDR=192.168.1.1 +NETMASK=255.255.255.0 +NETWORK=192.168.1.0 +BROADCAST=192.168.1.255 +ONBOOT=yes +BOOTPROTO=none +USERCTL=no +TYPE=Bonding +MTU=65520 + +Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected +mode and are configured with the same value. For IPoIB slaves that work in +datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at +all (and letting it to be set to the default value), performance of the +interface might decrease. + +* In the slave (ib) interface script put the following lines: +SLAVE=yes +MASTER= +TYPE=InfiniBand +PRIMARY= + +Example: the script for ib0 (slave) would be named /etc/sysconfig/network-scripts/ifcfg-ib0 +with the following text in the file: + +DEVICE=ib0 +USERCTL=no +ONBOOT=yes +MASTER=bond0 +SLAVE=yes +BOOTPROTO=none +TYPE=InfiniBand +PRIMARY=yes + +Note: If the slave interface is not primary then the line PRIMARY= is not +required and can be omitted. + +After the configuration is saved, restart the network service by running: +/etc/init.d/network restart + +2.1.2 Writing network scripts under Redhhat-EL5 +----------------------------------------------- +Follow the instructions in 3.1.1 (Writing network scripts under Redhat-AS4) +with the following changes: +* In the bondX (master) script - the line TYPE=Bonding is not needed. +* In the bondX (master) script - you may add to the configuration more options +with the following line +BONDING_OPTS=" primary=ib0 updelay=0 downdelay=0" +* in the ibX (slave) script - the line TYPE=InfiniBand necessary when using + bonding over devices configured with partitions ( p_key) +Example: + ifcfg-ibX.8003 and ifcfg-ibY.8003 must include TYPE=InfiniBand line in + their configuration files, when using as slaves for bondX device +* in /etc/modprobe.conf add the following lines +alias bond0 bonding +options bond0 miimon=100 mode=1 max_bonds=1 + +If you want more than one bonding interface, name them bond1, bond2... and +just add the necessary lines in /etc/modprobe.conf and change max_bonds=1 to +max_bonds=N where N=number_of_bonding_interfaces + +Note: restarting OFED doesn't keep the bonding configuration via initscripts. +You have to restart the network service in order to recreate the bonding +interface. + +2.2 Configuration with sysconfig support +---------------------------------------- +Note: This feature is available only for SLES-10 and above. + +2.2.1 Writing network scripts under SLES-10 +----------------------------------------------- +* In the master (bond) interface script add the lins: + +BONDING_MASTER=yes +BONDING_MODULE_OPTS="mode=active-backup miimon=" +BONDING_SLAVE0=slave0 +BONDING_SLAVE1=slave1 +MTU= + +Exmaple: for bond0 (master) the file is named /etc/sysconfig/network/ifcfg-bond0 +with the following text in the file: + +BOOTPROTO="static" +BROADCAST="10.0.2.255" +IPADDR="10.0.2.10" +NETMASK="255.255.0.0" +NETWORK="10.0.2.0" +REMOTE_IPADDR="" +STARTMODE="onboot" +BONDING_MASTER="yes" +BONDING_MODULE_OPTS="mode=active-backup miimon=100 primary=ib0 updelay=0 downdelay=0" +BONDING_SLAVE0=ib0 +BONDING_SLAVE1=ib1 +MTU=65520 + +Note: 65520 is a valid mtu value only if all IPoIB slaves operate in connected +mode and are configured with the same value. For IPoIB slaves that work in +datagram modee, use MTU=2044. If you don't set correct mtu or don't set mtu at +all (and letting it to be set to the default value), performance of the +interface might decrease. + +Note: primary, downdelay and updelay is an optional bonding interface +configuration. You may choose to use them, change them or delete them from the +configuration script (by editing the line that starts with BONDING_OPTS) + +* The slave (ib) interace script should look like this: + +BOOTPROTO='none' +STARTMODE='off' +PRE_DOWN_SCRIPT=/etc/sysconfig/network/unenslave.sh + +After the configuration is saved, restart the network service by running: +/etc/init.d/network restart + +2.3 Configuring Ethernet slaves +------------------------------- +It is not possible to have a mix of Ethernt slaves and IPoIB slaves under the +same bonding master. It is possible however that a bonding master of Ethernet +slaves and a bonding master of IPoIB slaves will co-exist in one machne. +To configure Ethernet slaves under a bonding master use the following +instructios (depending on the OS) + +* Under Redhat-AS4 + +Use the same instructions as for IPoIB slaves with the following exceptions + +- In the master configuration file add the line +SLAVEDEV=1 +- In the slave configuration file leave the line +TYPE=InfiniBand +- For Ethernet, it is possible to set parameters of the bonding module in /etc/modprobe.conf +with the following line for example +options bonding miimon=100 mode=1 primary=eth0 +Note that alias names for the bonding module (such as bond0) may not work. + +* Under Redhat-AS5 + +No special instructions are required. + +* Under SLES10 + +When using both type of bonding under, it is neccessary to update the +MANDATORY_DEVICES environment variable in /etc/sysconfig/network/config with the names +of the InfiniBand devices ( ib0, ib1, etc. ). Otherwise, bonding devices will be created +before InfiniBand devices at boot time. + +Note: If there is more than one Ethernet NIC installed then there might be a +race for the interface name eth0, eth1 etc. This may lead to unexpected +relation between logical and physical devices which may lead to wrong bonding +configuration. This issue may be solved by binding a logical device name (e.g. +eth0) to a physical (hardware) device by specifying the MAC address in the +ethN configuration file. diff --git a/readme_and_howto/qlgc_vnic.cfg.sample b/readme_and_howto/qlgc_vnic.cfg.sample new file mode 100644 index 0000000..58734a6 --- /dev/null +++ b/readme_and_howto/qlgc_vnic.cfg.sample @@ -0,0 +1,186 @@ +# QLogic VNIC configuration file +# +# This file documents and describes the use of the +# VNIC configuration file qlgc_vnic.cfg. This file +# should reside in /etc/infiniband/qlgc_vnic.cfg +# +# +# Knowing how to fill the configuration file +############################################### +# +# For filling the configuration file you need to know +# some information about your EVIC/VEx device. This information +# can be obtained with the help of the ib_qlgc_vnic_query tool. +# "ib_qlgc_vnic_query -es" command will give DGID, IOCGUID and IOCSTRING information about +# the EVIC/VEx IOCs that are available through port 1 and +# "ib_qlgc_vnic_query -es -d /dev/infiniband/umad1" will give information about +# the EVIC/VEX IOCs available through port 2. +# +# Refer to the README for more information about the ib_qlgc_vnic_query tool. +# +# +# General structure of the configuration file +############################################### +# +# All lines beginning with a # are treated as comments. +# +# A simple configuration file consists of CREATE commands +# for each VNIC interface to be created. +# +# A simple CREATE command looks like this: +# +# {CREATE; NAME="eioc1"; +# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1"; +# } +# +#Where +# +#NAME - The device name for the interface +# +#DGID - The DGID of the IOC to use. +# +# If DGID is specified then IOCGUID MUST also be specified. +# +# Though specifying DGID is optional, using this option is recommended, +# as it will provide the quickest way of starting up the VNIC service. +# +# +#IOCGUID - The GUID of the IOC to use. +# +#IOCSTRING - The IOC Profile ID String of the IOC to use. +# +# Either an IOCGUID or an IOCSTRING MUST always be specified. +# +# If DGID is specified then IOCGUID MUST also be specified. +# +# If no DGID is specified and both IOCGUID and IOCSTRING are specified +# then IOCSTRING is given preference and the DGID of the IOC whose +# IOCSTRING is specified is used to create the VNIC interface. +# +# If hotswap capability of EVIC/VEx is to be used, then IOCSTRING +# must be specified. +# +#INSTANCE - Defaults to 0. Range 0-255. If a host will connect to the +# same IOC more than once, each connection must be assigned a unique +# number. +# +# +#RX_CSUM - defaults to TRUE. When true, indicates that the receive checksum +# should be done by the EVIC/VEx +# +#HEARTBEAT - defaults to 100. Specifies the time in 1/100'ths of a second +# between heartbeats +# +#PORT - Specification for local HCA port. First port is 1. +# +#HCA - Optional HCA specification for use with PORT specification. First HCA is 0. +# +#PORTGUID - The PORTGUID of the IB port to use. +# +# Use of PORTGUID for configuring the VNIC interface has an +# advantage on hosts having more than 1 HCAs plugged in. As +# PORTGUID is persistent for given IB port, VNIC configurations +# would be consistent and reliable - unaffected by restarts of +# OFED IB stack on host having more than 1 HCAs plugged in. +# +# On the downside, if HCA on the host is changed, VNIC interfaces +# configured with PORTGUID needs reconfiguration. +# +#IB_MULTICAST - Controls enabling or disabling of IB multicast feature on VNIC. +# Defaults to TRUE implying IB multicast is enabled for +# the interface. To disable IB multicast, set it to FALSE. +# +# Example of DGID and IOCGUID based configuration (this configuration will give +# the quickest start up of VNIC service): +# +# {CREATE; NAME="eioc1"; +# DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; +# } +# +# +# Example of IOCGUID based configuration: +# +# {CREATE; NAME="eioc1"; IOCGUID=0x66A013000010C; +# RX_CSUM=TRUE; +# HEARTBEAT=100; } +# +# Example of IOCSTRING based configuration: +# +# {CREATE; NAME="eioc1"; IOCSTRING="Chassis 0x00066A0050000018, Slot 2, IOC 1"; +# RX_CSUM=TRUE; +# HEARTBEAT=100; } +# +# +#Failover configuration: +######################### +# +# It is possible to create a VNIC interface with failover configuration +# by using the PRIMARY and SECONDARY commands. The IOC specified in +# the PRIMARY command will be used as the primary IOC for this interface +# and the IOC specified in the SECONDARY command will be used as the +# fail-over backup in case the connection with the primary IOC fails +# for some reason. +# +# PRIMARY and SECONDARY commands are written in the following way: +# +# PRIMARY={DGID=...;IOCGUID=...; IOCSTRING=...;INSTANCE=... } - +# IOCGUID, and INSTANCE must be values that are unique to the primary interface +# +# SECONDARY={DGID=...;IOCGUID=...; INSTANCE=... } - +# IOCGUID, and INSTANCE must be values that are unique to the secondary interface +# +# OR it can also be specified without using DGID, like this: +# +# PRIMARY={IOCGUID=...; INSTANCE=... } - IOCGUID may be substituted with +# IOCSTRING. IOCGUID, IOCSTRING, and INSTANCE must be values that are +# unique to the primary interface +# +# SECONDARY={IOCGUID=...; INSTANCE=... } - bring up a secondary connection for +# fail-over. IOCGUID may be substituted with IOCSTRING. IOCGUID, IOCSTRING, +# and INSTANCE values to be used for the secondary connection +# +# +#Examples of failover configuration: +# +#{CREATE; NAME="veth1"; +# PRIMARY={ DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0130000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 1"; +# INSTANCE=1; PORT=1; } +# SECONDARY={DGID=0xfe8000000000000000066a0258000001; IOCGUID=0x66a0230000001; IOCSTRING="Chassis 0x00066A00010003F2, Slot 1, IOC 2"; +# INSTANCE=1; PORT=2; } +#} +# +# {CREATE; NAME="eioc2"; +# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; } +# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; } +# } +# +#Example of configuration with IB_MULTICAST +# +# {CREATE; NAME="eioc2"; +# PRIMARY = {IOCGUID=0x66A0130000105; INSTANCE=0; PORT=1; IB_MULTICAST=FALSE; } +# SECONDARY = {IOCGUID=0x66A013000010C; INSTANCE=0; PORT=2; IB_MULTICAST=FALSE; } +# } +# +# Example of HCA/PORT and PORTGUID configurations: +# { +# CREATE; NAME="veth1"; +# PRIMARY={IOCGUID=00066a02de000070; INSTANCE=1; PORTGUID=0x0002c903000010f5; } +# SECONDARY={IOCGUID=00066a02de000070; INSTANCE=2; PORTGUID=0x0002c903000010f6; } +# } +# +# { +# CREATE; NAME="veth2"; +# PRIMARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=3; HCA=1; PORT=2; } +# SECONDARY={IOCGUID=00066a02de000070; DGID=fe8000000000000000066a21de000070; INSTANCE=4; HCA=0; PORT=1; } +# } +# +# { +# CREATE; NAME="veth3"; +# IOCSTRING="EVIC in Chassis 0x00066a00db00001e, Slot 1, Ioc 2"; +# INSTANCE=5 PORTGUID=0x0002c90300000786; +# } +# { +# CREATE; NAME="veth4; +# IOCGUID=00066a02de000070; +# INSTANCE=6; HCA=1; PORT=2; +# } diff --git a/release_notes/cxgb3_release_notes.txt b/release_notes/cxgb3_release_notes.txt new file mode 100644 index 0000000..9a993c8 --- /dev/null +++ b/release_notes/cxgb3_release_notes.txt @@ -0,0 +1,352 @@ + Open Fabrics Enterprise Distribution (OFED) + CHELSIO T3 RNIC RELEASE NOTES + September 2010 + + +The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the +Chelsio S series adapters. Make sure you choose the 'cxgb3' and +'libcxgb3' options when generating your ofed rpms. + +============================================ +New for ofed-1.5.2 +============================================ + +- Bug fixes. Various upstream bug fixes have been included in this +release. + +============================================ +Enabling Various MPIs +============================================ + +For OpenMPI, Intel MPI, HP MPI, and Scali MPI: you must set the iw_cxgb3 +module option peer2peer=1 on all systems. This can be done by writing +to the /sys/module file system during boot. EG: + +# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer + +Or you can add the following line to /etc/modprobe.conf to set the option +at module load time: + +options iw_cxgb3 peer2peer=1 + +For Intel MPI, HP MPI, and Scali MPI: Enable the chelsio device by adding +an entry to /etc/dat.conf for the chelsio interface. For instance, +if your chelsio interface name is eth2, then the following line adds +a DAT version 1.2 and 2.0 devices named "chelsio" and "chelsio2" for +that interface: + +chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" "" +chelsio2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" "" + +============= +Intel MPI: +============= + +The following env vars enable Intel MPI version 3.1.038. Place these +in your user env after installing and setting up Intel MPI: + +export RSH=ssh +export DAPL_MAX_INLINE=64 +export I_MPI_DEVICE=rdssm:chelsio +export MPIEXEC_TIMEOUT=180 +export MPI_BIT_MODE=64 + +Logout & log back in. + +Populate mpd.hosts with node names. +Note: The hosts in this file should be Chelsio interface IP addresses. + +Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in +/etc/dat.conf named "chelsio". + +Note: MPIEXEC_TIMEOUT value might be required to increase if heavy traffic +is going across the systems. + +Contact Intel for obtaining their MPI with DAPL support. + +To run Intel MPI applications: + + mpdboot -n -r ssh --ncpus= + mpiexec -ppn -n + + +============= +HP MPI: +============= + +The following env vars enable HP MPI version 2.03.01.00. Place these +in your user env after installing and setting up HP MPI: + +export MPI_ROOT=/opt/hpmpi +export PATH=$MPI_ROOT/bin:/opt/bin:$PATH +export MANPATH=$MANPATH:$MPI_ROOT/share/man + +Log out & log back in. + +To run HP MPI applications, use these mpirun options: + +-prot -e DAPL_MAX_INLINE=64 -UDAPL + +EG: + +$ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob + +Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces. + +Also this assumes your first entry in /etc/dat.conf is for the chelsio +device. + +Contact HP for obtaining their MPI with DAPL support. + +============= +Scali MPI: +============= + +The following env vars enable Scali MPI. Place these in your user env +after installing and setting up Scali MPI for running over IWARP: + +export DAPL_MAX_INLINE=64 +export SCAMPI_NETWORKS=chelsio +export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128" + +Log out & log back in. + +Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf +named "chelsio". + +Note: SCAMPI supports only dapl 1.2 library not dapl 2.0 + +Contact Scali for obtaining their MPI with DAPL support. + +To run SCALI MPI applications: + + mpimon -- + +Note: is the number of processes to run on the node Note: + should be the IP of Chelsio's interface + +============= +OpenMPI: +============= + +OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater. + +Open MPI will work without any specific configuration via the openib btl. +Users wishing to performance tune the configurable options may wish to +inspect the receive queue values. Those can be found in the "Chelsio T3" +section of mca-btl-openib-hca-params.ini. + +Note: OpenMPI version 1.3 does not support newer Chelsio card with device +ID 0x0035 and 0x0036. To use those cards add the device id of the cards +in the "Chelsio T3" section of mca-btl-openib-hca-params.ini file. + +To run OpenMPI applications: + + mpirun --host , -mca btl openib,sm,self + +============= +MVAPICH2: +============= + +The following env vars enable MVAPICH2 version 1.4-2. Place these +in your user env after installing and setting up MVAPICH2 MPI: + +export MVAPICH2_HOME=/usr/mpi/gcc/mvapich2-1.4/ +export MV2_USE_IWARP_MODE=1 +export MV2_USE_RDMA_CM=1 + +On each node, add this to the end of /etc/profile. + + ulimit -l 999999 + +On each node, add this to the end of /etc/init.d/sshd and restart sshd. + + ulimit -l 999999 + % service sshd restart + +Verify the ulimit changes worked. These should show '999999': + + % ulimit -l + % ssh ulimit -l + +Note: You may have to restart sshd a few times to get it to work. + +Create mpd.hosts with list of hostname or ipaddrs in the cluster. They +should be names/addresses that you can ssh to without passwords. (See +Passwordless SSH Setup). + +On each node, create /etc/mv2.conf with a single line containing the +IP address of the local T3 interface. This is how MVAPICH2 picks which +interface to use for RDMA traffic. + +On each node, edit /etc/hosts file. Comment the entry if there is an +entry with 127.0.0.1 IP Address and local host name. Add an entry for +corporate IP address and local host name (name that you have given in +mpd.hosts file) in /etc/hosts file. + +To run MVAPICH2 application: + + mpirun_rsh -ssh -np 8 -hostfile mpd.hosts + +============================================ +Loadable Module options: +============================================ + +The following options can be used when loading the iw_cxgb3 module to +tune the iWARP driver: + +cong_flavor - set the congestion control algorithm. Default is 1. + 0 == Reno + 1 == Tahoe + 2 == NewReno + 3 == HighSpeed + +snd_win - set the TCP send window in bytes. Default is 32kB. + +rcv_win - set the TCP receive window in bytes. Default is 256kB. + +crc_enabled - set whether MPA CRC should be negotiated. Default is 1. + +markers_enabled - set whether to request receiving MPA markers. Default is + 0; do not request to receive markers. + + NOTE: The Chelsio RNIC fully supports markers, but + the current OFA RDMA-CM doesn't provide an API for + requesting either markers or crc to be negotiated. Thus + this functionality is provided via module parameters. + +mpa_rev - set the MPA revision to be used. Default is 1, which is + spec compliant. Set to 0 to connect with the Ammasso 1100 + rnic. + +ep_timeout_secs - set the number of seconds for timing out MPA start up + negotiation and normal close. Default is 60. + +peer2peer - Enables connection setup changes to allow peer2peer + applications to work over chelsio rnics. This enables + the following applications: + Intel MPI + HP MPI + Open MPI + Scali MPI + MVAPICH2 + Set peer2peer=1 on all systems to enable these + applications. + +The following options can be used when loading the cxgb3 module to +tune the NIC driver: + +msi - whether to use MSI or MSI-X. Default is 2. + 0 = only pin + 1 = only MSI or pin + 2 = use MSI/X, MSI, or pin, based on system + +============================================ +Updating Firmware: +============================================ + +This release requires firmware version 7.10.0, and Protocol SRAM +version 1.1.0. These versions are included in the ofed-1.5.2 release +and will be automatically loaded when the cxgb3 module is loaded and +the interface configured. To load later/newer versions of the firmware, +follow this procedure: + +If your distro/kernel supports firmware loading, you can place the chelsio +firmware and psram images in /lib/firmware/cxgb3, then unload and reload +the cxgb3 module to get the new images loaded. If this does not work, +then you can load the firmware images manually: + +Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio. + +To build cxgbtool: + +# cd +# make && make install + +Then load the cxgb3 driver: + +# modprobe cxgb3 + +Now note the ethernet interface name for the T3 device. This can be +done by typing 'ifconfig -a' and noting the interface name for the +interface with a HW address that begins with "00:07:43". Then load the +new firmware and eeprom file: + +# cxgbtool ethxx loadfw +# update_eeprom.sh ethxx +# reboot + +============================================ +Testing connectivity with ping and rping: +============================================ + +Configure the ethernet interfaces for your cxgb3 device. After you +modprobe iw_cxgb3 you will see one or two ethernet interfaces for the +T3 device. Configure them with an appropriate ip address, netmask, etc. +You can use the Linux ping command to test basic connectivity via the +T3 interface. + +To test RDMA, use the rping command that is included in the librdmacm-utils +rpm: + +On the server machine: + +# rping -s -a 0.0.0.0 -p 9999 + +On the client machine: + +# rping -c -VvC10 -a server_ip_addr -p 9999 + +You should see ping data like this on the client: + +ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr +ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs +ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst +ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu +ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv +ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw +ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx +ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy +ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz +ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA +client DISCONNECT EVENT... +# + +============================================ +Addition Notes and Issues +============================================ + +1) To run uDAPL over the chelsio device, you must export this environment +variable: + + export DAPL_MAX_INLINE=64 + +2) If you have a multi-homed host and the physical ethernet networks +are bridged, or if you have multiple chelsio rnics in the system, then +you need to configure arp to only send replies on the interface with +the target ip address: + + sysctl -w net.ipv4.conf.all.arp_ignore=2 + +3) If you are building OFED against a kernel.org kernel later than +2.6.20, then make sure your kernel is configured with the cxgb3 and +iw_cxgb3 modules enabled. This forces the kernel to pull in the genalloc +allocator, which is required for the OFED iw_cxgb3 module. Make sure +these config options are included in your .config file: + + CONFIG_CHELSIO_T3=m + CONFIG_INFINIBAND_CXGB=m + +4) If you run the RDMA latency test using the ib_rdma_lat program, make +sure you use the following command lines to limit the amount of inline +data to 64: + + server: ib_rdma_lat -c -I 64 + client: ib_rdma_lat -c -I 64 server_ip_addr + +5) If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are +using a 64KB page size (like PPC64 and IA64 systems) and your server is +using a 4KB page size (like i386 and X86_64), then you need to mount the +server using rsize=32768,wsize=32768 to avoid overrunning the Chelsio +RNIC fast register limits. This is a known firmware limitation in the +Chelsio RNIC. diff --git a/release_notes/diags_release_notes.txt b/release_notes/diags_release_notes.txt new file mode 100644 index 0000000..9db4cbf --- /dev/null +++ b/release_notes/diags_release_notes.txt @@ -0,0 +1,89 @@ + Open Fabrics Enterprise Distribution (OFED) + Diagnostic Tools in OFED 1.5 Release Notes + + December 2009 + + +Repo: git://git.openfabrics.org/~sashak/management/management.git +URL: http://www.openfabrics.org/downloads/management + + +General +------- +Model of operation: All diag utilities use direct MAD access to perform their +operations. Operations that require QP0 mads only may use direct routed +mads, and therefore can work even in unconfigured subnets. Almost all +utilities can operate without accessing the SM, unless GUID to lid translation +is required. The only exception to this is saquery which requires the SM. + + +Dependencies +------------ +Most diag utilities depend on libibmad and libibumad. +All diag utilities depend on the ib_umad kernel module. + + +Multiple port/Multiple CA support +--------------------------------- +When no IB device or port is specified (see the "local umad parameters" below), +the libibumad library selects the port to use by the following criteria: +1. the first port that is ACTIVE. +2. if not found, the first port that is UP (physical link up). + +If a port and/or CA name is specified, the libibumad library attempts to +satisfy the user request, and will fail if it cannot do so. + +For example: + ibaddr # use the 'best port' + ibaddr -C mthca1 # pick the best port from mthca1 only. + ibaddr -P 2 # use the second (active/up) port from the + first available IB device. + ibaddr -C mthca0 -P 2 # use the specified port only. + + +Common options & flags +---------------------- +Most diagnostics take the following flags. The exact list of supported +flags per utility can be found in the usage message and can be displayed +using util_name -h syntax. + +# Debugging flags + -d raise the IB debugging level. May be used + several times (-ddd or -d -d -d). + -e show umad send receive errors (timeouts and others) + -h display the usage message + -v increase the application verbosity level. + May be used several times (-vv or -v -v -v) + -V display the internal version info. + +# Addressing flags + -D use directed path address arguments. The path + is a comma separated list of out ports. + Examples: + "0" # self port + "0,1,2,1,4" # out via port 1, then 2, ... + -G use GUID address arguments. In most cases, it is the Port GUID. + Examples: + "0x08f1040023" + -s use 'smlid' as the target lid for SA queries. + +# Local umad parameters: + -C use the specified ca_name. + -P use the specified ca_port. + -t override the default timeout for the solicited mads. + + +CLI notation +------------ +All utilities use the POSIX style notation, meaning that all options (flags) +must precede all arguments (parameters). + + +Utilities descriptions +---------------------- +See man pages + + +Bugs Fixed +---------- + diff --git a/release_notes/ehca_release_notes.txt b/release_notes/ehca_release_notes.txt new file mode 100644 index 0000000..e1ca30b --- /dev/null +++ b/release_notes/ehca_release_notes.txt @@ -0,0 +1,113 @@ + + Open Fabrics Enterprise Distribution (OFED) + ehca in OFED 1.5.2 Release Notes + + September 2010 + + +Overview +-------- +ehca is the low level driver implementation for all IBM GX-based HCAs. + +Supported HCAs +-------------- +- GX Dual-port SDR 4x IB HCA +- GX Dual-port SDR 12x IB HCA +- GX Dual-port DDR 4x IB HCA +- GX Dual-port DDR 12x IB HCA + +Available Parameters +-------------------- +In order to set ehca parameters, add the following line(s) to /etc/modprobe.conf: + + options ib_ehca = + +whereby is one of the following items: +- debug_level debug level (0: no debug traces (default), 1: with debug traces) +- port_act_time time to wait for port activation (default: 30 sec) +- scaling_code scaling code (0: disable (default), 1: enable) +- open_aqp1 Open AQP1 on startup (default: no) (bool) +- hw_level Hardware level (0: autosensing (default), 0x10..0x14: eHCA, 0x20..0x23: eHCA2) (int) +- nr_ports number of connected ports (-1: autodetect (default), 1: port one only, 2: two ports) (int) +- use_hp_mr Use high performance MRs (default: no) (bool) +- poll_all_eqs Poll all event queues periodically (default: yes) (bool) +- static_rate Set permanent static rate (default: no static rate) (int) +- lock_hcalls Serialize all hCalls made by the driver (default: autodetect) (bool) +- number_of_cqs Max number of CQs which can be allocated (default: autodetect) (int) +- number_of_qps Max number of QPs which can be allocated (default: autodetect) (int) + +New Features +------------ +- None + +Fixed Bugs ofed-1.5.2 +--------------------- +- Fixed automatic detection if hcall locks should be enabled or not + +Fixed Bugs ofed-1.5.1 +--------------------- +- Fixed crash when reading sysfs performance counters +- Do not disable IRQs when processing EQs +- Allow query of max_dest_rd_atomic and max_qp_rd_atomic values + +Fixed Bugs ofed-1.5 +--------------------- +- SRQ overflow prevention +- Performance improvements for QP creation +- MAD redirection fix + +Fixed Bugs ofed-1.4.1 +--------------------- +- none + +Fixed Bugs ofed-1.4 +--------------------- +- Reject send work requests only for RESET, INIT and RTR state +- Reject receive work requests if QP is in RESET state +- In case of lost interrupts, trigger EOI to reenable interrupts +- Filter PATH_MIG events if QP was never armed +- Release mutex in error path of alloc_small_queue_page() +- Check idr_find() return value +- Discard double CQE for one WR +- Generate flush status CQ entries +- Don't allow creating UC QP with SRQ +- Fix reported max number of QPs and CQs in systems with >1 adapter +- Reject dynamic memory add/remove when ehca adapter is present +- Remove reference to special QP in case of port activation failure +- Fix locking for shca_list_lock + +Fixed Bugs ofed-1.3.1 +--------------------- +- Support all ibv_devinfo values in query_device() and query_port() +- Prevent posting of SQ WQEs if QP not in RTS +- Remove mr_largepage parameter, ie always enable large page support +- Allocate event queue size depending on max number of CQs and QPs +- Protect QP against destroying until all async events for it are handled + +Fixed Bugs ofed-1.3 +------------------- +- Serialize HCA-related hCalls if necessary +- Fix static rate if path faster than link +- Return physical link information in query_port() +- Fix clipping of device limits to INT_MAX +- Fix issues related to path migration support +- Support more than 4k QPs for userspace and kernelspace +- Prevent sending UD packets to QP0 +- Prevent RDMA-related connection failures on some eHCA2 hardware + +Available backports +------------------- +- RedHat EL5 up4: 2.6.18-164.ELsmp +- RedHat EL5 up5: 2.6.18-194.ELsmp +- SLES11: 2.6.27.19-5.1-smp +- SLES11SP1: 2.6.32.12-0.7-default +- SLES10SP3: 2.6.16.60-0.54.5 +- kernel.org: 2.6.29-32 + +Known Issues +------------ +1. The port(s) needs to be connected to an active switch port while +loading the ehca device driver. + +2. Dynamic memory operations are tolerated by ehca, but are prevented by +the driver while it is loaded. diff --git a/release_notes/ibacm_release_notes.txt b/release_notes/ibacm_release_notes.txt new file mode 100644 index 0000000..4048b39 --- /dev/null +++ b/release_notes/ibacm_release_notes.txt @@ -0,0 +1,144 @@ + Open Fabrics Enterprise Distribution (OFED) + IB ACM in OFED 1.5 Release Notes + + July 2010 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Quick Start Guide +3. Operation Details +4. Known Issues + +=============================================================================== +1. Overview +=============================================================================== +The IB ACM package implements and provides a framework for experimental name, +address, and route resolution services over InfiniBand. It is intended to +address connection setup scalability issues running MPI applications on +large clusters. The IB ACM provides information needed to establish a +connection, but does not implement the CM protocol. + +The librdmacm can invoke IB ACM services when built using the --with-ib_acm +option. The IB ACM services tie in under the rdma_resolve_addr, +rdma_resolve_route, and rdma_getaddrinfo routines. For maximum benefit, +the rdma_getaddrinfo routine should be used, however existing applications +should still see significant connection scaling benefits using the calls +available in librdmacm 1.0.11 and previous releases. + +The IB ACM is focused on being scalable and efficient. The current +implementation limits network traffic, SA interactions, and centralized +services. ACM supports multiple resolution protocols in order to handle +different fabric topologies. + +The IB ACM package is comprised of two components: the ib_acm service +and a test/configuration utility - ib_acme. Both are userspace components +and are available for Linux and Windows. Additional details are given below. + +=============================================================================== +2. Quick Start Guide +=============================================================================== + +1. Prerequisites: libibverbs and libibumad must be installed. + The IB stack should be running with IPoIB configured. + These steps assume that the user has administrative privileges. +2. Install the IB ACM package + This installs ib_acm, and ib_acme. +3. Run ib_acme -A -O + This will generate IB ACM address and options configuration files. + (acm_addr.cfg and acm_opts.cfg) +4. Run ib_acm and leave running. + ib_acm will eventually be converted to a service/daemon, but for now + is a userspace application. Because ib_acm uses the libibumad + interfaces, it should be run with administrative privileges. +5. Optionally, run ib_acme -s -d -v + This will verify that the ib_acm service is running. +5. Install librdmacm using the build option --with-ib_acm. + The librdmacm will automatically use the ib_acm service. + On failures, the librdmacm will fall back to normal resolution. + +=============================================================================== +3. Operation Details +=============================================================================== + +ib_acme: +The ib_acme program serves a dual role. It acts as a utility to test +ib_acm operation and help verify if the ib_acm service and selected +protocol is usable for a given cluster configuration. Additionally, +it automatically generates ib_acm configuration files to assist with +or eliminate manual setup. + + +acm configuration files: +The ib_acm service relies on two configuration files. + +The acm_addr.cfg file contains name and address mappings for each IB + endpoint. Although the names in the acm_addr.cfg +file can be anything, ib_acme maps the host name and IP addresses to +the IB endpoints. + +The acm_opts.cfg file provides a set of configurable options for the +ib_acm service, such as timeout, number of retries, logging level, etc. +ib_acme generates the acm_opts.cfg file using static information. A +future enhancement would adjust options based on the current system +and cluster size. + + +ib_acm: +The ib_acm service is responsible for resolving names and addresses to +InfiniBand path information and caching such data. It is currently +implemented as an executable application, but is a conceptual service +or daemon that should execute with administrative privileges. + +The ib_acm implements a client interface over TCP sockets, which is +abstracted by the librdmacm library. One or more back-end protocols are +used by the ib_acm service to satisfy user requests. Although the +ib_acm supports standard SA path record queries on the back-end, it +provides an experimental multicast resolution protocol in hope of +achieving greater scalability. The latter is not usable on all fabric +topologies, specifically ones that may not have reversible paths. +Users should use the ib_acme utility to verify that multicast protocol +is usable before running other applications. + +Conceptually, the ib_acm service implements an ARP like protocol and either +uses IB multicast records to construct path record data or queries the +SA directly, depending on the selected route protocol. By default, the +ib_acm services uses and caches SA path record queries. + +Specifically, all IB endpoints join a number of multicast groups. +Multicast groups differ based on rates, mtu, sl, etc., and are prioritized. +All participating endpoints must be able to communicate on the lowest +priority multicast group. The ib_acm assigns one or more names/addresses +to each IB endpoint using the acm_addr.cfg file. Clients provide source +and destination names or addresses as input to the service, and receive +as output path record data. + +The service maps a client's source name/address to a local IB endpoint. +If a client does not provide a source address, then the ib_acm service +will select one based on the destination and local routing tables. If the +destination name/address is not cached locally, it sends a multicast +request out on the lowest priority multicast group on the local endpoint. +The request carries a list of multicast groups that the sender can use. +The recipient of the request selects the highest priority multicast group +that it can use as well and returns that information directly to the sender. +The request data is cached by all endpoints that receive the multicast +request message. The source endpoint also caches the response and uses +the multicast group that was selected to construct or obtain path record +data, which is returned to the client. + +=============================================================================== +4. Known Issues +=============================================================================== + +The current implementation of the IB ACM has several restrictions: +- The ib_acm is limited in its handling of dynamic changes; + the ib_acm must be stopped and restarted if a cluster is reconfigured. +- Cached data does not timed out and is only updated if a new resolution + request is received from a different QPN than a cached request. +- Support for IPv6 has not been verified. +- The number of addresses that can be assigned to a single endpoint is + limited to 4. +- The number of multicast groups that an endpoint can support is limited to 2. + diff --git a/release_notes/ibutils_release_notes.txt b/release_notes/ibutils_release_notes.txt new file mode 100644 index 0000000..83478cc --- /dev/null +++ b/release_notes/ibutils_release_notes.txt @@ -0,0 +1,74 @@ + Open Fabrics InfiniBand Diagnostic Utilities + -------------------------------------------- + +******************************************************************************* +RELEASE: OFED 1.5 +DATE: Dec 2009 + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. New features +3. Major Bugs Fixed +4. Known Issues + +=============================================================================== +1. Overview +=============================================================================== + +The ibutils package provides a set of diagnostic tools that check the health +of an InfiniBand fabric. + +Package components: +ibis: IB interface - A TCL shell that provides interface for sending various + MADs on the IB fabric. This is the component that actually accesses + the IB Hardware. + +ibdm: IB Data Model - A library that provides IB fabric analysis. + +ibmgtsim: An IB fabric simulator. Useful for developing IB tools. + +ibdiag: This package provides 3 tools which provide the user interface + to activate the above functionality: + - ibdiagnet: Performs various quality and health checks on the IB + fabric. + - ibdiagpath: Performs various fabric quality and health checks on + the given links and nodes in a specific path. + - ibdiagui: A GUI wrapper for the above tools. + +=============================================================================== +2. New Features +=============================================================================== + +* New "From the Edge" topology matching algorithm. + Integrated into ibtopodiff when run with the flag -e + +* New library - libsysapi + The library is a C API for IBDM C++ objects + +* Added ibnl definition files for Mellanox and Sun IB QDR products + +* Added new feature to ibdiagnet - general device info + +* ibdiagnet now can get port 0 as a parameterr (for managed switches). + + +=============================================================================== +3. Major Bugs Fixed +=============================================================================== + +* ibutils: various fixes in build process (dependencies, parallel build, etc) + +* ibdiagnet: fixed crash with -r flag + +* ibdiagnet: fixed regular expression for pkey matching + +* ibdiagnet: ibdiagnet.lst file has device IDs with trailing zeroes - fixed + +=============================================================================== +4. Known Issues +=============================================================================== + +- Ibdiagnet "-wt" option may generate a bad topology file when running on a + cluster that contains complex switch systems. diff --git a/release_notes/ipath_release_notes.txt b/release_notes/ipath_release_notes.txt new file mode 100644 index 0000000..0382fe9 --- /dev/null +++ b/release_notes/ipath_release_notes.txt @@ -0,0 +1,13 @@ + Open Fabrics Enterprise Distribution (OFED) + ipath in OFED 1.5 Release Notes + + December 2009 + +====================================================================== +1. Overview +====================================================================== +ipath is the low level driver implementation for the +QLogic HyperTransport HCA only (model QHT7140). + +The qib driver is the currently supported driver for all +PCI-Express based Infiniband HCAs. diff --git a/release_notes/ipoib_release_notes.txt b/release_notes/ipoib_release_notes.txt new file mode 100644 index 0000000..c332d6c --- /dev/null +++ b/release_notes/ipoib_release_notes.txt @@ -0,0 +1,483 @@ + Open Fabrics Enterprise Distribution (OFED) + IPoIB in OFED 1.5.2 Release Notes + + December 2010 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Known Issues +3. DHCP Support of IPoIB +4. The ib-bonding driver +5. Child interfaces +6. Bug Fixes and Enhancements Since OFED 1.3 +7. Bug Fixes and Enhancements Since OFED 1.3.1 +8. Bug Fixes and Enhancements Since OFED 1.4 +9. Bug Fixes and Enhancements Since OFED 1.4.2 +10. Bug Fixes and Enhancements Since OFED 1.5.0 +11. Bug Fixes and Enhancements Since OFED 1.5.2 +12. Performance tuning + +=============================================================================== +1. Overview +=============================================================================== +IPoIB is a network driver implementation that enables transmitting IP and ARP +protocol packets over an InfiniBand UD channel. The implementation conforms to +the relevant IETF working group's RFCs (http://www.ietf.org). + + +Usage and configuration: +======================== +1. To check the current mode used for outgoing connections, enter: + cat /sys/class/net/ib0/mode +2. To disable IPoIB CM at compile time, enter: + cd OFED-1.5 + export OFA_KERNEL_PARAMS="--without-ipoib-cm" + ./install.pl +3. To change the run-time configuration for IPoIB, enter: + edit /etc/infiniband/openib.conf, change the following parameters: + # Enable IPoIB Connected Mode + SET_IPOIB_CM=yes + # Set IPoIB MTU + IPOIB_MTU=65520 + +4. You can also change the mode and MTU for a specific interface manually. + + To enable connected mode for interface ib0, enter: + echo connected > /sys/class/net/ib0/mode + + To increase MTU, enter: + ifconfig ib0 mtu 65520 + +5. Switching between CM and UD mode can be done in run time: + echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD + echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM + + +=============================================================================== +2. Known Issues +=============================================================================== +1. If a host has multiple interfaces and (a) each interface belongs to a + different IP subnet, (b) they all use the same InfiniBand Partition, and (c) + they are connected to the same IB Switch, then the host violates the IP rule + requiring different broadcast domains. Consequently, the host may build an + incorrect ARP table. + + The correct setting of a multi-homed IPoIB host is achieved by using a + different PKEY for each IP subnet. If a host has multiple interfaces on the + same IP subnet, then to prevent a peer from building an incorrect ARP entry + (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X + stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This + causes the network stack to send ARP replies only on the interface with the + IP address specified in the ARP request: + + sysctl -w net.ipv4.conf.ib0.arp_ignore=1 + sysctl -w net.ipv4.conf.ib1.arp_ignore=1 + + Or, globally, + + sysctl -w net.ipv4.conf.all.arp_ignore=1 + + To learn more about the arp_ignore parameter, see + Documentation/networking/ip-sysctl.txt. + Note that distributions have the means to make kernel parameters persistent. + +2. There are IPoIB alias lines in /etc/modprobe.d/ib_ipoib.conf which prevent + stopping/unloading the stack (i.e., '/etc/init.d/openibd stop' will fail). + These alias lines cause the drivers to be loaded again by udev scripts. + + Workaround: Change modprobe.conf to set + OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove + the alias lines from /etc/modprobe.d/ib_ipoib.conf. + +3. On SLES 10: + The ib1 interface uses the configuration script of ib0. + + Workaround: Invoke ifup/ifdown using both the interface name and the + configuration script name (example: ifup ib1 ib1). + +4. After a hotplug event, the IPoIB interface falls back to datagram mode, and + MTU is reduced to 2K. + Workaround: Re-enable connected mode and increase MTU manually: + echo connected > /sys/class/net/ib0/mode + ifconfig ib0 mtu 65520 + +5. Since the IPoIB configuration files (ifcfg-ib) are installed under the + standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/ + and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf + does not prevent the loading of IPoIB on boot. + +6. If IPoIB connected mode is enabled, it uses a large MTU for connected mode + messages and a small MTU for datagram (in particular, multicast) messages, + and relies on path MTU discovery to adjust MTU appropriately. Packets sent + in the window before MTU discovery automatically reduces the MTU for a + specific destination will be dropped, producing the following message in the + system log: + "packet len (> ) too long to send, dropping" + + To warn about this, a message is produced in the system log each time MTU is + set to a value higher than 2K. + +7. IPoIB IPv6 support is broken for systems with kernels < 2.6.12 and + kernels >= 2.6.12. The reason for that is that kernel 2.6.12 puts the link + layer address at an offset of two bytes with respect to older kernels. This + causes the other host to misinterpret the hardware address resulting in failure + to resolve path which are based on wrong GIDs. As an example, RH 4.x and RH + 5.x cannot inter-operate. + +8. In connected mode, TCP latency for short messages is larger by approx. 1usec + (~5%) than in datagram mode. As a workaround, use datagram mode. + +9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with + newer kernels. It is recommended to use kernel 2.6.18 or up for + best IPoIB performance. + +10. Connectivity issues encountered when using IPv6 on ia64 systems. + +11. The IPoIB module uses a Linux implementation for Large Receive Offload + (LRO) in kernel 2.6.24 and later. These kernels require installing the + "inet_lro" module. + +12. ConnectX only: If you have a port configured as ETH and IPoIB is running + in connected mode, and then you change the port type to IB, the IPoIB mode + will change to datagram mode. + +13. When working with iSCSI, you must disable LRO (even if you are working in + connected mode). This is because there is a bug in older kernels which causes + a kernel panic. + +14. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test + gets to packet size 8192 or larger, it always loses the first packet in the + sequence. + Workaround: Increase the number of pending skb's before a neighbor is + resolved (default is 3). This value can be changed with: + sysctl net.ipv4.neigh.ib0.unres_qlen. + +15. IPoIB multicast support is broken in RH4.x kernels. This is because + ndisc_mc_map() does not handle IPoIB hardware addresses. + +16. If bonding uses an IPoIB slave, then un-enslaving all slaves (or downing + them with ifdown) followed by unloading the module ib_ipoib might crash the + kernel. To avoid this leave the IPoIB interfaces enslaved when unloading + ib_ipoib. + +17. On SLES 11, sysconfig scripts override the interface mode and set it to + datagram on each call to ifup, ifdown, etc. To avoid this, add the line + IPOIB_MODE=connected + to the interface configuration file (e.g. ifcfg-ib0) + +18. When installing OFED on a machine that runs kernel 2.6.30 (or another + kernel from kernel.org that OFED supports), the installation script blocks + the installation of ib-bonding since the bonding module that comes with the + kernel has all the functionality to support IPoIB slaves. This approach + however doesn't patch the sysconfig (SuSE) or initscripts (RedHat) package, + so the network configuration script may not work properly. + For example, if you install OFED on RHEL5.2 that runs kernel 2.6.30 and + you try to configure and run bonding, you won't be able to restart the + network and see bond0 up and running with IPoIB slaves. + A workaround to this problem would be as follows: + a. Compile ib-bonding source rpm (under SRPMS directory) separately on + a machine with RHEL5.2 and kernel 2.6.18-92.el5 (default for this OS). + b. Install the binary RPM while the machine runs kernel 2.6.18-92.el5. + This will patch the OS configuration scripts and install the bonding + module. + c. Switch to kernel 2.6.30. The module that was compiled in (a) will + not be loaded since it was compiled and installed for a different + kernel. + d. Configure bonding and restart the network. The bonding interface + should be up and running afterwards. + +19. On RHEL5.X, '/etc/init.d/openibd start' prints the following messages while + bringing up IPoIB interfaces: + + Setting up InfiniBand network interfaces: + Bringing up interface ib0: [ OK ] + RTNETLINK answers: File exists + Error adding address 192.168.1.11 for ib1. + Bringing up interface ib1: [ OK ] + Setting up service network . . . [ done ] + + This does not affect IPoIB configuration and interfaces are configured as + expected. + +20. In IPoIB connected mode, packages larger than 2016 bytes are not sent. + https://bugs.openfabrics.org/show_bug.cgi?id=1839 + +21. Under SLES11, if an IP configuration exists for an IPoIB interface + that later becomes a slave of a bonding master, a network restart + does not erase the IP configuration from the slave and it appears to have + an IP address even though the new configuration does not set one. This + may cause problems when trying to use the bonded network interface. To + avoid this, restart the IB stack (openib restart) once you change the + configuration. + This issue is described in + https://bugs.openfabrics.org/show_bug.cgi?id=1975 + +22. Currently, IPoIB LRO is not supported on ConnectX-2 devices + +=============================================================================== +3. IPoIB Configuration Based on DHCP +=============================================================================== + +Setting an IPoIB interface configuration based on DHCP (v3.0.4 which is +available via www.isc.org) is performed similarly to the configuration of +Ethernet interfaces. In other words, you need to make sure that IPoIB +configuration files include the following line: + For RedHat: + BOOTPROTO=dhcp + For SLES: + BOOTPROTO=dhcp +Note: If IPoIB configuration files are included, ifcfg-ib files will be +installed under: +/etc/sysconfig/network-scripts/ on a RedHat machine +/etc/sysconfig/network/ on a SuSE machine + +Note: Two patches for DHCP are required for supporting IPoIB. The patch files +for DHCP v3.0.4 are available under the docs/dhcp/ directory. + +Standard DHCP fields holding MAC addresses are not large enough to contain an +IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages +convey a client identifier field used to identify the DHCP session. This client +identifier field can be used to associate an IP address with a client identifier +value, such that the DHCP server will grant the same IP address to any client +that conveys this client identifier. + +Note: Refer to the DHCP documentation for more details how to make this +association. + +The length of the client identifier field is not fixed in the specification. + +4.1 DHCP Server +In order for the DHCP server to provide configuration records for clients, an +appropriate configuration file needs to be created. By default, the DHCP server +looks for a configuration file called dhcpd.conf under /etc. You can either +edit this file or create a new one and provide its full path to the DHCP server +using the -cf flag. See a file example at docs/dhcpd.conf of this package. +The DHCP server must run on a machine which has loaded the IPoIB module. + +To run the DHCP server from the command line, enter: +dhcpd -d +Example: +host1# dhcpd ib0 -d + +4.2 DHCP Client (Optional) + +Note: A DHCP client can be used if you need to prepare a diskless machine with +an IB driver. + +In order to use a DHCP client identifier, you need to first create a +configuration file that defines the DHCP client identifier. Then run the DHCP +client with this file using the following command: +dhclient cf +Example of a configuration file for the ConnectX (PCI Device ID 26428), called +dhclient.conf: +# The value indicates a hexadecimal number +interface "ib1" { +send dhcp-client-identifier ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39; +} +Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), +called dhclient.conf: +# The value indicates a hexadecimal number +interface "ib1" { +send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92; +} + +In order to use the configuration file, run: +host1# dhclient -cf dhclient.conf ib1 + + +=============================================================================== +4. The ib-bonding driver +=============================================================================== +The ib-bonding driver is a High Availability solution for IPoIB interfaces. +It is based on the Linux Ethernet Bonding Driver and was adapted to work with +IPoIB. The ib-bonding driver comes with the ib-bonding package +(run rpm -qi ib-bonding to get the package information). + +Using the ib-bonding driver +--------------------------- +The ib-bonding driver is loaded automatically. + +Automatic operation: +Use standard OS tools (sysconfig in SuSE and initscripts in RedHat) +to create a configuration that will come up with network restart. For details +on this, read the documentation for the ib-bonding package. + +Notes: +* Using /etc/infiniband/openib.conf to create a persistent configuration is + no longer supported +* On RHEL4_U7, a slave interface cannot be set as primary. +* ib-bonding will not be compiled and installed with OFED on an OS with kernel + that is >= 2.6.27 (e.g., SLES11). The bonding driver that comes with those + kernels already supports enslaving IPoIB interfaces. In addition, an OS + can come with an older kernel but with a patched bonding driver that also + does not require modification (e.g., RHEL5.4). OFED will not replace the + bonding module in such cases either. + However, there might still be issues with OS configuration tools (like + sysconfig or initscripts) that may need fixing, but such issues have not + been observed yet. + + +=============================================================================== +5. Child interfaces +=============================================================================== + +5.1 Subinterfaces +----------------- +You can create subinterfaces for a primary IPoIB interface to provide traffic +isolation. Each such subinterface (also called a child interface) has +different IP and network addresses from the primary (parent) interface. The +default Partition Key (PKey), ff:ff, applies to the primary (parent) interface. + +5.1.1 Creating a Subinterface +----------------------------- +To create a child interface (subinterface), follow this procedure: +Note: In the following procedure, ib0 is used as an example of an IB +subinterface. + +Step 1. Decide on the PKey to be used in the subnet. Valid values are 0-255. +The actual PKey used is a 16-bit number with the most significant bit set. For +example, a value of 0 will give a PKey with the value 0x8000. + +Step 2. Create a child interface by running: +host1$ echo > /sys/class/net//create_child +Example: +host1$ echo 0 > /sys/class/net/ib0/create_child +This will create the interface ib0.8000. + +Step 3. Verify the configuration of this interface by running: +Using the example of Step 2: +host1$ ifconfig ib0.8000 +ib0.8000 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00- +00-00-00-00-00-00 +BROADCAST MULTICAST MTU:2044 Metric:1 +RX packets:0 errors:0 dropped:0 overruns:0 frame:0 +TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 +collisions:0 txqueuelen:128 +RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) + +Step 4. As can be seen, the interface does not have IP or network addresses so +it needs to be configured. + +Step 5. To be able to use this interface, a configuration of the Subnet Manager +is needed so that the PKey chosen, which defines a broadcast address, can be +recognized. + +5.1.2 Removing a Subinterface +To remove a child interface (subinterface), run: +echo /sys/class/net//delete_child +Using the example of Step 2: +echo 0x8000 > /sys/class/net/ib0/delete_child +Note that when deleting the interface you must use the PKey value with the most +significant bit set (e.g., 0x8000 in the example above). + + +=============================================================================== +6. Bug Fixes and Enhancements Since OFED 1.3 +=============================================================================== +- There is no default configuration for IPoIB interfaces: One should manually + specify the full IP configuration or use the ofed_net.conf file. See + OFED_Installation_Guide.txt for details on ipoib configuration. +- Don't drop multicast sends when they can be queued +- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small + SKBs (bug 989) +- IPoIB failed on stress testing (bug 1004) +- Kernel Oops during "port up/down test" (bug 1040) +- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel + panic (bug 985) +- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20 +- Set max CM MTU when moving to CM mode, instead of setting it in openibd script +- Fix CQ size calculations for ipoib +- Bonding: Enable build for SLES10 SP2 +- Bonding: Fix issue in using the bonding module for Ethernet slaves (see + documentation for details) + +=============================================================================== +7. Bug Fixes and Enhancements Since OFED 1.3.1 +=============================================================================== +- IPoIB: Refresh paths instead of flushing them on SM change events to improve + failover respond +- IPoIB: Fix loss of connectivity after bonding failover on both sides +- Bonding: Fix link state detection under RHEL4 +- Bonding: Avoid annoying messages from initscripts when starting bond +- Bonding: Set default number of grat. ARP after failover to three (was one) + +=============================================================================== +8. Bug Fixes and Enhancements Since OFED 1.4 +=============================================================================== +- Performance tuning is enabled by default for IPOIB CM. +- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails +- Disable napi while cq is being drained (bugzilla #1587) +- rdma_cm: Use the rate from the ipoib broadcast when joining an ipoib + multicast. When joining an IPoIB multicast group, use the same rate as in the + broadcast group. Otherwise, if rdma_cm creates this group before IPoIB does, + it might get a different rate. This will cause IPoIB to fail joining the same + group later on, because IPoIB has a strict rate selection. +- Fixed unprotected use of priv->broadcast in ipoib_mcast_join_task. +- Do not join broadcast group if interface is brought down + + +=============================================================================== +9. Bug Fixes and Enhancements Since OFED 1.4.2 +=============================================================================== + +- Check that the format of multicast link addresses is correct before taking + them from dev->mc_list to priv->multicast_list. This way we never try to + send a bogus address to the SA, which prevents badness from erroneous + 'ip maddr addr add', broken bonding drivers, etc. (bugzilla #1664) +- IPoIB: Don't turn on carrier for a non-active port. + If a bonding interface uses this IPoIB interface as a slave it might + not detect that this slave is almost useless and failover + functionality will be damaged. The fix checks the state of the IB + port in the carrier_task before calling netif_carrier_on(). (bugzilla #1726) +- Clear ipoib_neigh.dgid in ipoib_neigh_alloc() + IPoIB can miss a change in destination GID under some conditions. The + problem is caused when ipoib_neigh->dgid contains a stale address. + The fix is to set ipoib_neigh->dgid to zero in ipoib_neigh_alloc(). + +=============================================================================== +10. Bug Fixes and Enhancements Since OFED 1.5.0 +=============================================================================== + +- Fixed lockup of the TX queue on mixed CM/UD traffic + When there is a high rate of send traffic on both CM and UD QPs, the + transmitter can be stopped by the CM path but not re-enabled. + +=============================================================================== +11. Bug Fixes and Enhancements Since OFED 1.5.2 +=============================================================================== +1. Fix IPoIB rx_frames and rx_usecs to conform to ethtool documentation. + + +=============================================================================== +12. Performance Tuning +=============================================================================== +When IPoIB is configured to run in connected mode, tcp parameter tuning is +performed at driver startup to improve the throughput of medium and large +messages. +The driver startup scripts set the following TCP parameters as follows: + + net.ipv4.tcp_timestamps=0 + net.ipv4.tcp_sack=0 + net.core.netdev_max_backlog=250000 + net.core.rmem_max=16777216 + net.core.wmem_max=16777216 + net.core.rmem_default=16777216 + net.core.wmem_default=16777216 + net.core.optmem_max=16777216 + net.ipv4.tcp_mem="16777216 16777216 16777216" + net.ipv4.tcp_rmem="4096 87380 16777216" + net.ipv4.tcp_wmem="4096 65536 16777216" + +This tuning is effective only for connected mode. If you run in datagram mode, +it actually reduces performance. + +If you change the IPoIB run mode to "datagram" while the driver is running, +the tuned parameters do not get reset to their default values. We therefore +recommend that you change the IPoIB mode only while the driver is down +(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file +/etc/infiniband/openib.conf, and then restarting the driver). + + diff --git a/release_notes/iser_release_notes.txt b/release_notes/iser_release_notes.txt new file mode 100644 index 0000000..cd2d12d --- /dev/null +++ b/release_notes/iser_release_notes.txt @@ -0,0 +1,90 @@ + + Open Fabrics Enterprise Distribution (OFED) + iSER initiator in OFED 1.5.x Release Notes + + March 2010 + + +* Background + + iSER allows iSCSI to be layered over RDMA transports (including + InfiniBand and iWARP (RNIC)). + + The OpenFabrics iSER initiator implementation is inter-operable with + open-iscsi (http://www.open-iscsi.org/). It provides an alternative + transport to iscsi_tcp in the open-iscsi framework. The iSER transport + exposes a transport API to scsi_transport_iscsi, and a SCSI LLD API to + the Linux SCSI mid-layer (scsi_mod). Currently, the OpenFabrics iSER + initiator can be layered over InfiniBand (no iWARP support yet). + +* Supported platforms + + - kernel.org: 2.6.30 and higher + - RHEL 5.4 + + Except for these platforms, OFED-1.5.x will not install iSER on top of + the kernel and the original iSER module coming with Linux Distribution + will stop working because of mismatch in symbols version. + +* Fixed Bugs and Enhancements since OFED 1.3 + iSER: + - Add logical unit reset support + - Update URLs of iSER docs + - Add change_queue_depth method + - Fix list iteration bug + - Handle iser_device allocation error gracefully + - Don't change ITT endianess + - Move high-volume debug output to higher debug level + - Count FMR alignment violations per session + Open-iSCSI: + - Update open-iscsi rpm versions from + 2.0-754 to 2.0-754.1 and from 2.0-865.15 to 2.0-869.2 + - Change open-iscsi defaults + - iscsi_discovery: fixed printing debug information + - iscsi_discovery: check if iscsid is running + - Set open-iscsi for auto-startup when installing OFED + - iscsiadm: bail out if daemon isn't running + +* Known Issues + Open-iSCSI: + - modifying node transport_name while session is active + will create stale session. It will be deleted only after reboot. + +* Installation/upgrade of open-iscsi + If iSER is selected to be installed with OFED, open-iscsi will be also + installed (or upgraded if another version of open-iscsi is already + installed). Installing/upgrading open-iscsi is required for iSER to + work properly. Before installing OFED, please make sure that no version + of open-iscsi is installed or add the following key to your ofed.conf + file: upgrade_open_iscsi=yes. Using this key will remove any old version + of open-iscsi. + + If an older version of open-iscsi was installed, it is recommended to + delete its records before running open-iscsi. This can easily be done by + running the following command (while open-iscsi is stopped): + + rm -rf /etc/iscsi/nodes/* /etc/iscsi/send_targets/* + + Then, open-iscsi may be started, and targets may be discovered by running + 'iscsi_discovery '. + +* iSER links + + Wiki pages + + Information on building/configuring/running the open iscsi initiator over + iSER: https://wiki.openfabrics.org/tiki-index.php?page=iSER + + IETF pages + + iSCSI and iSER specifications come out of the IETF IP storage (IPS) work + group. + + iSCSI specification: http://www.ietf.org/rfc/rfc3720.txt + iSER specification: http://www.ietf.org/rfc/rfc5046.txt + + "About" page + + general and detailed information on iSCSI and iSER + http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA + diff --git a/release_notes/iser_target_release_notes.txt b/release_notes/iser_target_release_notes.txt new file mode 100644 index 0000000..b865944 --- /dev/null +++ b/release_notes/iser_target_release_notes.txt @@ -0,0 +1,51 @@ + Open Fabrics Enterprise Distribution (OFED) + STGT/iSER target in OFED 1.5 Release Notes + + December 2009 + + +* Background + + iSER allows iSCSI to be layered over RDMA transports (including InfiniBand + and iWARP (RNIC)). Linux target framework (tgt) aims to simplify various SCSI + target driver (iSCSI, Fibre Channel, SRP, etc) creation and maintenance. + + tgt supports the following target drivers (among othets) + + - iSCSI software (tcp) target driver for Ethernet/IPoIB NICs + - iSER software target driver for Infiniband and RDMA NICs + + For iSCSI and iSER tgt consists of user-space daemon, and user-space + tools. That is, no special kernel support is needed other than the + kernel (and user space) RDMA stacks. + + The code is under the GNU General Public License version 2. + + This package is based on a snapshot (clone) of the tgt git tree taken + on August 28th, 2008 + +* Supported platforms + + RHEL 5 and its updates + SLES 10 and its service-packs + + The release has been tested against the Linux open iscsi initiator + +* STGT/iSER links + + STGT home page + http://stgt.berlios.de + + STGT git + git://git.kernel.org/pub/scm/linux/kernel/git/tomo/tgt.git + + the STGT sources have some embedded documentation, specifically + the README and REDMA.iscsi files would be usefull + + Wiki pages + + Information on building/configuring/running the stgt/iser target + https://wiki.openfabrics.org/tiki-index.php?page=iSER-target + + general and detailed information on iSCSI and iSER + http://www.voltaire.com/Products/Server_Products/iSER_iSCSI_RDMA diff --git a/release_notes/mlx4_release_notes.txt b/release_notes/mlx4_release_notes.txt new file mode 100644 index 0000000..897b112 --- /dev/null +++ b/release_notes/mlx4_release_notes.txt @@ -0,0 +1,348 @@ + Open Fabrics Enterprise Distribution (OFED) + ConnectX driver (mlx4) in OFED 1.5.2 Release Notes + + December 2010 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Supported firmware versions +3. VPI (Virtual Process Interconnect) +4. InfiniBand new features and bug fixes since OFED 1.3.1 +5. InfiniBand (mlx4_ib) new features and bug fixes since OFED 1.4 +6. Eth (mlx4_en) new features and bug fixes since OFED 1.4 +7. New features and bug fixes since OFED 1.4.1 +8. New features and bug fixes since OFED 1.4.2 +9. New features and bug fixes since OFED 1.5 +10. New features and bug fixes since OFED 1.5.1 +11. New features and bug fixes since OFED 1.5.2 +12. Known Issues +13. mlx4 available parameters + +=============================================================================== +1. Overview +=============================================================================== +mlx4 is the low level driver implementation for the ConnectX adapters designed +by Mellanox Technologies. The ConnectX can operate as an InfiniBand adapter, +as an Ethernet NIC, or as a Fibre Channel HBA. The driver in OFED 1.4 supports +InfiniBand and Ethernet NIC configurations. To accommodate the supported +configurations, the driver is split into three modules: + +- mlx4_core + Handles low-level functions like device initialization and firmware + commands processing. Also controls resource allocation so that the + InfiniBand and Ethernet functions can share the device without + interfering with each other. +- mlx4_ib + Handles InfiniBand-specific functions and plugs into the InfiniBand + midlayer +- mlx4_en + Handles Ethernet specific functions and plugs into the netdev mid-layer. + +=============================================================================== +2. Supported firmware versions +=============================================================================== +- This release was tested with FW 2.8.0000 +- The minimal version to use is 2.3.000. +- To use both IB and Ethernet (VPI) use FW version 2.6.000 or higher + +=============================================================================== +3. VPI (Virtual Protocol Interconnect) +=============================================================================== +VPI enables ConnectX to be configured as an Ethernet NIC and/or an InfiniBand +adapter. +o Overview: + The VPI driver is a combination of the Mellanox ConnectX HCA Ethernet and + InfiniBand drivers. + It supplies the user with the ability to run InfiniBand and Ethernet + protocols on the same HCA (separately or at the same time). + For more details on the Ethernet driver see MLNX_EN_README.txt. +o Firmware: + The VPI driver works with FW 25408 version 2.6.000 or higher. + One needs to use INI files that allow different protocols over same HCA. +o Port type management: + By default both ConnectX ports are initialized as InfiniBand ports. + If you wish to change the port type use the connectx_port_config script after + the driver is loaded. + Running "/sbin/connectx_port_config -s" will show current port configuration + for all ConnectX devices. + Port configuration is saved in file: /etc/infiniband/connectx.conf. + This saved configuration is restored at driver restart only if done via + "/etc/init.d/openibd restart". + + Possible port types are: + "eth" - Always Ethernet. + "ib" - Always InfiniBand. + "auto" - Link sensing mode - detect port type based on the attached + network type. If no link is detected, the driver retries link + sensing every few seconds. + + Port link type can be configured for each device in the system at run time + using the "/sbin/connectx_port_config" script. + + This utility will prompt for the PCI device to be modified (if there is only + one it will be selected automatically). + At the next stage the user will be prompted for the desired mode for each port. + The desired port configuration will then be set for the selected device. + Note: This utility also has a non interactive mode: + "/sbin/connectx_port_config [[-d|--device ] -c|--conf ]". + +- The following configurations are supported by VPI: + Port1 = eth Port2 = eth + Port1 = ib Port2 = ib + Port1 = auto Port2 = auto + Port1 = ib Port2 = eth + Port1 = ib Port2 = auto + Port1 = auto Port2 = eth + + Note: the following options are not supported: + Port1 = eth Port2 = ib + Port1 = eth Port2 = auto + Port1 = auto Port2 = ib + + +=============================================================================== +4. InfiniBand new features and bug fixes since OFED 1.3.1 +=============================================================================== +Features that are enabled with ConnectX firmware 2.5.0 only: +- Send with invalidate and Local invalidate send queue work requests. +- Resize CQ support. + +Features that are enabled with ConnectX firmware 2.6.0 only: +- Fast register MR send queue work requests. +- Local DMA L_Key. +- Raw Ethertype QP support (one QP per port) -- receive only. + +Non-firmware dependent features: +- Allow 4K messages for UD QPs +- Allocate/free fast register MR page lists +- More efficient MTT allocator +- RESET->ERR QP state transition no longer supported (IB Spec 1.2.1) +- Pass congestion management class MADs to the HCA +- Enable firmware diagnostic counters available via sysfs +- Enable LSO support for IPOIB +- IB_EVENT_LID_CHANGE is generated more appropriately +- Fixed race condition between create QP and destroy QP (bugzilla 1389) + + +=============================================================================== +5. InfiniBand new features and bug fixes since OFED 1.4 +=============================================================================== +- Enable setting via module param (set_4k_mtu) 4K MTU for ConnectX ports. +- Support optimized registration of huge pages backed memory. + With this optimization, the number of MTT entries used is significantly + lower than for regular memory, so the HCA will access registered memory with + fewer cache misses and improved performance. + For more information on this topic, please refer to Linux documentation file: + Documentation/vm/hugetlbpage.txt +- Do not enable blueflame sends if write combining is not available +- Add write combining support for for PPC64, and thus enable blueflame sends. +- Unregister IB device before executing CLOSE_PORT. +- Notify and exit if the kernel module used does not support XRC. This is done + to avoid libmlx4 compatibility problem. +- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment. + This enable to register more memory with the same number of segments. + + +=============================================================================== +6. Eth (mlx4_en) new features and bug fixes since OFED 1.4 +=============================================================================== +6.1 Changes and New Features +---------------------------- +- Added Tx Multi-queue support which Improves multi-stream and bi-directional + TCP performance. +- Added IP Reassembly to improve RX bandwidth for IP fragmented packets. +- Added linear skb support which improves UDP performance. +- Removed the following module parameters: + - rx/tx_ring_size + - rx_ring_num - number of RX rings + - pprx/pptx - global pause frames + The parameters above are controlled through the standard Ethtool interface. + +Bug Fixes +--------- +- Memory leak when driver is unloaded without configuring interfaces first. +- Setting flow control parameters for one ConnectX port through Ethtool + impacts the other port as well. +- Adaptive interrupt moderation malfunctions after receiving/transmitting + around 7 Tera-bytes of data. +- Firmware commands fail with bad flow messages when bringing an interface up. +- Unexpected behavior in case of memory allocation failures. + +=============================================================================== +7. New features and bug fixes since OFED 1.4.1 +=============================================================================== +- Added support for new device ID: 0x6764: MT26468 ConnectX EN 10GigE PCIe gen2 + +=============================================================================== +8. New features and bug fixes since OFED 1.4.2 +=============================================================================== +8.1 Changes and New Features +---------------------------- +- mlx4_en is now supported on PPC and IA64. +- Added self diagnostics feature: ethtool -t eth. +- Card's vpd can be accessed for read and write using ethtool interface. + +8.2 Bug Fixes +------------- +- mlx4 can now work with MSI-X on RH4 systems. +- Enabled the driver to load on systems with 32 cores and higher. +- The driver is being stuck if the HW/FW stops responding, reset is done + instead. +- Fixed recovery flows from memory allocation failures. +- When the system is low on memory, the mlx4_en driver now allocates smaller RX + rings. +- The mlx4_core driver now retries to obtain MSI-X vectors if the initial request is + rejected by the OS + +=============================================================================== +9. New features and bug fixes since OFED 1.5 +=============================================================================== +9.1 Changes and New Features +---------------------------- +- Added RDMA over Converged Enhanced Ethernet (RoCEE) support + See RoCEE_README.txt. +- Masked Compare and Swap (MskCmpSwap) + The MskCmpSwap atomic operation is an extension to the CmpSwap operation + defined in the IB spec. MskCmpSwap allows the user to select a portion of the + 64 bit target data for the "compare" check as well as to restrict the swap to + a (possibly different) portion. +- Masked Fetch and Add (MFetchAdd) + The MFetchAdd Atomic operation extends the functionality of the standard IB + FetchAdd by allowing the user to split the target into multiple fields of + selectable length. The atomic add is done independently on each one of this + fields. A bit set in the field_boundary parameter specifies the field + boundaries. +- Improved VLAN tagging performance for the mlx4_en driver. +- RSS support for Ethernet UDP traffic on ConnectX-2 cards with firmware + 2.7.700 and higher. + +9.2 Bug Fixes +------------- +- Bonding stops functioning when one of the Ethernet ports is closed. +- "Scheduling while atomic" errors in /var/log/messages when working with + bonding and mlx4_en drivers in several operating systems. + +=============================================================================== +10. New features and bug fixes since OFED 1.5.1 +=============================================================================== +10.1 Changes and New Features +---------------------------- +1. Added RAW QP support +2. Extended the range of log_mtts_per_seg - upper bound moved from 5 to 7. +3. Added 0xff70 vendor ID support for MADs. +4. Added support for GID change event. +5. Better interrupts spreading under heavy RX load (mlx4_en) + +10.2 Bug Fixes +------------- +1. Fixed chunk sg list overflow in mlx4_alloc_icm() +2. Fixed bug in invalidation of counter index. +3. Fixed bug in catching netdev events for updating GID table. +4. Fixed bug in populating GID table for RoCE. +5. Fixed XRC locking and prevention of null dereference. +6. Added spinlock to xrc_reg_list changes and scanning in interrupt context. +7. Fixed offload changes via Ethtool for VLAN interfaces + +=============================================================================== +11. New features and bug fixes since OFED 1.5.2 +=============================================================================== +11.1 Changes and new features +----------------------------- +1. RoCE counters are now added to the regular Ethernet counters. The counters +for RoCE specific traffic are at the same place and are not changed. +2. Forward any vendor ID SMP MADs to firmware for handling. +3. Add blue flame support for kernel consumers. This allows lower latencies to +be achieved. To use blue flame, a consumer needs to create the QP with inline +support. + +11.2 Bug fixes +-------------- +1. Fix race when reading node desctription through MADs. +2. Fix modify CQ so each of moderation parameters is independent. +3. Limit the number of fast registration work requests to match HW capabilities. +4 Changes to node-description via sysfs are now propagated to FW (for FW +2.8.000 and later). This enables FW to send a 144 trap to OpenSM regarding the +change, so that OpenSM can read that nodes updated description. This fixes an +old race condition, where OpenSM read the nodes description before it was +changed during driver startup. +5. Fix max fast registration WRs that can be posted to CX. +6. Fix port speed reporting for RoCE ports. +7. Limit GID entries for VLAN to match hardware capabilities. +8. Fix RoCE link state report. +9. Workaround firmware bug reporting wrong number of blue flame registers. +10. Bug fix in kernel pos_send when VLANs are used. +11. Fix in mlx4_en for handling VLAN operations when working under bond + interfaces. +12.Fix Ethtool transceiver type report for mlx4_en + + +=============================================================================== +12. Known Issues +=============================================================================== +- The SQD feature is not supported +- To load the driver on machines with a 64KB default page size, the UAR bar + must be enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium + with SLES 11 or when 64KB page size enabled. + Perform the following three steps: + 1. Add the following line in the firmware configuration (INI) file under the + [HCA] section: + log2_uar_bar_megabytes = 5 + 2. Burn a modified firmware image with the changed INI file. + 3. Reboot the system. + + +================================================================================ +13. mlx4 available parameters +================================================================================ +In order to set mlx4 parameters, add the following line(s) to /etc/modpobe.conf: + options mlx4_core parameter= + and/or + options mlx4_ib parameter= + and/or + options mlx4_en parameter= + +mlx4_core parameters: + set_4k_mtu: try to set 4K MTU to all ConnectX ports (int) + debug_level: enable debug tracing if > 0 (int) + block_loopback: block multicast loopback packets if > 0 (int) + msi_x: attempt to use MSI-X if nonzero (int) + log_num_mac: log2 max number of MACs per ETH port (1-7, int) + use_prio: enable steering by VLAN priority on ETH ports + (0/1, default 0) (bool) + log_num_qp: log maximum number of QPs per HCA (int) + log_num_srq: log maximum number of SRQs per HCA (int) + log_rdmarc_per_qp: log number of RDMARC buffers per QP (int) + log_num_cq: log maximum number of CQs per HCA (int) + log_num_mcg: log maximum number of multicast groups per HCA + (int) + log_num_mpt: log maximum number of memory protection table + entries per HCA (int) + log_num_mtt: log maximum number of memory translation table + segments per HCA (int) + log_mtts_per_seg: log2 number of MTT entries per segment (1-5) + (int) + enable_qos: enable Quality of Service support in the HCA + (default: off) (bool) + enable_pre_t11_mode: set FCoXX to pre-T11 mode if non-zero + (default 0) (int) + internal_err_reset: reset device on internal errors if non-zero + (default 1) (int) + +mlx4_ib parameters: + debug_level: enable debug tracing if > 0 (default 0) + +mlx4_en parameters: + udp_rss: enable RSS for incoming UDP traffic or disabled (0) + tcp_rss: enable RSS for incoming TCP traffic or disabled (0) + num_lro: number of LRO sessions per ring or disabled (0) + (default is 32) + ip_reasm: allow reassembly of fragmented IP packets (default + is enabled) + pfctx: priority based Flow Control policy on TX[7:0] + per priority bit mask (default is 0) + pfcrx: priority based Flow Control policy on RX[7:0] + per priority bit mask (default is 0) + inline_thold: threshold for using inline data (default is 128) diff --git a/release_notes/mpi-selector_release_notes.txt b/release_notes/mpi-selector_release_notes.txt new file mode 100644 index 0000000..95944dc --- /dev/null +++ b/release_notes/mpi-selector_release_notes.txt @@ -0,0 +1,43 @@ + MPI Selector 1.0 release notes + December 2009 + ============================== + +OFED contains a simple mechanism for system administrators and end +users to select which MPI implementation they want to use. The MPI +selector functionality is not specific to any MPI implementation; it +can be used with any implementation that provides shell startup files +that correctly set the environment for that MPI. The OFED installer +will automatically add MPI selector support for each MPI that it +installs. Additional MPI's not known by the OFED installer can be +listed in the MPI selector; see the mpi-selector(1) man page for +details. + +Note that MPI selector only affects the default MPI environment for +*future* shells. Specifically, if you use MPI selector to select MPI +implementation ABC, this default selection will not take effect until +you start a new shell (e.g., logout and login again). Other packages +(such as environment modules) provide functionality that allows +changing your environment to point to a new MPI implementation in the +current shell. The MPI selector was not meant to duplicate or replace +that functionality. + +The MPI selector functionality can be invoked in one of two ways: + +1. The mpi-selector-menu command. + + This command is a simple, menu-based program that allows the + selection of the system-wide MPI (usually only settable by root) + and a per-user MPI selection. It also shows what the current + selections are. + + This command is recommended for all users. + +2. The mpi-selector command. + + This command is a CLI-equivalent of the mpi-selector-menu, + allowing for the same functionality as mpi-selector-menu but + without the interactive menus and prompts. It is suitable for + scripting. + +See the mpi-selector(1) man page for more information. + diff --git a/release_notes/mstflint_release_notes.txt b/release_notes/mstflint_release_notes.txt new file mode 100644 index 0000000..15c7c79 --- /dev/null +++ b/release_notes/mstflint_release_notes.txt @@ -0,0 +1,77 @@ +=============================================================================== + OFED 1.5.2 for Linux + Mellanox Firmware Burning and Diagnostic Utilities + December 2010 +=============================================================================== + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. New Features +3. Major Bugs Fixed +4. Known Issues + +=============================================================================== +1. Overview +=============================================================================== + +This package contains a burning and diagnostic tools for Mellanox +manufactured cards. It also provides access to the relevant source code. Please +see the file LICENSE for licensing details. + +Package Contents: + a) mstflint source code + b) mflash lib + This lib provides Flash access through Mellanox HCAs. + c) mtcr lib (implemented in mtcr.h file) + This lib enables access to adapter hardware registers via PCIe + d) mstregdump utility + This utility dumps hardware registers from Mellanox hardware for later + analysis by Mellanox. + e) mstvpd + This utility dumps the on-card VPD (Vital Product Data, which contains + the card serial number, part number, and other info). + f) hca_self_test.ofed + This scripts checks the status of software, firmware and hardware of the + HCAs or NICs installed on the local host. + +=============================================================================== +2. New Features +=============================================================================== + +* Added support for flash type SST25VF016B in mstflint + +* Added support for flash type M25PX16 in mstflint + +* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') in + a binary image file. This is useful for production to prepare images for pre- + assembly flash burning. These new commands are supported by Mellanox 4th + generation devices. + +* Added an option to set the VSD and GUIDs (mstflint command 'sv' and 'sg') on + an already burnt device. These command re-burn the existing image with the + given GUIDs or VSD. + When the 'sg' command is applied on a device with blank (0xff) GUIDs, it + updates the GUIDs without re-burning the image. + +* mstregdump: Updated address list for ConnectX2 device. + +=============================================================================== +3. Bugs Fixed +=============================================================================== + +* Show correct device names in mstflint help + +=============================================================================== +4. Known Issues +=============================================================================== + +* Rarely you may get the following error message when running mstflint: + Warning: memory access to device 0a:00.0 failed: Input/output error. + Warning: Fallback on IO: much slower, and unsafe if device in use. + *** buffer overflow detected ***: mstflint terminated + + To solve the issue, run "mst start" (requires MFT - Mellanox Firmware Tools package) and + then re-run mstflint. + diff --git a/release_notes/mthca_release_notes.txt b/release_notes/mthca_release_notes.txt new file mode 100644 index 0000000..40f3c4e --- /dev/null +++ b/release_notes/mthca_release_notes.txt @@ -0,0 +1,92 @@ + Open Fabrics Enterprise Distribution (OFED) + mthca in OFED 1.5 Release Notes + + December 2009 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Fixed Bugs since OFED 1.3.1 +3. Bug fixes and enhancements since OFED 1.4 +4. Known Issues + +=============================================================================== +1. Overview +=============================================================================== +mthca is the low level driver implementation for the following Mellanox +Technologies HCAs: InfiniHost, InfiniHost III Ex and InfiniHost III Lx. + +mthca Available Parameters +-------------------------- +In order to set mthca parameters, add the following line to /etc/modpobe.conf: + + options ib_mthca parameter= + +mthca parameters: + catas_reset_disable: disable reset on catastrophic event if nonzero + (int) + fw_cmd_doorbell: post FW commands through doorbell page if + nonzero (and supported by FW) (int) + debug_level: Enable debug tracing if > 0 (int) + msi_x: attempt to use MSI-X if nonzero (int) + tune_pci: increase PCI burst from the default set by BIOS if nonzero (int) + num_qp: maximum number of QPs per HCA (int) + rdb_per_qp: number of RDB buffers per QP (int) + num_cq: maximum number of CQs per HCA (int) + num_mcg: maximum number of multicast groups per HCA (int) + num_mpt: maximum number of memory protection table entries per HCA (int) + num_mtt: maximum number of memory translation table segments per HCA (int) + num_udav: maximum number of UD address vectors per HCA (int) + fmr_reserved_mtts: number of memory translation table segments reserved for + FMR (int) + log_mtts_per_seg: Log2 number of MTT entries per segment (1-5) (int) + +=============================================================================== +2. Fixed Bugs +=============================================================================== +- Fix access to freed memory in catastrophic processing + catas_reset() uses pointer to mthca_dev, but mthca_dev is not valid after + call __mthca_restart_one(). + + +=============================================================================== +3. Bug fixes and enhancements since OFED 1.4 +=============================================================================== +- Added a module parameter (log_mtts_per_seg) for number of MTTs per segment. + This enable to register more memory with the same number of segments. +- Bring INIT_HCA and other commands timeout into consistency with PRM. This + solve an issue when had more than 2^18 max qp's configured. + +=============================================================================== +4. Known Issues +=============================================================================== +1. A UAR size other than 8MB prevents mthca driver loading. The default UAR + size is 8MB. If the size is changed, the following error message will be + logged to /var/log/messages upon attempting to load the mthca driver: + ib_mthca 0000:04:00.0: Missing UAR, aborting. + +2. If a user level application using multicast receives a control signal + in the process of detaching from a multicast group, its QP may remain a + member of the multicast group (in HCA). + Workaround: Destroy the multicast group after detaching the QP from it. + +3. In mem-free devices, RC QPs can be created with a maximum of (max_sge - 1) + entries only; UD QPs can be created with a maximum of (max_sge - 3) entries. + +4. Performance can be degraded due to a wrong BIOS configuration: + The PCI Express specification requires the BIOS to set the MaxReadReq + register for each HCA card for maximum performance and stability. + + If you experience bandwidth performance degradation, try forcing the card to + behave not according to the PCI Express specification by setting the + tune_pci=1 module parameter. This tune_pci=1 assignment was the default + setting in OFED 1.0; therefore, it may have masked performance degradation + on some systems. + + If tune_pci=1 improves bandwidth, please report the issue to your BIOS + vendor. Please note that Mellanox Technologies does not recommend using + tune_pci=1 in production systems: working with tune_pci=1 set is untested + and is known to trigger instability issues on some platforms. + diff --git a/release_notes/mvapich2_release_notes.txt b/release_notes/mvapich2_release_notes.txt new file mode 100644 index 0000000..9a0fa90 --- /dev/null +++ b/release_notes/mvapich2_release_notes.txt @@ -0,0 +1,118 @@ +======================================================================== + + Open Fabrics Enterprise Distribution (OFED) + MVAPICH2-1.5.1 in OFED 1.5.2 Release Notes + + September 2010 + + +Overview +-------- + +These are the release notes for MVAPICH2-1.5.1. MVAPICH2 is an MPI-2 +implementation over InfiniBand, iWARP and RoCEE (RDMAoE) from the Ohio +State University (http://mvapich.cse.ohio-state.edu/). + + +User Guide +---------- + +For more information on using MVAPICH2-1.5.1, please visit the user +guide at http://mvapich.cse.ohio-state.edu/support/. + + +Software Dependencies +--------------------- + +MVAPICH2 depends on the installation of the OFED Distribution stack with +OpenSM running. The MPI module also requires an established network +interface (either InfiniBand, IPoIB, iWARP, RoCEE uDAPL, or Ethernet). +BLCR support is needed if built with fault tolerance support. Similarly, +HWLOC support is needed if built with Portable Hardware Locality feature +for CPU mapping. + + +ChangeLog +--------- + +* Features and Enhancements + - Significantly reduce memory footprint on some systems by changing + the stack size setting for multi-rail configurations + - Optimization to the number of RDMA Fast Path connections + - Performance improvements in Scatterv and Gatherv collectives for + CH3 interface (Thanks to Dan Kokran and Max Suarez of NASA for + identifying the issue) + - Tuning of Broadcast Collective + - Support for tuning of eager thresholds based on both adapter and + platform type + - Environment variables for message sizes can now be expressed in + short form K=Kilobytes and M=Megabytes (e.g. + MV2_IBA_EAGER_THRESHOLD=12K) + - Ability to selectively use some or all HCAs using colon separated + lists. e.g. MV2_IBA_HCA=mlx4_0:mlx4_1 + - Improved Bunch/Scatter mapping for process binding with HWLOC and + SMT support (Thanks to Dr. Bernd Kallies of ZIB for ideas and + suggestions) + - Update to Hydra code from MPICH2-1.3b1 + - Auto-detection of various iWARP adapters + - Specifying MV2_USE_IWARP=1 is no longer needed when using iWARP + - Changing automatic eager threshold selection and tuning for iWARP + adapters based on number of nodes in the system instead of the + number of processes + - PSM progress loop optimization for QLogic Adapters (Thanks to Dr. + Avneesh Pant of QLogic for the patch) + +* Bug fixes + - Fix memory leak in registration cache with --enable-g=all + - Fix memory leak in operations using datatype modules + - Fix for rdma_cross_connect issue for RDMA CM. The server is + prevented from initiating a connection. + - Don't fail during build if RDMA CM is unavailable + - Various mpirun_rsh bug fixes for CH3, Nemesis and uDAPL interfaces + - ROMIO panfs build fix + - Update panfs for not-so-new ADIO file function pointers + - Shared libraries can be generated with unknown compilers + - Explicitly link against DL library to prevent build error due to + DSO link change in Fedora 13 (introduced with gcc-4.4.3-5.fc13) + - Fix regression that prevents the proper use of our internal HWLOC + component + - Remove spurious debug flags when certain options are selected at + build time + - Error code added for situation when received eager SMP message is + larger than receive buffer + - Fix for Gather and GatherV back-to-back hang problem with LiMIC2 + - Fix for packetized send in Nemesis + - Fix related to eager threshold in nemesis ib-netmod + - Fix initialization parameter for Nemesis based on adapter type + - Fix for uDAPL one sided operations (Thanks to Jakub Fedoruk from + Intel for reporting this) + - Fix an issue with out-of-order message handling for iWARP + - Fixes for memory leak and Shared context Handling in PSM for + QLogic Adapters (Thanks to Dr. Avneesh Pant of QLogic for the + patch) + + +Main Verification Flows +----------------------- + +In order to verify the correctness of MVAPICH2-1.4.1, the following +tests and parameters were run. + +Test Description +==================================================================== +Intel Intel's MPI functionality test suite +OSU Benchmarks OSU's performance tests +IMB Intel's MPI Benchmark test +mpich2 Test suite distributed with MPICH2 +NAS NAS Parallel Benchmarks (NPB3.2) + + +Mailing List +------------ + +There is a public mailing list mvapich-discuss@cse.ohio-state.edu for +mvapich users and developers to +- Ask for help and support from each other and get prompt response +- Contribute patches and enhancements + +======================================================================== diff --git a/release_notes/mvapich_release_notes.txt b/release_notes/mvapich_release_notes.txt new file mode 100644 index 0000000..8c872ae --- /dev/null +++ b/release_notes/mvapich_release_notes.txt @@ -0,0 +1,102 @@ + Open Fabrics Enterprise Distribution (OFED) + OSU MPI MVAPICH-1.2.0, in OFED 1.5 Release Notes + + December 2009 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Software Dependencies +3. New Features +4. Bug Fixes +5. Known Issues +6. Main Verification Flows + + +=============================================================================== +1. Overview +=============================================================================== +These are the release notes for OSU MPI MVAPICH-1.2.0. +OSU MPI is an MPI channel implementation over InfiniBand +by Ohio State University (OSU). + +See http://mvapich.cse.ohio-state.edu + + +=============================================================================== +2. Software Dependencies +=============================================================================== +OSU MPI depends on the installation of the OFED stack with OpenSM running. +The MPI module also requires an established network interface (either +InfiniBand IPoIB or Ethernet). + + +=============================================================================== +3. New Features ( Compared to mvapich 1.1.0 ) +=============================================================================== +MVAPICH-1.2.0 has the following additional features: +- Advanced network recovery support +- mpirun launcher improvements +- Efficient intra-node shared memory communication + support for diskless clusters +- RoCEE (RDMAoE) networks support + +=============================================================================== +4. Bug Fixes ( Compared to mvapich 1.1.0 ) +=============================================================================== +- Multiple fixes for mpirun_rsh launcher + +=============================================================================== +5. Known Issues +=============================================================================== +- Shared memory broadcast optimization is disabled by default. + +- MVAPICH MPI compiled on AMD x86_64 does not work with MVAPICH MPI compiled + on Intel X86_64 (EM64t). + Workaround: + Use "VIADEV_USE_COMPAT_MODE=1" run time option in order to enable compatibility + mode that works for AMD and Intel platform. + +- A process running MPI cannot fork after MPI_Init unless the environment + variable IBV_FORK_SAFE=1 is set to enable fork support. This support also + requires a kernel version of 2.6.16 or higher. + +- For users of Mellanox Technologies firmware fw-23108 or fw-25208 only: + MVAPICH might fail in its default configuration if your HCA is burnt with an + fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version + 4.7.400 or earlier. + + NOTE: There is no issue if you chose to update firmware during Mellanox + OFED installation as newer firmware versions were burnt. + + Workaround: + Option 1 - Update the firmware. For instructions, see Mellanox Firmware Tools + (MFT) User's Manual under the docs/ folder. + Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0 + +- MVAPICH may fail to run on some SLES 10 machines due to problems in resolving + the host name. + Workaround: Edit /etc/hosts and comment-out/remove the line that maps + IP address 127.0.0.2 to the system's fully qualified hostname. + + +=============================================================================== +6. Main Verification Flows +=============================================================================== +In order to verify the correctness of MVAPICH, the following tests and +parameters were run. + +Test Description +------------------------------------------------------------------- +Intel's Test suite - 1400 Intel tests +BW/LT OSU's test for bandwidth latency +IMB Intel's MPI Benchmark test +mpitest b_eff test +Presta Presta multicast test +Linpack Linpack benchmark +NAS2.3 NAS NPB2.3 tests +SuperLU SuperLU benchmark (NERSC edition) +NAMD NAMD application +CAM CAM application diff --git a/release_notes/nes_release_notes.txt b/release_notes/nes_release_notes.txt new file mode 100644 index 0000000..0233d90 --- /dev/null +++ b/release_notes/nes_release_notes.txt @@ -0,0 +1,319 @@ + Open Fabrics Enterprise Distribution (OFED) + NetEffect Ethernet Cluster Server Adapter Release Notes + September 2010 + + + +The iw_nes module and libnes user library provide RDMA and L2IF +support for the NetEffect Ethernet Cluster Server Adapters. + +========== +What's New +========== +OFED 1.5.2 contains several enhancements and bug fixes to iw_nes driver. + +* Add new feature iWarp Multicast Acceleration (IMA). +* Add module option to disable extra doorbell read after a write. +* Change CQ event notification to not fire event unless there is a + new CQE not polled. +* Fix payload calculation for post receive with more than one SGE. +* Fix crash when CLOSE was indicated twice due to connection close + during remote peer's timeout on pending MPA reply. +* Fix ifdown hang by not calling ib_unregister_device() till removal + of iw_nes module. +* Handle RST when state of connection is in FIN_WAIT2. +* Correct properties for various nes_query_{qp, port, device} calls. + + +============================================ +Required Setting - RDMA Unify TCP port space +============================================ +RDMA connections use the same TCP port space as the host stack. To avoid +conflicts, set rdma_cm module option unify_tcp_port_space to 1 by adding +the following to /etc/modprobe.conf: + + options rdma_cm unify_tcp_port_space=1 + + +======================================== +Required Setting - Power Management Mode +======================================== +If possible, disable Active State Power Management in the BIOS, e.g.: + + PCIe ASPM L0s - Advanced State Power Management: DISABLED + + +======================= +Loadable Module Options +======================= +The following options can be used when loading the iw_nes module by modifying +modprobe.conf file. + +wide_ppm_offset=0 + Set to 1 will increase CX4 interface clock ppm offset to 300ppm. + Default setting 0 is 100ppm. + +mpa_version=1 + MPA version to be used int MPA Req/Resp (0 or 1). + +disable_mpa_crc=0 + Disable checking of MPA CRC. + Set to 1 to enable MPA CRC. + +send_first=0 + Send RDMA Message First on Active Connection. + +nes_drv_opt=0x00000100 + Following options are supported: + + 0x00000010 - Enable MSI + 0x00000080 - No Inline Data + 0x00000100 - Disable Interrupt Moderation + 0x00000200 - Disable Virtual Work Queue + 0x00001000 - Disable extra doorbell read after write + +nes_debug_level=0 + Specify debug output level. + +wqm_quanta=65536 + Set size of data to be transmitted at a time. + +limit_maxrdreqsz=0 + Limit PCI read request size to 256 bytes. + + +=============== +Runtime Options +=============== +The following options can be used to alter the behavior of the iw_nes module: +NOTE: Assuming NetEffect Ethernet Cluster Server Adapter is assigned eth2. + + ifconfig eth2 mtu 9000 - largest mtu supported + + ethtool -K eth2 tso on - enables TSO + ethtool -K eth2 tso off - disables TSO + + ethtool -C eth2 rx-usecs-irq 128 - set static interrupt moderation + + ethtool -C eth2 adaptive-rx on - enable dynamic interrupt moderation + ethtool -C eth2 adaptive-rx off - disable dynamic interrupt moderation + ethtool -C eth2 rx-frames-low 16 - low watermark of rx queue for dynamic + interrupt moderation + ethtool -C eth2 rx-frames-high 256 - high watermark of rx queue for + dynamic interrupt moderation + ethtool -C eth2 rx-usecs-low 40 - smallest interrupt moderation timer + for dynamic interrupt moderation + ethtool -C eth2 rx-usecs-high 1000 - largest interrupt moderation timer + for dynamic interrupt moderation + +=================== +uDAPL Configuration +=================== +Rest of the document assumes the following uDAPL settings in dat.conf: + + OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" "" + ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" "" + + +============== +mpd.hosts file +============== +mpd.hosts is a text file with a list of nodes, one per line, in the MPI ring. +Use either fully qualified hostname or IP address. + + +======================================= +Recommended Settings for HP MPI 2.2.7 +======================================= +Add the following to mpirun command: + + -1sided + +Example mpirun command with uDAPL-2.0: + + mpirun -np 2 -hostfile /opt/mpd.hosts + -UDAPL -prot -intra=shm + -e MPI_HASIC_UDAPL=ofa-v2-iwarp + -1sided + /opt/hpmpi/help/hello_world + +Example mpirun command with uDAPL-1.2: + + mpirun -np 2 -hostfile /opt/mpd.hosts + -UDAPL -prot -intra=shm + -e MPI_HASIC_UDAPL=OpenIB-iwarp + -1sided + /opt/hpmpi/help/hello_world + + +============================================================ +Recommended Settings for Platform MPI 7.1 (formerly HP-MPI) +============================================================ +Add the following to mpirun command: + + -1sided + +Example mpirun command with uDAPL-2.0: + + mpirun -np 2 -hostfile /opt/mpd.hosts + -UDAPL -prot -intra=shm + -e MPI_HASIC_UDAPL=ofa-v2-iwarp + -1sided + /opt/platform_mpi/help/hello_world + +Example mpirun command with uDAPL-1.2: + + mpirun -np 2 -hostfile /opt/mpd.hosts + -UDAPL -prot -intra=shm + -e MPI_HASIC_UDAPL=OpenIB-iwarp + -1sided + /opt/platform_mpi/help/hello_world + + +============================================== +Recommended Settings for Intel MPI 3.2.x/4.0.x +============================================== +Add the following to mpiexec command: + + -genv I_MPI_FALLBACK_DEVICE 0 + -genv I_MPI_DEVICE rdma:OpenIB-iwarp + -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1 + +Example mpiexec command line for uDAPL-2.0: + + mpiexec -genv I_MPI_FALLBACK_DEVICE 0 + -genv I_MPI_DEVICE rdma:ofa-v2-iwarp + -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1 + -ppn 1 -n 2 + /opt/intel/impi/3.2.2/bin64/IMB-MPI1 + +Example mpiexec command line for uDAPL-1.2: + mpiexec -genv I_MPI_FALLBACK_DEVICE 0 + -genv I_MPI_DEVICE rdma:OpenIB-iwarp + -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1 + -ppn 1 -n 2 + /opt/intel/impi/3.2.2/bin64/IMB-MPI1 + + +======================================== +Recommended Setting for MVAPICH2 and OFA +======================================== +Add the following to the mpirun command: + + -env MV2_USE_IWARP_MODE 1 + +Example mpiexec command line: + + mpiexec -l -n 2 + -env MV2_USE_IWARP_MODE 1 + /usr/mpi/gcc/mvapich2-1.5/tests/osu_benchmarks-3.1.1/osu_latency + + +========================================== +Recommended Setting for MVAPICH2 and uDAPL +========================================== +Add the following to the mpirun command for 64 or more processes: + + -env MV2_ON_DEMAND_THRESHOLD + +Example mpirun command with uDAPL-2.0: + + mpiexec -l -n 64 + -env MV2_DAPL_PROVIDER ofa-v2-iwarp + -env MV2_ON_DEMAND_THRESHOLD 64 + /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1 + +Example mpirun command with uDAPL-1.2: + + mpiexec -l -n 64 + -env MV2_DAPL_PROVIDER OpenIB-iwarp + -env MV2_ON_DEMAND_THRESHOLD 64 + /usr/mpi/gcc/mvapich2-1.5/tests/IMB-3.2/IMB-MPI1 + + +=========================== +Modify Settings in Open MPI +=========================== +There is more than one way to specify MCA parameters in +Open MPI. Please visit this link and use the best method +for your environment: + +http://www.open-mpi.org/faq/?category=tuning#setting-mca-params + + +======================================= +Recommended Settings for Open MPI 1.4.2 +======================================= +Allow the sender to use RDMA Writes: + + -mca btl_openib_flags 2 + +Example mpirun command line: + + mpirun -np 2 -hostfile /opt/mpd.hosts + -mca btl openib,self,sm + -mca btl_mpi_leave_pinned 0 + -mca btl_openib_flags 2 + /usr/mpi/gcc/openmpi-1.4.2/tests/IMB-3.2/IMB-MPI1 + + +=================================== +iWARP Multicast Acceleration (IMA) +=================================== + +iWARP multicast acceleration enables raw L2 multicast traffic kernel +bypass using user-space verbs API using the new defined QP type +IBV_QPT_RAW_ETH. + +The L2 RAW_ETH acceleration assumes that user application transmits and +receives a whole L2 frame including MAC/IP/UDP/TCP headers. + +ETH RAW QP usage: +First the application creates IBV_QPT_RAW_ETH QP with associated CQ, PD, +completion channels as it is performed for RDMA connection. + +Next step is enabling L2 MAC address RX filters for directing received +multicasts to the RAW_ETH QPs using ibv_attach_multicast() verb. + +From this point the application is ready to receive and transmit multicast +traffic. + +In multicast acceleration the user application passes to ibv_post_send() +whole IGMP frame including MAC header, IP header, UDP header and UDP payload. +It is a user responsibility to make IP fragmentation when required payload +is larger than MTU. Every fragment is a separate L2 frame to transmit. +The ibv_poll_cq() provides an information about the status of transmit buffer. + +On receive path, ibv_poll_cq() returns information about received L2 +packet, the Rx buffer (previously posted by ibv_post_recv() ) contains +whole L2 frame including MAC header, IP header and UDP header. +It is a user application responsibility to check if received packet is +a valid UDP frame so the fragments must be checked and checksums must be +computed. + +IMA API description (NE020 specific): +User application must create separate CQs for RX and TX path. +Only single SGE on tranmit is supported. +User application must post at least 65 rx buffers to keep RX path working. + +IMA device: +IMA requires creation of the /dev/infiniband/nes_ud_sksq device to get +access to optimized IMA transmit path. The best method for creation of this +device is manual addition following line to /etc/udev/rules.d/90-ib.rules +file after OFED distribution installation and rebooting machine. + +KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644" + +As a result the 90-ib.rules should look like: + +KERNEL=="umad*", NAME="infiniband/%k" +KERNEL=="issm*", NAME="infiniband/%k" +KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666" +KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666" +KERNEL=="ucma", NAME="infiniband/%k", MODE="0666" +KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666" +KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644" + + + +NetEffect is a trademark of Intel Corporation in the U.S. and other countries. diff --git a/release_notes/nfs-rdma.release-notes.txt b/release_notes/nfs-rdma.release-notes.txt new file mode 100644 index 0000000..9b1a794 --- /dev/null +++ b/release_notes/nfs-rdma.release-notes.txt @@ -0,0 +1,230 @@ +################################################################################ +# # +# NFS/RDMA README # +# # +################################################################################ + + Author: NetApp and Open Grid Computing + + Adapted for OFED 1.5.1 (from linux-2.6.30/Documentation/filesystems/nfs-rdma.txt) + by Jon Mason + +Table of Contents +~~~~~~~~~~~~~~~~~ + - Overview + - OFED 1.5.1 limitations + - Getting Help + - Installation + - Check RDMA and NFS Setup + - NFS/RDMA Setup + +Overview +~~~~~~~~ + + This document describes how to install and setup the Linux NFS/RDMA client + and server software. + + The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server + was first included in the following release, Linux 2.6.25. + + In our testing, we have obtained excellent performance results (full 10Gbit + wire bandwidth at minimal client CPU) under many workloads. The code passes + the full Connectathon test suite and operates over both Infiniband and iWARP + RDMA adapters. + +OFED 1.5.1 limitations: +~~~~~~~~~~~~~~~~~~~~~ + NFS-RDMA is supported for the following releases: + - Redhat Enterprise Linux (RHEL) version 5.2 + - Redhat Enterprise Linux (RHEL) version 5.3 + - Redhat Enterprise Linux (RHEL) version 5.4 + - SUSE Linux Enterprise Server (SLES) version 11 + + And the following kernel.org kernels: + - 2.6.22 + - 2.6.25 + - 2.6.30 + + All other Linux Distrubutions and kernel versions are NOT supported on OFED + 1.5.1 + +Getting Help +~~~~~~~~~~~~ + + If you get stuck, you can ask questions on the + nfs-rdma-devel@lists.sourceforge.net, or linux-rdma@vger.kernel.org + mailing lists. + +Installation +~~~~~~~~~~~~ + + These instructions are a step by step guide to building a machine for + use with NFS/RDMA. + + - Install an RDMA device + + Any device supported by the drivers in drivers/infiniband/hw is acceptable. + + Testing has been performed using several Mellanox-based IB cards and + the Chelsio cxgb3 iWARP adapter. + + - Install OFED 1.5.1 + + NFS/RDMA has been tested on RHEL5.2, RHEL 5.3, RHEL5.4, SLES11, + kernels 2.6.22, 2.6.25, and 2.6.30. On these kernels, + NFS-RDMA will be installed by default if you simply select "install all", + and can be specifically included by a "custom" install. + + In addition, the install script will install a version of the nfs-utils that + is required for NFS/RDMA. The binary installed will be named "mount.rnfs". + This version is not necessary for Linux Distributions with nfs-utils 1.1 or + later. + + Upon successful installation, the nfs kernel modules will be placed in the + directory /lib/modules/'uname -a'/updates. It is recommended that you reboot + to ensure that the correct modules are loaded. + +Check RDMA and NFS Setup +~~~~~~~~~~~~~~~~~~~~~~~~ + + Before configuring the NFS/RDMA software, it is a good idea to test + your new kernel to ensure that the kernel is working correctly. + In particular, it is a good idea to verify that the RDMA stack + is functioning as expected and standard NFS over TCP/IP and/or UDP/IP + is working properly. + + - Check RDMA Setup + + If you built the RDMA components as modules, load them at + this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel + card: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + + If you are using InfiniBand, make sure there is a Subnet Manager (SM) + running on the network. If your IB switch has an embedded SM, you can + use it. Otherwise, you will need to run an SM, such as OpenSM, on one + of your end nodes. + + If an SM is running on your network, you should see the following: + + $ cat /sys/class/infiniband/driverX/ports/1/state + 4: ACTIVE + + where driverX is mthca0, ipath5, ehca3, etc. + + To further test the InfiniBand software stack, use IPoIB (this + assumes you have two IB hosts named host1 and host2): + + host1$ ifconfig ib0 a.b.c.x + host2$ ifconfig ib0 a.b.c.y + host1$ ping a.b.c.y + host2$ ping a.b.c.x + + For other device types, follow the appropriate procedures. + + - Check NFS Setup + + For the NFS components enabled above (client and/or server), + test their functionality over standard Ethernet using TCP/IP or UDP/IP. + +NFS/RDMA Setup +~~~~~~~~~~~~~~ + + We recommend that you use two machines, one to act as the client and + one to act as the server. + + One time configuration: + + - On the server system, configure the /etc/exports file and + start the NFS/RDMA server. + + Exports entries with the following formats have been tested: + + /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) + /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) + + The IP address(es) is(are) the client's IPoIB address for an InfiniBand + HCA or the client's iWARP address(es) for an RNIC. + + NOTE: The "insecure" option must be used because the NFS/RDMA client does + not use a reserved port. + + Each time a machine boots: + + - Load and configure the RDMA drivers + + For InfiniBand using a Mellanox adapter: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + $ ifconfig ib0 a.b.c.d + + NOTE: use unique addresses for the client and server + + - Start the NFS server + + Load the RDMA transport module: + + $ modprobe svcrdma + + Start the server: + + $ /etc/init.d/nfsserver start + + or + + $ service nfs start + + Instruct the server to listen on the RDMA transport: + + $ echo rdma 20049 > /proc/fs/nfsd/portlist + + NOTE for SLES10 servers: The nfs start scripts on most distro's start + rpc.statd by default. However, the in-kernel lockd that was in SLES10 has + been removed in the new kernels. Since OFED is back-porting the new code to + the older distro's, there is no in-kernel lockd in SLES10 and the SLES10 + nfsserver scripts do not know the need to start it. Therefore, the + nfsserver scripts will be modified when the rnfs-utils rpm is installed to + start/stop rpc.statd. + + - On the client system + + Load the RDMA client module: + + $ modprobe xprtrdma + + Mount the NFS/RDMA server: + + $ mount.rnfs :/ /mnt -o proto=rdma,port=20049 + + NOTE: For kernels < 2.6.23, the "-i" flag must be passed into mount.rnfs. + This option allows the mount command to ignore the kernel version check. If + not disabled, the check will prevent passing arguments to the kernel and not + allow the updated version of NFS to accept the "rdma" NFS option. + + To verify that the mount is using RDMA, run "cat /proc/mounts" and check + the "proto" field for the given mount. + + Congratulations! You're using NFS/RDMA! + +Known Issues +~~~~~~~~~~~~~~~~~~~~~~~~ + +If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are using +a 64KB page size (like PPC64 and IA64 systems) and your server is using a +4KB page size (like i386 and X86_64), then you need to mount the server +using rsize=32768,wsize=32768 to avoid overrunning the Chelsio RNIC fast +register limits. This is a known firmware limitation in the Chelsio RNIC. + +Running NFSRDMA over Mellanox's ConnectX HCA requires that the adapter firmware +be 2.7.0 or greater on all NFS clients and servers. Firmware 2.6.0 has known +issues that prevent the RDMA connection from being established. Firmware 2.7.0 +has resolved these issues. + +IPv6 support requires portmap that supports version 4. Portmap included in RHEL5 +and SLES10 only supports version 2. Without version 4 support, the following +error will be logged: + svc: failed to register lockdv1 (errno 97). +This error will not affect IPv4 support. diff --git a/release_notes/open_mpi_release_notes.txt b/release_notes/open_mpi_release_notes.txt new file mode 100644 index 0000000..e231b04 --- /dev/null +++ b/release_notes/open_mpi_release_notes.txt @@ -0,0 +1,1756 @@ + Open Fabrics Enterprise Distribution (OFED) + Open MPI in OFED 1.5.1 Copyrights, License, and Release Notes + + March 2010 + +Open MPI Copyrights +------------------- +Most files in this release are marked with the copyrights of the +organizations who have edited them. The copyrights below generally +reflect members of the Open MPI core team who have contributed code to +this release. The copyrights for code used under license from other +parties are included in the corresponding files. + +Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana + University Research and Technology + Corporation. All rights reserved. +Copyright (c) 2004-2009 The University of Tennessee and The University + of Tennessee Research Foundation. All rights + reserved. +Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, + University of Stuttgart. All rights reserved. +Copyright (c) 2004-2007 The Regents of the University of California. + All rights reserved. +Copyright (c) 2006-2009 Los Alamos National Security, LLC. All rights + reserved. +Copyright (c) 2006-2009 Cisco Systems, Inc. All rights reserved. +Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. +Copyright (c) 2006-2008 Sandia National Laboratories. All rights + reserved. +Copyright (c) 2006-2009 Sun Microsystems, Inc. All rights reserved. + Use is subject to license terms. +Copyright (c) 2006-2009 The University of Houston. All rights + reserved. +Copyright (c) 2006-2008 Myricom, Inc. All rights reserved. +Copyright (c) 2007-2008 UT-Battelle, LLC. All rights reserved. +Copyright (c) 2007-2008 IBM Corporation. All rights reserved. +Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich + Supercomputing + Centre, Federal Republic of Germany +Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany +Copyright (c) 2007 Evergrid, Inc. All rights reserved. +Copyright (c) 2008 Institut National de Recherche en + Informatique. All rights reserved. +Copyright (c) 2007 Lawrence Livermore National Security, LLC. + All rights reserved. +Copyright (c) 2007-2010 Mellanox Technologies. All rights reserved. +Copyright (c) 2006 QLogic Corporation. All rights reserved. + +Additional copyrights may follow + +Open MPI License +---------------- +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + +- Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + +- Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer listed + in this license in the documentation and/or other materials + provided with the distribution. + +- Neither the name of the copyright holders nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +The copyright holders provide no reassurances that the source code +provided does not infringe any patent, copyright, or any other +intellectual property rights of third parties. The copyright holders +disclaim any liability to any recipient for claims brought against +recipient by any third party for infringement of that parties +intellectual property rights. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +=========================================================================== + +When submitting questions and problems, be sure to include as much +extra information as possible. This web page details all the +information that we request in order to provide assistance: + + http://www.open-mpi.org/community/help/ + +The best way to report bugs, send comments, or ask questions is to +sign up on the user's and/or developer's mailing list (for user-level +and developer-level questions; when in doubt, send to the user's +list): + + users@open-mpi.org + devel@open-mpi.org + +Because of spam, only subscribers are allowed to post to these lists +(ensure that you subscribe with and post from exactly the same e-mail +address -- joe@example.com is considered different than +joe@mycomputer.example.com!). Visit these pages to subscribe to the +lists: + + http://www.open-mpi.org/mailman/listinfo.cgi/users + http://www.open-mpi.org/mailman/listinfo.cgi/devel + +Thanks for your time. + +=========================================================================== + +Much, much more information is also available in the Open MPI FAQ: + + http://www.open-mpi.org/faq/ + +=========================================================================== + +OFED-Specific Release Notes +--------------------------- + +** SLES 10 with Pathscale compiler support: + +Using the Pathscale compiler to build Open MPI on SLES10 may result in +a non-functional Open MPI installation (every Open MPI command fails). +If this problem occurs, try upgrading your Pathscale installation to +the latest maintenance release, or use a different compiler to compile +Open MPI. + +** Intel compiler support: + +Some versions of the Intel 9.1 C++ compiler suite series produce +incorrect code when used with the Open MPI C++ bindings. Symptoms of +this problem include crashing applications (e.g., segmentation +violations) and Open MPI producing errors about incorrect parameters. +Be sure to upgrade to the latest maintenance release of the Intel 9.1 +compiler to avoid these problems. + +** Installing newer versions of Open MPI after OFED is installed: + +Open MPI can be built from source after OFED is fully installed. The +source code for Open MPI can be extracted from the SRPM shipped with +OFED or downloaded from the main Open MPI web site: +http://www.open-mpi.org/. + +To compile with Open MPI from source with OFED support, fully install +the rest of OFED. If you used the default prefix for the OFED +installation (/usr), Open MPI should build with OpenFabrics support by +default. If you used a different OFED prefix, you must tell Open MPI +what it is with the "--with-openib=" switch to configure. +You can verify that Open MPI installed with OpenFabrics support by +running (the exact version numbers displayed may be different; the +important part is that the "openib" BTL is displayed): + + shell$ ompi_info | grep openib + MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.2) + +See the rest of the documentation below for other configure command +line options and installation instructions. + +** Changelog summary + +Showing versions 1.2.7 - 1.4; see the "NEWS" file in an Open MPI +distribution for the full list. + +1.4.1 (OFED version) +--- +- Update support for various OpenFabrics devices in the openib BTL's + .ini file. +- Fixing RDMA CM failure during QP creation (Ticket #2307) + +1.4.1 +--- +- Update to PLPA v1.3.2, addressing a licensing issue identified by + the Fedora project. See + https://svn.open-mpi.org/trac/plpa/changeset/262 for details. +- Add check for malformed checkpoint metadata files (Ticket #2141). +- Fix error path in ompi-checkpoint when not able to checkpoint + (Ticket #2138). +- Cleanup component release logic when selecting checkpoint/restart + enabled components (Ticket #2135). +- Fixed VT node name detection for Cray XT platforms, and fixed some + broken VT documentation files. +- Fix a possible race condition in tearing down RDMA CM-based + connections. +- Relax error checking on MPI_GRAPH_CREATE. Thanks to David Singleton + for pointing out the issue. +- Fix a shared memory "hang" problem that occurred on x86/x86_64 + platforms when used with the GNU >=4.4.x compiler series. +- Add fix for Libtool 2.2.6b's problems with the PGI 10.x compiler + suite. Inspired directly from the upstream Libtool patches that fix + the issue (but we need something working before the next Libtool + release). + +1.4 +--- + +The *only* change in the Open MPI v1.4 release (as compared to v1.3.4) +was to update the embedded version of Libtool's libltdl to address a +potential security vulnerability. Specifically: Open MPI v1.3.4 was +created with GNU Libtool 2.2.6a; Open MPI v1.4 was created with GNU +Libtool 2.2.6b. There are no other changes between Open MPI v1.3.4 +and v1.4. + + +1.3.4 +----- + +- Fix some issues in OMPI's SRPM with regard to shell_scripts_basename + and its use with mpi-selector. Thanks to Bill Johnstone for + pointing out the problem. +- Added many new MPI job process affinity options to mpirun. See the + newly-updated mpirun(1) man page for details. +- Several updates to mpirun's XML output. +- Update to fix a few Valgrind warnings with regards to the ptmalloc2 + allocator and Open MPI's use of PLPA. +- Many updates and fixes to the (non-default) "sm" collective + component (i.e., native shared memory MPI collective operations). +- Updates and fixes to some MPI_COMM_SPAWN_MULTIPLE corner cases. +- Fix some internal copying functions in Open MPI's use of PLPA. +- Correct some SLURM nodelist parsing logic that may have interfered + with large jobs. Additionally, per advice from the SLURM team, + change the environment variable that we use for obtaining the job's + allocation. +- Revert to an older, safer (but slower) communicator ID allocation + algorithm. +- Fixed minimum distance finding for OpenFabrics devices in the openib + BTL. +- Relax the parameter checking MPI_CART_CREATE a bit. +- Fix MPI_COMM_SPAWN[_MULTIPLE] to only error-check the info arguments + on the root process. Thanks to Federico Golfre Andreasi for + reporting the problem. +- Fixed some BLCR configure issues. +- Fixed a potential deadlock when the openib BTL was used with + MPI_THREAD_MULTIPLE. +- Fixed dynamic rules selection for the "tuned" coll component. +- Added a launch progress meter to mpirun (useful for large jobs; set + the orte_report_launch_progress MCA parameter to 1 to see it). +- Reduced the number of file descriptors consumed by each MPI process. +- Add new device IDs for Chelsio T3 RNICs to the openib BTL config file. +- Fix some CRS self component issues. +- Added some MCA parameters to the PSM MTL to tune its run-time + behavior. +- Fix some VT issues with MPI_BOTTOM/MPI_IN_PLACE. +- Man page updates from the Debain Open MPI package maintainers. +- Add cycle counter support for the Alpha and Sparc platforms. +- Pass visibility flags to libltdl's configure script, resulting in + those symbols being hidden. This appears to mainly solve the + problem of applications attempting to use different versions of + libltdl from that used to build Open MPI. + + +1.3.3 +----- + +- Fix a number of issues with the openib BTL (OpenFabrics) RDMA CM, + including a memory corruption bug, a shutdown deadlock, and a route + timeout. Thanks to David McMillen and Hal Rosenstock for help in + tracking down the issues. +- Change the behavior of the EXTRA_STATE parameter that is passed to + Fortran attribute callback functions: this value is now stored + internally in MPI -- it no longer references the original value + passed by MPI_*_CREATE_KEYVAL. +- Allow the overriding RFC1918 and RFC3330 for the specification of + "private" networks, thereby influencing Open MPI's TCP + "reachability" computations. +- Improve flow control issues in the sm btl, by both tweaking the + shared memory progression rules and by enabling the "sync" collective + to barrier every 1,000th collective. +- Various fixes for the IBM XL C/C++ v10.1 compiler. +- Allow explicit disabling of ptmalloc2 hooks at runtime (e.g., enable + support for Debian's builtroot system). Thanks to Manuel Prinz and + the rest of the Debian crew for helping identify and fix this issue. +- Various minor fixes for the I/O forwarding subsystem. +- Big endian iWARP fixes in the Open Fabrics RDMA CM support. +- Update support for various OpenFabrics devices in the openib BTL's + .ini file. +- Fixed undefined symbol issue with Open MPI's parallel debugger + message queue support so it can be compiled by Sun Studio compilers. +- Update MPI_SUBVERSION to 1 in the Fortran bindings. +- Fix MPI_GRAPH_CREATE Fortran 90 binding. +- Fix MPI_GROUP_COMPARE behavior with regards to MPI_IDENT. Thanks to + Geoffrey Irving for identifying the problem and supplying the fix. +- Silence gcc 4.1 compiler warnings about type punning. Thanks to + Number Cruncher for the fix. +- Added more Valgrind and other memory-cleanup fixes. Thanks to + various Open MPI users for help with these issues. +- Miscellaneous VampirTrace fixes. +- More fixes for openib credits in heavy-congestion scenarios. +- Slightly decrease the latency in the openib BTL in some conditions + (add "send immediate" support to the openib BTL). +- Ensure to allow MPI_REQUEST_GET_STATUS to accept an + MPI_STATUS_IGNORE parameter. Thanks to Shaun Jackman for the bug + report. +- Added Microsoft Windows support. See README.WINDOWS file for + details. + + +1.3.2 +----- + +- Fixed a potential infinite loop in the openib BTL that could occur + in senders in some frequent-communication scenarios. Thanks to Don + Wood for reporting the problem. +- Add a new checksum PML variation on ob1 (main MPI point-to-point + communication engine) to detect memory corruption in node-to-node + messages +- Add a new configuration option to add padding to the openib + header so the data is aligned +- Add a new configuration option to use an alternative checksum algo + when using the checksum PML +- Fixed a problem reported by multiple users on the mailing list that + the LSF support would fail to find the appropriate libraries at + run-time. +- Allow empty shell designations from getpwuid(). Thanks to Sergey + Koposov for the bug report. +- Ensure that mpirun exits with non-zero status when applications die + due to user signal. Thanks to Geoffroy Pignot for suggesting the + fix. +- Ensure that MPI_VERSION / MPI_SUBVERSION match what is returned by + MPI_GET_VERSION. Thanks to Rob Egan for reporting the error. +- Updated MPI_*KEYVAL_CREATE functions to properly handle Fortran + extra state. +- A variety of ob1 (main MPI point-to-point communication engine) bug + fixes that could have caused hangs or seg faults. +- Do not install Open MPI's signal handlers in MPI_INIT if there are + already signal handlers installed. Thanks to Kees Verstoep for + bringing the issue to our attention. +- Fix GM support to not seg fault in MPI_INIT. +- Various VampirTrace fixes. +- Various PLPA fixes. +- No longer create BTLs for invalid (TCP) devices. +- Various man page style and lint cleanups. +- Fix critical OpenFabrics-related bug noted here: + http://www.open-mpi.org/community/lists/announce/2009/03/0029.php. + Open MPI now uses a much more robust memory intercept scheme that is + quite similar to what is used by MX. The use of "-lopenmpi-malloc" + is no longer necessary, is deprecated, and is expected to disappear + in a future release. -lopenmpi-malloc will continue to work for the + duration of the Open MPI v1.3 and v1.4 series. +- Fix some OpenFabrics shutdown errors, both regarding iWARP and SRQ. +- Allow the udapl BTL to work on Solaris platforms that support + relaxed PCI ordering. +- Fix problem where the mpirun would sometimes use rsh/ssh to launch on + the localhost (instead of simply forking). +- Minor SLURM stdin fixes. +- Fix to run properly under SGE jobs. +- Scalability and latency improvements for shared memory jobs: convert + to using one message queue instead of N queues. +- Automatically size the shared-memory area (mmap file) to match + better what is needed; specifically, so that large-np jobs will start. +- Use fixed-length MPI predefined handles in order to provide ABI + compatibility between Open MPI releases. +- Fix building of the posix paffinity component to properly get the + number of processors in loosely tested environments (e.g., + FreeBSD). Thanks to Steve Kargl for reporting the issue. +- Fix --with-libnuma handling in configure. Thanks to Gus Correa for + reporting the problem. + + +1.3.1 +----- + +- Added "sync" coll component to allow users to synchronize every N + collective operations on a given communicator. +- Increased the default values of the IB and RNR timeout MCA parameters. +- Fix a compiler error noted by Mostyn Lewis with the PGI 8.0 compiler. +- Fix an error that prevented stdin from being forwarded if the + rsh launcher was in use. Thanks to Branden Moore for pointing out + the problem. +- Correct a case where the added datatype is considered as contiguous but + has gaps in the beginning. +- Fix an error that limited the number of comm_spawns that could + simultaneously be running in some environments +- Correct a corner case in OB1's GET protocol for long messages; the + error could sometimes cause MPI jobs using the openib BTL to hang. +- Fix a bunch of bugs in the IO forwarding (IOF) subsystem and add some + new options to output to files and redirect output to xterm. Thanks to + Jody Weissmann for helping test out many of the new fixes and + features. +- Fix SLURM race condition. +- Fix MPI_File_c2f(MPI_FILE_NULL) to return 0, not -1. Thanks to + Lisandro Dalcin for the bug report. +- Fix the DSO build of tm PLM. +- Various fixes for size disparity between C int's and Fortran + INTEGER's. Thanks to Christoph van Wullen for the bug report. +- Ensure that mpirun exits with a non-zero exit status when daemons or + processes abort or fail to launch. +- Various fixes to work around Intel (NetEffect) RNIC behavior. +- Various fixes for mpirun's --preload-files and --preload-binary + options. +- Fix the string name in MPI::ERRORS_THROW_EXCEPTIONS. +- Add ability to forward SIFTSTP and SIGCONT to MPI processes if you + set the MCA parameter orte_forward_job_control to 1. +- Allow the sm BTL to allocate larger amounts of shared memory if + desired (helpful for very large multi-core boxen). +- Fix a few places where we used PATH_MAX instead of OMPI_PATH_MAX, + leading to compile problems on some platforms. Thanks to Andrea Iob + for the bug report. +- Fix mca_btl_openib_warn_no_device_params_found MCA parameter; it + was accidentally being ignored. +- Fix some run-time issues with the sctp BTL. +- Ensure that RTLD_NEXT exists before trying to use it (e.g., it + doesn't exist on Cygwin). Thanks to Gustavo Seabra for reporting + the issue. +- Various fixes to VampirTrace, including fixing compile errors on + some platforms. +- Fixed missing MPI_Comm_accept.3 man page; fixed minor issue in + orterun.1 man page. Thanks to Dirk Eddelbuettel for identifying the + problem and submitting a patch. +- Implement the XML formatted output of stdout/stderr/stddiag. +- Fixed mpirun's -wdir switch to ensure that working directories for + multiple app contexts are properly handled. Thanks to Geoffroy + Pignot for reporting the problem. +- Improvements to the MPI C++ integer constants: + - Allow MPI::SEEK_* constants to be used as constants + - Allow other MPI C++ constants to be used as array sizes +- Fix minor problem with orte-restart's command line options. See + ticket #1761 for details. Thanks to Gregor Dschung for reporting + the problem. + +1.3 +--- + +- Extended the OS X 10.5.x (Leopard) workaround for a problem when + assembly code is compiled with -g[0-9]. Thanks to Barry Smith for + reporting the problem. See ticket #1701. +- Disabled MPI_REAL16 and MPI_COMPLEX32 support on platforms where the + bit representation of REAL*16 is different than that of the C type + of the same size (usually long double). Thanks to Julien Devriendt + for reporting the issue. See ticket #1603. +- Increased the size of MPI_MAX_PORT_NAME to 1024 from 36. See ticket #1533. +- Added "notify debugger on abort" feature. See tickets #1509 and #1510. + Thanks to Seppo Sahrakropi for the bug report. +- Upgraded Open MPI tarballs to use Autoconf 2.63, Automake 1.10.1, + Libtool 2.2.6a. +- Added missing MPI::Comm::Call_errhandler() function. Thanks to Dave + Goodell for bringing this to our attention. +- Increased MPI_SUBVERSION value in mpi.h to 1 (i.e., MPI 2.1). +- Changed behavior of MPI_GRAPH_CREATE, MPI_TOPO_CREATE, and several + other topology functions per MPI-2.1. +- Fix the type of the C++ constant MPI::IN_PLACE. +- Various enhancements to the openib BTL: + - Added btl_openib_if_[in|ex]clude MCA parameters for + including/excluding comma-delimited lists of HCAs and ports. + - Added RDMA CM support, includng btl_openib_cpc_[in|ex]clude MCA + parameters + - Added NUMA support to only use "near" network adapters + - Added "Bucket SRQ" (BSRQ) support to better utilize registered + memory, including btl_openib_receive_queues MCA parameter + - Added ConnectX XRC support (and integrated with BSRQ) + - Added btl_openib_ib_max_inline_data MCA parameter + - Added iWARP support + - Revamped flow control mechansisms to be more efficient + - "mpi_leave_pinned=1" is now the default when possible, + automatically improving performance for large messages when + application buffers are re-used +- Elimiated duplicated error messages when multiple MPI processes fail + with the same error. +- Added NUMA support to the shared memory BTL. +- Add Valgrind-based memory checking for MPI-semantic checks. +- Add support for some optional Fortran datatypes (MPI_LOGICAL1, + MPI_LOGICAL2, MPI_LOGICAL4 and MPI_LOGICAL8). +- Remove the use of the STL from the C++ bindings. +- Added support for Platform/LSF job launchers. Must be Platform LSF + v7.0.2 or later. +- Updated ROMIO with the version from MPICH2 1.0.7. +- Added RDMA capable one-sided component (called rdma), which + can be used with BTL components that expose a full one-sided + interface. +- Added the optional datatype MPI_REAL2. As this is added to the "end of" + predefined datatypes in the fortran header files, there will not be + any compatibility issues. +- Added Portable Linux Processor Affinity (PLPA) for Linux. +- Addition of a finer symbols export control via the visibiliy feature + offered by some compilers. +- Added checkpoint/restart process fault tolerance support. Initially + support a LAM/MPI-like protocol. +- Removed "mvapi" BTL; all InfiniBand support now uses the OpenFabrics + driver stacks ("openib" BTL). +- Added more stringent MPI API parameter checking to help user-level + debugging. +- The ptmalloc2 memory manager component is now by default built as + a standalone library named libopenmpi-malloc. Users wanting to + use leave_pinned with ptmalloc2 will now need to link the library + into their application explicitly. All other users will use the + libc-provided allocator instead of Open MPI's ptmalloc2. This change + may be overriden with the configure option enable-ptmalloc2-internal +- The leave_pinned options will now default to using mallopt on + Linux in the cases where ptmalloc2 was not linked in. mallopt + will also only be available if munmap can be intercepted (the + default whenever Open MPI is not compiled with --without-memory- + manager. +- Open MPI will now complain and refuse to use leave_pinned if + no memory intercept / mallopt option is available. +- Add option of using Perl-based wrapper compilers instead of the + C-based wrapper compilers. The Perl-based version does not + have the features of the C-based version, but does work better + in cross-compile environments. + + +1.2.9 +----- + +- Fix a segfault when using one-sided communications on some forms of derived + datatypes. Thanks to Dorian Krause for reporting the bug. See #1715. +- Fix an alignment problem affecting one-sided communications on + some architectures (e.g., SPARC64). See #1738. +- Fix compilation on Solaris when thread support is enabled in Open MPI + (e.g., when using --with-threads). See #1736. +- Correctly take into account the MTU that an OpenFabrics device port + is using. See #1722 and + https://bugs.openfabrics.org/show_bug.cgi?id=1369. +- Fix two datatype engine bugs. See #1677. + Thanks to Peter Kjellstrom for the bugreport. +- Fix the bml r2 help filename so the help message can be found. See #1623. +- Fix a compilation problem on RHEL4U3 with the PGI 32 bit compiler + caused by . See ticket #1613. +- Fix the --enable-cxx-exceptions configure option. See ticket #1607. +- Properly handle when the MX BTL cannot open an endpoint. See ticket #1621. +- Fix a double free of events on the tcp_events list. See ticket #1631. +- Fix a buffer overun in opal_free_list_grow (called by MPI_Init). + Thanks to Patrick Farrell for the bugreport and Stephan Kramer for + the bugfix. See ticket #1583. +- Fix a problem setting OPAL_PREFIX for remote sh-based shells. + See ticket #1580. + + +1.2.8 +----- + +- Tweaked one memory barrier in the openib component to be more conservative. + May fix a problem observed on PPC machines. See ticket #1532. +- Fix OpenFabrics IB partition support. See ticket #1557. +- Restore v1.1 feature that sourced .profile on remote nodes if the default + shell will not do so (e.g. /bin/sh and /bin/ksh). See ticket #1560. +- Fix segfault in MPI_Init_thread() if ompi_mpi_init() fails. See ticket #1562. +- Adjust SLURM support to first look for $SLURM_JOB_CPUS_PER_NODE instead of + the deprecated $SLURM_TASKS_PER_NODE environment variable. This change + may be *required* when using SLURM v1.2 and above. See ticket #1536. +- Fix the MPIR_Proctable to be in process rank order. See ticket #1529. +- Fix a regression introduced in 1.2.6 for the IBM eHCA. See ticket #1526. + + +1.2.7 +----- + +- Add some Sun HCA vendor IDs. See ticket #1461. +- Fixed a memory leak in MPI_Alltoallw when called from Fortran. + Thanks to Dave Grote for the bugreport. See ticket #1457. +- Only link in libutil when it is needed/desired. Thanks to + Brian Barret for diagnosing and fixing the problem. See ticket #1455. +- Update some QLogic HCA vendor IDs. See ticket #1453. +- Fix F90 binding for MPI_CART_GET. Thanks to Scott Beardsley for + bringing it to our attention. See ticket #1429. +- Remove a spurious warning message generated in/by ROMIO. See ticket #1421. +- Fix a bug where command-line MCA parameters were not overriding + MCA parameters set from environment variables. See ticket #1380. +- Fix a bug in the AMD64 atomics assembly. Thanks to Gabriele Fatigati + for the bug report and bugfix. See ticket #1351. +- Fix a gather and scatter bug on intercommunicators when the datatype + being moved is 0 bytes. See ticket #1331. +- Some more man page fixes from the Debian maintainers. + See tickets #1324 and #1329. +- Have openib BTL (OpenFabrics support) check for the presence of + /sys/class/infiniband before allowing itself to be used. This check + prevents spurious "OMPI did not find RDMA hardware!" notices on + systems that have the software drivers installed, but no + corresponding hardware. See tickets #1321 and #1305. +- Added vendor IDs for some ConnectX openib HCAs. See ticket #1311. +- Fix some RPM specfile inconsistencies. See ticket #1308. + Thanks to Jim Kusznir for noticing the problem. +- Removed an unused function prototype that caused warnings on + some systems (e.g., OS X). See ticket #1274. +- Fix a deadlock in inter-communicator scatter/gather operations. + Thanks to Martin Audet for the bug report. See ticket #1268. + +=========================================================================== + +Much, much more information is also available in the Open MPI FAQ: + + http://www.open-mpi.org/faq/ + +=========================================================================== + +General Release Notes +--------------------- + +Detailed Open MPI v1.3 Feature List: + + o Open MPI RunTime Environment (ORTE) improvements + - General robustness improvements + - Scalable job launch (we've seen ~16K processes in less than a + minute in a highly-optimized configuration) + - New process mappers + - Support for Platform/LSF environments (v7.0.2 and later) + - More flexible processing of host lists + - new mpirun cmd line options and associated functionality + + o Fault-Tolerance Features + - Asynchronous, transparent checkpoint/restart support + - Fully coordinated checkpoint/restart coordination component + - Support for the following checkpoint/restart services: + - blcr: Berkley Lab's Checkpoint/Restart + - self: Application level callbacks + - Support for the following interconnects: + - tcp + - mx + - openib + - sm + - self + - Improved Message Logging + + o MPI_THREAD_MULTIPLE support for point-to-point messaging in the + following BTLs (note that only MPI point-to-point messaging API + functions support MPI_THREAD_MULTIPLE; other API functions likely + do not): + - tcp + - sm + - mx + - elan + - self + + o Point-to-point Messaging Layer (PML) improvements + - Memory footprint reduction + - Improved latency + - Improved algorithm for multiple communication device + ("multi-rail") support + + o Numerous Open Fabrics improvements/enhancements + - Added iWARP support (including RDMA CM) + - Memory footprint and performance improvements + - "Bucket" SRQ support for better registered memory utilization + - XRC/ConnectX support + - Message coalescing + - Improved error report mechanism with Asynchronous events + - Automatic Path Migration (APM) + - Improved processor/port binding + - Infrastructure for additional wireup strategies + - mpi_leave_pinned is now enabled by default + + o uDAPL BTL enhancements + - Multi-rail support + - Subnet checking + - Interface include/exclude capabilities + + o Processor affinity + - Linux processor affinity improvements + - Core/socket <--> process mappings + + o Collectives + - Performance improvements + - Support for hierarchical collectives (must be activated + manually; see below) + + o Miscellaneous + - MPI 2.1 compliant + - Sparse process groups and communicators + - Support for Cray Compute Node Linux (CNL) + - One-sided RDMA component (BTL-level based rather than PML-level + based) + - Aggregate MCA parameter sets + - MPI handle debugging + - Many small improvements to the MPI C++ bindings + - Valgrind support + - VampirTrace support + - Updated ROMIO to the version from MPICH2 1.0.7 + - Removed the mVAPI IB stacks + - Display most error messages only once (vs. once for each + process) + - Many other small improvements and bug fixes, too numerous to + list here + +Known issues +------------ + + o There is a segfault that sometimes occurs on one of our x86_64 test + clusters when using MPI onesided communications over Myrinet MX. + Since no one else has reported this problem we are not holding + up the 1.3 release. See ticket #1757 for the details, and any + possible workarounds. + + o XGrid support is currently broken. + https://svn.open-mpi.org/trac/ompi/ticket/1777 + + o MPI_REDUCE_SCATTER does not work with counts of 0. + https://svn.open-mpi.org/trac/ompi/ticket/1559 + + o Please also see the Open MPI bug tracker for bugs beyond this release. + https://svn.open-mpi.org/trac/ompi/report + +=========================================================================== + +The following abbreviated list of release notes applies to this code +base as of this writing (10 July 2009): + +General notes +------------- + +- Open MPI includes support for a wide variety of supplemental + hardware and software package. When configuring Open MPI, you may + need to supply additional flags to the "configure" script in order + to tell Open MPI where the header files, libraries, and any other + required files are located. As such, running "configure" by itself + may not include support for all the devices (etc.) that you expect, + especially if their support headers / libraries are installed in + non-standard locations. Network interconnects are an easy example + to discuss -- Myrinet and OpenFabrics networks, for example, both + have supplemental headers and libraries that must be found before + Open MPI can build support for them. You must specify where these + files are with the appropriate options to configure. See the + listing of configure command-line switches, below, for more details. + +- The majority of Open MPI's documentation is here in this file, the + included man pages, and on the web site FAQ + (http://www.open-mpi.org/). This will eventually be supplemented + with cohesive installation and user documentation files. + +- Note that Open MPI documentation uses the word "component" + frequently; the word "plugin" is probably more familiar to most + users. As such, end users can probably completely substitute the + word "plugin" wherever you see "component" in our documentation. + For what it's worth, we use the word "component" for historical + reasons, mainly because it is part of our acronyms and internal API + functionc calls. + +- The run-time systems that are currently supported are: + - rsh / ssh + - LoadLeveler + - PBS Pro, Open PBS, Torque + - Platform LSF (v7.0.2 and later) + - SLURM + - XGrid (known to be broken in 1.3 through 1.3.2) + - Cray XT-3 and XT-4 + - Sun Grid Engine (SGE) 6.1, 6.2 and open source Grid Engine + - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008) + +- Systems that have been tested are: + - Linux (various flavors/distros), 32 bit, with gcc, and Sun Studio 12 + - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft, + Intel, Portland, Pathscale, and Sun Studio 12 compilers (*) + - OS X (10.4), 32 and 64 bit (i386, PPC, PPC64, x86_64), with gcc + and Absoft compilers (*) + - Solaris 10 update 2, 3 and 4, 32 and 64 bit (SPARC, i386, x86_64), + with Sun Studio 10, 11 and 12 + + (*) Be sure to read the Compiler Notes, below. + +- Other systems have been lightly (but not fully tested): + - Other 64 bit platforms (e.g., Linux on PPC64) + - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008); + more testing and support is expected later in the Open MPI v1.3.x + series. See the README.WINDOWS file. + +Compiler Notes +-------------- + +- Mixing compilers from different vendors when building Open MPI + (e.g., using the C/C++ compiler from one vendor and the F77/F90 + compiler from a different vendor) has been successfully employed by + some Open MPI users (discussed on the Open MPI user's mailing list), + but such configurations are not tested and not documented. For + example, such configurations may require additional compiler / + linker flags to make Open MPI build properly. + +- Open MPI does not support the Sparc v8 CPU target, which is the + default on Sun Solaris. The v8plus (32 bit) or v9 (64 bit) + targets must be used to build Open MPI on Solaris. This can be + done by including a flag in CFLAGS, CXXFLAGS, FFLAGS, and FCFLAGS, + -xarch=v8plus for the Sun compilers, -mv8plus for GCC. + +- At least some versions of the Intel 8.1 compiler seg fault while + compiling certain Open MPI source code files. As such, it is not + supported. + +- The Intel 9.0 v20051201 compiler on IA64 platforms seems to have a + problem with optimizing the ptmalloc2 memory manager component (the + generated code will segv). As such, the ptmalloc2 component will + automatically disable itself if it detects that it is on this + platform/compiler combination. The only effect that this should + have is that the MCA parameter mpi_leave_pinned will be inoperative. + +- Early versions of the Portland Group 6.0 compiler have problems + creating the C++ MPI bindings as a shared library (e.g., v6.0-1). + Tests with later versions show that this has been fixed (e.g., + v6.0-5). + +- The Portland Group compilers prior to version 7.0 require the + "-Msignextend" compiler flag to extend the sign bit when converting + from a shorter to longer integer. This is is different than other + compilers (such as GNU). When compiling Open MPI with the Portland + compiler suite, the following flags should be passed to Open MPI's + configure script: + + shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-Msignextend \ + --with-wrapper-cflags=-Msignextend \ + --with-wrapper-cxxflags=-Msignextend ... + + This will both compile Open MPI with the proper compile flags and + also automatically add "-Msignextend" when the C and C++ MPI wrapper + compilers are used to compile user MPI applications. + +- Using the MPI C++ bindings with the Pathscale compiler is known + to fail, possibly due to Pathscale compiler issues. + +- Using the Absoft compiler to build the MPI Fortran bindings on Suse + 9.3 is known to fail due to a Libtool compatibility issue. + +- Open MPI will build bindings suitable for all common forms of + Fortran 77 compiler symbol mangling on platforms that support it + (e.g., Linux). On platforms that do not support weak symbols (e.g., + OS X), Open MPI will build Fortran 77 bindings just for the compiler + that Open MPI was configured with. + + Hence, on platforms that support it, if you configure Open MPI with + a Fortran 77 compiler that uses one symbol mangling scheme, you can + successfully compile and link MPI Fortran 77 applications with a + Fortran 77 compiler that uses a different symbol mangling scheme. + + NOTE: For platforms that support the multi-Fortran-compiler bindings + (i.e., weak symbols are supported), due to limitations in the MPI + standard and in Fortran compilers, it is not possible to hide these + differences in all cases. Specifically, the following two cases may + not be portable between different Fortran compilers: + + 1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE + will only compare properly to Fortran applications that were + created with Fortran compilers that that use the same + name-mangling scheme as the Fortran compiler that Open MPI was + configured with. + + 2. Fortran compilers may have different values for the logical + .TRUE. constant. As such, any MPI function that uses the Fortran + LOGICAL type may only get .TRUE. values back that correspond to + the the .TRUE. value of the Fortran compiler that Open MPI was + configured with. Note that some Fortran compilers allow forcing + .TRUE. to be 1 and .FALSE. to be 0. For example, the Portland + Group compilers provide the "-Munixlogical" option, and Intel + compilers (version >= 8.) provide the "-fpscomp logicals" option. + + You can use the ompi_info command to see the Fortran compiler that + Open MPI was configured with. + +- The Fortran 90 MPI bindings can now be built in one of three sizes + using --with-mpi-f90-size=SIZE (see description below). These sizes + reflect the number of MPI functions included in the "mpi" Fortran 90 + module and therefore which functions will be subject to strict type + checking. All functions not included in the Fortran 90 module can + still be invoked from F90 applications, but will fall back to + Fortran-77 style checking (i.e., little/none). + + - trivial: Only includes F90-specific functions from MPI-2. This + means overloaded versions of MPI_SIZEOF for all the MPI-supported + F90 intrinsic types. + + - small (default): All the functions in "trivial" plus all MPI + functions that take no choice buffers (meaning buffers that are + specified by the user and are of type (void*) in the C bindings -- + generally buffers specified for message passing). Hence, + functions like MPI_COMM_RANK are included, but functions like + MPI_SEND are not. + + - medium: All the functions in "small" plus all MPI functions that + take one choice buffer (e.g., MPI_SEND, MPI_RECV, ...). All + one-choice-buffer functions have overloaded variants for each of + the MPI-supported Fortran intrinsic types up to the number of + dimensions specified by --with-f90-max-array-dim (default value is + 4). + + Increasing the size of the F90 module (in order from trivial, small, + and medium) will generally increase the length of time required to + compile user MPI applications. Specifically, "trivial"- and + "small"-sized F90 modules generally allow user MPI applications to + be compiled fairly quickly but lose type safety for all MPI + functions with choice buffers. "medium"-sized F90 modules generally + take longer to compile user applications but provide greater type + safety for MPI functions. + + Note that MPI functions with two choice buffers (e.g., MPI_GATHER) + are not currently included in Open MPI's F90 interface. Calls to + these functions will automatically fall through to Open MPI's F77 + interface. A "large" size that includes the two choice buffer MPI + functions is possible in future versions of Open MPI. + + +General Run-Time Support Notes +------------------------------ + +- The Open MPI installation must be in your PATH on all nodes (and + potentially LD_LIBRARY_PATH, if libmpi is a shared library), unless + using the --prefix or --enable-mpirun-prefix-by-default + functionality (see below). + +- LAM/MPI-like mpirun notation of "C" and "N" is not yet supported. + +- The XGrid support is experimental - see the Open MPI FAQ and this + post on the Open MPI user's mailing list for more information: + + http://www.open-mpi.org/community/lists/users/2006/01/0539.php + +- Open MPI's run-time behavior can be customized via MCA ("MPI + Component Architecture") parameters (see below for more information + on how to get/set MCA parameter values). Some MCA parameters can be + set in a way that renders Open MPI inoperable (see notes about MCA + parameters later in this file). In particular, some parameters have + required options that must be included. + + - If specified, the "btl" parameter must include the "self" + component, or Open MPI will not be able to deliver messages to the + same rank as the sender. For example: "mpirun --mca btl tcp,self + ..." + - If specified, the "btl_tcp_if_exclude" paramater must include the + loopback device ("lo" on many Linux platforms), or Open MPI will + not be able to route MPI messages using the TCP BTL. For example: + "mpirun --mca btl_tcp_if_exclude lo,eth1 ..." + +- Running on nodes with different endian and/or different datatype + sizes within a single parallel job is supported in this release. + However, Open MPI does not resize data when datatypes differ in size + (for example, sending a 4 byte MPI_DOUBLE and receiving an 8 byte + MPI_DOUBLE will fail). + + +MPI Functionality and Features +------------------------------ + +- All MPI-2.1 functionality is supported. + +- MPI_THREAD_MULTIPLE support is included, but is only lightly tested. + It likely does not work for thread-intensive applications. Note + that *only* the MPI point-to-point communication functions for the + BTL's listed above are considered thread safe. Other support + functions (e.g., MPI attributes) have not been certified as safe + when simultaneously used by multiple threads. + + Note that Open MPI's thread support is in a fairly early stage; the + above devices are likely to *work*, but the latency is likely to be + fairly high. Specifically, efforts so far have concentrated on + *correctness*, not *performance* (yet). + +- MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a + portable C datatype can be found that matches the Fortran type + REAL*16, both in size and bit representation. + +- Asynchronous message passing progress using threads can be turned on + with the --enable-progress-threads option to configure. + Asynchronous message passing progress is only supported with devices + that support MPI_THREAD_MULTIPLE, but is only very lightly tested + (and may not provide very much performance benefit). + + +Collectives +----------- + +- The "hierarch" coll component (i.e., an implementation of MPI + collective operations) attempts to discover network layers of + latency in order to segregate individual "local" and "global" + operations as part of the overall collective operation. In this + way, network traffic can be reduced -- or possibly even minimized + (similar to MagPIe). The current "hierarch" component only + separates MPI processes into on- and off-node groups. + + Hierarch has had sufficient correctness testing, but has not + received much performance tuning. As such, hierarch is not + activated by default -- it must be enabled manually by setting its + priority level to 100: + + mpirun --mca coll_hierarch_priority 100 ... + + We would appreciate feedback from the user community about how well + hierarch works for your applications. + + +Network Support +--------------- + +- The OpenFabrics Enterprise Distribution (OFED) software package v1.0 + will not work properly with Open MPI v1.2 (and later) due to how its + Mellanox InfiniBand plugin driver is created. The problem is fixed + OFED v1.1 (and later). + +- Older mVAPI-based InfiniBand drivers (Mellanox VAPI) are no longer + supported. Please use an older version of Open MPI (1.2 series or + earlier) if you need mVAPI support. + +- The use of fork() with the openib BTL is only partially supported, + and only on Linux kernels >= v2.6.15 with libibverbs v1.1 or later + (first released as part of OFED v1.2), per restrictions imposed by + the OFED network stack. + +- There are two MPI network models available: "ob1" and "cm". "ob1" + uses BTL ("Byte Transfer Layer") components for each supported + network. "cm" uses MTL ("Matching Tranport Layer") components for + each supported network. + + - "ob1" supports a variety of networks that can be used in + combination with each other (per OS constraints; e.g., there are + reports that the GM and OpenFabrics kernel drivers do not operate + well together): + - OpenFabrics: InfiniBand and iWARP + - Loopback (send-to-self) + - Myrinet: GM and MX + - Portals + - Quadrics Elan + - Shared memory + - TCP + - SCTP + - uDAPL + + - "cm" supports a smaller number of networks (and they cannot be + used together), but may provide better better overall MPI + performance: + - Myrinet MX (not GM) + - InfiniPath PSM + - Portals + + Open MPI will, by default, choose to use "cm" when the InfiniPath + PSM MTL can be used. Otherwise, OB1 will be used and the + corresponding BTLs will be selected. Users can force the use of ob1 + or cm if desired by setting the "pml" MCA parameter at run-time: + + shell$ mpirun --mca pml ob1 ... + or + shell$ mpirun --mca pml cm ... + +- Myrinet MX support is shared between the 2 internal devices, the MTL + and the BTL. The design of the BTL interface in Open MPI assumes + that only naive one-sided communication capabilities are provided by + the low level communication layers. However, modern communication + layers such as Myrinet MX, InfiniPath PSM, or Portals, natively + implement highly-optimized two-sided communication semantics. To + leverage these capabilities, Open MPI provides the "cm" PML and + corresponding MTL components to transfer messages rather than bytes. + The MTL interface implements a shorter code path and lets the + low-level network library decide which protocol to use (depending on + issues such as message length, internal resources and other + parameters specific to the underlying interconnect). However, Open + MPI cannot currently use multiple MTL modules at once. In the case + of the MX MTL, process loopback and on-node shared memory + communications are provided by the MX library. Moreover, the + current MX MTL does not support message pipelining resulting in + lower performances in case of non-contiguous data-types. + + The "ob1" PML and BTL components use Open MPI's internal on-node + shared memory and process loopback devices for high performance. + The BTL interface allows multiple devices to be used simultaneously. + For the MX BTL it is recommended that the first segment (which is as + a threshold between the eager and the rendezvous protocol) should + always be at most 4KB, but there is no further restriction on the + size of subsequent fragments. + + The MX MTL is recommended in the common case for best performance on + 10G hardware when most of the data transfers cover contiguous memory + layouts. The MX BTL is recommended in all other cases, such as when + using multiple interconnects at the same time (including TCP), or + transferring non contiguous data-types. + + +Shared library versioning support +--------------------------------- + +Open MPI started using GNU-Libtool recommended shared library +versioning with the v1.3.3 release (where all versions were set to +0:0:0) for the main MPI libraries: libmpi, libmpi_cxx, libmpi_f77, and +libmpi_f90. + +Open MPI's other internal libraries are not [yet] versioned for deep +voodoo technical reasons. Please see +https://svn.open-mpi.org/trac/ompi/ticket/2092 for more details. + +=========================================================================== + +Building Open MPI +----------------- + +Open MPI uses a traditional configure script paired with "make" to +build. Typical installs can be of the pattern: + +--------------------------------------------------------------------------- +shell$ ./configure [...options...] +shell$ make all install +--------------------------------------------------------------------------- + +There are many available configure options (see "./configure --help" +for a full list); a summary of the more commonly used ones follows: + +--prefix= + Install Open MPI into the base directory named . Hence, + Open MPI will place its executables in /bin, its header + files in /include, its libraries in /lib, etc. + +--with-elan= + Specify the directory where the Quadrics Elan library and header + files are located. This option is generally only necessary if the + Elan headers and libraries are not in default compiler/linker + search paths. + + Elan is the support library for Quadrics-based networks. + +--with-elan-libdir= + Look in directory for the Quadrics Elan libraries. By default, Open + MPI will look in /lib and /lib64, + which covers most cases. This option is only needed for special + configurations. + +--with-gm= + Specify the directory where the GM libraries and header files are + located. This option is generally only necessary if the GM headers + and libraries are not in default compiler/linker search paths. + + GM is the support library for older Myrinet-based networks (GM has + been obsoleted by MX). + +--with-gm-libdir= + Look in directory for the GM libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-mx= + Specify the directory where the MX libraries and header files are + located. This option is generally only necessary if the MX headers + and libraries are not in default compiler/linker search paths. + + MX is the support library for Myrinet-based networks. + +--with-mx-libdir= + Look in directory for the MX libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-openib= + Specify the directory where the OpenFabrics (previously known as + OpenIB) libraries and header files are located. This option is + generally only necessary if the OpenFabrics headers and libraries + are not in default compiler/linker search paths. + + "OpenFabrics" refers to iWARP- and InifiniBand-based networks. + +--with-openib-libdir= + Look in directory for the OpenFabrics libraries. By default, Open + MPI will look in /lib and /lib64, which covers most cases. This option is only + needed for special configurations. + +--with-portals= + Specify the directory where the Portals libraries and header files + are located. This option is generally only necessary if the Portals + headers and libraries are not in default compiler/linker search + paths. + + Portals is the support library for Cray interconnects, but is also + available on other platforms (e.g., there is a Portals library + implemented over regular TCP). + +--with-portals-config= + Configuration to use for Portals support. The following + values are possible: "utcp", "xt3", "xt3-modex" (default: utcp). + +--with-portals-libs= + Additional libraries to link with for Portals support. + +--with-psm= + Specify the directory where the QLogic InfiniPath PSM library and + header files are located. This option is generally only necessary + if the InfiniPath headers and libraries are not in default + compiler/linker search paths. + + PSM is the support library for QLogic InfiniPath network adapters. + +--with-psm-libdir= + Look in directory for the PSM libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-sctp= + Specify the directory where the SCTP libraries and header files are + located. This option is generally only necessary if the SCTP headers + and libraries are not in default compiler/linker search paths. + + SCTP is a special network stack over ethernet networks. + +--with-sctp-libdir= + Look in directory for the SCTP libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-udapl= + Specify the directory where the UDAPL libraries and header files are + located. Note that UDAPL support is disabled by default on Linux; + the --with-udapl flag must be specified in order to enable it. + Specifying the directory argument is generally only necessary if the + UDAPL headers and libraries are not in default compiler/linker + search paths. + + UDAPL is the support library for high performance networks in Sun + HPC ClusterTools and on Linux OpenFabrics networks (although the + "openib" options are preferred for Linux OpenFabrics networks, not + UDAPL). + +--with-udapl-libdir= + Look in directory for the UDAPL libraries. By default, Open MPI + will look in /lib and /lib64, + which covers most cases. This option is only needed for special + configurations. + +--with-lsf= + Specify the directory where the LSF libraries and header files are + located. This option is generally only necessary if the LSF headers + and libraries are not in default compiler/linker search paths. + + LSF is a resource manager system, frequently used as a batch + scheduler in HPC systems. + + NOTE: If you are using LSF version 7.0.5, you will need to add + "LIBS=-ldl" to the configure command line. For example: + + ./configure LIBS=-ldl --with-lsf ... + + This workaround should *only* be needed for LSF 7.0.5. + +--with-lsf-libdir= + Look in directory for the LSF libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-tm= + Specify the directory where the TM libraries and header files are + located. This option is generally only necessary if the TM headers + and libraries are not in default compiler/linker search paths. + + TM is the support library for the Torque and PBS Pro resource + manager systems, both of which are frequently used as a batch + scheduler in HPC systems. + +--with-sge + Specify to build support for the Sun Grid Engine (SGE) resource + manager. SGE support is disabled by default; this option must be + specified to build OMPI's SGE support. + + The Sun Grid Engine (SGE) is a resource manager system, frequently + used as a batch scheduler in HPC systems. + +--with-mpi-param_check(=value) + "value" can be one of: always, never, runtime. If --with-mpi-param + is not specified, "runtime" is the default. If --with-mpi-param + is specified with no value, "always" is used. Using + --without-mpi-param-check is equivalent to "never". + + - always: the parameters of MPI functions are always checked for + errors + - never: the parameters of MPI functions are never checked for + errors + - runtime: whether the parameters of MPI functions are checked + depends on the value of the MCA parameter mpi_param_check + (default: yes). + +--with-threads=value + Since thread support (both support for MPI_THREAD_MULTIPLE and + asynchronous progress) is only partially tested, it is disabled by + default. To enable threading, use "--with-threads=posix". This is + most useful when combined with --enable-mpi-threads and/or + --enable-progress-threads. + +--enable-mpi-threads + Allows the MPI thread level MPI_THREAD_MULTIPLE. See + --with-threads; this is currently disabled by default. + +--enable-progress-threads + Allows asynchronous progress in some transports. See + --with-threads; this is currently disabled by default. See the + above note about asynchronous progress. + +--disable-mpi-cxx + Disable building the C++ MPI bindings. Note that this does *not* + disable the C++ checks during configure; some of Open MPI's tools + are written in C++ and therefore require a C++ compiler to be built. + +--disable-mpi-cxx-seek + Disable the MPI::SEEK_* constants. Due to a problem with the MPI-2 + specification, these constants can conflict with system-level SEEK_* + constants. Open MPI attempts to work around this problem, but the + workaround may fail in some esoteric situations. The + --disable-mpi-cxx-seek switch disables Open MPI's workarounds (and + therefore the MPI::SEEK_* constants will be unavailable). + +--disable-mpi-f77 + Disable building the Fortran 77 MPI bindings. + +--disable-mpi-f90 + Disable building the Fortran 90 MPI bindings. Also related to the + --with-f90-max-array-dim and --with-mpi-f90-size options. + +--with-mpi-f90-size= + Three sizes of the MPI F90 module can be built: trivial (only a + handful of MPI-2 F90-specific functions are included in the F90 + module), small (trivial + all MPI functions that take no choice + buffers), and medium (small + all MPI functions that take 1 choice + buffer). This parameter is only used if the F90 bindings are + enabled. + +--with-f90-max-array-dim= + The F90 MPI bindings are strictly typed, even including the number of + dimensions for arrays for MPI choice buffer parameters. Open MPI + generates these bindings at compile time with a maximum number of + dimensions as specified by this parameter. The default value is 4. + +--enable-mpirun-prefix-by-default + This option forces the "mpirun" command to always behave as if + "--prefix $prefix" was present on the command line (where $prefix is + the value given to the --prefix option to configure). This prevents + most rsh/ssh-based users from needing to modify their shell startup + files to set the PATH and/or LD_LIBRARY_PATH for Open MPI on remote + nodes. Note, however, that such users may still desire to set PATH + -- perhaps even in their shell startup files -- so that executables + such as mpicc and mpirun can be found without needing to type long + path names. --enable-orterun-prefix-by-default is a synonym for + this option. + +--disable-shared + By default, libmpi is built as a shared library, and all components + are built as dynamic shared objects (DSOs). This switch disables + this default; it is really only useful when used with + --enable-static. Specifically, this option does *not* imply + --enable-static; enabling static libraries and disabling shared + libraries are two independent options. + +--enable-static + Build libmpi as a static library, and statically link in all + components. Note that this option does *not* imply + --disable-shared; enabling static libraries and disabling shared + libraries are two independent options. + +--enable-sparse-groups + Enable the usage of sparse groups. This would save memory + significantly especially if you are creating large + communicators. (Disabled by default) + +--enable-peruse + Enable the PERUSE MPI data analysis interface. + +--enable-dlopen + Build all of Open MPI's components as standalone Dynamic Shared + Objects (DSO's) that are loaded at run-time. The opposite of this + option, --disable-dlopen, causes two things: + + 1. All of Open MPI's components will be built as part of Open MPI's + normal libraries (e.g., libmpi). + 2. Open MPI will not attempt to open any DSO's at run-time. + + Note that this option does *not* imply that OMPI's libraries will be + built as static objects (e.g., libmpi.a). It only specifies the + location of OMPI's components: standalone DSOs or folded into the + Open MPI libraries. You can control whenther Open MPI's libraries + are build as static or dynamic via --enable|disable-static and + --enable|disable-shared. + +--enable-heterogeneous + Enable support for running on heterogeneous clusters (e.g., machines + with different endian representations). Heterogeneous support is + disabled by default because it imposes a minor performance penalty. + +--enable-ptmalloc2-internal + ***NOTE: This option no longer exists. + + This option was introduced in Open MPI v1.3 and was then removed in + Open MPI v1.3.2. Open MPI fundamentally changed how it uses + ptmalloc2 support in v1.3.2 such that the + --enable-ptmalloc2-internal flag was no longer necessary. It can + still harmlessly be supplied to Open MPI's configure script, but a + warning will appear about how it is an unrecognized option. + + In v1.3 and v1.3.1, Open MPI built the ptmalloc2 library as a + standalone library that users could choose to link in or not (by + adding -lopenmpi-malloc to their link command). Using this option + restored pre-v1.3 behavior of *always* forcing the user to use the + ptmalloc2 memory manager (because it is part of libmpi). + + Starting with v1.3.2, ptmalloc2 is always built into Open MPI, but + is only activated in certain scenarios. + +--with-wrapper-cflags= +--with-wrapper-cxxflags= +--with-wrapper-fflags= +--with-wrapper-fcflags= +--with-wrapper-ldflags= +--with-wrapper-libs= + Add the specified flags to the default flags that used are in Open + MPI's "wrapper" compilers (e.g., mpicc -- see below for more + information about Open MPI's wrapper compilers). By default, Open + MPI's wrapper compilers use the same compilers used to build Open + MPI and specify an absolute minimum set of additional flags that are + necessary to compile/link MPI applications. These configure options + give system administrators the ability to embed additional flags in + OMPI's wrapper compilers (which is a local policy decision). The + meanings of the different flags are: + + : Flags passed by the mpicc wrapper to the C compiler + : Flags passed by the mpic++ wrapper to the C++ compiler + : Flags passed by the mpif77 wrapper to the F77 compiler + : Flags passed by the mpif90 wrapper to the F90 compiler + : Flags passed by all the wrappers to the linker + : Flags passed by all the wrappers to the linker + + There are other ways to configure Open MPI's wrapper compiler + behavior; see the Open MPI FAQ for more information. + +There are many other options available -- see "./configure --help". + +Changing the compilers that Open MPI uses to build itself uses the +standard Autoconf mechanism of setting special environment variables +either before invoking configure or on the configure command line. +The following environment variables are recognized by configure: + +CC - C compiler to use +CFLAGS - Compile flags to pass to the C compiler +CPPFLAGS - Preprocessor flags to pass to the C compiler + +CXX - C++ compiler to use +CXXFLAGS - Compile flags to pass to the C++ compiler +CXXCPPFLAGS - Preprocessor flags to pass to the C++ compiler + +F77 - Fortran 77 compiler to use +FFLAGS - Compile flags to pass to the Fortran 77 compiler + +FC - Fortran 90 compiler to use +FCFLAGS - Compile flags to pass to the Fortran 90 compiler + +LDFLAGS - Linker flags to pass to all compilers +LIBS - Libraries to pass to all compilers (it is rarely + necessary for users to need to specify additional LIBS) + +For example: + +shell$ ./configure CC=mycc CXX=myc++ F77=myf77 F90=myf90 ... + +***Note: We generally suggest using the above command line form for + setting different compilers (vs. setting environment variables and + then invoking "./configure"). The above form will save all + variables and values in the config.log file, which makes + post-mortem analysis easier when problems occur. + +Note that you may also want to ensure that the value of +LD_LIBRARY_PATH is set appropriately (or not at all) for your build +(or whatever environment variable is relevant for your operating +system). For example, some users have been tripped up by setting to +use non-default Fortran compilers via FC / F77, but then failing to +set LD_LIBRARY_PATH to include the directory containing that +non-default Fortran compiler's support libraries. This causes Open +MPI's configure script to fail when it tries to compile / link / run +simple Fortran programs. + +It is required that the compilers specified be compile and link +compatible, meaning that object files created by one compiler must be +able to be linked with object files from the other compilers and +produce correctly functioning executables. + +Open MPI supports all the "make" targets that are provided by GNU +Automake, such as: + +all - build the entire Open MPI package +install - install Open MPI +uninstall - remove all traces of Open MPI from the $prefix +clean - clean out the build tree + +Once Open MPI has been built and installed, it is safe to run "make +clean" and/or remove the entire build tree. + +VPATH and parallel builds are fully supported. + +Generally speaking, the only thing that users need to do to use Open +MPI is ensure that /bin is in their PATH and /lib is +in their LD_LIBRARY_PATH. Users may need to ensure to set the PATH +and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc) +so that non-interactive rsh/ssh-based logins will be able to find the +Open MPI executables. + +=========================================================================== + +Checking Your Open MPI Installation +----------------------------------- + +The "ompi_info" command can be used to check the status of your Open +MPI installation (located in /bin/ompi_info). Running it with +no arguments provides a summary of information about your Open MPI +installation. + +Note that the ompi_info command is extremely helpful in determining +which components are installed as well as listing all the run-time +settable parameters that are available in each component (as well as +their default values). + +The following options may be helpful: + +--all Show a *lot* of information about your Open MPI + installation. +--parsable Display all the information in an easily + grep/cut/awk/sed-able format. +--param + A of "all" and a of "all" will + show all parameters to all components. Otherwise, the + parameters of all the components in a specific framework, + or just the parameters of a specific component can be + displayed by using an appropriate and/or + name. + +Changing the values of these parameters is explained in the "The +Modular Component Architecture (MCA)" section, below. + +=========================================================================== + +Compiling Open MPI Applications +------------------------------- + +Open MPI provides "wrapper" compilers that should be used for +compiling MPI applications: + +C: mpicc +C++: mpiCC (or mpic++ if your filesystem is case-insensitive) +Fortran 77: mpif77 +Fortran 90: mpif90 + +For example: + +shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g +shell$ + +All the wrapper compilers do is add a variety of compiler and linker +flags to the command line and then invoke a back-end compiler. To be +specific: the wrapper compilers do not parse source code at all; they +are solely command-line manipulators, and have nothing to do with the +actual compilation or linking of programs. The end result is an MPI +executable that is properly linked to all the relevant libraries. + +Customizing the behavior of the wrapper compilers is possible (e.g., +changing the compiler [not recommended] or specifying additional +compiler/linker flags); see the Open MPI FAQ for more information. + +=========================================================================== + +Running Open MPI Applications +----------------------------- + +Open MPI supports both mpirun and mpiexec (they are exactly +equivalent). For example: + +shell$ mpirun -np 2 hello_world_mpi +or +shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi + +are equivalent. Some of mpiexec's switches (such as -host and -arch) +are not yet functional, although they will not error if you try to use +them. + +The rsh launcher accepts a -hostfile parameter (the option +"-machinefile" is equivalent); you can specify a -hostfile parameter +indicating an standard mpirun-style hostfile (one hostname per line): + +shell$ mpirun -hostfile my_hostfile -np 2 hello_world_mpi + +If you intend to run more than one process on a node, the hostfile can +use the "slots" attribute. If "slots" is not specified, a count of 1 +is assumed. For example, using the following hostfile: + +--------------------------------------------------------------------------- +node1.example.com +node2.example.com +node3.example.com slots=2 +node4.example.com slots=4 +--------------------------------------------------------------------------- + +shell$ mpirun -hostfile my_hostfile -np 8 hello_world_mpi + +will launch MPI_COMM_WORLD rank 0 on node1, rank 1 on node2, ranks 2 +and 3 on node3, and ranks 4 through 7 on node4. + +Other starters, such as the resource manager / batch scheduling +environments, do not require hostfiles (and will ignore the hostfile +if it is supplied). They will also launch as many processes as slots +have been allocated by the scheduler if no "-np" argument has been +provided. For example, running a SLURM job with 8 processors: + +shell$ salloc -n 8 mpirun a.out + +The above command will reserve 8 processors and run 1 copy of mpirun, +which will, in turn, launch 8 copies of a.out in a single +MPI_COMM_WORLD on the processors that were allocated by SLURM. + +Note that the values of component parameters can be changed on the +mpirun / mpiexec command line. This is explained in the section +below, "The Modular Component Architecture (MCA)". + +=========================================================================== + +The Modular Component Architecture (MCA) + +The MCA is the backbone of Open MPI -- most services and functionality +are implemented through MCA components. Here is a list of all the +component frameworks in Open MPI: + +--------------------------------------------------------------------------- + +MPI component frameworks: +------------------------- + +allocator - Memory allocator +bml - BTL management layer +btl - MPI point-to-point Byte Transfer Layer, used for MPI + point-to-point messages on some types of networks +coll - MPI collective algorithms +crcp - Checkpoint/restart coordination protocol +dpm - MPI-2 dynamic process management +io - MPI-2 I/O +mpool - Memory pooling +mtl - Matching transport layer, used for MPI point-to-point + messages on some types of networks +osc - MPI-2 one-sided communications +pml - MPI point-to-point management layer +pubsub - MPI-2 publish/subscribe management +rcache - Memory registration cache +topo - MPI topology routines + +Back-end run-time environment component frameworks: +--------------------------------------------------- + +errmgr - RTE error manager +ess - RTE environment-specfic services +filem - Remote file management +grpcomm - RTE group communications +iof - I/O forwarding +notifier - System/network administrator noficiation system +odls - OpenRTE daemon local launch subsystem +oob - Out of band messaging +plm - Process lifecycle management +ras - Resource allocation system +rmaps - Resource mapping system +rml - RTE message layer +routed - Routing table for the RML +snapc - Snapshot coordination + +Miscellaneous frameworks: +------------------------- + +backtrace - Debugging call stack backtrace support +carto - Cartography (host/network mapping) support +crs - Checkpoint and restart service +installdirs - Installation directory relocation services +maffinity - Memory affinity +memchecker - Run-time memory checking +memcpy - Memopy copy support +memory - Memory management hooks +paffinity - Processor affinity +timer - High-resolution timers + +--------------------------------------------------------------------------- + +Each framework typically has one or more components that are used at +run-time. For example, the btl framework is used by the MPI layer to +send bytes across different types underlying networks. The tcp btl, +for example, sends messages across TCP-based networks; the openib btl +sends messages across OpenFabrics-based networks; the MX btl sends +messages across Myrinet networks. + +Each component typically has some tunable parameters that can be +changed at run-time. Use the ompi_info command to check a component +to see what its tunable parameters are. For example: + +shell$ ompi_info --param btl tcp + +shows all the parameters (and default values) for the tcp btl +component. + +These values can be overridden at run-time in several ways. At +run-time, the following locations are examined (in order) for new +values of parameters: + +1. /etc/openmpi-mca-params.conf + + This file is intended to set any system-wide default MCA parameter + values -- it will apply, by default, to all users who use this Open + MPI installation. The default file that is installed contains many + comments explaining its format. + +2. $HOME/.openmpi/mca-params.conf + + If this file exists, it should be in the same format as + /etc/openmpi-mca-params.conf. It is intended to provide + per-user default parameter values. + +3. environment variables of the form OMPI_MCA_ set equal to a + + + Where is the name of the parameter. For example, set the + variable named OMPI_MCA_btl_tcp_frag_size to the value 65536 + (Bourne-style shells): + + shell$ OMPI_MCA_btl_tcp_frag_size=65536 + shell$ export OMPI_MCA_btl_tcp_frag_size + +4. the mpirun command line: --mca + + Where is the name of the parameter. For example: + + shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi + +These locations are checked in order. For example, a parameter value +passed on the mpirun command line will override an environment +variable; an environment variable will override the system-wide +defaults. + +=========================================================================== + +Common Questions +---------------- + +Many common questions about building and using Open MPI are answered +on the FAQ: + + http://www.open-mpi.org/faq/ + +=========================================================================== + +Got more questions? +------------------- + +Found a bug? Got a question? Want to make a suggestion? Want to +contribute to Open MPI? Please let us know! + +When submitting questions and problems, be sure to include as much +extra information as possible. This web page details all the +information that we request in order to provide assistance: + + http://www.open-mpi.org/community/help/ + +User-level questions and comments should generally be sent to the +user's mailing list (users@open-mpi.org). Because of spam, only +subscribers are allowed to post to this list (ensure that you +subscribe with and post from *exactly* the same e-mail address -- +joe@example.com is considered different than +joe@mycomputer.example.com!). Visit this page to subscribe to the +user's list: + + http://www.open-mpi.org/mailman/listinfo.cgi/users + +Developer-level bug reports, questions, and comments should generally +be sent to the developer's mailing list (devel@open-mpi.org). Please +do not post the same question to both lists. As with the user's list, +only subscribers are allowed to post to the developer's list. Visit +the following web page to subscribe: + + http://www.open-mpi.org/mailman/listinfo.cgi/devel + +Make today an Open MPI day! diff --git a/release_notes/opensm_release_notes.txt b/release_notes/opensm_release_notes.txt new file mode 100644 index 0000000..9b5de67 --- /dev/null +++ b/release_notes/opensm_release_notes.txt @@ -0,0 +1,728 @@ + OpenSM Release Notes 3.3 + ============================= + +Version: OpenSM 3.3.x +Repo: git://git.openfabrics.org/~sashak/management.git +Date: Dec 2009 + +1 Overview +---------- +This document describes the contents of the OpenSM 3.3 release. +OpenSM is an InfiniBand compliant Subnet Manager and Administration, +and runs on top of OpenIB. The OpenSM version for this release +is opensm-3.3.5. + +This document includes the following sections: +1 This Overview section (describing new features and software + dependencies) +2 Known Issues And Limitations +3 Unsupported IB compliance statements +4 Bug Fixes +5 Main Verification Flows +6 Qualified Software Stacks and Devices + +1.1 Major New Features + +* Mesh Analysis for LASH routing algorithm. + The performance of LASH can be improved by preconditioning the mesh in + cases where there are multiple links connecting switches and also in + cases where the switches are not cabled consistently. + Activated with --do_mesh_analysis command line and config file option. + +* Reloadable OpenSM configuration (preliminary implemented) + This is possible now to reload OpenSM configuration parameters on the + fly without restarting. + +* Routing paths sorted balancing (for UpDown and MinHops) + This sorts the port order in which routing paths balancing is performed + by OpenSM. Helps to improve performance dramatically (40-50%) for most + popular application communication patterns. + To overwrite this behavior use --guid_routing_order_file command line + option. + +* Weighted Lid Matrices calculation (for UpDown, MinHop and DOR). + This low level routing fine-tuning feature provides the means to + define a weighting factor per port for customizing the least weight + hops for the routing. Custom weights are provided using file specified + with '--hop_weights_file' command line option. + +* I/O nodes connectivity (for FatTree). + This provides possibility to define the set of I/O nodes for the + Fat-Tree routing algorithm. I/O nodes are non-CN nodes allowed to use + up to N (specified using --max_reverse_hops) switches the wrong way + around to improve connectivity. I/O nodes list is provided using file + and --io_guid_file command line option. + +* MGID to MLID compression - infrastructure for many MGIDs to single MLID + compression. This becomes helpful when number of multicast groups + exceeds subnet's MLID routing capability (normally 1024 groups). In such + cases many multicast groups (MGID) can be routed using same MLID value. + +* Many code improvements, optimizations and cleanups. + +* Windows support (early stage). + +1.2 Minor New Features: + +cde0c0d opensm: Convert remaining helper routines for GID printing format +bc5743c opensm: Add support for MaxCreditHint and LinkRoundTripLatency to + osm_dump_port_info +6cd34ab opensm: Add Dell to known vendor list +003d6bd opensm: Add more info for traps 144 and 256-259 in osm_dump_notice +5b0c5de opensm/osm_ucat_ftree.c Enhance min hops counters usage +0715b92 ib_types.h: Add ib_switch_info_get_state_opt_sl2vlmapping routine +2ddba79 opensm: Remove some __ and __osm_ prefixes +ea0691f opensm/iba/ib_types.h: Add PortXmit/RcvDataSL PerfMgt attributes +9c79be5 ib_types.h: Adding BKEY violation trap (259) +c608ea6 opensm: Add and utilize ib_gid_is_notzero routine +b639e64 opensm: Handle trap repress on trap 144 generation +b034205 Add pkey table support to osm_get_all_port_attr +876605b opensm/ib_types.h: Add attribute ID for PortCountersExtended +aae3bbc opensm: PortInfo requests for discovered switches +0147b09 opensm/osm_lid_mgr: use single array for used_lids +a9225b0 opensm/Makefile.am: remove osm_build_id.h junk file generation +8e3a57d opensm/osm_console.c: Add list of SMs to status command +3d664b9 opensm/osm_console.c : Added dump_portguid function to console to + generate a list of port guids matching one or more regexps +85b35bc opensm/osm_helper.c: print port number as decimal +8674cb7 opensm: sort port order for routing by switch loads +80c0d48 opensm: rescan config file even in standby +8b7aa5e opensm/osm_subnet.c enable log_max_size opt update +8558ee5 opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters +ecde2f7 opensm/osm_subnet.c support subnet configuration rescan and update +58c45e4 opensm/osm_log.c save log_max_size in subnet opt in MB +cf88e93 opensm: Add new partition keyword for all hca, switches and routers +4bfd4e0 opensm: remove libibcommon build dependencies +3718fc4 opensm/event_plugin: link opensm with -rdynamic flag +587ce14 opensm/osm_inform.c report IB traps to plugin +ced5a6e opensm/opensm/osm_console.c: move reporting of plugins to "status" + command. +696aca2 opensm: Add configurable retries for transactions +0d932ff opensm/osm_sa_mcmember_record.c: optimization in zero mgid comparison +254c2ef opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, set init + failure on PKeyTable and QoS initialization failure +83bd10a opensm: Reduce heap consumption by multicast routing tables (MFTs) +cd33bc5 opensm: Add some additional HP vendor IDs/OUIs +f78ec3a opensm/osm_mcast_tbl.(h c): Make max_mlid_ho be maximum MLID configured +2d13530 opensm: Add infrastructure support for PortInfo + IsMulticastPkeyTrapSuppressionSupported +3ace760 opensm: Reduce heap consumption by unicast routing tables (LFTs) +eec568e osmtest: Add SA get PathRecord stress test +aabc476 opensm: Add infrastructure support for more newly allocated PortInfo + CapabilityMask bits +c83c331 opensm: improve multicast re-routing requests processing +46db92f opensm: Parallelize (Stripe) MFT sets across switches +00c6a6e opensm: Parallelize (Stripe) LFT sets across switches +e21c651 opensm/osm_base.h: Add new SA ClassPortInfo:CapabilityMask2 bit + allocations +09056b1 opensm/ib_types.h: Add CounterSelect2 field to PortCounters attribute +6a63003 opensm: Add ability to configure SMSL +25f071f opensm/lash: Set minimum VL for LASH to use +622d853 opensm/osm_ucast_ftree.cd: Added support for same level links +8146ba7 opensm: Add new Sun vendor ID +1d7dd18 opensm/osm_ucast_ftree.c: Enhanced Fat-Tree algorithm +e07a2f1 Add LMC support to DOR routing +1acfe8a opensm: Add SuperMicro to list of recognized vendors +f02f40e opensm: implement 'connect_roots' option in fat-tree routing +748d41e opensm SA DB dump/restore: added option to dump SA DB on every sweep +b03a95e complib/cl_fleximap: add cl_fmap_match() function +b7a8a87 opensm/include/iba/ib_types.h: adding Congestion Control definitions + +1.3 Library API Changes + + None + +1.4 Software Dependencies + +OpenSM depends on the installation of libibumad package (distributed as +part of OFA IB management together with OpenSM) and IB stack presence, +in particular libibumad uses user_mad kernel interface ('ib_umad' kernel +module). The qualified driver versions are provided in Table 2, +"Qualified IB Stacks". + +Also, building of QoS manager policy file parser requires flex, and either +bison or byacc installed. + +1.5 Supported Devices Firmware + +The main task of OpenSM is to initialize InfiniBand devices. The +qualified devices and their corresponding firmware versions +are listed in Table 3. + +2 Known Issues And Limitations +------------------------------ + +* No Service / Key associations: + There is no way to manage Service access by Keys. + +* No SM to SM SMDB synchronization: + Puts the burden of re-registering services, multicast groups, and + inform-info on the client application (or IB access layer core). + +3 Unsupported IB Compliance Statements +-------------------------------------- +The following section lists all the IB compliance statements which +OpenSM does not support. Please refer to the IB specification for detailed +information regarding each compliance statement. + +* C14-22 (Authentication): + M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one + SubnSet method. As a work-around, an OpenSM option is provided for + defining the protect bits. + +* C14-67 (Authentication): + On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then + the SM shall generate a SubnGetResp if the M_Key matches, or + silently drop the packet if M_Key does not match. + +* C15-0.1.23.4 (Authentication): + InformInfoRecords shall always be provided with the QPN set to 0, + except for the case of a trusted request, in which case the actual + subscriber QPN shall be returned. + +* o13-17.1.2 (Event-FWD): + If no permission to forward, the subscription should be removed and + no further forwarding should occur. + +* C14-24.1.1.5 and C14-62.1.1.22 (Initialization): + GUIDInfo - SM should enable assigning Port GUIDInfo. + +* C14-44 (Initialization): + If the SM discovers that it is missing an M_Key to update CA/RT/SW, + it should notify the higher level. + +* C14-62.1.1.12 (Initialization): + PortInfo:M_Key - Set the M_Key to a node based random value. + +* C14-62.1.1.13 (Initialization): + PortInfo:M_KeyProtectBits - set according to an optional policy. + +* C14-62.1.1.24 (Initialization): + SwitchInfo:DefaultPort - should be configured for random FDB. + +* C14-62.1.1.32 (Initialization): + RandomForwardingTable should be configured. + +* o15-0.1.12 (Multicast): + If the JoinState is SendOnlyNonMember = 1 (only), then the endport + should join as sender only. + +* o15-0.1.8 (Multicast): + If a request for creating an MCG with fields that cannot be met, + return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass). + +* C15-0.1.8.6 (SA-Query): + Respond to SubnAdmGetTraceTable - this is an optional attribute. + +* C15-0.1.13 Services: + Reject ServiceRecord create, modify or delete if the given + ServiceP_Key does not match the one included in the ServiceGID port + and the port that sent the request. + +* C15-0.1.14 (Services): + Provide means to associate service name and ServiceKeys. + +4 Bug Fixes +----------- + +4.1 Major Bug Fixes + +18990fa opensm: set IS_SM bit during opensm init +3551389 fix local port smlid in osm_send_trap144() +a6de48d opensm/osm_link_mgr.c initialize SMSL +82df467 opensm/osm_req.c: Shouldn't reveal port's MKey on Trap method +45ebff9 opensm/osm_console_io.h: Modify osm_console_exit so only the + connection is killed, not the socket +d10660a opensm/osm_req.c: In osm_send_trap144, set producer type according + to node type +8a2d2dd opensm/osm_node_info_rcv.c: create physp for the newly discovered + port of the known node +39b241f opensm/lid_mgr: fix duplicated lid assignment +b44c398 opensm: invalidate routing cache when entering master state +595f2e3 opensm: update LFTs when entering master +8406c65 opensm: fix port chooser +fa90512 opensm/osm_vendor_*_sa: fix incompatibility with QLogic SM +7ec9f7c opensm: discard multicast SA PR with wildcard DGID +5cdb53f opensm/osm_sa_node_record.c use comp mask to match by LID or GUID +55f9772 opensm: Return single PathRecord for SubnAdmGet with DGID/SGID wild + carded +5ec0b5f opensm: compress IPV6 SNM groups to use a single MLID + +4.2 Other Bug Fixes + +4911e0b performance-manager-HOWTO.txt: Indicate master state +86ccaa4 opensm/osm_pkey_mgr.c: Fix pkey endian in log message +b79b079 opensm.8.in: Add mention of backing documentation for QoS policy + file and performance manager +b4d92af opensm/osm_perfmgr.c: Eliminate duplicated error number +a10b57a opensm/osm_ucast_ftree.c: lids are always handled in host order +44273a2 opensm/osm_ucast_ftree.c: fixing bug in indexing +5cd98f7 Fix further bugs around console closure and clean up code. +6b34339 opensm/osm_opensm.c: add newline to log message +68c241c send trap144 when local priority is higher than master priority +6462999 opensm/osm_inform.c: In __osm_send_report, make sure p_report_madw + valid before using +9b8561a opensm/console: Fixed osm_console poll to handle POLLHUP +91d0700 osm_vendor_ibumad.c: In clear_madw, fix tid endian in message +5a5136b osm_switch.h : Fixed wrong comment about return value of + osm_switch_set_hops +c1ec8c0 osm_ucast_ftree.c: Removed useless initialization on switch indexes +418d01f opensm/osm_helper.c: use single buffer in osm_dump_dr_smp() +2c9153c opensm/osm_helper.c: consolidate dr path printing code +048c447 opensm/osm_helper.c: return then log is inactive +dd3ef0c opensm: Return error status when cl_disp_register fails +0143bf7 opensm/osm_perfmgr.c: Improve assert in osm_pc_rcv_process +6622504 osm_perfmgr.c: In osm_perfmgr_shutdown, add missing cl_disp_unregister +7b66dee opensm: remove unneeded anymore physp initializations +f11274a opensm/partition-config.txt: Update for defmember feature +d240e7d opensm/osm_sm_state_mgr.c: Remove unneeded return statement +898fb8c opensm: Improve some snprintf uses +6820e63 opensm/osm_sa_link_record.c: improve get_base_lid() +64c8d31 opensm: initialize all switch ports +555fae8 opensm/sweep: add log message before lid assignment +8e22307 opensm/console: Enhance perfmgr print_counters for better nodenames +b9721a1 opensm/osm_console.c: Improve perfmgr print_counters error message +4d8dc72 opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec +a98dd82 opensm/main.c: remove enable_stack_dump() call +db6d51e opensm/osm_subnet: fix crash in qos string config parameters reloading +e5111c8 opensm: proper config file rescan +e5295b2 opensm: pre-scan command line for config file option +e2f549e opensm/osm_console.c: Eliminate some extraneous parentheses +0a265dc opensm/console: dump_portguid - don't duplicate matched guids +540fefb opensm/console: dump_portguid command fixes +d96202c opensm/osm_console.c: Add missing command in help_perfmgr +ae1bd3c opensm/osm_helper.c: Add port counters to __osm_disp_msg_str +1d38b31 opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin +156c749 opensm: fix structure definition for trap 257-258 +5c09f4a opensm/osm_state_mgr.c: small bug in scanning lid table +72a2fa2 opensm/osm_sa.c: fixing SA MAD dump +539a4d3 opensm/osm_ucast_ftree.c Fixed bad init value for down port index +6690833 opensm/ftree: simplify root guids setup. +90e3291 opensm/ftree: cleanup ftree_sw_tbl_element_t use +c07d245 opensm/qos_config: no invalid option message on default values +b382ad8 opensm: avoid memory leaks on config parameters reloading +45f57ce opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation +3d618aa opensm/osm_subnet.c: break matching when config parameter already found +44d98e3 opensm/osm_subnet.c: clean_val() remove trailing quotation +173010a opensm/doc/perf-manager-arch.txt: Fix some commentary typos +83bf6c5 opensm/osm_subnet.c fix parse functions for big endian machines +6b9a1e9 opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager + operation +4f79a17 opensm/osm_perfmgr.c: In osm_perfmgr_init, eliminate memory leak + on error +22da81f opensm/osm_ucast_ftree.c: fix full topology dump +aa25fcb opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 + is active +003bd4b opensm/osm_subnet.c Fix memory leak for QOS string parameters. +9cbbab2 opensm/opensm.spec: fix event plugin config options +996e8f6 OpenSM: update osmeventplugin example for the new TRAP event. +67f4c07 opensm/lash: simplify some memory allocations +3e6bcdb opensm/lash: fix memory leaks +3ff97b9 opensm/vendor: save some stack memory +ccc7621 opensm/osm_ucast_ftree.c: fixing errors in comments +1a802b3 Corrected incoherency in __osm_ftree_fabric_route_to_non_cns comments +85a7e54 opensm/osm_sm.c: fix MC group creation in race condition +aad1af2 opensm/osm_trap_rcv.c: Improvements in log_trap_info() +f619d67 opensm/osm_trap_rcv.c: Minor reorganization of trap_rcv_process_request +084335b opensm/link_mgr: verify port's lid +d525931 opensm/osm_vendor_ibumad: Use OSM_UMAD_MAX_AGENTS rather than + UMAD_CA_MAX_AGENTS +f342c62 opensm/osm_sa.c: don't ignore failure in osm_mgrp_add_port() +587fda4 osmtest/osmt_multicast.c: fix strict aliasing breakage warning +6931f3e opensm: make subnet's max mlid update implementation independent +30f1acd osm_ucast_ftree.c missing reset of ca_ports +ac04779 opensm: fix LFT allocation size +a7838d0 opensm/osm_ucast_cache: reduce OSM_LOG_INFO debug printouts +c027335 opensm/osm_ucast_updn.c: Further reduction in cas_per_sw allocation +e8ee292 opensm/opensm/osm_subnet.c: adjust buffer to ensure a '\n' is printed +84d9830 opensm/osm_ucast_updn.c: Reduce temporary allocation of cas_per_sw +347ad64 opensm/ib_types.h: Mask off client rereg bit in set_client_rereg +c2ab189 opensm/osm_state_mgr.c: in cleanup_switch() check only relevant + LFT part +40c93d3 use transportable constant attributes +c8fa71a osmtest -code cleanup - use strncasecmp() +770704a opensm/osm_mcast_mgr.c: In mcast_mgr_set_mft_block, fix node GUID + in log message +3d20f82 opensm/osm_sa_path_record.c: separate router guid resolution code +27ea3c8 opensm: fix gcc-4.4.1 warnings +c88bfd3 opensm/osm_lid_mgr.c: Fix typo in OSM_LOG message +a9ea08c opensm/osm_mesh.c: Add dump_mesh routine at OSM_LOG_DEBUG level +bc2a61e C++ style coding does not compile +6647600 opensm: remove meanless 'const' keywords in APIs +323a74f opensm/osm_qos_parser_y.y: fix endless loop +0121a81 opensm: fix endless looping in mcast_mgr +696c022 opensm: fix some obvious -Wsign-compare warnings +b91e3c3 opensm/osm_get_port_by_lid(): don't bother with lmc +ca582df opensm/osm_get_port_by_lid(): speedup a port lookup +fd846ee opensm/osm_mesh.c: simplify compare_switches() function +fe20080 osm_sa.c - void * arithmetic causes problems +220130f osm_helper.c use explicit value for struct init +0168ece use standard varargs syntax in macro OSM_LOG() +180b335 update functions to match .h prototypes +9240ef4 opensm/osm_ucast_lash: fix use after free bug +6f1a21a opensm: osm_get_port_by_lid() helper +c9e2818 opensm/osm_sa_path_record.c: validate multicast membership +225dcf5 opensm/osm_mesh.c: Remove edges in lash matrix +4dd928b opensm/osm_sa_mcmember_record.c: clean uninitialized variable use +c48f0bc opensm/osm_perfmgr_db.c: Fix memory leak of db nodes +82d3585 opensm/osm_notice.c: move logging code to separate function +9557f60 opensm/osm_inform.c: For traps 64-67, use GID from DataDetails in + log message +e2e78d9 opensm/opensm.8.in: Indicate default rule for Default partition +08c5beb opensm/osm_sa_node_record.c: dump NodeInfo with debug verbosity +1fe88f0 opensm/multicast: merge mcm_port and mcm_info +ba75747 opensm/multicast: consolidate port addition/removing code +5e61ab8 opensm: port object reference in mcm ports list +5c5dacf opensm: fix uninitialized return value in osm_sm_mcgrp_leave() +7cfe18d osm_ucast_ftree.c: Removed reverse_hop parameters from + fabric_route_upgoing_by_going_down +aa7fb47 opensm/multicast: kill mc group to_be_deleted flag +a4910fe opensm/osm_mcast_mgr.c: multicast routing by mlid - renaming +1d14060 opensm/multicast: remove change id tracking +5a84951 opensm: use mgrp pointer as osm_sm_mcgrp_join/leave() parameter +d8e3ff5 opensm: use mgrp pointer in port mcm_info +0631cd3 opensm doc: Indicated limited (rather than partial) partition + membership +1010535 opensm/osm_ucast_lash.c: In lash_core, return status -1 for all errors +942e20f opensm/osm_helper.c: Add SM priority changed into trap 144 description +2372999 opensm/osm_ucast_mgr: better lft setup +e268b32 opensm/osm_helper.c: Only change method when > rather than >= +9309e8c complib/cl_event.c: change nanosec var type long +d93b126 opensm/complib: account for nsec overflow in timeout values +ef4c8ac opensm/osm_qos_policy.c: matching PR query to QoS level with pkey +c93b58b opensm: fixing some data types in osm_req_get/set +2b89177 opensm/libvendor/osm_vendor_ibumad.c: Handle umad_alloc failure in + osm_vendor_get +2cba163 opensm/osm_helper.c: In osm_dump_dr_smp, fix endian of status +47397e3 opensm/osm_sm_mad_ctrl.c: Fix endian of status in error message +e83b7ca opensm/osm_mesh.c: Reorder switches for lash +9256239 opensm/osm_trap_rcv.c: Validate trap is 144 before checking for + NodeDescription changed +011d9ca opensm/osm_ucast_lash.c: Handle calloc failure in generate_cdg_for_sp +59964d7 opensm: fixing handling of opt.max_wire_smps +f4e3cd0 opensm/osm_ucast_lash.c: Directly call calloc/free rather than + create/delete_cdg +5a208bd opensm/osm_ucast_lash.c: Added error numbers to some error log messages +3b80d10 opensm/osm_helper.c: fix printing trap 258 details +f682fe0 opensm: do not configure MFTs when mcast support is disabled +cc42095 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, indicate + failed attribute +aebf215 opensm/osm_ucast_lash.c: Remove osm_mesh_node_delete call from + switch_delete +1ef4694 opensm/osm_path.h: In osm_dr_path_init, only copy needed part of path +c594a2d opensm: osm_dr_path_extend can fail due to invalid hop count +46e5668 opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete +81841dc opensm/osm_ucast_lash.c: Handle malloc failures better +2801203 opensm: remove extra "0x" from debug message. +88821d2 opensm/main.c: Display SMSL when specified +f814dcd opensm/osm_subnet.c: Format lash_start_vl consistent with other + uint8 items +66669c9 opensm/main.c: Display LASH start VL when specified +31bb0a7 opensm/osm_mcst_mgr.c: check number of switches only once +75e672c opensm: find MC group by MGID using fleximap +2b7260d Clarify the syntax of the hop_weights_file +e6f0070 opensm/osm_mesh.c: Improve VL utilization +27497a0 opensm/osm_ucast_ftree.c Fix assert comparing number of CAs to CN ports +3b98131 opensm/osm_qos_policy.c: Use proper size in malloc in + osm_qos_policy_vlarb_scope_create +e6f367d opensm/osm_ucast_ftree.c: Made error numbers unique in some log + messages +83261a8 osm_ucast_ftree.c Count number of hops instead of calculating it +7bdf4ff opensm/osm_sa_(path multipath)_record.c: Fix typo in a couple of + log messages +0f8ed87 opensm/osm_ucast_mgr.c: Add error numbers to some error log messages +0b5ccb4 complib/Makefile.am: prevent file duplications +e0b8ec9 opensm/osm_sminfo_rcv.c: clean type of smi_rcv_process_get_sm() +4d01005 opensm: sweep component processors return status value +6ad8d78 opensm/libvendor/osm_vendor_(ibumad mlx)_sa.c: Handle malloc + failure in __osmv_send_sa_req +cf97ebf opensm/osm_ucast_lash.(h c): Replace memory allocation by array +957461c opensm/osm_sa.c add attribute and component mask to error message +5d339a1 osm_dump.c dump port if lft is set up +518083d osm_port.c: check if op_vls = 0 before max_op_vls comparison +b6964cb opensm/osm_port.c: Change log level of Invalid OP_VLS 0 message + to VERBOSE +b27568c opensm/PerfMgr: Reduce host name length +bc495c0 opensm/osm_lid_mgr.c bug in opensm LID assignment +5a466fd opensm/osm_perfmgr_db.c: Remove unneeded initialization in + perfmgr_db_print_by_name +57cf328 opensm/osm_ucast_ftree.c Increase the size of the hop table +8323cf1 opensm/PerfMgr: Remove some underbars from internal names +65b1c15 opensm: Changes to spec and make files for updated release notes +cd226c7 OpenSM: include/vendor/osm_vendor.h - Replaced #elif with no + condition by #else +9f8bd4a management: Fixed custom_release in SPEC files +c0b8207 opensm/PerfMgr: Change redir_tbl_size to num_ports for better clarity +596bb08 opensm/osm_sa.c: check for SA DB file only if requested +2f2bd4e opensm SA DB dump/restore: load SA DB only once +4abcbf2 opensm: Added print_desc to various log messages +5e3d235 opensm/osm_vendor_ibumad.c: Move error info into single message +8e5ca10 opensm/libvendor//osm_vendor_ibumad_sa.c: uninitialized fields +d13c2b6 opensm/osm_sm_mad_ctrl.c Changes to some error messages +f79d315 opensm/osm_sm_mad_ctrl.c: Add missing call to return mad to mad pool +150a9b1 opensm/osm_sa_mcmember_record.c: print mcast join/create failures in + VERBOSE instead of DEBUG level +9b7882a opensm/osm_vendor_ibumad.c: Change LID format to decimal in log message +5256c43 opensm/osm_vendor_mlx: fix compilation error +93db10d opensm/osm_vendor_mlx_txn.c: eliminate bunch of compilation warnings +156fdc1 opensm/osm_helper.c Log format changes +7a55434 opensm/osm_ucast_ftree.c Changed log level +a1694de opensm/osm_state_mgr.c Added more info to some error messages +fdec20a opensm/osm_trap_rcv.c: Eliminate heavy sweep on receipt of trap 145 +13a32a7 opensm - standardize on a single Windows #define - take #2 +b236a10 opensm/osm_db_files.c: kill useless malloc() castings +4ba0c26 opensm/osm_db_files.c: add '/' path delimited +e3b98a5 opensm/osm_sm_mad_ctrl.c: Fix qp0_mads_accounting +dbbe5b3 opensm/osm_subnet.c: fixing bug in dumping options file +f22856a opensm/osm_ucast_mgr.c: fix memory leak +0d5f0b6 opensm: osm_get_mgrp_by_mgid() helper +e3c044a osm_sa_mcmember_record.c: pass MCM Record data to mlid allocator +3dda2dc opensm/osm_sa_member_record.c: mlid independent MGID generator +1f95a3c opensm/osm_sa_mcmember_record.c: move mgid allocation code +b78add1 complib: replace intn_t types by C99 intptr_t +a864fd3 osmtest/osmt_mtl_regular_qp.c: cleaning uintn_t use +9e01318 opensm/osm_console.c: make const functions +f8c4c3e opensm/osm_mgrp_new(): add subnet db insertion +80da047 complib/fleximap: make compar callback to return int +bf7fe2d opensm: cleanup intn_t uses +0862bba opensm/main.c: opensm cannot be killed while asking for port guid +2b70193 opensm/complib: bug in cl_list_insert_array_head/tail functions +4764199 opensm - use C99 transportable data type for pointer storage +a9c326c opensm/osm_state_mgr.c: do not probe remote side of port 0 +4945706 opensm/osm_mcast_mgr.c: fix return value on alloc_mfts() failures +8312a24 OpenSM: Fix unused variable compiler warning. +ab8f0a3 opensm/partition: keep multicast group pointer +a817430 opensm: Only clear SMP beyond end of PortInfo attribute +52fb6f2 opensm/osm_switch.h: Remove dead osm_switch_get_physp_ptr routine +aa6d932 opensm/osm_mcast_tbl.c: In osm_mcast_tbl_clear_mlid, use memset to + clear port mask entry +2ad846b opensm/osm_trap_rcv.c: use source_lid and port_num for logging +b9d7756 opensm/osm_mcast_tbl: Fix size of port mask table array +11c0a9b opensm/main.c: Use strtoul rather than strtol for parsing transaction + timeout +0608af9 opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, revert setting + of init failure on QoS initialization failures +c6b4d4a opensm/osm_vendor_ibumad.c: Add transaction ID to osm_vendor_send + log message +520af84 opensm/osm_sa_path_record.c: don't set dgid pointer for local subnet +4a878fb opensm/osm_mcast_mgr.c: fix osm_mcast_mgr_compute_max_hops for + managed switch + +* Other less critical or visible bugs were also fixed. + +5 Main Verification Flows +------------------------- + +OpenSM verification is run using the following activities: +* osmtest - a stand-alone program +* ibmgtsim (IB management simulator) based - a set of flows that + simulate clusters, inject errors and verify OpenSM capability to + respond and bring up the network correctly. +* small cluster regression testing - where the SM is used on back to + back or single switch configurations. The regression includes + multiple OpenSM dedicated tests. +* cluster testing - when we run OpenSM to setup a large cluster, perform + hand-off, reboots and reconnects, verify routing correctness and SA + responsiveness at the ULP level (IPoIB and SDP). + +5.1 osmtest + +osmtest is an automated verification tool used for OpenSM +testing. Its verification flows are described by list below. + +* Inventory File: Obtain and verify all port info, node info, link and path + records parameters. + +* Service Record: + - Register new service + - Register another service (with a lease period) + - Register another service (with service p_key set to zero) + - Get all services by name + - Delete the first service + - Delete the third service + - Added bad flows of get/delete non valid service + - Add / Get same service with different data + - Add / Get / Delete by different component mask values (services + by Name & Key / Name & Data / Name & Id / Id only ) + +* Multicast Member Record: + - Query of existing Groups (IPoIB) + - BAD Join with insufficient comp mask (o15.0.1.3) + - Create given MGID=0 (o15.0.1.4) + - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4) + - Create BAD MGID=0xFA. (o15.0.1.6) + - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6) + - New MGID with invalid join state (o15.0.1.9) + - Retry of existing MGID - See JoinState update (o15.0.1.11) + - BAD RATE when connecting to existing MGID (o15.0.1.13) + - Partial JoinState delete request - removing FullMember (o15.0.1.14) + - Full Delete of a group (o15.0.1.14) + - Verify Delete by trying to Join deleted group (o15.0.1.14) + - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15) + +* GUIDInfo Record: + - All GUIDInfoRecords in subnet are obtained + +* MultiPathRecord: + - Perform some compliant and noncompliant MultiPathRecord requests + - Validation is via status in responses and IB analyzer + +* PKeyTableRecord: + - Perform some compliant and noncompliant PKeyTableRecord queries + - Validation is via status in responses and IB analyzer + +* LinearForwardingTableRecord: + - Perform some compliant and noncompliant LinearForwardingTableRecord queries + - Validation is via status in responses and IB analyzer + +* Event Forwarding: Register for trap forwarding using reports + - Send a trap and wait for report + - Unregister non-existing + +* Trap 64/65 Flow: Register to Trap 64-65, create traps (by + disconnecting/connecting ports) and wait for report, then unregister. + +* Stress Test: send PortInfoRecord queries, both single and RMPP and + check for the rate of responses as well as their validity. + + +5.2 IB Management Simulator OpenSM Test Flows: + +The simulator provides ability to simulate the SM handling of virtual +topologies that are not limited to actual lab equipment availability. +OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily +regressions use smaller (16 and 128 nodes clusters). + +The following test flows are run on the IB management simulator: + +* Stability: + Up to 12 links from the fabric are randomly selected to drop packets + at drop rates up to 90%. The SM is required to succeed in bringing the + fabric up. The resulting routing is verified to be correct as well. + +* LID Manager: + Using LMC = 2 the fabric is initialized with LIDs. Faults such as + zero LID, Duplicated LID, non-aligned (to LMC) LIDs are + randomly assigned to various nodes and other errors are randomly + output to the guid2lid cache file. The SM sweep is run 5 times and + after each iteration a complete verification is made to ensure that all + LIDs that could possibly be maintained are kept, as well as that all nodes + were assigned a legal LID range. + +* Multicast Routing: + Nodes randomly join the 0xc000 group and eventually the + resulting routing is verified for completeness and adherence to + Up/Down routing rules. + +* osmtest: + The complete osmtest flow as described in the previous table is run on + the simulated fabrics. + +* Stress Test: + This flow merges fabric, LID and stability issues with continuous + PathRecord, ServiceRecord and Multicast Join/Leave activity to + stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get + were added to the test such both existing and non existing nodes + perform them in random order. + +5.3 OpenSM Regression + +Using a back-to-back or single switch connection, the following set of +tests is run nightly on the stacks described in table 2. The included +tests are: + +* Stress Testing: Flood the SA with queries from multiple channel + adapters to check the robustness of the entire stack up to the SA. + +* Dynamic Changes: Dynamic Topology changes, through randomly + dropping SMP packets, used to test OpenSM adaptation to an unstable + network & verify DB correctness. + +* Trap Injection: This flow injects traps to the SM and verifies that it + handles them gracefully. + +* SA Query Test: This test exhaustively checks the SA responses to all + possible single component mask. To do that the test examines the + entire set of records the SA can provide, classifies them by their + field values and then selects every field (using component mask and a + value) and verifies that the response matches the expected set of records. + A random selection using multiple component mask bits is also performed. + +5.4 Cluster testing: + +Cluster testing is usually run before a distribution release. It +involves real hardware setups of 16 to 32 nodes (or more if a beta site +is available). Each test is validated by running all-to-all ping through the IB +interface. The test procedure includes: + +* Cluster bringup + +* Hand-off between 2 or 3 SM's while performing: + - Node reboots + - Switch power cycles (disconnecting the SM's) + +* Unresponsive port detection and recovery + +* osmtest from multiple nodes + +* Trap injection and recovery + + +6 Qualified Software Stacks and Devices +--------------------------------------- + +OpenSM Compatibility +-------------------- +Note that OpenSM version 3.2.1 and earlier used a value of 1 in host +byte order for the default SM_Key, so there is a compatibility issue +with these earlier versions of OpenSM when the 3.2.2 or later version +is running on a little endian machine. This affects SM handover as well +as SA queries (saquery tool in infiniband-diags). + + +Table 2 - Qualified IB Stacks +============================= + +Stack | Version +-----------------------------------------|-------------------------- +The main stream Linux kernel | 2.6.x +OFED | 1.4 +OFED | 1.3 +OFED | 1.2 +OFED | 1.1 +OFED | 1.0 + +Table 3 - Qualified Devices and Corresponding Firmware +====================================================== + +Mellanox +Device | FW versions +------------------------------------|------------------------------- +InfiniScale | fw-43132 5.2.000 (and later) +InfiniScale III | fw-47396 0.5.000 (and later) +InfiniScale IV | fw-48436 7.1.000 (and later) +InfiniHost | fw-23108 3.5.000 (and later) +InfiniHost III Lx | fw-25204 1.2.000 (and later) +InfiniHost III Ex (InfiniHost Mode) | fw-25208 4.8.200 (and later) +InfiniHost III Ex (MemFree Mode) | fw-25218 5.3.000 (and later) +ConnectX IB | fw-25408 2.3.000 (and later) + +QLogic/PathScale +Device | Note +--------|----------------------------------------------------------- +iPath | QHT6040 (PathScale InfiniPath HT-460) +iPath | QHT6140 (PathScale InfiniPath HT-465) +iPath | QLE6140 (PathScale InfiniPath PE-880) +iPath | QLE7240 +iPath | QLE7280 + +Note 1: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose +QP0 and QP1. However, it does support it as a device on the subnet. + +Note 2: QoS firmware and Mellanox devices + +HCAs: QoS supported by ConnectX. QoS-enabled FW release is 2_5_000 and +later. + +Switches: QoS supported by InfiniScale III +Any InfiniScale III FW that is supported by OpenSM supports QoS. diff --git a/release_notes/qib_release_notes.txt b/release_notes/qib_release_notes.txt new file mode 100644 index 0000000..3c5558f --- /dev/null +++ b/release_notes/qib_release_notes.txt @@ -0,0 +1,16 @@ + Open Fabrics Enterprise Distribution (OFED) + qib in OFED 1.5.1 Release Notes + + March 2010 + +====================================================================== +1. Overview +====================================================================== +qib is the low level driver implementation for all QLogic InfiniPath +PCI-Express HCAs: gen 1 x8 SDR QLE7140, gen 1 x8 DDR QLE7240, +gen 1 x16 DDR QLE7280, gen 2 x8 QDR QLE7340 and QLE7342. + +The qib driver is new for OFED 1.5. + +The qib kernel driver obsoletes the ipath kernel driver but is +compatible with libipathverbs so no new user level components are needed. diff --git a/release_notes/qperf_release_notes.txt b/release_notes/qperf_release_notes.txt new file mode 100644 index 0000000..28a5114 --- /dev/null +++ b/release_notes/qperf_release_notes.txt @@ -0,0 +1,79 @@ +Distribution + Open Fabrics Enterprise Distribution (OFED) 1.5, December 2009 + +Summary + qperf - Measure RDMA and IP performance + +Overview + qperf measures bandwidth and latency between two nodes. It can work over + TCP/IP as well as the RDMA transports. + +Quick Start + * Since qperf measures latency and bandwidth between two nodes, you need + access to two nodes. Assume they are called node1 and node2. + + * On node1, run qperf without any arguments. It will act as a server and + continue to run until asked to quit. + + * To measure TCP bandwidth between the two nodes, on node2, type: + qperf node1 tcp_bw + + * To measure RDMA RC latency, type (on node2): + qperf node1 rc_lat + + * To measure RDMA UD latency using polling, type (on node2): + qperf node1 -P 1 ud_lat + + * To measure SDP bandwidth, on node2, type: + qperf node1 sdp_bw + +Documentation + * Man page available. Type + man qperf + + * To get a list of examples, type: + qperf --help examples + + * To get a list of tests, type: + qperf --help tests + +Tests + Miscellaneous + conf Show configuration + quit Cause the server to quit + Socket Based + rds_bw RDS streaming one way bandwidth + rds_lat RDS one way latency + sctp_bw SCTP streaming one way bandwidth + sctp_lat SCTP one way latency + sdp_bw SDP streaming one way bandwidth + sdp_lat SDP one way latency + tcp_bw TCP streaming one way bandwidth + tcp_lat TCP one way latency + udp_bw UDP streaming one way bandwidth + udp_lat UDP one way latency + RDMA Send/Receive + ud_bw UD streaming one way bandwidth + ud_bi_bw UD streaming two way bandwidth + ud_lat UD one way latency + rc_bw RC streaming one way bandwidth + rc_bi_bw RC streaming two way bandwidth + rc_lat RC one way latency + uc_bw UC streaming one way bandwidth + uc_bi_bw UC streaming two way bandwidth + uc_lat UC one way latency + RDMA + rc_rdma_read_bw RC RDMA read streaming one way bandwidth + rc_rdma_read_lat RC RDMA read one way latency + rc_rdma_write_bw RC RDMA write streaming one way bandwidth + rc_rdma_write_lat RC RDMA write one way latency + rc_rdma_write_poll_lat RC RDMA write one way polling latency + uc_rdma_write_bw UC RDMA write streaming one way bandwidth + uc_rdma_write_lat UC RDMA write one way latency + uc_rdma_write_poll_lat UC RDMA write one way polling latency + InfiniBand Atomics + rc_compare_swap_mr RC compare and swap messaging rate + rc_fetch_add_mr RC fetch and add messaging rate + Verification + ver_rc_compare_swap Verify RC compare and swap + ver_rc_fetch_add Verify RC fetch and add diff --git a/release_notes/rdma_cm_release_notes.txt b/release_notes/rdma_cm_release_notes.txt new file mode 100644 index 0000000..07e4cab --- /dev/null +++ b/release_notes/rdma_cm_release_notes.txt @@ -0,0 +1,133 @@ + Open Fabrics Enterprise Distribution (OFED) + RDMA CM in OFED 1.5 Release Notes + + July 2010 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. New Features +3. Known Issues + +=============================================================================== +1. Overview +=============================================================================== +The RDMA CM is a communication manager used to setup reliable, connected +and unreliable datagram data transfers. It provides an RDMA transport +neutral interface for establishing connections. The API is based on sockets, +but adapted for queue pair (QP) based semantics: communication must be +over a specific RDMA device, and data transfers are message based. + + +The RDMA CM only provides the communication management (connection setup / +teardown) portion of an RDMA API. It works in conjunction with the verbs +API for data transfers. + +=============================================================================== +2. New Features +=============================================================================== +for OFED 1.5.2: + +Several enhancements were added to librdmacm release 1.0.12 that +are intended to simplify using RDMA devices and address scalability issues. +These changes were in response to long standing requests to make +connection establishment 'more like sockets'. For full details, +users should refer to the appropriate man pages. Major changes include: + +* Support synchronous operation for library calls. Users can control + whether an rdma_cm_id operates asynchronously or synchronously based on + the rdma_event_channel parameter. Use of synchronous operations + reduces the amount of application code required to use the librdmacm + by eliminating the need for event processing code. + + An rdma_cm_id will be marked for synchronous operation if the + rdma_event_channel parameter is NULL for rdma_create_id or + rdma_migrate_id. Users can toggle between synchronous and + asynchronous operation through the rdma_migrate_id call. + + Calls that operate synchronously include rdma_resolve_addr, + rdma_resolve_route, rdma_connect, rdma_accept, and rdma_get_request. + Synchronous event data is returned to the user through the + rdma_cm_id. + +* The addition of a new API: rdma_getaddrinfo. This call is modeled + after getaddrinfo, but for RDMA devices and connections. It has the + following notable deviations from getaddrinfo: + + A source address is returned as part of the call to allow the + user to allocate necessary local HW resources for connections. + + Optional routing information may be returned to support + Infiniband fabrics. IB routing information includes necessary + path record data. rdma_getaddrinfo will obtain this information + if IB ACM support (see below) is enabled. The use of IB ACM + is not required for rdma_getaddrinfo. + + rdma_getaddrinfo provides future extensions to support + more complex address and route resolution mechanisms, such as + multiple path support and failover. + +* Support for a new APIs: rdma_get_request, rdma_create_ep, and + rdma_destroy_ep. rdma_get_request simplifies the passive side + implementation by adding synchronous support for accepting new + connections. rdma_create_ep combines the functionality of + rdma_create_id, rdma_create_qp, rdma_resolve_addr, and rdma_resolve_route + in a single API that uses the output of rdma_getaddrinfo as its input. + +* Support for optional parameters. To simplify support for casual RDMA + developers and researchers, the librdmacm can allocate protection + domains, completion queues, and queue pairs on a user's behalf. + This simplifies the amount of information that a developer + must learn in order to use RDMA, plus allows the user to take + advantage of higher-level completion processing abstractions. + + In addition to optional parameters, a user can also specify that the + librdmacm should automatically select usable values for RDMA read + operations. + +* Add support for IB ACM. IB ACM (InfiniBand Assistant for Communication + Management) defines a socket based protocol to an IB address and route + resolution service. One implementation of that service is provided + separately by the ibacm package, but anyone can implement the service + provided that they adhere to the IB ACM socket protocol. IB ACM is an + experimental service targeted at increasing the scalability of applications + running on a large cluster. + + Use of IB ACM is not required and is controlled through the build option + '--with-ib_acm'. If the librdmacm fails to contact the IB ACM service, it + reverts to using kernel services to resolve address and routing data. + +* Add RDMA helper routines. The librdmacm provide a set of simpler verbs + calls for posting work requests, registering memory, and checking for + completions. These calls are wrappers around libibverbs routines. + +=============================================================================== +3. Known Issues +=============================================================================== +The RDMA CM relies on the operating system's network configuration tables to +map IP addresses to RDMA devices. Incorrectly configured network +configurations can result in the RDMA CM being unable to locate the correct +RDMA device. Currently, the RDMA CM only supports IPv4 addressing. + +All RDMA interfaces must provide a way to map IP addresses to an RDMA device. +For Infiniband, this is done using IPoIB, and requires correctly configured +IPoIB device interfaces sharing the same multicast domain. For details on +configuring IPoIB, refer to ipoib_release_notes.txt. For RDMA devices to +communicate, they must support the same underlying network and data link +layers. + +If you experience problems using the RDMA CM, you may want to check the +following: + + * Verify that you have IP connectivity over the RDMA devices. For example, + ping between iWarp or IPoIB devices. + + * Ensure that IP network addresses assigned to RDMA devices do not + overlap with IP network addresses assigned to standard Ethernet devices. + + * For multicast issues, either bind directly to a specific RDMA device, or + configure the IP routing tables to route multicast traffic over an RDMA + device's IP address. + diff --git a/release_notes/rds_release_notes.txt b/release_notes/rds_release_notes.txt new file mode 100644 index 0000000..d7a6638 --- /dev/null +++ b/release_notes/rds_release_notes.txt @@ -0,0 +1,110 @@ + Open Fabrics Enterprise Distribution (OFED) + RDS in OFED 1.5.1 Release Notes + March 2010 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Supported Platforms +3. Installation & Configuration +4. New Features +5. Bug fixes and Enhancements since OFED 1.4 +6. Bug fixes and Enhancements since OFED 1.3.1 +7. Bug fixes and Enhancements since OFED 1.3 +8. Bug fixes and Enhancements since OFED 1.2 +9. Known Issues + +=============================================================================== +1. Overview +=============================================================================== +RDS socket API. It provides reliable, in-order datagram delivery between +sockets over a variety of transports. +For details see RDS_README.txt and man 7 rds. + +=============================================================================== +2. supported platforms +=============================================================================== + +Same as overall OFED release. + +=============================================================================== +3. Installation & Configuration +=============================================================================== +To install RDS select rds in OFED's manual installation or put 'rds=y' in the +ofed.conf for unattended installation. + +To load RDS module upon boot edit file '/etc/infiniband/openib.conf' as +follows: + +# Load RDS module +RDS_LOAD=yes + +=============================================================================== +4. New Features +=============================================================================== + +GET_MR_FOR_DEST sockopt added. This allows a MR to be associated with +a remote host. GET_MR sockopt deprecated. + +Transports now modularized: rds_rdma.ko (IB and iWARP) and +rds_tcp.ko. This enables RDS use with TCP, without the IB stack +loaded. + +Improved receive processing to lower amount of time spent with interrupts +disabled. + +=============================================================================== +5. Bug fixes and Enhancements since OFED 1.4 +=============================================================================== + +* Set retry_count to 2 and make modifiable via modparam +* Many locking fixes +* Rebased to mainline kernel 2.6.30 resulted in rds trace framework + being removed. + +=============================================================================== +6. Bug fixes and Enhancements since OFED 1.3.1 +=============================================================================== +- RDMA completion notifications are signalled when the IB stack gives us the + completion event for the accompanying RDS message. This is a change from the + 1.3.x behavior, which signalled completion notifications when the RDS message + was ACKed. +- Fixed bugs associated with congestion monitoring. +- FMR pool size increased from 2K to 4K +- Added support for RDMA_CM_EVENT_ADDR_CHANGE event. +- RDS should now work on Qlogic HCAs. + +=============================================================================== +7. Bug fixes and Enhancements since OFED 1.3 +=============================================================================== +- Fix a bug in RDMA signaling +- Add 3 more stats counters +- Fix a kernel crash that can occur when RDS/IB connection drops +- Fixes for RDMA API + +=============================================================================== +8. Bug fixes and Enhancements since OFED 1.2 +=============================================================================== + +1) Wire protocol for RDS v3 and RDS v2 are not compatible. + +2) RDS over TCP is disabled in OFED 1.3. We will re-enable in future release. + +3) Congestion monitoring support gives the application more fine-grained + control. + +With explicit monitoring, the application polls for POLLIN as before, and +additionally uses the RDS_CONG_MONITOR socket option to install a 64bit mask +value in the socket, where each bit corresponds to a group of ports. +When a congestion update arrives, RDS checks the set of ports that became +uncongested against the bit mask installed in the socket. If they overlap, a +control messages is enqueued on the socket, and the application is woken up. +When application calls recvmsg (2), it will be given the control message +containing the bitmap on the socket. + +=============================================================================== +9. Known Issues +=============================================================================== +1. RDMAs over 1 MiB not supported. diff --git a/release_notes/sdp_release_notes.txt b/release_notes/sdp_release_notes.txt new file mode 100644 index 0000000..52fc1b7 --- /dev/null +++ b/release_notes/sdp_release_notes.txt @@ -0,0 +1,251 @@ + Open Fabrics Enterprise Distribution (OFED) + SDP in MLNX_OFED 1.5.2 Release Notes + + December 2010 + + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Bug Fixes and Enhancements since OFED 1.5.2 +3. ZCopy +4. Known Issues +5. Verification Applications/Flows/Tests +6. Module Parameters + +=============================================================================== +1. Overview +=============================================================================== +Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol +that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced +protocol offload capabilities, SDP can provide lower latency, higher bandwidth, +and lower CPU utilization than IPoIB or +Ethernet running some sockets-based applications. + +SDP in OFED is at GA level for MLNX OFED 1.5.2 + +=============================================================================== +2. Main Features and Changes +=============================================================================== +- Added support for Inline and blueflame +- Improved stability issues +- Bug fixes + +=============================================================================== +2. Bug Fixes and Enhancements since OFED 1.5.2 +=============================================================================== +* Cleanups + - Added support for 2.6.34 / 2.6.36. + +* Bug Fixes + - Fixed compilation problems on 32 bit hosts + - Do not compile in debug mode when not asked. + - Improved recovery from errors. + +* Enhancements + - more statistics in /proc/sdpstats + - added debugfs for sdp: + - sdpprf was moved from /proc to debugfs/sdp + - debugfs/ - Socket history + + +=============================================================================== +3. ZCopy +=============================================================================== +- ZCopy is enabled by default for blocks larger than 64K. ZCopy can be disabled + by setting the module paramter sdp_zcopy_thresh to zero or to any other value + by setting it to another non zero value. + +- ZCOPY mode gives good performance for large blocks with very small cpu + utilization. When in use, all messages longer than 'sdp_zcopy_thresh' bytes + in length will cause the user space buffer to be pinned and the data sent + directly from the original buffer. This results in less CPU usage and on many + systems in enhanced bandwidth. + ZCOPY is most efficient with multi stream jobs and it performs better as the + message size increases. + The default 64K value for 'sdp_zcopy_thresh' is sometimes too low for some + systems. You must experiment with your hardware to select the best value. + +- ZCOPY vs BCOPY: + ZCOPY performance is more efficient in weak cpu and multi streams, whereas + BCOPY is more efficient in single stream. + +=============================================================================== +4. Known Issues +=============================================================================== +- SDP is at beta level on Infinihost HCA family + +- Occasionally, socket bind fails when using EINVAL. Although TCP socket is binded + successfully, SDP is occupied, thus causing the socket bind failure. + See Bugzilla 2159 and Bugzilla 2160 + +- When SO_REUSEADDR is set, only a single socket can be bind to the IP_ANY and a + specific port. TCP limitation, unless one of the sockets is listening. + +- BUG 1331 - Although TCP allows connecting to IP_ANY - 0.0.0.0 + (as a destination address!), SDP does not allow connecting to the IP_ANY + and rejects the connection. + +- BUG 1444 - The setsockopt(SO_RCVBUF) is not functional in sdp socket. + To limit top system wide sdp memory usage for recv, + use the module parameter top_mem_usage. + +- Each SDP socket currently consumes up to 2 MBytes of memory. If this value + is high for your installation, it is possible to trade off performance + for lower memory utilization per socket by reducing the value of the + "rcvbuf_scale" module parameter (default: 16). + + Note: The minimum legal value for the "rcvbuf_scale" module is 1. + At this parameter value, each socket will consume approximately 128 KBytes. + +- Small message size performance is low when messages are sent by client + at a rate lower than the rate at which they are consumed by server, + and when TCP_CORK is not set. This is observed, for example, with iperf + benchmark. + Workaround: Set the TCP_CORK socket option + to ensure data is sent in at least 32K byte chunks. + +- Performance is low on 32-bit kernels, as SDP utilizes high memory + to ease memory pressure. + Workaround: Move to a 64-bit kernel if the application remains a 32-bit one. + +- By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards + using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth. + Workaround: Reset the MTU size to 1K in this situation, using either of + the two methods below: + + 1. Activate the "tavor quirk" workaround in opensm: + a. Create an opensm options cache file (/var/cache/osm/opensm.opts): + > opensm --cache-options -o + b. Add the following line to /var/cache/osm/opensm.opts: + enable_quirks TRUE + c. Rerun opensm using your usual command line options to activate + the opensm quirk option. + + 2. Activate the "tavor quirk" workaround in cma: + set the tavor_quirk module parameter of the rdma_cm module to value 1 + (default: 0). + +- When waiting for RX, the driver first polls, arms interrupt and then goes to + sleep. Polling duration could be set by recv_poll module parameter. The + higher this value is, the higher the CPU utilization is, and the number of + interrupts is lower. + This should be fine tuned according to the specific environment and + application latency. + +- When using SDP over RoCE, and the peer has a card that does not support RoCE + a delay in the connection establishment may occur. + +- BUG2185 - Occasionally, accessing /proc/net/sdpstats, causes kernel + panic. + +- For set-user-ID/set-group-ID ELF binaries, only libraries in the standard + search directories that are also set-user-ID. Since always installing + libsdp with this bit on is a security vulnerability, the default behavior is + to reset this bit. A user that want to run such binaries should modify the + libsdp.spec file. + +=============================================================================== +5. Verification Applications/Flows/Tests +=============================================================================== +- ssh/sshd +- wget/netscape/firefox/apache +- netpipe +- netperf +- LTP socket tests +- iperf-2.0.2 +- ttcp +- openmpi +- openmpi + Intel MPI benchmarks +- Threaded and forking echo client server examples +- Various Java client server applications (SUN:jre, BEA:jrockit/WebLogic, GNU:gij/gcj) +- Many UNIX utilities to verify that pre-load did not harm the applications + +=============================================================================== +6. Module Parameters +=============================================================================== + +General +------- +sdp_link_layer_ib_only: + Supports only link layer of type InfiniBand. + It is useful when not using SDP over RoCE. + +sdp_debug_level: + Enables connection establishment and teardown debug tracing. + +sdp_data_debug_level: + Enables datapath debug tracing. If set to 1, it shows only packets >1. + To enable debugging of data path, compile driver with CONFIG_SDP_DEBUG_DATA. + + +recv_poll: + Enables poll receiving before arming the interrupt. Set a higher value + to decrease the number of RX interrupts. Consequently, the CPU + utilization will be higher. + +sdp_keepalive_time: + Default idle time in seconds before keepalive probe sent. + +Resources +--------- +rcvbuf_initial_size: + Receives buffer initial size in bytes. + +rcvbuf_scale: + Not in use + +top_mem_usage: + Top system wide sdp memory usage for recv (in MB). + +max_large_sockets: + Not in use + +sdp_fmr_pool_size: + Number of FMRs to allocate for pool + +sdp_fmr_dirty_wm: + Watermark to flush fmr pool + +Thresholds +---------- +sdp_inline_thresh: + Inline copy threshold. effective to new sockets only; 0=Off. + +sdp_zcopy_thresh: + Zero copy using RDMA threshold; 0=Off. + If smaller than page size, set to page size. + +Interrupt hardware moderation: +------------------------------ +sdp_rx_coal_target: + Target number of bytes to coalesce with interrupt moderation. + +sdp_rx_coal_time: + rx coal time (jiffies). + +sdp_rx_rate_low: + rx_rate low (packets/sec). + +sdp_rx_coal_time_low: + low moderation usec. + +sdp_rx_rate_high: + rx_rate high (packets/sec). + +sdp_rx_coal_time_high: + high moderation usec. + +sdp_rx_rate_thresh: + rx rate thresh (). + +sdp_sample_interval: + sample interval (jiffies). + +hw_int_mod_count: + Forced hw int moderation val. -1 for auto (packets). 0 to disable. + +hw_int_mod_usec: + Forced hw int moderation val. -1 for auto (usec). 0 to disable. diff --git a/release_notes/srp_release_notes.txt b/release_notes/srp_release_notes.txt new file mode 100644 index 0000000..b6cee0b --- /dev/null +++ b/release_notes/srp_release_notes.txt @@ -0,0 +1,613 @@ + + Open Fabrics Enterprise Distribution (OFED) + SRP in OFED 1.5.2 Release Notes + + December 2010 + + +============================================================================== +Table of contents +============================================================================== + + 1. Overview + 2. Changes and Bug Fixes since OFED 1.5 + 3. Software Dependencies + 4. Major Features + 5. Loading SRP Initiator + 6. Manually Establishing an SRP Connection + 7. SRP Tools - ibsrpdm and srp_daemon + 8. Automatic Discovery and Connecting to Targets + 9. Multiple Connections from Initiator IB Port to the Target + 10. High Availability + 11. Shutting Down SRP + 12. Known Issues + 13. Vendor Specific Notes + + +============================================================================== +1. Overview +============================================================================== + +The SRP standard describes the message format and protocol definitions required +for transferring commands and data between a SCSI initiator port and a SCSI +target port using RDMA communication service. + + +============================================================================== +2. Changes and Bug Fixes since OFED 1.5 +============================================================================== +* Check for scsi_id in scmnd to prevent scan/rescan keep adding new scsi devices + ie. echo "- - -" > /sys/class/scsi_host/hostXX/scan +* Bug fixing + +============================================================================== +4. Software Dependencies +============================================================================== + +The SRP Initiator depends on the installation of the OFED Distribution stack +with OpenSM running. + +============================================================================== +5. Major Features +============================================================================== + +This SRP Initiator is based on source taken from openib.org gen2 implementing +the SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. See: +www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf + +The SRP Initiator supports: +- Basic SCSI Primary Commands -3 (SPC-3) + (www.t10.org/ftp/t10/drafts/spc3/spc3r21b.pdf) +- Basic SCSI Block Commands -2 (SBC-2) + (www.t10.org/ftp/t10/drafts/sbc2/sbc2r16.pdf) +- Basic functionality, task management and limited error handling + +============================================================================== +6. Loading SRP Initiator +============================================================================== + +To load the SRP module, either execute the "modprobe ib_srp" command after the +OFED driver is up, or change the value of SRP_LOAD in +/etc/infiniband/openib.conf to "yes" (causing the srp module to be loaded +at driver boot). + +NOTE: When loading the ib_srp module, it is possible to set the module + parameter srp_sg_tablesize. This is the maximum number of + gather/scatter entries per I/O (default: 12). + + a. modprobe ib_srp srp_sg_tablesize=32 + or + b. edit /etc/modprobe.conf and add the following line: + options ib_srp srp_sg_tablesize=32 + +Module paramters: +For the list of ib_srp module parameters + $ modinfo ib_srp + + + srp_sg_tablesze: Max number of scatter/gather entries per I/O + + srp_dev_loss_tmo: Number of seconds that srp driver will not return + DID_NO_CONNECT status when it loss connection to target. + During this period, it will try to re-establish + the connection to target, and return DID_RESET, + DID_ABORT statuses for outstanding scsi command to + prevent DM Multipath driver to failover to next paths. + Default value is 60 seconds. + +============================================================================== +7. Manually Establishing an SRP Connection +============================================================================== + +The following steps describe how to manually load an SRP connection between +the Initiator and an SRP Target. Section 8 explains how to do this +automatically. + +- Make sure that the ib_srp module is loaded, the SRP Initiator is reachable + by the SRP Target, and that an SM is running. + +- To establish a connection with an SRP Target and create SRP (SCSI) device(s) + for that target under /dev, use the following command: + + echo -n id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\ + pkey=ffff,service_id=[service[0] value] > \ + /sys/class/infiniband_srp/srp-mthca[hca number]-[port number]/add_target + + a. Execution of the above "echo" command may take some time + b. The SM must be running while the command executes + c. It is possible to include additional parameters in the echo command: + > max_cmd_per_lun - Default: 63 + > max_sect (short for max_sectors) - sets the request size of a command + > io_class - Default: 0x100 as in rev 16A of the specification + Note: In rev 10 the default was 0xff00 + > initiator_ext - Please refer to Section 9 (Multiple Connections...) + d. See SRP Tools below for instructions on how the parameters in the + echo command above may be obtained. + +NOTES: + +- Using the same *echo -n * more than one, the srp target + will terminate the previous connection and re-establish the new + connection. To have more than two connections to srp target, please use + different inititiator_ext values in echo command. + +- To list the new SCSI devices that have been added by the echo command, you + may use either of the following two methods: + a. Execute "fdisk -l". This command lists all devices; the new devices are + included in this listing. + b. Execute *dmesg* or look at /var/log/messages to find messages with the + names of the new devices. + + +============================================================================== +8. SRP Tools - ibsrpdm and srp_daemon +============================================================================== + +To assist in performing the steps in Section 6, the OFED 1.3.1 distribution +provides two utilities which: +- Detect targets on the fabric reachable by the Initiator (for step 1) +- Output target attributes in a format suitable for use in the above + "echo" command (step 2) + +These utilities are: ibsrpdm and srp_daemon. + +The utilities can be found under /usr/local/ofed/sbin/ (or /sbin/), +and are part of the srptools RPM that may be installed using the +OFED custom installation. Detailed information regarding the various +options for these utilities are provided by their man pages. + +Below, several usage scenarios for these utilities are presented. + +ibsrpdm usage +------------- +1. Detecting reachable targets + + a. To detect all targets reachable by the SRP initiator via the default + umad device (/dev/infiniband/umad0), execute the following command: + $ ibsrpdm + + This command will output information on each SRP target detected, in + human-readable form. + + Sample output: + IO Unit Info: + port LID: 0103 + port GID: fe800000000000000002c90200402bd5 + change ID: 0002 + max controllers: 0x10 + + controller[ 1] + GUID: 0002c90200402bd4 + vendor ID: 0002c9 + device ID: 005a44 + IO class : 0100 + ID: LSI Storage Systems SRP Driver 200400a0b81146a1 + service entries: 1 + service[ 0]: 200400a0b81146a1 / SRP.T10:200400A0B81146A1 + + b. To detect all the SRP Targets reachable by the SRP Initiator via + another umad device, use the following command: + + $ ibsrpdm -d + +2. Assistance in creating an SRP connection + + a. To generate output suitable for utilization in the "echo" command of + section 5, add the "-c" option to ibsrpdm: + + $ ibsrpdm -c + + Sample output: + id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, + dgid=fe800000000000000002c90200402bd5,pkey=ffff, + service_id=200400a0b81146a1 + + b. To establish a connection with an SRP Target (Section 6) using the output + from the "libsrpdm -c" example above, execute the following command: + + $ echo -n id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, + dgid=fe800000000000000002c90200402bd5,pkey=ffff, + service_id=200400a0b81146a1 + > /sys/class/infiniband_srp/srp-mlnx_0-1/add_target + + The SRP connection should now be up; the newly created SCSI devices should + appear in the listing obtained from the "fdisk -l" command. + + +srp_daemon +---------- +The srp_daemon utility is based on ibsrpdm and extends its functionality. +In addition to the ibsrpdm functionality described above, srp_daemon can also +- Establish an SRP connection by itself (without the need to issue the "echo" + command described in Section 6) +- Continue running in background, detecting new targets and establishing SRP + connections with them (daemon mode) +- Discover reachable SRP targets given an infiniband HCA name and port, rather + than just by /dev/umad where is a digit +- Enable High Availability operation (together with Device-Mapper Multipath) +- Have a configuration file that determines the targets to connect to + +a. srp_daemon commands equivalent to ibsrpdm: + + "srp_daemon -a -o" is equivalent to "ibsrpdm" + "srp_daemon -c -a -o" is equivalent to "ibsrpdm -c" + +Note: These srp_daemon commands can behave differently than the equivalent + ibsrpdm command when /etc/srp_daemon.conf is not empty. + +b. srp_daemon extensions to ibsrpdm + + - To discover SRP Targets reachable from HCA device , + port , (and generate output suitable for 'echo') you may execute + + srp_daemon -c -a -o -i -p + + - To both discover the SRP Targets and establish connections with them, just + add the -e option to the above command. + + - Executing srp_daemon over a port without the -a option will only display + the reachable targets via the port and to which the initiator is not + connected. If executing with the -e option it is better to omit -a. + + - It is recommended to use the -n option. This option adds the initiator_ext + to the connecting string. (See Section 9 for more details). + + - srp_daemon has a configuration file that can be set, where the default is + /etc/srp_daemon.conf. Use the -f to supply a different configuration file + that configures the targets srp_daemon is allowed to connect to. The + configuration file can also be used to set values for additional + parameters (e.g., max_cmd_per_lun, max_sect). + + - A continuous background (daemon) operation, providing an automatic ongoing + detection and connection capability. See Section 8. + +============================================================================== +9. Automatic Discovery and Connecting to Targets +============================================================================== + +- Make sure that the ib_srp module is loaded, the SRP Initiator can reach an + SRP Target, and that an SM is running. + +- To connect to all the existing Targets in the fabric, execute + srp_daemon -e -o. This utility will scan the fabric once, connect to + every Target it detects, and then exit. + +NOTE: srp_daemon will follow the configuration it finds in + /etc/srp_daemon.conf. Thus, it will ignore a target that is disallowed in + the configuration file. + +- To connect to all the existing Targets in the fabric and to connect + to new targets that will join the fabric, execute srp_daemon -e. This utility + continues to execute until it is either killed by the user or encounters + connection errors (such as no SM in the fabric). + +- To execute SRP daemon as a daemon you may execute run_srp_daemon + (found under /usr/local/ofed/sbin/ or /sbin/), providing it with + the same options used for running srp_daemon. + + Note: Make sure only one instance of run_srp_daemon runs per port. + +- To execute SRP daemon as a daemon on all the ports, execute srp_daemon.sh + (found under /usr/local/ofed/sbin/ or /sbin/). + srp_daemon.sh sends its log to /var/log/srp_daemon.log. + +- It is possible to configure this script to execute automatically when the + InfiniBand driver starts by changing the value of SRP_DAEMON_ENABLE in + /etc/infiniband/openib.conf to "yes" and SRP_LOAD to yes as well. + + Another option to to configure this script to execute automatically when the + InfiniBand driver starts is by changing the value of SRPHA_ENABLE in + /etc/infiniband/openib.conf to "yes". However, this option also enables + SRP High Availability that has some more features. (Please read the High + Availability section). + +============================================================================== +10. Multiple Connections from Initiator IB Port to the Target +============================================================================== + +Some system configurations may need multiple SRP connections from +the SRP Initiator to the same SRP Target: to the same Target IB port, +or to different IB ports on the same Target HCA. + +In case of a single Target IB port, i.e., SRP connections use the same path, +the configuration is enabled using a different initiator_ext value for each +SRP connection. The initiator_ext value is a 16-hexadecimal-digit value +specified in the connection command. + +Also in case of two physical connections (i.e., network paths) from a single +initiator IB port to two different IB ports on the same Target HCA, there is +need for a different initiator_ext value on each path. The conventions is to +use the Target port GUID as the initiator_ext value for the relevant path. + +If you use srp_daemon with -n flag, it automatically assigns initiator_ext +values according to this convention. For example: + + id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec, + dgid=fe800000000000000002c90200402bed, + pkey=ffff,service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200 + + Notes: + a. It is recommended to use the -n flag for all srp_daemon invocations. + b. ibsrpdm does not have a corresponding option. + c. srp_daemon.sh always uses the -n option (whether invoked manually by + the user, or automatically at startup by setting SRPHA_ENABLE or + SRP_DAEMON_ENABLE to yes). + +============================================================================== +11. High Availability (HA) +============================================================================== + +High Availability Overview +-------------------------- + +High Availability works using the Device-Mapper (DM) multipath and the +SRP daemon. + +Each initiator is connected to the same target from several ports/HCAs. +The DM multipath is responsible for joining together different paths to the +same target and for fail-over between paths when one of them goes offline. +Multipath will be execute on newly joined SCSI devices. + +Each initiator should execute several instances of the SRP daemon, one for each +port. At startup, each SRP daemon detects the SRP targets in the fabric and +sends requests to the ib_srp module to connect to each of them. These +SRP daemons also detect targets that subsequently join the fabric, and send the +ib_srp module requests to connect to them as well. + +High Availability Operation +--------------------------- + +When a path (from port1) to a target fails, the ib_srp module starts an error +recovery process. If this process gets to the reset_host stage and there is no +path to the target from this port, ib_srp will remove this scsi_host. After +the scsi_host is removed, multipath switches to another path to this target +(from another port/HCA). + +When the failed path recovers, it will be detected by the SRP daemon. The SRP +daemon will then request ib_srp to connect to this target. Once the connection +is up, there will be a new scsi_host for this target. Multipath will be +executed on the devices of this host, returning to the original state (prior to +the failed path). + +High Availability Prerequisites +------------------------------- + +Installation for RHEL4 and RHEL5: (Execute once) + - Verify that the standard device-mapper-multipath rpm is installed. If not, + install it from the RHEL distribution. + +Installation for SLES10: (Execute once) + - Verify that multipath is installed. If not, install it from the + installation (You can use yast). + + - Update udev: (Execute once - for manual activation of High Availability only) + + - Add a file to /etc/udev/rules.d/ (you can call it 91-srp.rules) + This file should have one line: + ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m" + + Note: When SRPHA_ENABLE is set to "yes" (see Automatic Activation of High + Availability below), this file is created upon each boot of the driver and + is deleted when the driver is unloaded. + + +Manual Activation of High Availability +-------------------------------------- + +Initialization: (Execute after each boot of the driver) + 1) Execute modprobe dm-multipath + 2) Execute modprobe ib-srp + 3) Make sure you have created file /etc/udev/rules.d/91-srp.rules + as described above + 4) Execute for each port and each HCA: + srp_daemon -c -e -R 300 -i -p + (You can use another value for -R. See under the Known Issues section + the workaround for the rare race condition.) + + This step can be performed by executing srp_daemon.sh, which sends + its log to /var/log/srp_daemon.log. + + Now it is possible to access the SRP LUNs on /dev/mapper/. + + NOTE: It is possible for regular (non-SRP) LUNs to also be present; + the SRP LUNs may be identified by their names. You can configure the + /etc/multipath.conf file to change multipath behavior. + + +Automatic Activation of High Availability +----------------------------------------- +- Set the value of SRPHA_ENABLE in /etc/infiniband/openib.conf to "yes". + Also make sure SRP_LOAD=yes and SRP_DAEMON_ENABLE=yes. + +- From the next loading of the driver it will be possible to access the SRP + LUNs on /dev/mapper/ + NOTE: It is possible that regular (not SRP) LUNs may also be present; + the SRP LUNs may be identified by their name. + +- It is possible to see the output of the SRP daemon in /var/log/srp_daemon.log + + +============================================================================== +12. Shutting Down SRP +============================================================================== + +SRP can be shutdown by using "modprobe -r ib_srp", or by stopping the OFED +("/etc/init.d/openibd stop"), or as a by-product of a complete system shutdown. + +Prior to shutting down SRP, it is REQUIRED to remove all references to it. +The actions you need to take depend on the way SRP was loaded. There are +three cases. + +a. Without High Availability +------------------------------------ +When working without High Availability, you should unmount all SRP +partitions that were mounted prior to shutting down SRP. +For example, /dev/sdd1 is srp partition and mount to /mnt/test +$ umount /mnt/test +$ modprobe -r ib_srp + +NOTES: the umount may get stuck ~90 seconds per connection to target if the + target is down. This is due to the srp_dev_loss_tmo=60 seconds which + srp driver waits for the target coming back before returning error + status. + If you have shutdown/remove srp target and the host have 4 connections + to the SRP target, you should wait ~4-5 minutes for the unmount to exit. + Do not ctrl+c to kill umount process. + +b. After Manual Activation of High Availability +----------------------------------------------- +If you manually activated SRP High Availability, perform the following steps: +- Unmount all SRP partitions that were mounted +- Kill all SRP daemon instances. +- Make sure there are no multipath instances running. If there are multiple + instances, wait for them to end or kill them. +- Execute multipath -F + +Example: +$ umount /mnt/test1 /mnt/test2 (wait for it to exit, do not ctrl+c) +$ ps -ax and kill all srp_daemon processes. +$ multipath -ll (wait for it to exit, do not ctrl+c) +$ multipath -F +$ modprobe -r ib_srp + +c. After Automatic Activation of High Availability +-------------------------------------------------- +If SRP High Availability was automatically activated, SRP shutdown must be +part of the driver shutdown ("/etc/init.d/openibd stop") which performs +steps 2-5 of case (b) above. However, you still have to unmount all SRP +partitions that were mounted before driver shutdown. + + +HAL Issue +--------- +The HAL (Hardware Abstraction Layer) system includes a daemon that examines +all devices in the system. In this process, it frequently holds a reference +to the ib_srp module. If you attempt to shutdown SRP while this daemon is +holding a reference to ib_srp, the shutdown will fail. Therefore, you +should make sure this will not occur. One solution may be to stop "haldaemon" +(/etc/init.d/haldaemon stop) prior to SRP shutdown. + + +============================================================================== +13. Known Issues +============================================================================== + +- There is a very rare race condition which can cause the SRP daemon to miss a + target that joins the fabric. The race can occur if a target joins and leaves + the fabric several times in a short time (e.g., if the cable is not connected + well). In such a case, the SM may ignore this quick change of state and may + not send an InformInfo to the srp_daemon. + + Workaround: Execute the srp_daemon command with the -R option. This + option causes the SRP daemon to perform a full rescan of the fabric every + seconds. + +- The srp_daemon does not support different pkeys other than the default + pkey=ffff + +- It is recommended to use an SM that supports the enhanced capability mask + matching feature (errata MGTWG8372). With SMs which support this feature, the + SRP daemon generates significantly less communication traffic. + +- When booting OFED with SRP High Availability enabled, executing multipath for + all LUNs on all connections may take some time (several minutes). However, it + is possible to start working while this process is in progress. + +- Stopping the driver while SRP High Availability is enabled kills all + multipath processes. Consider appropriate actions in case multipath is used + for other purposes. + +- AS High Availability is based on Device Mapper multipath, it embodies + multipath limitations and also its configuration and tuning options. + See http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=Home + for information on multipath. + To modify and tune multipath configuration, edit the file /etc/multipath.conf + according to instructions and tips listed in + /usr/share/doc/packages/multipath-tools/multipath.conf.* + +- In case your topology has two physical connections (i.e., network paths) from + a single initiator IB port to two different IB ports on the same Target HCA, + and you wish to have an SRP connection on the one path coexist with an SRP + connection on the second path, you must set a different initiator_ext value + on each path. See Section 9, "Multiple Connections from Initiator IB Port + to the Target" for details. + +- The srp_daemon tool reads by default the configuration file + /etc/srp_daemon.conf. In case this configuration file disallows connecting + to a certain target, srp_daemon will ignore the target. If you find out + that srp_daemon ignores a target, please check the /etc/srp_daemon.conf file. + +- Rebooting the system with unclean mounted filesystem and dead connection + to SRP target, the system may get stuck. + +- After establish the connection with srp target and rebooting the system, + initiator will fail to connect to target @ first manual *echo -n* command + (target reject with stale connection). You need to do *echo -n* one more + time. + You do not see this problem with srp_daemon mode since srp_daemon will + retry to connect. + +- The combination of "weak" single lun srp target, I/O with big block size, + default max_command_per_lun=63 while using /dev/urandom to create file with + ext3 fs on srp lun, may cause ext3 remount with "read-only" flag + ie. + Example: + sdb1 is first partition of srp lun sdb, ext3 fs is created + $ mount /dev/sdb1 /mnt/sdb1; cd /mnt/sdb1 + $ dd if=/dev/urandom of=10G-file bs=1G count=10 + --> ext3 fs may remount with read-only flag + + Workarounds: + ------------ + a. Log into the target with small max_command_per_lun (3,4,8) + $ echo id_ext=0002c9030008fc0c,ioc_guid=0002c9030008fc0c, + dgid=fe800000000000000002c90300084417,max_cmd_per_lun=4,pkey=ffff, + service_id=0002c9030008fc0c > /sys/class/infiniband_srp/srp-mlx4_0-1/add_target + + ----------------OR------------------- + + b. Run dd with /dev/zero instead of /dev/urandom + $ dd if=/dev/zero of=10G-file bs=1G count=10 + + ----------------OR------------------- + + c. Run dd with smaller block size + $ dd if=/dev/urandom of=10G-file bs=128K count=40000 + + ----------------OR------------------- + + d. Combine the a,b,c steps (This is the recommended workaround) + +============================================================================== +14. Vendor Specific Notes +============================================================================== + +Hosts connected to Qlogic SRP Targets must perform one of the following +steps after upgrading to OFED 1.3.1 to continue accessing their storage +successfully: + +1. When issuing the "echo" command to add a new SRP Target, the host + must append the string ",initiator_ext=0000000000000001" to the original + echo string. + Example: + 'ibsrpdm -c' output is as follows: + + id_ext=0000000000000001,ioc_guid=00066a0138000165,dgid=fe8000000000000 + 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00 + + id_ext=0000000000000001,ioc_guid=00066a0238000165,dgid=fe8000000000000 + 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00 + + To connect to the first target, the echo command must be: + + echo -n \ + id_ext=0000000000000001,ioc_guid=00066a0138000165,\ + dgid=fe8000000000000000066a0260000165,pkey=ffff,\ + service_id=0000494353535250,io_class=ff00,\ + initiator_ext=0000000000000001 > \ + /sys/class/inifiniband_srp/srp-mthca0-1/add_target + + +2. Change the SRP map on the Qlogic SRP Target to set the expected initiator + extension to 0. For details on how to change the SRP map on a Qlogic SRP + Target, please refer to product documentation. + + diff --git a/release_notes/uDAPL_release_notes.txt b/release_notes/uDAPL_release_notes.txt new file mode 100644 index 0000000..60b83bb --- /dev/null +++ b/release_notes/uDAPL_release_notes.txt @@ -0,0 +1,1482 @@ + Release Notes for + OFED 1.5.1 DAPL Release + March 2010 + + This release of the uDAPL reference implementation package for both + DAT 1.2 and 2.0 specification is timed to coincide with OFED release + of the Open Fabrics (www.openfabrics.org) software stack. + + uDAPL v1 (1.2.16-1) and v2 (2.0.27-1) + + ---------------- + + * New Features (v2 only) - UCM provider with IB UD based CM per process. + More scalable then rdma_cm (cma) or socket cm (scm). + ---------------- + + * Provider descriptions and PROS/CONS (cma, scm, ucm) + + 1. CMA - uses OFA rdma_cm to setup QP's. IPoIB, ARP, and SA queries required. + + Provider name: ofa-v2-cma + PROs: OFA rdma_cm has the most testing across many applications. + Supports both iWARP and IB. + + CONs: Serialization of conn processing with kernel based CM service + Requires IPoIB ARP for name resolution, storms + Requires SA for path record queries for IB fabrics. + Conn Request private data limited to 52 bytes. + + Settings for larger clusters (512+ cores): + + setenv DAPL_CM_ROUTE_TIMEOUT_MS 20000 + setenv DAPL_CM_ARP_TIMEOUT_MS 10000 + + 2. SCM - uses sockets to exchange QP information. IPoIB, ARP, and SA queries NOT required. + + Provider name (connectx): ofa-v2-mlx4_0-1 + PROs: Each rank has own instance of socket cm. More private data with requests. + Doesn't require path-record lookup. + + CONs: Socket resources grow with scale-out, serialization of + connections with kernel based tcp sockets, + Competes for MPI socket resources/port space and other TCP applications. + Sockets remain in TIMEWAIT state for minutes after closure. + Requires ARP for name resolution. + Doesn't support iWARP devices. + + Settings for larger clusters (512+ cores): + + setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */ + setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */ + + 3. UCM - use's IB UD QP to exchange QP info. Sockets, ARP, IPoIB, and SA queries NOT required. + + Provider name (connectx): ofa-v2-mlx4_0-1u + PROs: Each rank has own instance of CM in user process + Resources fixed per rank regardless of scale-out size + No serialization of user or kernel resources establishing connections, + Simple 3-way msg handsake, CM messages fit in inline data for lowest message latency, + Supports alternate paths + No address resolution required. + No path resolution required. + + CONs: New provider with limited testing, a little tougher to debug. + Doesn't support iWARP + + Settings for larger clusters (512+ cores): + + setenv DAPL_UCM_REP_TIME 800 /* REQUEST timer, waiting for REPLY in millisecs */ + setenv DAPL_UCM_RTU_TIME 400 /* REPLY timer, waiting for RTU in millisecs */ + setenv DAPL_UCM_RETRY 15 /* REQUEST and REPLY retries */ + setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */ + setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */ + + ---------------- + + * CM Performance: CPS profile for cma, scm, and ucm v2 uDAPL providers: + + Intel SR1600 Urbanna Servers with Xeon(R) CPU X5570 @ 2.93GHz + Urbanna Platform - 2 node, 8 cores per node, Mellanox MLX4 IB QDR, no switch. + + dtestcm (server/client): + + cma: Connections: 183.21 usec, CPS 5458.31 Total 0.18 secs, poll_cnt=3403, Num=1000 + scm: Connections: 178.80 usec, CPS 5592.93 Total 0.18 secs, poll_cnt=2344, Num=1000 + ucm: Connections: 122.43 usec, CPS 8167.93 Total 0.12 secs, poll_cnt=2609, Num=1000 + + dapl_cm_bw: MPI uDAPL/CM profiling application (all-to-all connections, all ranks) + + CMA + 2 Connect times (10): Total 0.0020 per 0.0002 CPS=4997.98 + 4 Connect times (40): Total 0.0077 per 0.0002 CPS=5224.59 + 8 Connect times (240): Total 0.0276 per 0.0001 CPS=8710.76 + 16 Connect times (1120): Total 0.1194 per 0.0001 CPS=9379.37 + 32 Connect times (4800): Total 6.1949 per 0.0013 CPS=774.83 + + SCM + 2 Connect times (10): Total 0.0024 per 0.0002 CPS=4103.61 + 4 Connect times (40): Total 0.0060 per 0.0002 CPS=6622.41 + 8 Connect times (240): Total 0.0206 per 0.0001 CPS=11634.15 + 16 Connect times (1120): Total 9.0118 per 0.0080 CPS=124.28 + 32 Connect times (4800): Total 21.0198 per 0.0044 CPS=228.36 + + UCM + 2 Connect times (10): Total 0.0014 per 0.0001 CPS=7353.27 + 4 Connect times (40): Total 0.0045 per 0.0001 CPS=8816.19 + 8 Connect times (240): Total 0.0191 per 0.0001 CPS=12582.44 + 16 Connect times (1120): Total 0.0799 per 0.0001 CPS=14017.68 + 32 Connect times (4800): Total 0.3337 per 0.0001 CPS=14385.21 + + ---------------- + + * Bug Fixes + + V2.0 Package + + Release 2.0.27 + windows: add scm makefile + windows does not require rdma_cma_abi.h, move the include from common code + windows patch to fix IB_INVALID_HANDLE name collision + scm: dat_ep_connect fails on 32bit servers + undefined symbol: dapls_print_cm_list + cleanup CM object lock before freeing CM object memory + destroy verbs completion channels created via ia_open or ep_create. + package: update Copyright file and include the 3 license files in distribution + common: when copying private_data out of rdma_cm events, use the + cma: fix referencing freed address + dapl: move close device after async thread is done + + Release 2.0.26 + openib_common: add check for both gid and global routing in RTR + openib_common: remote memory read privilege set multi times + ucm, scm: DAPL_GLOBAL_ROUTING enabled causes segv + + Release 2.0.25 + winof scm: initialize opt for NODELAY setsockopt + winof cma: windows definition for EADDRNOTAVAIL missing + scm: client side setsockopt NODELAY fails if data arrives before setting + cma: setup_listener Cannot assign requested address + common: seg fault in dapl_evd_wait with multi-thread application using CNO's. + ucm: inbound DREQ/DREP handshake should transition QP. + winof: Remove duplicate include of comp_channel.cpp from cm.c as it is + included in opensm_ucb/device.c. + + Release 2.0.24 + winof: Utilize WinOF version of inet_ntop() for Windows OSes which do not + support inet_ntop(). + ucm: windows build issue with new CQ completion channel + winof: add ucm provider to windows build + winof: add missing build files for ibal, scm + scm: connection peer resets under heavy load, incorrect event on error + ucm: increase default reply and rtu timeout values. + ucm: change some debug message levels and add check for valid UD REPLY during retries. + ucm: increase timers during subsequent retries + ucm, scm: address handles need destroyed when freeing Endpoints with UD QP's. + openib_common: ignore pd free errors, clear pd_handle and return. + ucm: using UD type QP's, ucm reports wrong reject event when user rejects AH resolution request. + ucm, scm, cma: Fix CNO support on DTO type EVD's + ucm: fix lock init bug in ucm_cm_find + ucm: fix build problem with latest windows ucm changes + ucm: The HCA should not be closed until all resources have been released. + ucm: Fix build warning when compiling on 32-bit systems. + ucm: Trying to deregister the same memory region twice leads to an + dat: reduce debug message level when parsing for location of dat.conf + ucm: update ucm provider for windows environment + ucm: add timer/retry CM logic to the ucm provider + + Release 2.0.23 + cma: cannot reuse the cm_id and qp for new connection, must reallocate a new one. + scm, cma: update DAPL cm protocol revision with latest address/port changes + ucm: modify IB address format to align better with sockaddr_in6 + Add definition for getpid similar to that used by the other dtest apps. + WinOF provides a common implementation of gettimeofday that should + The completion manager was updated to provide an abstraction that + dtestcm: remove IB verb definitions + dtest, dtestx: remove IB verb definitions + scm: tighten up socket options to insure similiar behavior on Windows and Linux. + cma: improve serialization of destroy and event processing + scm: improve serialization of destroy and state changes + common: no cleanup/release code for timer thread + scm, cma: dapli_thread doesn't always get teminated on library close. + ucm: tighten up locking with CM processing, state changes + ucm: For UD type QP's, return CR p_data with CONN_EST event on passive side. + ucm: cleanup extra cr/lf + ucm: fix issues with UD QP's. + winof: Convert windows version of dapl and dat libaries to use private heaps. + dtest, dtestx: modifications for UD QP testing with ucm provider. + scm, ucm: UD QP support was broken when porting to common openib code base. + cma: cleanup warning with unused local variable, ret, in disconnect + cma: remove debug message after rdma_disconnect failure + scm: socket errno check needs O/S dependent wrapper + dapltest: update script files for WinOF + cma: conditional check for new rdma_cm definition. + + Release 2.0.22 + dapltest: add mdep processor yield and use with dapltest + ucm: Add new provider using a DAPL based IB-UD cm mechanism for MPI implementations. + + Release 2.0.21 + scm: Fix disconnect. QP's need to move to ERROR state in + modify dtest.c to cleanup CNO wait code and consolidate into + CNO events, once triggered will not be returned during the cno wait. + CNO support broken in both CMA and SCM providers. + common osd: include winsock2.h for IPv6 definitions. + common osd: include w2tcpip.h for sockaddr_in6 definitions. + DAPL introduced the concept of directly waiting on the CQ for + dapltest: Implement a malloc() threshold for the completion reaping. + scm: handle connected state when freeing CM objects + scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings. + scm: set TCP_NODELAY sockopt on the server side for sends. + remove obsolete files in dapl/udapl source tree + dtestcm: add UD type QP option to test + scm: destroy QP called before disconnect + cma: add support for rdma_cm TIME_WAIT event. + scm: remove old udapl_scm code replaced by openib_scm. + winof: fix issues after consolidating cma, scm code base. + cma: lock held when exiting as a result of a rdma_create_event_channel failure. + windows: all dlist functions have been moved to the header file. + dtestcm windows: add build infrastructure for new dtestcm test suite + openib_common: reorganize code base to share common mem, cq, qp, dto functions + scm: fixes and optimizations for connection scaling + scm: double the default fd_set_size + scm: EP reference in CR should be cleared during ep_destroy + dtestx: fix conn establishment event checking + dtestcm: new test to measure dapl connection rates. + + Release 2.0.20 + common,scm: add debug capabilities to print in-process CM lists + scm: disconnect EP before cleaning up orphaned CR's during dat_ep_free + dapltest: windows scripts updated + scm: private data is not handled properly via CR rejects. + scm: cleanup orphaned UD CR's when destroying the EP + scm: provider specific query for default UD MTU is wrong. + scm: update CM code to shutdown before closing socket + dapltest: windows script dt-cli.bat updated + dapl/windows cma provider: add support for network devices based on index + openib: remove 1st gen provider, replaced with openib_cma and openib_scm + dapltest: update windows script files + dapltest: windows batch files in sripts directory + windows_osd/linux_osd: new dapl_os_gettid macro to return thread id + windows: missing build files for common and udapl sub-directories + windows: add build files for openib_scm, remove /Wp64 build option. + scm: multi-hca CM processing broken. Need cr thread wakeup mechanism per HCA. + dtest: add connection timers on client side + linux_osd: use pthread_self instead of getpid for debug messages + windows ibal-scm: dapl/dirs file needs updated to remove ibal-scm + + v1.2 Package: + + Release 1.2.16 + package: update Copyright file and include the 3 license files in distribution + cma: max sge incorrectly decremented during ibv_device_query + + Release 1.2.15 + dtest, dapltest: conflict with dapl-2 utils package, change to dapl1, dapltest1 + scm: fix compiler warning, unused variable + + ---------------- + + * Build Notes: + + # NON_DEBUG build/install example for x86_64, OFED targets + ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" + make install + + # DEBUG build/install example for x86_64, using OFED targets + ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" + make install + + # COUNTERS build/install example for x86_64, using OFED targets + ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS" + make install + + ---------------- + + * BKM for running new DAPL library on your cluster without any impact on existing OFED installation: + + Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1 + + Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.25.tar.gz + + untar in /home/ardavis + cd /home/ardavis/dapl-2.0.25 + ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries) + + create /home/ardavis/dat.conf with following 3 lines. (entries with path to new libraries): + + ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" "" + ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" "" + ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploucm.so.2 dapl.2.0 "mlx4_0 1" "" + + Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following: + + setenv DAT_OVERRIDE=/home/ardavis/dat.conf + + If running Intel MPI and uDAPL socket cm, set the following: + + setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1 + + or if running Intel MPI and uDAPL IB UD cm, set the following: + + setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1u + + or if running Intel MPI and uDAPL rdma_cm, set the following: + + setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0 + +------------------------- + + OFED 1.4.1 RELEASE NOTES + + NEW SINCE OFED 1.4 - new versions of uDAPL v1 (1.2.14-1) and v2 (2.0.19-1) + + * New Features - optional counters, must be configured/built with -DDAPL_COUNTERS + + * Bug Fixes + + v2 - scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit + v2 - scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge + v2 - dtest: add flush EVD call after data transfer errors + v2 - scm: increase default MTU size from 1024 to 2048 + v2 - dapltest: reset server listen ports to avoid collisions during long runs + v2 - dapltest: avoid duplicating ports, increment based on ep/thread count + v2 - dapltest: fix assumptions that multiple EP's will connect in order + v2 - common: sync missing with when removing items off of EVD pending queue + v2 - scm: reduce open time with thread start up + v2 - scm: getsockopt optlen needs initialized to size of optval + v2 - scm: cr_thread cleanup + v2 - OFED and WinOF code sync + v2 - scm: remove unnecessary query gid/lid from connection phase code. + v2 - scm: add optional 64-bit counters, build with -DDAPL_COUNTERS. + v1,v2 - spec files missing Requires(post) statements for sed/coreutils + v1,v2 - dtest/dapltest: use $(top_builddir) for .la files during test builds + v1,v2 - scm: remove unecessary thread when using direct objects + v1,v2 - Fix SuSE 11 build issues, asm/atomic.h no longer exists + + * Build Notes: + + # NON_DEBUG build/install example for x86_64, OFED targets + ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" + make install + + # DEBUG build/install example for x86_64, using OFED targets + ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" + make install + + # COUNTERS build/install example for x86_64, using OFED targets + ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS" + make install + + * BKM for running new DAPL library on your cluster without any impact on existing OFED installation: + + Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1 + + Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.19.tar.gz + + untar in /home/ardavis + cd /home/ardavis/dapl-2.0.19 + ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries) + + create /home/ardavis/dat.conf with following 2 lines. (entries with path to new libraries): + + ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" "" + ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" "" + + Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following: + + setenv DAT_OVERRIDE=/home/ardavis/dat.conf + + If running Intel MPI and uDAPL socket cm, set the following: + + setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1 + + if running Intel MPI and uDAPL rdma_cm, set the following: + + setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0 + +------------------------- + + OFED 1.4 RELEASE NOTES + + NEW SINCE OFED 1.3.1 - new versions of uDAPL v1 (1.2.12-1) and v2 (2.0.15-1) + + * New Features + + 1. The new socket CM provider, introduced in 1.2.8 and 2.0.11 packages, + assumes homogeneous cluster and will setup the QP's based on local HCA port + attributes and exchanges QP information via socket's using the hostname of + each node. IPoIB and rdma_cm are NOT required for this provider. QP attributes + can be adjusted via the following environment parameters: + + DAPL_ACK_TIMER (default=16 5 bits, 4.096us*2^ack_timer. 16 == 268ms) + DAPL_ACK_RETRY (default=7 3 bits, 7 * 268ms = 1.8 seconds) + DAPL_RNR_TIMER (default=12 5 bits, 12 == 64ms, 28 == 163ms, 31 == 491ms) + DAPL_RNR_RETRY (default=7 3 bits, 7 == infinite) + DAPL_IB_MTU (default=1024 limited to active MTU max) + + The new socket cm entries in /etc/dat.conf provide a link to the actual HCA + device and port. Example v1 and v2 entries for a Mellanox connectx device, port 1: + + OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" "" + ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" "" + + This new socket cm provider, was successfully tested on the TATA CRL cluster + (#8 on Top500) with Intel MPI, achieving a HPLinpack score of 132.8TFlops on + 1798 nodes, 14384 cores at ~76.9% of peak. DAPL_ACK_TIMER was increased to 21 + for this scale. + + 2. New v2 definitions for IB unreliable datagram extension (only supported in + scm provider, libdaploscm.so.2) + + Extended EP dat_service_type, with DAT_IB_SERVICE_TYPE_UD + Add IB extension call dat_ib_post_send_ud(). + Add address handle definition for UD calls. + Add IB event definitions to provide remote AH via connect and connect requests + See dtestx (-d) source for example usage model + + * Bug Fixes + + v1,v2 - dapltest: trans test moves to cleanup stage before rdma_read processing is complete + v1,v2 - Fix static registration (dat.conf) to include sysconfdir override + v1,v2 - dat.conf: add default iwarp entry for eth2 + v1,v2 - dapl: adjust max_rdma_read_iov to 1 for iWARP devices + v1,v2 - dtest: reduce default IOV's for ep_create to support iWARP + v1,v2 - dtest: fix 32-bit build issues + v1,v2 - build: $(DESTDIR) prepend needed on install hooks for dat.conf + v2 - scm: UD shares EP;s which requires serialization + v2 - dapl: fixes for IB UD extensions in common code and socket cm provider. + v2 - dapl: add provider specific attribute query option for IB UD MTU size + v2 - dapl build: add correct CFLAGS, set non-debug build by default for v2 + v2 - dtestx: fix stack corruption problem with hostname strcpy + v2 - dapl extension: dapli_post_ext should always allocate cookie for requests. + v2 - dapltest: manpage - rdma write example incorrect + v1,v2 - dat, dapl, dtest, dapltest, providers: fix compiler warnings in dat common code + v1,v2 - dapl cma: debug message during query needs definition for inet_ntoa + v1,v2 - dapl scm: fix corner case that delivers duplicate disconnect events + v1,v2 - dat: include stddef.h for NULL definition in dat_platform_specific.h + v1,v2 - dapl: add debug messages during async and overflow events + v1,v2 - dapltest: add check for duplicate disconnect events in transaction test + v1,v2 - dapl scm: use correct device attribute for max_rdma_read_out, max_qp_init_rd_atom + v1,v2 - dapl scm: change IB RC qp inline and timer defaults. + v1,v2 - dapl scm: add mtu adjustments via environment, default = 1024. + v1,v2 - dapl scm: change connect and accept to non-blocking to avoid blocking user thread. + v1,v2 - dapl scm: update max_rdma_read_iov, max_rdma_write_iov EP attributes during query + v1,v2 - dat: allow TYPE_ERR messages to be turned off with DAT_DBG_TYPE + v1,v2 - dapl: remove needless terminating 0 in dto_op_str functions. + v1,v2 - dat: remove reference to doc/dat.conf in makefile.am + v1,v2 - dapl scm: fix ibv_destroy_cq busy error condition during dat_evd_free. + v1,v2 - dapl scm: add stdout logging for uname and gethostbyname errors during open. + v1,v2 - dapl scm: support global routing and set mtu based on active_mtu + v1,v2 - dapl: add opcode to string function to report opcode during failures. + v1,v2 - dapl: remove unused iov buffer allocation on the endpoint + v1,v2 - dapl: endpoint pending request count is wrong + +------------------------- + + OFED 1.3.1 RELEASE NOTES + + NEW SINCE OFED 1.3 - new versions of uDAPL v1 (1.2.7-1) and v2 (2.0.9-1) + + * New Features - None + + * Bug Fixes + v2 - add private data exchange with reject + v1,v2 - better error reporting in non-debug builds + v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers + v1,v2 - support for zero byte operations, iov==NULL + v1,v2 - multi-transport support for inline data and private data differences + v1,v2 - fix memory leaks and other reported bugs since OFED 1.3 + v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1 + v1,v2 - long delay during dat_ia_open when DNS not configured + v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max + +------------------------- + + OFED 1.3 RELEASE NOTES + + NEW SINCE OFED 1.2 + + * New Features + + 1. Add v2.0 library support for new 2.0 API Specification + 2. Separate v1.2 library release to co-exist with v2.0 libraries. + 3. New dat.conf with both 1.2 and 2.0 support + 4. New v2.0 dtestx utilities to test IB extensions + + * Bug Fixes + + v1.2 and v2.0 + - uDAT: static/dynamic registry parsing fixes + - uDAPL: provider fixes for dat_psp_create_any + - dtest/dapltest: change default provider names to sync with dat.conf + - openib_cma: issues with destroy_cm_id and init/resp exchange + - dapltest: use gettimeofday instead of get_cycles for better portability + - dapltest: endian issue with mem_handle, mem_address + - dapltest fix to include inet_ntoa definitions + - fix build problems on 32-bit and 64-bit PowerPC + - cleanup packaging + + v2.0 + - set default config options to match spec file, --enable-debug --enable-ext-type=ib + - use unique devel target names, libdat2.so, /usr/include/dat2 + - dtestx fix memory leak, freeaddrinfo after getaddrinfo + - Fix for IB extended DTO cookie deallocation on inbound rdma_Write_immed + - WinOF: Update OFED code base to include WinOF changes, work from same code base + - WinOF: add DAT_API definition, __stdcall for windows, nothing for linux + - dtest: add dat_evd_query to check correct size + - openib_cma: add macro to convert SID to PORT + - dtest: endian support for exchanging RMR info + - openib_cma: lower default settings, inline and RDMA init/resp + - openib_cma: missing ia_query for max_iov_segments_per_rdma_write + + v1.2 + - openib_cma: turn down dbg noise level on rejects + - dtest: typo in memset + + + BUILD: v1 and v2 uDAPL source install/build instructions (redhat example): + + # cd to distribution SRPMS directory + cd /tmp/OFED-1.3/SRPMS + rpm -i dapl-1.2*.rpm + rpm -i dapl-2.0*.rpm + cd /usr/src/redhat/SOURCES + tar zxf dapl-1.2*.tgz + tar zxf dapl-2.0*.tgz + + # NON_DEBUG build example for x86_64, using OFED targets + + ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 + LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" + + # build and install + + make + make install + + # DEBUG build example for x86_64, using OFED targets + + ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 + LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" + + # build and install + + make + make install + + # DEBUG messages: set environment variable DAPL_DBG_TYPE, default + mapping is 0x0003 + + DAPL_DBG_TYPE_ERR = 0x0001, + DAPL_DBG_TYPE_WARN = 0x0002, + DAPL_DBG_TYPE_EVD = 0x0004, + DAPL_DBG_TYPE_CM = 0x0008, + DAPL_DBG_TYPE_EP = 0x0010, + DAPL_DBG_TYPE_UTIL = 0x0020, + DAPL_DBG_TYPE_CALLBACK = 0x0040, + DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080, + DAPL_DBG_TYPE_API = 0x0100, + DAPL_DBG_TYPE_RTN = 0x0200, + DAPL_DBG_TYPE_EXCEPTION = 0x0400, + DAPL_DBG_TYPE_SRQ = 0x0800, + DAPL_DBG_TYPE_CNTR = 0x1000 + +------------------------- + + OFED 1.2 RELEASE NOTES + + NEW SINCE Gamma 3.2 and OFED 1.1 + + * New Features + + 1. Added dtest and dapltest to the openfabrics build and utils rpm. + Includes manpages. + 2. Added following enviroment variables to configure connection management + timers (default settings) for larger clusters: + + DAPL_CM_ARP_TIMEOUT_MS 4000 + DAPL_CM_ARP_RETRY_COUNT 15 + DAPL_CM_ROUTE_TIMEOUT_MS 4000 + DAPL_CM_ROUTE_RETRY_COUNT 15 + + * Bug Fixes + + + Added support for new ib verbs client register event. No extra + processing required at the uDAPL level. + + Fix some issues supporting create qp without recv cq handle or + recv qp resources. IB verbs assume a recv_cq handle and uDAPL + dapl_ep_create assumes there is always recv_sge resources specified. + + Fix some timeout and long disconnect delay issues discovered during + scale-out testing. Added support to retry rdma_cm address and route + resolution with configuration options. Provide a disconnect call + when receiving the disconnect request to guarantee a disconnect reply + and event on the remote side. The rdma_disconnect was not being called + from dat_ep_disconnect() as a result of the state changing + to DISCONNECTED in the event callback. + + Changes to support exchanging and validation of the device + responder_resources and the initiator_depth during conn establishment + + Fix some build issues with dapltest on 32 bit arch, and on ia64 SUSE arch + + Add support for multiple IB devices to dat.conf to support IPoIB HA failover + + Fix atomic operation build problem with ia64 and RHEL5. + + Add support to return local and remote port information with dat_ep_query + + Cleanup RPM specfile for the dapl package, move to 1.2-1 release. + + NEW SINCE Gamma 3.1 and OFED 1.0 + + * BUG FIXES + + + Update obsolete CLK_TCK to CLOCKS_PER_SEC + + Fill out some unitialized fields in the ia_attr structure returned by + dat_ia_query(). + + Update dtest to support multiple segments on rdma write and change + makefile to use OpenIB-cma by default. + + Add support for dat_evd_set_unwaitable on a DTO evd in openib_cma + provider + + Added errno reporting (message and return codes) during open to help + diagnose create thread issues. + + Fix some suspicious inline assembly EIEIO_ON_SMP and ISYNC_ON_SMP + + Fix IA64 build problems + + Lower the reject debug message level so we don't see warnings when + consumers reject. + + Added support for active side TIMED_OUT event from a provider. + + Fix bug in dapls_ib_get_dat_event() call after adding new unreachable + event. + + Update for new rdma_create_id() function signature. + + Set max rdma read per EP attributes + + Report the proper error and timeout events. + + Socket CM fix to guard against using a loopback address as the local + device address. + + Use the uCM set_option feature to adjust connect request timeout + retry values. + + Fix to disallow any event after a disconnect event. + + * OFED 1.1 uDAPL source build instructions: + + cd /usr/local/ofed/src/openib-1.1/src/userspace/dapl + + # NON_DEBUG build configuration + + ./configure --disable-libcheck --prefix /usr/local/ofed + --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 + CPPFLAGS="-I../libibverbs/include -I../librdmacm/include" + + # build and install + + make + make install + + # DEBUG build configuration + + ./configure --disable-libcheck --enable-debug --prefix /usr/local/ofed + --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 + CPPFLAGS="-I../libibverbs/include -I../librdmacm/include" + + # build and install + + make + make install + + # DEBUG messages: set environment variable DAPL_DBG_TYPE, default + mapping is 0x0003 + + DAPL_DBG_TYPE_ERR = 0x0001, + DAPL_DBG_TYPE_WARN = 0x0002, + DAPL_DBG_TYPE_EVD = 0x0004, + DAPL_DBG_TYPE_CM = 0x0008, + DAPL_DBG_TYPE_EP = 0x0010, + DAPL_DBG_TYPE_UTIL = 0x0020, + DAPL_DBG_TYPE_CALLBACK = 0x0040, + DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080, + DAPL_DBG_TYPE_API = 0x0100, + DAPL_DBG_TYPE_RTN = 0x0200, + DAPL_DBG_TYPE_EXCEPTION = 0x0400, + DAPL_DBG_TYPE_SRQ = 0x0800, + DAPL_DBG_TYPE_CNTR = 0x1000 + + + Note: The udapl provider library libdaplscm.so is untested and + unsupported, thus customers should not use it. + It will be removed in the next OFED release. + + DAPL GAMMA 3.1 RELEASE NOTES + + This release of the DAPL reference implementation + is timed to coincide with the first release of the + Open Fabrics (www.openfabrics.org) software stack. + This release adds support for this new stack, which + is now the native Linux RDMA stack. + + This release also adds a new licensing option. In + addition to the Common Public License and BSD License, + the code can now be licensed under the terms of the GNU + General Public License (GPL) version 2. + + NEW SINCE Gamma 3.0 + + - GPL v2 added as a licensing option + - OpenFabrics (aka OpenIB) gen2 verbs support + - dapltest support for Solaris 10 + + * BUG FIXES + + + Fixed a disconnect event processing race + + Fix to destroy all QPs on IA close + + Removed compiler warnings + + Removed unused variables + + And many more... + + DAPL GAMMA 3.0 RELEASE NOTES + + This is the first release based on version 1.2 of the spec. There + are some components, such a shared receive queues (SRQs), which + are not implemented yet. + + Once again there were numerous bug fixes submitted by the + DAPL community. + + NEW SINCE Beta 2.06 + + - DAT 1.2 headers + - DAT_IA_HANDLEs implemented as small integers + - Changed default device name to be "ia0a" + - Initial support for Linux 2.6.X kernels + - Updates to the OpenIB gen 1 provider + + * BUG FIXES + + + Updated Makefile for differentiation between OS releases. + + Updated atomic routines to use appropriate API + + Removed unnecessary assert from atomic_dec. + + Fixed bugs when freeing a PSP. + + Fixed error codes returned by the DAT static registry. + + Kernel updates for dat_strerror. + + Cleaned up the transport layer/adapter interface to use DAPL + types rather than transport types. + + Fixed ring buffer reallocation. + + Removed old test/udapl/dapltest directory. + + Fixed DAT_IA_HANDLE translation (from pointer to int and + vice versa) on 64-bit platforms. + + DAP BETA 2.06 RELEASE NOTES + + We are not planning any further releases of the Beta series, + which are based on the 1.1 version of the spec. There may be + further releases for bug fixes, but we anticipate the DAPL + community to move to the new 1.2 version of the spec and the + changes mandated in the reference implementation. + + The biggest item in this release is the first inclusion of the + OpenIB Gen 1 provider, an item generating a lot of interest in + the IB community. This implementation has graciously been + provided by the Mellanox team. The kdapl implementation is in + progress, and we imagine work will soon begin on Gen 2. + + There are also a handful of bug fixes available, as well as a long + awaited update to the endpoint design document. + + NEW SINCE Beta 2.05 + + - OpenIB gen 1 provider support has been added + - Added dapls_evd_post_generic_event(), routine to post generic + event types as requested by some providers. Also cleaned up + error reporting. + - Updated the endpoint design document in the doc/ directory. + + * BUG FIXES + + + Cleaned up memory leak on close by freeing the HCA structure; + + Removed bogus #defs for rdtsc calls on IA64. + + Changed daptest thread types to use internal types for + portability & correctness + + Various 64 bit enhancements & updates + + Fixes to conformance test that were defining CONN_QUAL twice + and using it in different ways + + Cleaned up private data handling in ep_connect & provider + support: we now avoid extra copy in connect code; reduced + stack requirements by using private_data structure in the EP; + removed provider variable. + + Fixed problem in the dat conformance test where cno_wait would + attempt to dereference a timer value and SEGV. + + Removed old vestiges of depricated POLLING_COMPLETIONS + conditionals. + + DAPL BETA 2.05 RELEASE NOTES + + This was to be a very minor release, the primary change was + going to be the new wording of the DAT license as contained in + the header for all source files. But the interest and + development occurring in DAPL provided some extra bug fixes, and + some new functionality that has been requested for a while. + + First, you may notice that every single source file was + changed. If you read the release notes from DAPL BETA 2.04, you + were warned this would happen. There was a legal issue with the + wording in the header, the end result was that every source file + was required to change the word 'either of' to 'both'. We've + been putting this change off as long as possible, but we wanted + to do it in a clean drop before we start working on DAT 1.2 + changes in the reference implementation, just to keep things + reasonably sane. + + kdapltest has enabled three of the subtests supported by + dapltest. The Performance test in particular has been very + useful to dapltest in getting minima and maxima. The Limit test + pushes the limits by allocating the maximum number of specific + resources. And the FFT tests are also available. + + Most vendors have supported shared memory regions for a while, + several of which have asked the reference implementation team to + provide a common implementation. Shared memory registration has + been tested on ibapi, and compiled into vapi. Both InfiniBand + providers have the restriction that a memory region must be + created before it can be shared; not all RDMA APIs are this way, + several allow you to declare a memory region shared when it is + registered. Hence, details of the implementation are hidden in + the provider layer, rather than forcing other APIs to do + something strange. + + This release also contains some changes that will allow dapl to + work on Opteron processors, as well as some preliminary support + for Power PC architecture. These features are not well tested + and may be incomplete at this time. + + Finally, we have been asked several times over the course of the + project for a canonical interface between the common and + provider layers. This release includes a dummy provider to meet + that need. Anyone should be able to download the release and do + a: + make VERBS=DUMMY + + And have a cleanly compiled dapl library. This will be useful + both to those porting new transport providers, as well as those + going to new machines. + + The DUMMY provider has been compiled on both Linux and Windows + machines. + + + NEW SINCE Beta 2.4 + - kdapltest enhancements: + * Limit subtests now work + * Performance subtests now work. + * FFT tests now work. + + - The VAPI headers have been refreshed by Mellanox + + - Initial Opteron and PPC support. + + - Atomic data types now have consistent treatment, allowing us to + use native data types other than integers. The Linux kdapl + uses atomic_t, allowing dapl to use the kernel macros and + eliminate the assembly code in dapl_osd.h + + - The license language was updated per the direction of the + DAT Collaborative. This two word change affected the header + of every file in the tree. + + - SHARED memory regions are now supported. + + - Initial support for the TOPSPIN provider. + + - Added a dummy provider, essentially the NULL provider. It's + purpose is to aid in porting and to clarify exactly what is + expected in a provider implementation. + + - Removed memory allocation from the DTO path for VAPI + + - cq_resize will now allow the CQ to be resized smaller. Not all + providers support this, but it's a provider problem, not a + limitation of the common code. + + * BUG FIXES + + + Removed spurious lock in dapl_evd_connection_callb.c that + would have caused a deadlock. + + The Async EVD was getting torn down too early, potentially + causing lost errors. Has been moved later in the teardown + process. + + kDAPL replaced mem_map_reserve() with newer SetPageReserved() + for better Linux integration. + + kdapltest no longer allocate large print buffers on the stack, + is more careful to ensure buffers don't overflow. + + Put dapl_os_dbg_print() under DAPL_DBG conditional, it is + supposed to go away in a production build. + + dapltest protocol version has been bumped to reflect the + change in the Service ID. + + Corrected several instances of routines that did not adhere + to the DAT 1.1 error code scheme. + + Cleaned up vapi ib_reject_connection to pass DAT types rather + than provider specific types. Also cleaned up naming interface + declarations and their use in vapi_cm.c; fixed incorrect + #ifdef for naming. + + Initialize missing uDAPL provider attr, pz_support. + + Changes for better layering: first, moved + dapl_lmr_convert_privileges to the provider layer as memory + permissions are clearly transport specific and are not always + defined in an integer bitfield; removed common routines for + lmr and rmr. Second, move init and release setup/teardown + routines into adapter_util.h, which defined the provider + interface. + + Cleaned up the HCA name cruft that allowed different types + of names such as strings or ints to be dealt with in common + code; but all names are presented by the dat_registry as + strings, so pushed conversions down to the provider + level. Greatly simplifies names. + + Changed deprecated true/false to DAT_TRUE/DAT_FALSE. + + Removed old IB_HCA_NAME type in favor of char *. + + Fixed race condition in kdapltest's use of dat_evd_dequeue. + + Changed cast for SERVER_PORT_NUMBER to DAT_CONN_QUAL as it + should be. + + Small code reorg to put the CNO into the EVD when it is + allocated, which simplifies things. + + Removed gratuitous ib_hca_port_t and ib_send_op_type_t types, + replaced with standard int. + + Pass a pointer to cqe debug routine, not a structure. Some + clean up of data types. + + kdapl threads now invoke reparent_to_init() on exit to allow + threads to get cleaned up. + + + + DAPL BETA 2.04 RELEASE NOTES + + The big changes for this release involve a more strict adherence + to the original dapl architecture. Originally, only InfiniBand + providers were available, so allowing various data types and + event codes to show through into common code wasn't a big deal. + + But today, there are an increasing number of providers available + on a number of transports. Requiring an IP iWarp provider to + match up to InfiniBand events is silly, for example. + + Restructuring the code allows more flexibility in providing an + implementation. + + There are also a large number of bug fixes available in this + release, particularly in kdapl related code. + + Be warned that the next release will change every file in the + tree as we move to the newly approved DAT license. This is a + small change, but all files are affected. + + Future releases will also support to the soon to be ratified DAT + 1.2 specification. + + This release has benefited from many bug reports and fixes from + a number of individuals and companies. On behalf of the DAPL + community, thank you! + + + NEW SINCE Beta 2.3 + + - Made several changes to be more rigorous on the layering + design of dapl. The intent is to make it easier for non + InfiniBand transports to use dapl. These changes include: + + * Revamped the ib_hca_open/close code to use an hca_ptr + rather than an ib_handle, giving the transport layer more + flexibility in assigning transport handles and resources. + + * Removed the CQD calls, they are specific to the IBM API; + folded this functionality into the provider open/close calls. + + * Moved VAPI, IBAPI transport specific items into a transport + structure placed inside of the HCA structure. Also updated + routines using these fields to use the new location. Cleaned + up provider knobs that have been exposed for too long. + + * Changed a number of provider routines to use DAPL structure + pointers rather than exposing provider handles & values. Moved + provider specific items out of common code, including provider + data types (e.g. ib_uint32_t). + + * Pushed provider completion codes and type back into the + provider layer. We no longer use EVD or CM completion types at + the common layer, instead we obtain the appropriate DAT type + from the provider and process only DAT types. + + * Change private_data handling such that we can now accommodate + variable length private data. + + - Remove DAT 1.0 cruft from the DAT header files. + + - Better spec compliance in headers and various routines. + + - Major updates to the VAPI implementation from + Mellanox. Includes initial kdapl implementation + + - Move kdapl platform specific support for hash routines into + OSD file. + + - Cleanups to make the code more readable, including comments + and certain variable and structure names. + + - Fixed CM_BUSTED code so that it works again: very useful for + new dapl ports where infrastructure is lacking. Also made + some fixes for IBHOSTS_NAMING conditional code. + + - Added DAPL_MERGE_CM_DTO as a compile time switch to support + EVD stream merging of CM and DTO events. Default is off. + + - 'Quit' test ported to kdapltest + + - uDAPL now builds on Linux 2.6 platform (SuSE 9.1). + + - kDAPL now builds for a larger range of Linux kernels, but + still lacks 2.6 support. + + - Added shared memory ID to LMR structure. Shared memory is + still not fully supported in the reference implementation, but + the common code will appear soon. + + * Bug fixes + - Various Makefiles fixed to use the correct dat registry + library in its new location (as of Beta 2.03) + - Simple reorg of dat headers files to be consistent with + the spec. + - fixed bug in vapi_dto.h recv macro where we could have an + uninitialized pointer. + - Simple fix in dat_dr.c to initialize a variable early in the + routine before errors occur. + - Removed private data pointers from a CONNECTED event, as + there should be no private data here. + - dat_strerror no longer returns an uninitialized pointer if + the error code is not recognized. + - dat_dup_connect() will reject 0 timeout values, per the + spec. + - Removed unused internal_hca_names parameter from + ib_enum_hcas() interface. + - Use a temporary DAT_EVENT for kdapl up-calls rather than + making assumptions about the current event queue. + - Relocated some platform dependent code to an OSD file. + - Eliminated several #ifdefs in .c files. + - Inserted a missing unlock() on an error path. + - Added bounds checking on size of private data to make sure + we don't overrun the buffer + - Fixed a kdapltest problem that caused a machine to panic if + the user hit ^C + - kdapltest now uses spin locks more appropriate for their + context, e.g. spin_lock_bh or spin_lock_irq. Under a + conditional. + - Fixed kdapltest loops that drain EVDs so they don't go into + endless loops. + - Fixed bug in dapl_llist_add_entry link list code. + - Better error reporting from provider code. + - Handle case of user trying to reap DTO completions on an + EP that has been freed. + - No longer hold lock when ep_free() calls into provider layer + - Fixed cr_accept() to not have an extra copy of + private_data. + - Verify private_data pointers before using them, avoid + panic. + - Fixed memory leak in kdapltest where print buffers were not + getting reclaimed. + + + + DAPL BETA 2.03 RELEASE NOTES + + There are some prominent features in this release: + 1) dapltest/kdapltest. The dapltest test program has been + rearchitected such that a kernel version is now available + to test with kdapl. The most obvious change is a new + directory structure that more closely matches other core + dapl software. But there are a large number of changes + throughout the source files to accommodate both the + differences in udapl/kdapl interfaces, but also more mundane + things such as printing. + + The new dapltest is in the tree at ./test/dapltest, while the + old remains at ./test/udapl/dapltest. For this release, we + have maintained both versions. In a future release, perhaps + the next release, the old dapltest directory will be + removed. Ongoing development will only occur in the new tree. + + 2) DAT 1.1 compliance. The DAT Collaborative has been busy + finalizing the 1.1 revision of the spec. The header files + have been reviewed and posted on the DAT Collaborative web + site, they are now in full compliance. + + The reference implementation has been at a 1.1 level for a + while. The current implementation has some features that will + be part of the 1.2 DAT specification, but only in places + where full compatibility can be maintained. + + 3) The DAT Registry has undergone some positive changes for + robustness and support of more platforms. It now has the + ability to support several identical provider names + simultaneously, which enables the same dat.conf file to + support multiple platforms. The registry will open each + library and return when successful. For example, a dat.conf + file may contain multiple provider names for ex0a, each + pointing to a different library that may represent different + platforms or vendors. This simplifies distribution into + different environments by enabling the use of common + dat.conf files. + + In addition, there are a large number of bug fixes throughout + the code. Bug reports and fixes have come from a number of + companies. + + Also note that the Release notes are cleaned up, no longer + containing the complete text of previous releases. + + * EVDs no longer support DTO and CONNECTION event types on the + same EVD. NOTE: The problem is maintaining the event ordering + between two channels such that no DTO completes before a + connection is received; and no DTO completes after a + disconnect is received. For 90% of the cases this can be made + to work, but the remaining 10% will cause serious performance + degradation to get right. + + NEW SINCE Beta 2.2 + + * DAT 1.1 spec compliance. This includes some new types, error + codes, and moving structures around in the header files, + among other things. Note the Class bits of dat_error.h have + returned to a #define (from an enum) to cover the broadest + range of platforms. + + * Several additions for robustness, including handle and + pointer checking, better argument checking, state + verification, etc. Better recovery from error conditions, + and some assert()s have been replaced with 'if' statements to + handle the error. + + * EVDs now maintain the actual queue length, rather than the + requested amount. Both the DAT spec and IB (and other + transports) allow the underlying implementation to provide + more CQ entries than requested. + + Requests for the same number of entries contained by an EVD + return immediate success. + + * kDAPL enhancements: + - module parameters & OS support calls updated to work with + more recent Linux kernels. + - kDAPL build options changes to match the Linux kernel, vastly + reducing the size and making it more robust. + - kDAPL unload now works properly + - kDAPL takes a reference on the provider driver when it + obtains a verbs vector, to prevent an accidental unload + - Cleaned out all of the uDAPL cruft from the linux/osd files. + + * New dapltest (see above). + + * Added a new I/O trace facility, enabling a developer to debug + all I/O that are in progress or recently completed. Default + is OFF in the build. + + * 0 timeout connections now refused, per the spec. + + * Moved the remaining uDAPL specific files from the common/ + directory to udapl/. Also removed udapl files from the kdapl + build. + + * Bug fixes + - Better error reporting from provider layer + - Fixed race condition on reference counts for posting DTO + ops. + - Use DAT_COMPLETION_SUPPRESS_FLAG to suppress successful + completion of dapl_rmr_bind (instead of + DAT_COMPLEITON_UNSIGNALLED, which is for non-notification + completion). + - Verify psp_flags value per the spec + - Bug in psp_create_any() checking psp_flags fixed + - Fixed type of flags in ib_disconnect from + DAT_COMPLETION_FLAGS to DAT_CLOSE_FLAGS + - Removed hard coded check for ASYNC_EVD. Placed all EVD + prevention in evd_stream_merging_supported array, and + prevent ASYNC_EVD from being created by an app. + - ep_free() fixed to comply with the spec + - Replaced various printfs with dbg_log statements + - Fixed kDAPL interaction with the Linux kernel + - Corrected phy_register protottype + - Corrected kDAPL wait/wakeup synchronization + - Fixed kDAPL evd_kcreate() such that it no longer depends + on uDAPL only code. + - dapl_provider.h had wrong guard #def: changed DAT_PROVIDER_H + to DAPL_PROVIDER_H + - removed extra (and bogus) call to dapls_ib_completion_notify() + in evd_kcreate.c + - Inserted missing error code assignment in + dapls_rbuf_realloc() + - When a CONNECTED event arrives, make sure we are ready for + it, else something bad may have happened to the EP and we + just return; this replaces an explicit check for a single + error condition, replacing it with the general check for the + state capable of dealing with the request. + - Better context pointer verification. Removed locks around + call to ib_disconnect on an error path, which would result + in a deadlock. Added code for BROKEN events. + - Brought the vapi code more up to date: added conditional + compile switches, removed obsolete __ActivePort, deal + with 0 length DTO + - Several dapltest fixes to bring the code up to the 1.1 + specification. + - Fixed mismatched dalp_os_dbg_print() #else dapl_Dbg_Print(); + the latter was replaced with the former. + - ep_state_subtype() now includes UNCONNECTED. + - Added some missing ibapi error codes. + + + + NEW SINCE Beta 2.1 + + * Changes for Erratta and 1.1 Spec + - Removed DAT_NAME_NOT_FOUND, per DAT erratta + - EVD's with DTO and CONNECTION flags set no longer valid. + - Removed DAT_IS_SUCCESS macro + - Moved provider attribute structures from vendor files to udat.h + and kdat.h + - kdapl UPCALL_OBJECT now passed by reference + + * Completed dat_strerr return strings + + * Now support interrupted system calls + + * dapltest now used dat_strerror for error reporting. + + * Large number of files were formatted to meet project standard, + very cosmetic changes but improves readability and + maintainability. Also cleaned up a number of comments during + this effort. + + * dat_registry and RPM file changes (contributed by Steffen Persvold): + - Renamed the RPM name of the registry to be dat-registry + (renamed the .spec file too, some cvs add/remove needed) + - Added the ability to create RPMs as normal user (using + temporal paths), works on SuSE, Fedora, and RedHat. + - 'make rpm' now works even if you didn't build first. + - Changed to using the GNU __attribute__((constructor)) and + __attribute__((destructor)) on the dat_init functions, dat_init + and dat_fini. The old -init and -fini options to LD makes + applications crash on some platforms (Fedora for example). + - Added support for 64 bit platforms. + - Added code to allow multiple provider names in the registry, + primarily to support ia32 and ia64 libraries simultaneously. + Provider names are now kept in a list, the first successful + library open will be the provider. + + * Added initial infrastructure for DAPL_DCNTR, a feature that + will aid in debug and tuning of a dapl implementation. Partial + implementation only at this point. + + * Bug fixes + - Prevent debug messages from crashing dapl in EVD completions by + verifying the error code to ensure data is valid. + - Verify CNO before using it to clean up in evd_free() + - CNO timeouts now return correct error codes, per the spec. + - cr_accept now complies with the spec concerning connection + requests that go away before the accept is invoked. + - Verify valid EVD before posting connection evens on active side + of a connection. EP locking also corrected. + - Clean up of dapltest Makefile, no longer need to declare + DAT_THREADSAFE + - Fixed check of EP states to see if we need to disconnect an + IA is closed. + - ep_free() code reworked such that we can properly close a + connection pending EP. + - Changed disconnect processing to comply with the spec: user will + see a BROKEN event, not DISCONNECTED. + - If we get a DTO error, issue a disconnect to let the CM and + the user know the EP state changed to disconnect; checked IBA + spec to make sure we disconnect on correct error codes. + - ep_disconnect now properly deals with abrupt disconnects on the + active side of a connection. + - PSP now created in the correct state for psp_create_any(), making + it usable. + - dapl_evd_resize() now returns correct status, instead of always + DAT_NOT_IMPLEMENTED. + - dapl_evd_modify_cno() does better error checking before invoking + the provider layer, avoiding bugs. + - Simple change to allow dapl_evd_modify_cno() to set the CNO to + NULL, per the spec. + - Added required locking around call to dapl_sp_remove_cr. + + - Fixed problems related to dapl_ep_free: the new + disconnect(abrupt) allows us to do a more immediate teardown of + connections, removing the need for the MAGIC_EP_EXIT magic + number/state, which has been removed. Mmuch cleanup of paths, + and made more robust. + - Made changes to meet the spec, uDAPL 1.1 6.3.2.3: CNO is + triggered if there are waiters when the last EVD is removed + or when the IA is freed. + - Added code to deal with the provider synchronously telling us + a connection is unreachable, and generate the appropriate + event. + - Changed timer routine type from unsigned long to uintptr_t + to better fit with machine architectures. + - ep.param data now initialized in ep_create, not ep_alloc. + - Or Gerlitz provided updates to Mellanox files for evd_resize, + fw attributes, many others. Also implemented changes for correct + sizes on REP side of a connection request. + + + + NEW SINCE Beta 2.0 + + * dat_echo now DAT 1.1 compliant. Various small enhancements. + + * Revamped atomic_inc/dec to be void, the return value was never + used. This allows kdapl to use Linux kernel equivalents, and + is a small performance advantage. + + * kDAPL: dapl_evd_modify_upcall implemented and tested. + + * kDAPL: physical memory registration implemented and tested. + + * uDAPL now builds cleanly for non-debug versions. + + * Default RDMA credits increased to 8. + + * Default ACK_TIMEOUT now a reasonable value (2 sec vs old 2 + months). + + * Cleaned up dat_error.h, now 1.1 compliant in comments. + + * evd_resize initial implementation. Untested. + + * Bug fixes + - __KDAPL__ is defined in kdat_config.h, so apps don't need + to define it. + - Changed include file ordering in kdat.h to put kdat_config.h + first. + - resolved connection/tear-down race on the client side. + - kDAPL timeouts now scaled properly; fixed 3 orders of + magnitude difference. + - kDAPL EVD callbacks now get invoked for all completions; old + code would drop them in heavy utilization. + - Fixed error path in kDAPL evd creation, so we no longer + leak CNOs. + - create_psp_any returns correct error code if it can't create + a connection qualifier. + - lock fix in ibapi disconnect code. + - kDAPL INFINITE waits now work properly (non connection + waits) + - kDAPL driver unload now works properly + - dapl_lmr_[k]create now returns 1.1 error codes + - ibapi routines now return DAT 1.1 error codes + + + + NEW SINCE Beta 1.10 + + * kDAPL is now part of the DAPL distribution. See the release + notes above. + + The kDAPL 1.1 spec is now contained in the doc/ subdirectory. + + * Several files have been moved around as part of the kDAPL + checkin. Some files that were previously in udapl/ are now + in common/, some in common are now in udapl/. The goal was + to make sure files are properly located and make sense for + the build. + + * Source code formatting changes for consistency. + + * Bug fixes + - dapl_evd_create() was comparing the wrong bit combinations, + allowing bogus EVDs to be created. + - Removed code that swallowed zero length I/O requests, which + are allowed by the spec and are useful to applications. + - Locking in dapli_get_sp_ep was asymmetric; fixed it so the + routine will take and release the lock. Cosmetic change. + - dapl_get_consuemr_context() will now verify the pointer + argument 'context' is not NULL. + + + OBTAIN THE CODE + + To obtain the tree for your local machine you can check it + out of the source repository using CVS tools. CVS is common + on Unix systems and available as freeware on Windows machines. + The command to anonymously obtain the source code from + Source Forge (with no password) is: + + cvs -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl login + cvs -z3 -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl co . + + When prompted for a password, simply press the Enter key. + + Source Forge also contains explicit directions on how to become + a developer, as well as how to use different CVS commands. You may + also browse the source code using the URL: + + http://svn.sourceforge.net/viewvc/dapl/trunk/ + + SYSTEM REQUIREMENTS + + This project has been implemented on Red Hat Linux 7.3, SuSE + SLES 8, 9, and 10, Windows 2000, RHEL 3.0, 4.0 and 5.0 and a few + other Linux distrubutions. The structure of the code is designed + to allow other operating systems to easily be adapted. + + The DAPL team has used Mellanox Tavor based InfiniBand HCAs for + development, and continues with this platform. Our HCAs use the + IB verbs API submitted by IBM. Mellanox has contributed an + adapter layer using their VAPI verbs API. Either platform is + available to any group considering DAPL work. The structure of + the uDAPL source allows other provider API sets to be easily + integrated. + + The development team uses any one of three topologies: two HCAs + in a single machine; a single HCA in each of two machines; and + most commonly, a switch. Machines connected to a switch may have + more than one HCA. + + The DAPL Plugfest revealed that switches and HCAs available from + most vendors will interoperate with little trouble, given the + most recent releases of software. The dapl reference team makes + no recommendation on HCA or switch vendors. + + Explicit machine configurations are available upon request. + + IN THE TREE + + The DAPL tree contains source code for the uDAPL and kDAPL + implementations, and also includes tests and documentation. + + Included documentation has the base level API of the + providers: OpenFabrics, IBM Access, and Mellanox Verbs API. Also + included are a growing number of DAPL design documents which + lead the reader through specific DAPL subsystems. More + design documents are in progress and will appear in the tree in + the near future. + + A small number of test applications and a unit test framework + are also included. dapltest is the primary testing application + used by the DAPL team, it is capable of simulating a variety of + loads and exercises a large number of interfaces. Full + documentation is included for each of the tests. + + Recently, the dapl conformance test has been added to the source + repository. The test provides coverage of the most common + interfaces, doing both positive and negative testing. Vendors + providing DAPL implementation are strongly encouraged to run + this set of tests. + + MAKEFILE NOTES + + There are a number #ifdef's in the code that were necessary + during early development. They are disappearing as we + have time to take advantage of features and work available from + newer releases of provider software. These #ifdefs are not + documented as the intent is to remove them as soon as possible. + + CONTRIBUTIONS + + As is common to Source Forge projects, there are a small number + of developers directly associated with the source tree and having + privileges to change the tree. Requested updates, changes, bug + fixes, enhancements, or contributions should be sent to + James Lentini at jlentinit@netapp.com for review. We welcome your + contributions and expect the quality of the project will + improve thanks to your help. + + The core DAPL team is: + + James Lentini + Arlin Davis + Steve Sears + + ... with contributions from a number of excellent engineers in + various companies contributing to the open source effort. + + + ONGOING WORK + + Not all of the DAPL spec is implemented at this time. + Functionality such as shared memory will probably not be + implemented by the reference implementation (there is a write up + on this in the doc/ area), and there are yet various cases where + work remains to be done. And of course, not all of the + implemented functionality has been tested yet. The DAPL team + continues to develop and test the tree with the intent of + completing the specification and delivering a robust and useful + implementation. + + +The DAPL Team + diff --git a/scripts/create_Module.symvers.sh b/scripts/create_Module.symvers.sh new file mode 100755 index 0000000..5b2d76d --- /dev/null +++ b/scripts/create_Module.symvers.sh @@ -0,0 +1,64 @@ +#!/bin/bash +# +# Copyright (c) 2006 Mellanox Technologies. All rights reserved. +# Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. +# +# This Software is licensed under one of the following licenses: +# +# 1) under the terms of the "Common Public License 1.0" a copy of which is +# available from the Open Source Initiative, see +# http://www.opensource.org/licenses/cpl.php. +# +# 2) under the terms of the "The BSD License" a copy of which is +# available from the Open Source Initiative, see +# http://www.opensource.org/licenses/bsd-license.php. +# +# 3) under the terms of the "GNU General Public License (GPL) Version 2" a +# copy of which is available from the Open Source Initiative, see +# http://www.opensource.org/licenses/gpl-license.php. +# +# Licensee has the right to choose one of the above licenses. +# +# Redistributions of source code must retain the above copyright +# notice and one of the license notices. +# +# Redistributions in binary form must reproduce both the above copyright +# notice, one of the license notices in the documentation +# and/or other materials provided with the distribution. +# +# Description: creates Module.symvers file for InfiniBand modules + +KVERSION=${KVERSION:-$(uname -r)} +MOD_SYMVERS=./Module.symvers +SYMS=/tmp/syms + +echo MODULES_DIR=${MODULES_DIR-:./} + +if [ -f ${MOD_SYMVERS} -a ! -f ${MOD_SYMVERS}.save ]; then + mv ${MOD_SYMVERS} ${MOD_SYMVERS}.save +fi +rm -f $MOD_SYMVERS +rm -f $SYMS + +for mod in $(find ${MODULES_DIR} -name '*.ko') ; do + nm -o $mod |grep __crc >> $SYMS + n_mods=$((n_mods+1)) +done + +n_syms=$(wc -l $SYMS |cut -f1 -d" ") +echo Found $n_syms OFED kernel symbols in $n_mods modules +n=1 + +while [ $n -le $n_syms ] ; do + line=$(head -$n $SYMS|tail -1) + + line1=$(echo $line|cut -f1 -d:) + line2=$(echo $line|cut -f2 -d:) + file=$(echo $line1| sed -e 's@./@@' -e 's@.ko@@' -e "s@$PWD/@@") + crc=$(echo $line2|cut -f1 -d" ") + sym=$(echo $line2|cut -f3 -d" ") + echo -e "0x$crc\t$sym\t$file" >> $MOD_SYMVERS + n=$((n+1)) +done + +echo ${MOD_SYMVERS} created. diff --git a/scripts/ofed_patch.sh b/scripts/ofed_patch.sh new file mode 100755 index 0000000..0497f95 --- /dev/null +++ b/scripts/ofed_patch.sh @@ -0,0 +1,268 @@ +#!/bin/bash +# +# Copyright (c) 2009 Mellanox Technologies. All rights reserved. +# +# This Software is licensed under one of the following licenses: +# +# 1) under the terms of the "Common Public License 1.0" a copy of which is +# available from the Open Source Initiative, see +# http://www.opensource.org/licenses/cpl.php. +# +# 2) under the terms of the "The BSD License" a copy of which is +# available from the Open Source Initiative, see +# http://www.opensource.org/licenses/bsd-license.php. +# +# 3) under the terms of the "GNU General Public License (GPL) Version 2" a +# copy of which is available from the Open Source Initiative, see +# http://www.opensource.org/licenses/gpl-license.php. +# +# Licensee has the right to choose one of the above licenses. +# +# Redistributions of source code must retain the above copyright +# notice and one of the license notices. +# +# Redistributions in binary form must reproduce both the above copyright +# notice, one of the license notices in the documentation +# and/or other materials provided with the distribution. +# +# +# Add/Remove a patch to/from OFED's ofa_kernel package + + +usage() +{ +cat << EOF + + Usage: + Add patch to OFED: + `basename $0` --add + --ofed|-o + --patch|-p + --type|-t |addons > + + Remove patch from OFED: + `basename $0` --remove + --ofed|-o + --patch|-p + --type|-t |addons > + + Example: + `basename $0` --add --ofed /tmp/OFED-1.X/ --patch /tmp/cma_establish.patch --type kernel + + `basename $0` --remove --ofed /tmp/OFED-1.X/ --patch cma_establish.patch --type kernel + +EOF +} + +action="" + +# Execute command w/ echo and exit if it fail +ex() +{ + echo "$@" + if ! "$@"; then + printf "\nFailed executing $@\n\n" + exit 1 + fi +} + +add_patch() +{ + if [ -f $2/${1##*/} ]; then + echo Replacing $2/${1##*/} + ex /bin/rm -f $2/${1##*/} + fi + ex cp $1 $2 +} + +remove_patch() +{ + if [ -f $2/${1##*/} ]; then + echo Removing $2/${1##*/} + ex /bin/rm -f $2/${1##*/} + else + echo Patch $2/${1##*/} was not found + exit 1 + fi +} + +set_rpm_info() +{ + package_SRC_RPM=$(/bin/ls -1 ${ofed}/SRPMS/${1}*src.rpm 2> /dev/null) + if [[ -n "${package_SRC_RPM}" && -s ${package_SRC_RPM} ]]; then + package_name=$(rpm --queryformat "[%{NAME}]" -qp ${package_SRC_RPM}) + package_ver=$(rpm --queryformat "[%{VERSION}]" -qp ${package_SRC_RPM}) + package_rel=$(rpm --queryformat "[%{RELEASE}]" -qp ${package_SRC_RPM}) + else + echo $1 src.rpm not found under ${ofed}/SRPMS + exit 1 + fi +} + +main() +{ + while [ ! -z "$1" ] + do + case $1 in + --add) + action="add" + shift + ;; + --remove) + action="remove" + shift + ;; + --ofed|-o) + ofed=$2 + shift 2 + ;; + --patch|-p) + patch=$2 + shift 2 + ;; + --type|-t) + type=$2 + shift 2 + case ${type} in + backport|addons) + tag=$1 + shift + ;; + esac + ;; + --help|-h) + usage + exit 0 + ;; + *) + usage + exit 1 + ;; + esac + done + + if [ -z "$action" ]; then + usage + exit 1 + fi + + if [ -z "$ofed" ] || [ ! -d "$ofed" ]; then + echo Set the path to the OFED directory. Use \'--ofed\' parameter + exit 1 + else + ofed=$(readlink -f $ofed) + fi + + if [ "$action" == "add" ]; then + if [ -z "$patch" ] || [ ! -r "$patch" ]; then + echo Set the path to the patch file. Use \'--patch\' parameter + exit 1 + else + patch=$(readlink -f $patch) + fi + else + if [ -z "$patch" ]; then + echo Set the name of the patch to be removed. Use \'--patch\' parameter + exit 1 + fi + fi + + if [ -z "$type" ]; then + echo Set the type of the patch. Use \'--type\' parameter + exit 1 + fi + + if [ "$type" == "backport" ] || [ "$type" == "addons" ]; then + if [ -z "$tag" ]; then + echo Set tag for backport patch. + exit 1 + fi + fi + + # Get ofa RPM version + case $type in + kernel|backport|addons) + set_rpm_info ofa_kernel + ;; + *) + echo "Unknown type $type" + exit 1 + ;; + esac + + package=${package_name}-${package_ver} + cd ${ofed} + if [ ! -e SRPMS/${package}-${package_rel}.src.rpm ]; then + echo File ${ofed}/SRPMS/${package}-${package_rel}.src.rpm not found + exit 1 + fi + + if ! ( set -x && rpm -i --define "_topdir $(pwd)" SRPMS/${package}-${package_rel}.src.rpm && set +x ); then + echo "Failed to install ${package}-${package_rel}.src.rpm" + exit 1 + fi + + cd - + + cd ${ofed}/SOURCES + ex tar xzf ${package}.tgz + + case $type in + kernel) + if [ "$action" == "add" ]; then + add_patch $patch ${package}/kernel_patches/fixes + else + remove_patch $patch ${package}/kernel_patches/fixes + fi + ;; + backport) + if [ "$action" == "add" ]; then + if [ ! -d ${package}/kernel_patches/backport/$tag ]; then + echo Creating ${package}/kernel_patches/backport/$tag directory + ex mkdir -p ${package}/kernel_patches/backport/$tag + echo WARNING: Check that ${package} configure supports backport/$tag + fi + add_patch $patch ${package}/kernel_patches/backport/$tag + else + remove_patch $patch ${package}/kernel_patches/backport/$tag + fi + ;; + addons) + if [ "$action" == "add" ]; then + if [ ! -d ${package}/kernel_addons/backport/$tag ]; then + echo Creating ${package}/kernel_addons/backport/$tag directory + ex mkdir -p ${package}/kernel_addons/backport/$tag + echo WARNING: Check that ${package} configure supports backport/$tag + fi + add_patch $patch ${package}/kernel_addons/backport/$tag + else + remove_patch $patch ${package}/kernel_addons/backport/$tag + fi + ;; + *) + echo Unknown patch type: $type + exit 1 + ;; + esac + + ex tar czf ${package}.tgz ${package} + cd - + + cd ${ofed} + echo Rebuilding ${package_name} source rpm: + if ! ( set -x && rpmbuild -bs --define "_topdir $(pwd)" SPECS/${package_name}.spec && set +x ); then + echo Failed to create ${package}-${package_rel}.src.rpm + exit 1 + fi + ex rm -rf SOURCES/${package}* + if [ "$action" == "add" ]; then + echo Patch added successfully. + else + echo Patch removed successfully. + fi + echo + echo Remove existing RPM packages from ${ofed}/RPMS direcory in order + echo to rebuild RPMs +} + +main $@ diff --git a/sdp_release_notes.txt b/sdp_release_notes.txt deleted file mode 100644 index 52fc1b7..0000000 --- a/sdp_release_notes.txt +++ /dev/null @@ -1,251 +0,0 @@ - Open Fabrics Enterprise Distribution (OFED) - SDP in MLNX_OFED 1.5.2 Release Notes - - December 2010 - - - -=============================================================================== -Table of Contents -=============================================================================== -1. Overview -2. Bug Fixes and Enhancements since OFED 1.5.2 -3. ZCopy -4. Known Issues -5. Verification Applications/Flows/Tests -6. Module Parameters - -=============================================================================== -1. Overview -=============================================================================== -Sockets Direct Protocol (SDP) is an InfiniBand byte-stream transport protocol -that provides TCP stream semantics. Capable of utilizing InfiniBand's advanced -protocol offload capabilities, SDP can provide lower latency, higher bandwidth, -and lower CPU utilization than IPoIB or -Ethernet running some sockets-based applications. - -SDP in OFED is at GA level for MLNX OFED 1.5.2 - -=============================================================================== -2. Main Features and Changes -=============================================================================== -- Added support for Inline and blueflame -- Improved stability issues -- Bug fixes - -=============================================================================== -2. Bug Fixes and Enhancements since OFED 1.5.2 -=============================================================================== -* Cleanups - - Added support for 2.6.34 / 2.6.36. - -* Bug Fixes - - Fixed compilation problems on 32 bit hosts - - Do not compile in debug mode when not asked. - - Improved recovery from errors. - -* Enhancements - - more statistics in /proc/sdpstats - - added debugfs for sdp: - - sdpprf was moved from /proc to debugfs/sdp - - debugfs/ - Socket history - - -=============================================================================== -3. ZCopy -=============================================================================== -- ZCopy is enabled by default for blocks larger than 64K. ZCopy can be disabled - by setting the module paramter sdp_zcopy_thresh to zero or to any other value - by setting it to another non zero value. - -- ZCOPY mode gives good performance for large blocks with very small cpu - utilization. When in use, all messages longer than 'sdp_zcopy_thresh' bytes - in length will cause the user space buffer to be pinned and the data sent - directly from the original buffer. This results in less CPU usage and on many - systems in enhanced bandwidth. - ZCOPY is most efficient with multi stream jobs and it performs better as the - message size increases. - The default 64K value for 'sdp_zcopy_thresh' is sometimes too low for some - systems. You must experiment with your hardware to select the best value. - -- ZCOPY vs BCOPY: - ZCOPY performance is more efficient in weak cpu and multi streams, whereas - BCOPY is more efficient in single stream. - -=============================================================================== -4. Known Issues -=============================================================================== -- SDP is at beta level on Infinihost HCA family - -- Occasionally, socket bind fails when using EINVAL. Although TCP socket is binded - successfully, SDP is occupied, thus causing the socket bind failure. - See Bugzilla 2159 and Bugzilla 2160 - -- When SO_REUSEADDR is set, only a single socket can be bind to the IP_ANY and a - specific port. TCP limitation, unless one of the sockets is listening. - -- BUG 1331 - Although TCP allows connecting to IP_ANY - 0.0.0.0 - (as a destination address!), SDP does not allow connecting to the IP_ANY - and rejects the connection. - -- BUG 1444 - The setsockopt(SO_RCVBUF) is not functional in sdp socket. - To limit top system wide sdp memory usage for recv, - use the module parameter top_mem_usage. - -- Each SDP socket currently consumes up to 2 MBytes of memory. If this value - is high for your installation, it is possible to trade off performance - for lower memory utilization per socket by reducing the value of the - "rcvbuf_scale" module parameter (default: 16). - - Note: The minimum legal value for the "rcvbuf_scale" module is 1. - At this parameter value, each socket will consume approximately 128 KBytes. - -- Small message size performance is low when messages are sent by client - at a rate lower than the rate at which they are consumed by server, - and when TCP_CORK is not set. This is observed, for example, with iperf - benchmark. - Workaround: Set the TCP_CORK socket option - to ensure data is sent in at least 32K byte chunks. - -- Performance is low on 32-bit kernels, as SDP utilizes high memory - to ease memory pressure. - Workaround: Move to a 64-bit kernel if the application remains a 32-bit one. - -- By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards - using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth. - Workaround: Reset the MTU size to 1K in this situation, using either of - the two methods below: - - 1. Activate the "tavor quirk" workaround in opensm: - a. Create an opensm options cache file (/var/cache/osm/opensm.opts): - > opensm --cache-options -o - b. Add the following line to /var/cache/osm/opensm.opts: - enable_quirks TRUE - c. Rerun opensm using your usual command line options to activate - the opensm quirk option. - - 2. Activate the "tavor quirk" workaround in cma: - set the tavor_quirk module parameter of the rdma_cm module to value 1 - (default: 0). - -- When waiting for RX, the driver first polls, arms interrupt and then goes to - sleep. Polling duration could be set by recv_poll module parameter. The - higher this value is, the higher the CPU utilization is, and the number of - interrupts is lower. - This should be fine tuned according to the specific environment and - application latency. - -- When using SDP over RoCE, and the peer has a card that does not support RoCE - a delay in the connection establishment may occur. - -- BUG2185 - Occasionally, accessing /proc/net/sdpstats, causes kernel - panic. - -- For set-user-ID/set-group-ID ELF binaries, only libraries in the standard - search directories that are also set-user-ID. Since always installing - libsdp with this bit on is a security vulnerability, the default behavior is - to reset this bit. A user that want to run such binaries should modify the - libsdp.spec file. - -=============================================================================== -5. Verification Applications/Flows/Tests -=============================================================================== -- ssh/sshd -- wget/netscape/firefox/apache -- netpipe -- netperf -- LTP socket tests -- iperf-2.0.2 -- ttcp -- openmpi -- openmpi + Intel MPI benchmarks -- Threaded and forking echo client server examples -- Various Java client server applications (SUN:jre, BEA:jrockit/WebLogic, GNU:gij/gcj) -- Many UNIX utilities to verify that pre-load did not harm the applications - -=============================================================================== -6. Module Parameters -=============================================================================== - -General -------- -sdp_link_layer_ib_only: - Supports only link layer of type InfiniBand. - It is useful when not using SDP over RoCE. - -sdp_debug_level: - Enables connection establishment and teardown debug tracing. - -sdp_data_debug_level: - Enables datapath debug tracing. If set to 1, it shows only packets >1. - To enable debugging of data path, compile driver with CONFIG_SDP_DEBUG_DATA. - - -recv_poll: - Enables poll receiving before arming the interrupt. Set a higher value - to decrease the number of RX interrupts. Consequently, the CPU - utilization will be higher. - -sdp_keepalive_time: - Default idle time in seconds before keepalive probe sent. - -Resources ---------- -rcvbuf_initial_size: - Receives buffer initial size in bytes. - -rcvbuf_scale: - Not in use - -top_mem_usage: - Top system wide sdp memory usage for recv (in MB). - -max_large_sockets: - Not in use - -sdp_fmr_pool_size: - Number of FMRs to allocate for pool - -sdp_fmr_dirty_wm: - Watermark to flush fmr pool - -Thresholds ----------- -sdp_inline_thresh: - Inline copy threshold. effective to new sockets only; 0=Off. - -sdp_zcopy_thresh: - Zero copy using RDMA threshold; 0=Off. - If smaller than page size, set to page size. - -Interrupt hardware moderation: ------------------------------- -sdp_rx_coal_target: - Target number of bytes to coalesce with interrupt moderation. - -sdp_rx_coal_time: - rx coal time (jiffies). - -sdp_rx_rate_low: - rx_rate low (packets/sec). - -sdp_rx_coal_time_low: - low moderation usec. - -sdp_rx_rate_high: - rx_rate high (packets/sec). - -sdp_rx_coal_time_high: - high moderation usec. - -sdp_rx_rate_thresh: - rx rate thresh (). - -sdp_sample_interval: - sample interval (jiffies). - -hw_int_mod_count: - Forced hw int moderation val. -1 for auto (packets). 0 to disable. - -hw_int_mod_usec: - Forced hw int moderation val. -1 for auto (usec). 0 to disable. diff --git a/srp_release_notes.txt b/srp_release_notes.txt deleted file mode 100644 index b6cee0b..0000000 --- a/srp_release_notes.txt +++ /dev/null @@ -1,613 +0,0 @@ - - Open Fabrics Enterprise Distribution (OFED) - SRP in OFED 1.5.2 Release Notes - - December 2010 - - -============================================================================== -Table of contents -============================================================================== - - 1. Overview - 2. Changes and Bug Fixes since OFED 1.5 - 3. Software Dependencies - 4. Major Features - 5. Loading SRP Initiator - 6. Manually Establishing an SRP Connection - 7. SRP Tools - ibsrpdm and srp_daemon - 8. Automatic Discovery and Connecting to Targets - 9. Multiple Connections from Initiator IB Port to the Target - 10. High Availability - 11. Shutting Down SRP - 12. Known Issues - 13. Vendor Specific Notes - - -============================================================================== -1. Overview -============================================================================== - -The SRP standard describes the message format and protocol definitions required -for transferring commands and data between a SCSI initiator port and a SCSI -target port using RDMA communication service. - - -============================================================================== -2. Changes and Bug Fixes since OFED 1.5 -============================================================================== -* Check for scsi_id in scmnd to prevent scan/rescan keep adding new scsi devices - ie. echo "- - -" > /sys/class/scsi_host/hostXX/scan -* Bug fixing - -============================================================================== -4. Software Dependencies -============================================================================== - -The SRP Initiator depends on the installation of the OFED Distribution stack -with OpenSM running. - -============================================================================== -5. Major Features -============================================================================== - -This SRP Initiator is based on source taken from openib.org gen2 implementing -the SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. See: -www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf - -The SRP Initiator supports: -- Basic SCSI Primary Commands -3 (SPC-3) - (www.t10.org/ftp/t10/drafts/spc3/spc3r21b.pdf) -- Basic SCSI Block Commands -2 (SBC-2) - (www.t10.org/ftp/t10/drafts/sbc2/sbc2r16.pdf) -- Basic functionality, task management and limited error handling - -============================================================================== -6. Loading SRP Initiator -============================================================================== - -To load the SRP module, either execute the "modprobe ib_srp" command after the -OFED driver is up, or change the value of SRP_LOAD in -/etc/infiniband/openib.conf to "yes" (causing the srp module to be loaded -at driver boot). - -NOTE: When loading the ib_srp module, it is possible to set the module - parameter srp_sg_tablesize. This is the maximum number of - gather/scatter entries per I/O (default: 12). - - a. modprobe ib_srp srp_sg_tablesize=32 - or - b. edit /etc/modprobe.conf and add the following line: - options ib_srp srp_sg_tablesize=32 - -Module paramters: -For the list of ib_srp module parameters - $ modinfo ib_srp - - + srp_sg_tablesze: Max number of scatter/gather entries per I/O - + srp_dev_loss_tmo: Number of seconds that srp driver will not return - DID_NO_CONNECT status when it loss connection to target. - During this period, it will try to re-establish - the connection to target, and return DID_RESET, - DID_ABORT statuses for outstanding scsi command to - prevent DM Multipath driver to failover to next paths. - Default value is 60 seconds. - -============================================================================== -7. Manually Establishing an SRP Connection -============================================================================== - -The following steps describe how to manually load an SRP connection between -the Initiator and an SRP Target. Section 8 explains how to do this -automatically. - -- Make sure that the ib_srp module is loaded, the SRP Initiator is reachable - by the SRP Target, and that an SM is running. - -- To establish a connection with an SRP Target and create SRP (SCSI) device(s) - for that target under /dev, use the following command: - - echo -n id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\ - pkey=ffff,service_id=[service[0] value] > \ - /sys/class/infiniband_srp/srp-mthca[hca number]-[port number]/add_target - - a. Execution of the above "echo" command may take some time - b. The SM must be running while the command executes - c. It is possible to include additional parameters in the echo command: - > max_cmd_per_lun - Default: 63 - > max_sect (short for max_sectors) - sets the request size of a command - > io_class - Default: 0x100 as in rev 16A of the specification - Note: In rev 10 the default was 0xff00 - > initiator_ext - Please refer to Section 9 (Multiple Connections...) - d. See SRP Tools below for instructions on how the parameters in the - echo command above may be obtained. - -NOTES: - -- Using the same *echo -n * more than one, the srp target - will terminate the previous connection and re-establish the new - connection. To have more than two connections to srp target, please use - different inititiator_ext values in echo command. - -- To list the new SCSI devices that have been added by the echo command, you - may use either of the following two methods: - a. Execute "fdisk -l". This command lists all devices; the new devices are - included in this listing. - b. Execute *dmesg* or look at /var/log/messages to find messages with the - names of the new devices. - - -============================================================================== -8. SRP Tools - ibsrpdm and srp_daemon -============================================================================== - -To assist in performing the steps in Section 6, the OFED 1.3.1 distribution -provides two utilities which: -- Detect targets on the fabric reachable by the Initiator (for step 1) -- Output target attributes in a format suitable for use in the above - "echo" command (step 2) - -These utilities are: ibsrpdm and srp_daemon. - -The utilities can be found under /usr/local/ofed/sbin/ (or /sbin/), -and are part of the srptools RPM that may be installed using the -OFED custom installation. Detailed information regarding the various -options for these utilities are provided by their man pages. - -Below, several usage scenarios for these utilities are presented. - -ibsrpdm usage -------------- -1. Detecting reachable targets - - a. To detect all targets reachable by the SRP initiator via the default - umad device (/dev/infiniband/umad0), execute the following command: - $ ibsrpdm - - This command will output information on each SRP target detected, in - human-readable form. - - Sample output: - IO Unit Info: - port LID: 0103 - port GID: fe800000000000000002c90200402bd5 - change ID: 0002 - max controllers: 0x10 - - controller[ 1] - GUID: 0002c90200402bd4 - vendor ID: 0002c9 - device ID: 005a44 - IO class : 0100 - ID: LSI Storage Systems SRP Driver 200400a0b81146a1 - service entries: 1 - service[ 0]: 200400a0b81146a1 / SRP.T10:200400A0B81146A1 - - b. To detect all the SRP Targets reachable by the SRP Initiator via - another umad device, use the following command: - - $ ibsrpdm -d - -2. Assistance in creating an SRP connection - - a. To generate output suitable for utilization in the "echo" command of - section 5, add the "-c" option to ibsrpdm: - - $ ibsrpdm -c - - Sample output: - id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, - dgid=fe800000000000000002c90200402bd5,pkey=ffff, - service_id=200400a0b81146a1 - - b. To establish a connection with an SRP Target (Section 6) using the output - from the "libsrpdm -c" example above, execute the following command: - - $ echo -n id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, - dgid=fe800000000000000002c90200402bd5,pkey=ffff, - service_id=200400a0b81146a1 - > /sys/class/infiniband_srp/srp-mlnx_0-1/add_target - - The SRP connection should now be up; the newly created SCSI devices should - appear in the listing obtained from the "fdisk -l" command. - - -srp_daemon ----------- -The srp_daemon utility is based on ibsrpdm and extends its functionality. -In addition to the ibsrpdm functionality described above, srp_daemon can also -- Establish an SRP connection by itself (without the need to issue the "echo" - command described in Section 6) -- Continue running in background, detecting new targets and establishing SRP - connections with them (daemon mode) -- Discover reachable SRP targets given an infiniband HCA name and port, rather - than just by /dev/umad where is a digit -- Enable High Availability operation (together with Device-Mapper Multipath) -- Have a configuration file that determines the targets to connect to - -a. srp_daemon commands equivalent to ibsrpdm: - - "srp_daemon -a -o" is equivalent to "ibsrpdm" - "srp_daemon -c -a -o" is equivalent to "ibsrpdm -c" - -Note: These srp_daemon commands can behave differently than the equivalent - ibsrpdm command when /etc/srp_daemon.conf is not empty. - -b. srp_daemon extensions to ibsrpdm - - - To discover SRP Targets reachable from HCA device , - port , (and generate output suitable for 'echo') you may execute - - srp_daemon -c -a -o -i -p - - - To both discover the SRP Targets and establish connections with them, just - add the -e option to the above command. - - - Executing srp_daemon over a port without the -a option will only display - the reachable targets via the port and to which the initiator is not - connected. If executing with the -e option it is better to omit -a. - - - It is recommended to use the -n option. This option adds the initiator_ext - to the connecting string. (See Section 9 for more details). - - - srp_daemon has a configuration file that can be set, where the default is - /etc/srp_daemon.conf. Use the -f to supply a different configuration file - that configures the targets srp_daemon is allowed to connect to. The - configuration file can also be used to set values for additional - parameters (e.g., max_cmd_per_lun, max_sect). - - - A continuous background (daemon) operation, providing an automatic ongoing - detection and connection capability. See Section 8. - -============================================================================== -9. Automatic Discovery and Connecting to Targets -============================================================================== - -- Make sure that the ib_srp module is loaded, the SRP Initiator can reach an - SRP Target, and that an SM is running. - -- To connect to all the existing Targets in the fabric, execute - srp_daemon -e -o. This utility will scan the fabric once, connect to - every Target it detects, and then exit. - -NOTE: srp_daemon will follow the configuration it finds in - /etc/srp_daemon.conf. Thus, it will ignore a target that is disallowed in - the configuration file. - -- To connect to all the existing Targets in the fabric and to connect - to new targets that will join the fabric, execute srp_daemon -e. This utility - continues to execute until it is either killed by the user or encounters - connection errors (such as no SM in the fabric). - -- To execute SRP daemon as a daemon you may execute run_srp_daemon - (found under /usr/local/ofed/sbin/ or /sbin/), providing it with - the same options used for running srp_daemon. - - Note: Make sure only one instance of run_srp_daemon runs per port. - -- To execute SRP daemon as a daemon on all the ports, execute srp_daemon.sh - (found under /usr/local/ofed/sbin/ or /sbin/). - srp_daemon.sh sends its log to /var/log/srp_daemon.log. - -- It is possible to configure this script to execute automatically when the - InfiniBand driver starts by changing the value of SRP_DAEMON_ENABLE in - /etc/infiniband/openib.conf to "yes" and SRP_LOAD to yes as well. - - Another option to to configure this script to execute automatically when the - InfiniBand driver starts is by changing the value of SRPHA_ENABLE in - /etc/infiniband/openib.conf to "yes". However, this option also enables - SRP High Availability that has some more features. (Please read the High - Availability section). - -============================================================================== -10. Multiple Connections from Initiator IB Port to the Target -============================================================================== - -Some system configurations may need multiple SRP connections from -the SRP Initiator to the same SRP Target: to the same Target IB port, -or to different IB ports on the same Target HCA. - -In case of a single Target IB port, i.e., SRP connections use the same path, -the configuration is enabled using a different initiator_ext value for each -SRP connection. The initiator_ext value is a 16-hexadecimal-digit value -specified in the connection command. - -Also in case of two physical connections (i.e., network paths) from a single -initiator IB port to two different IB ports on the same Target HCA, there is -need for a different initiator_ext value on each path. The conventions is to -use the Target port GUID as the initiator_ext value for the relevant path. - -If you use srp_daemon with -n flag, it automatically assigns initiator_ext -values according to this convention. For example: - - id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec, - dgid=fe800000000000000002c90200402bed, - pkey=ffff,service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200 - - Notes: - a. It is recommended to use the -n flag for all srp_daemon invocations. - b. ibsrpdm does not have a corresponding option. - c. srp_daemon.sh always uses the -n option (whether invoked manually by - the user, or automatically at startup by setting SRPHA_ENABLE or - SRP_DAEMON_ENABLE to yes). - -============================================================================== -11. High Availability (HA) -============================================================================== - -High Availability Overview --------------------------- - -High Availability works using the Device-Mapper (DM) multipath and the -SRP daemon. - -Each initiator is connected to the same target from several ports/HCAs. -The DM multipath is responsible for joining together different paths to the -same target and for fail-over between paths when one of them goes offline. -Multipath will be execute on newly joined SCSI devices. - -Each initiator should execute several instances of the SRP daemon, one for each -port. At startup, each SRP daemon detects the SRP targets in the fabric and -sends requests to the ib_srp module to connect to each of them. These -SRP daemons also detect targets that subsequently join the fabric, and send the -ib_srp module requests to connect to them as well. - -High Availability Operation ---------------------------- - -When a path (from port1) to a target fails, the ib_srp module starts an error -recovery process. If this process gets to the reset_host stage and there is no -path to the target from this port, ib_srp will remove this scsi_host. After -the scsi_host is removed, multipath switches to another path to this target -(from another port/HCA). - -When the failed path recovers, it will be detected by the SRP daemon. The SRP -daemon will then request ib_srp to connect to this target. Once the connection -is up, there will be a new scsi_host for this target. Multipath will be -executed on the devices of this host, returning to the original state (prior to -the failed path). - -High Availability Prerequisites -------------------------------- - -Installation for RHEL4 and RHEL5: (Execute once) - - Verify that the standard device-mapper-multipath rpm is installed. If not, - install it from the RHEL distribution. - -Installation for SLES10: (Execute once) - - Verify that multipath is installed. If not, install it from the - installation (You can use yast). - - - Update udev: (Execute once - for manual activation of High Availability only) - - - Add a file to /etc/udev/rules.d/ (you can call it 91-srp.rules) - This file should have one line: - ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m" - - Note: When SRPHA_ENABLE is set to "yes" (see Automatic Activation of High - Availability below), this file is created upon each boot of the driver and - is deleted when the driver is unloaded. - - -Manual Activation of High Availability --------------------------------------- - -Initialization: (Execute after each boot of the driver) - 1) Execute modprobe dm-multipath - 2) Execute modprobe ib-srp - 3) Make sure you have created file /etc/udev/rules.d/91-srp.rules - as described above - 4) Execute for each port and each HCA: - srp_daemon -c -e -R 300 -i -p - (You can use another value for -R. See under the Known Issues section - the workaround for the rare race condition.) - - This step can be performed by executing srp_daemon.sh, which sends - its log to /var/log/srp_daemon.log. - - Now it is possible to access the SRP LUNs on /dev/mapper/. - - NOTE: It is possible for regular (non-SRP) LUNs to also be present; - the SRP LUNs may be identified by their names. You can configure the - /etc/multipath.conf file to change multipath behavior. - - -Automatic Activation of High Availability ------------------------------------------ -- Set the value of SRPHA_ENABLE in /etc/infiniband/openib.conf to "yes". - Also make sure SRP_LOAD=yes and SRP_DAEMON_ENABLE=yes. - -- From the next loading of the driver it will be possible to access the SRP - LUNs on /dev/mapper/ - NOTE: It is possible that regular (not SRP) LUNs may also be present; - the SRP LUNs may be identified by their name. - -- It is possible to see the output of the SRP daemon in /var/log/srp_daemon.log - - -============================================================================== -12. Shutting Down SRP -============================================================================== - -SRP can be shutdown by using "modprobe -r ib_srp", or by stopping the OFED -("/etc/init.d/openibd stop"), or as a by-product of a complete system shutdown. - -Prior to shutting down SRP, it is REQUIRED to remove all references to it. -The actions you need to take depend on the way SRP was loaded. There are -three cases. - -a. Without High Availability ------------------------------------- -When working without High Availability, you should unmount all SRP -partitions that were mounted prior to shutting down SRP. -For example, /dev/sdd1 is srp partition and mount to /mnt/test -$ umount /mnt/test -$ modprobe -r ib_srp - -NOTES: the umount may get stuck ~90 seconds per connection to target if the - target is down. This is due to the srp_dev_loss_tmo=60 seconds which - srp driver waits for the target coming back before returning error - status. - If you have shutdown/remove srp target and the host have 4 connections - to the SRP target, you should wait ~4-5 minutes for the unmount to exit. - Do not ctrl+c to kill umount process. - -b. After Manual Activation of High Availability ------------------------------------------------ -If you manually activated SRP High Availability, perform the following steps: -- Unmount all SRP partitions that were mounted -- Kill all SRP daemon instances. -- Make sure there are no multipath instances running. If there are multiple - instances, wait for them to end or kill them. -- Execute multipath -F - -Example: -$ umount /mnt/test1 /mnt/test2 (wait for it to exit, do not ctrl+c) -$ ps -ax and kill all srp_daemon processes. -$ multipath -ll (wait for it to exit, do not ctrl+c) -$ multipath -F -$ modprobe -r ib_srp - -c. After Automatic Activation of High Availability --------------------------------------------------- -If SRP High Availability was automatically activated, SRP shutdown must be -part of the driver shutdown ("/etc/init.d/openibd stop") which performs -steps 2-5 of case (b) above. However, you still have to unmount all SRP -partitions that were mounted before driver shutdown. - - -HAL Issue ---------- -The HAL (Hardware Abstraction Layer) system includes a daemon that examines -all devices in the system. In this process, it frequently holds a reference -to the ib_srp module. If you attempt to shutdown SRP while this daemon is -holding a reference to ib_srp, the shutdown will fail. Therefore, you -should make sure this will not occur. One solution may be to stop "haldaemon" -(/etc/init.d/haldaemon stop) prior to SRP shutdown. - - -============================================================================== -13. Known Issues -============================================================================== - -- There is a very rare race condition which can cause the SRP daemon to miss a - target that joins the fabric. The race can occur if a target joins and leaves - the fabric several times in a short time (e.g., if the cable is not connected - well). In such a case, the SM may ignore this quick change of state and may - not send an InformInfo to the srp_daemon. - - Workaround: Execute the srp_daemon command with the -R option. This - option causes the SRP daemon to perform a full rescan of the fabric every - seconds. - -- The srp_daemon does not support different pkeys other than the default - pkey=ffff - -- It is recommended to use an SM that supports the enhanced capability mask - matching feature (errata MGTWG8372). With SMs which support this feature, the - SRP daemon generates significantly less communication traffic. - -- When booting OFED with SRP High Availability enabled, executing multipath for - all LUNs on all connections may take some time (several minutes). However, it - is possible to start working while this process is in progress. - -- Stopping the driver while SRP High Availability is enabled kills all - multipath processes. Consider appropriate actions in case multipath is used - for other purposes. - -- AS High Availability is based on Device Mapper multipath, it embodies - multipath limitations and also its configuration and tuning options. - See http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=Home - for information on multipath. - To modify and tune multipath configuration, edit the file /etc/multipath.conf - according to instructions and tips listed in - /usr/share/doc/packages/multipath-tools/multipath.conf.* - -- In case your topology has two physical connections (i.e., network paths) from - a single initiator IB port to two different IB ports on the same Target HCA, - and you wish to have an SRP connection on the one path coexist with an SRP - connection on the second path, you must set a different initiator_ext value - on each path. See Section 9, "Multiple Connections from Initiator IB Port - to the Target" for details. - -- The srp_daemon tool reads by default the configuration file - /etc/srp_daemon.conf. In case this configuration file disallows connecting - to a certain target, srp_daemon will ignore the target. If you find out - that srp_daemon ignores a target, please check the /etc/srp_daemon.conf file. - -- Rebooting the system with unclean mounted filesystem and dead connection - to SRP target, the system may get stuck. - -- After establish the connection with srp target and rebooting the system, - initiator will fail to connect to target @ first manual *echo -n* command - (target reject with stale connection). You need to do *echo -n* one more - time. - You do not see this problem with srp_daemon mode since srp_daemon will - retry to connect. - -- The combination of "weak" single lun srp target, I/O with big block size, - default max_command_per_lun=63 while using /dev/urandom to create file with - ext3 fs on srp lun, may cause ext3 remount with "read-only" flag - ie. - Example: - sdb1 is first partition of srp lun sdb, ext3 fs is created - $ mount /dev/sdb1 /mnt/sdb1; cd /mnt/sdb1 - $ dd if=/dev/urandom of=10G-file bs=1G count=10 - --> ext3 fs may remount with read-only flag - - Workarounds: - ------------ - a. Log into the target with small max_command_per_lun (3,4,8) - $ echo id_ext=0002c9030008fc0c,ioc_guid=0002c9030008fc0c, - dgid=fe800000000000000002c90300084417,max_cmd_per_lun=4,pkey=ffff, - service_id=0002c9030008fc0c > /sys/class/infiniband_srp/srp-mlx4_0-1/add_target - - ----------------OR------------------- - - b. Run dd with /dev/zero instead of /dev/urandom - $ dd if=/dev/zero of=10G-file bs=1G count=10 - - ----------------OR------------------- - - c. Run dd with smaller block size - $ dd if=/dev/urandom of=10G-file bs=128K count=40000 - - ----------------OR------------------- - - d. Combine the a,b,c steps (This is the recommended workaround) - -============================================================================== -14. Vendor Specific Notes -============================================================================== - -Hosts connected to Qlogic SRP Targets must perform one of the following -steps after upgrading to OFED 1.3.1 to continue accessing their storage -successfully: - -1. When issuing the "echo" command to add a new SRP Target, the host - must append the string ",initiator_ext=0000000000000001" to the original - echo string. - Example: - 'ibsrpdm -c' output is as follows: - - id_ext=0000000000000001,ioc_guid=00066a0138000165,dgid=fe8000000000000 - 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00 - - id_ext=0000000000000001,ioc_guid=00066a0238000165,dgid=fe8000000000000 - 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00 - - To connect to the first target, the echo command must be: - - echo -n \ - id_ext=0000000000000001,ioc_guid=00066a0138000165,\ - dgid=fe8000000000000000066a0260000165,pkey=ffff,\ - service_id=0000494353535250,io_class=ff00,\ - initiator_ext=0000000000000001 > \ - /sys/class/inifiniband_srp/srp-mthca0-1/add_target - - -2. Change the SRP map on the Qlogic SRP Target to set the expected initiator - extension to 0. For details on how to change the SRP map on a Qlogic SRP - Target, please refer to product documentation. - - diff --git a/uDAPL_release_notes.txt b/uDAPL_release_notes.txt deleted file mode 100644 index 60b83bb..0000000 --- a/uDAPL_release_notes.txt +++ /dev/null @@ -1,1482 +0,0 @@ - Release Notes for - OFED 1.5.1 DAPL Release - March 2010 - - This release of the uDAPL reference implementation package for both - DAT 1.2 and 2.0 specification is timed to coincide with OFED release - of the Open Fabrics (www.openfabrics.org) software stack. - - uDAPL v1 (1.2.16-1) and v2 (2.0.27-1) - - ---------------- - - * New Features (v2 only) - UCM provider with IB UD based CM per process. - More scalable then rdma_cm (cma) or socket cm (scm). - ---------------- - - * Provider descriptions and PROS/CONS (cma, scm, ucm) - - 1. CMA - uses OFA rdma_cm to setup QP's. IPoIB, ARP, and SA queries required. - - Provider name: ofa-v2-cma - PROs: OFA rdma_cm has the most testing across many applications. - Supports both iWARP and IB. - - CONs: Serialization of conn processing with kernel based CM service - Requires IPoIB ARP for name resolution, storms - Requires SA for path record queries for IB fabrics. - Conn Request private data limited to 52 bytes. - - Settings for larger clusters (512+ cores): - - setenv DAPL_CM_ROUTE_TIMEOUT_MS 20000 - setenv DAPL_CM_ARP_TIMEOUT_MS 10000 - - 2. SCM - uses sockets to exchange QP information. IPoIB, ARP, and SA queries NOT required. - - Provider name (connectx): ofa-v2-mlx4_0-1 - PROs: Each rank has own instance of socket cm. More private data with requests. - Doesn't require path-record lookup. - - CONs: Socket resources grow with scale-out, serialization of - connections with kernel based tcp sockets, - Competes for MPI socket resources/port space and other TCP applications. - Sockets remain in TIMEWAIT state for minutes after closure. - Requires ARP for name resolution. - Doesn't support iWARP devices. - - Settings for larger clusters (512+ cores): - - setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */ - setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */ - - 3. UCM - use's IB UD QP to exchange QP info. Sockets, ARP, IPoIB, and SA queries NOT required. - - Provider name (connectx): ofa-v2-mlx4_0-1u - PROs: Each rank has own instance of CM in user process - Resources fixed per rank regardless of scale-out size - No serialization of user or kernel resources establishing connections, - Simple 3-way msg handsake, CM messages fit in inline data for lowest message latency, - Supports alternate paths - No address resolution required. - No path resolution required. - - CONs: New provider with limited testing, a little tougher to debug. - Doesn't support iWARP - - Settings for larger clusters (512+ cores): - - setenv DAPL_UCM_REP_TIME 800 /* REQUEST timer, waiting for REPLY in millisecs */ - setenv DAPL_UCM_RTU_TIME 400 /* REPLY timer, waiting for RTU in millisecs */ - setenv DAPL_UCM_RETRY 15 /* REQUEST and REPLY retries */ - setenv DAPL_ACK_RETRY 7 /* IB RC Ack retry count */ - setenv DAPL_ACK_TIMER 20 /* IB RC Ack retry timer */ - - ---------------- - - * CM Performance: CPS profile for cma, scm, and ucm v2 uDAPL providers: - - Intel SR1600 Urbanna Servers with Xeon(R) CPU X5570 @ 2.93GHz - Urbanna Platform - 2 node, 8 cores per node, Mellanox MLX4 IB QDR, no switch. - - dtestcm (server/client): - - cma: Connections: 183.21 usec, CPS 5458.31 Total 0.18 secs, poll_cnt=3403, Num=1000 - scm: Connections: 178.80 usec, CPS 5592.93 Total 0.18 secs, poll_cnt=2344, Num=1000 - ucm: Connections: 122.43 usec, CPS 8167.93 Total 0.12 secs, poll_cnt=2609, Num=1000 - - dapl_cm_bw: MPI uDAPL/CM profiling application (all-to-all connections, all ranks) - - CMA - 2 Connect times (10): Total 0.0020 per 0.0002 CPS=4997.98 - 4 Connect times (40): Total 0.0077 per 0.0002 CPS=5224.59 - 8 Connect times (240): Total 0.0276 per 0.0001 CPS=8710.76 - 16 Connect times (1120): Total 0.1194 per 0.0001 CPS=9379.37 - 32 Connect times (4800): Total 6.1949 per 0.0013 CPS=774.83 - - SCM - 2 Connect times (10): Total 0.0024 per 0.0002 CPS=4103.61 - 4 Connect times (40): Total 0.0060 per 0.0002 CPS=6622.41 - 8 Connect times (240): Total 0.0206 per 0.0001 CPS=11634.15 - 16 Connect times (1120): Total 9.0118 per 0.0080 CPS=124.28 - 32 Connect times (4800): Total 21.0198 per 0.0044 CPS=228.36 - - UCM - 2 Connect times (10): Total 0.0014 per 0.0001 CPS=7353.27 - 4 Connect times (40): Total 0.0045 per 0.0001 CPS=8816.19 - 8 Connect times (240): Total 0.0191 per 0.0001 CPS=12582.44 - 16 Connect times (1120): Total 0.0799 per 0.0001 CPS=14017.68 - 32 Connect times (4800): Total 0.3337 per 0.0001 CPS=14385.21 - - ---------------- - - * Bug Fixes - - V2.0 Package - - Release 2.0.27 - windows: add scm makefile - windows does not require rdma_cma_abi.h, move the include from common code - windows patch to fix IB_INVALID_HANDLE name collision - scm: dat_ep_connect fails on 32bit servers - undefined symbol: dapls_print_cm_list - cleanup CM object lock before freeing CM object memory - destroy verbs completion channels created via ia_open or ep_create. - package: update Copyright file and include the 3 license files in distribution - common: when copying private_data out of rdma_cm events, use the - cma: fix referencing freed address - dapl: move close device after async thread is done - - Release 2.0.26 - openib_common: add check for both gid and global routing in RTR - openib_common: remote memory read privilege set multi times - ucm, scm: DAPL_GLOBAL_ROUTING enabled causes segv - - Release 2.0.25 - winof scm: initialize opt for NODELAY setsockopt - winof cma: windows definition for EADDRNOTAVAIL missing - scm: client side setsockopt NODELAY fails if data arrives before setting - cma: setup_listener Cannot assign requested address - common: seg fault in dapl_evd_wait with multi-thread application using CNO's. - ucm: inbound DREQ/DREP handshake should transition QP. - winof: Remove duplicate include of comp_channel.cpp from cm.c as it is - included in opensm_ucb/device.c. - - Release 2.0.24 - winof: Utilize WinOF version of inet_ntop() for Windows OSes which do not - support inet_ntop(). - ucm: windows build issue with new CQ completion channel - winof: add ucm provider to windows build - winof: add missing build files for ibal, scm - scm: connection peer resets under heavy load, incorrect event on error - ucm: increase default reply and rtu timeout values. - ucm: change some debug message levels and add check for valid UD REPLY during retries. - ucm: increase timers during subsequent retries - ucm, scm: address handles need destroyed when freeing Endpoints with UD QP's. - openib_common: ignore pd free errors, clear pd_handle and return. - ucm: using UD type QP's, ucm reports wrong reject event when user rejects AH resolution request. - ucm, scm, cma: Fix CNO support on DTO type EVD's - ucm: fix lock init bug in ucm_cm_find - ucm: fix build problem with latest windows ucm changes - ucm: The HCA should not be closed until all resources have been released. - ucm: Fix build warning when compiling on 32-bit systems. - ucm: Trying to deregister the same memory region twice leads to an - dat: reduce debug message level when parsing for location of dat.conf - ucm: update ucm provider for windows environment - ucm: add timer/retry CM logic to the ucm provider - - Release 2.0.23 - cma: cannot reuse the cm_id and qp for new connection, must reallocate a new one. - scm, cma: update DAPL cm protocol revision with latest address/port changes - ucm: modify IB address format to align better with sockaddr_in6 - Add definition for getpid similar to that used by the other dtest apps. - WinOF provides a common implementation of gettimeofday that should - The completion manager was updated to provide an abstraction that - dtestcm: remove IB verb definitions - dtest, dtestx: remove IB verb definitions - scm: tighten up socket options to insure similiar behavior on Windows and Linux. - cma: improve serialization of destroy and event processing - scm: improve serialization of destroy and state changes - common: no cleanup/release code for timer thread - scm, cma: dapli_thread doesn't always get teminated on library close. - ucm: tighten up locking with CM processing, state changes - ucm: For UD type QP's, return CR p_data with CONN_EST event on passive side. - ucm: cleanup extra cr/lf - ucm: fix issues with UD QP's. - winof: Convert windows version of dapl and dat libaries to use private heaps. - dtest, dtestx: modifications for UD QP testing with ucm provider. - scm, ucm: UD QP support was broken when porting to common openib code base. - cma: cleanup warning with unused local variable, ret, in disconnect - cma: remove debug message after rdma_disconnect failure - scm: socket errno check needs O/S dependent wrapper - dapltest: update script files for WinOF - cma: conditional check for new rdma_cm definition. - - Release 2.0.22 - dapltest: add mdep processor yield and use with dapltest - ucm: Add new provider using a DAPL based IB-UD cm mechanism for MPI implementations. - - Release 2.0.21 - scm: Fix disconnect. QP's need to move to ERROR state in - modify dtest.c to cleanup CNO wait code and consolidate into - CNO events, once triggered will not be returned during the cno wait. - CNO support broken in both CMA and SCM providers. - common osd: include winsock2.h for IPv6 definitions. - common osd: include w2tcpip.h for sockaddr_in6 definitions. - DAPL introduced the concept of directly waiting on the CQ for - dapltest: Implement a malloc() threshold for the completion reaping. - scm: handle connected state when freeing CM objects - scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings. - scm: set TCP_NODELAY sockopt on the server side for sends. - remove obsolete files in dapl/udapl source tree - dtestcm: add UD type QP option to test - scm: destroy QP called before disconnect - cma: add support for rdma_cm TIME_WAIT event. - scm: remove old udapl_scm code replaced by openib_scm. - winof: fix issues after consolidating cma, scm code base. - cma: lock held when exiting as a result of a rdma_create_event_channel failure. - windows: all dlist functions have been moved to the header file. - dtestcm windows: add build infrastructure for new dtestcm test suite - openib_common: reorganize code base to share common mem, cq, qp, dto functions - scm: fixes and optimizations for connection scaling - scm: double the default fd_set_size - scm: EP reference in CR should be cleared during ep_destroy - dtestx: fix conn establishment event checking - dtestcm: new test to measure dapl connection rates. - - Release 2.0.20 - common,scm: add debug capabilities to print in-process CM lists - scm: disconnect EP before cleaning up orphaned CR's during dat_ep_free - dapltest: windows scripts updated - scm: private data is not handled properly via CR rejects. - scm: cleanup orphaned UD CR's when destroying the EP - scm: provider specific query for default UD MTU is wrong. - scm: update CM code to shutdown before closing socket - dapltest: windows script dt-cli.bat updated - dapl/windows cma provider: add support for network devices based on index - openib: remove 1st gen provider, replaced with openib_cma and openib_scm - dapltest: update windows script files - dapltest: windows batch files in sripts directory - windows_osd/linux_osd: new dapl_os_gettid macro to return thread id - windows: missing build files for common and udapl sub-directories - windows: add build files for openib_scm, remove /Wp64 build option. - scm: multi-hca CM processing broken. Need cr thread wakeup mechanism per HCA. - dtest: add connection timers on client side - linux_osd: use pthread_self instead of getpid for debug messages - windows ibal-scm: dapl/dirs file needs updated to remove ibal-scm - - v1.2 Package: - - Release 1.2.16 - package: update Copyright file and include the 3 license files in distribution - cma: max sge incorrectly decremented during ibv_device_query - - Release 1.2.15 - dtest, dapltest: conflict with dapl-2 utils package, change to dapl1, dapltest1 - scm: fix compiler warning, unused variable - - ---------------- - - * Build Notes: - - # NON_DEBUG build/install example for x86_64, OFED targets - ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" - make install - - # DEBUG build/install example for x86_64, using OFED targets - ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" - make install - - # COUNTERS build/install example for x86_64, using OFED targets - ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS" - make install - - ---------------- - - * BKM for running new DAPL library on your cluster without any impact on existing OFED installation: - - Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1 - - Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.25.tar.gz - - untar in /home/ardavis - cd /home/ardavis/dapl-2.0.25 - ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries) - - create /home/ardavis/dat.conf with following 3 lines. (entries with path to new libraries): - - ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" "" - ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" "" - ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploucm.so.2 dapl.2.0 "mlx4_0 1" "" - - Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following: - - setenv DAT_OVERRIDE=/home/ardavis/dat.conf - - If running Intel MPI and uDAPL socket cm, set the following: - - setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1 - - or if running Intel MPI and uDAPL IB UD cm, set the following: - - setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1u - - or if running Intel MPI and uDAPL rdma_cm, set the following: - - setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0 - -------------------------- - - OFED 1.4.1 RELEASE NOTES - - NEW SINCE OFED 1.4 - new versions of uDAPL v1 (1.2.14-1) and v2 (2.0.19-1) - - * New Features - optional counters, must be configured/built with -DDAPL_COUNTERS - - * Bug Fixes - - v2 - scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit - v2 - scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge - v2 - dtest: add flush EVD call after data transfer errors - v2 - scm: increase default MTU size from 1024 to 2048 - v2 - dapltest: reset server listen ports to avoid collisions during long runs - v2 - dapltest: avoid duplicating ports, increment based on ep/thread count - v2 - dapltest: fix assumptions that multiple EP's will connect in order - v2 - common: sync missing with when removing items off of EVD pending queue - v2 - scm: reduce open time with thread start up - v2 - scm: getsockopt optlen needs initialized to size of optval - v2 - scm: cr_thread cleanup - v2 - OFED and WinOF code sync - v2 - scm: remove unnecessary query gid/lid from connection phase code. - v2 - scm: add optional 64-bit counters, build with -DDAPL_COUNTERS. - v1,v2 - spec files missing Requires(post) statements for sed/coreutils - v1,v2 - dtest/dapltest: use $(top_builddir) for .la files during test builds - v1,v2 - scm: remove unecessary thread when using direct objects - v1,v2 - Fix SuSE 11 build issues, asm/atomic.h no longer exists - - * Build Notes: - - # NON_DEBUG build/install example for x86_64, OFED targets - ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" - make install - - # DEBUG build/install example for x86_64, using OFED targets - ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" - make install - - # COUNTERS build/install example for x86_64, using OFED targets - ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include -DDAPL_COUNTERS" - make install - - * BKM for running new DAPL library on your cluster without any impact on existing OFED installation: - - Note: example for user /home/ardavis, (assumes /home/ardavis is exported) and MLX4 adapter, port 1 - - Download latest 2.x package: http://www.openfabrics.org/downloads/dapl/dapl-2.0.19.tar.gz - - untar in /home/ardavis - cd /home/ardavis/dapl-2.0.19 - ./configure && make (build on node with OFED 1.3 or higher installed, dependency on verb/rdma_cm libraries) - - create /home/ardavis/dat.conf with following 2 lines. (entries with path to new libraries): - - ofa-v2-ib0 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaplcma.so.1 dapl.2.0 "ib0 0" "" - ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default /home/ardavis/dapl-2.0.19/dapl/udapl/.libs/libdaploscm.so.2 dapl.2.0 "mlx4_0 1" "" - - Run uDAPL application or an MPI that uses uDAPL, with (assuming MLX4 connectx adapters) following: - - setenv DAT_OVERRIDE=/home/ardavis/dat.conf - - If running Intel MPI and uDAPL socket cm, set the following: - - setenv I_MPI_DEVICE=rdssm:ofa-v2-mlx4_0-1 - - if running Intel MPI and uDAPL rdma_cm, set the following: - - setenv I_MPI_DEVICE=rdssm:ofa-v2-ib0 - -------------------------- - - OFED 1.4 RELEASE NOTES - - NEW SINCE OFED 1.3.1 - new versions of uDAPL v1 (1.2.12-1) and v2 (2.0.15-1) - - * New Features - - 1. The new socket CM provider, introduced in 1.2.8 and 2.0.11 packages, - assumes homogeneous cluster and will setup the QP's based on local HCA port - attributes and exchanges QP information via socket's using the hostname of - each node. IPoIB and rdma_cm are NOT required for this provider. QP attributes - can be adjusted via the following environment parameters: - - DAPL_ACK_TIMER (default=16 5 bits, 4.096us*2^ack_timer. 16 == 268ms) - DAPL_ACK_RETRY (default=7 3 bits, 7 * 268ms = 1.8 seconds) - DAPL_RNR_TIMER (default=12 5 bits, 12 == 64ms, 28 == 163ms, 31 == 491ms) - DAPL_RNR_RETRY (default=7 3 bits, 7 == infinite) - DAPL_IB_MTU (default=1024 limited to active MTU max) - - The new socket cm entries in /etc/dat.conf provide a link to the actual HCA - device and port. Example v1 and v2 entries for a Mellanox connectx device, port 1: - - OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" "" - ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" "" - - This new socket cm provider, was successfully tested on the TATA CRL cluster - (#8 on Top500) with Intel MPI, achieving a HPLinpack score of 132.8TFlops on - 1798 nodes, 14384 cores at ~76.9% of peak. DAPL_ACK_TIMER was increased to 21 - for this scale. - - 2. New v2 definitions for IB unreliable datagram extension (only supported in - scm provider, libdaploscm.so.2) - - Extended EP dat_service_type, with DAT_IB_SERVICE_TYPE_UD - Add IB extension call dat_ib_post_send_ud(). - Add address handle definition for UD calls. - Add IB event definitions to provide remote AH via connect and connect requests - See dtestx (-d) source for example usage model - - * Bug Fixes - - v1,v2 - dapltest: trans test moves to cleanup stage before rdma_read processing is complete - v1,v2 - Fix static registration (dat.conf) to include sysconfdir override - v1,v2 - dat.conf: add default iwarp entry for eth2 - v1,v2 - dapl: adjust max_rdma_read_iov to 1 for iWARP devices - v1,v2 - dtest: reduce default IOV's for ep_create to support iWARP - v1,v2 - dtest: fix 32-bit build issues - v1,v2 - build: $(DESTDIR) prepend needed on install hooks for dat.conf - v2 - scm: UD shares EP;s which requires serialization - v2 - dapl: fixes for IB UD extensions in common code and socket cm provider. - v2 - dapl: add provider specific attribute query option for IB UD MTU size - v2 - dapl build: add correct CFLAGS, set non-debug build by default for v2 - v2 - dtestx: fix stack corruption problem with hostname strcpy - v2 - dapl extension: dapli_post_ext should always allocate cookie for requests. - v2 - dapltest: manpage - rdma write example incorrect - v1,v2 - dat, dapl, dtest, dapltest, providers: fix compiler warnings in dat common code - v1,v2 - dapl cma: debug message during query needs definition for inet_ntoa - v1,v2 - dapl scm: fix corner case that delivers duplicate disconnect events - v1,v2 - dat: include stddef.h for NULL definition in dat_platform_specific.h - v1,v2 - dapl: add debug messages during async and overflow events - v1,v2 - dapltest: add check for duplicate disconnect events in transaction test - v1,v2 - dapl scm: use correct device attribute for max_rdma_read_out, max_qp_init_rd_atom - v1,v2 - dapl scm: change IB RC qp inline and timer defaults. - v1,v2 - dapl scm: add mtu adjustments via environment, default = 1024. - v1,v2 - dapl scm: change connect and accept to non-blocking to avoid blocking user thread. - v1,v2 - dapl scm: update max_rdma_read_iov, max_rdma_write_iov EP attributes during query - v1,v2 - dat: allow TYPE_ERR messages to be turned off with DAT_DBG_TYPE - v1,v2 - dapl: remove needless terminating 0 in dto_op_str functions. - v1,v2 - dat: remove reference to doc/dat.conf in makefile.am - v1,v2 - dapl scm: fix ibv_destroy_cq busy error condition during dat_evd_free. - v1,v2 - dapl scm: add stdout logging for uname and gethostbyname errors during open. - v1,v2 - dapl scm: support global routing and set mtu based on active_mtu - v1,v2 - dapl: add opcode to string function to report opcode during failures. - v1,v2 - dapl: remove unused iov buffer allocation on the endpoint - v1,v2 - dapl: endpoint pending request count is wrong - -------------------------- - - OFED 1.3.1 RELEASE NOTES - - NEW SINCE OFED 1.3 - new versions of uDAPL v1 (1.2.7-1) and v2 (2.0.9-1) - - * New Features - None - - * Bug Fixes - v2 - add private data exchange with reject - v1,v2 - better error reporting in non-debug builds - v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers - v1,v2 - support for zero byte operations, iov==NULL - v1,v2 - multi-transport support for inline data and private data differences - v1,v2 - fix memory leaks and other reported bugs since OFED 1.3 - v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1 - v1,v2 - long delay during dat_ia_open when DNS not configured - v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max - -------------------------- - - OFED 1.3 RELEASE NOTES - - NEW SINCE OFED 1.2 - - * New Features - - 1. Add v2.0 library support for new 2.0 API Specification - 2. Separate v1.2 library release to co-exist with v2.0 libraries. - 3. New dat.conf with both 1.2 and 2.0 support - 4. New v2.0 dtestx utilities to test IB extensions - - * Bug Fixes - - v1.2 and v2.0 - - uDAT: static/dynamic registry parsing fixes - - uDAPL: provider fixes for dat_psp_create_any - - dtest/dapltest: change default provider names to sync with dat.conf - - openib_cma: issues with destroy_cm_id and init/resp exchange - - dapltest: use gettimeofday instead of get_cycles for better portability - - dapltest: endian issue with mem_handle, mem_address - - dapltest fix to include inet_ntoa definitions - - fix build problems on 32-bit and 64-bit PowerPC - - cleanup packaging - - v2.0 - - set default config options to match spec file, --enable-debug --enable-ext-type=ib - - use unique devel target names, libdat2.so, /usr/include/dat2 - - dtestx fix memory leak, freeaddrinfo after getaddrinfo - - Fix for IB extended DTO cookie deallocation on inbound rdma_Write_immed - - WinOF: Update OFED code base to include WinOF changes, work from same code base - - WinOF: add DAT_API definition, __stdcall for windows, nothing for linux - - dtest: add dat_evd_query to check correct size - - openib_cma: add macro to convert SID to PORT - - dtest: endian support for exchanging RMR info - - openib_cma: lower default settings, inline and RDMA init/resp - - openib_cma: missing ia_query for max_iov_segments_per_rdma_write - - v1.2 - - openib_cma: turn down dbg noise level on rejects - - dtest: typo in memset - - - BUILD: v1 and v2 uDAPL source install/build instructions (redhat example): - - # cd to distribution SRPMS directory - cd /tmp/OFED-1.3/SRPMS - rpm -i dapl-1.2*.rpm - rpm -i dapl-2.0*.rpm - cd /usr/src/redhat/SOURCES - tar zxf dapl-1.2*.tgz - tar zxf dapl-2.0*.tgz - - # NON_DEBUG build example for x86_64, using OFED targets - - ./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 - LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" - - # build and install - - make - make install - - # DEBUG build example for x86_64, using OFED targets - - ./configure --enable-debug --prefix /usr --sysconf=/etc --libdir /usr/lib64 - LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" - - # build and install - - make - make install - - # DEBUG messages: set environment variable DAPL_DBG_TYPE, default - mapping is 0x0003 - - DAPL_DBG_TYPE_ERR = 0x0001, - DAPL_DBG_TYPE_WARN = 0x0002, - DAPL_DBG_TYPE_EVD = 0x0004, - DAPL_DBG_TYPE_CM = 0x0008, - DAPL_DBG_TYPE_EP = 0x0010, - DAPL_DBG_TYPE_UTIL = 0x0020, - DAPL_DBG_TYPE_CALLBACK = 0x0040, - DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080, - DAPL_DBG_TYPE_API = 0x0100, - DAPL_DBG_TYPE_RTN = 0x0200, - DAPL_DBG_TYPE_EXCEPTION = 0x0400, - DAPL_DBG_TYPE_SRQ = 0x0800, - DAPL_DBG_TYPE_CNTR = 0x1000 - -------------------------- - - OFED 1.2 RELEASE NOTES - - NEW SINCE Gamma 3.2 and OFED 1.1 - - * New Features - - 1. Added dtest and dapltest to the openfabrics build and utils rpm. - Includes manpages. - 2. Added following enviroment variables to configure connection management - timers (default settings) for larger clusters: - - DAPL_CM_ARP_TIMEOUT_MS 4000 - DAPL_CM_ARP_RETRY_COUNT 15 - DAPL_CM_ROUTE_TIMEOUT_MS 4000 - DAPL_CM_ROUTE_RETRY_COUNT 15 - - * Bug Fixes - - + Added support for new ib verbs client register event. No extra - processing required at the uDAPL level. - + Fix some issues supporting create qp without recv cq handle or - recv qp resources. IB verbs assume a recv_cq handle and uDAPL - dapl_ep_create assumes there is always recv_sge resources specified. - + Fix some timeout and long disconnect delay issues discovered during - scale-out testing. Added support to retry rdma_cm address and route - resolution with configuration options. Provide a disconnect call - when receiving the disconnect request to guarantee a disconnect reply - and event on the remote side. The rdma_disconnect was not being called - from dat_ep_disconnect() as a result of the state changing - to DISCONNECTED in the event callback. - + Changes to support exchanging and validation of the device - responder_resources and the initiator_depth during conn establishment - + Fix some build issues with dapltest on 32 bit arch, and on ia64 SUSE arch - + Add support for multiple IB devices to dat.conf to support IPoIB HA failover - + Fix atomic operation build problem with ia64 and RHEL5. - + Add support to return local and remote port information with dat_ep_query - + Cleanup RPM specfile for the dapl package, move to 1.2-1 release. - - NEW SINCE Gamma 3.1 and OFED 1.0 - - * BUG FIXES - - + Update obsolete CLK_TCK to CLOCKS_PER_SEC - + Fill out some unitialized fields in the ia_attr structure returned by - dat_ia_query(). - + Update dtest to support multiple segments on rdma write and change - makefile to use OpenIB-cma by default. - + Add support for dat_evd_set_unwaitable on a DTO evd in openib_cma - provider - + Added errno reporting (message and return codes) during open to help - diagnose create thread issues. - + Fix some suspicious inline assembly EIEIO_ON_SMP and ISYNC_ON_SMP - + Fix IA64 build problems - + Lower the reject debug message level so we don't see warnings when - consumers reject. - + Added support for active side TIMED_OUT event from a provider. - + Fix bug in dapls_ib_get_dat_event() call after adding new unreachable - event. - + Update for new rdma_create_id() function signature. - + Set max rdma read per EP attributes - + Report the proper error and timeout events. - + Socket CM fix to guard against using a loopback address as the local - device address. - + Use the uCM set_option feature to adjust connect request timeout - retry values. - + Fix to disallow any event after a disconnect event. - - * OFED 1.1 uDAPL source build instructions: - - cd /usr/local/ofed/src/openib-1.1/src/userspace/dapl - - # NON_DEBUG build configuration - - ./configure --disable-libcheck --prefix /usr/local/ofed - --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 - CPPFLAGS="-I../libibverbs/include -I../librdmacm/include" - - # build and install - - make - make install - - # DEBUG build configuration - - ./configure --disable-libcheck --enable-debug --prefix /usr/local/ofed - --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 - CPPFLAGS="-I../libibverbs/include -I../librdmacm/include" - - # build and install - - make - make install - - # DEBUG messages: set environment variable DAPL_DBG_TYPE, default - mapping is 0x0003 - - DAPL_DBG_TYPE_ERR = 0x0001, - DAPL_DBG_TYPE_WARN = 0x0002, - DAPL_DBG_TYPE_EVD = 0x0004, - DAPL_DBG_TYPE_CM = 0x0008, - DAPL_DBG_TYPE_EP = 0x0010, - DAPL_DBG_TYPE_UTIL = 0x0020, - DAPL_DBG_TYPE_CALLBACK = 0x0040, - DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080, - DAPL_DBG_TYPE_API = 0x0100, - DAPL_DBG_TYPE_RTN = 0x0200, - DAPL_DBG_TYPE_EXCEPTION = 0x0400, - DAPL_DBG_TYPE_SRQ = 0x0800, - DAPL_DBG_TYPE_CNTR = 0x1000 - - - Note: The udapl provider library libdaplscm.so is untested and - unsupported, thus customers should not use it. - It will be removed in the next OFED release. - - DAPL GAMMA 3.1 RELEASE NOTES - - This release of the DAPL reference implementation - is timed to coincide with the first release of the - Open Fabrics (www.openfabrics.org) software stack. - This release adds support for this new stack, which - is now the native Linux RDMA stack. - - This release also adds a new licensing option. In - addition to the Common Public License and BSD License, - the code can now be licensed under the terms of the GNU - General Public License (GPL) version 2. - - NEW SINCE Gamma 3.0 - - - GPL v2 added as a licensing option - - OpenFabrics (aka OpenIB) gen2 verbs support - - dapltest support for Solaris 10 - - * BUG FIXES - - + Fixed a disconnect event processing race - + Fix to destroy all QPs on IA close - + Removed compiler warnings - + Removed unused variables - + And many more... - - DAPL GAMMA 3.0 RELEASE NOTES - - This is the first release based on version 1.2 of the spec. There - are some components, such a shared receive queues (SRQs), which - are not implemented yet. - - Once again there were numerous bug fixes submitted by the - DAPL community. - - NEW SINCE Beta 2.06 - - - DAT 1.2 headers - - DAT_IA_HANDLEs implemented as small integers - - Changed default device name to be "ia0a" - - Initial support for Linux 2.6.X kernels - - Updates to the OpenIB gen 1 provider - - * BUG FIXES - - + Updated Makefile for differentiation between OS releases. - + Updated atomic routines to use appropriate API - + Removed unnecessary assert from atomic_dec. - + Fixed bugs when freeing a PSP. - + Fixed error codes returned by the DAT static registry. - + Kernel updates for dat_strerror. - + Cleaned up the transport layer/adapter interface to use DAPL - types rather than transport types. - + Fixed ring buffer reallocation. - + Removed old test/udapl/dapltest directory. - + Fixed DAT_IA_HANDLE translation (from pointer to int and - vice versa) on 64-bit platforms. - - DAP BETA 2.06 RELEASE NOTES - - We are not planning any further releases of the Beta series, - which are based on the 1.1 version of the spec. There may be - further releases for bug fixes, but we anticipate the DAPL - community to move to the new 1.2 version of the spec and the - changes mandated in the reference implementation. - - The biggest item in this release is the first inclusion of the - OpenIB Gen 1 provider, an item generating a lot of interest in - the IB community. This implementation has graciously been - provided by the Mellanox team. The kdapl implementation is in - progress, and we imagine work will soon begin on Gen 2. - - There are also a handful of bug fixes available, as well as a long - awaited update to the endpoint design document. - - NEW SINCE Beta 2.05 - - - OpenIB gen 1 provider support has been added - - Added dapls_evd_post_generic_event(), routine to post generic - event types as requested by some providers. Also cleaned up - error reporting. - - Updated the endpoint design document in the doc/ directory. - - * BUG FIXES - - + Cleaned up memory leak on close by freeing the HCA structure; - + Removed bogus #defs for rdtsc calls on IA64. - + Changed daptest thread types to use internal types for - portability & correctness - + Various 64 bit enhancements & updates - + Fixes to conformance test that were defining CONN_QUAL twice - and using it in different ways - + Cleaned up private data handling in ep_connect & provider - support: we now avoid extra copy in connect code; reduced - stack requirements by using private_data structure in the EP; - removed provider variable. - + Fixed problem in the dat conformance test where cno_wait would - attempt to dereference a timer value and SEGV. - + Removed old vestiges of depricated POLLING_COMPLETIONS - conditionals. - - DAPL BETA 2.05 RELEASE NOTES - - This was to be a very minor release, the primary change was - going to be the new wording of the DAT license as contained in - the header for all source files. But the interest and - development occurring in DAPL provided some extra bug fixes, and - some new functionality that has been requested for a while. - - First, you may notice that every single source file was - changed. If you read the release notes from DAPL BETA 2.04, you - were warned this would happen. There was a legal issue with the - wording in the header, the end result was that every source file - was required to change the word 'either of' to 'both'. We've - been putting this change off as long as possible, but we wanted - to do it in a clean drop before we start working on DAT 1.2 - changes in the reference implementation, just to keep things - reasonably sane. - - kdapltest has enabled three of the subtests supported by - dapltest. The Performance test in particular has been very - useful to dapltest in getting minima and maxima. The Limit test - pushes the limits by allocating the maximum number of specific - resources. And the FFT tests are also available. - - Most vendors have supported shared memory regions for a while, - several of which have asked the reference implementation team to - provide a common implementation. Shared memory registration has - been tested on ibapi, and compiled into vapi. Both InfiniBand - providers have the restriction that a memory region must be - created before it can be shared; not all RDMA APIs are this way, - several allow you to declare a memory region shared when it is - registered. Hence, details of the implementation are hidden in - the provider layer, rather than forcing other APIs to do - something strange. - - This release also contains some changes that will allow dapl to - work on Opteron processors, as well as some preliminary support - for Power PC architecture. These features are not well tested - and may be incomplete at this time. - - Finally, we have been asked several times over the course of the - project for a canonical interface between the common and - provider layers. This release includes a dummy provider to meet - that need. Anyone should be able to download the release and do - a: - make VERBS=DUMMY - - And have a cleanly compiled dapl library. This will be useful - both to those porting new transport providers, as well as those - going to new machines. - - The DUMMY provider has been compiled on both Linux and Windows - machines. - - - NEW SINCE Beta 2.4 - - kdapltest enhancements: - * Limit subtests now work - * Performance subtests now work. - * FFT tests now work. - - - The VAPI headers have been refreshed by Mellanox - - - Initial Opteron and PPC support. - - - Atomic data types now have consistent treatment, allowing us to - use native data types other than integers. The Linux kdapl - uses atomic_t, allowing dapl to use the kernel macros and - eliminate the assembly code in dapl_osd.h - - - The license language was updated per the direction of the - DAT Collaborative. This two word change affected the header - of every file in the tree. - - - SHARED memory regions are now supported. - - - Initial support for the TOPSPIN provider. - - - Added a dummy provider, essentially the NULL provider. It's - purpose is to aid in porting and to clarify exactly what is - expected in a provider implementation. - - - Removed memory allocation from the DTO path for VAPI - - - cq_resize will now allow the CQ to be resized smaller. Not all - providers support this, but it's a provider problem, not a - limitation of the common code. - - * BUG FIXES - - + Removed spurious lock in dapl_evd_connection_callb.c that - would have caused a deadlock. - + The Async EVD was getting torn down too early, potentially - causing lost errors. Has been moved later in the teardown - process. - + kDAPL replaced mem_map_reserve() with newer SetPageReserved() - for better Linux integration. - + kdapltest no longer allocate large print buffers on the stack, - is more careful to ensure buffers don't overflow. - + Put dapl_os_dbg_print() under DAPL_DBG conditional, it is - supposed to go away in a production build. - + dapltest protocol version has been bumped to reflect the - change in the Service ID. - + Corrected several instances of routines that did not adhere - to the DAT 1.1 error code scheme. - + Cleaned up vapi ib_reject_connection to pass DAT types rather - than provider specific types. Also cleaned up naming interface - declarations and their use in vapi_cm.c; fixed incorrect - #ifdef for naming. - + Initialize missing uDAPL provider attr, pz_support. - + Changes for better layering: first, moved - dapl_lmr_convert_privileges to the provider layer as memory - permissions are clearly transport specific and are not always - defined in an integer bitfield; removed common routines for - lmr and rmr. Second, move init and release setup/teardown - routines into adapter_util.h, which defined the provider - interface. - + Cleaned up the HCA name cruft that allowed different types - of names such as strings or ints to be dealt with in common - code; but all names are presented by the dat_registry as - strings, so pushed conversions down to the provider - level. Greatly simplifies names. - + Changed deprecated true/false to DAT_TRUE/DAT_FALSE. - + Removed old IB_HCA_NAME type in favor of char *. - + Fixed race condition in kdapltest's use of dat_evd_dequeue. - + Changed cast for SERVER_PORT_NUMBER to DAT_CONN_QUAL as it - should be. - + Small code reorg to put the CNO into the EVD when it is - allocated, which simplifies things. - + Removed gratuitous ib_hca_port_t and ib_send_op_type_t types, - replaced with standard int. - + Pass a pointer to cqe debug routine, not a structure. Some - clean up of data types. - + kdapl threads now invoke reparent_to_init() on exit to allow - threads to get cleaned up. - - - - DAPL BETA 2.04 RELEASE NOTES - - The big changes for this release involve a more strict adherence - to the original dapl architecture. Originally, only InfiniBand - providers were available, so allowing various data types and - event codes to show through into common code wasn't a big deal. - - But today, there are an increasing number of providers available - on a number of transports. Requiring an IP iWarp provider to - match up to InfiniBand events is silly, for example. - - Restructuring the code allows more flexibility in providing an - implementation. - - There are also a large number of bug fixes available in this - release, particularly in kdapl related code. - - Be warned that the next release will change every file in the - tree as we move to the newly approved DAT license. This is a - small change, but all files are affected. - - Future releases will also support to the soon to be ratified DAT - 1.2 specification. - - This release has benefited from many bug reports and fixes from - a number of individuals and companies. On behalf of the DAPL - community, thank you! - - - NEW SINCE Beta 2.3 - - - Made several changes to be more rigorous on the layering - design of dapl. The intent is to make it easier for non - InfiniBand transports to use dapl. These changes include: - - * Revamped the ib_hca_open/close code to use an hca_ptr - rather than an ib_handle, giving the transport layer more - flexibility in assigning transport handles and resources. - - * Removed the CQD calls, they are specific to the IBM API; - folded this functionality into the provider open/close calls. - - * Moved VAPI, IBAPI transport specific items into a transport - structure placed inside of the HCA structure. Also updated - routines using these fields to use the new location. Cleaned - up provider knobs that have been exposed for too long. - - * Changed a number of provider routines to use DAPL structure - pointers rather than exposing provider handles & values. Moved - provider specific items out of common code, including provider - data types (e.g. ib_uint32_t). - - * Pushed provider completion codes and type back into the - provider layer. We no longer use EVD or CM completion types at - the common layer, instead we obtain the appropriate DAT type - from the provider and process only DAT types. - - * Change private_data handling such that we can now accommodate - variable length private data. - - - Remove DAT 1.0 cruft from the DAT header files. - - - Better spec compliance in headers and various routines. - - - Major updates to the VAPI implementation from - Mellanox. Includes initial kdapl implementation - - - Move kdapl platform specific support for hash routines into - OSD file. - - - Cleanups to make the code more readable, including comments - and certain variable and structure names. - - - Fixed CM_BUSTED code so that it works again: very useful for - new dapl ports where infrastructure is lacking. Also made - some fixes for IBHOSTS_NAMING conditional code. - - - Added DAPL_MERGE_CM_DTO as a compile time switch to support - EVD stream merging of CM and DTO events. Default is off. - - - 'Quit' test ported to kdapltest - - - uDAPL now builds on Linux 2.6 platform (SuSE 9.1). - - - kDAPL now builds for a larger range of Linux kernels, but - still lacks 2.6 support. - - - Added shared memory ID to LMR structure. Shared memory is - still not fully supported in the reference implementation, but - the common code will appear soon. - - * Bug fixes - - Various Makefiles fixed to use the correct dat registry - library in its new location (as of Beta 2.03) - - Simple reorg of dat headers files to be consistent with - the spec. - - fixed bug in vapi_dto.h recv macro where we could have an - uninitialized pointer. - - Simple fix in dat_dr.c to initialize a variable early in the - routine before errors occur. - - Removed private data pointers from a CONNECTED event, as - there should be no private data here. - - dat_strerror no longer returns an uninitialized pointer if - the error code is not recognized. - - dat_dup_connect() will reject 0 timeout values, per the - spec. - - Removed unused internal_hca_names parameter from - ib_enum_hcas() interface. - - Use a temporary DAT_EVENT for kdapl up-calls rather than - making assumptions about the current event queue. - - Relocated some platform dependent code to an OSD file. - - Eliminated several #ifdefs in .c files. - - Inserted a missing unlock() on an error path. - - Added bounds checking on size of private data to make sure - we don't overrun the buffer - - Fixed a kdapltest problem that caused a machine to panic if - the user hit ^C - - kdapltest now uses spin locks more appropriate for their - context, e.g. spin_lock_bh or spin_lock_irq. Under a - conditional. - - Fixed kdapltest loops that drain EVDs so they don't go into - endless loops. - - Fixed bug in dapl_llist_add_entry link list code. - - Better error reporting from provider code. - - Handle case of user trying to reap DTO completions on an - EP that has been freed. - - No longer hold lock when ep_free() calls into provider layer - - Fixed cr_accept() to not have an extra copy of - private_data. - - Verify private_data pointers before using them, avoid - panic. - - Fixed memory leak in kdapltest where print buffers were not - getting reclaimed. - - - - DAPL BETA 2.03 RELEASE NOTES - - There are some prominent features in this release: - 1) dapltest/kdapltest. The dapltest test program has been - rearchitected such that a kernel version is now available - to test with kdapl. The most obvious change is a new - directory structure that more closely matches other core - dapl software. But there are a large number of changes - throughout the source files to accommodate both the - differences in udapl/kdapl interfaces, but also more mundane - things such as printing. - - The new dapltest is in the tree at ./test/dapltest, while the - old remains at ./test/udapl/dapltest. For this release, we - have maintained both versions. In a future release, perhaps - the next release, the old dapltest directory will be - removed. Ongoing development will only occur in the new tree. - - 2) DAT 1.1 compliance. The DAT Collaborative has been busy - finalizing the 1.1 revision of the spec. The header files - have been reviewed and posted on the DAT Collaborative web - site, they are now in full compliance. - - The reference implementation has been at a 1.1 level for a - while. The current implementation has some features that will - be part of the 1.2 DAT specification, but only in places - where full compatibility can be maintained. - - 3) The DAT Registry has undergone some positive changes for - robustness and support of more platforms. It now has the - ability to support several identical provider names - simultaneously, which enables the same dat.conf file to - support multiple platforms. The registry will open each - library and return when successful. For example, a dat.conf - file may contain multiple provider names for ex0a, each - pointing to a different library that may represent different - platforms or vendors. This simplifies distribution into - different environments by enabling the use of common - dat.conf files. - - In addition, there are a large number of bug fixes throughout - the code. Bug reports and fixes have come from a number of - companies. - - Also note that the Release notes are cleaned up, no longer - containing the complete text of previous releases. - - * EVDs no longer support DTO and CONNECTION event types on the - same EVD. NOTE: The problem is maintaining the event ordering - between two channels such that no DTO completes before a - connection is received; and no DTO completes after a - disconnect is received. For 90% of the cases this can be made - to work, but the remaining 10% will cause serious performance - degradation to get right. - - NEW SINCE Beta 2.2 - - * DAT 1.1 spec compliance. This includes some new types, error - codes, and moving structures around in the header files, - among other things. Note the Class bits of dat_error.h have - returned to a #define (from an enum) to cover the broadest - range of platforms. - - * Several additions for robustness, including handle and - pointer checking, better argument checking, state - verification, etc. Better recovery from error conditions, - and some assert()s have been replaced with 'if' statements to - handle the error. - - * EVDs now maintain the actual queue length, rather than the - requested amount. Both the DAT spec and IB (and other - transports) allow the underlying implementation to provide - more CQ entries than requested. - - Requests for the same number of entries contained by an EVD - return immediate success. - - * kDAPL enhancements: - - module parameters & OS support calls updated to work with - more recent Linux kernels. - - kDAPL build options changes to match the Linux kernel, vastly - reducing the size and making it more robust. - - kDAPL unload now works properly - - kDAPL takes a reference on the provider driver when it - obtains a verbs vector, to prevent an accidental unload - - Cleaned out all of the uDAPL cruft from the linux/osd files. - - * New dapltest (see above). - - * Added a new I/O trace facility, enabling a developer to debug - all I/O that are in progress or recently completed. Default - is OFF in the build. - - * 0 timeout connections now refused, per the spec. - - * Moved the remaining uDAPL specific files from the common/ - directory to udapl/. Also removed udapl files from the kdapl - build. - - * Bug fixes - - Better error reporting from provider layer - - Fixed race condition on reference counts for posting DTO - ops. - - Use DAT_COMPLETION_SUPPRESS_FLAG to suppress successful - completion of dapl_rmr_bind (instead of - DAT_COMPLEITON_UNSIGNALLED, which is for non-notification - completion). - - Verify psp_flags value per the spec - - Bug in psp_create_any() checking psp_flags fixed - - Fixed type of flags in ib_disconnect from - DAT_COMPLETION_FLAGS to DAT_CLOSE_FLAGS - - Removed hard coded check for ASYNC_EVD. Placed all EVD - prevention in evd_stream_merging_supported array, and - prevent ASYNC_EVD from being created by an app. - - ep_free() fixed to comply with the spec - - Replaced various printfs with dbg_log statements - - Fixed kDAPL interaction with the Linux kernel - - Corrected phy_register protottype - - Corrected kDAPL wait/wakeup synchronization - - Fixed kDAPL evd_kcreate() such that it no longer depends - on uDAPL only code. - - dapl_provider.h had wrong guard #def: changed DAT_PROVIDER_H - to DAPL_PROVIDER_H - - removed extra (and bogus) call to dapls_ib_completion_notify() - in evd_kcreate.c - - Inserted missing error code assignment in - dapls_rbuf_realloc() - - When a CONNECTED event arrives, make sure we are ready for - it, else something bad may have happened to the EP and we - just return; this replaces an explicit check for a single - error condition, replacing it with the general check for the - state capable of dealing with the request. - - Better context pointer verification. Removed locks around - call to ib_disconnect on an error path, which would result - in a deadlock. Added code for BROKEN events. - - Brought the vapi code more up to date: added conditional - compile switches, removed obsolete __ActivePort, deal - with 0 length DTO - - Several dapltest fixes to bring the code up to the 1.1 - specification. - - Fixed mismatched dalp_os_dbg_print() #else dapl_Dbg_Print(); - the latter was replaced with the former. - - ep_state_subtype() now includes UNCONNECTED. - - Added some missing ibapi error codes. - - - - NEW SINCE Beta 2.1 - - * Changes for Erratta and 1.1 Spec - - Removed DAT_NAME_NOT_FOUND, per DAT erratta - - EVD's with DTO and CONNECTION flags set no longer valid. - - Removed DAT_IS_SUCCESS macro - - Moved provider attribute structures from vendor files to udat.h - and kdat.h - - kdapl UPCALL_OBJECT now passed by reference - - * Completed dat_strerr return strings - - * Now support interrupted system calls - - * dapltest now used dat_strerror for error reporting. - - * Large number of files were formatted to meet project standard, - very cosmetic changes but improves readability and - maintainability. Also cleaned up a number of comments during - this effort. - - * dat_registry and RPM file changes (contributed by Steffen Persvold): - - Renamed the RPM name of the registry to be dat-registry - (renamed the .spec file too, some cvs add/remove needed) - - Added the ability to create RPMs as normal user (using - temporal paths), works on SuSE, Fedora, and RedHat. - - 'make rpm' now works even if you didn't build first. - - Changed to using the GNU __attribute__((constructor)) and - __attribute__((destructor)) on the dat_init functions, dat_init - and dat_fini. The old -init and -fini options to LD makes - applications crash on some platforms (Fedora for example). - - Added support for 64 bit platforms. - - Added code to allow multiple provider names in the registry, - primarily to support ia32 and ia64 libraries simultaneously. - Provider names are now kept in a list, the first successful - library open will be the provider. - - * Added initial infrastructure for DAPL_DCNTR, a feature that - will aid in debug and tuning of a dapl implementation. Partial - implementation only at this point. - - * Bug fixes - - Prevent debug messages from crashing dapl in EVD completions by - verifying the error code to ensure data is valid. - - Verify CNO before using it to clean up in evd_free() - - CNO timeouts now return correct error codes, per the spec. - - cr_accept now complies with the spec concerning connection - requests that go away before the accept is invoked. - - Verify valid EVD before posting connection evens on active side - of a connection. EP locking also corrected. - - Clean up of dapltest Makefile, no longer need to declare - DAT_THREADSAFE - - Fixed check of EP states to see if we need to disconnect an - IA is closed. - - ep_free() code reworked such that we can properly close a - connection pending EP. - - Changed disconnect processing to comply with the spec: user will - see a BROKEN event, not DISCONNECTED. - - If we get a DTO error, issue a disconnect to let the CM and - the user know the EP state changed to disconnect; checked IBA - spec to make sure we disconnect on correct error codes. - - ep_disconnect now properly deals with abrupt disconnects on the - active side of a connection. - - PSP now created in the correct state for psp_create_any(), making - it usable. - - dapl_evd_resize() now returns correct status, instead of always - DAT_NOT_IMPLEMENTED. - - dapl_evd_modify_cno() does better error checking before invoking - the provider layer, avoiding bugs. - - Simple change to allow dapl_evd_modify_cno() to set the CNO to - NULL, per the spec. - - Added required locking around call to dapl_sp_remove_cr. - - - Fixed problems related to dapl_ep_free: the new - disconnect(abrupt) allows us to do a more immediate teardown of - connections, removing the need for the MAGIC_EP_EXIT magic - number/state, which has been removed. Mmuch cleanup of paths, - and made more robust. - - Made changes to meet the spec, uDAPL 1.1 6.3.2.3: CNO is - triggered if there are waiters when the last EVD is removed - or when the IA is freed. - - Added code to deal with the provider synchronously telling us - a connection is unreachable, and generate the appropriate - event. - - Changed timer routine type from unsigned long to uintptr_t - to better fit with machine architectures. - - ep.param data now initialized in ep_create, not ep_alloc. - - Or Gerlitz provided updates to Mellanox files for evd_resize, - fw attributes, many others. Also implemented changes for correct - sizes on REP side of a connection request. - - - - NEW SINCE Beta 2.0 - - * dat_echo now DAT 1.1 compliant. Various small enhancements. - - * Revamped atomic_inc/dec to be void, the return value was never - used. This allows kdapl to use Linux kernel equivalents, and - is a small performance advantage. - - * kDAPL: dapl_evd_modify_upcall implemented and tested. - - * kDAPL: physical memory registration implemented and tested. - - * uDAPL now builds cleanly for non-debug versions. - - * Default RDMA credits increased to 8. - - * Default ACK_TIMEOUT now a reasonable value (2 sec vs old 2 - months). - - * Cleaned up dat_error.h, now 1.1 compliant in comments. - - * evd_resize initial implementation. Untested. - - * Bug fixes - - __KDAPL__ is defined in kdat_config.h, so apps don't need - to define it. - - Changed include file ordering in kdat.h to put kdat_config.h - first. - - resolved connection/tear-down race on the client side. - - kDAPL timeouts now scaled properly; fixed 3 orders of - magnitude difference. - - kDAPL EVD callbacks now get invoked for all completions; old - code would drop them in heavy utilization. - - Fixed error path in kDAPL evd creation, so we no longer - leak CNOs. - - create_psp_any returns correct error code if it can't create - a connection qualifier. - - lock fix in ibapi disconnect code. - - kDAPL INFINITE waits now work properly (non connection - waits) - - kDAPL driver unload now works properly - - dapl_lmr_[k]create now returns 1.1 error codes - - ibapi routines now return DAT 1.1 error codes - - - - NEW SINCE Beta 1.10 - - * kDAPL is now part of the DAPL distribution. See the release - notes above. - - The kDAPL 1.1 spec is now contained in the doc/ subdirectory. - - * Several files have been moved around as part of the kDAPL - checkin. Some files that were previously in udapl/ are now - in common/, some in common are now in udapl/. The goal was - to make sure files are properly located and make sense for - the build. - - * Source code formatting changes for consistency. - - * Bug fixes - - dapl_evd_create() was comparing the wrong bit combinations, - allowing bogus EVDs to be created. - - Removed code that swallowed zero length I/O requests, which - are allowed by the spec and are useful to applications. - - Locking in dapli_get_sp_ep was asymmetric; fixed it so the - routine will take and release the lock. Cosmetic change. - - dapl_get_consuemr_context() will now verify the pointer - argument 'context' is not NULL. - - - OBTAIN THE CODE - - To obtain the tree for your local machine you can check it - out of the source repository using CVS tools. CVS is common - on Unix systems and available as freeware on Windows machines. - The command to anonymously obtain the source code from - Source Forge (with no password) is: - - cvs -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl login - cvs -z3 -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl co . - - When prompted for a password, simply press the Enter key. - - Source Forge also contains explicit directions on how to become - a developer, as well as how to use different CVS commands. You may - also browse the source code using the URL: - - http://svn.sourceforge.net/viewvc/dapl/trunk/ - - SYSTEM REQUIREMENTS - - This project has been implemented on Red Hat Linux 7.3, SuSE - SLES 8, 9, and 10, Windows 2000, RHEL 3.0, 4.0 and 5.0 and a few - other Linux distrubutions. The structure of the code is designed - to allow other operating systems to easily be adapted. - - The DAPL team has used Mellanox Tavor based InfiniBand HCAs for - development, and continues with this platform. Our HCAs use the - IB verbs API submitted by IBM. Mellanox has contributed an - adapter layer using their VAPI verbs API. Either platform is - available to any group considering DAPL work. The structure of - the uDAPL source allows other provider API sets to be easily - integrated. - - The development team uses any one of three topologies: two HCAs - in a single machine; a single HCA in each of two machines; and - most commonly, a switch. Machines connected to a switch may have - more than one HCA. - - The DAPL Plugfest revealed that switches and HCAs available from - most vendors will interoperate with little trouble, given the - most recent releases of software. The dapl reference team makes - no recommendation on HCA or switch vendors. - - Explicit machine configurations are available upon request. - - IN THE TREE - - The DAPL tree contains source code for the uDAPL and kDAPL - implementations, and also includes tests and documentation. - - Included documentation has the base level API of the - providers: OpenFabrics, IBM Access, and Mellanox Verbs API. Also - included are a growing number of DAPL design documents which - lead the reader through specific DAPL subsystems. More - design documents are in progress and will appear in the tree in - the near future. - - A small number of test applications and a unit test framework - are also included. dapltest is the primary testing application - used by the DAPL team, it is capable of simulating a variety of - loads and exercises a large number of interfaces. Full - documentation is included for each of the tests. - - Recently, the dapl conformance test has been added to the source - repository. The test provides coverage of the most common - interfaces, doing both positive and negative testing. Vendors - providing DAPL implementation are strongly encouraged to run - this set of tests. - - MAKEFILE NOTES - - There are a number #ifdef's in the code that were necessary - during early development. They are disappearing as we - have time to take advantage of features and work available from - newer releases of provider software. These #ifdefs are not - documented as the intent is to remove them as soon as possible. - - CONTRIBUTIONS - - As is common to Source Forge projects, there are a small number - of developers directly associated with the source tree and having - privileges to change the tree. Requested updates, changes, bug - fixes, enhancements, or contributions should be sent to - James Lentini at jlentinit@netapp.com for review. We welcome your - contributions and expect the quality of the project will - improve thanks to your help. - - The core DAPL team is: - - James Lentini - Arlin Davis - Steve Sears - - ... with contributions from a number of excellent engineers in - various companies contributing to the open source effort. - - - ONGOING WORK - - Not all of the DAPL spec is implemented at this time. - Functionality such as shared memory will probably not be - implemented by the reference implementation (there is a write up - on this in the doc/ area), and there are yet various cases where - work remains to be done. And of course, not all of the - implemented functionality has been tested yet. The DAPL team - continues to develop and test the tree with the intent of - completing the specification and delivering a robust and useful - implementation. - - -The DAPL Team -