From: Tziporet Koren Date: Thu, 1 Feb 2007 14:12:57 +0000 (+0200) Subject: first doc files - from ofed 1.2 X-Git-Url: https://openfabrics.org/gitweb/?a=commitdiff_plain;h=f43a950c36d081c939fbb407c64d1fd6d97c1cd7;p=compat-rdma%2Fdocs.git first doc files - from ofed 1.2 --- f43a950c36d081c939fbb407c64d1fd6d97c1cd7 diff --git a/HOWTO.build_ofed b/HOWTO.build_ofed new file mode 100644 index 0000000..40b38bf --- /dev/null +++ b/HOWTO.build_ofed @@ -0,0 +1,160 @@ + Open Fabrics Enterprise Distribution (OFED) + How To Build OFED 1.1 + + October 2006 + + +============================================================================== +Table of contents +============================================================================== +1. Overview +2. Usage +3. Requirements + +============================================================================== +1. Overview +============================================================================== +The script "build_ofed.sh" is used to build the OFED package based on the +OpenFabrics project and InfiniBand git tree. The package is built under the +current working directory. + +The OFED package includes InfiniBand kernel modules, userspace libraries, +diagnostic tools, performance benchmarks, firmware burning tools, Open MPI and +OSU MPI. + +See OFED_release_notes.txt for more details. + +============================================================================== +2. Usage +============================================================================== + +The build script for the OFED package can be downloaded from: + https://openib.org/svn/gen2/branches/1.1/ofed/build + +Name: build_ofed.sh + + +Usage: build_ofed.sh --ver|-v --git|-g + [--svnrev|-r ] + [--tmpdir ] + [--without-makedist] + [--userspace|-u ] + [--ofed-scripts ] + [--ofed-docs ] + [--mpidir|-m ] + [--extrasdir|-e ] + + Required: + --ver Determines the name of the OFED version that is built + --git Path to a local GIT tree (directory). The tree must + have previously been created by one of the methods + provided in the "Requirements" section below. + + Optional: + --svnrev The svn revision for extraction of the userspace + component (default: most recent) + + --tmpdir Directory to use as a work area (default: /tmp ) + + --without-makedist Do not execute "make dist" for the userspace + component (default: do "make dist") + + --userspace If you have already checked out the userspace + component, you may use this option to request + that the userspace component be taken from the + given directory. Otherwise, the userspace_URL + (see below) will be used. + + --ofed_scripts If you have already checked out the scripts + component, you may use this option to request that + the scripts component be taken from the given + directory. Otherwise, the ofed_scripts_URL (see + below) will be used. + + --ofed_docs If you have already checked out the docs component, + you may use this option to request that the docs + component be taken from the given directory. + Otherwise, the ofed_docs_URL (see below) will be + used. + + --mpidir If you have already checked out the mpi component, + you may use this option to request that the mpi + component be taken from the given directory. + Otherwise, the mpi_URL (see below) will be used. + + --extrasdir If you have already checked out the extras + component, you may use this option to request that + the extras component be taken from the given + directory. Otherwise, the extras_URL (see below) + will be used. + + +Sources are extracted by default from the following locations: + userspace_URL: + https://openib.org/svn/gen2/branches/1.1/src/userspace + openib_scripts_URL: + https://openib.org/svn/gen2/branches/1.1/ofed/openib/scripts + ofed_scripts_URL: + https://openib.org/svn/gen2/branches/1.1/ofed/scripts + ofed_docs_URL: + https://openib.org/svn/gen2/branches/1.1/ofed/docs + mpi_URL: + https://openib.org/svn/gen2/branches/1.1/ofed/mpi + extras_URL: + https://openib.org/svn/gen2/branches/1.1/ofed/extras + +Example: + + ./build_ofed.sh --ver 1.1-rc6 --git /local/git/ofed_1_1/ + + This command will create a package (i.e., subtree) called OFED-1.1-rc6 + in the current working direcory. The git tree "/local/git/ofed_1_1/" + in this example is a local InfiniBand git tree which was created using + one of the methods in the "Requirements" section below. + +============================================================================== +3. Requirements +============================================================================== + +1. Git: + Can be downloaded from: + http://www.kernel.org/pub/software/scm/git/git-1.4.2.tar.gz + +2. Subversion: + Can be downloaded from: + http://subversion.tigris.org + +3. InfiniBand Git tree: + There are two ways to get the infiniband git tree: + - The faster way: + mkdir gitdir + cd gitdir + git clone --bare \ + git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git \ + .git + git fetch git://www.mellanox.co.il/~git/infiniband ofed_1_1 \ + ofed_addons cma_branch ehca_branch mst_sdp + + - The slower way: + mkdir gitdir + cd gitdir + git clone -s --bare git://www.mellanox.co.il/~git/infiniband .git + git checkout ofed_1_1 `git-ls-tree -r --name-only ofed_1_1 \ + include/rdma include/scsi/srp.h drivers/infiniband \ + Documentation/infiniband ofed_scripts kernel_patches` + echo 'ref: refs/heads/ofed_1_1' > .git/HEAD + +4. Autotools: + + libtool-1.5.20 or higher + autoconf-2.59 or higher + automake-1.9.6 or higher + m4-1.4.4 or higher + + The above tools can be downloaded from the following URLs: + + libtool - "http://ftp.gnu.org/gnu/libtool/libtool-1.5.20.tar.gz" + autoconf - "http://ftp.gnu.org/gnu/autoconf/autoconf-2.59.tar.gz" + automake - "http://ftp.gnu.org/gnu/automake/automake-1.9.6.tar.gz" + m4 - "http://ftp.gnu.org/gnu/m4/m4-1.4.4.tar.gz" + diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..380a563 --- /dev/null +++ b/LICENSE @@ -0,0 +1,26 @@ +OpenIB.org BSD license: + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + +* Redistributions of source code must retain the above copyright +notice, this list of conditions and the following disclaimer. + +* Redistributions in binary form must reproduce the above +copyright notice, this list of conditions and the following +disclaimer in the documentation and/or other materials provided +with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS +FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE +COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, +INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, +BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN +ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. diff --git a/MPI_README.txt b/MPI_README.txt new file mode 100644 index 0000000..ddd1589 --- /dev/null +++ b/MPI_README.txt @@ -0,0 +1,249 @@ + Open Fabrics Enterprise Distribution (OFED) + MPI in OFED 1.1 README + + October 2006 + + +=============================================================================== +Table of Contents +=============================================================================== +1. General +2. OSU MVAPICH MPI +3. Open MPI + + +=============================================================================== +1. General +=============================================================================== +Two MPI stacks are included in this release of OFED: + +- Ohio State University (OSU) MVAPICH 0.9.7 (Modified by Mellanox + Technologies) +- Open MPI 1.1.1-1 + +Setup, compilation and run information of OSU MVAPICH and Open MPI is +provided below in sections 2 and 3 respectively. + +1.1 Installation Note +--------------------- +In Step 2 of the main menu of install.sh, options 2, 3 and 4 can install +one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt +to learn about the different options. + +The installation script allows each MPI to be compiled using one or +more compilers. Users need to set, per MPI stack installed, the PATH +and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks. + +1.2 MPI Tests +------------- +OFED includes four basic tests that can be run against each MPI stack: +bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests +are located under: /mpi///tests/, +where is /usr/local/ofed by default. + +=============================================================================== +2. OSU MVAPICH MPI +=============================================================================== + +This package is a modified version of the Ohio State University (OSU) +MVAPICH Rev 0.9.7 MPI software package, and is the officially supported +MPI stack for this release of OFED. Modifications to the original version +include: additional features, bug fixes, and RPM packaging. +See http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ for more details. + + +2.1 Setting up for OSU MVAPICH MPI +---------------------------------- +To launch OSU MPI jobs, its installation directory needs to be included +in PATH and LD_LIBRARY_PATH. To set them, execute one of the following +commands: + source /mpi///etc/mvapich.sh + -- when using sh for launching MPI jobs + or + source /mpi///etc/mvapich.csh + -- when using csh for launching MPI jobs + + +2.2 Compiling OSU MVAPICH MPI Applications: +------------------------------------------- +***Important note***: +A valid Fortran compiler must be present in order to build the MVAPICH MPI +stack and tests. + +The default gcc-g77 Fortran compiler is provided with all RedHat Linux +releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide +this compiler as part of the default installation. + +The following compilers are supported by OFED's OSU MPI package: gcc, +intel and pathscale. The install script prompts the user to choose +the compiler with which to build the OSU MVAPICH MPI RPM. Note that more +than one compiler can be selected simultaneously, if desired. + +For details see: + http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.html + +To review the default configuration of the installation, check the default +configuration file: /mpi///etc/mvapich.conf + +2.3 Running OSU MVAPICH MPI Applications: +----------------------------------------- +Requirements: +o At least two nodes. Example: mtlm01, mtlm02 +o Machine file: Includes the list of machines. Example: /root/cluster +o Bidirectional rsh or ssh without a password + +Note for OSU: ssh will be used unless -rsh is specified. In order to use +rsh, add to the mpirun_rsh command the parameter: -rsh + +*** Running OSU tests *** + +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_bw +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_latency +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_bibw +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_bcast + +*** Running Intel MPI Benchmark test (Full test) *** + +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/IMB-2.3/IMB-MPI1 + +*** Running Presta test *** + +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/presta-1.4.0/com -o 100 +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/presta-1.4.0/glob -o 100 +/usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/presta-1.4.0/globalop + + +=============================================================================== +3. Open MPI +=============================================================================== + +Open MPI is a next-generation MPI implementation from the Open MPI +Project (http://www.open-mpi.org/). Version 1.1.1-1 of Open MPI is +included in this release, which is also available directly +from the main Open MPI web site. This MPI stack is being offered in +OFED as a "technology preview," meaning that it is not officially +supported yet. It is expected that future releases of OFED will have +fully supported versions of Open MPI. + +A working Fortran compiler is not required to build Open MPI, but some +of the included MPI tests are written in Fortran. These tests will not +compile/run if Open MPI is built without Fortran support. + +The following compilers are supported by OFED's Open MPI package: GNU, +Pathscale, Intel, or Portland. The install script prompts the user +for the compiler with which to build the Open MPI RPM. Note that more +than one compiler can be selected simultaneously, if desired. + +Users should check the main Open MPI web site for additional +documentation and support. (Note: The FAQ file considers +InfiniBand tuning among other issues.) + +3.1 Setting up for Open MPI: +---------------------------- +The Open MPI team strongly advises users to put the Open MPI installation +directory in their PATH and LD_LIBRARY_PATH. This can be done at the +system level if all users are going to use Open MPI. Specifically: +- add /bin to PATH +- add /lib to LD_LIBRARY_PATH + + is the directory where the desired Open MPI instance was installed. +("instance" refers to the compiler used for Open MPI compilation at install +time.) + +If using rsh or ssh to launch MPI jobs, you *must* set the variables described +above in your shell startup files (e.g., .bashrc, .cshrc, etc.). + +If you are using a job scheduler to launch MPI jobs (e.g., SLURM, Torque), +setting the PATH and LD_LIBRARY_PATH is still required, but it does +not need to be set in your shell startup files. Procedures describing +how to add these values to PATH and LD_LIBRARY_PATH are described in +detail at: + http://www.open-mpi.org/faq/?category=running + +3.2 Compiling Open MPI Applications: +------------------------------------ +(copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see +this web page for more details) + +The Open MPI team strongly recommends that you simply use Open MPI's +"wrapper" compilers to compile your MPI applications. That is, instead +of using (for example) gcc to compile your program, use mpicc. Open +MPI provides a wrapper compiler for four languages: + + Language Wrapper compiler name + ------------- -------------------------------- + C mpicc + C++ mpiCC, mpicxx, or mpic++ + (note that mpiCC will not exist + on case-insensitive file-systems) + Fortran 77 mpif77 + Fortran 90 mpif90 + ------------- -------------------------------- + +Note that if no Fortran 77 or Fortran 90 compilers were found when +Open MPI was built, Fortran 77 and 90 support will automatically be +disabled (respectively). + +If you expect to compile your program as: + + > gcc my_mpi_application.c -lmpi -o my_mpi_application + +Simply use the following instead: + + > mpicc my_mpi_application.c -o my_mpi_application + +Specifically: simply adding "-lmpi" to your normal compile/link +command line *will not work*. See +http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the +Open MPI wrapper compilers. + +Note that Open MPI's wrapper compilers do not do any actual compiling +or linking; all they do is manipulate the command line and add in all +the relevant compiler / linker flags and then invoke the underlying +compiler / linker (hence, the name "wrapper" compiler). More +specifically, if you run into a compiler or linker error, check your +source code and/or back-end compiler -- it is usually not the fault of +the Open MPI wrapper compiler. + +3.3 Running Open MPI Applications: +---------------------------------- +Open MPI uses either the "mpirun" or "mpiexec" commands to launch +applications. If your cluster uses a resource manager (such as SLURM +or Torque), providing a hostfile is not necessary: + + > mpirun -np 4 my_mpi_application + +If you use rsh/ssh to launch applications, they must be set up to NOT +prompt for a password (see http://www.open-mpi.org/faq/?category=rsh +for more details on this topic). Moreover, you need to provide a hostfile +containing a list of hosts to run on. + +Example: + + > cat hostfile + node1.example.com + node2.example.com + node3.example.com + node4.example.com + + > mpirun -np 4 -hostfile hostfile my_mpi_application + (application runs on all 4 nodes) + +In the following examples, replace with the number of nodes to run on, +and with the filename of a valid hostfile listing the nodes +to run on. + +Example1: Running the OSU bandwidth: + + > cd /usr/local/ofed/mpi/gcc/openmpi-1.1.1-1/tests/osu_benchmarks-2.2 + > mpirun -np -hostfile osu_bw + +Example2: Running the Intel MPI Benchmark benchmarks: + + > cd /usr/local/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3 + > mpirun -np -hostfile IMB-MPI1 + +Example3: Running the Presta benchmarks: + + > cd /usr/local/ofed/mpi/gcc/openmpi-1.1.1-1/tests/presta-1.4.0 + > mpirun -np -hostfile com -o 100 diff --git a/OFED_Installation_Guide.txt b/OFED_Installation_Guide.txt new file mode 100644 index 0000000..3b832cc --- /dev/null +++ b/OFED_Installation_Guide.txt @@ -0,0 +1,356 @@ + Open Fabrics Enterprise Distribution (OFED) + Version 1.1 + Installation Guide + + October 2006 + +============================================================================== +Table of contents +============================================================================== + + 1. Overview + 2. Contents of the OFED Distribution + 3. HW and SW Requirements + 4. How to download and extract the OFED + 5. Installing OFED Software + 6. Building OFED RPMs + 7. IPoIB Configuration + 8. Uninstalling OFED + 9. Configuration + 10. Related Documentation + + +============================================================================== +1. Overview +============================================================================== + +This is the OpenFabrics Enterprise Distribution (OFED) version 1.1 +software package supporting InfiniBand fabrics. It is composed of +several software modules intended for use on a computer cluster +constructed as an InfiniBand subnet. + +This document describes how to install the various modules and test them in +a Linux environment. + +General Notes: + 1) The install script removes all previously installed OFED packages + and re-installs from scratch. (Note: Configuration files will not + be removed). You will be prompted to acknowledge the deletion of + the old packages. + + 2) When installing OFED on an entire [homogeneous] cluster, a common + strategy is to build the software only once (perhaps on a shared + file system such as NFS). The resulting RPMs can then be installed + on all nodes in the cluster using any cluster-aware tools (such as + pdsh). + +============================================================================== +2. OFED Package Contents +============================================================================== + +The OFED Distribution package generates RPMs for installing the following: + + o OpenFabrics core and ULPs: + - HCA drivers (mthca, ipath, ehca) + - core + - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Initiator + and uDAPL + o OpenFabrics utilities: + - OpenSM: InfiniBand Subnet Manager + - Diagnostic tools + - Performance tests + o MPI: + - OSU MPI stack supporting the InfiniBand interface + - Open MPI stack supporting the InfiniBand interface + - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta) + o Sources of all software modules (under conditions mentioned in the + modules' LICENSE files) + o Documentation + +============================================================================== +3. HW and SW Requirements +============================================================================== + +1) Server platform with InfiniBand HCA (see OFED Distribution + Release Notes for details) + +2) Linux OS (see OFED Distribution Release Notes for details) + +3) Administrator privileges on your machine(s) + +4) Disk Space: - For Build & Installation: 300MB + - For Installation only: 200MB + +5) For the OFED Distribution to compile on your machine, some software + packages of your OS distribution are required. These are listed here. + +OS Distribution Required Packages +--------------- ---------------------------------- +General: +o Common to all gcc, glib, glib-devel, glibc, glibc-devel, + automake, autoconf, libtool. +o RedHat kernel-devel, sysfsutils, sysfsutils-devel, rpm-build +o SLES 9.0 kernel-source, udev, rpm +o SLES 10.0 kernel-source, sysfsutils, sysfsutils-devel, rpm + +Specific Component Requirements: +o OSU MPI a Fortran Compiler (such as gcc-g77) +o ibutils tcl-8.4, tcl-devel-8.4 +o oiscsi-iser-support open-iscsi, db-devel +o tvflash pciutils-devel + + The installer will warn you if you attempt to compile any of the + above packages and do not have the prerequisites installed. + +============================================================================== +4. How to download and extract the OFED Distribution +============================================================================== + +1) Download the OFED-X.X.X.tgz file to your target Linux host. + + If this package is to be installed on a cluster, it is recommended to + download it to an NFS shared directory. + +2) Extract the package using: + + tar xzvf OFED-X.X.X.tgz + +============================================================================== +5. Installing OFED Software +============================================================================== + +1) Go to the directory into which the package was extracted: + + cd /..../OFED-X.X.X + +2) Installing the OFED package must be done as root. For a + menu-driven, first build and installation, run the installer + script: + + ./install.sh + + Interactive menus will direct you through the install process. + + Note: After the installer completes, information about the OFED + installation such as the prefix, kernel version, and + installation parameters can be found by running + /etc/infiniband/info. + + During the interactive installation of OFED, two files are + generated: ofed.conf and ofed_net.conf. ofed.conf holds the + installed software modules and configuration settings chosen by the + user. ofed_net.conf holds the IPoIB settings chosen by the user. + + If the package is installed on a cluster-shared directory, these + files can then be used to perform an automatic, unattended + installation of OFED on other machines in the cluster. The + unattended installation will use the same choices as were selected + in the interactive installation. + + For an automatic installation on any host, run the following: + + ./OFED-X.X.X/install.sh -c /ofed.conf -net /ofed_net.conf + + Note: It is possible to rename and/or edit the ofed.conf and ofed_net.conf + files. Thus it is possible to change user choices (observing the + original format). See examples of ofed.conf and ofed_net.conf under + OFED-X.X.X/docs. + +Install Process Results: +------------------------ + +o The OFED package is installed under directory. +o Kernel modules are copied to: + /lib/modules/`uname -r`/kernel/drivers/infiniband/ +o The package kernel include files are placed under /src/openib. + These includes should be used when building kernel modules which use + the Infiniband stack. (Note that these includes, if needed, have + been "backported" to your kernel). +o The package raw (unbackported) source files are placed under + /src/openib-1.1. +o The script "openibd" is installed under /etc/init.d/. This script can + be used to load and unload the software stack. +o The directory /etc/infiniband is created with the files "info" and + "openib.conf". The "info" script can be used to retrieve OFED + installation information. The "openib.conf" file contains the list of + modules that will be loaded when the "openibd" script is used. +o The file "90-ib.rules" is installed to /etc/udev/rules.d/ +o If libibverbs-utils is installed, then ofed.sh and ofed.csh are + installed under /etc/profile.d/. These automatically update the PATH + environment variable with /bin. In addition, ofed.conf is + installed under /etc/ld.so.conf.d/ to update dynamic linker's + run-time search path to find the InfiniBand shared libraries. +o The file /etc/modprobe.conf is updated to include the following: + - "alias ib ib_ipoib" for each ib interface. + - "alias net-pf-27 ib_sdp" for sdp. +o If opensm is installed, the daemon opensmd is installed under /etc/init.d/ + and opensm.conf is installed under /etc. +o If IPoIB configuration files were included, ifcfg-ib files will be + installed at: + - RedHat: /etc/sysconfig/network-scripts/ + - SuSE: /etc/sysconfig/network/ + + +============================================================================== +6. Building OFED RPMs +============================================================================== + +1) Go to the directory into which the package was extracted: + + cd /..../OFED-X.X.X + + Building RPMs can be done as a non-root user. + +2) For interactive build run the build.sh script: + + ./build.sh + + Interactive menus will direct you through the build process. + + During the manual building of OFED RPMs, ofed.conf is generated. + ofed.conf holds the selected software modules and configuration + settings chosen by the user. + +3) For an automated build, run the following: + + ./OFED-X.X.X/build.sh -c /ofed.conf + + Note: It is possible to rename and/or edit the ofed.conf file. Thus + it is possible to change user choices (observing the original format). + See an example of ofed.conf under OFED-X.X.X/docs. + +Build Process Results +--------------------- + +The OFED build.sh script builds OFED binary RPMs under +OFED-X.X.X/RPMS; the sources are placed in OFED-X.X.X/SRPMS/. +Running this script does not change any currently installed +components, and the script does not change the current kernel build. + +Once the build process has completed, the user may run ./install.sh to +install the new RPMs. This time, however, any previously installed +OFED components will be uninstalled and the newly built package will +be installed. + +Note: Depending on your hardware, the build procedure may take 30-45 + minutes. Installation, however, is a relatively short process + (~5 minutes). A common strategy for OFED installation on large + homogeneous clusters is to extract the tarball on a network + file system (such as NFS), do the build on NFS, and then run + installer on each node with the RPMs that were previously built. + +*** Important Note for Open MPI users ONLY: The Open MPI software + requires that the InfiniBand drivers be installed before it is + built. Hence, Open MPI will only be built if you select the + "install" option. Open MPI will *not* be built if you only select + the "build" option. + +============================================================================== +7. IP-over-IB (IPoIB) Configuration +============================================================================== + +Configuring IPoIB is an optional step during the installation. During +interactive installation, the user may choose to insert the ifcfg-ib +files. If this option is chosen, the ifcfg-ib files will be +installed at: + +- RedHat: /etc/sysconfig/network-scripts/ +- SuSE: /etc/sysconfig/network/ + +Setting IPoIB Configuration: +---------------------------- + +The default IPoIB interface configuration is based on DHCP. Note that +a special patch for DHCP servers is required for supporting IPoIB +clients. A patch for dhcp v3.0.4 is is available under +OFED-X.X.X/docs/dhcp. + +If you are not using DHCP to obtain IP addresses for clients using +IPoIB, you must manually specify the full IP configuration during the +interactive installation: IP address, network address, netmask, and +broadcast address. + +For unattended installations, a configuration file can be provided +with this information. The configuration file must specify the +following information: +- Fixed values for each IPoIB interface. +- Base IPoIB configuration on Ethernet configuration (may be useful for + cluster configuration). + +Here are some examples of ofed_net.conf: + +# Static settings; all values provided by this file +IPADDR_ib0=172.16.0.4 +NETMASK_ib0=255.255.0.0 +NETWORK_ib0=172.16.0.0 +BROADCAST_ib0=172.16.255.255 +ONBOOT_ib0=1 + +# Based on eth0; each '*' will be replaced with corresponding octet +# from eth0. +LAN_INTERFACE_ib0=eth0 +IPADDR_ib0=172.16.'*'.'*' +NETMASK_ib0=255.255.0.0 +NETWORK_ib0=172.16.0.0 +BROADCAST_ib0=172.16.255.255 +ONBOOT_ib0=1 + +# Based on the first eth interface that is found (for n=0,1,...); +# each '*' will be replaced with corresponding octet from eth. +LAN_INTERFACE_ib0= +IPADDR_ib0=172.16.'*'.'*' +NETMASK_ib0=255.255.0.0 +NETWORK_ib0=172.16.0.0 +BROADCAST_ib0=172.16.255.255 +ONBOOT_ib0=1 + +============================================================================== +8. Uninstalling OFED +============================================================================== + +There are two ways to uninstall OFED: +1) Via the installation menu. +2) Using the script uninstall.sh. The script resides under OFED-X.X.X/ + and under the installation directory. + + +============================================================================== +9. Configuration +============================================================================== + +Most of the OFED components can be configured or reconfigured after +the installation by modifying the relevant configuration files. The +list of the modules that will be loaded automatically upon boot can be +found in the /etc/infiniband/openib.conf file. Other configuration +files include: +- SDP configuration file: /etc/libsdp.conf +- OpenSM configuration file: /etc/opensm.conf + +See packages Release Notes for more details. + +Note: After the installer completes, information about the OFED + installation such as the prefix, kernel version, and + installation parameters can be found by running + /etc/infiniband/info. + +============================================================================== +10. Related Documentation +============================================================================== + +OFED documentation is located in the ofed-docs RPM. After +installation the documents are located under the directory: +/docs (the default prefix is /usr/local/ofed). + +Document list: + + o README.txt + o OFED_Installation_Guide.txt + o MPI_README.txt + o Examples of configuration files + o OFED_tips.txt + o HOWTO.build_ofed + o All release notes + +For more information, please visit the OpenFabrics web site: + + http://www.openfabrics.org/ diff --git a/OFED_release_notes.txt b/OFED_release_notes.txt new file mode 100644 index 0000000..d48e667 --- /dev/null +++ b/OFED_release_notes.txt @@ -0,0 +1,233 @@ + Open Fabrics Enterprise Distribution (OFED) + Version 1.1 + Release Notes + + October 2006 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview, which includes: + - OFED Distribution Rev 1.1 Contents + - Supported Platforms and Operating Systems + - Supported HCA Adapter Cards and Firmware Versions + - Tested Switch Platforms + - Third party Test Packages + - OFED sources +2. Main Changes from OFED 1.0 +3. Fixed Bugs +4. Known Issues + + +=============================================================================== +1. Overview +=============================================================================== +These are the release notes of Open Fabrics Enterprise Distribution (OFED) +release 1.1. The OFED software package is composed of several software modules, +and is intended for use on a computer cluster constructed as an InfiniBand +network. + +Note: If you plan to upgrade the OFED package on your cluster, please upgrade +all of its nodes to this new version. + + +1.1 OFED 1.1 Contents +--------------------- +The OFED package contains the following components: + o OpenFabrics core and ULPs: + - HCA drivers (mthca, ipath, ehca) + - core + - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Host and uDAPL + o OpenFabrics utilities: + - OpenSM (OSM): InfiniBand Subnet Manager + - Diagnostic tools + - Performance tests + o MPI: + - OSU MPI stack supporting the InfiniBand interface + - Open MPI stack supporting the InfiniBand interface + - MPI benchmark tests (OSU benchmarks, Intel MPI benchmarks, Presta) + o Sources of all software modules (under conditions mentioned in the modules' + LICENSE files) + o Documentation + +Notes: +1. SDP is in beta quality. +2. ehca driver is in technology preview state. +3. All other OFED components are of production quality. +4. See release notes for each package in the docs directory. +5. Any Topspin copyright belongs to Cisco Systems, Inc. + +1.2 Supported Platforms and Operating Systems +--------------------------------------------- + o CPU architectures: + - x86_64 + - x86 + - ia64 + - ppc64 + + o Linux Operating Systems: + - RedHat EL4 up3: 2.6.9-34.ELsmp + - RedHat EL4 up4: 2.6.9-42.ELsmp + - SLES9 SP3: 2.6.5-7.244-smp + - SLES10: 2.6.16.21-0.8-smp + - kernel.org: 2.6.17.x and 2.6.18.x + +1.3 HCAs Supported +------------------ +This release supports HCAs by Mellanox Technologies, Qlogic and IBM. + + o Mellanox Technologies HCAs: + - InfiniHost (fw-23108 Rev 3.5.000) + - InfiniHost III Ex (MemFree: fw-25218 Rev 5.1.400 + with memory: fw-25208 Rev 4.7.600) + - InfiniHost III Lx (fw-25204 Rev 1.1.000) + The SDR and DDR modes of the InfiniHost III family are supported. + + For official firmware versions please see: + http://www.mellanox.com/support/firmware_table.php + + o Qlogic HCAs: + - QHT6040 (PathScale InfiniPath HT-460) + - QHT6140 (PathScale InfiniPath HT-465) + - QLE6140 (PathScale InfiniPath PE-880) + + o IBM HCAs: + - GX Dual-port 4x IB HCA + - GX Dual-port 12x IB HCA + + +1.4 Switches Supported +---------------------- +This release was tested with switches and gateways provided by the following +companies: + - Cisco + - Voltaire + - SilverStorm + - Flextronics + +1.5 Third Party Packages +------------------------ +The following third party packages have been tested with OFED 1.1: +1. Intel MPI, Version 2.0.1 - refresh, and Version 3.0 + +1.6 OFED Sources: +----------------- +Source repositories: +Kernel: git://www.mellanox.co.il/~git/infiniband ofed_1_1 +User: https://openib.org/svn/gen2/branches/1.1/src/userspace + +The kernel sources are based on Linux 2.6.18-rc6 mainline kernel. Its patches +are included in the OFED sources directory. +For details see HOWTO.build_ofed. + + +=============================================================================== +2. Main Changes from OFED 1.0 +=============================================================================== +Note: For details regarding the various changes, please see the release notes +for each package in the docs directory. + + 2.1 General changes: + o Kernel code based on 2.6.18 + o HCA fatal - kernel flow support + o High Availability in IPoIB and SRP + o RDS was removed for the OFED package + o IBM low level driver (ehca) was added + + 2.1 IPoIB: + o High Availability support using a user-level daemon (beta quality) + + 2.2 SDP: + o Beta quality (higher stability) + o Improved latency + o Implemented the Naggle algorithm + o Supports sending/receiving out of band data + o Interoperability with previous SDP implementation + + 2.3 SRP: + o GA quality + o DM (Device Mapper) - for high availability (beta quality). + o New srp_daemon was added + + 2.4 iSER: + o Testing more platforms (e.g., ppc64 and ia64) + + 2.5 uDAPL + o Scalability features needed for Intel MPI + + 2.6 MPI: + a. OSU MVAPICH: + o Version was changed to 0.9.7-mlx2.2.0 + o Message coalescing + b. Open MPI: + o Version was updated to v1.1.1 + o Bug fixes and general enhancements over v1.1 + o See http://www.open-mpi.org/svn/new.php for details + c. MPI tests: + o Updated the tests to latest versions from LLNL, Intel, OSU + + 2.7 OSM: + o Partition Manager (Pkey) + o Pre-computed routing load from file + o Primitive QoS - as technology preview + + 2.8 Management: + o Added Madeye utility + o Added saquery tool + o Enhanced ibnetdiscover tool with grouping function + o New ibutils package: + o Port error counter check + o Port performance counters dump + o Link width and Link speed check by flag + + 2.9 Install: + o Create both 32-bit and 64-bit user-level libraries on x86_64 and + ppc64 platforms + o OSM RPM was separated into several RPMs to enable installing + diagnostic tools without the opensm executable. + o The package kernel include files are placed under /src/openib. + These includes should be used when building kernel modules which use + the Infiniband stack. (Note that these includes, if needed, have + been "backported" to your kernel). + o The package raw (unbackported) source files are placed under + /src/openib-1.1. + +=============================================================================== +3. Fixed Bugs +=============================================================================== +1. OFED installation now supports installing lib32 on 64-bit systems. +2. Registration of huge page memory buffers is now supported. +3. Diagnostic tools do not require an opensm executable to be installed anymore. +4. Hotplug removal does not hang the system when the device is used by + the uverbs interface. +4. MVAPICH does work on ppc64. + +Bugs fixed in each package are reported in the packages release notes. + +=============================================================================== +4. Known Issues +=============================================================================== +The following is a list of major limitations and known issues of the various +components of the OFED 1.1 release. + +1. Memory registration by user is limited according to the administrator + setting. See "Pinning (Locking) User Memory Pages" in OFED_tips.txt for + system configuration. +2. Fork support from kernel 2.6.12 and above is available provided + that applications do not use threads. The fork() is supported as long + as parent process does not run before child exits or calls exec(). + The former can be achieved by calling wait(childpid) the later can be + achieved by application specific means. Posix system() call is + supported. +3. On RedHat EL4 up2 and Fedora Core 4 the driver may not load properly if + SELINUX is enforced. + Workaround: Change the value of the parameter SELINUX in + /etc/sysconfig/selinux from "enforcing" to "permissive" or "disabled". +4. libibcm is not thread safe: if several threads use libibcm, the function + ib_cm_get_device will give the same device to all of the threads, which + can cause thread X to get events that were sent to thread Y. +5. ehca driver is supported only on kernel 2.6.18. +6. ipath driver is supported only on 64 bit platforms. + +Note: See the release notes of each component for additional issues. diff --git a/OFED_tips.txt b/OFED_tips.txt new file mode 100644 index 0000000..56fcbb8 --- /dev/null +++ b/OFED_tips.txt @@ -0,0 +1,238 @@ + Open Fabrics Enterprise Distribution (OFED) + Tips for Working with OFED 1.1 + + October 2006 + +=============================================================================== +Table of Contents +=============================================================================== +1. OFED Utilities +2. Debug HOWTOs +3. Pinning (Locking) User Memory Pages + + +=============================================================================== +1. OFED Utilities +=============================================================================== + +The OFED package includes utilities under /bin, where stands +for the OFED installation path. To retrieve the this path, run the script +"/etc/infiniband/info" as explained in 2.2 below. + +Notes: +------ +1. This document includes descriptions for a subset of the existing utilities. + To learn about other utilities, use their --help flag. + +2. The sources for all utilities are not part of the RPM installation. However, + all sources exist in the openib-1.1.tgz tarball. + + +1.1 Device Information +---------------------- +Device information can be obtained using several utilities: + +a. ibv_devinfo + + ibv_devinfo prints the ca attributes. + + usage: + ibv_devinfo + + Options: + -d, --ib-dev= use IB device (default: first device found) + -i, --ib-port= use port of IB device (default: all ports) + -l, --list print only the IB devices names + -v, --verbose print all the attributes of the IB device(s) + +b. ibstat + + usage: + ibstat [OPTIONS] [portnum] + + Options: + -d debug + -l list all IB devices + -s print short device summary + -p print port GUIDs + -V print ibstat version information and exit + -h print usage + + Examples: + ibstat -l # list all IB devices + ibstat mthca0 2 # stat port 2 of mthca0 + +c. Using sysfs file system + The driver supports the sysfs file system under: /sys/class/infiniband + + Examples: + + > ls /sys/class/infiniband/mthca0/ + board_id device fw_ver hca_type hw_rev node_desc node_guid node_type + ports sys_image_guid + + > cat /sys/class/infiniband/mthca0/board_id + MT_0200000001 + + > ls /sys/class/infiniband/mthca0/ports/1/ + cap_mask counters gids lid lid_mask_count phys_state pkeys rate sm_lid + sm_sl state + + > cat /sys/class/infiniband/mthca0/ports/1/state + 4: ACTIVE + +1.2 Performance Tests +--------------------- + The following performance tests are provided with the OFED release: + + 1. Latency tests: + - ib_read_lat: RDMA read + - ib_write_lat: RDMA write + - ib_send_lat: UD, UC and RC (default) send + + 2. Bandwidth tests: + - ib_read_bw: RDMA read + - ib_write_bw: RDMA write + - ib_send_bw: UD, UC and RC (default) send + + Usage: + Server: + Client: + is an Ethernet or IPoIB address. + --help lists the available . The same options must be + passed to both server and client. + + Note: See PERF_TEST_README.txt for more information on the performance + tests. + + Example: ib_send_bw + Usage: + ib_send_bw start a server and wait for connection + ib_send_bw connect to server at + + options: + -p, --port= listen on/connect to port + (default: 18515) + -d, --ib-dev= use IB device + (default: first device found) + -i, --ib-port= use port of IB device + (default: 1) + -c, --connection= connection type RC/UC/UD (default: RC) + -m, --mtu= mtu size (default: 1024) + -s, --size= size of message to exchange + (default: 65536) + -a, --all run sizes from 2 up to 2^23 + -t, --tx-depth= size of tx queue (default: 300) + -n, --iters= number of exchanges + (at least 2, default: 1000) + -b, --bidirectional measure bidirectional bandwidth + (default: unidirectional) + -V, --version display version number + +1.3 Ping-pong Example Tests +--------------------------- + The ping-pong example tests provide basic connectivity tests. Each test + has a help message (-h). + - ibv_ud_pingpong + - ibv_rc_pingpong + - ibv_srq_pingpong + - ibv_uc_pingpong + + Example: ibv_ud_pingpong --h + Usage: + ibv_ud_pingpong start a server and wait for connection + ibv_ud_pingpong connect to server at + + options: + -p, --port= listen on/connect to port + (default: 18515) + -d, --ib-dev= use IB device + (default: first device found) + -i, --ib-port= use port of IB device (default: 1) + -s, --size= size of message to exchange (default: 2048) + -r, --rx-depth= number of receives to post at a time + (default: 500) + -n, --iters= number of exchanges (default: 1000) + -e, --events sleep on CQ events (default: poll) + + +=============================================================================== +2. Debug HOWTOs +=============================================================================== + +2.1 OFED Components and version information +------------------------------------------- +The text file /BUILD_ID provides data on all OFED components (whether +installed or not). + +For example: + + > cat /usr/local/ofed/BUILD_ID + OFED-1.1-rc4 + + openib-1.1 (REV=9304) + # User space + https://openib.org/svn/gen2/branches/1.1/src/userspace + Git: + ref: refs/heads/ofed_1_1 + commit d39c60f1406d29eb8e336529610574800a81d81e + + # MPI + mpi_osu-0.9.7-mlx2.2.0.tgz + openmpi-1.1.1-1.src.rpm + mpitests-2.0-0.src.rpm + +2.2 Installed OFED Components +------------------------------- +The script /etc/infiniband/info provides data on the specific OFED installation +on this machine. + +For example: + + > /etc/infiniband/info + prefix=/usr/local/ofed + Kernel=2.6.9-22.ELsmp + + MODULES: CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m + CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_INFINIBAND_ADDR_TRANS=y + CONFIG_INFINIBAND_MTHCA=m CONFIG_IPATH_CORE=m CONFIG_INFINIBAND_IPATH=m + CONFIG_INFINIBAND_IPOIB=m + + User level: --kernel-version 2.6.9-22.ELsmp --kernel-sources + /lib/modules/2.6.9-22.ELsmp/build --with-libibcm --with-libibverbs + --with-libipathverbs --with-libmthca --with-mstflint --with-perftest + +2.3 Building/Installing IB Modules with debug information +--------------------------------------------------------- +To compile/build/install the IB modules so that they will contain debug +information, set OPENIB_KERNEL_EXTRA_CFLAGS="-g" in your environment +before running OFED install.sh/build.sh . + + +=============================================================================== +3. Pinning (Locking) User Memory Pages +=============================================================================== + +Memory locking is managed by the kernel on a per user basis. Regular users (as +opposed to root) have a limited number of pages which they may pin, where +the limit is pre-set by the administrator. Registering memory for IB verbs +requires pinning memory, thus an application cannot register more memory than +it is allowed to pin. + +The user can change the system per-process memory lock limit by adding +the following two lines to file /etc/security/limits.conf: + + * soft memlock + * hard memlock + + where denotes the number of KBytes that may be locked by a + user process. + +The above change to /etc/security/limits.conf will allow any user process in the +system to lock up to KBytes of memory. + +On some systems, it may be possible to use "unlimited" for the size to disable +these limits entirely. + +Note: The file /etc/security/limits.conf contains further documentation. + diff --git a/PERF_TEST_README.txt b/PERF_TEST_README.txt new file mode 100644 index 0000000..f79b9a2 --- /dev/null +++ b/PERF_TEST_README.txt @@ -0,0 +1,121 @@ + Open Fabrics Enterprise Distribution (OFED) + Performance Tests README for OFED 1.1 + + October 2006 + + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Notes on Testing Methodology +3. Test Descriptions +4. Running Tests + +=============================================================================== +1. Overview +=============================================================================== +This is a collection of tests written over uverbs intended for use as a +performance micro-benchmark. As an example, the tests can be used for +HW or SW tuning and/or functional testing. + +Please post results/observations to the openib-general mailing list. +See "Contact Us" at http://openib.org/mailman/listinfo/openib-general and +http://www.openib.org. + + +=============================================================================== +2. Notes on Testing Methodology +=============================================================================== +- The benchmark used the CPU cycle counter to get time stamps without context + switch. Some CPU architectures (e.g., Intel's 80486 or older PPC) do NOT + have such capability. + +- The benchmark measures round-trip time but reports half of that as one-way + latency. This means that it may not be sufficiently accurate for asymmetrical + configurations. + +- Min/Median/Max result is reported. + The median (vs average) is less sensitive to extreme scores. + Typically, the "Max" value is the first value measured. + +- Larger samples help marginally only. The default (1000) is pretty good. + Note that an array of cycles_t (typically unsigned long) is allocated + once to collect samples and again to store the difference between them. + Really big sample sizes (e.g., 1 million) might expose other problems + with the program. + +- The "-H" option will dump the histogram for additional statistical analysis. + See xgraph, ygraph, r-base (http://www.r-project.org/), pspp, or other + statistical math programs. + +Architectures tested: i686, x86_64, ia64 + + + +=============================================================================== +4. Test Descriptions +=============================================================================== + +rdma_lat.c latency test with RDMA write transactions +rdma_bw.c streaming BW test with RDMA write transactions + + +The following tests are mainly useful for HW/SW benchmarking. +They are not intended as actual usage examples. + +send_lat.c latency test with send transactions +send_bw.c BW test with send transactions +write_lat.c latency test with RDMA write transactions +write_bw.c BW test with RDMA write transactions +read_lat.c latency test with RDMA read transactions +read_bw.c BW test with RDMA read transactions + +The executable name of each test starts with the general prefix "ib_", +e.g., ib_write_lat. + +Running Tests +------------- + +Prerequisites: + kernel 2.6 + ib_uverbs (kernel module) matches libibverbs + ("match" means binary compatible, but ideally of the same SVN rev) + +Server: ./ +Client: ./ + + o is IPv4 or IPv6 address. You can use the IPoIB + address if IPoIB is configured. + o --help lists the available + + *** IMPORTANT NOTE: The SAME OPTIONS must be passed to both server and client. + + +Common Options to all tests: + -p, --port= listen on/connect to port (default: 18515) + -m, --mtu= mtu size (default: 1024) + -d, --ib-dev= use IB device (default: first device found) + -i, --ib-port= use port of IB device (default: 1) + -s, --size= size of message to exchange (default: 1) + -a, --all run sizes from 2 till 2^23 + -t, --tx-depth= size of tx queue (default: 50) + -n, --iters= number of exchanges (at least 100, default: 1000) + -C, --report-cycles report times in cpu cycle units + (default: microseconds) + -H, --report-histogram print out all results + (default: print summary only) + -U, --report-unsorted (implies -H) print out unsorted results + (default: sorted) + -V, --version display version number + + *** IMPORTANT NOTE: You need to be running a Subnet Manager on the switch or + on one of the nodes in your fabric. + +Example: +Run "ib_rdma_lat -C" on the server side. +Then run "ib_rdma_lat -C " on the client. + +ib_rdma_lat will exit on both server and client after printing results. + diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..5672fec --- /dev/null +++ b/README.txt @@ -0,0 +1,232 @@ + Open Fabrics Enterprise Distribution (OFED) + Version 1.1 + README + + October 2006 + + +This is the OpenFabrics Enterprise Distribution (OFED) version 1.1 software +package supporting InfiniBand fabrics. It is composed of several software +modules intended for use on a computer cluster constructed as an InfiniBand +network. + +*** Note: If you plan to upgrade OFED on your cluster, please upgrade all + its nodes to this new version. + +This document includes the following sections: + +1. HW and SW Requirements +2. OFED Package Contents +3. A Note on the Installation Process +4. Building OFED Software RPMs +5. Installing OFED +6. Starting and Verifying the IB Fabric +7. MPI (Message Passing Interface) +8. Related Documentation + + +OpenFabrics Home Page: http://www.openfabrics.org +The OFED rev 1.1 software download available in www.openib.org/downloads.html + +Please email bugs and error reports to your InfiniBand vendor, or use bugzilla +http://openib.org/bugzilla/ + + + +1. HW and SW Requirements: +========================== +1) Server platform with InfiniBand HCA (see OFED Distribution + Release Notes for details) + +2) Linux OS (see OFED Distribution Release Notes for details) + +3) Administrator privileges on your machine(s) + +4) Disk Space: - For Build & Installation: 300MB + - For Installation only: 200MB + +5) For the OFED Distribution to compile on your machine, some software packages + of your OS distribution are required. These are listed here. + +OS Distribution Required Packages +--------------- ---------------------------------- +General: +o Common to all gcc, glib, glib-devel, glibc, glibc-devel, + automake, autoconf, libtool. +o RedHat kernel-devel, sysfsutils, sysfsutils-devel, rpm-build +o SLES 9.0 kernel-source, udev, rpm +o SLES 10.0 kernel-source, sysfsutils, sysfsutils-devel, rpm + +Specific Component Requirements: +o OSU MPI requires: Fortran Compiler(default: gcc-g77) +o ibutils: tcl-8.4, tcl-devel-8.4 +o oiscsi-iser-support: open-iscsi, db-devel +o tvflash: pciutils-devel + + +2. OFED Package Contents +======================== + +The OFED Distribution package generates RPMs for installing the following: + + o OpenFabrics core and ULPs: + - HCA drivers (mthca, ipath, ehca) + - core + - Upper Layer Protocols: IPoIB, SDP, SRP Initiator, iSER Initiator, + and uDAPL + o OpenFabrics utilities: + - OpenSM: InfiniBand Subnet Manager + - Diagnostic tools + - Performance tests + o MPI: + - OSU MPI stack supporting the InfiniBand interface + - Open MPI stack supporting the InfiniBand interface + - MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta) + o Sources of all software modules (under conditions mentioned in the + modules' LICENSE files) + o Documentation + + +3. A Note on the Installation Process +===================================== + +The OFED build process can take up to 40 minutes. If you are planning to +install the OFED package on a multi-node cluster, it is recommended to build +OFED RPMs once into a shared directory, and use the created RPMs in order to +install the package on the rest of the cluster machines. + +Use the script build.sh to build the OFED RPMs. This script can be used as a +non-root user. + +To install the package, use the install.sh script. When installing from scratch, +install.sh will first build the RPMs, then install them onto the local machine. +If the RPMs already exist, the install.sh script will simply install them onto +the local machine without re-building them. + +*** Important Note for Open MPI users ONLY: + You must install OFED (run install.sh). Building the OFED RPMs is + not sufficient. + +4. Building OFED Software RPMs +============================== + +Building OFED SW RPM packages can be a separate process or part of the +installer. In the latter case you may skip this section and move to the next +one: "Installing OFED Software". + +Some users may wish to build OFED RPM files separate from the main +installation flow. To do this, please run the ./build.sh script. (See note in +Section 3 above.) + +The build process will temporarily use the following default directory: +/var/tmp/OFED. The build.sh script will prompt the user to enter a different +temporary directory if desired. + +build.sh will also prompt the user for the installation directory. By default it +is /usr/local/ofed. +The RPMs will be placed under ./RPMS directory. + +For further details, see "Building OFED RPMs" and "Advanced Usage of OFED" in +OFED_Installation_Guide.txt under OFED-1.1/docs. + + +5. Installing OFED Software +============================ + +The default installation directory is: /usr/local/ofed + +Install Quick Guide: +1) Download and extract: tar xzvf OFED-1.1.tgz file. +2) Change into directory: cd OFED-1.1 +3) Run as root: ./install.sh +4) Follow the directions to install required components. For details, please see + OFED_Installation_Guide.txt under OFED-1.1/docs. + + +Note: The install script removes previously installed IB packages and + re-installs from scratch. You will be prompted to acknowledge the deletion + of the old packages. However, configuration files (.conf) will be + preserved and saved with a ".rpmsave" extension. + + +6. Starting and Verifying the IB Fabric +======================================= + +1) If you rebooted your machine after the installation process completed, + IB interfaces should be up. If you did not reboot your machine, please + enter the following command: /etc/init.d/openibd start + +2) Check that the IB driver is running on all nodes: ibv_devinfo should print + "hca_id: " on the first line. + +3) Make sure that a Subnet Manager is running by invoking the sminfo utility. + If an SM is not running, sminfo prints: + sminfo: iberror: query failed + If an SM is running, sminfo prints the LID and other SM node information. + Example: + sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 + + To check if OpenSM is running on the management node, enter: /etc/init.d/opensmd status + To start OpenSM, enter: /etc/init.d/opensmd start + + Note: OpenSM parameters can be set via the file /etc/opensm.conf. + Note: OpenSM can be configured to run upon boot by setting 'ONBOOT=yes' + in /etc/opensm.conf. + +4) Verify the status of ports by using ibv_devinfo: all connected ports should + report a "PORT_ACTIVE" state. + +5) Check the network connectivity status: run ibchecknet to see if the subnet + is "clean" and ready for ULP/application use. The following tools display + more information in addition to IB info: ibnetdiscover, ibhosts, and + ibswitches. + +6) Alternatively, instead of running steps 3 to 5 you can use the ibdiagnet + utility to perform a set of tests on your network. Upon finding an error, + ibdiagnet will print a message starting with a "-E-". For a more complete + report of the network features you should run ibdiagnet -r. If you have a + topology file describing your network you can feed this file to ibdiagnet + (using the option: -t ) and all reports will use the names they + appear in the file (instead of LIDs, GUIDs and directed routes). + +7) To run an application over SDP set the following variables: + env LD_PRELOAD='stack_prefix'/lib/libsdp.so + LIBSDP_CONFIG_FILE='stack_prefix'/etc/libsdp.conf + (or LD_PRELOAD='stack_prefix'/lib64/libsdp.so on 64 bit machines) + The default 'stack_prefix' is /usr/local/ofed. + + +7. MPI (Message Passing Interface) +================================== + +In Step 2 of the main menu of install.sh, options 2, 3 and 4 can +install one or more MPI stacks. Multiple MPI stacks can be installed +simultaneously -- they will not conflict with each other. + +There are two MPI stacks included in this release of OFED: + +- Ohio State University's MVAPICH 0.9.7 (specifically updated and + modified by Mellanox Technologies and Cisco or this release of OFED) +- Open MPI 1.1.1 + +OFED also includes 4 basic tests that can be run against each MPI +stack: bandwidth (bw), latency (lt), Intel MPI Benchmark and Presta. The tests +are located under: /mpi///tests/. + +Please see MPI_README.txt for more details on each MPI package and how to run +the tests. + + +8. Related Documentation +======================== +1) Release Notes for OFED Distribution components are to be found under + OFED-1.1/docs and, after the package installation, under /docs. +2) For a detailed installation guide, see OFED_Installation_Guide.txt. +3) MPI_README.txt under /docs. +4) OFED_tips.txt under /docs +5) PERF_TEST_README.txt under /docs +5) For more information, please visit the OFED web-page http://www.openfabrics.org + + +For more information contact your InfiniBand vendor. + diff --git a/dhcp/dhcp-3.0.4.patch b/dhcp/dhcp-3.0.4.patch new file mode 100755 index 0000000..f3f2b1c --- /dev/null +++ b/dhcp/dhcp-3.0.4.patch @@ -0,0 +1,30 @@ +Index: dhcp-3.0.4/includes/site.h +=================================================================== +--- dhcp-3.0.4.orig/includes/site.h 2002-03-12 20:33:39.000000000 +0200 ++++ dhcp-3.0.4/includes/site.h 2006-05-23 11:34:38.000000000 +0300 +@@ -135,7 +135,7 @@ + the aforementioned problems do not matter to you, or if no other + API is supported for your system, you may want to go with it. */ + +-/* #define USE_SOCKETS */ ++#define USE_SOCKETS + + /* Define this to use the Sun Streams NIT API. + +Index: dhcp-3.0.4/common/discover.c +=================================================================== +--- dhcp-3.0.4.orig/common/discover.c 2006-02-23 00:43:27.000000000 +0200 ++++ dhcp-3.0.4/common/discover.c 2006-05-23 11:45:16.000000000 +0300 +@@ -532,6 +532,12 @@ void discover_interfaces (state) + break; + #endif + ++ case ARPHRD_INFINIBAND: ++ tmp -> hw_address.hlen = 1; ++ tmp -> hw_address.hbuf [0] = ARPHRD_INFINIBAND; ++ memcpy (&tmp -> hw_address.hbuf [1], sa.sa_data, 20); ++ break; ++ + default: + log_error ("%s: unknown hardware address type %d", + ifr.ifr_name, sa.sa_family); diff --git a/diags_release_notes.txt b/diags_release_notes.txt new file mode 100644 index 0000000..41e6d68 --- /dev/null +++ b/diags_release_notes.txt @@ -0,0 +1,90 @@ + Open Fabrics Enterprise Distribution (OFED) + Diagnostic Tools in OFED 1.1 Release Notes + + October 2006 + + +Repo: https://openib.org/svn/gen2/branches/1.1/src/userspace/management/diags +Version: 9535 + + +General +------- +Model of operation: All utilities use direct MAD access to perform their +operations. Operations that require QP0 mads only may use direct routed +mads, and therefore may work even in unconfigured subnets. Almost all +utilities can operate without accessing the SM, unless GUID to lid translation +is required. The only exception to this is saquery which requires the SM. + + +Dependencies +------------ +Most utilities depend on libibmad and libibumad. +All utilities depend on the ib_umad kernel module. + + +Multiple port/Multiple CA support +--------------------------------- +When no IB device or port is specified (see the "local umad parameters" below), +the libibumad library selects the port to use by the following criteria: +1. the first port that is ACTIVE. +2. if not found, the first port that is UP (physical link up). + +If a port and/or CA name is specified, the libibumad library attempts to +satisfy the user request, and will fail if it cannot do so. + +For example: + ibaddr # use the 'best port' + ibaddr -C mthca1 # pick the best port from mthca1 only. + ibaddr -P 2 # use the second (active/up) port from the + first available IB device. + ibaddr -C mthca0 -P 2 # use the specified port only. + + +Common options & flags +---------------------- +Most diagnostics take the following flags. The exact list of supported +flags per utility can be found in the usage message and can be displayed +using util_name -h syntax. + +# Debugging flags + -d raise the IB debugging level. May be used + several times (-ddd or -d -d -d). + -e show umad send receive errors (timeouts and others) + -h display the usage message + -v increase the application verbosity level. + May be used several times (-vv or -v -v -v) + -V display the internal version info. + +# Addressing flags + -D use directed path address arguments. The path + is a comma separated list of out ports. + Examples: + "0" # self port + "0,1,2,1,4" # out via port 1, then 2, ... + -G use GUID address arguments. In most cases, it is the Port GUID. + Examples: + "0x08f1040023" + -s use 'smlid' as the target lid for SA queries. + +# Local umad parameters: + -C use the specified ca_name. + -P use the specified ca_port. + -t override the default timeout for the solicited mads. + + +CLI notation +------------ +All utilities use the POSIX style notation, meaning that all options (flags) +must precede all arguments (parameters). + + +Bugs Fixed +---------- +man pages are now supplied. + + +Utilities descriptions +---------------------- +See man pages + diff --git a/ehca_release_notes.txt b/ehca_release_notes.txt new file mode 100644 index 0000000..abc0f88 --- /dev/null +++ b/ehca_release_notes.txt @@ -0,0 +1,51 @@ + Open Fabrics Enterprise Distribution (OFED) + ehca in OFED 1.1 Release Notes + + October 2006 + + +Overview +-------- +ehca is the low level driver implementation for all IBM GX-based HCAs. + +ehca Available Parameters +-------------------------- +In order to set ehca parameters, add the following line(s) to /etc/modprobe.conf: + + options ib_ehca = + +whereby is one of the following items: +- debug_level debug level (0: no debug traces (default), 1: with debug traces) +- nr_ports number of connected ports (default: 2) +- port_act_time time to wait for port activation (default: 30 sec) + +Known Issues +------------ + +1. The device driver normally uses both ports. For using just one port connect +the ports as shown in Figure 1 and load the device driver by running +`modprobe ib_ehca nr_ports=1`. + + --------- IB Card in p570 + | + \ / + +---+ + | # | + | # | + | # | + | # | <--- Port 2: NOT CONNECTED + | # | + |---| + | # | + | # | + | # | + | # | <--- Port 1: CONNECTED TO THE INFINIBAND SWITCH + | # | + +---+ + +*Figure 1:* Connections if only one port is used. + +NOTE: In OpenPower 720 and p550 port 1 is at the top, port 2 is at the bottom. + +2. Furthermore the port(s) needs to be connected to an active switch port while +loading the ehca device driver. diff --git a/ibutils_release_note.txt b/ibutils_release_note.txt new file mode 100644 index 0000000..f87e793 --- /dev/null +++ b/ibutils_release_note.txt @@ -0,0 +1,109 @@ + Open Fabrics Enterprise Distribution (OFED) + IBUTILS in OFED 1.1 Release Notes + + October 2006 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Requirements +3. Reports +4. Known Issues + + +=============================================================================== +1. Overview +=============================================================================== + +The IBUTILS package provides means for debugging the connectivity and +status of InfiniBand (IB) devices in a fabric. + +The package tools are intended to provide the following services: +* Discover the InfiniBand fabric connectivity +* Determine whether or not a Subnet Manager (SM) is running +* Identify links which drop packets and/or incur errors by sending MAD + packets multiple times, across all the links, reporting port monitor counters +* Identify fabric level mismatches or inconsistencies such as: + - Duplicate port GUIDs - Two or more different ports with the same GUID + - Duplicate node GUIDs - Two or more different nodes with the same node GUID + - Duplicate LIDs - Two or more devices that have the same assigned LID + - Zero valued LIDs - A device with LID=0 indicates that the SM did not + assign a LID to this device. + - Zero valued system GUIDs - A device with system GUID=0 indicates that + the vendor did not assign it a GUID. + - An InfiniBand link is in the INIT state, which prevents data transfer + - Unexpected link width (when using the -lw flag) + - Unexpected link speed (when using the -ls flag) + +The IBUTILS package includes the following stand-alone tools: + + o ibdiagnet + Discovers the network, providing a listing of the following: + - All the nodes, ports and links in the fabric + - Link Forwarding Tables (LFT) dump file + - Multicast Forwarding Tables (MFT) dump file + - Fabric Subnet Managers (SMs) information file and a list of all + the masked GUIDs found + + o ibdiagpath + - Traces a path between two nodes specified by LIDs or a directed path + between the source and destination nodes. + - Provides information regarding the nodes and ports traversed. + - Utilizes device-specific health queries for the different devices + along the path between the source and destination nodes. + +Note: There are man pages for both tools. + + +=============================================================================== +2. Requirements +=============================================================================== + 1. ibis must be installed. + + 2. The path environment variable must include the path to ibis. To define the + path to ibis use one of the following commands (depending on your shell): + export PATH=:$PATH + or + setenv PATH :$PATH + + (the default path to ibis is: /usr/local/ofed/bin/) + +=============================================================================== +3. Reports +=============================================================================== +The default directory for all generated report files is /tmp . + +Both utilities collect summary information regarding all the fabric SM's +during the run, and then output that information at end of the run in file +/tmp/ibdiagnet.sm. + +Each report message includes: + - Device Type + - Device portGUID + - The direct path to the device + - If a topology file is provided to be matched with the discovered fabric, + the node name is also provided in the report message. Otherwise, host + names are included only in HCA-related report messages. + +=============================================================================== +4. Known Issues +=============================================================================== + ibdiagpath issues: + - If no subnet manager is initialized in the subnet, FDB tables may be + incorrectly set. Consequently, PortCounter MADs cannot be sent. + + - A link along a LID-routed path in INIT state causes ibdiagpath performance + queries to fail. The performance queries fail since they cannot proceed via + non-ACTIVE links. + + - ibdiagpath cannot validate the provided topology file against the existing + fabric topology. If the topology file includes a device/link that does not + exist, or the device/link information is incorrect, then ibdiagpath may + -- in name-based routing -- extract a non-existing path based on the + incorrect topology file. + + - If the hostname provided for the -s flag is not the actual local hostname, + then all the extracted names from the topology file will be incorrect. + However, all the other information provided will be correct. diff --git a/ipath_release_notes.txt b/ipath_release_notes.txt new file mode 100644 index 0000000..9fcfa88 --- /dev/null +++ b/ipath_release_notes.txt @@ -0,0 +1,61 @@ + Open Fabrics Enterprise Distribution (OFED) + ipath in OFED 1.1 Release Notes + + October 2006 + + +Overview +-------- +ipath is the low level driver implementation for all QLogic HCAs. + + +1. MSI (Message Signalled Interrupt) support required with QLogic HCAs +---------------------------------------------------------------------- +The QLogic adapter require MSI (Message Signalled Interrupts) to be +enabled in the kernel. In addition, the kernels provided with some Linux +distributions must be rebuilt after installation of a kernel patch to +fix the pci_msi_quirk bug introduced in the 2.6.12 kernel. + +1.1. If the InfiniPath driver is being compiled on a machine without +CONFIG_PCI_MSI=y configured, a warning similar to this will appear in +dmesg at boot: + +[root@sqa-00 ~]# dmesg | grep ipath +ipath_core 0000:01:00.0: infinipath0: pci_enable_msi failed: -1, interrupts may not work +ipath_core 0000:01:00.0: infinipath0: irq is 0, BIOS error? Interrupts won't work + +OpenFabrics on a QLogic adapter will not work correctly unless the kernel +is configured with CONFIG_PCI_MSI=y. + +1.2. Systems with QLogic adapters and which contain the AMD8131 PCI +bridge may require installation of the pci_msi_quirk kernel patch. If +the following messages are displayed on the console during boot, or are +in /var/log/messages, you will need to install the patch. + +PCI: MSI quirk detected. pci_msi_quirk set. +path_core 0000:03:00.0: pci_enable_msi failed: -22, interrupts +may not work + +NOTE: This problem has been fixed in the 2.6.17 kernel.org kernel. + +To install pci_msi_quirk patch and configure MSI for use with QLogic adapter +---------------------------------------------------------------------------- +To remedy both of these problems simultaneously, build the +kernel RPMs yourself with stock kernel SRPMs available from your +distribution source, and install the kernel patches provided at +http://www.pathscale.com/infinipath_support/downloads-1.3.html. + +See http://www.pathscale.com/docs/infinipath/1.3/msi_patch_notes.txt +for details. + +Once these instructions are completed, rebuild the OFED kernel +modules with the OFED installer. + +2. Note: +-------- +When running Fedora Core 4 with the QLogic adapters, it is recommended +to use the 2.6.16 kernel. + +3. Known Issues: +---------------- +The ipath driver only supports the x86_64 architecture in this release. diff --git a/ipoib_release_notes.txt b/ipoib_release_notes.txt new file mode 100644 index 0000000..5436a7d --- /dev/null +++ b/ipoib_release_notes.txt @@ -0,0 +1,151 @@ + Open Fabrics Enterprise Distribution (OFED) + IPoIB in OFED 1.1 Release Notes + + October 2006 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. New Features +3. Known Issues +4. DHCP Support of IPoIB +5. High Availability (HA) Service + +=============================================================================== +1. Overview +=============================================================================== +IPoIB is a network driver implementation that enables transmission of IP and +ARP protocol packets over an InfiniBand UD channel. The implementation conforms +to the relevant IETF working group's RFCs (http://www.ietf.org). + + +=============================================================================== +2. New Features +=============================================================================== +IPoIB supports increasing the verbosity of debug messages through a module +parameter. The parameter value can be controlled at load time or runtime. +At load time this can be done by inserting the following line in +/etc/modprobe.conf: + options ib_ipoib debug_level= + +At runtime the value can be controlled by writing the following to sysfs: + echo > /sys/module/ib_ipoib/parameters/debug_level + +The value can also be inspected by running: + cat /sys/module/ib_ipoib/parameters/debug_level + +Note that on some older kernels (like the one supplied with Redhat AS 4.0), +the path for inspecting the debug level is different and is as follows: + /sys/module/ib_ipoib/debug_level + +=============================================================================== +3. Known Issues +=============================================================================== +1. If a host has multiple interfaces, each belonging to a different IP subnet, + yet they use the same InfiniBand switch, the host may build an incorrect ARP + table. This may lead to problems which seem like violations of the IP rule + requiring different broadcast domains -- a rule not observed in this + implementation of IPoIB. + +2. On Fedora Core 4, SuSE 10 and SLES 10: + a. There are IPoIB alias lines in modprobe.conf which prevent stopping/ + unloading the stack (i.e., '/etc/init.d/openibd stop' will fail). + These alias lines cause the drivers to be loaded again by udev scripts. + + Workaround: Change modprobe.conf to set + OPENIB_PARAMS="--without-modprobe" before running install.sh, + or remove the alias lines from modprobe.conf. + + b. The ib1 interface uses the configuration script of ib0. + + Workaround: Invoke ifup/ifdown using both the interface name and the + configuration script name (example: ifup ib1 ib1). + +3. On RedHat EL 4 up2 the driver may not load properly if SELINUX is enforced. + + Workaround: Change the value of the parameter SELINUX in + /etc/sysconfig/selinux from "enforcing" to "permissive" or "disabled". + +4. Since the IPoIB configuration files (ifcfg-ib) are installed at the + standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/ + and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf + does not prevent the loading of IPoIB on boot. + +5. On RedHat EL 4 up4, ipoib multicast group membership does not work + due to missing code in the kernel which was available in u3 and removed + in u4. + + +=============================================================================== +4. DHCP Support of IPoIB +=============================================================================== +IPoIB is configured by default to use information obtained dynamically from a +DHCP server, at driver startup time, to configure its interfaces. + +Note: To use DHCP the user must apply a special patch (see "DHCP Notes" below). + +DHCP Supported Operating Systems +-------------------------------- +1. SLES 10 +2. SuSE 10 +3. Any kernel from 2.6.14 (tested with kernel 2.6.16.18) + +DHCP Unsupported Operating Systems +---------------------------------- +No RedHat EL distributions are supported. + + +DHCP Notes +---------- +1. It may be required to run over different UDP ports than the well known ports + (67 and 68). Free port numbers greater than 0x8000 must be chosen. To + specify a server or client port number, use the option -p . + The client's port number must be the chosen server's port number plus one. + +2. For IPoIB to use DHCP, it is required to patch ISC's DHCP. The patch file can + be found under OFED-1.1/docs/dhcp after extracting the distribution file + (after installation it can also be found under /docs/dhcp). The patch + should be applied for the server and for each client. Tests were run on + version 3.0.4 of the DHCP package. + + +=============================================================================== +5. High Availability (HA) Service +=============================================================================== +High Availability (HA) service for IPoIB interfaces is provided via the +ipoibtools package. Ipoibtools currently includes a perl script, ipoib_ha.pl, +and two executables, arpingib and mcasthandle. + +The HA service operates as follows: A user-level daemon runs in background to +detect failure of the primary IPoIB interface. If such a failure is detected +(e.g., port down), the daemon configures the secondary IPoIB interface with the +configuration parameters of the primary IPoIB interface (so that the secondary +interface assumes the IP identity of the primary interface). + +Enabling the HA Service +----------------------- +To enable HA service automatically (upon bootup of the driver), +perform the following steps: + +1. Edit file '/etc/infiniband/openib.conf' as follows: + + IPOIBHA_ENABLE=yes + PRIMARY_IPOIB_DEV=ib0 + SECONDARY_IPOIB_DEV=ib1 + +2. Run '/etc/init.d/openibd restart' to restart the driver. + +The HA service may also be activated manually, via the following command: + + ipoib_ha.pl -p -s \ + --with-arping --with-multicast [-v] + + -p primary IPoIB interface (default: ib0) + -s secondary IPoIB interface (default: ib1) + --with-arping use modified arping utility to send an unsolicited + ARP REPLY + --with-multicast support applications that are using multicast + -v verbose output + diff --git a/iser_release_notes.txt b/iser_release_notes.txt new file mode 100644 index 0000000..be39f90 --- /dev/null +++ b/iser_release_notes.txt @@ -0,0 +1,44 @@ + Open Fabrics Enterprise Distribution (OFED) + iSER initiator in OFED 1.1 Release Notes + + October 2006 + + +* Background + + iSER allows iSCSI to be layered over RDMA transports (including InfiniBand + and iWARP (RNIC)). + + The OpenIB iSER initiator implementation is interoperable with open-iscsi + (http://www.open-iscsi.org/). It provides an alternative transport to + iscsi_tcp in the open iscsi framework. The iSER transport exposes a + transport API to scsi_transport_iscsi, and a SCSI LLD API to the Linux + SCSI mid-layer (scsi_mod). + +* supported platforms + + SLES10 (RC1 and later) + + the release has been tested against Voltaire iSCSI/iSER target running + in Voltaire's IB/Fibre-Channel router + +* known issues + + SCSI command aborts may cause some instabilities + +* iSER links + + WIKI pages + + information on building/configuring/running the open iscsi initiator w. iSER + https://openib.org/tiki/tiki-index.php?page=iSER + + IETF pages + + iSCSI and iSER specifications come out of the IETF IP storage (IPS) work group + http://www.ietf.org/html.charters/ips-charter.html. + + "ABOUT" page + + general and detailed information on iSCSI and iSER + http://www.voltaire.com/iser.htm diff --git a/mthca_release_notes.txt b/mthca_release_notes.txt new file mode 100644 index 0000000..16f1ccc --- /dev/null +++ b/mthca_release_notes.txt @@ -0,0 +1,92 @@ + Open Fabrics Enterprise Distribution (OFED) + mthca in OFED 1.1 Release Notes + + October 2006 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. New Features +3. Fixed Bugs +4. Known Issues + +=============================================================================== +1. Overview +=============================================================================== +mthca is the low level driver implementation for all Mellanox Technologies HCAs. + +mthca Available Parameters +-------------------------- +In order to set mthca parameters, add the following line to /etc/modpobe.conf: + + options ib_mthca parameter= + +mthca parameters: + + - tune_pci increase PCI burst from the default set by BIOS if nonzero + - msi attempt to use MSI if nonzero + - msi_x attempt to use MSI-X if nonzero + - fw_cmd_doorbell post FW commands through doorbell page if nonzero (and + supported by FW) + - catas_reset_disable disable device reset on a catastrophic event if nonzero + + +=============================================================================== +2. New Features +=============================================================================== +1. Catastrophic event reset: catastrophic event handling has been expanded + to include resetting the device. After generating the IB_EVENT_DEVICE_FATAL + async event, mthca now resets the device (assuming that the + catas_reset_disable module parameter described above is zero). + + Note that the reset entails removing then adding the device. For the device + to complete the reset, all user-level applications using device resources + directly via the user verbs layer must release those resources. Thus, such + applications should register to receive async events, should detect the + IBV_EVENT_DEVICE_FATAL event, and should release all resources for that + device upon receiving such an event. + + +=============================================================================== +3. Fixed Bugs +=============================================================================== +1. mthca no longer misses restoring the following PCI-X/PCI Express + registers after reset: + o PCI-X device: PCI-X command register + o PCI-X bridge: upstream and downstream split transaction registers + o PCI Express: PCI Express device control and link control registers +2. Fence bit is now supported properly. +3. Fixed modify_qp, modify_srq and resize_cq methods to be fully reentrant. + +=============================================================================== +4. Known Issues +=============================================================================== +1. UAR size other than 8MB prevents mthca driver loading. The default UAR + size is 8MB. If it is changed, the following error message will be logged to + /var/log/messages upon attempt to load the mthca driver: + ib_mthca 0000:04:00.0: Missing UAR, aborting. + +2. If a user level application that uses multicast receives a control signal + in the process of detaching from a multicast group, its QP may remain a + member of the multicast group (in HCA). + Workaround: Destroy the multicast group after detaching the QP from it. + +3. On MemFree devices, RC QPs can be created with a maximum of (max_sge - 3) + entries only. + +4. Performance degradation due to wrong BIOS configuration: + The PCI Express spec. requires BIOS to set the MaxReadReq register + for each card for maximum performance and stability. + + If you are seeing bandwidth performance degradation, you can try forcing + the card to behave out of PCI Express spec. by setting the tune_pci=1 module + parameter. This tune_pci=1 option was the default setting in OFED + 1.0, which might have masked performance degradation on some systems. + + If tune_pci=1 improves bandwidth, please report the issue to your + BIOS vendor. Please note that Mellanox Technologies does not recommend using + tune_pci=1 in production systems: working with tune_pci=1 option set is + untested and is known to trigger stability issues on some platforms. + diff --git a/ofed-docs.spec b/ofed-docs.spec new file mode 100644 index 0000000..af7001a --- /dev/null +++ b/ofed-docs.spec @@ -0,0 +1,65 @@ +# +# Copyright (c) 2006 Mellanox Technologies. All rights reserved. +# +# This Software is licensed under one of the following licenses: +# +# 1) under the terms of the "Common Public License 1.0" a copy of which is +# available from the Open Source Initiative, see +# http://www.opensource.org/licenses/cpl.php. +# +# 2) under the terms of the "The BSD License" a copy of which is +# available from the Open Source Initiative, see +# http://www.opensource.org/licenses/bsd-license.php. +# +# 3) under the terms of the "GNU General Public License (GPL) Version 2" a +# copy of which is available from the Open Source Initiative, see +# http://www.opensource.org/licenses/gpl-license.php. +# +# Licensee has the right to choose one of the above licenses. +# +# Redistributions of source code must retain the above copyright +# notice and one of the license notices. +# +# Redistributions in binary form must reproduce both the above copyright +# notice, one of the license notices in the documentation +# and/or other materials provided with the distribution. +# +# +# $Id: ofed-docs.spec 7948 2006-06-13 12:42:34Z vlad $ +# + +Summary: OFED docs +Name: ofed-docs +Version: @VERSION@ +Release: 0 +License: GPL/BSD +Url: https://openib.org/svn/gen2/branches/1.1/ofed/docs +Group: Documentation/Man +Source: %{name}-%{version}.tar.gz +BuildRoot: %{?build_root:%{build_root}}%{!?build_root:/var/tmp/%{name}-%{version}-root} +Vendor: OpenFabrics +%description +OpenFabrics documentation + +%prep +%setup -q -n %{name}-%{version} + +%install +mkdir -p $RPM_BUILD_ROOT%{_prefix} +cp -a * $RPM_BUILD_ROOT%{_prefix} + +%clean +rm -rf $RPM_BUILD_ROOT + +%files +%defattr(-,root,root) +%{_prefix}/docs +%{_prefix}/README.txt +%{_prefix}/LICENSE +%{_prefix}/BUILD_ID + +%changelog +* Thu Jul 27 2006 Vladimir Sokolovsky +- Changed version to 1.1 +* Tue Jun 6 2006 Vladimir Sokolovsky +- Initial packaging diff --git a/ofed.conf-example b/ofed.conf-example new file mode 100644 index 0000000..00fb798 --- /dev/null +++ b/ofed.conf-example @@ -0,0 +1,57 @@ +STACK_PREFIX=/usr/local/ofed +BUILD_ROOT=/var/tmp/OFED +kernel_ib=y +kernel_ib_devel=y +libibverbs=y +libibverbs_devel=y +libibverbs_utils=y +libibcm=y +libibcm_devel=y +libmthca=y +libehca=y +libmthca_devel=y +perftest=y +mstflint=y +libipathverbs=y +libipathverbs_devel=y +oiscsi_iser=y +ofed_docs=y +ofed_scripts=y +libsdp=y +srptools=y +ipoibtools=y +tvflash=y +libibcommon=y +libibcommon_devel=y +libibmad=y +libibmad_devel=y +libibumad=y +libibumad_devel=y +libopensm=y +libopensm_devel=y +opensm=y +libosmcomp=y +libosmcomp_devel=y +libosmvendor=y +libosmvendor_devel=y +openib_diags=y +librdmacm=y +librdmacm_devel=y +librdmacm_utils=y +dapl=y +dapl_devel=y +mpi_osu=y +openmpi=y +mpitests=y +ibutils=y +ib_verbs=y +ib_mthca=y +ib_ehca=y +ib_ipoib=y +ib_ipath=y +ib_sdp=y +ib_srp=y +ib_iser=y +MPI_COMPILER_mpi_osu=" gcc intel pathscale" +MPI_COMPILER_openmpi=" gcc intel pathscale" +# OPENIB_PARAMS="--with-memtrack --without-modprobe --with-madeye-mod --without-ipoibconf" diff --git a/ofed_net.conf-example b/ofed_net.conf-example new file mode 100644 index 0000000..fe7a947 --- /dev/null +++ b/ofed_net.conf-example @@ -0,0 +1,13 @@ +LAN_INTERFACE_ib0=eth0 +IPADDR_ib0=192.168.0.'*' +NETMASK_ib0=255.255.255.0 +NETWORK_ib0=192.168.0.0 +BROADCAST_ib0=192.168.0.255 +ONBOOT_ib0=1 + +LAN_INTERFACE_ib1=eth0 +IPADDR_ib1=172.16.'*'.'*' +NETMASK_ib1=255.255.0.0 +NETWORK_ib1=172.16.0.0 +BROADCAST_ib1=172.16.255.255 +ONBOOT_ib1=1 diff --git a/open_mpi_release_notes.txt b/open_mpi_release_notes.txt new file mode 100644 index 0000000..90da614 --- /dev/null +++ b/open_mpi_release_notes.txt @@ -0,0 +1,817 @@ + Open Fabrics Enterprise Distribution (OFED) + Open MPI in OFED 1.1 Copyrights, License, and Release Notes + + October 2006 + + +Open MPI Copyrights +------------------- +Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana + University Research and Technology + Corporation. All rights reserved. +Copyright (c) 2004-2006 The University of Tennessee and The University + of Tennessee Research Foundation. All rights + reserved. +Copyright (c) 2004-2006 High Performance Computing Center Stuttgart, + University of Stuttgart. All rights reserved. +Copyright (c) 2004-2006 The Regents of the University of California. + All rights reserved. +Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. +Copyright (c) 2006 Voltaire, Inc. All rights reserved. + +Additional copyrights may follow + +Open MPI License +---------------- +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + +- Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + +- Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer listed + in this license in the documentation and/or other materials + provided with the distribution. + +- Neither the name of the copyright holders nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +The copyright holders provide no reassurances that the source code +provided does not infringe any patent, copyright, or any other +intellectual property rights of third parties. The copyright holders +disclaim any liability to any recipient for claims brought against +recipient by any third party for infringement of that parties +intellectual property rights. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +=========================================================================== + +The best way to report bugs, send comments, or ask questions is to +sign up on the user's and/or developer's mailing list (for user-level +and developer-level questions; when in doubt, send to the user's +list): + + users@open-mpi.org + devel@open-mpi.org + +Because of spam, only subscribers are allowed to post to these lists +(ensure that you subscribe with and post from exactly the same e-mail +address -- joe@example.com is considered different than +joe@mycomputer.example.com!). Visit these pages to subscribe to the +lists: + + http://www.open-mpi.org/mailman/listinfo.cgi/users + http://www.open-mpi.org/mailman/listinfo.cgi/devel + +Thanks for your time. + +=========================================================================== + +OFED-Specific Release Notes +--------------------------- + +SLES 10 with non-gcc compiler support: + +The Open MPI v1.1.1-1 SRPM included in OFED v1.1 will not build with +non-gcc compilers on SLES 10 because SLES's rpmbuild command inserts +"-D_FORTIFY_SOURCE=2" into the build flags, which -- for lack of a +longer, boring explanation -- doesn't yet work with non-gcc compilers. + +Open MPI can be built from source to workaround this problem. The +source code for Open MPI can be extracted from the SRPM shipped with +OFED or downloaded from the main Open MPI web site: +http://www.open-mpi.org/. + +To compile with Open MPI from source with OFED support, fully install +the rest of OFED and then configure Open MPI with the +"--with-openib=/usr/local/ofed" command line option. See the rest of +the documentation below for other configure command line options and +installation instructions. + +Intel compiler support: + +Some versions of the Intel 9.1 C++ compiler suite series produce +incorrect code when used with the Open MPI C++ bindings. Symptoms of +this problem include crashing applications (e.g., segmentation +violations) and Open MPI producing errors about incorrect parameters. + +=========================================================================== + +General Release Notes +--------------------- + +The following abbreviated list of release notes applies to this code +base as of this writing (17 Jun 2006): + +- Open MPI includes support for a wide variety of supplemental + hardware and software package. When configuring Open MPI, you may + need to supply additional flags to the "configure" script in order + to tell Open MPI where the header files, libraries, and any other + required files are located. As such, running "configure" by itself + may include support for all the devices (etc.) that you expect, + especially if their support headers / libraries are installed in + non-standard locations. Network interconnects are an easy example + to discuss -- Myrinet and Infiniband, for example, both have + supplemental headers and libraries that must be found before Open + MPI can build support for them. You must specify where these files + are with the appropriate options to configure. See the listing of + configure command-line switches, below, for more details. + +- The Open MPI installation must be in your PATH on all nodes (and + potentially LD_LIBRARY_PATH, if libmpi is a shared library). + +- LAM/MPI-like mpirun notation of "C" and "N" is not yet supported. + +- Striping MPI messages across multiple networks is supported (and + happens automatically when multiple networks are available), but + needs performance tuning. + +- The run-time systems that are currently supported are: + - rsh / ssh + - Recent versions of BProc (e.g., Clustermatic) + - PBS Pro, Open PBS, Torque (i.e., anything who supports the TM + interface) + - SLURM + - XGrid + - Cray XT-3 / Red Storm + +- The majority of Open MPI's documentation is here in this file and on + the web site FAQ (http://www.open-mpi.org/). This will eventually + be supplemented with cohesive installation and user documentation + files. + +- Systems that have been tested are: + - Linux, 32 bit, with gcc + - Linux, 64 bit (x86), with gcc + - OS X (10.3), 32 bit, with gcc + - OS X (10.4), 32 bit, with gcc + +- Other systems have been lightly (but not fully tested): + - Other compilers on Linux, 32 and 64 bit + - Other 64 bit platforms (Linux and AIX on PPC64, SPARC) + +- Some MCA parameters can be set in a way that renders Open MPI + inoperable (see notes about MCA parameters later in this file). In + particular, some parameters have required options that must be + included. + - If specified, the "btl" parameter must include the "self" + component, or Open MPI will not be able to deliver messages to the + same rank as the sender. For example: "mpirun --mca btl tcp,self + ..." + - If specified, the "btl_tcp_if_exclude" paramater must include the + loopback device ("lo" on many Linux platforms), or Open MPI will + not be able to route MPI messages using the TCP BTL. For example: + "mpirun --mca btl_tcp_if_exclude lo,eth1 ..." + +- Building shared libraries on AIX with the xlc compilers is only + supported if you supply the following command line option to + configure: LDFLAGS=-Wl,-brtl. + +- Open MPI does not support the Sparc v8 CPU target, which is the + default on Sun Solaris. The v8plus (32 bit) or v9 (64 bit) + targets must be used to build Open MPI on Solaris. This can be + done by including a flag in CFLAGS, CXXFLAGS, FFLAGS, and FCFLAGS, + -xarch=v8plus for the Sun compilers, -mv8plus for GCC. + +- At least some versions of the Intel 8.1 compiler seg fault while + compiling certain Open MPI source code files. As such, it is not + supported. + +- The Intel 9.0 v20051201 compiler on IA64 platforms seems to have a + problem with optimizing the ptmalloc2 memory manager component (the + generated code will segv). As such, the ptmalloc2 component will + automatically disable itself if it detects that it is on this + platform/compiler combination. The only effect that this should + have is that the MCA parameter mpi_leave_pinned will be inoperative. + +- Early versions of the Portland Group 6.0 compiler have problems + creating the C++ MPI bindings as a shared library (e.g., v6.0-1). + Tests with later versions show that this has been fixed (e.g., + v6.0-5). + +- The Portland Group compilers require the "-Msignextend" compiler + flag to extend the sign bit when converting from a shorter to longer + integer. This is is different than other compilers (such as GNU). + When compiling Open MPI with the Portland compiler suite, the + following flags should be passed to Open MPI's configure script: + + shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-signextent \ + --with-wrapper-cflags=-Msignextend \ + --with-wrapper-cxxflags=-Msignextend ... + + This will both compile Open MPI with the proper compile flags and + also automatically add "-Msignextend" when the C and C++ MPI wrapper + compilers are used to compile user MPI applications. + +- Open MPI will build bindings suitable for all common forms of + Fortran 77 compiler symbol mangling on platforms that support it + (e.g., Linux). On platforms that do not support weak symbols (e.g., + OS X), Open MPI will build Fortran 77 bindings just for the compiler + that Open MPI was configured with. + + Hence, on platforms that support it, if you configure Open MPI with + a Fortran 77 compiler that uses one symbol mangling scheme, you can + successfully compile and link MPI Fortran 77 applications with a + Fortran 77 compiler that uses a different symbol mangling scheme. + + NOTE: For platforms that support the multi-Fortran-compiler bindings + (i.e., weak symbols are supported), due to limitations in the MPI + standard and in Fortran compilers, it is not possible to hide these + differences in all cases. Specifically, the following two cases may + not be portable between different Fortran compilers: + + 1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE + will only compare properly to Fortran applications that were + created with Fortran compilers that that use the same + name-mangling scheme as the Fortran compiler that Open MPI was + configured with. + + 2. Fortran compilers may have different values for the logical + .TRUE. constant. As such, any MPI function that uses the Fortran + LOGICAL type may only get .TRUE. values back that correspond to + the the .TRUE. value of the Fortran compiler that Open MPI was + configured with. Note that some Fortran compilers allow forcing + .TRUE. to be 1 and .FALSE. to be 0. For example, the Portland + Group compilers provide the "-Munixlogical" option, and Intel + compilers (version >= 8.) provide the "-fpscomp logicals" option. + + You can use the ompi_info command to see the Fortran compiler that + Open MPI was configured with. + +- The MPI and run-time layers do not free all used memory properly + during MPI_FINALIZE. + +- Running on nodes with different endian and/or different datatype + sizes within a single parallel job is supported starting with Open + MPI v1.1. However, Open MPI does not resize data when datatypes + differ in size (for example, sending a 4 byte MPI_LONG and receiving + an 8 byte MPI_LONG will fail). + +- MPI_THREAD_MULTIPLE support is included, but is only lightly tested. + +- Asynchronous message passing progress using threads can be turned on + with the --enable-progress-threads option to configure. + Asynchronous message passing progress is only supported for TCP, + shared memory, and Myrinet/GM. Myrinet/GM has only been lightly + tested. + +- Due to limitations in the Libtool 1.5 series, Fortran 90 MPI + bindings support can only be built as a static library. It is + expected that Libtool 2.0 (and therefore future releases of Open + MPI) will be able to support shared libraries for the Fortran 90 + bindings. + +- The XGrid support is experimental - see the Open MPI FAQ and this + post on the Open MPI user's mailing list for more information: + + http://www.open-mpi.org/community/lists/users/2006/01/0539.php + +- The MX library limits the maximum message fragment size for both + on-node and off-node messages. As of MX v1.0.3, the inter-node + maximum fragment size is 32k, and the intra-node maximum fragment + size is 16k -- fragments sent larger than these sizes will fail. + Open MPI automatically fragments large messages; it currently limits + its first fragment size on MX networks to the lower of these two + values -- 16k. As such, increasing the value of the MCA parameter + named "btl_mx_first_frag_size" larger than 16k may cause failures in + some cases (i.e., when using MX to send large messages to processes + on the same node); it will cause failures in all cases if it is set + above 32k. Note that this only affects the *first* fragment of + messages; latter fragments do not have this size restriction. The + MCA parameter btl_mx_max_send_size can be used to vary the maximum + size of subsequent fragments. + +- The current version of the Open MPI point-to-point engine does not + yet support hardware-level MPI message matching. As such, MPI + message matching must be performed in software, artificially + increasing latency for short messages on certain networks (such as + MX and hardware-supported Portals). Future versions of Open MPI + will support hardware matching on networks that provide it, and will + eliminate the extra overhead of software MPI message matching where + possible. + +- The Fortran 90 MPI bindings can now be built in one of three sizes + using --with-mpi-f90-size=SIZE (see description below). These sizes + reflect the number of MPI functions included in the "mpi" Fortran 90 + module and therefore which functions will be subject to strict type + checking. All functions not included in the Fortran 90 module can + still be invoked from F90 applications, but will fall back to + Fortran-77 style checking (i.e., little/none). + + - trivial: Only includes F90-specific functions from MPI-2. This + means overloaded versions of MPI_SIZEOF for all the MPI-supported + F90 intrinsic types. + + - small (default): All the functions in "trivial" plus all MPI + functions that take no choice buffers (meaning buffers that are + specified by the user and are of type (void*) in the C bindings -- + generally buffers specified for message passing). Hence, + functions like MPI_COMM_RANK are included, but functions like + MPI_SEND are not. + + - medium: All the functions in "small" plus all MPI functions that + take one choice buffer (e.g., MPI_SEND, MPI_RECV, ...). All + one-choice-buffer functions have overloaded variants for each of + the MPI-supported Fortran intrinsic types up to the number of + dimensions specified by --with-f90-max-array-dim (default value is + 4). + + Increasing the size of the F90 module (in order from trivial, small, + and medium) will generally increase the length of time required to + compile user MPI applications. Specifically, "trivial"- and + "small"-sized F90 modules generally allow user MPI applications to + be compiled fairly quickly but lose type safety for all MPI + functions with choice buffers. "medium"-sized F90 modules generally + take longer to compile user applications but provide greater type + safety for MPI functions. + + Note that MPI functions with two choice buffers (e.g., MPI_GATHER) + are not currently included in Open MPI's F90 interface. Calls to + these functions will automatically fall through to Open MPI's F77 + interface. A "large" size that includes the two choice buffer MPI + functions is possible in future versions of Open MPI. + +=========================================================================== + +Building Open MPI +----------------- + +Open MPI uses a traditional configure script paired with "make" to +build. Typical installs can be of the pattern: + +--------------------------------------------------------------------------- +shell$ ./configure [...options...] +shell$ make all install +--------------------------------------------------------------------------- + +There are many available configure options (see "./configure --help" +for a full list); a summary of the more commonly used ones follows: + +--prefix= + Install Open MPI into the base directory named . Hence, + Open MPI will place its executables in /bin, its header + files in /include, its libraries in /lib, etc. + +--with-gm= + Specify the directory where the GM libraries and header files are + located. This enables GM support in Open MPI. + +--with-gm-libdir= + Look in directory for the GM libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-mx= + Specify the directory where the MX libraries and header files are + located. This enables MX support in Open MPI. + +--with-mx-libdir= + Look in directory for the MX libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-mvapi= + Specify the directory where the mVAPI libraries and header files are + located. This enables mVAPI support in Open MPI. + +--with-mvapi-libdir= + Look in directory for the MVAPI libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-openib= + Specify the directory where the Open Fabrics (previously known as + OpenIB) libraries and header files are located. This enables Open + Fabrics support in Open MPI. + +--with-openib-libdir= + Look in directory for the OPENIB libraries. By default, Open MPI will + look in /lib and /lib64, which covers + most cases. This option is only needed for special configurations. + +--with-tm= + Specify the directory where the TM libraries and header files are + located. This enables PBS / Torque support in Open MPI. + +--with-mpi-param_check(=value) + "value" can be one of: always, never, runtime. If --with-mpi-param + is not specified, "runtime" is the default. If --with-mpi-param + is specified with no value, "always" is used. Using + --without-mpi-param-check is equivalent to "never". + + - always: the parameters of MPI functions are always checked for + errors + - never: the parameters of MPI functions are never checked for + errors + - runtime: whether the parameters of MPI functions are checked + depends on the value of the MCA parameter mpi_param_check + (default: yes). + +--with-threads=value + Since thread support (both support for MPI_THREAD_MULTIPLE and + asynchronous progress) is only partially tested, it is disabled by + default. To enable threading, use "--with-threads=posix". This is + most useful when combined with --enable-mpi-threads and/or + --enable-progress-threads. + +--enable-mpi-threads + Allows the MPI thread level MPI_THREAD_MULTIPLE. See + --with-threads; this is currently disabled by default. + +--enable-progress-threads + Allows asynchronous progress in some transports. See + --with-threads; this is currently disabled by default. + +--disable-mpi-cxx + Disable building the C++ MPI bindings. Note that this does *not* + disable the C++ checks during configure; some of Open MPI's tools + are written in C++ and therefore require a C++ compiler to be built. + +--disable-mpi-f77 + Disable building the Fortran 77 MPI bindings. + +--disable-mpi-f90 + Disable building the Fortran 90 MPI bindings. Also related to the + --with-f90-max-array-dim and --with-mpi-f90-size options. + +--with-mpi-f90-size= + Three sizes of the MPI F90 module can be built: trivial (only a + handful of MPI-2 F90-specific functions are included in the F90 + module), small (trivial + all MPI functions that take no choice + buffers), and medium (small + all MPI functions that take 1 choice + buffer). This parameter is only used if the F90 bindings are + enabled. + +--with-f90-max-array-dim= + The F90 MPI bindings are strictly typed, even including the number of + dimensions for arrays for MPI choice buffer parameters. Open MPI + generates these bindings at compile time with a maximum number of + dimensions as specified by this parameter. The default value is 4. + +--disable-shared + By default, libmpi is built as a shared library, and all components + are built as dynamic shared objects (DSOs). This switch disables + this default; it is really only useful when used with + --enable-static. Specifically, this option does *not* imply + --disable-shared; enabling static libraries and disabling shared + libraries are two independent options. + +--enable-static + Build libmpi as a static library, and statically link in all + components. Note that this option does *not* imply + --disable-shared; enabling static libraries and disabling shared + libraries are two independent options. + +There are several other options available -- see "./configure --help". + +Changing the compilers that Open MPI uses to build itself uses the +standard Autoconf mechanism of setting special environment variables +either before invoking configure or on the configure command line. +The following environment variables are recognized by configure: + +CC - C compiler to use +CFLAGS - Compile flags to pass to the C compiler +CPPFLAGS - Preprocessor flags to pass to the C compiler + +CXX - C++ compiler to use +CXXFLAGS - Compile flags to pass to the C++ compiler +CXXCPPFLAGS - Preprocessor flags to pass to the C++ compiler + +F77 - Fortran 77 compiler to use +FFLAGS - Compile flags to pass to the Fortran 77 compiler + +FC - Fortran 90 compiler to use +FCFLAGS - Compile flags to pass to the Fortran 90 compiler + +LDFLAGS - Linker flags to pass to all compilers +LIBS - Libraries to pass to all compilers (it is rarely + necessary for users to need to specify additional LIBS) + +For example: + +shell$ ./configure CC=mycc CXX=myc++ F77=myf77 F90=myf90 ... + +It is required that the compilers specified be compile and link +compatible, meaning that object files created by one compiler must be +able to be linked with object files from the other compilers and +produce correctly functioning executables. + +Open MPI supports all the "make" targets that are provided by GNU +Automake, such as: + +all - build the entire Open MPI package +install - install Open MPI +uninstall - remove all traces of Open MPI from the $prefix +clean - clean out the build tree + +Once Open MPI has been built and installed, it is safe to run "make +clean" and/or remove the entire build tree. + +VPATH builds are fully supported. + +Generally speaking, the only thing that users need to do to use Open +MPI is ensure that /bin is in their PATH and /lib is +in their LD_LIBRARY_PATH. Users may need to ensure to set the PATH +and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc) +so that rsh/ssh-based logins will be able to find the Open MPI +executables. + +=========================================================================== + +Checking Your Open MPI Installation +----------------------------------- + +The "ompi_info" command can be used to check the status of your Open +MPI installation (located in /bin/ompi_info). Running it with +no arguments provides a summary of information about your Open MPI +installation. + +Note that the ompi_info command is extremely helpful in determining +which components are installed as well as listing all the run-time +settable parameters that are available in each component (as well as +their default values). + +The following options may be helpful: + +--all Show a *lot* of information about your Open MPI + installation. +--parsable Display all the information in an easily + grep/cut/awk/sed-able format. +--param + A of "all" and a of "all" will + show all parameters to all components. Otherwise, the + parameters of all the components in a specific framework, + or just the parameters of a specific component can be + displayed by using an appropriate and/or + name. + +Changing the values of these parameters is explained in the "The +Modular Component Architecture (MCA)" section, below. + +=========================================================================== + +Compiling Open MPI Applications +------------------------------- + +Open MPI provides "wrapper" compilers that should be used for +compiling MPI applications: + +C: mpicc +C++: mpiCC (or mpic++ if your filesystem is case-insensitive) +Fortran 77: mpif77 +Fortran 90: mpif90 + +For example: + +shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g +shell$ + +All the wrapper compilers do is add a variety of compiler and linker +flags to the command line and then invoke a back-end compiler. To be +specific: the wrapper compilers do not parse source code at all; they +are solely command-line manipulators, and have nothing to do with the +actual compilation or linking of programs. The end result is an MPI +executable that is properly linked to all the relevant libraries. + +=========================================================================== + +Running Open MPI Applications +----------------------------- + +Open MPI supports both mpirun and mpiexec (they are exactly +equivalent). For example: + +shell$ mpirun -np 2 hello_world_mpi + +or + +shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi + +are equivalent. Some of mpiexec's switches (such as -host and -arch) +are not yet functional, although they will not error if you try to use +them. + +The rsh starter accepts a -hostfile parameter (the option +"-machinefile" is equivalent); you can specify a -hostfile parameter +indicating an standard mpirun-style hostfile (one hostname per line): + +shell$ mpirun -hostfile my_hostfile -np 2 hello_world_mpi + +If you intend to run more than one process on a node, the hostfile can +use the "slots" attribute. If "slots" is not specified, a count of 1 +is assumed. For example, using the following hostfile: + +--------------------------------------------------------------------------- +node1.example.com +node2.example.com +node3.example.com slots=2 +node4.example.com slots=4 +--------------------------------------------------------------------------- + +shell$ mpirun -hostfile my_hostfile -np 8 hello_world_mpi + +will launch MPI_COMM_WORLD rank 0 on node1, rank 1 on node2, ranks 2 +and 3 on node3, and ranks 4 through 7 on node4. + +Other starters, such as the batch scheduling environments, do not +require hostfiles (and will ignore the hostfile if it is supplied). + +Note that the values of component parameters can be changed on the +mpirun / mpiexec command line. This is explained in the section +below, "The Modular Component Architecture (MCA)". + +=========================================================================== + +The Modular Component Architecture (MCA) + +The MCA is the backbone of Open MPI -- most services and functionality +are implemented through MCA components. Here is a list of all the +component frameworks in Open MPI: + +--------------------------------------------------------------------------- +MPI component frameworks: +------------------------- + +allocator - Memory allocator +bml - BTL management layer +btl - MPI point-to-point byte transfer layer +coll - MPI collective algorithms +io - MPI-2 I/O +mpool - Memory pooling +pml - MPI point-to-point management layer +ptl - (Outdated / deprecated) MPI point-to-point transport layer +rcache - Memory registration cache +topo - MPI topology routines + +Back-end run-time environment component frameworks: +--------------------------------------------------- + +errmgr - RTE error manager +gpr - General purpose registry +iof - I/O forwarding +ns - Name server +oob - Out of band messaging +pls - Process launch system +ras - Resource allocation system +rds - Resource discovery system +rmaps - Resource mapping system +rmgr - Resource manager +rml - RTE message layer +schema - Name schemas +sds - Startup / discovery service +soh - State of health monitor + +Miscellaneous frameworks: +------------------------- + +maffinity - Memory affinity +memory - Memory subsystem hooks +paffinity - Processor affinity +timer - High-resolution timers + +--------------------------------------------------------------------------- + +Each framework typically has one or more components that are used at +run-time. For example, the btl framework is used by MPI to send bytes +across underlying networks. The tcp btl, for example, sends messages +across TCP-based networks; the gm btl sends messages across GM +Myrinet-based networks. + +Each component typically has some tunable parameters that can be +changed at run-time. Use the ompi_info command to check a component +to see what its tunable parameters are. For example: + +shell$ ompi_info --param btl tcp + +shows all the parameters (and default values) for the tcp btl +component. + +These values can be overridden at run-time in several ways. At +run-time, the following locations are examined (in order) for new +values of parameters: + +1. /etc/openmpi-mca-params.conf + + This file is intended to set any system-wide default MCA parameter + values -- it will apply, by default, to all users who use this Open + MPI installation. The default file that is installed contains many + comments explaining its format. + +2. $HOME/.openmpi/mca-params.conf + + If this file exists, it should be in the same format as + /etc/openmpi-mca-params.conf. It is intended to provide + per-user default parameter values. + +3. environment variables of the form OMPI_MCA_ set equal to a + + + Where is the name of the parameter. For example, set the + variable named OMPI_MCA_btl_tcp_frag_size to the value 65536 + (Bourne-style shells): + + shell$ OMPI_MCA_btl_tcp_frag_size=65536 + shell$ export OMPI_MCA_btl_tcp_frag_size + +4. the mpirun command line: --mca + + Where is the name of the parameter. For example: + + shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi + +These locations are checked in order. For example, a parameter value +passed on the mpirun command line will override an environment +variable; an environment variable will override the system-wide +defaults. + +=========================================================================== + +Common Questions +---------------- + +Many common questions about building and using Open MPI are answered +on the FAQ: + + http://www.open-mpi.org/faq/ + +=========================================================================== + +Got more questions? +------------------- + +Found a bug? Got a question? Want to make a suggestion? Want to +contribute to Open MPI? Please let us know! + +User-level questions and comments should generally be sent to the +user's mailing list (users@open-mpi.org). Because of spam, only +subscribers are allowed to post to this list (ensure that you +subscribe with and post from *exactly* the same e-mail address -- +joe@example.com is considered different than +joe@mycomputer.example.com!). Visit this page to subscribe to the +user's list: + + http://www.open-mpi.org/mailman/listinfo.cgi/users + +Developer-level bug reports, questions, and comments should generally +be sent to the developer's mailing list (devel@open-mpi.org). Please +do not post the same question to both lists. As with the user's list, +only subscribers are allowed to post to the developer's list. Visit +the following web page to subscribe: + + http://www.open-mpi.org/mailman/listinfo.cgi/devel + +When submitting bug reports to either list, be sure to include the +following information in your mail (please compress!): + +- the stdout and stderr from Open MPI's configure +- the top-level config.log file +- the stdout and stderr from building Open MPI +- the output from "ompi_info --all" (if possible) + +For Bourne-type shells, here's one way to capture this information: + +shell$ ./configure ... 2>&1 | tee config.out +[...lots of configure output...] +shell$ make 2>&1 | tee make.out +[...lots of make output...] +shell$ mkdir ompi-output +shell$ cp config.out config.log make.out ompi-output +shell$ ompi_info --all |& tee ompi-output/ompi-info.out +shell$ tar cvf ompi-output.tar ompi-output +[...output from tar...] +shell$ gzip ompi-output.tar + +For C shell-type shells, the procedure is only slightly different: + +shell% ./configure ... |& tee config.out +[...lots of configure output...] +shell% make |& tee make.out +[...lots of make output...] +shell% mkdir ompi-output +shell% cp config.out config.log make.out ompi-output +shell% ompi_info --all |& tee ompi-output/ompi-info.out +shell% tar cvf ompi-output.tar ompi-output +[...output from tar...] +shell% gzip ompi-output.tar + +In either case, attach the resulting ompi-output.tar.gz file to your +mail. This provides the Open MPI developers with a lot of information +about your installation and can greatly assist us in helping with your +problem. + +Be sure to also include any other useful files (in the +ompi-output.tar.gz tarball), such as output showing specific errors. diff --git a/opensm_release_notes.txt b/opensm_release_notes.txt new file mode 100644 index 0000000..2c38a40 --- /dev/null +++ b/opensm_release_notes.txt @@ -0,0 +1,486 @@ + OpenSM Release Notes 2.0.5 + ============================ + +Version: OpenFabrics Enterprise Distribution (OFED) 1.1 +Repo: https://openib.org/svn/gen2/branches/1.1/src/userspace/management/osm +Version: 9535 (openib-2.0.5) +Date: October 2006 + +1 Overview +---------- +This document describes the contents of the OpenSM OFED 1.1 release. +OpenSM is an InfiniBand compliant Subnet Manager and Administration, +and runs on top of OpenIB. The OpenSM version for this release +is openib-2.0.5 + +This document includes the following sections: +1 This Overview section (describing new features and software + dependencies) +2 Known Issues And Limitations +3 Unsupported IB compliance statements +4 Major Bug Fixes +5 Main Verification Flows +6 Qualified software stacks and devices + +1.1 Major New Features + +* Partition manager: + The partition manager provides a means to setup multiple partitions + by providing a partition policy file. For details please read the + doc/partition-config.txt or the opensm man page. + +* Basic QoS Manager: + Provides a uniform configuration of the entire fabric with values defined + in the OpenSM options file. The options support different settings for + CAs, Switches, and Routers. Note that this is disabled by default and + using -Q enables QoS fabric setup. + +* Loading pre-routes from a file: + A new routing module enables loading pre-routes from a file. + To use this option you should use the command line options: + "-R file --U " or + "--routing_engine file --ucast_file " + For more information refer to the file doc/modular-routing.txt + or the opensm man page. + +* SA MultiPathRecord support: + The SA can now handle requests for multiple PathRecords in one query. + This includes methods SA GetMulti/GetMultiResp and dual sided RMPP. + +* PPC64 is now QAed and supported + +* Support LMC > 0 for Switch Enhanced Port 0: + Allows enhanced switch port 0 (ESP0) to have a non zero + LMC. Use the configured subnet wide LMC for this. Modifications were + necessary to the LID assignment and routing to support this. + Also, added an option to the configuration to use LMC configured for + subnet for enhanced switch port 0 or set it to 0 even if a non zero + LMC is configured for the subnet. The default is currently the + latter option. The new configuration option is: lmc_esp0 + +1.2 Minor New Features: + +* IPoIB broadcast group configuration: + It is now possible to control the IPoIB broadcast group parameters + (MTU, rate, SL) through the partitions configuration file. + +* Limiting OpenSM log file size: + By providing the command line option: "-L " or + "--log_limit " the user can limit the generated log + file size. When specified, the log file will be truncated upon reaching + this limit. + +* Favor 1K MTU for Tavor (MT23108) HCA + In cases where a PathRecord or MultiPathRecord is queried and the + requestor does not specify the MTU or does specify it in a way + that allows for MTU to be 1K and one of the path ends in a Tavor, + limit the MTU to 1K max. + +* Man pages: + Added opensm.8 and osmtest.8 + +* Leaf VL stall count control: + A new parameter (leaf_vl_stall_count) for controlling the number of + sequential packets dropped on a switch port driving a HCA/TCA/Router + that cause the port to enter the VLStalled state was added to the + options file. + +* SM Polling/Handover defaults changed + The default SMInfo polling retries was decreased from 18 to 4 + which reduces the default handover time from 3 min to 40 seconds. + +1.3 Library API Changes + +* cl_mem* APIs deprecated in complib: + These functions are now considered as deprecated and should be + replaced by direct calls to malloc, free, memset, etc. + +* osm_log_init_v2 API added in libopensm: + Supports providing the new option for log file truncation. + +1.4 Software Dependencies + +OpenSM depends on the installation of either OFED 1.1, OFED 1.0, +OpenIB gen2 (e.g. IBG2 distribution), OpenIB gen1 (e.g. IBGD +distribution), or Mellanox VAPI stacks. The qualified driver versions +are provided in Table 2, "Qualified IB Stacks". + +1.5 Supported Devices Firmware + +The main task of OpenSM is to initialize InfiniBand devices. The +qualified devices and their corresponding firmware versions +are listed in Table 3. + +2 Known Issues And Limitations +------------------------------ + +* No Service / Key associations: + There is no way to manage Service access by Keys. + +* No SM to SM SMDB synchronization: + Puts the burden of re-registering services, multicast groups, and + inform-info on the client application (or IB access layer core). + +* No "port down" event handling: + Changing the switch port through which OpenSM connects to the IB + fabric may cause incorrect operation. Please restart OpenSM whenever + such a connectivity change is made. + +* Changing connections during SM operation: + Under some conditions the SM can get confused by a change in + cabling (moving a cable from one switch port to the other) and + momentarily see this as having the same GUID appear connected + to two different IB ports. Under some conditions, when the SM fails to + get the corresponding change event it might mistakenly report this case + as a "duplicated GUID" case and abort. It is advisable to double-check + the syslog after each such change in connectivity and restart + OpenSM if it has exited. + +3 Unsupported IB Compliance Statements +-------------------------------------- +The following section lists all the IB compliance statements which +OpenSM does not support. Please refer to the IB specification for detailed +information regarding each compliance statement. + +* C14-22 (Authentication): + M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one + SubnSet method. As a work-around, an OpenSM option is provided for + defining the protect bits. + +* C14-67 (Authentication): + On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then + the SM shall generate a SubnGetResp if the M_Key matches, or + silently drop the packet if M_Key does not match. + +* C15-0.1.23.4 (Authentication): + InformInfoRecords shall always be provided with the QPN set to 0, + except for the case of a trusted request, in which case the actual + subscriber QPN shall be returned. + +* o13-17.1.2 (Event-FWD): + If no permission to forward, the subscription should be removed and + no further forwarding should occur. + +* C14-24.1.1.5 and C14-62.1.1.22 (Initialization): + GUIDInfo - SM should enable assigning Port GUIDInfo. + +* C14-44 (Initialization): + If the SM discovers that it is missing an M_Key to update CA/RT/SW, + it should notify the higher level. + +* C14-62.1.1.12 (Initialization): + PortInfo:M_Key - Set the M_Key to a node based random value. + +* C14-62.1.1.13 (Initialization): + PortInfo:P_KeyProtectBits - set according to an optional policy. + +* C14-62.1.1.24 (Initialization): + SwitchInfo:DefaultPort - should be configured for random FDB. + +* C14-62.1.1.32 (Initialization): + RandomForwardingTable should be configured. + +* o15-0.1.12 (Multicast): + If the JoinState is SendOnlyNonMember = 1 (only), then the endport + should join as sender only. + +* o15-0.1.8 (Multicast): + If a request for creating an MCG with fields that cannot be met, + return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass). + +* C15-0.1.8.6 (SA-Query): + Respond to SubnAdmGetTraceTable - this is an optional attribute. + +* C15-0.1.13 Services: + Reject ServiceRecord create, modify or delete if the given + ServiceP_Key does not match the one included in the ServiceGID port + and the port that sent the request. + +* C15-0.1.14 (Services): + Provide means to associate service name and ServiceKeys. + +4 Major Bug Fixes +----------------- + +The following is a list of bugs that were fixed. Note that other less critical +or visible bugs were also fixed. + +* "Broken" fabric (duplicated port GUIDs) handling improved + Replace assert with a real check to handle invalid physical port + in osm_node_info_rcv.c which could occur on a broken fabric + +* SA client synchronous request failed but status returned was IB_SUCCESS + even if there was no response. + There was a missing setting of the status in the synchronous case. + +* Memory leak fixes: + 1. In libvendor/osm_vendor_ibumad.c:osm_vendor_get_all_port_attr + 2. In libvendor/osm_vendor_ibumad_sa.c:__osmv_sa_mad_rcv_cb + 3. On receiving SMInfo SA request from a node that does not share a + partition, the response mad was allocated but never free'd + as it was never sent. + +* Set(InformInfo) OpenSM Deadlock: + When receiving a request with unknown LID + +* PathRecord to inconsistent multicast destination: + Fix the return error when multicast destination is not consistently + indicated. + +* Remove double calculation of reversible path + In osm_sa_path_record.c:__osm_pr_rcv_get_lid_pair_path a PathRecord + query used to double check if the path is reversible + +* Some PathRecord log messages use "net order": + Fix GUID net to host conversion in some osm_log messages + +* DR/LID routed SMPs direction bit handling: + osm_resp.c:osm_resp_make_resp_smp, set direction bit only if direct + routed class. This bug caused two issues: + 1. Get/Set responses always had direction bit set. + 2. Trap represses never had direction bit set. + The direction bit needs setting in direct routed responses and it + doesn't exist in LID routed responses. + osm_sm_mad_ctrl.c: did not detect the "direction bit" correctly. + +* OpenSM crash due to transaction lookup (interop with Cisco stack) + When a wire TID that maps to internal TID of zero (after applying + mask) was received the lookup of the transaction was successful. + The stale transaction pointed to "free'd" memory. + +* Better handling for Path/MultiPath requests for raw traffic + +* Wrong ProducerType provided in Notice Reports: + When formating an SM generated report, the ProducerType was using + CL_NTOH32 which can not be used to format a 24bit network order number. + +* OpenSM break on PPC64 + complib: Fixed memory corruption in cl_pool.c:cl_qcpool_init. This + affected big endian 64-bit architectures only. + +* Illegal Set(InformInfo) was wrongly successful in updating the SMDB + osm_sa_informinfo.c: In osm_infr_rcv_process_set_method, if sending + error, don't call osm_infr_rcv_process_set_method + +* RMPP queries of InformInfoRecord fail + ib_types.h: Pad ib_inform_info_record_t to be modulo 8 in size so + that attribute offset is calculated properly + +* Returning "invalid request" rather than "unsupported method/attribute" + In these cases, a noncompliant response was being provided. + +* Noncompliant response for SubnAdmGet(PortInfoRecord) with no match + osm_pir_rcv_process, now returns "SA no records error" for SubnAdmGet + with 0 records found + +* Noncompliant non base LID returned by some queries: + The following attributes used to return the request LID rather than + its base LID in responses: PKeyTableRecord, GUIDInfoRecord, + SLtoVLMappingTableRecord, VLArbitrationTableRecord, LinkRecord + +* Noncompliant SubnAdmGet and SubnAdmGetTable: + Mixing of error codes in case of no records or multiple records + fixed for the attributes: + LinearForwardingTableRecord, GUIDInfoRecord, + VLArbitrationTableRecord, LinkRecord, PathRecord + +* segfault in InformInfo flows + Under stress concurrent Set/Delete/Get flows. Fixed by adding + missing lock. + +* SA queries containing LID out if range did not return ERR_REQ_INVALID + +5 Main Verification Flows +------------------------- + +OpenSM verification is run using the following activities: +* osmtest - a stand-alone program +* ibmgtsim (IB management simulator) based - a set of flows that + simulate clusters, inject errors and verify OpenSM capability to + respond and bring up the network correctly. +* small cluster regression testing - where the SM is used on back to + back or single switch configurations. The regression includes + multiple OpenSM dedicated tests. +* cluster testing - when we run OpenSM to setup a large cluster, perform + hand-off, reboots and reconnects, verify routing correctness and SA + responsiveness at the ULP level (IPoIB and SDP). + +5.1 osmtest + +osmtest is an automated verification tool used for OpenSM +testing. Its verification flows are described by list below. + +* Inventory File: Obtain and verify all port info, node info, link and path + records parameters. + +* Service Record: + - Register new service + - Register another service (with a lease period) + - Register another service (with service p_key set to zero) + - Get all services by name + - Delete the first service + - Delete the third service + - Added bad flows of get/delete non valid service + - Add / Get same service with different data + - Add / Get / Delete by different component mask values (services + by Name & Key / Name & Data / Name & Id / Id only ) + +* Multicast Member Record: + - Query of existing Groups (IPoIB) + - BAD Join with insufficient comp mask (o15.0.1.3) + - Create given MGID=0 (o15.0.1.4) + - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4) + - Create BAD MGID=0xFA. (o15.0.1.6) + - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6) + - New MGID with invalid join state (o15.0.1.9) + - Retry of existing MGID - See JoinState update (o15.0.1.11) + - BAD RATE when connecting to existing MGID (o15.0.1.13) + - Partial JoinState delete request - removing FullMember (o15.0.1.14) + - Full Delete of a group (o15.0.1.14) + - Verify Delete by trying to Join deleted group (o15.0.1.14) + - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15) + +* GUIDInfo Record: + - All GUIDInfoRecords in subnet are obtained + +* MultiPathRecord: + - Perform some compliant and noncompliant MultiPathRecord requests + - Validation is via status in responses and IB analyzer + +* PKeyTableRecord: + - Perform some compliant and noncompliant PKeyTableRecord queries + - Validation is via status in responses and IB analyzer + +* LinearForwardingTableRecord: + - Perform some compliant and noncompliant LinearForwardingTableRecord queries + - Validation is via status in responses and IB analyzer + +* Event Forwarding: Register for trap forwarding using reports + - Send a trap and wait for report + - Unregister non-existing + +* Trap 64/65 Flow: Register to Trap 64-65, create traps (by + disconnecting/connecting ports) and wait for report, then unregister. + +* Stress Test: send PortInfoRecord queries, both single and RMPP and + check for the rate of responses as well as their validity. + + +5.2 IB Management Simulator OpenSM Test Flows: + +The simulator provides ability to simulate the SM handling of virtual +topologies that are not limited to actual lab equipment availability. +OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily +regressions use smaller (16 and 128 nodes clusters). + +The following test flows are run on the IB management simulator: + +* Stability: + Up to 12 links from the fabric are randomly selected to drop packets + at drop rates up to 90%. The SM is required to succeed in bringing the + fabric up. The resulting routing is verified to be correct as well. + +* LID Manager: + Using LMC = 2 the fabric is initialized with LIDs. Faults such as + zero LID, Duplicated LID, non-aligned (to LMC) LIDs are + randomly assigned to various nodes and other errors are randomly + output to the guid2lid cache file. The SM sweep is run 5 times and + after each iteration a complete verification is made to ensure that all + LIDs that could possibly be maintained are kept, as well as that all nodes + were assigned a legal LID range. + +* Multicast Routing: + Nodes randomly join the 0xc000 group and eventually the + resulting routing is verified for completeness and adherence to + Up/Down routing rules. + +* osmtest: + The complete osmtest flow as described in the previous table is run on + the simulated fabrics. + +* Stress Test: + This flow merges fabric, LID and stability issues with continuous + PathRecord, ServiceRecord and Multicast Join/Leave activity to + stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get + were added to the test such both existing and non existing nodes + perform them in random order. + +5.3 OpenSM Regression + +Using a back-to-back or single switch connection, the following set of +tests is run nightly on the stacks described in table 2. The included +tests are: + +* Stress Testing: Flood the SA with queries from multiple channel + adapters to check the robustness of the entire stack up to the SA. + +* Dynamic Changes: Dynamic Topology changes, through randomly + dropping SMP packets, used to test OpenSM adaptation to an unstable + network & verify DB correctness. + +* Trap Injection: This flow injects traps to the SM and verifies that it + handles them gracefully. + +* SA Query Test: This test exhaustively checks the SA responses to all + possible single component mask. To do that the test examines the + entire set of records the SA can provide, classifies them by their + field values and then selects every field (using component mask and a + value) and verifies that the response matches the expected set of records. + A random selection using multiple component mask bits is also performed. + +5.4 Cluster testing: + +Cluster testing is usually run before a distribution release. It +involves real hardware setups of 16 to 32 nodes (or more if a beta site +is available). Each test is validated by running all-to-all ping through the IB +interface. The test procedure includes: + +* Cluster bringup + +* Hand-off between 2 or 3 SM's while performing: + - Node reboots + - Switch power cycles (disconnecting the SM's) + +* Unresponsive port detection and recovery + +* osmtest from multiple nodes + +* Trap injection and recovery + + +6 Qualification +---------------- + +Table 2 - Qualified IB Stacks +============================= + +Stack | Version +-----------------------------------------|-------------------------- +OFED | 1.1 +OFED | 1.0 +OpenIB Gen2 (IBG2 distribution) | 1.0 +OpenIB Gen1 (IBGD distribution) | 1.8.0 +VAPI (Mellanox InfiniBand HCA Driver) | 3.2 and later + +Table 3 - Qualified Devices and Corresponding Firmware +====================================================== + +Mellanox +Device | FW versions +--------|----------------------------------------------------------- +MT43132 | InfiniScale - fw-43132 5.2.0 (and later) +MT47396 | InfiniScale III - fw-47396 0.5.0 (and later) +MT23108 | InfiniHost - fw-23108 3.3.2 +MT25204 | InfiniHost III Lx - fw-25204 1.0.1 +MT25208 | InfiniHost III Ex (InfiniHost Mode) - fw-25208 4.6.2 (and later) +MT25208 | InfiniHost III Ex (MemFree Mode) - fw-25218 5.0.1 (and later) + +QLogic/PathScale +Device | Note +--------|----------------------------------------------------------- +iPath | QHT6040 (PathScale InfiniPath HT-460) +iPath | QHT6140 (PathScale InfiniPath HT-465) +iPath | QLE6140 (PathScale InfiniPath PE-880) + +Note: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose +QP0 and QP1. However, it does support it as a device on the subnet. diff --git a/osu_mpi_release_notes.txt b/osu_mpi_release_notes.txt new file mode 100644 index 0000000..8ff2db2 --- /dev/null +++ b/osu_mpi_release_notes.txt @@ -0,0 +1,77 @@ + Open Fabrics Enterprise Distribution (OFED) + OSU MPI MVAPICH-0.9.7, Rev 0.9.7-mlx2.2.0 in OFED 1.1 Release Notes + + October 2006 + + +Overview +-------- +These are the release notes for OSU MPI MVAPICH-0.9.7, Rev 0.9.7-mlx2.2.0 +This is OFED's edition of the OSU MPI MVAPICH-0.9.7 release. OSU MPI is an MPI +channel implementation over InfiniBand from Ohio State University (OSU) +(http://nowlab.cse.ohio-state.edu/projects/mpi-iba/). + +Software Dependencies +--------------------- +OSU MPI depends on the installation of the OFED Distribution stack with OpenSM +running. The MPI module also requires an established network interface (either +InfiniBand IPoIB or Ethernet). + +New Features +------------ +The mlx2.2.0 module is based on the MVAPICH-0.9.7 (MPI-1 over OpenIB/Gen2) module at +openib.org gen2. This version for OFED has the following additional features: +- Message coalescing +- SRQ flow optimization + +Bug Fixes +--------- +- Affinity support is now enabled by default +- Multiple fixes in rsh/ssh launcher +- LD_LIBRARY_PATH fix (MLNX #37387) +- TotalView scalability fix +- FastPath is enabled on start for small clusters +- Fix for correct f90 support (openib bugzilla 191) +- Fix for comment support in mpirun_rsh (openib bugzilla 143) + +Known Issues +------------ +- A process running MPI cannot fork after MPI_Init. Using fork might cause a + segmentation fault. +- Using mpirun with ssh has a signal collection problem. Killing the run + (using CTRL-C) might leave some of the processes running on some of the + nodes. This can also happen if one of the processes exits with an error. + Note: This problem does not exist with rsh. +- The MPD job launcher feature of OSU MPI module has not been tested by Mellanox + Technologies. See http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ for more + details. +- For users of Mellanox Technologies firmware fw-23108 or fw-25208 only: + OSU MPI might fail in its default configuration if your HCA is burnt with an + fw-23108 version that is earlier than 3.4.000, or with an fw-25208 version + 4.7.400 or earlier. + Workaround: + Option 1 - Update the firmware + Option 2 - In mvapich.conf, set VIADEV_SRQ_ENABLE=0 +- MVAPICH does not run on RHEL4 U3 ppc64 +- MVAPICH may fail to run on some SLES 10 machines due to problems in resolving + the host name. + Workaround: Edit /etc/hosts and comment-out/remove the line that maps + IP address 127.0.0.2 to the system's fully qualified hostname. + +Main Verification Flows +----------------------- +In order to verify the correctness of OSU MPI, the following tests and +parameters were run. + +Test Description +=================================================================== +Intel's test suite 1400 Intel tests +BW/LT OSU's test for bandwidth latency +IMB Intel's MPI Benchmark test +mpitest b_eff test +Presta Presta multicast test +Linpack Linpack benchmark +NAS2.3 NAS NPB2.3 tests +SuperLU SuperLU benchmark (NERSC edition) +NAMD NAMD application +CAM CAM application diff --git a/sdp_release_notes.txt b/sdp_release_notes.txt new file mode 100644 index 0000000..ebde1ab --- /dev/null +++ b/sdp_release_notes.txt @@ -0,0 +1,187 @@ + Open Fabrics Enterprise Distribution (OFED) + SDP in OFED 1.1 Release Notes + + October 2006 + + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. Bug Fixes +3. Known Issues +4. Verification Applications/Flows/Tests + +=============================================================================== +1. Overview +=============================================================================== +SDP in OFED is at beta level for OFED 1.1. + + +=============================================================================== +2. Bug Fixes +=============================================================================== +* SDP now disables timewait on close if the socket has been disconnected + +* SDP now reports EPIPE if a packet gets queued after disconnect + +* Improved urgent data latency + +* Fixed data corruption upon changing the TCP_NODELAY socket option + +* Fixed a crash that occurs when a child is disconnected while its parent is + being destroyed. + +* SDP now recovers from RTU packet loss. + + +=============================================================================== +3. Known Issues +=============================================================================== +- Each SDP socket currently consumes up to 2 MBytes of memory. If this value + is high for your installation, it is possible to trade off performance + for lower memory utilization per socket by reducing the value of the + "rcvbuf_scale" module parameter (default: 16). + + Note: the minimum legal value for this parameter is 1. + At this parameter value, each socket will consume approximately 128 KBytes. + +- Small message size performance is low when messages are sent by client + at a rate lower than the rate at which they are consumed by server, + and when TCP_CORK is not set. This is observed, for example, with iperf + benchmark. As a workaround, set the TCP_CORK socket option + to ensure data is sent in at least 32K byte chunks. + +- Performance is low on 32-bit kernels, as SDP utilizes high memory + to ease memory pressure. Moving to a 64-bit kernel solves this + problem even if the application remains a 32-bit one. + +- By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards + using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth. + Workaround: reset the MTU size to 1K in this situation, using either of + the two methods below: + + 1. Activate the "tavor quirk" workaround in opensm: + a. Create an opensm options cache file (/var/cache/osm/opensm.opts): + > opensm --cache-options -o + b. Add the following line to /var/cache/osm/opensm.opts: + enable_quirks TRUE + c. Rerun opensm using your usual command line options to activate + the opensm quirk option. + + 2. Activate the "tavor quirk" workaround in cma: + set the tavor_quirk module parameter of the rdma_cm module to value 1 + (default: 0). + + +=============================================================================== +4. Verification Applications/Flows/Tests +=============================================================================== +- ssh/sshd +- wget/netscape/firefox/apache +- netpipe +- netperf +- LTP socket tests +- iperf-2.0.2 +- ttcp +- Threaded and forking echo client server examples +- Various Java client server applications (SUN:jre, BEA:jrockit/WebLogic, GNU:gij/gcj) +- Many UNIX utilities to verify that pre-load did not harm the applications + + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Open Fabrics Enterprise Distribution (OFED) + libsdp v. 9382 in OFED 1.1 Release Notes + + October 2006 + + +=============================================================================== +Table of Contents +=============================================================================== +1. Overview +2. New Features +3. Bug Fixes +4. Known Issues +5. Verification Applications/Flows/Tests + +=============================================================================== +1. Overview +=============================================================================== +This document describes the contents of the libsdp OFED 1.1 release. +libsdp is a LD_PRELOAD-able library that can be used to migrate existing +applications to use InfiniBand Sockets Direct Protocol (SDP) instead of +TCP sockets, transparently and without recompilations. To setup libsdp +please follow the instructions below. The libsdp version for this release +is 1.1. + + +=============================================================================== +2. New Features +=============================================================================== +* New verbosity level "7" used for reporting connect/accept calls + and the address family used. This results in a reasonably short + log file that shows which connections used SDP and which ones used TCP. + + +=============================================================================== +3 Bug Fixes +=============================================================================== +The following list of bugs were fixed. Note that other less critical +or visible bugs were also fixed. + +* Some applications provide IPv6 address in partial struct (missing the newer + scope_id field). The fix avoids libsdp memory corruption in this case. + +* Fixed address conversion bug in loopback address and also on all IPv4->IPv6 + (missed the extra 0xffff to the IPv6). + +* listen - had a missing flow for handling implicit bind. This caused the + "use both" case to provide two different/unrelated ports for the SDP and + TCP ports. Eventually, this caused the SDP port to be unusable (since the + client was only obtaining the TCP port number). Fixed by adding a flow similar + to the one used by bind with ANY_PORT. + +* Several bugs in using getsockname()/getpeername() which prevented the correct + address length from being returned when SDP-provided IPv4 addresses had to be + converted back to IPv6. + +* Fixed memory corruption caused by using struct sockaddr to store IPv6 address. + sockaddr_storage is used instead. + +* Accept now handles a null output address pointer. + +* errno was corrupted by call to is_invalid_addr, reporting false errno to + applications. + +* Fixed socket leak in the flow for bind(ANY_PORT) + +* libsdp now avoids log errors on connect() using async mode returning -1 when + errno == EINPROGRESS + +=============================================================================== +4. Known Issues +=============================================================================== +* libsdp cannot provide its socket switch functionality for executables + statically linked with libc. + +* When using server to listen on both SDP and TCP, the number of sockets is + doubled. + +* A rare race still exists when performing bind/listen on ANY_PORT. The race + is between applications and has been greatly minimized. A test to reproduce it + has not been found yet. The race is between libsdp running the sequence + close(fd1) and bind(fd2, port), and another application/thread explicitly + trying to bind(fd3, port) to the same port. + + To resolve this race a change in SDP/CMA behavior is required (provide + different port number in successive calls to bind (ANY_PORT) and SDP support + for "unbind"). + + +=============================================================================== +5. Verification Applications/Flows/Tests +=============================================================================== +See the corresponding section in the SDP release notes above. + diff --git a/srp_release_notes.txt b/srp_release_notes.txt new file mode 100644 index 0000000..92cef61 --- /dev/null +++ b/srp_release_notes.txt @@ -0,0 +1,505 @@ + + Open Fabrics Enterprise Distribution (OFED) + SRP in OFED 1.1 Release Notes + + October 2006 + + +============================================================================== +Table of contents +============================================================================== + + 1. Overview + 2. Software Dependencies + 3. Major Features + 4. Loading SRP Initiator + 5. Manually Establishing an SRP Connection + 6. SRP Tools - ibsrpdm and srp_daemon + 7. Automatic Discovery and Connecting to Targets + 8. Multiple Connections from Initiator IB Port to the Target + 9. High Availability + 10. Shutting Down SRP + 11. Known Issues + 12. Vendor Specific Notes + + +============================================================================== +1. Overview +============================================================================== + +The SRP standard describes the message format and protocol definitions required +for transferring commands and data between a SCSI initiator port and a SCSI +target port using RDMA communication service. + + +============================================================================== +2. Software Dependencies +============================================================================== + +The SRP Initiator depends on the installation of the OFED Distribution stack +with OpenSM running. + +============================================================================== +3. Major Features +============================================================================== + +This SRP Initiator is based on source taken from openib.org gen2 implementing +the SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. See: +www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf + +The SRP Initiator supports: +- Basic SCSI Primary Commands -3 (SPC-3) + (www.t10.org/ftp/t10/drafts/spc3/spc3r21b.pdf) +- Basic SCSI Block Commands -2 (SBC-2) + (www.t10.org/ftp/t10/drafts/sbc2/sbc2r16.pdf) +- Basic functionalities, task management and limited error handling + +============================================================================== +4. Loading SRP Initiator +============================================================================== + +To load the SRP module, either execute the "modprobe ib_srp" command after the +OFED driver is up, or change the value of SRP_LOAD in +/etc/infiniband/openib.conf to "yes" (causing the srp module to be loaded +at driver boot). + +NOTE: When loading the ib_srp module, it is possible to set the module + parameter srp_sg_tablesize. This is the maximum number of + gather/scatter entries per I/O (default: 12). + + +============================================================================== +5. Manually Establishing an SRP Connection +============================================================================== + +The following steps describe how to manually load an SRP connection between +the Initiator and an SRP Target. Section 7 explains how to do this +automatically. + +- Make sure that the ib_srp module is loaded, the SRP Initiator is reachable + by the SRP Target, and that an SM is running. + +- To establish a connection with an SRP Target and create SRP (SCSI) device(s) + for that target under /dev, use the following command: + + echo id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\ + pkey=ffff,service_id=[service[0] value] > \ + /sys/class/infiniband_srp/srp-mthca[hca number]-[port number]/add_target + + Notes: + a. Execution of the above "echo" command may take some time + b. The SM must be running while the command executes + c. It is possible to include additional parameters in the echo command: + > max_cmd_per_lun - Default: 63 + > max_sect (short for max_sectors) - sets the request size of a command + > io_class - Default: 0x100 as in rev 16A of the specification + Note: In rev 10 the default was 0xff00 + > initiator_ext - Please refer to Section 8 (Multiple Connections...) + d. See SRP Tools below for instructions on how the parameters in the + echo command above may be obtained. + +- To list the new SCSI devices that have been added by the echo command, you + may use either of the following two methods: + a. Execute "fdisk -l". This commands lists all devices; the new devices are + included in this listing. + b. Execute "dmesg" or look at /var/log/messages to find messages with the names + of the new devices. + + +============================================================================== +6. SRP Tools - ibsrpdm and srp_daemon +============================================================================== + +To assist in performing the steps in Section 5, the OFED 1.1 distribution +provides two utilities which: +- Detect targets on the fabric reachable by the Initiator (for step 1) +- Output target attributes in a format suitable for use in the above + "echo" command (step 2) + +These utilities are: ibsrpdm and srp_daemon. + +The utilities can be found under /usr/local/ofed/sbin/ (or /sbin/), +and are part of the srptools RPM that may be installed using the +OFED custom installation. Detailed information regarding the various +options for these utilities are provided by their man pages. + +Below, several usage scenarios for these utilities are presented. + +ibsrpdm usage +------------- +1. Detecting reachable targets + + a. To detect all targets reachable by the SRP initiator via the default + umad device (/dev/umad0), execute the following command: + > ibsrpdm + + This command will output information on each SRP target detected, in + human-readable form. + + Sample output: + IO Unit Info: + port LID: 0103 + port GID: fe800000000000000002c90200402bd5 + change ID: 0002 + max controllers: 0x10 + + controller[ 1] + GUID: 0002c90200402bd4 + vendor ID: 0002c9 + device ID: 005a44 + IO class : 0100 + ID: LSI Storage Systems SRP Driver 200400a0b81146a1 + service entries: 1 + service[ 0]: 200400a0b81146a1 / SRP.T10:200400A0B81146A1 + + b. To detect all the SRP Targets reachable by the SRP Initiator via + another umad device, use the following command: + + > ibsrpdm -d + +2. Assistance in creating an SRP connection + + a. To generate output suitable for utilization in the "echo" command of + section 5, add the "-c" option to ibsrpdm: + + >ibsrpdm -c + + Sample output: + id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, + dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1 + + b. To establish a connection with an SRP Target (Section 5) using the output + from the "libsrpdm -c" example above, execute the following command: + + echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, + dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1 + > /sys/class/infiniband_srp/srp-mthca0-1/add_target + + The SRP connection should now be up; the newly created SCSI devices should appear + in the listing obtained from the "fdisk -l" command. + +srp_daemon +---------- +The srp_daemon utility is based on ibsrpdm and extends its functionality. +In addition to the ibsrpdm functionality described above, srp_daemon can also: +- Establish an SRP connection by itself (without the need to issue the "echo" + command described in Section 5) +- Continue running in background, detecting new targets and establishing SRP + connections with them (daemon mode) +- Discover reachable SRP targets given an infiniband HCA name and port, rather + than just by /dev/umad where is a digit +- Enable High Availability operation (together with Device-Mapper Multipath) + +a. srp_daemon commands equivalent to ibsrpdm: + + "srp_daemon -a -o" is equivalent to "srp_daemon -a -o" + "srp_daemon -c -a -o" is equivalent to "ibsrpdm -c" + +b. srp_daemon extensions to ibsrpdm + + - To discover SRP Targets reachable from HCA device , + port , (and generate output suitable for 'echo') you may execute + + srp_daemon -c -a -o -i -p + + - To both discover the SRP Targets and establish connections with them, just + add the -e option to the above command. + + - Executing srp_daemon without -a option will display only the reachable + Targets to which the initiator is not connected (via the port upon which + srp_daemon was activated). + + - Continuous background (daemon) operation, providing automatic ongoing + detection and connection capability -- see the next section. + +============================================================================== +7. Automatic Discovery and Connecting to Targets +============================================================================== + +- Make sure that the ib_srp module is loaded, the SRP Initiator can reach an + SRP Target, and that an SM is running. + +- To connect to all the existing Targets in the fabric, execute + srp_daemon -e -o. This utility will scan the fabric once, connect to + every Target it detects, and then exit. + +- To connect to all the existing Targets in the fabric and to connect + to new targets that will join the fabric, execute srp_daemon -e. This utility + continues to execute until it is either killed by the user or encounters + connection errors (such as no SM in the fabric). + +- To execute SRP daemon as a daemon you may execute run_srp_daemon + (found under /usr/local/ofed/sbin/ or /sbin/), providing it with + the same options used for running srp_daemon. + + Note: Make sure only one instance of run_srp_daemon runs per port. + +- To execute SRP daemon as a daemon on all the ports, execute srp_daemon.sh + (found under /usr/local/ofed/sbin/ or /sbin/). + srp_daemon.sh sends its log to /var/log/srp_daemon.log. + +- It is possible to configure this script to execute automatically when the + InfiniBand driver starts by changing the value of SRPHA_ENABLE in + /etc/infiniband/openib.conf to "yes". However, this option also enables + SRP High Availability that has some more features. (Please read the High + Availability section). + +============================================================================== +8. Multiple Connections from Initiator IB Port to the Target +============================================================================== + +Some system configurations may need multiple SRP connections from +the SRP Initiator to the same SRP Target: to the same Target IB port, +or to different IB ports on the same Target HCA. + +In case of a single Target IB port, i.e., SRP connections use the same path, +the configuration is enabled using a different initiator_ext value for each +SRP connection. The initiator_ext value is a 16-hexadecimal-digit value +specified in the connection command. + +Also in case of two physical connections (i.e., network paths) from a single +initiator IB port to two different IB ports on the same Target HCA, there is +need for a different initiator_ext value on each path. The conventions is to +use the Target port GUID as the initiator_ext value for the relevant path. + +If you use srp_daemon with -n flag, it automatically assigns initiator_ext +values according to this convention. For example: + + id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec,dgid=fe800000000000000002c90200402bed,\ + pkey=ffff,service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200 + + Notes: + a. It is recommended to use the -n flag for all srp_daemon invocations. + b. ibsrpdm does not have a corresponding option. + c. srp_daemon.sh always uses the -n option (whether invoked manually by + the user, or automatically at startup by setting SRPHA_ENABLE to yes). + +============================================================================== +9. High Availability (HA) +============================================================================== + + Note: This is a Beta release of the High Availability feature for the + SCSI RDMA Protocol (SRP) Initiator. + It is intended for development use, not as a complete product. + +High Availability Overview +-------------------------- + +High Availability works using the Device-Mapper (DM) multipath and the +SRP daemon. + +Each initiator is connected to the same target from several ports/HCAs. +The DM multipath is responsible for joining together different paths to the +same target and for fail-over between paths when one of them goes offline. +Rules were added to udev that will execute multipath on newly joined SCSI +devices. + +Each initiator should execute several instances of the SRP daemon, one for each +port. At startup, each SRP daemon detects the SRP targets in the fabric and +sends requests to the ib_srp module to connect to each of them. These +SRP daemons also detect targets that subsequently join the fabric, and send the +ib_srp module requests to connect to them as well. + +High Availability Operation +--------------------------- + +When a path (from port1) to a target fails, the ib_srp module starts an error +recovery process. If this process gets to the reset_host stage and there is no +path to the target from this port, ib_srp will remove this scsi_host. After +the scsi_host is removed, multipath switches to another path to this target +(from another port/HCA). + +When the failed path recovers, it will be detected by the SRP daemon. The SRP +daemon will then request ib_srp to connect to this target. Once the connection +is up, there will be a new scsi_host for this target. The udev rule will then +execute multipath on the devices of this host, and we will return to the +original state (before the path failed). + +High Availability Prerequisites +------------------------------- + +Installation: (Execute once) +- Verify that multipath is installed. If not, it is possible to download it + from http://christophe.varoqui.free.fr/multipath-tools/multipath-tools-0.4.7.tar.bz2 + and then compile and install it. + +- Update udev: (Execute once - for manual activation of High Availability only) + +- Add a file to /etc/udev/rules.d/ (you can call it 91-srp.rules) + This file should have one line: + ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m" + +Note that when SRPHA_ENABLE is set to "yes" (see below in Automatic activation +of High Availability subsection), this file is created upon each boot of +the driver and deleted when the driver is unloaded. + +Manual Activation of High Availability +-------------------------------------- + +Initialization: (Execute after each boot of the driver) + 1) Execute modprobe dm-multipath + 2) Execute modprobe ib-srp + 3) Make sure you have created file /etc/udev/rules.d/91-srp.rules + as described above + 4) Execute for each port and each HCA: + srp_daemon -c -e -R 300 -i -p + (You can use another value for -R. See in Known Issues section the + workaround for the rare race condition.) + + This step can be performed by executing srp_daemon.sh, which sends + its log to /var/log/srp_daemon.log. + + Now it is possible to access the SRP LUNs on /dev/mapper/. + + NOTE: It is possible that regular (not SRP) LUNs may also be present; + the SRP LUNs may be identified by their name. + + +Automatic Activation of High Availability +----------------------------------------- +- Set the value of SRPHA_ENABLE in /etc/infiniband/openib.conf to "yes". + +- From the next loading of the driver it will be possible to access the SRP + LUNs on /dev/mapper/ + NOTE: It is possible that regular (not SRP) LUNs may also be present; + the SRP LUNs may be identified by their name. + +- It is possible to see the output of the SRP daemon in /var/log/srp_daemon.log + + +============================================================================== +10. Shutting Down SRP +============================================================================== + +SRP can be shutdown by using "rmmod ib_srp", or by stopping the OFED driver +("/etc/init.d/openibd stop"), or as a by-product of a complete system shutdown. + +Prior to shutting down SRP, remove all references to it. The actions you need +to take depend on the way SRP was loaded. There are three cases. + +a. Without High Availability +------------------------------------ +When working without High Availability, you should unmount the SRP +partitions that were mounted prior to shutting down SRP. + + +b. After Manual Activation of High Availability +----------------------------------------------- +If you manually activated SRP High Availability, perform the following steps: +1) Unmount all SRP partitions that were mounted +2) Kill the SRP daemon instances +3) Make sure there are no multipath instances running. If there are multiple + instances, wait for them to end or kill them. +4) Execute multipath -F + + +c. After Automatic Activation of High Availability +-------------------------------------------------- +If SRP High Availability was automatically activated, SRP shutdown must be +part of the driver shutdown ("/etc/init.d/openibd stop") which performs +steps 2-4 of case b above. However, you still have to unmount all SRP +partitions that were mounted before driver shutdown. + + +HAL Issue +--------- +The HAL (Hardware Abstraction Layer) system includes a daemon that examines +all devices in the system. In this process, it frequently holds a reference +to the ib_srp module. If you attempt to shutdown SRP while this daemon is +holding a reference to ib_srp, the shutdown will fail. Therefore, you +should make sure this will not occur. One solution may be to stop "haldaemon" +(/etc/init.d/haldaemon stop) prior to SRP shutdown. + + +============================================================================== +11. Known Issues +============================================================================== + +- SRP is not supported on a 32-bit operating system running on a 64-bit + platform. + +- The SCSI device is sent offline when a link goes down for several seconds, + when the subnet manager goes down for a long time, or when a disk is removed + from a target during run-time. + +- There is a very rare race condition which can cause the SRP daemon to miss a + target that joins the fabric. The race can occur if a target that left the + fabric rejoins it after the ib_srp module has decided to remove this target, + but before the scsi_host has been removed. As a result, when the SRP daemon + checks if this target is already connected, it will receive a positive + response and will therefore not reconnect to this target. + + Workaround: Execute the srp_daemon command with the -R option. This + option causes the SRP daemon to perform a full rescan of the fabric every + seconds. + +- It is recommended to use an SM that supports the enhanced capability mask + matching feature (errata MGTWG8372). With SMs which support this feature, the + SRP daemon generates significantly less communication traffic. + +- When booting OFED with SRP High Availability enabled, executing multipath for + all LUNs on all connections may take some time (several minutes). However, it + is possible to start working while this process is in progress. + +- If SRP High Availability is enabled, disconnections while OFED is booting, or + simultaneous disconnections and connections during normal operation, may lead + to what seems as a deadlock between multipath instances. + +- High Availability uses multipath which needs at least udev version 050. The + RHEL4 distribution uses udev 039, therefore, High Availability does not work + on the standard Red Hat distribution. + +- Stopping the driver while SRP High Availability is enabled kills all + multipath processes. Consider appropriate actions in case multipath is used + for other purposes. + +- AS High Availability is based on Device Mapper multipath, it embodies + multipath limitations and also its configuration and tuning options. + See http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=Home + for information on multipath. + To modify and tune multipath configuration, edit the file /etc/multipath.conf + according to instructions and tips listed in + /usr/share/doc/packages/multipath-tools/multipath.conf.* + +- In case your topology has two physical connections (i.e., network paths) from + a single initiator IB port to two different IB ports on the same Target HCA, + and you wish to have an SRP connection on the one path coexist with an SRP + connection on the second path, you must set a different initiator_ext value + on each path. See Section 8, "Multiple Connections from Initiator IB Port + to the Target" for details. + +============================================================================== +12. Vendor Specific Notes +============================================================================== + +Hosts connected to Silverstorm SRP Targets must perform one of the following +steps after upgrading to OFED 1.1 to continue accessing their storage +successfully: + +1. When issuing the "echo" command to add a new SRP Target, the host + must append the string ",initiator_ext=0000000000000001" to the original + echo string. + Example: + 'ibsrpdm -c' output is as follows: + + id_ext=0000000000000001,ioc_guid=00066a0138000165,dgid=fe8000000000000 + 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00 + + id_ext=0000000000000001,ioc_guid=00066a0238000165,dgid=fe8000000000000 + 000066a0260000165,pkey=ffff,service_id=0000494353535250,io_class=ff00 + + To connect to the first target, the echo command must be: + + echo -n \ + id_ext=0000000000000001,ioc_guid=00066a0138000165,\ + dgid=fe8000000000000000066a0260000165,pkey=ffff,\ + service_id=0000494353535250,io_class=ff00,\ + initiator_ext=0000000000000001 > \ + /sys/class/inifiniband_srp/srp-mthca0-1/add_target + + +2. Change the SRP map on the Silverstorm SRP Target to set the expected + initiator extension to 0. For details on how to change the SRP map on a + Silverstorm SRP Target, please refer to product documentation. + + diff --git a/uDAPL_release_notes.txt b/uDAPL_release_notes.txt new file mode 100644 index 0000000..972e329 --- /dev/null +++ b/uDAPL_release_notes.txt @@ -0,0 +1,925 @@ + + Release Notes for + Gamma 3.2 and OFED 1.1 DAPL Release + October 2006 + + + DAPL GAMMA 3.2/OFED 1.1 RELEASE NOTES + + This release of the DAPL reference implementation + is timed to coincide with OFED release 1.1 of the + Open Fabrics (www.openfabrics.org) software stack. + + NEW SINCE Gamma 3.1 and OFED 1.0 + + * BUG FIXES + + + Update obsolete CLK_TCK to CLOCKS_PER_SEC + + Fill out some unitialized fields in the ia_attr structure returned by + dat_ia_query(). + + Update dtest to support multiple segments on rdma write and change + makefile to use OpenIB-cma by default. + + Add support for dat_evd_set_unwaitable on a DTO evd in openib_cma + provider + + Added errno reporting (message and return codes) during open to help + diagnose create thread issues. + + Fix some suspicious inline assembly EIEIO_ON_SMP and ISYNC_ON_SMP + + Fix IA64 build problems + + Lower the reject debug message level so we don't see warnings when + consumers reject. + + Added support for active side TIMED_OUT event from a provider. + + Fix bug in dapls_ib_get_dat_event() call after adding new unreachable + event. + + Update for new rdma_create_id() function signature. + + Set max rdma read per EP attributes + + Report the proper error and timeout events. + + Socket CM fix to guard against using a loopback address as the local + device address. + + Use the uCM set_option feature to adjust connect request timeout + retry values. + + Fix to disallow any event after a disconnect event. + + * OFED 1.1 uDAPL source build instructions: + + cd /usr/local/ofed/src/openib-1.1/src/userspace/dapl + + # NON_DEBUG build configuration + + ./configure --disable-libcheck --prefix /usr/local/ofed + --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 + CPPFLAGS="-I../libibverbs/include -I../librdmacm/include" + + # build and install + + make + make install + + # DEBUG build configuration + + ./configure --disable-libcheck --enable-debug --prefix /usr/local/ofed + --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 + CPPFLAGS="-I../libibverbs/include -I../librdmacm/include" + + # build and install + + make + make install + + # DEBUG messages: set environment variable DAPL_DBG_TYPE, default + mapping is 0x0003 + + DAPL_DBG_TYPE_ERR = 0x0001, + DAPL_DBG_TYPE_WARN = 0x0002, + DAPL_DBG_TYPE_EVD = 0x0004, + DAPL_DBG_TYPE_CM = 0x0008, + DAPL_DBG_TYPE_EP = 0x0010, + DAPL_DBG_TYPE_UTIL = 0x0020, + DAPL_DBG_TYPE_CALLBACK = 0x0040, + DAPL_DBG_TYPE_DTO_COMP_ERR= 0x0080, + DAPL_DBG_TYPE_API = 0x0100, + DAPL_DBG_TYPE_RTN = 0x0200, + DAPL_DBG_TYPE_EXCEPTION = 0x0400, + DAPL_DBG_TYPE_SRQ = 0x0800, + DAPL_DBG_TYPE_CNTR = 0x1000 + + + Note: The udapl provider library libdaplscm.so is untested and + unsupported, thus customers should not use it. + It will be removed in the next OFED release. + + DAPL GAMMA 3.1 RELEASE NOTES + + This release of the DAPL reference implementation + is timed to coincide with the first release of the + Open Fabrics (www.openfabrics.org) software stack. + This release adds support for this new stack, which + is now the native Linux RDMA stack. + + This release also adds a new licensing option. In + addition to the Common Public License and BSD License, + the code can now be licensed under the terms of the GNU + General Public License (GPL) version 2. + + NEW SINCE Gamma 3.0 + + - GPL v2 added as a licensing option + - OpenFabrics (aka OpenIB) gen2 verbs support + - dapltest support for Solaris 10 + + * BUG FIXES + + + Fixed a disconnect event processing race + + Fix to destroy all QPs on IA close + + Removed compiler warnings + + Removed unused variables + + And many more... + + DAPL GAMMA 3.0 RELEASE NOTES + + This is the first release based on version 1.2 of the spec. There + are some components, such a shared receive queues (SRQs), which + are not implemented yet. + + Once again there were numerous bug fixes submitted by the + DAPL community. + + NEW SINCE Beta 2.06 + + - DAT 1.2 headers + - DAT_IA_HANDLEs implemented as small integers + - Changed default device name to be "ia0a" + - Initial support for Linux 2.6.X kernels + - Updates to the OpenIB gen 1 provider + + * BUG FIXES + + + Updated Makefile for differentiation between OS releases. + + Updated atomic routines to use appropriate API + + Removed unnecessary assert from atomic_dec. + + Fixed bugs when freeing a PSP. + + Fixed error codes returned by the DAT static registry. + + Kernel updates for dat_strerror. + + Cleaned up the transport layer/adapter interface to use DAPL + types rather than transport types. + + Fixed ring buffer reallocation. + + Removed old test/udapl/dapltest directory. + + Fixed DAT_IA_HANDLE translation (from pointer to int and + vice versa) on 64-bit platforms. + + DAP BETA 2.06 RELEASE NOTES + + We are not planning any further releases of the Beta series, + which are based on the 1.1 version of the spec. There may be + further releases for bug fixes, but we anticipate the DAPL + community to move to the new 1.2 version of the spec and the + changes mandated in the reference implementation. + + The biggest item in this release is the first inclusion of the + OpenIB Gen 1 provider, an item generating a lot of interest in + the IB community. This implementation has graciously been + provided by the Mellanox team. The kdapl implementation is in + progress, and we imagine work will soon begin on Gen 2. + + There are also a handful of bug fixes available, as well as a long + awaited update to the endpoint design document. + + NEW SINCE Beta 2.05 + + - OpenIB gen 1 provider support has been added + - Added dapls_evd_post_generic_event(), routine to post generic + event types as requested by some providers. Also cleaned up + error reporting. + - Updated the endpoint design document in the doc/ directory. + + * BUG FIXES + + + Cleaned up memory leak on close by freeing the HCA structure; + + Removed bogus #defs for rdtsc calls on IA64. + + Changed daptest thread types to use internal types for + portability & correctness + + Various 64 bit enhancements & updates + + Fixes to conformance test that were defining CONN_QUAL twice + and using it in different ways + + Cleaned up private data handling in ep_connect & provider + support: we now avoid extra copy in connect code; reduced + stack requirements by using private_data structure in the EP; + removed provider variable. + + Fixed problem in the dat conformance test where cno_wait would + attempt to dereference a timer value and SEGV. + + Removed old vestiges of depricated POLLING_COMPLETIONS + conditionals. + + DAPL BETA 2.05 RELEASE NOTES + + This was to be a very minor release, the primary change was + going to be the new wording of the DAT license as contained in + the header for all source files. But the interest and + development occurring in DAPL provided some extra bug fixes, and + some new functionality that has been requested for a while. + + First, you may notice that every single source file was + changed. If you read the release notes from DAPL BETA 2.04, you + were warned this would happen. There was a legal issue with the + wording in the header, the end result was that every source file + was required to change the word 'either of' to 'both'. We've + been putting this change off as long as possible, but we wanted + to do it in a clean drop before we start working on DAT 1.2 + changes in the reference implementation, just to keep things + reasonably sane. + + kdapltest has enabled three of the subtests supported by + dapltest. The Performance test in particular has been very + useful to dapltest in getting minima and maxima. The Limit test + pushes the limits by allocating the maximum number of specific + resources. And the FFT tests are also available. + + Most vendors have supported shared memory regions for a while, + several of which have asked the reference implementation team to + provide a common implementation. Shared memory registration has + been tested on ibapi, and compiled into vapi. Both InfiniBand + providers have the restriction that a memory region must be + created before it can be shared; not all RDMA APIs are this way, + several allow you to declare a memory region shared when it is + registered. Hence, details of the implementation are hidden in + the provider layer, rather than forcing other APIs to do + something strange. + + This release also contains some changes that will allow dapl to + work on Opteron processors, as well as some preliminary support + for Power PC architecture. These features are not well tested + and may be incomplete at this time. + + Finally, we have been asked several times over the course of the + project for a canonical interface between the common and + provider layers. This release includes a dummy provider to meet + that need. Anyone should be able to download the release and do + a: + make VERBS=DUMMY + + And have a cleanly compiled dapl library. This will be useful + both to those porting new transport providers, as well as those + going to new machines. + + The DUMMY provider has been compiled on both Linux and Windows + machines. + + + NEW SINCE Beta 2.4 + - kdapltest enhancements: + * Limit subtests now work + * Performance subtests now work. + * FFT tests now work. + + - The VAPI headers have been refreshed by Mellanox + + - Initial Opteron and PPC support. + + - Atomic data types now have consistent treatment, allowing us to + use native data types other than integers. The Linux kdapl + uses atomic_t, allowing dapl to use the kernel macros and + eliminate the assembly code in dapl_osd.h + + - The license language was updated per the direction of the + DAT Collaborative. This two word change affected the header + of every file in the tree. + + - SHARED memory regions are now supported. + + - Initial support for the TOPSPIN provider. + + - Added a dummy provider, essentially the NULL provider. It's + purpose is to aid in porting and to clarify exactly what is + expected in a provider implementation. + + - Removed memory allocation from the DTO path for VAPI + + - cq_resize will now allow the CQ to be resized smaller. Not all + providers support this, but it's a provider problem, not a + limitation of the common code. + + * BUG FIXES + + + Removed spurious lock in dapl_evd_connection_callb.c that + would have caused a deadlock. + + The Async EVD was getting torn down too early, potentially + causing lost errors. Has been moved later in the teardown + process. + + kDAPL replaced mem_map_reserve() with newer SetPageReserved() + for better Linux integration. + + kdapltest no longer allocate large print buffers on the stack, + is more careful to ensure buffers don't overflow. + + Put dapl_os_dbg_print() under DAPL_DBG conditional, it is + supposed to go away in a production build. + + dapltest protocol version has been bumped to reflect the + change in the Service ID. + + Corrected several instances of routines that did not adhere + to the DAT 1.1 error code scheme. + + Cleaned up vapi ib_reject_connection to pass DAT types rather + than provider specific types. Also cleaned up naming interface + declarations and their use in vapi_cm.c; fixed incorrect + #ifdef for naming. + + Initialize missing uDAPL provider attr, pz_support. + + Changes for better layering: first, moved + dapl_lmr_convert_privileges to the provider layer as memory + permissions are clearly transport specific and are not always + defined in an integer bitfield; removed common routines for + lmr and rmr. Second, move init and release setup/teardown + routines into adapter_util.h, which defined the provider + interface. + + Cleaned up the HCA name cruft that allowed different types + of names such as strings or ints to be dealt with in common + code; but all names are presented by the dat_registry as + strings, so pushed conversions down to the provider + level. Greatly simplifies names. + + Changed deprecated true/false to DAT_TRUE/DAT_FALSE. + + Removed old IB_HCA_NAME type in favor of char *. + + Fixed race condition in kdapltest's use of dat_evd_dequeue. + + Changed cast for SERVER_PORT_NUMBER to DAT_CONN_QUAL as it + should be. + + Small code reorg to put the CNO into the EVD when it is + allocated, which simplifies things. + + Removed gratuitous ib_hca_port_t and ib_send_op_type_t types, + replaced with standard int. + + Pass a pointer to cqe debug routine, not a structure. Some + clean up of data types. + + kdapl threads now invoke reparent_to_init() on exit to allow + threads to get cleaned up. + + + + DAPL BETA 2.04 RELEASE NOTES + + The big changes for this release involve a more strict adherence + to the original dapl architecture. Originally, only InfiniBand + providers were available, so allowing various data types and + event codes to show through into common code wasn't a big deal. + + But today, there are an increasing number of providers available + on a number of transports. Requiring an IP iWarp provider to + match up to InfiniBand events is silly, for example. + + Restructuring the code allows more flexibility in providing an + implementation. + + There are also a large number of bug fixes available in this + release, particularly in kdapl related code. + + Be warned that the next release will change every file in the + tree as we move to the newly approved DAT license. This is a + small change, but all files are affected. + + Future releases will also support to the soon to be ratified DAT + 1.2 specification. + + This release has benefited from many bug reports and fixes from + a number of individuals and companies. On behalf of the DAPL + community, thank you! + + + NEW SINCE Beta 2.3 + + - Made several changes to be more rigorous on the layering + design of dapl. The intent is to make it easier for non + InfiniBand transports to use dapl. These changes include: + + * Revamped the ib_hca_open/close code to use an hca_ptr + rather than an ib_handle, giving the transport layer more + flexibility in assigning transport handles and resources. + + * Removed the CQD calls, they are specific to the IBM API; + folded this functionality into the provider open/close calls. + + * Moved VAPI, IBAPI transport specific items into a transport + structure placed inside of the HCA structure. Also updated + routines using these fields to use the new location. Cleaned + up provider knobs that have been exposed for too long. + + * Changed a number of provider routines to use DAPL structure + pointers rather than exposing provider handles & values. Moved + provider specific items out of common code, including provider + data types (e.g. ib_uint32_t). + + * Pushed provider completion codes and type back into the + provider layer. We no longer use EVD or CM completion types at + the common layer, instead we obtain the appropriate DAT type + from the provider and process only DAT types. + + * Change private_data handling such that we can now accommodate + variable length private data. + + - Remove DAT 1.0 cruft from the DAT header files. + + - Better spec compliance in headers and various routines. + + - Major updates to the VAPI implementation from + Mellanox. Includes initial kdapl implementation + + - Move kdapl platform specific support for hash routines into + OSD file. + + - Cleanups to make the code more readable, including comments + and certain variable and structure names. + + - Fixed CM_BUSTED code so that it works again: very useful for + new dapl ports where infrastructure is lacking. Also made + some fixes for IBHOSTS_NAMING conditional code. + + - Added DAPL_MERGE_CM_DTO as a compile time switch to support + EVD stream merging of CM and DTO events. Default is off. + + - 'Quit' test ported to kdapltest + + - uDAPL now builds on Linux 2.6 platform (SuSE 9.1). + + - kDAPL now builds for a larger range of Linux kernels, but + still lacks 2.6 support. + + - Added shared memory ID to LMR structure. Shared memory is + still not fully supported in the reference implementation, but + the common code will appear soon. + + * Bug fixes + - Various Makefiles fixed to use the correct dat registry + library in its new location (as of Beta 2.03) + - Simple reorg of dat headers files to be consistent with + the spec. + - fixed bug in vapi_dto.h recv macro where we could have an + uninitialized pointer. + - Simple fix in dat_dr.c to initialize a variable early in the + routine before errors occur. + - Removed private data pointers from a CONNECTED event, as + there should be no private data here. + - dat_strerror no longer returns an uninitialized pointer if + the error code is not recognized. + - dat_dup_connect() will reject 0 timeout values, per the + spec. + - Removed unused internal_hca_names parameter from + ib_enum_hcas() interface. + - Use a temporary DAT_EVENT for kdapl up-calls rather than + making assumptions about the current event queue. + - Relocated some platform dependent code to an OSD file. + - Eliminated several #ifdefs in .c files. + - Inserted a missing unlock() on an error path. + - Added bounds checking on size of private data to make sure + we don't overrun the buffer + - Fixed a kdapltest problem that caused a machine to panic if + the user hit ^C + - kdapltest now uses spin locks more appropriate for their + context, e.g. spin_lock_bh or spin_lock_irq. Under a + conditional. + - Fixed kdapltest loops that drain EVDs so they don't go into + endless loops. + - Fixed bug in dapl_llist_add_entry link list code. + - Better error reporting from provider code. + - Handle case of user trying to reap DTO completions on an + EP that has been freed. + - No longer hold lock when ep_free() calls into provider layer + - Fixed cr_accept() to not have an extra copy of + private_data. + - Verify private_data pointers before using them, avoid + panic. + - Fixed memory leak in kdapltest where print buffers were not + getting reclaimed. + + + + DAPL BETA 2.03 RELEASE NOTES + + There are some prominent features in this release: + 1) dapltest/kdapltest. The dapltest test program has been + rearchitected such that a kernel version is now available + to test with kdapl. The most obvious change is a new + directory structure that more closely matches other core + dapl software. But there are a large number of changes + throughout the source files to accommodate both the + differences in udapl/kdapl interfaces, but also more mundane + things such as printing. + + The new dapltest is in the tree at ./test/dapltest, while the + old remains at ./test/udapl/dapltest. For this release, we + have maintained both versions. In a future release, perhaps + the next release, the old dapltest directory will be + removed. Ongoing development will only occur in the new tree. + + 2) DAT 1.1 compliance. The DAT Collaborative has been busy + finalizing the 1.1 revision of the spec. The header files + have been reviewed and posted on the DAT Collaborative web + site, they are now in full compliance. + + The reference implementation has been at a 1.1 level for a + while. The current implementation has some features that will + be part of the 1.2 DAT specification, but only in places + where full compatibility can be maintained. + + 3) The DAT Registry has undergone some positive changes for + robustness and support of more platforms. It now has the + ability to support several identical provider names + simultaneously, which enables the same dat.conf file to + support multiple platforms. The registry will open each + library and return when successful. For example, a dat.conf + file may contain multiple provider names for ex0a, each + pointing to a different library that may represent different + platforms or vendors. This simplifies distribution into + different environments by enabling the use of common + dat.conf files. + + In addition, there are a large number of bug fixes throughout + the code. Bug reports and fixes have come from a number of + companies. + + Also note that the Release notes are cleaned up, no longer + containing the complete text of previous releases. + + * EVDs no longer support DTO and CONNECTION event types on the + same EVD. NOTE: The problem is maintaining the event ordering + between two channels such that no DTO completes before a + connection is received; and no DTO completes after a + disconnect is received. For 90% of the cases this can be made + to work, but the remaining 10% will cause serious performance + degradation to get right. + + NEW SINCE Beta 2.2 + + * DAT 1.1 spec compliance. This includes some new types, error + codes, and moving structures around in the header files, + among other things. Note the Class bits of dat_error.h have + returned to a #define (from an enum) to cover the broadest + range of platforms. + + * Several additions for robustness, including handle and + pointer checking, better argument checking, state + verification, etc. Better recovery from error conditions, + and some assert()s have been replaced with 'if' statements to + handle the error. + + * EVDs now maintain the actual queue length, rather than the + requested amount. Both the DAT spec and IB (and other + transports) allow the underlying implementation to provide + more CQ entries than requested. + + Requests for the same number of entries contained by an EVD + return immediate success. + + * kDAPL enhancements: + - module parameters & OS support calls updated to work with + more recent Linux kernels. + - kDAPL build options changes to match the Linux kernel, vastly + reducing the size and making it more robust. + - kDAPL unload now works properly + - kDAPL takes a reference on the provider driver when it + obtains a verbs vector, to prevent an accidental unload + - Cleaned out all of the uDAPL cruft from the linux/osd files. + + * New dapltest (see above). + + * Added a new I/O trace facility, enabling a developer to debug + all I/O that are in progress or recently completed. Default + is OFF in the build. + + * 0 timeout connections now refused, per the spec. + + * Moved the remaining uDAPL specific files from the common/ + directory to udapl/. Also removed udapl files from the kdapl + build. + + * Bug fixes + - Better error reporting from provider layer + - Fixed race condition on reference counts for posting DTO + ops. + - Use DAT_COMPLETION_SUPPRESS_FLAG to suppress successful + completion of dapl_rmr_bind (instead of + DAT_COMPLEITON_UNSIGNALLED, which is for non-notification + completion). + - Verify psp_flags value per the spec + - Bug in psp_create_any() checking psp_flags fixed + - Fixed type of flags in ib_disconnect from + DAT_COMPLETION_FLAGS to DAT_CLOSE_FLAGS + - Removed hard coded check for ASYNC_EVD. Placed all EVD + prevention in evd_stream_merging_supported array, and + prevent ASYNC_EVD from being created by an app. + - ep_free() fixed to comply with the spec + - Replaced various printfs with dbg_log statements + - Fixed kDAPL interaction with the Linux kernel + - Corrected phy_register protottype + - Corrected kDAPL wait/wakeup synchronization + - Fixed kDAPL evd_kcreate() such that it no longer depends + on uDAPL only code. + - dapl_provider.h had wrong guard #def: changed DAT_PROVIDER_H + to DAPL_PROVIDER_H + - removed extra (and bogus) call to dapls_ib_completion_notify() + in evd_kcreate.c + - Inserted missing error code assignment in + dapls_rbuf_realloc() + - When a CONNECTED event arrives, make sure we are ready for + it, else something bad may have happened to the EP and we + just return; this replaces an explicit check for a single + error condition, replacing it with the general check for the + state capable of dealing with the request. + - Better context pointer verification. Removed locks around + call to ib_disconnect on an error path, which would result + in a deadlock. Added code for BROKEN events. + - Brought the vapi code more up to date: added conditional + compile switches, removed obsolete __ActivePort, deal + with 0 length DTO + - Several dapltest fixes to bring the code up to the 1.1 + specification. + - Fixed mismatched dalp_os_dbg_print() #else dapl_Dbg_Print(); + the latter was replaced with the former. + - ep_state_subtype() now includes UNCONNECTED. + - Added some missing ibapi error codes. + + + + NEW SINCE Beta 2.1 + + * Changes for Erratta and 1.1 Spec + - Removed DAT_NAME_NOT_FOUND, per DAT erratta + - EVD's with DTO and CONNECTION flags set no longer valid. + - Removed DAT_IS_SUCCESS macro + - Moved provider attribute structures from vendor files to udat.h + and kdat.h + - kdapl UPCALL_OBJECT now passed by reference + + * Completed dat_strerr return strings + + * Now support interrupted system calls + + * dapltest now used dat_strerror for error reporting. + + * Large number of files were formatted to meet project standard, + very cosmetic changes but improves readability and + maintainability. Also cleaned up a number of comments during + this effort. + + * dat_registry and RPM file changes (contributed by Steffen Persvold): + - Renamed the RPM name of the registry to be dat-registry + (renamed the .spec file too, some cvs add/remove needed) + - Added the ability to create RPMs as normal user (using + temporal paths), works on SuSE, Fedora, and RedHat. + - 'make rpm' now works even if you didn't build first. + - Changed to using the GNU __attribute__((constructor)) and + __attribute__((destructor)) on the dat_init functions, dat_init + and dat_fini. The old -init and -fini options to LD makes + applications crash on some platforms (Fedora for example). + - Added support for 64 bit platforms. + - Added code to allow multiple provider names in the registry, + primarily to support ia32 and ia64 libraries simultaneously. + Provider names are now kept in a list, the first successful + library open will be the provider. + + * Added initial infrastructure for DAPL_DCNTR, a feature that + will aid in debug and tuning of a dapl implementation. Partial + implementation only at this point. + + * Bug fixes + - Prevent debug messages from crashing dapl in EVD completions by + verifying the error code to ensure data is valid. + - Verify CNO before using it to clean up in evd_free() + - CNO timeouts now return correct error codes, per the spec. + - cr_accept now complies with the spec concerning connection + requests that go away before the accept is invoked. + - Verify valid EVD before posting connection evens on active side + of a connection. EP locking also corrected. + - Clean up of dapltest Makefile, no longer need to declare + DAT_THREADSAFE + - Fixed check of EP states to see if we need to disconnect an + IA is closed. + - ep_free() code reworked such that we can properly close a + connection pending EP. + - Changed disconnect processing to comply with the spec: user will + see a BROKEN event, not DISCONNECTED. + - If we get a DTO error, issue a disconnect to let the CM and + the user know the EP state changed to disconnect; checked IBA + spec to make sure we disconnect on correct error codes. + - ep_disconnect now properly deals with abrupt disconnects on the + active side of a connection. + - PSP now created in the correct state for psp_create_any(), making + it usable. + - dapl_evd_resize() now returns correct status, instead of always + DAT_NOT_IMPLEMENTED. + - dapl_evd_modify_cno() does better error checking before invoking + the provider layer, avoiding bugs. + - Simple change to allow dapl_evd_modify_cno() to set the CNO to + NULL, per the spec. + - Added required locking around call to dapl_sp_remove_cr. + + - Fixed problems related to dapl_ep_free: the new + disconnect(abrupt) allows us to do a more immediate teardown of + connections, removing the need for the MAGIC_EP_EXIT magic + number/state, which has been removed. Mmuch cleanup of paths, + and made more robust. + - Made changes to meet the spec, uDAPL 1.1 6.3.2.3: CNO is + triggered if there are waiters when the last EVD is removed + or when the IA is freed. + - Added code to deal with the provider synchronously telling us + a connection is unreachable, and generate the appropriate + event. + - Changed timer routine type from unsigned long to uintptr_t + to better fit with machine architectures. + - ep.param data now initialized in ep_create, not ep_alloc. + - Or Gerlitz provided updates to Mellanox files for evd_resize, + fw attributes, many others. Also implemented changes for correct + sizes on REP side of a connection request. + + + + NEW SINCE Beta 2.0 + + * dat_echo now DAT 1.1 compliant. Various small enhancements. + + * Revamped atomic_inc/dec to be void, the return value was never + used. This allows kdapl to use Linux kernel equivalents, and + is a small performance advantage. + + * kDAPL: dapl_evd_modify_upcall implemented and tested. + + * kDAPL: physical memory registration implemented and tested. + + * uDAPL now builds cleanly for non-debug versions. + + * Default RDMA credits increased to 8. + + * Default ACK_TIMEOUT now a reasonable value (2 sec vs old 2 + months). + + * Cleaned up dat_error.h, now 1.1 compliant in comments. + + * evd_resize initial implementation. Untested. + + * Bug fixes + - __KDAPL__ is defined in kdat_config.h, so apps don't need + to define it. + - Changed include file ordering in kdat.h to put kdat_config.h + first. + - resolved connection/tear-down race on the client side. + - kDAPL timeouts now scaled properly; fixed 3 orders of + magnitude difference. + - kDAPL EVD callbacks now get invoked for all completions; old + code would drop them in heavy utilization. + - Fixed error path in kDAPL evd creation, so we no longer + leak CNOs. + - create_psp_any returns correct error code if it can't create + a connection qualifier. + - lock fix in ibapi disconnect code. + - kDAPL INFINITE waits now work properly (non connection + waits) + - kDAPL driver unload now works properly + - dapl_lmr_[k]create now returns 1.1 error codes + - ibapi routines now return DAT 1.1 error codes + + + + NEW SINCE Beta 1.10 + + * kDAPL is now part of the DAPL distribution. See the release + notes above. + + The kDAPL 1.1 spec is now contained in the doc/ subdirectory. + + * Several files have been moved around as part of the kDAPL + checkin. Some files that were previously in udapl/ are now + in common/, some in common are now in udapl/. The goal was + to make sure files are properly located and make sense for + the build. + + * Source code formatting changes for consistency. + + * Bug fixes + - dapl_evd_create() was comparing the wrong bit combinations, + allowing bogus EVDs to be created. + - Removed code that swallowed zero length I/O requests, which + are allowed by the spec and are useful to applications. + - Locking in dapli_get_sp_ep was asymmetric; fixed it so the + routine will take and release the lock. Cosmetic change. + - dapl_get_consuemr_context() will now verify the pointer + argument 'context' is not NULL. + + + + + OBTAIN THE CODE + + To obtain the tree for your local machine you can check it + out of the source repository using CVS tools. CVS is common + on Unix systems and available as freeware on Windows machines. + The command to anonymously obtain the source code from + Source Forge (with no password) is: + + cvs -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl login + cvs -z3 -d:pserver:anonymous@cvs.dapl.sourceforge.net:/cvsroot/dapl co . + + When prompted for a password, simply press the Enter key. + + Source Forge also contains explicit directions on how to become + a developer, as well as how to use different CVS commands. You may + also browse the source code using the URL: + + http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/dapl/ + + + SYSTEM REQUIREMENTS + + This project has been implemented on Red Hat Linux 7.3, SuSE + SLES 8, Windows 2000, RHEL 3.0, and a couple of other Linux + distrubutions. The structure of the code is designed to allow + other operating systems to easily be adapted. + + The DAPL team has used Mellanox Tavor based InfiniBand HCAs for + development, and continues with this platform. Our HCAs use the + IB verbs API submitted by IBM. Mellanox has contributed an + adapter layer using their VAPI verbs API. Either platform is + available to any group considering DAPL work. The structure of + the uDAPL source allows other provider API sets to be easily + integrated. + + The development team uses any one of three topologies: two HCAs + in a single machine; a single HCA in each of two machines; and + most commonly, a switch. Machines connected to a switch may have + more than one HCA. + + The DAPL Plugfest revealed that switches and HCAs available from + most vendors will interoperate with little trouble, given the + most recent releases of software. The dapl reference team makes + no recommendation on HCA or switch vendors. + + Explicit machine configurations are available upon request. + + + IN THE TREE + + The DAPL tree contains source code for the uDAPL and kDAPL + implementations, and also includes tests and documentation. + + Included documentation has the base level API of the + providers: the IBM Access API and the Mellanox Verbs API. Also + included are a growing number of DAPL design documents which + lead the reader through specific DAPL subsystems. More + design documents are in progress and will appear in the tree in + the near future. + + A small number of test applications and a unit test framework + are also included. dapltest is the primary testing application + used by the DAPL team, it is capable of simulating a variety of + loads and exercises a large number of interfaces. Full + documentation is included for each of the tests. + + Recently, the dapl conformance test has been added to the source + repository. The test provides coverage of the most common + interfaces, doing both positive and negative testing. Vendors + providing DAPL implementation are strongly encouraged to run + this set of tests. + + + MAKEFILE NOTES + + There are a number #ifdef's in the code that were necessary + during early development. They are disappearing as we + have time to take advantage of features and work available from + newer releases of provider software. You may notice an #ifdef + _BUSTED, which indicates a particular feature was not + working at the time the code was written and the DAPL team + developed a work-around. + + These #ifdefs are not documented as the intent is to remove + them as soon as possible. + + Of particular relevance are the following #defines: + + - CM_BUSTED + + The DAPL team has been an early adopter of InfiniBand and has + had to improvise missing functionality while the vendors lag + our development. InfiniBand uses a Connection Manager (CM) to + establish a connection between nodes. This #define essentially + 'fakes' a connection by moving a QP into the appropriate + state. Most of the IB vendors have a working CM now and this + is no longer the default, but the code remains as some + development groups are working to catch up. + + - NO_NAME_SERVICE + + Naming is a thorny issue in InfiniBand; translating from a + hostname or an interface name to a GID that can be used to + establish a connection with a remote machine. The reference + implementation provides a simple name service under this + #define. The goal is to use IPoIB when it becomes + available. NO_NAME_SERVICE will probably remain in the code + long term in order to enable various implementations. A + description of how this works is found in the end_point_design + document in the doc/ directory. + + + CONTRIBUTIONS + + As is common to Source Forge projects, there are a small number + of developers directly associated with the source tree and having + privileges to change the tree. Requested updates, changes, bug + fixes, enhancements, or contributions should be sent to Steve + Sears at sjs@netapp.com for review. We welcome your + contributions and expect the quality of the project will + improve thanks to your help. + + The core DAPL team is: + + Steve Sears + Philip Christopher + + ... with contributions from a number of excellent engineers in + various companies contributing to the open source effort. + + + ONGOING WORK + + Not all of the DAPL spec is implemented at this time. + Functionality such as shared memory will probably not be + implemented by the reference implementation (there is a write up + on this in the doc/ area), and there are yet various cases where + work remains to be done. And of course, not all of the + implemented functionality has been tested yet. The DAPL team + continues to develop and test the tree with the intent of + completing the specification and delivering a robust and useful + implementation. + + +The DAPL Team +