From 7289bf0d72f260dbbe8d11d34e974dcaa06186e9 Mon Sep 17 00:00:00 2001
From: Stan Smith
The OpenFabrics Enterprise Distribution for Windows package is composed of software modules intended
@@ -2425,7 +2425,8 @@ vstat - HCA Stats and Counters
A single running process (opensm.exe) is required to configure
and thus make an Infiniband subnet useable. For most cases, InfiniBand
@@ -2459,9 +2460,11 @@ Management service.User's Manual
Release 2.3
-05/12/2010
+06/18/2010
Overview
<return-to-top>
Subnet Management with OpenSM version 3.3.3
+Subnet Management with OpenSM version
+3.3.6
If opensm.exe is run from a command window, %TEMP% is not
- defined as %windir%\TEMP\; just use %TEMP%
+ defined as %windir%\TEMP\.
InfiniBand Subnet Management from a command window
+
+opensm - InfiniBand subnet manager and administration (SM/SA)
SYNOPSIS
opensm
@@ -2498,7 +2501,7 @@ Management service.
opensm also now contains an experimental version of a performance manager as well. - +
opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB fabric, initialize it, and sweep occasionally for changes. - +
opensm attaches to a specific IB port on the local machine and configures only the fabric connected to it. (If the local machine has other IB ports, opensm will ignore the fabrics connected to those other ports). If no port is specified, it will select the first "best" available port. - -opensm can present the available ports and prompt for a port number to attach -to. By default, the run is logged to two files:%windir%\temp\osm.syslog and -%windir%\temp\osm.log. +
+opensm can present the available ports and prompt for a port number to +attach to. +
+By default, the run is logged to two files: %TEMP%\osm.syslog (aka +%windir%\temp\osm.syslog) and %windir%\temp\opensm.log. The first file will register only general major events, whereas the second will include details of reported errors. All errors reported in this second file should be treated as indicators of IB fabric health issues. (Note that when a fatal and non-recoverable error occurs, opensm will exit.) Both log files should include the message "SUBNET UP" if opensm was able to -setup the subnet correctly.
+ BIT LOG LEVEL ENABLED
---- -----------------
0x01 - ERROR (error messages)
0x02 - INFO (basic messages, low volume)
@@ -2789,7 +2807,7 @@ specific log level as follows:
0x20 - FRAMES (dumps all SMP and GMP frames)
0x40 - ROUTING (dump FDB routing information)
0x80 - currently unused.
-
+
Without -D, OpenSM defaults to ERROR + INFO (0x3).
Specifying -D 0 disables all messages.
Specifying -D 0xFF enables all messages (see -V).
@@ -2800,7 +2818,7 @@ This option specifies a debug option.
These options are not normally needed.
The number following -d selects the debug
option to enable as follows:
-
+
OPT Description
--- -----------------
-d0 - Ignore other SM nodes
@@ -2811,9 +2829,8 @@ option to enable as follows:
Display this usage info then exit.
+
@@ -2842,7 +2859,7 @@ logrotate purposes.
Examples:
sc.exe control OpenSM 128
-# will clear the contents of %windir%\temp\osm.log
+# will clear the contents of %windir%\temp\osm.log, logrotate.
sc.exe control OpenSM 129
# start a new heavy sweep
@@ -2853,54 +2870,54 @@ Examples:
The default name of OpenSM partitions configuration file is %ProgramFiles\OFED\OpenSM\partitions.conf. The default may be changed by using the --Pconfig (-P) option with OpenSM. - +
The default partition will be created by OpenSM unconditionally even when partition configuration file does not exist or cannot be accessed. - +
The default partition has P_Key value 0x7fff. OpenSM's port will always have full membership in default partition. All other end ports will have full membership if the partition configuration file is not found or cannot be accessed, or limited membership if the file exists and can be accessed but there is no rule for the Default partition. - +
Effectively, this amounts to the same as if one of the following rules below appear in the partition configuration file. - +
In the case of no rule for the Default partition: - +
Default=0x7fff : ALL=limited, SELF=full ; - +
In the case of no partition configuration file or file cannot be accessed: - +
Default=0x7fff : ALL=full ; - - +
+
File Format - +
Comments: - +
Line content followed after '#' character is comment and ignored by parser. - +
General file format: - +
<Partition Definition>:<PortGUIDs list> ; - +
Partition Definition: - +
[PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
-
-
PartitionName - string, will be used with logging. When omitted
+
+ PartitionName - string, will be used with logging. When omitted
empty string will be used.
PKey - P_Key value for this partition. Only low 15 bits will
be used. When omitted will be autogenerated.
flag - used to indicate IPoIB capability of this partition.
defmember=full|limited - specifies default membership for port guid
list. Default is limited.
-
+
Currently recognized flags are:
-
-
ipoib - indicates that this partition may be used for IPoIB, as
+
+ ipoib - indicates that this partition may be used for IPoIB, as
result IPoIB capable MC group will be created.
rate=<val> - specifies rate for this IPoIB MC group
(default is 3 (10GBps))
@@ -2911,69 +2928,69 @@ Currently recognized flags are:
scope=<val> - specifies scope for this IPoIB MC group
(default is 2 (link local)). Multiple scope settings
are permitted for a partition.
-
+
Note that values for rate, mtu, and scope should be specified as defined in the IBTA specification (for example, mtu=4 for 2048). - +
PortGUIDs list:
-
-
PortGUID - GUID of partition member EndPort. Hexadecimal
+
+ PortGUID - GUID of partition member EndPort. Hexadecimal
numbers should start from 0x, decimal numbers
are accepted too.
full or limited - indicates full or limited membership for this
port. When omitted (or unrecognized) limited
membership is assumed.
-
+
There are two useful keywords for PortGUID definition:
-
-
- 'ALL' means all end ports in this subnet.
+
+ - 'ALL' means all end ports in this subnet.
- 'ALL_CAS' means all Channel Adapter end ports in this subnet.
- 'ALL_SWITCHES' means all Switch end ports in this subnet.
- 'ALL_ROUTERS' means all Router end ports in this subnet.
- 'SELF' means subnet manager's port.
-
+
Empty list means no ports in this partition. - +
Notes: - +
White space is permitted between delimiters ('=', ',',':',';'). - +
The line can be wrapped after ':' followed after Partition Definition and between. - +
PartitionName does not need to be unique, PKey does need to be unique. If PKey is repeated then those partition configurations will be merged and first PartitionName will be used (see also next note). - +
It is possible to split partition configuration in more than one definition, but then PKey should be explicitly specified (otherwise different PKey values will be generated for those definitions). - +
Examples:
-
-
Default=0x7fff : ALL, SELF=full ;
+
+ Default=0x7fff : ALL, SELF=full ;
Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;
-
-
NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
-
-
YetAnotherOne = 0x300 : SELF=full ;
+
+ NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; +
+ YetAnotherOne = 0x300 : SELF=full ;
YetAnotherOne = 0x300 : ALL=limited ;
-
-
ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
+
+ ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
# 0x123453, 0x123454 will be limited
ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
# 0x123456, 0x123457 will be limited
ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
-
-
+
+
Note: - +
The following rule is equivalent to how OpenSM used to run prior to the
partition manager:
-
-
Default=0x7fff,ipoib:ALL=full;
+
+ Default=0x7fff,ipoib:ALL=full;
+ qos_max_vls - The maximum number of VLs that will be on the subnet
qos_high_limit - The limit of High Priority component of VL
Arbitration table (IBA 7.6.9)
qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9)
@@ -2996,29 +3013,29 @@ list of these parameters:
qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is
a list of VLs corresponding to SLs 0-15 (Note
that VL15 used here means drop this SL)
-
+
Typical default values (hard-coded in OpenSM initialization) are:
-
-
qos_max_vls 15
+
+ qos_max_vls 15
qos_high_limit 0
qos_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
-
+
The syntax is compatible with rest of OpenSM configuration options and values may be stored in OpenSM config file (cached options file). - +
In addition to the above, we may define separate QoS configuration
parameters sets for various target types. As targets, we currently support
CAs, routers, switch external ports, and switch's enhanced port 0. The
names of such specialized parameters are prefixed by "qos_<type>_"
string. Here is a full list of the currently supported sets:
-
-
qos_ca_ - QoS configuration parameters set for CAs.
+
+ qos_ca_ - QoS configuration parameters set for CAs.
qos_rtr_ - parameters set for routers.
qos_sw0_ - parameters set for switches' port 0.
qos_swe_ - parameters set for switches' external ports.
-
+
Examples:
qos_sw0_max_vls=2
qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
@@ -3061,47 +3078,45 @@ are wild-carded means that a path record query specifying any
off-subnet DGID should return a path to the first available router.
This configuration yields the same behavior formerly achieved by
compiling opensm with -DROUTER_EXP which has been obsoleted.
-
-
OpenSM now offers five routing engines: - +
1. Min Hop Algorithm - based on the minimum hops to each node where the path length is optimized. - +
2. UPDN Unicast routing algorithm - also based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and deadlock may occur due to a loop in the subnet. - +
3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing for congestion-free "shift" communication pattern. It should be chosen if a subnet is a symmetrical or almost symmetrical fat-tree of various types, not just K-ary-N-Trees: non-constant K, not fully staffed, any Constant Bisectional Bandwidth (CBB) ratio. Similar to UPDN, Fat Tree routing is constrained to ranking rules. - +
4. LASH unicast routing algorithm - uses Infiniband virtual layers (SL) to provide deadlock-free shortest-path routing while also distributing the paths between layers. LASH is an alternative deadlock-free topology-agnostic routing algorithm to the non-minimal UPDN algorithm avoiding the use of a potentially congested root node. - +
5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but avoids port equalization except for redundant links between the same two switches. This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh (see details below). - +
OpenSM also supports a file method which can load routes from a table. See 'Modular Routing Engine' for more information on this. - +
The basic routing algorithm is comprised of two stages: - +
1. MinHop matrix calculation
How many hops are required to get from each port to each LID ?
The algorithm to fill these tables is different if you run standard
@@ -3111,7 +3126,7 @@ min hop from every destination LID through neighbor switches
For Up/Down routing, a BFS from every target is used. The BFS tracks link
direction (up or down) and avoid steps that will perform up after a down
step was used.
-
+
2. Once MinHop matrices exist, each switch is visited and for each target LID a
decision is made as to what port should be used to get to that LID.
This step is common to standard and Up/Down routing. Each port has a
@@ -3125,12 +3140,12 @@ same target port,
the previous LID of the same LMC group)
c. if none - prefer those which go through another NodeGuid
d. fall back to the number of paths method (if all go to same node).
-
+
Effect of Topology Changes - +
OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the -r (--reassign_lids) option is specified. - +
-r
@@ -3140,30 +3155,30 @@ the fabric switches unless the -r (--reassign_lids) option is specified.
may disrupt subnet traffic.
Without -r, OpenSM attempts to preserve existing
LID assignments resolving multiple use of same LID.
-
+
If a link is added or removed, OpenSM does not recalculate the routes that do not have to change. A route has to change if the port is no longer UP or no longer the MinHop. When routing changes are performed, the same algorithm for balancing the routes is invoked. - +
In the case of using the file based routing, any topology changes are currently ignored The 'file' routing engine just loads the LFTs from the file specified, with no reaction to real topology. Obviously, this will not be able to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent switches will be skipped. Multicast is not affected by 'file' routing engine (this uses min hop tables). - - +
+
Min Hop Algorithm - +
The Min Hop algorithm is invoked by default if no routing algorithm is specified. It can also be invoked by specifying '-R minhop'. - +
The Min Hop algorithm is divided into two stages: computation of min-hop tables on every switch and LFT output port assignment. Link subscription is also equalized with the ability to override based on port GUID. The latter is supplied by: - +
-i <equalize-ignore-guids-file>
@@ -3173,21 +3188,21 @@ port GUID. The latter is supplied by:
equalization algorithm. Note that only endports (CA,
switch port 0, and router ports) and not switch external
ports are supported.
-
+
LMC awareness routes based on (remote) system or switch basis. - - +
+
Purpose of UPDN Algorithm - +
The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet. A loop-deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop. As such, the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree, and one of its loops may experience a deadlock (due, for example, to high pressure). - +
The UPDN algorithm is based on the following main stages: - +
1. Auto-detect root nodes - based on the CA hop length from any switch in the subnet, a statistical histogram is built for each switch (hop num vs number of occurrences). If the histogram reflects a specific column (higher @@ -3195,45 +3210,45 @@ than others) for a certain node, then it is marked as a root node. Since the algorithm is statistical, it may not find any root nodes. The list of the root nodes found by this auto-detect stage is used by the ranking process stage. - +
Note 1: The user can override the node list manually.
Note 2: If this stage cannot find any root nodes, and the user did
not specify a guid list file, OpenSM defaults back to the
Min Hop routing algorithm.
-
+
2. Ranking process - All root switch nodes (found in stage 1) are assigned a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the subnet are ranked incrementally. This ranking aids in the process of enforcing rules that ensure loop-free paths. - +
3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from each (CA or switch) node in the subnet. During the BFS process, the FDB table of each switch node traversed by BFS is updated, in reference to the starting node, based on the ranking rules and guid values. - +
At the end of the process, the updated FDB tables ensure loop-free paths through the subnet. - +
Note: Up/Down routing does not allow LID routing communication between switches that are located inside spine "switch systems". The reason is that there is no way to allow a LID route between them that does not break the Up/Down rule. One ramification of this is that you cannot run SM on switches other than the leaf switches of the fabric. - - +
+
UPDN Algorithm Usage - +
Activation through OpenSM - +
Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm. Use '-a <root_guid_file>' for adding an UPDN guid file that contains the root nodes for ranking. If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm. - +
Notes on the guid list file: - +
1. A valid guid file specifies one guid in each line. Lines with an invalid
format will be discarded.
@@ -3241,17 +3256,17 @@ format will be discarded.
2. The user should specify the root switch guids. However, it is also
possible to specify CA guids; OpenSM will use the guid of the switch (if
it exists) that connects the CA to the subnet as a root node.
-
-
+
+
Fat-tree Routing Algorithm - +
The fat-tree algorithm optimizes routing for "shift" communication pattern. It should be chosen if a subnet is a symmetrical or almost symmetrical fat-tree of various types. It supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks. - +
If the root guid file is not provided ('-a' or '--root_guid_file' options),
the topology has to be pure fat-tree that complies with the following rules:
- Tree rank should be between two and eight (inclusively)
@@ -3265,24 +3280,24 @@ the topology has to be pure fat-tree that complies with the following rules:
- Switches of the same rank should have the same number
of ports in each DOWN-going port group.
- All the CAs have to be at the same tree level (rank).
-
+
If the root guid file is provided, the topology doesn't have to be pure
fat-tree, and it should only comply with the following rules:
- Tree rank should be between two and eight (inclusively)
- All the Compute Nodes** have to be at the same tree level (rank).
Note that non-compute node CAs are allowed here to be at different
tree ranks.
-
+
* ports that are connected to the same remote switch are referenced as 'port group'. - +
** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file' OpenSM options. - +
Topologies that do not comply cause a fallback to min hop routing. Note that this can also occur on link failures which cause the topology to no longer be "pure" fat-tree. - +
Note that although fat-tree algorithm supports trees with non-integer CBB ratio, the routing will not be as balanced as in case of integer CBB ratio. In addition to this, although the algorithm allows leaf switches to have any @@ -3290,97 +3305,97 @@ number of CAs, the closer the tree is to be fully populated, the more effective the "shift" communication pattern will be. In general, even if the root list is provided, the closer the topology to a pure and symmetrical fat-tree, the more optimal the routing will be. - +
The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) in the same directory where the OpenSM log resides. This ordering file provides the CN order that may be used to create efficient communication pattern, that will match the routing tables. - +
Routing between non-CN nodes - +
The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree. In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes. To solve this problem, a list of non-CN nodes can be specified by '-G' or '--io_guid_file' option. Theses nodes will be allowed to use switches the wrong way round a specific number of times (specified by '-H' or '--max_reverse_hops'. With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree. - +
Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way. This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used to allow connectivity for HA purposes or similar. Also having routes the other way around can in theory cause credit loops. - +
Use these options with extreme care ! - +
Activation through OpenSM - +
Use '-R ftree' option to activate the fat-tree algorithm. Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option is not used, routing algorithm will detect roots automatically. Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option is not used, all the CAs are considered as compute nodes. - +
Note: LMC > 0 is not supported by fat-tree routing. If this is specified, the default routing algorithm is invoked instead. - - +
+
LASH Routing Algorithm - +
LASH is an acronym for LAyered SHortest Path Routing. It is a deterministic shortest path routing algorithm that enables topology agnostic deadlock-free routing within communication networks. - +
When computing the routing function, LASH analyzes the network topology for the shortest-path routes between all pairs of sources / destinations and groups these paths into virtual layers in such a way as to avoid deadlock. - +
Note LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from HCA between and switch does not need virtual layers as deadlock will not arise between switch and HCA. - +
In more detail, the algorithm works as follows: - +
1) LASH determines the shortest-path between all pairs of source / destination switches. Note, LASH ensures the same SL is used for all SRC/DST - DST/SRC pairs and there is no guarantee that the return path for a given DST/SRC will be the reverse of the route SRC/DST. - +
2) LASH then begins an SL assignment process where a route is assigned to a layer (SL) if the addition of that route does not cause deadlock within that layer. This is achieved by maintaining and analysing a channel dependency graph for each layer. Once the potential addition of a path could lead to deadlock, LASH opens a new layer and continues the process. - +
3) Once this stage has been completed, it is highly likely that the first layers processed will contain more paths than the latter ones. To better balance the use of layers, LASH moves paths from one layer to another so that the number of paths in each layer averages out. - +
Note, the implementation of LASH in opensm attempts to use as few layers as possible. This number can be less than the number of actual layers available. - +
In general LASH is a very flexible algorithm. It can, for example, reduce to Dimension Order Routing in certain topologies, it is topology agnostic and fares well in the face of faults. - +
It has been shown that for both regular and irregular topologies, LASH outperforms Up/Down. The reason for this is that LASH distributes the traffic more evenly through a network, avoiding the bottleneck issues related to a root node and always routes shortest-path. - +
The algorithm was developed by Simula Research Laboratory. - - +
+
Use '-R lash -Q ' option to activate the LASH algorithm. - +
Note: QoS support has to be turned on in order that SL/VL mappings are used. - +
Note: LMC > 0 is not supported by the LASH routing. If this is specified, the default routing algorithm is invoked instead. - +
For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm. For toroidal meshes on the other hand there are routing loops that can cause deadlocks. LASH can be used to @@ -3393,76 +3408,77 @@ add an additional phase that analyses the mesh to try to determine the dimension and size of a mesh. If it determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs. - +
DOR Routing Algorithm - -The Dimension Order Routing algorithm is based on the Min Hop -algorithm and so uses shortest paths. Instead of spreading traffic -out across different paths with the same shortest distance, it chooses -among the available shortest paths based on an ordering of dimensions. -Each port must be consistently cabled to represent a hypercube -dimension or a mesh dimension. Paths are grown from a destination -back to a source using the lowest dimension (port) of available paths -at each step. This provides the ordering necessary to avoid deadlock. -When there are multiple links between any two switches, they still -represent only one dimension and traffic is balanced across them -unless port equalization is turned off. In the case of hypercubes, -the same port must be used throughout the fabric to represent the -hypercube dimension and match on both ends of the cable. In the case -of meshes, the dimension should consistently use the same pair of -ports, one port on one end of the cable, and the other port on the -other end, continuing along the mesh dimension. - +
+The Dimension Order Routing algorithm is based on the Min Hop algorithm and so
+uses shortest paths. Instead of spreading traffic
+out across different paths with the same shortest distance, it chooses among the
+available shortest paths based on an ordering of dimensions. Each port
+must be consistently cabled to represent a hypercube dimension or a mesh
+dimension. Alternatively, the -O option can be
+used to assign a custom mapping between the ports on a given switch, and the
+associated dimension. Paths are grown from a destination back
+to a source using the lowest dimension (port) of available paths at each step.
+This provides the ordering necessary to avoid deadlock. When there are multiple
+links between any two switches, they still represent only one dimension and
+traffic is balanced across them
+unless port equalization is turned off. In the case of hypercubes, the same port
+must be used throughout the fabric to represent the hypercube dimension and
+match on both ends of the cable, or the -O option used to accomplish the
+alignment. In the case of meshes, the dimension should consistently use the same
+pair of ports, one port on one end of the cable, and the other port on the other
+end, continuing along the mesh dimension, or the -O option used as an override.
Use '-R dor' option to activate the DOR algorithm. - - +
+
Routing References - +
To learn more about deadlock-free routing, see the article "Deadlock Free Message Routing in Multiprocessor Interconnection Networks" by William J Dally and Charles L Seitz (1985). - +
To learn more about the up/down algorithm, see the article "Effective Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the Universidad Politecnica de Valencia. - +
To learn more about LASH and the flexibility behind it, the requirement for layers, performance comparisons to other algorithms, see the following articles: - +
"Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions on Parallel and Distributed Systems, VOL.16, No12, December 2005. - +
"Routing for the ASI Fabric Manager", Solheim et al. IEEE Communications Magazine, Vol.44, No.7, July 2006. - +
"Layered Shortest Path (LASH) Routing in Irregular System Area Networks", Skeie et al. IEEE Computer Society Communication Architecture for Clusters 2002. - - +
+
Modular Routine Engine - +
Modular routing engine structure allows for the ease of "plugging" new routing modules. - +
Currently, only unicast callbacks are supported. Multicast can be added later. - +
One existing routing module is up-down "updn", which may be activated with '-R updn' option (instead of old '-u'). - +
General usage is: $ opensm -R 'module-name' - +
There is also a trivial routing module which is able to load LFT tables from a file. - +
Main features:
-
-
- this will load switch LFTs and/or LID matrices (min hops tables)
+
+ - this will load switch LFTs and/or LID matrices (min hops tables)
- this will load switch LFTs according to the path entries introduced
in the file
- no additional checks will be performed (such as "is port connected",
@@ -3471,39 +3487,34 @@ Main features:
LFTs correctly if endport GUIDs are represented in the file
(in order to disable this, GUIDs may be removed from the file
or zeroed)
-
+
The file format is compatible with output of 'ibroute' util and for whole fabric can be generated with dump_lfts.sh script. - +
To activate file based routing module, use:
-
-
opensm -R file -U \path\to\lfts_file
-
+ opensm -R file -U /path/to/lfts_file
+
If the lfts_file is not found or is in error, the default routing algorithm is utilized. - +
The ability to dump switch lid matrices (aka min hops tables) to file and later to load these is also supported. - +
The usage is similar to unicast forwarding tables loading from a lfts
file (introduced by 'file' routing engine), but new lid matrix file
-name should be specified by -M or --lid_matrix_file option. For example:
-
+name should be specified by -M or --lid_matrix_file option. For example:
opensm -R file -M ./opensm-lid-matrix.dump
-
-The dump file is named 'opensm-lid-matrix.dump' and will be generated
-in standard opensm dump directory (/var/log by default) when
-OSM_LOG_ROUTING logging flag is set.
-
+
+The dump file is named 'opensm-lid-matrix.dump' and will be generated in +standard opensm dump directory (%TEMP% by default) when OSM_LOG_ROUTING logging flag is set. +
When routing engine 'file' is activated, but the lfts file is not specified or not cannot be open default lid matrix algorithm will be used. - +
There is also a switch forwarding tables dumper which generates a file compatible with dump_lfts.sh output. This file can be used as input for forwarding tables loading by 'file' routing engine. Both or one of options -U and -M can be specified together with '-R file'. - -
+
+
+
+
osmtest is a test program to validate InfiniBand subnet manager and - administration (SM/SA). Default is to run all flows with the exception of - the QoS flow. osmtest provides a test suite for opensm. osmtest has the - following capabilities and testing flows: It creates an inventory file of - all available Nodes, Ports, and PathRecords, including all their fields. It - verifies the existing inventory, with all the object fields, and matches it - to a pre-saved one. A Multicast Compliancy test. An Event Forwarding test. A - Service Record registration test. An RMPP stress test. A Small SA Queries - stress test. It is recommended that after installing opensm, the user should - run "osmtest -f c" to generate the inventory file, and immediately - afterwards run "osmtest -f a" to test OpenSM. Another recommendation for - osmtest usage is to create the inventory when the IB fabric is stable, and - occasionally run "osmtest -v" to verify that nothing has changed. -
-Note - osmtest will not run on the node where OpenSM is running.
- See 'osmtest -h' for all options.
Functionality:
--+osmtest -f c +
SYNOPSIS
+ +osmtest + +[-f(low) <c|a|v|s|e|f|m|q|t>] [-w(ait) <trap_wait_time>] [-d(ebug) <number>] +[-m(ax_lid) <LID in hex>] [-g(uid)[=]<GUID in hex>] [-p(ort)] +[-i(nventory) <filename>] [-s(tress)] [-M(ulticast_Mode)] +[-t(imeout) <milliseconds>] [-l | --log_file] [-v] [-vf <flags>] +[-h(elp)] +DESCRIPTION
+ ++ +osmtest is a test program used to validate the correct operation of the InfiniBand subnet manager and +administration (SM/SA). +
+Default is to run all flows with the exception of the QoS flow. +
+osmtest provides a test suite for opensm. +
+osmtest has the following capabilities and testing flows: +
+It creates an inventory file of all available Nodes, Ports, and PathRecords, +including all their fields. +It verifies the existing inventory, with all the object fields, and matches it +to a pre-saved one. +A Multicast Compliancy test. +An Event Forwarding test. +A Service Record registration test. +An RMPP stress test. +A Small SA Queries stress test. +
+It is recommended that after installing opensm, the user should run +"osmtest -f c" to generate the inventory file, and +immediately afterwards run "osmtest -f a" to test OpenSM. +
+Another recommendation for osmtest usage is to create the inventory when the +IB fabric is stable, and occasionally +run "osmtest -v" to verify that nothing has changed. +
OPTIONS
+ ++
+ +
+
+- -f, --flow
- +This option directs osmtest to run a specific flow: +
FLOW DESCRIPTION +
c = create an inventory file with all nodes, ports and paths +
a = run all validation tests (expecting an input inventory) +
v = only validate the given inventory file +
s = run service registration, deregistration, and lease test +
e = run event forwarding test +
f = flood the SA with queries according to the stress mode +
m = multicast flow +
q = QoS info: dump VLArb and SLtoVL tables +
t = run trap 64/65 flow (this flow requires running of external tool) +
(default is all flows except QoS) +- -w, --wait
- +This option specifies the wait time for trap 64/65 in seconds +It is used only when running -f t - the trap 64/65 flow +(default to 10 sec) +
- -d, --debug
- +This option specifies a debug option. +These options are not normally needed. +The number following -d selects the debug +option to enable as follows: +
+ OPT Description +
--- ----------------- +
-d0 - Ignore other SM nodes +
-d1 - Force single threaded dispatching +
-d2 - Force log flushing after each log message +
-d3 - Disable multicast support +- -m, --max_lid
- +This option specifies the maximal LID number to be searched +for during inventory file build (default to 100) +
- -g, --guid
- +This option specifies the local port GUID value +with which OpenSM should bind. OpenSM may be +bound to 1 port at a time. +If GUID given is 0, OpenSM displays a list +of possible port GUIDs and waits for user input. +Without -g, OpenSM trys to use the default port. +
- -p, --port
- +This option displays a menu of possible local port GUID values +with which osmtest could bind +
- -i, --inventory
- +This option specifies the name of the inventory file +Normally, osmtest expects to find an inventory file, +which osmtest uses to validate real-time information +received from the SA during testing +If -i is not specified, osmtest defaults to the file +'osmtest.dat' +See -c option for related information +
- -s, --stress
- +This option runs the specified stress test instead +of the normal test suite +Stress test options are as follows: +
+ OPT Description +
--- ----------------- +
-s1 - Single-MAD (RMPP) response SA queries +
-s2 - Multi-MAD (RMPP) response SA queries +
-s3 - Multi-MAD (RMPP) Path Record SA queries +
-s4 - Single-MAD (non RMPP) get Path Record SA queries ++Without -s, stress testing is not performed +
- -M, --Multicast_Mode
- +This option specify length of Multicast test: +
+ OPT Description +
--- ----------------- +
-M1 - Short Multicast Flow (default) - single mode +
-M2 - Short Multicast Flow - multiple mode +
-M3 - Long Multicast Flow - single mode +
-M4 - Long Multicast Flow - multiple mode ++Single mode - Osmtest is tested alone, with no other +apps that interact with OpenSM MC +
+Multiple mode - Could be run with other apps using MC with +OpenSM. Without -M, default flow testing is performed +
- -t, --timeout
- +This option specifies the time in milliseconds +used for transaction timeouts. +Specifying -t 0 disables timeouts. +Without -t, OpenSM defaults to a timeout value of +200 milliseconds. +
- -l, --log_file
- +This option defines the log to be the given file. +By default the log goes to stdout. +
- -v, --verbose
- +This option increases the log verbosity level. +The -v option may be specified multiple times +to further increase the verbosity level. +See the -vf option for more information about. +log verbosity. +
- -V
- +This option sets the maximum verbosity level and +forces log flushing. +The -V is equivalent to '-vf 0xFF -d 2'. +See the -vf option for more information about. +log verbosity. +
- -vf
- +This option sets the log verbosity level. +A flags field must follow the -D option. +A bit set/clear in the flags enables/disables a +specific log level as follows: +
+ BIT LOG LEVEL ENABLED +
---- ----------------- +
0x01 - ERROR (error messages) +
0x02 - INFO (basic messages, low volume) +
0x04 - VERBOSE (interesting stuff, moderate volume) +
0x08 - DEBUG (diagnostic, high volume) +
0x10 - FUNCS (function entry/exit, very high volume) +
0x20 - FRAMES (dumps all SMP and GMP frames) +
0x40 - ROUTING (dump FDB routing information) +
0x80 - currently unused. ++Without -vf, osmtest defaults to ERROR + INFO (0x3) +Specifying -vf 0 disables all messages +Specifying -vf 0xFF enables all messages (see -V) +High verbosity levels may require increasing +the transaction timeout with the -t option +
- -h, --help
- +Display this usage info then exit. +
+
AUTHORS
+ ++
+- Hal Rosenstock
- +<hal.rosenstock@gmail.com> + +
- Eitan Zahavi
- +<eitan@mellanox.co.il> + +
- +
EXAMPLES
+Note - osmtest will not run on the node where OpenSM is running.
+
See 'osmtest -h' for all options.Functionality:
++-osmtest -f c # creates osmtest.dat inventory file in the current directory; required by other osmtest runs.
-
- osmtest -f a + osmtest -f v + # validate the default inventory file 'osmtest.dat'.
osmtest -f a # run all validation tests (expecting an input inventory file 'osmtest.dat' in the current folder).
Stress tests
osmtest -f f -s1 # @@ -4185,7 +4235,7 @@ include
DAT ENVIRONMENT:-DAT/DAPL 2.0 (free-build) libraries are identified in %SystemRoot% as +
DAT/DAPL 2.0 (free-build) libraries are identified in %SystemRoot%\System32 as dat2.dll and dapl2.dll. Debug versions of the v2.0 runtime libraries are located in '%SystemDrive%\%ProgramFiles%\OFED'.
IA32 (aka, 32-bit) @@ -4195,9 +4245,9 @@ include
In order for DAT/DAPL programs to execute correctly, the runtime library files 'dat2.dll and dapl2.dll' must be present in one of the following folders: current - directory, %SystemRoot% or in the library search path.
+ directory, %SystemRoot%, %SystemRoot%\System32 or in the library search path.The default OFED - installation places the runtime library files dat2.dll and dapl2.dll in the '%SystemRoot%' folder; + installation places the runtime library files dat2.dll and dapl2.dll in the '%SystemRoot%\System32' folder; symbol files (.pdb) are located in '%ProgramFiles%\OFED\'.
The default DAPL configuration file is defined as '%SystemDrive%\DAT\dat.conf'. This default @@ -4206,7 +4256,7 @@ include
Within the dat.conf file, the DAPL library specification can be located as the 5th whitespace separated line argument. By default the DAPL library file is installed as - '%SystemRoot%\dapl2.dll'.
+ '%SystemRoot%\System32\dapl2.dll'.Should you choose to relocated the DAPL library file to a path where whitespace appears in the full library path specification, then the full library file specification @@ -4246,7 +4296,7 @@ include
- + %windir%\System32\dapl2.dll
File: - %windir%\dapl2.dll
- @@ -4265,7 +4315,7 @@ include
dat.conf Provider name: ibnic0v2
Socket-CM Provider
- -
File: %windir%\dapl2-ofa-scm.dll
+File: %windir%\System32\dapl2-ofa-scm.dll
dat.conf @@ -4289,7 +4339,7 @@ include
RDMA-CM Provider
- -
File: %windir%\dapl2-ofa-cma.dll
+File: %windir%\System32\dapl2-ofa-cma.dll
dat.conf Provider @@ -5352,21 +5402,21 @@ RDMA devices
SYNOPSIS
DESCRIPTION
ibv_get_device_list() returns a NULL-terminated array of RDMA devices currently available. The argument num_devices is optional; if not NULL, -it is set to the number of devices returned in the array. +it is set to the number of devices returned in the array.ibv_free_device_list() frees the array of devices list returned by ibv_get_device_list().
RETURN VALUE
ibv_get_device_list() returns the array of available RDMA devices, or sets errno and returns NULL if the request fails. If no devices are found -then num_devices is set to 0, and non-NULL is returned. +then num_devices is set to 0, and non-NULL is returned.ibv_free_device_list() returns no value.
ERRORS
- EPERM
-- Permission denied. +
- Permission denied.
- ENOSYS
-- No kernel support for RDMA. +
- No kernel support for RDMA.
- ENOMEM
- Insufficient memory to complete the operation.
@@ -5376,7 +5426,7 @@ Client code should open all the devices it intends to use with ibv_open_device() before calling ibv_free_device_list(). Once it frees the array with ibv_free_device_list(), it will be able to use only the open devices; pointers to unopened devices will no longer be valid. - +SEE ALSO
ibv_get_device_name, ibv_get_device_guid, @@ -5384,7 +5434,7 @@ the open devices; pointers to unopened devices will no longer be valid.IBV_GET_DEVICE_GUID
NAME
-ibv_get_device_guid - get an RDMA device's GUID +ibv_get_device_guid - get an RDMA device's GUIDSYNOPSIS
#include <infiniband/verbs.h> @@ -5393,7 +5443,7 @@ ibv_get_device_guid - get an RDMA device's GUID ibv_get_device_name() returns the Global Unique IDentifier (GUID) of the RDMA device device.RETURN VALUE
ibv_get_device_guid() returns the GUID of the device in network byte -order. +order.SEE ALSO
ibv_get_device_list, ibv_get_device_name, @@ -5420,7 +5470,7 @@ the request fails.SEE ALSO
IBV_CLOSE_DEVICE
NAME
-ibv_open_device, ibv_close_device - open and close an RDMA device context +ibv_open_device, ibv_close_device - open and close an RDMA device contextSYNOPSIS
#include <infiniband/verbs.h> @@ -5430,16 +5480,16 @@ ibv_open_device, ibv_close_device - open and close an RDMA device contextDESCRIPTION
ibv_open_device() opens the device device and creates a context -for further use. +for further use.ibv_close_device() closes the device context context.
RETURN VALUE
ibv_open_device() returns a pointer to the allocated device context, or -NULL if the request fails. +NULL if the request fails.ibv_close_device() returns 0 on success, -1 on failure.
NOTES
ibv_close_device() does not release all the resources allocated using context context. To avoid resource leaks, the user should release all -associated resources before closing a context. +associated resources before closing a context.SEE ALSO
ibv_get_device_list, ibv_query_device, @@ -5455,7 +5505,7 @@ associated resources before closing a context.
NAME
ibv_get_async_event, ibv_ack_async_event - get or acknowledge asynchronous -events +eventsSYNOPSIS
#include <infiniband/verbs.h> @@ -5466,7 +5516,7 @@ eventsDESCRIPTION
ibv_get_async_event() waits for the next async event of the RDMA device context context and returns it through the pointer event, which is -an ibv_async_event struct, as defined in <infiniband/verbs.h>. +an ibv_async_event struct, as defined in <infiniband/verbs.h>.struct ibv_async_event { union { @@ -5539,21 +5589,21 @@ member of the structure. event_type will be one of the following events:ibv_ack_async_event() acknowledge the async event event.
RETURN VALUE
-ibv_get_async_event() returns 0 on success, and -1 on error. +ibv_get_async_event() returns 0 on success, and -1 on error.ibv_ack_async_event() returns no value.
NOTES
All async events that ibv_get_async_event() returns must be acknowledged using ibv_ack_async_event(). To avoid races, destroying an object (CQ, SRQ or QP) will wait for all affiliated events for the object to be acknowledged; this avoids an application retrieving an affiliated event after -the corresponding object has already been destroyed. +the corresponding object has already been destroyed.ibv_get_async_event() is a blocking function. If multiple threads call this function simultaneously, then when an async event occurs, only one thread will receive it, and it is not possible to predict which thread will receive it.
EXAMPLES
The following code example demonstrates one possible way to work with async -events in non-blocking mode. It performs the following steps: +events in non-blocking mode. It performs the following steps:1. Set the async events queue work mode to be non-blocked
@@ -5593,13 +5643,13 @@ ibv_ack_async_event(&async_event);
2. Poll the queue until it has an async event
3. Get the async event and ack itSEE ALSO
-ibv_open_device +ibv_open_device
IBV_QUERY_DEVICE
NAME
-ibv_query_device - query an RDMA device's attributes +ibv_query_device - query an RDMA device's attributesSYNOPSIS
#include <infiniband/verbs.h> @@ -5608,7 +5658,7 @@ ibv_query_device - query an RDMA device's attributesDESCRIPTION
ibv_query_device() returns the attributes of the device with context context. The argument device_attr is a pointer to an ibv_device_attr -struct, as defined in <infiniband/verbs.h>. +struct, as defined in <infiniband/verbs.h>.struct ibv_device_attr { char fw_ver[64]; /* FW version */ @@ -5654,7 +5704,7 @@ uint8_t phys_port_cnt; /* Number of physical ports */ };RETURN VALUE
ibv_query_device() returns 0 on success, or the value of errno on failure -(which indicates the failure reason). +(which indicates the failure reason).NOTES
The maximum values returned by this function are the upper limits of supported resources by the device. However, it may not be possible to use these maximum @@ -5664,7 +5714,7 @@ permissions, and the amount of resources already in use by other users/processes.SEE ALSO
ibv_open_device, ibv_query_port, ibv_query_pkey, -ibv_query_gid +ibv_query_gid
@@ -5683,13 +5733,13 @@ RETURN VALUE ibv_open_device, ibv_query_device, ibv_query_port, -ibv_query_pkey +ibv_query_pkey
IBV_QUERY_GID
IBV_QUERY_PKEY
NAME
-ibv_query_pkey - query an InfiniBand port's P_Key table +ibv_query_pkey - query an InfiniBand port's P_Key tableSYNOPSIS
#include <infiniband/verbs.h> @@ -5699,7 +5749,7 @@ ibv_query_pkey - query an InfiniBand port's P_Key table ibv_query_pkey() returns the P_Key value (in network byte order) in entry index of port port_num for device context context through the pointer pkey.RETURN VALUE
-ibv_query_pkey() returns 0 on success, and -1 on error. +ibv_query_pkey() returns 0 on success, and -1 on error.SEE ALSO
ibv_open_device, ibv_query_device, @@ -5709,7 +5759,7 @@ the pointer pkey.RETURN VALUE
IBV_QUERY_PORT
NAME
-ibv_query_port - query an RDMA port's attributes +ibv_query_port - query an RDMA port's attributesSYNOPSIS
#include <infiniband/verbs.h> @@ -5718,7 +5768,7 @@ ibv_query_port - query an RDMA port's attributesDESCRIPTION
ibv_query_port() returns the attributes of port port_num for device context context through the pointer port_attr. The argument -port_attr is an ibv_port_attr struct, as defined in <infiniband/verbs.h>. +port_attr is an ibv_port_attr struct, as defined in <infiniband/verbs.h>.struct ibv_port_attr { enum ibv_port_state state; /* Logical port state */ @@ -5743,7 +5793,7 @@ uint8_t phys_state; /* Physical port state */ };RETURN VALUE
ibv_query_port() returns 0 on success, or the value of errno on failure -(which indicates the failure reason). +(which indicates the failure reason).SEE ALSO
ibv_create_qp, ibv_destroy_qp, ibv_query_qp, @@ -5761,17 +5811,17 @@ SYNOPSIS int ibv_dealloc_pd(struct ibv_pd *pd);DESCRIPTION
-ibv_alloc_pd() allocates a PD for the RDMA device context context. - +ibv_alloc_pd() allocates a PD for the RDMA device context context. +ibv_dealloc_pd() deallocates the PD pd.
RETURN VALUE
ibv_alloc_pd() returns a pointer to the allocated PD, or NULL if the -request fails. +request fails.ibv_dealloc_pd() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
NOTES
ibv_dealloc_pd() may fail if any other resource is still associated with -the PD being freed. +the PD being freed.SEE ALSO
ibv_reg_mr, ibv_create_srq, ibv_create_qp, @@ -5783,7 +5833,7 @@ the PD being freed.
NAME
ibv_reg_mr, ibv_dereg_mr - register or deregister a memory region (MR) - +SYNOPSIS
#include <infiniband/verbs.h> @@ -5796,7 +5846,7 @@ ibv_reg_mr, ibv_dereg_mr - register or deregister a memory region (MR) protection domain pd. The MR's starting address is addr and its size is length. The argument access describes the desired memory protection attributes; it is either 0 or the bitwise OR of one or more of the -following flags: +following flags:
- IBV_ACCESS_LOCAL_WRITE Enable Local Write Access
@@ -5821,7 +5871,7 @@ request fails. The local key (L_Key) field lkey is used as the lkey field of struct ibv_sge when posting buffers with ibv_post_* verbs, and the the remote key (R_Key) field rkey is used by remote processes to perform Atomic and RDMA operations. The remote process places this rkey -as the rkey field of struct ibv_send_wr passed to the ibv_post_send function. +as the rkey field of struct ibv_send_wr passed to the ibv_post_send function.ibv_dereg_mr() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
NOTES
@@ -5829,7 +5879,7 @@ as the rkey field of struct ibv_send_wr passed to the ibv_post_send function. SEE ALSO ibv_alloc_pd, ibv_post_send, ibv_post_recv, -ibv_post_srq_recv +ibv_post_srq_recv
@@ -5848,7 +5898,7 @@ SYNOPSIS
IBV_CREATE_AHDESCRIPTION
ibv_create_ah() creates an address handle (AH) associated with the protection domain pd. The argument attr is an ibv_ah_attr struct, -as defined in <infiniband/verbs.h>. +as defined in <infiniband/verbs.h>.struct ibv_ah_attr { struct ibv_global_route grh; /* Global Routing Header (GRH) attributes */ @@ -5872,13 +5922,13 @@ uint8_t traffic_class; /* Traffic class */ibv_destroy_ah() destroys the AH ah.
RETURN VALUE
ibv_create_ah() returns a pointer to the created AH, or NULL if the -request fails. +request fails.ibv_destroy_ah() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
SEE ALSO
ibv_alloc_pd, ibv_init_ah_from_wc, -ibv_create_ah_from_wc +ibv_create_ah_from_wc
@@ -5887,7 +5937,7 @@ failure (which indicates the failure reason).
IBV_CREATE_AH_FROM_WC
NAME
ibv_init_ah_from_wc, ibv_create_ah_from_wc - initialize or create an address -handle (AH) from a work completion +handle (AH) from a work completionSYNOPSIS
#include <infiniband/verbs.h> @@ -5904,19 +5954,19 @@ handle (AH) from a work completion ibv_init_ah_from_wc() initializes the address handle (AH) attribute structure ah_attr for the RDMA device context context using the port number port_num, using attributes from the work completion wc -and the Global Routing Header (GRH) structure grh. - +and the Global Routing Header (GRH) structure grh. +ibv_create_ah_from_wc() creates an AH associated with the protection domain pd using the port number port_num, using attributes from the work completion wc and the Global Routing Header (GRH) structure grh.
RETURN VALUE
-ibv_init_ah_from_wc() returns 0 on success, and -1 on error. +ibv_init_ah_from_wc() returns 0 on success, and -1 on error.ibv_create_ah_from_wc() returns a pointer to the created AH, or NULL if the request fails.
NOTES
The filled structure ah_attr returned from ibv_init_ah_from_wc() -can be used to create a new AH using ibv_create_ah(). +can be used to create a new AH using ibv_create_ah().SEE ALSO
ibv_open_device, ibv_alloc_pd, ibv_create_ah, @@ -5935,13 +5985,13 @@ completion event channelSYNOPSIS
int ibv_destroy_comp_channel(struct ibv_comp_channel *channel);DESCRIPTION
ibv_create_comp_channel() creates a completion event channel for the RDMA -device context context. - +device context context. +ibv_destroy_comp_channel() destroys the completion event channel channel.
RETURN VALUE
ibv_create_comp_channel() returns a pointer to the created completion -event channel, or NULL if the request fails. +event channel, or NULL if the request fails.ibv_destroy_comp_channel() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
NOTES
@@ -5951,7 +6001,7 @@ Specification. A completion channel is essentially file descriptor that is used to deliver completion notifications to a userspace process. When a completion event is generated for a completion queue (CQ), the event is delivered via the completion channel attached to that CQ. This may be useful to steer completion -events to different threads by using multiple completion channels. +events to different threads by using multiple completion channels.ibv_destroy_comp_channel() fails if any CQs are still associated with the completion event channel being destroyed.
SEE ALSO
@@ -5962,7 +6012,7 @@ the completion event channel being destroyed.
NAME
ibv_create_cq, ibv_destroy_cq - create or destroy a completion queue (CQ) - +SYNOPSIS
#include <infiniband/verbs.h> @@ -5979,17 +6029,17 @@ will be used to set user context pointer of the CQ structure. The argument channel is optional; if not NULL, the completion channel channel will be used to return completion events. The CQ will use the completion vector comp_vector for signaling completion events; it must be at least zero and -less than context->num_comp_vectors. - +less than context->num_comp_vectors. +ibv_destroy_cq() destroys the CQ cq.
RETURN VALUE
ibv_create_cq() returns a pointer to the CQ, or NULL if the request -fails. +fails.ibv_destroy_cq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
NOTES
ibv_create_cq() may create a CQ with size greater than or equal to the -requested size. Check the cqe attribute in the returned CQ for the actual size. +requested size. Check the cqe attribute in the returned CQ for the actual size.ibv_destroy_cq() fails if any queue pair is still associated with this CQ.
SEE ALSO
@@ -6000,7 +6050,7 @@ CQ.IBV_POLL_CQ
NAME
-ibv_poll_cq - poll a completion queue (CQ) +ibv_poll_cq - poll a completion queue (CQ)SYNOPSIS
#include <infiniband/verbs.h> @@ -6010,7 +6060,7 @@ ibv_poll_cq - poll a completion queue (CQ) ibv_poll_cq() polls the CQ cq for work completions and returns the first num_entries (or all available completions if the CQ contains fewer than this number) in the array wc. The argument wc is a pointer to -an array of ibv_wc structs, as defined in <infiniband/verbs.h>. +an array of ibv_wc structs, as defined in <infiniband/verbs.h>.struct ibv_wc { uint64_t wr_id; /* ID of the completed Work Request (WR) */ @@ -6071,7 +6121,7 @@ size.RETURN VALUE
requested size. The cqe member of cq will be updated to the actual size.SEE ALSO
-ibv_create_cq ibv_destroy_cq +ibv_create_cq ibv_destroy_cq
@@ -6080,7 +6130,7 @@ SEE ALSO
IBV_GET_CQ_EVENT
NAME
ibv_get_cq_event, ibv_ack_cq_events - get and acknowledge completion queue (CQ) -events +eventsSYNOPSIS
#include <infiniband/verbs.h> @@ -6091,23 +6141,23 @@ eventsDESCRIPTION
ibv_get_cq_event() waits for the next completion event in the completion event channel channel. Fills the arguments cq with the CQ that got -the event and cq_context with the CQ's context. +the event and cq_context with the CQ's context.ibv_ack_cq_events() acknowledges nevents events on the CQ cq.
RETURN VALUE
-ibv_get_cq_event() returns 0 on success, and -1 on error. +ibv_get_cq_event() returns 0 on success, and -1 on error.ibv_ack_cq_events() returns no value.
NOTES
All completion events that ibv_get_cq_event() returns must be acknowledged using ibv_ack_cq_events(). To avoid races, destroying a CQ will wait for all completion events to be acknowledged; this guarantees a -one-to-one correspondence between acks and successful gets. +one-to-one correspondence between acks and successful gets.Calling ibv_ack_cq_events() may be relatively expensive in the datapath, since it must take a mutex. Therefore it may be better to amortize this cost by keeping a count of the number of events needing acknowledgement and acking several completion events in one call to ibv_ack_cq_events().
EXAMPLES
The following code example demonstrates one possible way to work with completion -events. It performs the following steps: +events. It performs the following steps:Stage I: Preparation
@@ -6223,8 +6273,8 @@ SYNOPSIS int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only);
1. Creates a CQ
2. Requests for notification upon a new (first) completion eventDESCRIPTION
ibv_req_notify_cq() requests a completion notification on the completion -queue (CQ) cq. - +queue (CQ) cq. +Upon the addition of a new CQ entry (CQE) to cq, a completion event will be added to the completion channel associated with the CQ. If the argument solicited_only is zero, a completion event is generated for any new CQE. @@ -6237,7 +6287,7 @@ successful send completion is unsolicited.
ibv_req_notify_cq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).NOTES
The request for notification is "one shot." Only one completion event will be -generated for each call to ibv_req_notify_cq(). +generated for each call to ibv_req_notify_cq().SEE ALSO
ibv_create_comp_channel, ibv_create_cq, ibv_get_cq_event@@ -6251,7 +6301,7 @@ generated for each call to ibv_req_notify_cq().
NAME
ibv_create_srq, ibv_destroy_srq - create or destroy a shared receive queue (SRQ) - +SYNOPSIS
#include <infiniband/verbs.h> @@ -6266,8 +6316,8 @@ ibv_create_srq, ibv_destroy_srq - create or destroy a shared receive queue (SRQ) int ibv_destroy_srq(struct ibv_srq *srq);DESCRIPTION
ibv_create_srq() creates a shared receive queue (SRQ) associated with the -protection domain pd. - +protection domain pd. +ibv_create_xrc_srq() creates an XRC shared receive queue (SRQ) associated with the protection domain pd, the XRC domain xrc_domain and the CQ which will hold the XRC completion xrc_cq.
@@ -6291,7 +6341,7 @@ max_wr and max_sge will be greater than or equal to the values requested.ibv_destroy_srq() destroys the SRQ srq.
RETURN VALUE
ibv_create_srq() returns a pointer to the created SRQ, or NULL if the -request fails. +request fails.ibv_destroy_srq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
NOTES
@@ -6312,7 +6362,7 @@ ibv_modify_srq - modify attributes of a shared receive queue (SRQ)SYNOPSIS
DESCRIPTION
ibv_modify_srq() modifies the attributes of SRQ srq with the attributes in srq_attr according to the mask srq_attr_mask. The -argument srq_attr is an ibv_srq_attr struct, as defined in <infiniband/verbs.h>. +argument srq_attr is an ibv_srq_attr struct, as defined in <infiniband/verbs.h>.struct ibv_srq_attr { uint32_t max_wr; /* maximum number of outstanding work requests (WRs) in the SRQ */ @@ -6334,7 +6384,7 @@ following flags: ibv_modify_srq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).NOTES
If any of the modify attributes is invalid, none of the attributes will be -modified. +modified.Not all devices support resizing SRQs. To check if a device supports it, check if the IBV_DEVICE_SRQ_RESIZE bit is set in the device capabilities flags.
@@ -6356,7 +6406,7 @@ ibv_query_srq - get the attributes of a shared receive queue (SRQ)SYNOPSIS
DESCRIPTION
ibv_query_srq() gets the attributes of the SRQ srq and returns them through the pointer srq_attr. The argument srq_attr is an -ibv_srq_attr struct, as defined in <infiniband/verbs.h>. +ibv_srq_attr struct, as defined in <infiniband/verbs.h>.struct ibv_srq_attr { uint32_t max_wr; /* maximum number of outstanding work requests (WRs) in the SRQ */ @@ -6368,7 +6418,7 @@ uint32_t srq_limit; /* the limit value of the SRQ */ (which indicates the failure reason).NOTES
If the value returned for srq_limit is 0, then the SRQ limit reached ("low watermark") event is not (or no longer) armed, and no asynchronous events will -be generated until the event is rearmed. +be generated until the event is rearmed.SEE ALSO
ibv_create_srq, ibv_destroy_srq, @@ -6390,7 +6440,7 @@ This QP number should be passed to the remote node (sender). The remote node will use xrc_rcv_qpn in ibv_post_send() when sending to an XRC SRQ on this host in the same xrc domain as the XRC receive QP. This QP is created in kernel space, and persists until the last process registered for the QP calls -ibv_unreg_xrc_rcv_qp() (at which time the QP is destroyed). +ibv_unreg_xrc_rcv_qp() (at which time the QP is destroyed).The process which creates this QP is automatically registered for it, and should also call ibv_unreg_xrc_rcv_qp() at some point, to unregister.
Processes which wish to receive on an XRC SRQ via this QP should call @@ -6436,7 +6486,7 @@ number xrc_qp_num which is associated with the XRC domain xrc_domain with the attributes in attr according to the mask attr_mask and move the QP state through the following transitions: Reset -> Init -> RTR. attr_mask should indicate all of the attributes which will be used in this -QP transition and the following masks (at least) should be set: +QP transition and the following masks (at least) should be set:
Next state Required attributes ---------- ---------------------------------------- @@ -6532,17 +6582,17 @@ argument is either 0 or the bitwise OR of one or more of the following flags:RETURN VALUE
ibv_modify_xrc_rcv_qp() returns 0 on success, or the value of errno on -failure (which indicates the failure reason). +failure (which indicates the failure reason).NOTES
If any of the modify attributes or the modify mask are invalid, none of the -attributes will be modified (including the QP state). +attributes will be modified (including the QP state).Not all devices support alternate paths. To check if a device supports it, check if the IBV_DEVICE_AUTO_PATH_MIG bit is set in the device capabilities flags.
SEE ALSO
ibv_open_xrc_domain, ibv_create_xrc_rcv_qp, -ibv_query_xrc_rcv_qp +ibv_query_xrc_rcv_qp
@@ -6551,7 +6601,7 @@ capabilities flags.
IBV_OPEN_XRC_DOMAIN
NAME
ibv_open_xrc_domain, ibv_close_xrc_domain - open or close an eXtended Reliable -Connection (XRC) domain +Connection (XRC) domainSYNOPSIS
#include <fcntl.h> #include <infiniband/verbs.h> @@ -6564,21 +6614,21 @@ Connection (XRC) domain context context or return a reference to an opened one. fd is the file descriptor to be associated with the XRC domain. The argument oflag describes the desired file creation attributes; it is either 0 or the bitwise OR -of one or more of the following flags: +of one or more of the following flags:
- O_CREAT
- If a domain belonging to device named by context is already associated with the inode, this flag has no effect, except as noted under O_EXCL below. Otherwise, a new XRC domain is created and is associated with inode - specified by fd. - + specified by fd. +
- O_EXCL
- If O_EXCL and O_CREAT are set, open will fail if a domain associated with the inode exists. The check for the existence of the domain and creation of the domain if it does not exist is atomic with respect to - other processes executing open with fd naming the same inode. + other processes executing open with fd naming the same inode.
If fd equals -1, no inode is is associated with the domain, and the @@ -6587,12 +6637,12 @@ only valid value for oflag is O_CREAT.
last reference, the XRC domain will be destroyed.RETURN VALUE
ibv_open_xrc_domain() returns a pointer to an opened XRC, or NULL if the -request fails. +request fails.ibv_close_xrc_domain() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
NOTES
Not all devices support XRC. To check if a device supports it, check if the -IBV_DEVICE_XRC bit is set in the device capabilities flags. +IBV_DEVICE_XRC bit is set in the device capabilities flags.ibv_close_xrc_domain() may fail if any QP or SRQ are still associated with the XRC domain being closed.
SEE ALSO
@@ -6606,7 +6656,7 @@ with the XRC domain being closed. IBV_QUERY_XRC_RCV_QP
NAME
-ibv_query_xrc_rcv_qp - get the attributes of an XRC receive queue pair (QP) +ibv_query_xrc_rcv_qp - get the attributes of an XRC receive queue pair (QP)SYNOPSIS
#include <infiniband/verbs.h> @@ -6618,7 +6668,7 @@ ibv_query_xrc_rcv_qp - get the attributes of an XRC receive queue pair (QP) for the XRC receive QP with the number xrc_qp_num which is associated with the XRC domain xrc_domain and returns them through the pointers attr and init_attr. The argument attr is an ibv_qp_attr -struct, as defined in <infiniband/verbs.h>. +struct, as defined in <infiniband/verbs.h>.struct ibv_qp_attr { enum ibv_qp_state qp_state; /* Current QP state */ @@ -6655,7 +6705,7 @@ For details on struct ibv_ah_attr see the description of ibv_create_ah(). failure (which indicates the failure reason).NOTES
The argument attr_mask is a hint that specifies the minimum list of attributes to retrieve. Some InfiniBand devices may return extra attributes not -requested, for example if the value can be returned cheaply. +requested, for example if the value can be returned cheaply.Attribute values are valid if they have been set using ibv_modify_xrc_rcv_qp(). The exact list of valid attributes depends on the QP state.
@@ -6673,7 +6723,7 @@ sq_draining, ah_attr (if APM is enabled).
NAME
ibv_reg_xrc_rcv_qp, ibv_unreg_xrc_rcv_qp - register and unregister a user -process with an XRC receive queue pair (QP) +process with an XRC receive queue pair (QP)SYNOPSIS
#include <infiniband/verbs.h> @@ -6682,8 +6732,8 @@ process with an XRC receive queue pair (QP)DESCRIPTION
ibv_reg_xrc_rcv_qp() registers a user process with the XRC receive QP (created via ibv_create_xrc_rcv_qp() ) whose number is xrc_qp_num, -and which is associated with the XRC domain xrc_domain. - +and which is associated with the XRC domain xrc_domain. +ibv_unreg_xrc_rcv_qp() unregisters a user process from the XRC receive QP number xrc_qp_num, which is associated with the XRC domain xrc_domain. When the number of user processes registered with this XRC @@ -6695,8 +6745,8 @@ NOTES ibv_reg_xrc_rcv_qp() and ibv_unreg_xrc_rcv_qp() may fail if the number xrc_qp_num is not a number of a valid XRC receive QP (the QP is not allocated or it is the number of a non-XRC QP), or the XRC receive QP was -created with an XRC domain other than xrc_domain. - +created with an XRC domain other than xrc_domain. +
If a process is still registered with any XRC RCV QPs belonging to some domain, ibv_close_xrc_domain() will return failure if called for that domain in that process.
@@ -6725,7 +6775,7 @@ ibv_create_qp, ibv_destroy_qp - create or destroy a queue pair (QP)SYNOPSIS<
DESCRIPTION
ibv_create_qp() creates a queue pair (QP) associated with the protection domain pd. The argument qp_init_attr is an ibv_qp_init_attr -struct, as defined in <infiniband/verbs.h>. +struct, as defined in <infiniband/verbs.h>.struct ibv_qp_init_attr { void *qp_context; /* Associated context of the QP */ @@ -6752,24 +6802,24 @@ created; the values will be greater than or equal to the values requested.ibv_destroy_qp() destroys the QP qp.
RETURN VALUE
ibv_create_qp() returns a pointer to the created QP, or NULL if the -request fails. Check the QP number (qp_num) in the returned QP. +request fails. Check the QP number (qp_num) in the returned QP.ibv_destroy_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).
NOTES
ibv_create_qp() will fail if a it is asked to create QP of a type other -than IBV_QPT_RC or IBV_QPT_UD associated with an SRQ. +than IBV_QPT_RC or IBV_QPT_UD associated with an SRQ.The attributes max_recv_wr and max_recv_sge are ignored by ibv_create_qp() if the QP is to be associated with an SRQ.
ibv_destroy_qp() fails if the QP is attached to a multicast group.
SEE ALSO
ibv_alloc_pd, ibv_modify_qp, -ibv_query_qp +ibv_query_qp
IBV_MODIFY_QP
NAME
-ibv_modify_qp - modify the attributes of a queue pair (QP) +ibv_modify_qp - modify the attributes of a queue pair (QP)SYNOPSIS
#include <infiniband/verbs.h> @@ -6778,7 +6828,7 @@ ibv_modify_qp - modify the attributes of a queue pair (QP)DESCRIPTION
ibv_modify_qp() modifies the attributes of QP qp with the attributes in attr according to the mask attr_mask. The argument -attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>. +attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>.struct ibv_qp_attr { enum ibv_qp_state qp_state; /* Move the QP to this state */ @@ -6864,7 +6914,7 @@ argument is either 0 or the bitwise OR of one or more of the following flags: ibv_modify_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).NOTES
If any of the modify attributes or the modify mask are invalid, none of the -attributes will be modified (including the QP state). +attributes will be modified (including the QP state).Not all devices support resizing QPs. To check if a device supports it, check if the IBV_DEVICE_RESIZE_MAX_WR bit is set in the device capabilities flags.
@@ -6927,8 +6977,8 @@ SYNOPSIS with wr to the receive queue of the queue pair qp. It stops processing WRs from this list at the first failure (that can be detected immediately while requests are being posted), and returns this failing WR -through bad_wr. - +through bad_wr. +The argument wr is an ibv_recv_wr struct, as defined in <infiniband/verbs.h>.
@@ -6946,11 +6996,11 @@ uint32_t lkey; /* Key of the local Memory Region */ };RETURN VALUE
ibv_post_recv() returns 0 on success, or the value of errno on failure -(which indicates the failure reason). +(which indicates the failure reason).NOTES
The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding -completion queue (CQ). +completion queue (CQ).If the QP qp is associated with a shared receive queue, you must use the function ibv_post_srq_recv(), and not ibv_post_recv(), since the QP's own receive queue will not be used.
@@ -6963,14 +7013,14 @@ list.SEE ALSO
ibv_create_qp, ibv_post_send, ibv_post_srq_recv, -ibv_poll_cq +ibv_poll_cq
IBV_POST_SEND
NAME
-ibv_post_send - post a list of work requests (WRs) to a send queue +ibv_post_send - post a list of work requests (WRs) to a send queueSYNOPSIS
#include <infiniband/verbs.h> @@ -6981,8 +7031,8 @@ ibv_post_send - post a list of work requests (WRs) to a send queue with wr to the send queue of the queue pair qp. It stops processing WRs from this list at the first failure (that can be detected immediately while requests are being posted), and returns this failing WR -through bad_wr. - +through bad_wr. +The argument wr is an ibv_send_wr struct, as defined in <infiniband/verbs.h>.
@@ -7050,16 +7100,16 @@ following flags:- IBV_SEND_INLINE Send data in given gather list as inline data
- in a send WQE. Valid only for Send and RDMA Write. The L_Key will not be - checked. + checked.
RETURN VALUE
ibv_post_send() returns 0 on success, or the value of errno on failure -(which indicates the failure reason). +(which indicates the failure reason).NOTES
The user should not alter or destroy AHs associated with WRs until request is fully executed and a work completion has been retrieved from the corresponding -completion queue (CQ) to avoid unexpected behavior. +completion queue (CQ) to avoid unexpected behavior.The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding completion queue (CQ). However, if the IBV_SEND_INLINE flag was set, the buffer @@ -7086,8 +7136,8 @@ ibv_post_srq_recv - post a list of work requests (WRs) to a shared receive queue ibv_post_srq_recv() posts the linked list of work requests (WRs) starting with wr to the shared receive queue (SRQ) srq. It stops processing WRs from this list at the first failure (that can be detected immediately while -requests are being posted), and returns this failing WR through bad_wr. - +requests are being posted), and returns this failing WR through bad_wr. +
The argument wr is an ibv_recv_wr struct, as defined in <infiniband/verbs.h>.
@@ -7108,7 +7158,7 @@ uint32_t lkey; /* Key of the local Memory Region */ failure (which indicates the failure reason).NOTES
The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding -completion queue (CQ). +completion queue (CQ).If a WR is being posted to a UD QP, the Global Routing Header (GRH) of the incoming message will be placed in the first 40 bytes of the buffer(s) in the scatter list. If no GRH is present in the incoming message, then the first bytes @@ -7118,7 +7168,7 @@ list.
SEE ALSO
ibv_create_qp, ibv_post_send, ibv_post_recv, -ibv_poll_cq +ibv_poll_cq
@@ -7134,7 +7184,7 @@ ibv_query_qp - get the attributes of a queue pair (QP)SYNOPSIS
DESCRIPTION
ibv_query_qp() gets the attributes specified in attr_mask for the QP qp and returns them through the pointers attr and init_attr. -The argument attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>. +The argument attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>.struct ibv_qp_attr { enum ibv_qp_state qp_state; /* Current QP state */ @@ -7172,8 +7222,8 @@ For details on struct ibv_ah_attr see the description of ibv_create_ah(). The argument attr_mask is a hint that specifies the minimum list of attributes to retrieve. Some RDMA devices may return extra attributes not requested, for example if the value can be returned cheaply. This has the same -form as in ibv_modify_qp(). - +form as in ibv_modify_qp(). +Attribute values are valid if they have been set using ibv_modify_qp(). The exact list of valid attributes depends on the QP state.
Multiple calls to ibv_query_qp() may yield some differences in the @@ -7199,15 +7249,15 @@ to/from a multicast group
SYNOPSIS
int ibv_detach_mcast(struct ibv_qp *qp, const union ibv_gid *gid, uint16_t lid);DESCRIPTION
ibv_attach_mcast() attaches the QP qp to the multicast group -having MGID gid and MLID lid. - +having MGID gid and MLID lid. +ibv_detach_mcast() detaches the QP qp to the multicast group having MGID gid and MLID lid.
RETURN VALUE
ibv_attach_mcast() and ibv_detach_mcast() returns 0 on success, or the value of errno on failure (which indicates the failure reason).NOTES
Only QPs of Transport Service Type IBV_QPT_UD may be attached to -multicast groups. +multicast groups.If a QP is attached to the same multicast group multiple times, the QP will still receive a single copy of a multicast message.
In order to receive multicast messages, a join request for the multicast @@ -7233,16 +7283,16 @@ mult_to_ibv_rate - convert multiplier of 2.5 Gbit/sec to an IB rate enumeration<
DESCRIPTION
ibv_rate_to_mult() converts the IB transmission rate enumeration rate to a multiple of 2.5 Gbit/sec (the base rate). For example, if rate is -IBV_RATE_5_GBPS, the value 2 will be returned (5 Gbit/sec = 2 * 2.5 Gbit/sec). +IBV_RATE_5_GBPS, the value 2 will be returned (5 Gbit/sec = 2 * 2.5 Gbit/sec).mult_to_ibv_rate() converts the multiplier value (of 2.5 Gbit/sec) mult to an IB transmission rate enumeration. For example, if mult is 2, the rate enumeration IBV_RATE_5_GBPS will be returned.
RETURN VALUE
-ibv_rate_to_mult() returns the multiplier of the base rate 2.5 Gbit/sec. +ibv_rate_to_mult() returns the multiplier of the base rate 2.5 Gbit/sec.mult_to_ibv_rate() returns the enumeration representing the IB transmission rate.
SEE ALSO
-ibv_query_port +ibv_query_port @@ -7622,7 +7672,7 @@ returned events must be acked before calling this function.   RDMA_RESOLVE_ADDR
NAME
-rdma_resolve_addr - Resolve destination and optional source addresses. +rdma_resolve_addr - Resolve destination and optional source addresses.SYNOPSIS
#include <rdma/rdma_cma.h>int rdma_resolve_addr (struct rdma_cm_id *id, struct sockaddr *src_addr, @@ -7630,22 +7680,22 @@ rdma_cm_id *id, struct sockaddr *src_addr,ARGUMENTS
- id
-- RDMA identifier. +
- RDMA identifier.
- src_addr
-- Source address information. This parameter may be NULL. +
- Source address information. This parameter may be NULL.
- dst_addr
-- Destination address information. +
- Destination address information.
- timeout_ms
-- Time to wait for resolution to complete. +
- Time to wait for resolution to complete.
DESCRIPTION
Resolve destination and optional source addresses from IP addresses to an RDMA address. If successful, the specified rdma_cm_id will be bound to a local -device. +device.NOTES
This call is used to map a given destination IP address to a usable RDMA address. The IP to RDMA address mapping is done using the local routing tables, @@ -7655,24 +7705,24 @@ given, and the rdma_cm_id has not yet been bound to a device, then the rdma_cm_id will be bound to a source address based on the local routing tables. After this call, the rdma_cm_id will be bound to an RDMA device. This call is typically made from the active side of a connection before calling -rdma_resolve_route and rdma_connect. +rdma_resolve_route and rdma_connect.INFINIBAND SPECIFIC
This call maps the destination and, if given, source IP addresses to GIDs. In order to perform the mapping, IPoIB must be running on both the local and remote -nodes. +nodes.SEE ALSO
rdma_create_id, rdma_resolve_route, rdma_connect, rdma_create_qp, rdma_get_cm_event, rdma_bind_addr, rdma_get_src_port, rdma_get_dst_port, rdma_get_local_addr, -rdma_get_peer_addr +rdma_get_peer_addr
RDMA_GET_CM_EVENT
NAME
-rdma_get_cm_event - Retrieves the next pending communication event. +rdma_get_cm_event - Retrieves the next pending communication event.SYNOPSIS
#include <rdma/rdma_cma.h>int rdma_get_cm_event (struct rdma_event_channel *channel, struct rdma_cm_event **event); @@ -7680,15 +7730,15 @@ rdma_event_channel *channel, struct rdma_cm_event **
ARGUMENTS
- channel
-- Event channel to check for events. +
- Event channel to check for events.
- event
-- Allocated information about the next communication event. +
- Allocated information about the next communication event.
DESCRIPTION
Retrieves a communication event. If no events are pending, by default, the call -will block until an event is received. +will block until an event is received.NOTES
The default synchronous behavior of this routine can be changed by modifying the file descriptor associated with the given channel. All events that are reported @@ -7696,36 +7746,36 @@ must be acknowledged by calling rdma_ack_cm_event. Destruction of an rdma_cm_id will block until related events have been acknowledged.EVENT DATA
Communication event details are returned in the rdma_cm_event structure. This structure is allocated by the rdma_cm and released by the rdma_ack_cm_event -routine. Details of the rdma_cm_event structure are given below. +routine. Details of the rdma_cm_event structure are given below.
- id
- The rdma_cm identifier associated with the event. If the event type is RDMA_CM_EVENT_CONNECT_REQUEST, then this references a new id for that - communication. + communication.
- listen_id
- For RDMA_CM_EVENT_CONNECT_REQUEST event types, this references the - corresponding listening request identifier. + corresponding listening request identifier.
- event
- Specifies the type of communication event which occurred. See EVENT - TYPES below. + TYPES below.
- status
- Returns any asynchronous error information associated with an event. The - status is zero unless the corresponding operation failed. + status is zero unless the corresponding operation failed.
- param
- Provides additional details based on the type of event. Users should select the conn or ud subfields based on the rdma_port_space of the rdma_cm_id associated with the event. See UD EVENT DATA and CONN EVENT DATA - below. + below.
UD EVENT DATA
Event parameters related to unreliable datagram (UD) services: RDMA_PS_UDP and RDMA_PS_IPOIB. The UD event data is valid for RDMA_CM_EVENT_ESTABLISHED and -RDMA_CM_EVENT_MULTICAST_JOIN events, unless stated otherwise. +RDMA_CM_EVENT_MULTICAST_JOIN events, unless stated otherwise.
- private_data
- References any user-specified data associated with @@ -7733,19 +7783,19 @@ RDMA_CM_EVENT_MULTICAST_JOIN events, unless stated otherwise. referenced by this field matches that specified by the remote side when calling rdma_connect or rdma_accept. This field is NULL if the event does not include private data. The buffer referenced by this pointer is - deallocated when calling rdma_ack_cm_event. + deallocated when calling rdma_ack_cm_event.
- private_data_len
- The size of the private data buffer. Users should note that the size of the private data buffer may be larger than the amount of private data sent - by the remote side. Any additional space in the buffer will be zeroed out. + by the remote side. Any additional space in the buffer will be zeroed out.
- ah_attr
- Address information needed to send data to the remote endpoint(s). Users - should use this structure when allocating their address handle. + should use this structure when allocating their address handle.
- qp_num
-- QP number of the remote endpoint or multicast group. +
- QP number of the remote endpoint or multicast group.
- qkey
- QKey needed to send data to the remote endpoint(s).
@@ -7754,115 +7804,115 @@ RDMA_CM_EVENT_MULTICAST_JOIN events, unless stated otherwise.CONN EVENT DATA
Event parameters related to connected QP services: RDMA_PS_TCP. The connection related event data is valid for RDMA_CM_EVENT_CONNECT_REQUEST and -RDMA_CM_EVENT_ESTABLISHED events, unless stated otherwise. +RDMA_CM_EVENT_ESTABLISHED events, unless stated otherwise.
- private_data
- References any user-specified data associated with the event. The data referenced by this field matches that specified by the remote side when calling rdma_connect or rdma_accept. This field is NULL if the event does not include private data. The buffer referenced by this pointer is - deallocated when calling rdma_ack_cm_event. + deallocated when calling rdma_ack_cm_event.
- private_data_len
- The size of the private data buffer. Users should note that the size of the private data buffer may be larger than the amount of private data sent - by the remote side. Any additional space in the buffer will be zeroed out. + by the remote side. Any additional space in the buffer will be zeroed out.
- responder_resources
- The number of responder resources requested of the recipient. This field matches the initiator depth specified by the remote node when calling - rdma_connect and rdma_accept. + rdma_connect and rdma_accept.
- initiator_depth
- The maximum number of outstanding RDMA read/atomic operations that the recipient may have outstanding. This field matches the responder resources - specified by the remote node when calling rdma_connect and rdma_accept. + specified by the remote node when calling rdma_connect and rdma_accept.
- flow_control
-- Indicates if hardware level flow control is provided by the sender. +
- Indicates if hardware level flow control is provided by the sender.
- retry_count
- For RDMA_CM_EVENT_CONNECT_REQUEST events only, indicates the number of - times that the recipient should retry send operations. + times that the recipient should retry send operations.
- rnr_retry_count
- The number of times that the recipient should retry receiver not ready - (RNR) NACK errors. + (RNR) NACK errors.
- srq
-- Specifies if the sender is using a shared-receive queue. +
- Specifies if the sender is using a shared-receive queue.
- qp_num
-- Indicates the remote QP number for the connection. +
- Indicates the remote QP number for the connection.
EVENT TYPES
-The following types of communication events may be reported. +The following types of communication events may be reported.
- RDMA_CM_EVENT_ADDR_RESOLVED
-- Address resolution (rdma_resolve_addr) completed successfully. +
- Address resolution (rdma_resolve_addr) completed successfully.
- RDMA_CM_EVENT_ADDR_ERROR
-- Address resolution (rdma_resolve_addr) failed. +
- Address resolution (rdma_resolve_addr) failed.
- RDMA_CM_EVENT_ROUTE_RESOLVED
-- Route resolution (rdma_resolve_route) completed successfully. +
- Route resolution (rdma_resolve_route) completed successfully.
- RDMA_CM_EVENT_ROUTE_ERROR
-- Route resolution (rdma_resolve_route) failed. +
- Route resolution (rdma_resolve_route) failed.
- RDMA_CM_EVENT_CONNECT_REQUEST
- Generated on the passive side to notify the user of a new connection - request. + request.
- RDMA_CM_EVENT_CONNECT_RESPONSE
- Generated on the active side to notify the user of a successful response to a connection request. It is only generated on rdma_cm_id's that do not - have a QP associated with them. + have a QP associated with them.
- RDMA_CM_EVENT_CONNECT_ERROR
- Indicates that an error has occurred trying to establish or a - connection. May be generated on the active or passive side of a connection. + connection. May be generated on the active or passive side of a connection.
- RDMA_CM_EVENT_UNREACHABLE
- Generated on the active side to notify the user that the remote server - is not reachable or unable to respond to a connection request. + is not reachable or unable to respond to a connection request.
- RDMA_CM_EVENT_REJECTED
- Indicates that a connection request or response was rejected by the - remote end point. + remote end point.
- RDMA_CM_EVENT_ESTABLISHED
- Indicates that a connection has been established with the remote end - point. + point.
- RDMA_CM_EVENT_DISCONNECTED
-- The connection has been disconnected. +
- The connection has been disconnected.
- RDMA_CM_EVENT_DEVICE_REMOVAL
- The local RDMA device associated with the rdma_cm_id has been removed. - Upon receiving this event, the user must destroy the related rdma_cm_id. + Upon receiving this event, the user must destroy the related rdma_cm_id.
- RDMA_CM_EVENT_MULTICAST_JOIN
- The multicast join operation (rdma_join_multicast) completed - successfully. + successfully.
- RDMA_CM_EVENT_MULTICAST_ERROR
- An error either occurred joining a multicast group, or, if the group had already been joined, on an existing group. The specified multicast group is - no longer accessible and should be rejoined, if desired. + no longer accessible and should be rejoined, if desired.
- RDMA_CM_EVENT_ADDR_CHANGE
- The network device associated with this ID through address resolution changed its HW address, eg following of bonding failover. This event can serve as a hint for applications who want the links used for their RDMA - sessions to align with the network stack. + sessions to align with the network stack.
- RDMA_CM_EVENT_TIMEWAIT_EXIT
- The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. - After the timewait state has completed, the rdma_cm will report this event. + After the timewait state has completed, the rdma_cm will report this event.
SEE ALSO
@@ -7888,7 +7938,7 @@ rdma_cm_event *event);@@ -7901,7 +7951,7 @@ should be a one-to-one correspondence between successful gets and acks. This call frees the event structure and any memory that it references.
- event
-- Event to be released. +
- Event to be released.
SEE ALSO
-rdma_get_cm_event, rdma_destroy_id +rdma_get_cm_event, rdma_destroy_id@@ -8344,7 +8394,7 @@ address.
SEE ALSO
rdma_connect, rdma_accept, rdma_get_cm_event, rdma_get_src_port, rdma_get_local_addr, -rdma_get_peer_addr +rdma_get_peer_addr
@@ -8440,14 +8490,14 @@ rdma_resolve_addr, rdma_create_qp, RDMA_LEAVE_MULTICAST
RDMA_GET_LOCAL_ADDR
NAME
-rdma_leave_multicast - Leaves a multicast group. +rdma_leave_multicast - Leaves a multicast group.SYNOPSIS
#include <rdma/rdma_cma.h>int rdma_leave_multicast (struct rdma_cm_id *id, struct sockaddr *addr);
ARGUMENTS
- id
-- Communication identifier associated with the request. +
- Communication identifier associated with the request.
- addr
- Multicast address identifying the group to leave.
@@ -8460,7 +8510,7 @@ multicast group may stilled be queued for completion processing immediately after leaving a multicast group. Destroying an rdma_cm_id will automatically leave all multicast groups.SEE ALSO
rdma_join_multicast, -rdma_destroy_qp +rdma_destroy_qp
-- 2.46.0
RDMA_SET_OPTION