Sean Hefty [Mon, 17 Sep 2012 23:00:12 +0000 (16:00 -0700)]
libibverbs: Add support for XRC QPs
XRC queue pairs: xrc defines two new types of QPs. The
initiator, or send-side, xrc qp behaves similar to a send-
only RC qp. xrc send qp's are managed through the existing
QP functions. The send_wr structure is extended in a back-
wards compatible way to support posting sends on a send xrc
qp, which require specifying the remote xrc srq.
The target, or receive-side, xrc qp behaves differently
than other implemented qp's. A recv xrc qp can be created,
modified, and destroyed like other qp's through the existing
calls. The qp_init_attr structure is extended for xrc qp's.
Because xrc recv qp's are bound to an xrcd, rather than a pd,
it is intended to be used among multiple processes. Any process
with access to an xrcd may allocate and connect an xrc recv qp.
The actual xrc recv qp is allocated and managed by the kernel.
If the owning process explicit destroys the xrc recv qp, it is
destroyed. However, if the xrc recv qp is left open when the
user process exits or closes its device, then the lifetime of
the xrc recv qp is bound with the lifetime of the xrcd.
Sean Hefty [Mon, 17 Sep 2012 19:34:55 +0000 (12:34 -0700)]
livibverbs: Add support for XRC SRQs
XRC support requires the use of a new type of SRQ.
XRC shared receive queues: xrc srq's are similar to normal
srq's, except that they are bound to an xrcd, rather
than to a protection domain. Based on the current spec
and implementation, they are only usable with xrc qps. To
support xrc srq's, we define a new srq_init_attr structure
to include an srq type and other needed information.
The kernel ABI is also updated to allow creating extended
SRQs.
Sean Hefty [Fri, 7 Sep 2012 21:38:07 +0000 (14:38 -0700)]
libibverbs: Introduce XRC domains
XRC introduces several new concepts and structures, one of
which is the XRC domain.
XRC domains: xrcd's are a type of protection domain used to
associate shared receive queues with xrc queue pairs. Since
xrcd are meant to be shared among multiple processes, we
introduce new APIs to open/close xrcd's.
The user to kernel ABI is extended to account for opening/
closing the xrcd.
If the underlying provider supports extensions, it returns -1 from its
alloc_context() call. Ibverbs then allocates the ibv_context structure and
calls into the provider to finish initializing it.
When extensions are supported, the ibv_device structure is embedded in a
larger verbs_device structure. Similarly, ibv_context is embedded inside
a larger verbs_context structure.
Bart Van Assche [Sun, 7 Aug 2011 18:05:30 +0000 (20:05 +0200)]
Fix a compiler warnings with NVALGRIND
Fix compiler warnings when compiling with NVALGRIND defined and the
latest Valgrind header files. Recently the Valgrind client request
implementation has been modified in order to not trigger compiler
warnings when building with gcc 4.6. A side effect of that change is
that Valgrind client request macros that return a value have to be
cast to void in order to avoid a compiler warning.
For more information, see also:
* Valgrind manual about VALGRIND_MAKE_MEM_DEFINED (http://valgrind.org/docs/manual/mc-manual.html).
* Valgrind trunk r11755 (http://article.gmane.org/gmane.comp.debugging.valgrind.devel/13489).
Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Roland Dreier <roland@purestorage.com>
Add support to ibv_devinfo for displaying extended speeds
Add code to ibv_devinfo to display the following new speeds:
8: FDR-10 is a proprietary link speed which is 10.3125 Gbps with 64b/66b
encoding rather than 8b/10b encoding.
16: FDR - 14.0625 Gbps
32: EDR - 25.78125 Gbps
Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il> Reviewed-by: Hal Rosenstock <hal@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
Bart Van Assche [Sun, 7 Aug 2011 18:01:48 +0000 (18:01 +0000)]
Makefile.am: Fix an automake warning
Fix the following automake warning message:
Makefile.am:1: `INCLUDES' is the old name for `AM_CPPFLAGS' (or `*_CPPFLAGS')
A quote from the automake manual:
INCLUDES
This does the same job as AM_CPPFLAGS (or any per-target _CPPFLAGS variable
if it is used). It is an older name for the same functionality. This
variable is deprecated; we suggest using AM_CPPFLAGS and per-target
_CPPFLAGS instead.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Bart Van Assche [Sun, 7 Aug 2011 18:01:08 +0000 (18:01 +0000)]
Add "foreign" option to AM_INIT_AUTOMAKE
Switch to the modern form of the AM_INIT_AUTOMAKE macro and tell
automake that the libibverbs package does not follow the GNU
standards. This change makes it possible to use 'autoreconf' for the
libibverbs package.
Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Roland Dreier <roland@purestorage.com>
Or Gerlitz [Tue, 19 Jul 2011 09:31:32 +0000 (09:31 +0000)]
Update examples for IBoE
Since IBoE requires usage of GRH, update ibv_*_pinpong examples to
accept GIDs. GIDs are given as an index to the local port's table and
are exchanged between the client and the server through the socket
connection.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il> Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Or Gerlitz [Tue, 19 Jul 2011 09:28:42 +0000 (09:28 +0000)]
Update kernel API header to include link_layer
Modify the code to handle returning the link layer of a port from the
kernel to the library. The kernel has done this since commit 2420b60b1dc4 ("IB/uverbs: Return link layer type to userspace for
query port operation"), merged in 2.6.37-rc1.
The new field does not change the size of struct ibv_query_port_resp
as it replaces a reserved field. Binary compatibility between the
kernel to the library is kept, since old kernels running below new
library will not zero that field, so it will be read as "unspecified,"
while an old library running over new kernel will ignore the value
returned by the kernel.
The solution was suggested by Roland Dreier <roland@purestorage.com>
and Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il> Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Or Gerlitz [Wed, 20 Jul 2011 19:37:24 +0000 (19:37 +0000)]
Add link_layer field port attribute
The new field has three possible values: IBV_LINK_LAYER_UNSPECIFIED,
IBV_LINK_LAYER_INFINIBAND, IBV_LINK_LAYER_ETHERNET. It can be used by
applications to know the link layer used by the port, which can be
either InfiniBand or Ethernet.
The addition of the new field does not change the size of struct
ibv_port_attr due to alignment of the preceding fields. Binary
compatibility between the library to applications is kept, since old
apps running over new library do not read this field, and new apps
running over old library will determine the link layer as unspecified
and hence take their IB code path.
The solution was suggested by Roland Dreier <roland@purestorage.com>
and Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il> Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Handle huge pages in ibv_fork_init() and madvise tracking
When fork support is enabled in libibverbs, madvise() is called for
every memory page that is registered as a memory region. Memory
ranges that are passed to madvise() must be page aligned and the size
must be a multiple of the page size.
libibverbs uses sysconf(_SC_PAGESIZE) to find out the system page size
and rounds all ranges passed to reg_mr() according to this page size.
When memory from libhugetlbfs is passed to reg_mr(), this does not
work as the page size for this memory range might be different
(e.g. 16MB). So libibverbs would have to use the huge page size to
calculate a page aligned range for madvise.
As huge pages are provided to the application "under the hood" when
preloading libhugetlbfs, the application does not have any knowledge
about when it registers a huge page or a usual page.
To work around this issue, detect the use of huge pages in libibverbs
and align memory ranges passed to madvise according to the huge page
size. Determining the page size of a given memory range by watching
madvise() fail has proven to be unreliable. So we introduce the
RDMAV_HUGEPAGES_SAFE environment variable to let the user decide if
the page size should be checked on every reg_mr() call or not. This
requires the user to be aware if huge pages are used by the running
application or not.
I did not add an aditional API call to enable this, as applications
can use setenv() + ibv_fork_init() to enable checking for huge pages
in the code.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
[ Updated ibv_fork_init() manpage for RDMAV_HUGEPAGES_SAFE. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Roland Dreier [Fri, 27 May 2011 18:20:46 +0000 (11:20 -0700)]
Fix crash if no devices and ibv_get_device_list() is called multiple times
If no devices are found, ibverbs_init() sets num_devices to 0. This
means the next call to __ibv_get_device_list() would call
ibverbs_init() again, which crashes because ibverbs_init() leaves
various internal pointers pointing to freed memory.
Fix this by using pthread_once() to call ibverbs_init() exactly once,
and then doing the right thing even if num_devices stays 0.
Tested-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
Jason Gunthorpe [Thu, 7 Oct 2010 22:38:33 +0000 (22:38 +0000)]
Fix autotools to include the necessary m4 files
Running autogen.sh with a new version of autotools and then building
on a system with an older version tends to explode. Unfortunately
this is sometimes necessary since the new version is required by the
package. The fix changes the autogen.sh output from:
+ aclocal -I config
+ libtoolize --force --copy
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `config'.
libtoolize: copying file `config/ltmain.sh'
libtoolize: Consider adding `AC_CONFIG_MACRO_DIR([m4])' to configure.in and
libtoolize: rerunning libtoolize, to keep the correct libtool macros in-tree.
libtoolize: Consider adding `-I m4' to ACLOCAL_AMFLAGS in Makefile.am.
+ autoheader
+ automake --foreign --add-missing --copy
+ autoconf
Hakon Bugge [Wed, 2 Jun 2010 17:01:10 +0000 (10:01 -0700)]
Force line-buffering in ibv_asyncwatch
ibv_asyncwatch defaults to block-buffering when stdout is redirected to
a file or pipe. Changing to line-buffered mode makes it more usable in
scripted environments.
Sean Hefty [Thu, 6 May 2010 23:20:48 +0000 (16:20 -0700)]
Add path record definitions to sa.h
Add definitions for path record wire definition. This will be used by
the librdmacm and ib_acm service, and is exchanged with the kernel
using the newer set and query route functionality.
Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Alex Vainman [Sun, 28 Mar 2010 17:06:16 +0000 (20:06 +0300)]
Undo changes in memory range tree when madvise() fails
ibv_madvise_range() doesn't cleanup if madvise() fails. This patch
rolls back changes already made in the memory range tracking tree by
madvise() calls before the one that failed. We can do this fairly
simply by simply restarting ibv_madvise_range() from the original
start to the current location with the opposite advice/inc values.
Signed-off-by: Alex Vainman <alexv@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Alex Vainman [Mon, 1 Feb 2010 05:57:45 +0000 (07:57 +0200)]
Fix incorrect splits/merges in the memory tree when madvise() fails.
ibv_madvise_range() first manages (splits or merges) memory ranges in
the tree and only then calls madvise(). If madvise() fails, the
tree's memory range may contain incorrectly split or merged ranges.
The patch undoes the split and merge operations performed on the node
which caused the madvise() failure as well as on that node's
neighbors.
Signed-off-by: Alex Vainman <alexv@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Alex Vainman [Mon, 1 Feb 2010 05:58:00 +0000 (07:58 +0200)]
Increment node refcount in ibv_madvise_range() only if madvise() succeeds
ibv_madvise_range() first updates the memory range reference count and
then calls to madvise(). If madvise() fails, the reference count of
the failed node is incorrect. Fix this by updating the node's
reference count only after a successful call to madvise() (or if no
call to madvise() was needed).
Signed-off-by: Alex Vainman <alexv@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Jason Gunthorpe [Wed, 28 Oct 2009 23:42:21 +0000 (17:42 -0600)]
Return errors from ibv_get_device_list() via errno
Get rid of the output to stderr on various failure cases from
ibv_get_device_list() such as no device driver found, so that
applications can control how to present errors. Fix up the examples
and the man page to match.
Code expecting this behavior linking to old libibverbs will
get the old fprint and errno set to garbage (probably ESPIPE).
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Jason Gunthorpe [Thu, 30 Jul 2009 22:58:16 +0000 (16:58 -0600)]
Do not use enum types for bit flags
Arithmetic operations on enum members do not result in the enum type;
C++ is stricter about this than C. So using flag enums results in
compile errors when they are OR'd together in a C++ application.
To fix this, replace all flag enum objects with int. int was selected
to preserve the ABI; we checked that enum types are the same size as
int on at least i386, x86-64, ppc32, ppc64, ia64, and mips, and arm
and sparc also appear compatible with this choice.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Jason Gunthorpe [Sun, 19 Jul 2009 05:26:34 +0000 (23:26 -0600)]
Make the gid argument to ibv_attach_mcast and ibv_detach_mcast const
ibv_attach_mcast() and ibv_detach_mcast() don't change the gid
argument, so the arguments should be const to allow applications to
pass in constant gids. This constness flows through to the driver
call struct and into the drivers and back into
ibv_cmd_attach_mcast()/ibv_cmd_detach_mcast().
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Jason Gunthorpe [Tue, 14 Jul 2009 21:16:53 +0000 (15:16 -0600)]
Allow config file paths to the driver library to be absolute
If the driver line starts with a / then no lib prefix is applied and
the full path is passed to dlopen(). This allows a completely
self-contained installation that relies on RPATH for the binaries and
this mechanism for the drivers.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Roland Dreier [Wed, 24 Jun 2009 18:08:36 +0000 (11:08 -0700)]
Update build system to use shave
Add shave (git://git.lespiau.name/shave) to make build output of libibverbs
much more readable by abbreviating the outputed commands so that
warnings become visible, etc.
Shirley Ma [Tue, 22 Jul 2008 22:06:50 +0000 (15:06 -0700)]
Implement PPC wmb() with sync instead of eieio
wmb() for PPC was incorrect defined as an eieio instruction in
libibverbs. eieio only orders pure I/O memory or a pure system memory
accesses. In a situation where the device drivers use the d_map
kernel services to share a portion of system memory with an I/O
adapter, we need to use sync() instead. See below link for reference:
Roland Dreier [Wed, 25 Jun 2008 02:32:27 +0000 (19:32 -0700)]
Revert conversion of ibv_devinfo to use ibv_port_state_str()
Using ibv_port_state_str() changes the port state output of ibv_devinfo
(eg "PORT_DOWN" becomes "down"), which is reported to break scripts that
parse this output. Revert to using the old code in ibv_devinfo; we want
ibv_port_state_str() to continue producing the nicer-looking lower case
output, so just leave the open-coded alternative in ibv_devinfo.
Reported-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Or Gerlitz [Wed, 25 Jul 2007 06:45:16 +0000 (09:45 +0300)]
Document IBV_SEND_INLINE buffer ownership
If the IBV_SEND_INLINE flag is set in a work request posted with
ibv_post_send(), the data buffers can be reused immediately after the
call returns. Document this.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Dotan Barak [Sun, 3 Feb 2008 15:55:04 +0000 (17:55 +0200)]
Fixes for man pages
Some fixes and updates to several man pages:
* Correct formatting in a few places.
* Add more "SEE ALSO" functions where appropriate.
* Document byte order of GUID and P_Key fields.
* Fix example code in ibv_get_cq_event.3
* Document GRH handling on receive.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Dotan Barak [Wed, 10 Oct 2007 09:25:18 +0000 (11:25 +0200)]
Fix several valgrind false positives
Fix several issues that were reported by valgrind:
* Initialize reserved attributes of command structures
* Fix the pointer and size when calling VALGRIND_MAKE_MEM_DEFINED in
ibv_cmd_reg_mr() and ibv_cmd_create_cq_v2(): if we have struct
xxx_resp *resp and resp_size, we need to do
Roland Dreier [Thu, 24 Jan 2008 04:04:50 +0000 (20:04 -0800)]
Convert hyphen to minus sign in ibv_query_pkey man page
A bare "-" in a man page will be rendered as a hyphen; to get a minus
sign, "\-" must be used. Very pedantic people (or automatic checkers,
such as Debian's lintian tool) may notice the difference. The man page
for ibv_query_pkey incorrectly wrote a negative return value as "-1".
Fix this to be the correct "\-1".
Roland Dreier [Tue, 20 Nov 2007 21:12:15 +0000 (13:12 -0800)]
Always return valid bad_wr on error from ibv_post_{send,recv,srq_recv}
There are error cases in the kernel's uverbs work request posting
functions where the return value is negative (i.e., an error) and yet a
non-zero resp.bad_wr is not written back to userspace. In this case,
ibv_cmd_post_send() should still set the bad_wr pointer.
Bug pointed out in ibv_post_send() by Ralph Campbell
<ralph.campbell@qlogic.com>, and noticed elsewhere by Dotan Barak
<dotanb@dev.mellanox.co.il>.
Dotan Barak [Wed, 8 Aug 2007 13:27:08 +0000 (16:27 +0300)]
Initialize reserved attributes in modify QP command
Initialize the reserved attributes in modify QP command to eliminate
valgrind warnings like:
==23549== Syscall param write(buf) points to uninitialised byte(s)
==23549== at 0x316B1B933F: (within /lib64/tls/libc-2.3.4.so)
==23549== by 0x4A33AF7: ibv_cmd_modify_qp (cmd.c:782)
==23549== by 0x4F860D8: mlx4_modify_qp (verbs.c:480)
==23549== by 0x4A37A53: ibv_modify_qp@@IBVERBS_1.1 (verbs.c:441)
==23549== by 0x40972E: qp_reset_to_rtr (mr_test_fun.c:1189)
==23549== by 0x403AFC: mr_test_connect_qp (mr_test.c:232)
==23549== by 0x404956: do_test (mr_test.c:85)
==23549== by 0x402DF8: main (main.c:448)
==23549== Address 0x7FEFFF2AE is on thread 1's stack
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Roland Dreier [Tue, 10 Jul 2007 18:23:18 +0000 (11:23 -0700)]
Fix too-big madvise() call in ibv_madvise_range()
When the first memory range found in ibv_madvise_range() is merged
with the previous range before entering the loop that calls madvise(),
a too-big range could be passed to madvise(). This could lead to
trying to madvise() memory that has already been freed and unmapped,
which causes madvise() and therefore ibv_reg_mr() to fail.
Fix this by making sure we don't madvise() any memory outside the
range passed into ibv_madvise_range().
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=682>.
Roland Dreier [Tue, 3 Jul 2007 18:53:47 +0000 (11:53 -0700)]
Fix Valgrind annotations so they can actually be built
The AC_CHECK_HEADER() test for <valgrind/memcheck.h> will never result
in HAVE_VALGRIND_MEMCHECK_H being defined, so ibverbs.h will never
include <valgrind/memcheck.h> and Valgrind annotations will never actually
get built. Fix this by adding an AC_DEFINE() of HAVE_VALGRIND_MEMCHECK_H
if the header is found.
Roland Dreier [Tue, 3 Jul 2007 18:48:40 +0000 (11:48 -0700)]
Clean up NVALGRIND comment in config.h.in
Update configure.in so that the comment generated by autoheader for
NVALGRIND in config.h.in is a complete sentence to match the style of
the rest of the file.
For newly created QPs, set qp->state to IBV_QPS_RESET. At least
libmlx4 needs this fix, or else it won't correctly initialize the QP's
send queue when transitioning to INIT.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
The amount of memory root can lock isn't limited, so the rlimit value
doesn't matter in this case. Do not print a warning about
RLIMIT_MEMLOCK being too low if EUID is 0.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Roland Dreier [Thu, 3 May 2007 18:09:44 +0000 (11:09 -0700)]
Fix call to ibv_free_device_list() in pingpong examples
When a -d option to specify which device to use is passed to the
pingpong examples, they iterate through the device list by
incrementing the dev_list pointer. This means that the call to
ibv_free_device_list() may not get the right pointer.
Fix this by using an index to iterate through the array and leaving
the dev_list pointer itself alone.
Roland Dreier [Sat, 28 Apr 2007 21:18:43 +0000 (14:18 -0700)]
Update Debian build
Use DEB_AUTO_UPDATE_LIBTOOL rather than manual rerunning autotools to
avoid setting RPATH. Remove DEB_DH_STRIP_ARGS since cdbs should
handle this automatically at debhelper compat level 5. Let cdbs
generate build-deps automatically (move control to control.in).