]> git.openfabrics.org - ~shefty/librdmacm.git/log
~shefty/librdmacm.git
11 years agolibrdmacm: Enable AF_IB support
Sean Hefty [Fri, 17 Aug 2012 21:02:45 +0000 (14:02 -0700)]
librdmacm: Enable AF_IB support

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agolibrdmacm: Set address family for source address returned by ACM
Sean Hefty [Thu, 23 Aug 2012 22:48:06 +0000 (15:48 -0700)]
librdmacm: Set address family for source address returned by ACM

Set the sa_family type when saving the source address returnd
by ACM.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agolibrdmacm: Report error level in error messages
Yann Droneaud [Mon, 27 Aug 2012 23:37:29 +0000 (16:37 -0700)]
librdmacm: Report error level in error messages

Report error messages as either 'Warning' or 'Fatal'.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agolibrdmacm: Use common prefix for error messages
Yann Droneaud [Mon, 27 Aug 2012 23:35:32 +0000 (16:35 -0700)]
librdmacm: Use common prefix for error messages

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agolibrdmacm: Report error messages on stderr
Yann Droneaud [Mon, 27 Aug 2012 23:33:50 +0000 (16:33 -0700)]
librdmacm: Report error messages on stderr

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Avoid rsocket calls until after fork
Sean Hefty [Thu, 23 Aug 2012 18:20:08 +0000 (11:20 -0700)]
rspreload: Avoid rsocket calls until after fork

When an rsocket call is made before an application calls fork(),
the forked applications can hang.  This can be seen by running
netserver and two netperf clients simultaneously.  The second
netperf client will eventually stop performing data transfers.

LD_PRELOAD=librspreload.so netserver -D

LD_PRELOAD=librspreload.so netperf -v2 -c -C -H 192.168.0.101 -l30
LD_PRELOAD=librspreload.so netperf -v2 -c -C -H 192.168.0.101 -l30

It's not clear what the specific problem is.  The best guess is
that libibverbs or the provider library (e.g. libmlx4) perform
some initialization, such as mmap'ing device memory, which does not
work when fork is called.

As a work-around, avoid calling rsocket routines until immediately
before they are needed.  This allows the process to fork before
the libraries are initialized.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Fix checks in fork_active/passive
Sean Hefty [Mon, 20 Aug 2012 16:06:49 +0000 (09:06 -0700)]
rspreload: Fix checks in fork_active/passive

Fix passing in wrong variable to rconnect(), check state instead
of type, and move call to getpeername until after we are sure than
the normal socket connection has completed.

Problems pointed out by Sridhar Samudrala <sri@us.ibm.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agolibrdmacm: Re-enable ibacm support
Sean Hefty [Fri, 17 Aug 2012 23:41:04 +0000 (16:41 -0700)]
librdmacm: Re-enable ibacm support

Commit 272c3cc024d0e5854cbafa6c2f1e8560398a68d7, "Delay ACM
connection until resolving an address", removed the call to
ucma_ib_init without adding it back in the correct location.
As a result, the librdmacm no longer uses ibacm.  Fix this
by adding the initialization call when resolving an address.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorstream: Use MSG_WAITALL for blocking test
Sean Hefty [Thu, 16 Aug 2012 22:41:35 +0000 (15:41 -0700)]
rstream: Use MSG_WAITALL for blocking test

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorsockets: Add support for MSG_WAITALL rrecv() flag
Sean Hefty [Thu, 28 Jun 2012 18:34:38 +0000 (11:34 -0700)]
rsockets: Add support for MSG_WAITALL rrecv() flag

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Add fstat support
Sean Hefty [Tue, 7 Aug 2012 16:37:24 +0000 (09:37 -0700)]
rspreload: Add fstat support

vsftpd calls fstat on a socket.  Fake it out.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Support sendfile
Sean Hefty [Tue, 14 Aug 2012 00:00:42 +0000 (17:00 -0700)]
rspreload: Support sendfile

Handle users calling sendfile with an rsocket.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Do not block connect when supporting fork
Sean Hefty [Sat, 11 Aug 2012 04:44:39 +0000 (21:44 -0700)]
rspreload: Do not block connect when supporting fork

Many FTP servers require fork support.  However, FTP clients,
such as ncftp, will perform the following call sequence:

send PASV request to server over connection 1
         server will listen for connection 2
issue nonblocking connect to server
send ACCEPT request to server over connection 1
         server will accept connection 2

The current fork support converts all nonblocking connect
calls to blocking.  The result is that the FTP client ends up
blocked waiting for the server to accept the connection,
which it will never do.

To handle this case, we have the active side follow the same
rule as the server side and defer establishing the rsocket
connection until the user calls the first data transfer routine.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Minor cleanup of fork_passive handling
Sean Hefty [Mon, 13 Aug 2012 23:00:16 +0000 (16:00 -0700)]
rspreload: Minor cleanup of fork_passive handling

Minor code cleanup in passive side handling of fork support.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorsockets: Support SO_OOBINLINE
Sean Hefty [Wed, 8 Aug 2012 04:31:12 +0000 (21:31 -0700)]
rsockets: Support SO_OOBINLINE

We don't support urgent data, so just return success.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Support dup2 calls
Sean Hefty [Mon, 30 Jul 2012 23:06:32 +0000 (16:06 -0700)]
rspreload: Support dup2 calls

vsftpd requires dup2() support.  To handle dup2, we need to add
reference count tracking to the preload fd's.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Call real.close in fd_close
Sean Hefty [Wed, 1 Aug 2012 23:26:11 +0000 (16:26 -0700)]
rspreload: Call real.close in fd_close

The index into the preload lookup table is obtained by opening
/dev/null and use the returned value.  When closing the file,
use the real close call and not the preload close call.  This
is a minor optimization, but clarifies the expected operation.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorsocket: Improve disconnect time under normal conditions
Sean Hefty [Fri, 27 Jul 2012 17:46:42 +0000 (10:46 -0700)]
rsocket: Improve disconnect time under normal conditions

When both sides of a connection attempt to close at the same
time, one of the two sides can easily get an error when sending
a disconnect message.  This results in that side hanging
during close until the send times out.  (The time out is caused
by the remote side destroying its QP.)

We can reduce the chance of this occurring by immediately
assuming that the disconnect has been successful once we've
received the remote side's disconnect message, or we've
polled a send completion for the local disconnect message.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorsockets: Use wr_id to determine completion type
Sean Hefty [Thu, 26 Jul 2012 22:35:32 +0000 (15:35 -0700)]
rsockets: Use wr_id to determine completion type

If a work request has completed in error, the completion type
field is undefined.  Use the wr_id to determine if the failed
completion was a send or receive.

This fixes an issue where MPI can hang during finalize.  With
both sides of a connection shutting down simultaneously, one
side may complete quicker and delete its QP before the other
side receives an acknowledgement to their disconnect message.
Eventually, the disconnect message will time out, but because
the completion type field is undefined, it may be processed
as a failed receive, rather than a failed send.  The end
result is that the second side hangs waiting for the send to
complete.

This problem showed up more easily after commit
2e5b0fc95964f74ea59dd725e849027faa0cd526, but existed beforehand.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorsockets: Enable support for privileged ports
Sean Hefty [Wed, 25 Jul 2012 18:11:56 +0000 (11:11 -0700)]
rsockets: Enable support for privileged ports

Allow the preload library to use rsockets with priviledged
ports.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorspreload: Call init from getsockname()
Sean Hefty [Tue, 24 Jul 2012 21:13:55 +0000 (14:13 -0700)]
rspreload: Call init from getsockname()

netperf for some unknown reason calls getsockname() using a
hard coded value of 0, without first allocating a socket.
This causes the rsocket preload library to crash, since the
library has not been properly initialized.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agorstream: Add option to test fork support
Sean Hefty [Tue, 17 Jul 2012 22:32:54 +0000 (15:32 -0700)]
rstream: Add option to test fork support

If the user specifies '-T f', rstream will process
connections in a child process.  The server continues
to run until all child processes have completed their
tests.

Fork support requires use of the librspreload library.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
11 years agolibrspreload: Support server apps that call fork()
Sean Hefty [Tue, 24 Jul 2012 18:40:10 +0000 (11:40 -0700)]
librspreload: Support server apps that call fork()

Provide limited support for applications that call fork().  To
handle fork(), we establish connections using normal sockets.
The socket is later converted to an rsocket when the user
makes the first call to a data transfer function (e.g. send,
recv, read, write, etc.).

Fork support is indicated by setting the environment variable
RDMAV_FORK_SAFE = 1.  When set, the preload library will delay
converting to an rsocket until the user attempts to send or receive
data on the socket.  To convert from a normal socket to an
rsocket, the preload library must inject a message on the
normal socket to synchronize between the client and server.  As
a result, if the rsocket connection fails, the ability to
silently fallback to the normal socket may be compromised.  Fork
support is disabled by default.

The current implementation works for simple test apps under
ideal conditions.  Although it supports nonblocking sockets, it
uses blocking rsockets when migrating connections.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrspreload: Make socket_fallback() call more generic
Sean Hefty [Mon, 16 Jul 2012 21:17:58 +0000 (14:17 -0700)]
librspreload: Make socket_fallback() call more generic

socket_fallback is used to switch from an rsocket to a normal
socket in the case of failures.  Rename the call and make it
more generic, so that it can switch between an rsocket and
a normal socket in either direction.  This will be used to
support fork().

As part of this change, we move the list of hooked and rsocket
calls into structures, versus maintaining a large number of
static variables.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Only allocate verbs resources when needed
Sean Hefty [Thu, 19 Jul 2012 17:09:48 +0000 (10:09 -0700)]
librdmacm: Only allocate verbs resources when needed

The librdmacm allocates a PD per device on initialization.  Although
we need to maintain the device list while the library is loaded
(see rdma_get_devices), we can reduce the overhead by only allocating
verbs resources when they are needed.

This allows the rsocket preload library to support fork for
applications that spawn connections off to child processes.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Remove unused 'ib' variable from ucma_init
Sean Hefty [Thu, 19 Jul 2012 17:13:50 +0000 (10:13 -0700)]
librdmacm: Remove unused 'ib' variable from ucma_init

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agov1.0.16 v1.0.16
Sean Hefty [Wed, 11 Jul 2012 00:55:32 +0000 (17:55 -0700)]
v1.0.16

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrspreload: Fix typecast to eliminate compile warnings
Hal Rosenstock [Thu, 12 Jul 2012 16:24:10 +0000 (09:24 -0700)]
librspreload: Fix typecast to eliminate compile warnings

src/preload.c: In function ?bind?:
src/preload.c:350: warning: assignment from incompatible pointer type
src/preload.c: In function ?connect?:
src/preload.c:397: warning: assignment from incompatible pointer type

Signed-off-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorspreload: Document use of librspreload library
Sean Hefty [Wed, 11 Jul 2012 22:13:29 +0000 (15:13 -0700)]
rspreload: Document use of librspreload library

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Include src/common.h in distribution
sean.hefty@intel.com [Wed, 11 Jul 2012 19:39:08 +0000 (12:39 -0700)]
librdmacm: Include src/common.h in distribution

Add missing header file to distribution to allow rpmbuild to
work.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Validate source address protocol family in rdma_resolve_addr
Yann Droneaud [Wed, 11 Jul 2012 18:54:39 +0000 (11:54 -0700)]
librdmacm: Validate source address protocol family in rdma_resolve_addr

If a source address is provided but its protocol family is not recognized,
returns an error.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Build librspreload library as part of build
Sean Hefty [Mon, 9 Jul 2012 21:58:14 +0000 (14:58 -0700)]
rsocket: Build librspreload library as part of build

Build the rsocket preload library as part of the build.  To reduce the
risk of the preload library intercepting calls without the user's
knowledge, the preload library is installed into {_libdir}/rsocket.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Support IPV6_V6ONLY socket option
Sean Hefty [Tue, 12 Jun 2012 19:02:04 +0000 (12:02 -0700)]
rsocket: Support IPV6_V6ONLY socket option

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Handle other shutdown option
Sean Hefty [Mon, 25 Jun 2012 21:19:54 +0000 (14:19 -0700)]
rsocket: Handle other shutdown option

Handle SHUT_RD and SHUT_WR shutdown options.

In order to handle shutting down the send and receive sides
separately, we break the connection state into multiple sub-states.
This allows us to be partially connected (i.e. for either just
reads or just writes).

Support for SHUT_WR is needed to handle netperf properly, which
shuts down a socket by having the client use SHUT_WR, followed by
the server completing the disconnect with SHUT_RDWR.  The following
patch eliminates an error message from netperf:

'shutdown_control: no response received  errno 95'

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Set readfds event if rsocket has been disconnected
Sean Hefty [Mon, 25 Jun 2012 22:04:52 +0000 (15:04 -0700)]
rsocket: Set readfds event if rsocket has been disconnected

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Add rsocket man page
Sean Hefty [Mon, 11 Jun 2012 20:20:18 +0000 (13:20 -0700)]
rsocket: Add rsocket man page

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Use configuration files to specify default settings
Sean Hefty [Tue, 5 Jun 2012 22:28:18 +0000 (15:28 -0700)]
rsocket: Use configuration files to specify default settings

Give an administrator control over the default settings
used by rsockets.  Use files under %sysconfig%/rdma/rsocket as shown:

mem_default - default size of receive buffer(s)
wmem_default - default size of send buffer(s)
sqsize_default - default size of send queue
rqsize_default - default size of receive queue
inline_default - default size of inline data

If configuration files are not available, rsockets will continue to
use internal defaults.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Spin before blocking on an rsocket
Sean Hefty [Mon, 4 Jun 2012 21:51:41 +0000 (14:51 -0700)]
rsocket: Spin before blocking on an rsocket

The latency cost of blocking is significant compared to round
trip ping-pong time.  Spin briefly on rsockets before calling
into the kernel and blocking.

The time to spin before blocking is read from an rsocket
configuration file %sysconfig%/rdma/rsocket/polling_time.  This
is user adjustable.

As a completely unintentional side effect, this just happens to
improve application performance in benchmarks, like netpipe,
significantly. ;)

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Handle TCP_MAXSEG socket option
Sean Hefty [Mon, 4 Jun 2012 20:22:10 +0000 (13:22 -0700)]
rsocket: Handle TCP_MAXSEG socket option

netperf uses the TCP_MAXSEG socket option.  Add support for it.
Problem reported by Sridhar Samudrala <sri@us.ibm.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoman: List RDMA_PS_IB as a supported port space in rdma_getaddrinfo man page
Hal Rosenstock [Fri, 1 Jun 2012 17:52:03 +0000 (10:52 -0700)]
man: List RDMA_PS_IB as a supported port space in rdma_getaddrinfo man page

Signed-off-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Use separate connections for latency/bw tests
Sean Hefty [Sun, 27 May 2012 21:07:42 +0000 (14:07 -0700)]
rstream: Use separate connections for latency/bw tests

Optimize each connection for either latency or bandwidth
results.  This improves small message latency under 384
bytes by .5 - 1 us, while increasing bandwidth by
1 - 1.5 Gbps.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Use snprintf in place of sprintf
Sean Hefty [Tue, 29 May 2012 21:15:57 +0000 (14:15 -0700)]
rstream: Use snprintf in place of sprintf

Avoid possible buffer overrun.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Add option to specify size of send/recv buffers
Sean Hefty [Wed, 23 May 2012 18:32:46 +0000 (11:32 -0700)]
rstream: Add option to specify size of send/recv buffers

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsockets: Change the default QP size from 512 to 384
Sean Hefty [Thu, 24 May 2012 21:31:12 +0000 (14:31 -0700)]
rsockets: Change the default QP size from 512 to 384

Simple latency/bandwidth tests using rstream showed minimal
difference in performance between using a QP sized to 384
entries versus 512.  Reduce the overhead of a default rsocket
by using 384 entries.  A user can request a larger size by
calling rsetsockopt.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsockets: Simplify state checks
Sean Hefty [Sat, 26 May 2012 00:28:44 +0000 (17:28 -0700)]
rsockets: Simplify state checks

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket preload: Use environment variable to set QP size
Sean Hefty [Tue, 22 May 2012 01:46:36 +0000 (18:46 -0700)]
rsocket preload: Use environment variable to set QP size

Allow the user to specify the size of the send/receive queues
and inline data size through environment variables:
RS_SQ_SIZE, RS_RQ_SIZE, and RS_INLINE.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Add option to specify size of inline data
Sean Hefty [Tue, 22 May 2012 18:39:21 +0000 (11:39 -0700)]
rsocket: Add option to specify size of inline data

Allow the user to override the default inline data size.
We still require a minimum size in order to transfer receive
buffer update message.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsockets: Allow user to specify the QP sizes
Sean Hefty [Fri, 18 May 2012 23:56:15 +0000 (16:56 -0700)]
rsockets: Allow user to specify the QP sizes

Add setsockopt options that allow the user to specify the desired
size of the underlying QP.  The provided sizes are used as the
maximum size when creating the QP.  The actual sizes of the QP
are the smaller of the user provided maximum and the maximum
sizes supported by the underlying hardware.

A user may retrieve the actual sizes of the QP through the
getsockopt call.

The send and receive queue sizes are specified separately.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsockets: Define options specific to rsockets
Sean Hefty [Fri, 18 May 2012 23:36:07 +0000 (16:36 -0700)]
rsockets: Define options specific to rsockets

Allow a user to control some of the RDMA related attributes
of an rsocket through setsockopt/getsockopt.  A user specifies
that the rsocket should be modified through SOL_RDMA level.

This patch provides the initial framework.  Subsequent patches
will add the configurable parameters.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsockets: Reduce QP size if larger than hardware maximums
Sean Hefty [Sat, 19 May 2012 00:07:11 +0000 (17:07 -0700)]
rsockets: Reduce QP size if larger than hardware maximums

When porting rsockets to iwarp, it was discovered that the default
QP size (512) was larger than that supported by the hardware.
Decrease the size of the QP if the default size is larger than
the maximum supported by the hardware.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agors-preload: Handle recursive socket() calls
Sean Hefty [Fri, 25 May 2012 19:42:12 +0000 (12:42 -0700)]
rs-preload: Handle recursive socket() calls

When ACM support is enabled in the librdmacm, it will attempt to
establish a socket connection to the ACM daemon.  When the rsocket
preload library is in use, this can result in a recursive call
to socket() that results in the library hanging.  The resulting
call stack is:

socket() -> rsocket() -> rdma_create_id() -> ucma_init() ->
socket() -> rsocket() -> rdma_create_id() -> ucma_init()

The second call to ucma_init() hangs because initialization is
still pending.

Fix this by checking for recursive calls to socket() in the preload
library.  When detected, call the real socket() call instead of
directing the call back into rsockets().  Since rsockets is a part
of the librdmacm, it can call rsockets directly if it wants to use
rsockets instead of standard sockets.

This problem and the cause was reported by Chet Murthy <chet@watson.ibm.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Delay ACM connection until resolving an address
Sean Hefty [Fri, 25 May 2012 17:48:47 +0000 (10:48 -0700)]
librdmacm: Delay ACM connection until resolving an address

Avoid creating a connection to the ACM service when
it's not needed.  For example, if the user of the librdmacm
is a server application, it will not use ACM services.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoacm: Use -1 to indicate an invalid socket rather than 0
Sean Hefty [Fri, 25 May 2012 19:23:10 +0000 (12:23 -0700)]
acm: Use -1 to indicate an invalid socket rather than 0

socket() can return 0 as a valid socket.  This can happen
when using a daemon that closes stdin/out/err.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Fix hang in rrecv/rsend after disconnecting
Sean Hefty [Sat, 26 May 2012 00:24:08 +0000 (17:24 -0700)]
rsocket: Fix hang in rrecv/rsend after disconnecting

If a user calls rrecv() after a blocking rsocket has been disconnected,
it will hang.  This problem and the cause was reported by Sirdhar Samudrala
<samudrala@us.ibm.com>.  It can be reproduced by running netserver -f -D
using the rs-preload library.  A similar issue exists with rsend().

Fix this by not blocking on a CQ unless we're connected.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Check for connection error on async connect
Sean Hefty [Mon, 21 May 2012 23:39:46 +0000 (16:39 -0700)]
rstream: Check for connection error on async connect

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Check that send and recv CQs are different before destroying
Sean Hefty [Mon, 21 May 2012 05:41:38 +0000 (22:41 -0700)]
librdmacm: Check that send and recv CQs are different before destroying

ucma_destroy_cqs() destroys both the send and recv CQs if they
are non-null.  If the two CQs are actually the same one, this
results in a crash when trying to destroy the second CQ.  Check
that the CQs are different before destroying the second CQ.

This fixes a crash when using rsockets, which sets the send and
recv CQs to the same CQ.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Support older acm.h header files
Sean Hefty [Fri, 18 May 2012 17:00:58 +0000 (10:00 -0700)]
librdmacm: Support older acm.h header files

Older versions of acm.h do not include the resolve_data or
perf_data fields in struct acm_msg.  If we're using an older
version of the acm.h header file, use an internal definition
of struct acm_msg.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Add test option to include more sizes
Sean Hefty [Thu, 17 May 2012 16:50:15 +0000 (09:50 -0700)]
rstream: Add test option to include more sizes

Allow user to specify that a full set of transfer sizes should
be tested.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Group latency/bandwidth tests together
Sean Hefty [Thu, 17 May 2012 16:26:13 +0000 (09:26 -0700)]
rstream: Group latency/bandwidth tests together

Rather than grouping tests by transfer size, group by the test type.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Set rsocket nonblocking if set to async operation
Sean Hefty [Wed, 16 May 2012 22:23:41 +0000 (15:23 -0700)]
rstream: Set rsocket nonblocking if set to async operation

If asynchronous use is specified (use of poll/select), set the
rsocket to nonblocking.  This matches the common usage case for
asynchronous sockets.

When asynchronous support is enabled, the nonblocking/blocking
test option determines whether the poll/select call will block,
or if rstream will spin on the calls.

This provides more flexibility with how the rsocket is used.
Specifically, MPI often uses nonblocking sockets, but spins on
poll/select.  However, many apps will use nonblocking sockets,
but wait on poll/select.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Clarify use of async test option
Sean Hefty [Fri, 11 May 2012 17:41:02 +0000 (10:41 -0700)]
rstream: Clarify use of async test option

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm/rstream: Set rsocket nonblocking for base tests
Sean Hefty [Fri, 11 May 2012 17:33:13 +0000 (10:33 -0700)]
librdmacm/rstream: Set rsocket nonblocking for base tests

The base set of rstream tests want nonblocking rsockets, but don't
actually set the rsocket to nonblocking.  It instead relies on the
MSG_DONTWAIT flag.  Make the code match the expected behavior and
set the rsocket to nonblocking and make nonblocking the default.

Provide a test option to switch it back to blocking mode.  We keep
the existing nonblocking test option for compatibility.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorstream: Always set TCP_NODELAY on rsocket
Sean Hefty [Wed, 16 May 2012 22:16:40 +0000 (15:16 -0700)]
rstream: Always set TCP_NODELAY on rsocket

The NODELAY option is coupled with whether the socket is blocking
or nonblocking.  Remove this coupling and always set the NODELAY
option.

NODELAY currently has no effect on rsockets.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets
Sean Hefty [Thu, 10 May 2012 18:17:32 +0000 (11:17 -0700)]
librdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets

The RDMA CM fail calls to set REUSEADDR on an rdma_cm_id if
it is not in the idle state.  As a result, this causes a failure
in NetPipe when run with socket calls intercepted by rsockets.
Fix this by returning success when REUSEADDR is set on an rsocket
that has already been connected.  When running over IB, REUSEADDR
is not necessary, since the TCP/IP addresses are mapped.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsockets: Optimize synchronization to improve performance
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance

Hotspot performance analysis using VTune showed pthread_mutex_unlock()
as the most significant hotspot when transferring small messages using
rstream.  To reduce the impact of using pthread mutexes, replace it
with a custom lock built using an atomic variable and a semaphore.
When there's no contention for the lock (which is the expected case
for nonblocking sockets), the synchronization is reduced to
incrementing and decrementing an atomic variable.

A test that acquired and released a lock 2 billion times reported that
the custom lock was roughly 20% faster than using the mutex.
26.6 seconds versus 33.0 seconds.

Unfortunately, further analysis showed that using the custom lock
provided a minimal performance gain on rstream itself, and simply
moved the hotspot to the custom unlock call.  The hotspot is likely
a result of some other interaction, rather than caused by slowness
in releasing a lock.  However, we keep the custom lock based on
the results of the direct lock tests that were done.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorping: Replace sprintf with snprintf to protect from buffer overflow
Dotan Barak [Tue, 24 Apr 2012 18:20:55 +0000 (11:20 -0700)]
rping: Replace sprintf with snprintf to protect from buffer overflow

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Succeed setting SO_KEEPALIVE option rsocket
Sean Hefty [Mon, 9 Apr 2012 19:05:39 +0000 (12:05 -0700)]
rsocket: Succeed setting SO_KEEPALIVE option

memcached sets SO_KEEPALIVE, so succeed any requests to set
that option.  We don't actually implement keepalive at this time.

To implement keepalive, we would need to record the last time
that a message was sent or received on an rsocket.  If no
new messages are processed within the keepalive timeout, then
we would need to issue a keepalive message.  For rsockets,
this would simply mean sending a 0-byte control message that
gets ignored on the remote side.

The only real difficulty with handlng keepalive is doing it
without adding threads.  This requires additional work in
poll to occasionally timeout, send keepalive messages, then
return to polling if no new data has arrived.  Alternatively,
we can add a thread to handle sending keepalive messages.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Succeed SO_LINGER socket option
Sean Hefty [Fri, 6 Apr 2012 22:40:10 +0000 (15:40 -0700)]
rsocket: Succeed SO_LINGER socket option

Succeed calls to set the SO_LINGER socket option.  We don't
actually implement SO_LINGER semantics because we never place
an rsocket into a timewait state.  Unlike socket behavior,
we do wait for all pending data to be transferred by the hardware.
This is done so that the disconnect message can be sent over
the rsocket connection.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Handle socket option toggling on/off
Sean Hefty [Mon, 9 Apr 2012 19:16:21 +0000 (12:16 -0700)]
rsocket: Handle socket option toggling on/off

If the user turns a socket option off, record that, so that
rgetsockopt returns the correct state of the option.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Discard unrecognized control messages
Sean Hefty [Fri, 6 Apr 2012 22:12:36 +0000 (15:12 -0700)]
rsocket: Discard unrecognized control messages

If we receive a control message that is not known, simply discard it.
This provides some ability to support forward compatibility.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Work-arounds to support RH EL5
Sean Hefty [Tue, 21 Feb 2012 20:48:04 +0000 (12:48 -0800)]
rsocket: Work-arounds to support RH EL5

Discard ENOSYS errors when trying to set address reuse.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Allow use of LD_PRELOAD to intercept socket calls
Sean Hefty [Tue, 3 Apr 2012 21:25:23 +0000 (14:25 -0700)]
rsocket: Allow use of LD_PRELOAD to intercept socket calls

Intercept socket calls and convert TCP socket operation to
streaming over RDMA.

Allow falling back from rsockets to normal sockets on error
or when trying to bind/connect to a reserved port.  This is
needed to handle MPI job startup, where MPI should use rsockets,
but mpiexect needs to communicate using ssh over normal sockets.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Add sample application to copy files over rsockets
Sean Hefty [Thu, 1 Dec 2011 20:52:28 +0000 (12:52 -0800)]
rsocket: Add sample application to copy files over rsockets

rcopy will copy files from a source system to a specified remote
server.  It's essentially a really dumb FTP type program that can
be used to quickly transfer files between systems, which can be
useful to verify data integrity.

(It was easier to create this program than modify an existing FTP
client and server application, which was my first choice.  Fork
support is difficult.)

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agorsocket: Add example program that uses rsocket
Sean Hefty [Thu, 8 Dec 2011 19:30:12 +0000 (11:30 -0800)]
rsocket: Add example program that uses rsocket

rstream provides an example that uses either rsocket or socket
APIs.  The latter allows rstream to be used to verify rsocket
behavior compared to socket.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Define streaming over RDMA interface (rsockets)
Sean Hefty [Tue, 3 Apr 2012 21:25:04 +0000 (14:25 -0700)]
librdmacm: Define streaming over RDMA interface (rsockets)

Introduces a new set of APIs that support a byte streaming interface
over RDMA devices.  The new interface matches sockets, except that all
function calls are prefixed with an 'r'.

The following functions are defined:

rsocket
rbind, rlisten, raccept, rconnect
rshutdown, rclose
rrecv, rrecvfrom, rrecvmsg, rread, rreadv
rsend, rsendto, rsendmsg, rwrite, rwritev
rpoll, rselect
rgetpeername, rgetsockname
rsetsockopt, rgetsockopt, rfcntl

Functions take the same parameters as that use for sockets.  The
follow capabilities and flags are supported at this time:

PF_INET, PF_INET6, SOCK_STREAM, IPPROTO_TCP
MSG_DONTWAIT, MSG_PEEK
SO_REUSEADDR, TCP_NODELAY, SO_ERROR, SO_SNDBUF, SO_RCVBUF
O_NONBLOCK

The rpoll call supports polling both rsockets and normal fd's.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoucmatose: Fix segfault on address error
Sagi Grimberg [Wed, 4 Apr 2012 21:36:49 +0000 (14:36 -0700)]
ucmatose: Fix segfault on address error

Client connect_events() shoudl fail if it received some error,
otherwise the program will try to reach a non-existent QP
resource resulting in a segfault.  Return an error from
cma_handler() if we had a connection error.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoAutomatically detect if ibacm is installed
Sean Hefty [Fri, 2 Mar 2012 01:13:58 +0000 (17:13 -0800)]
Automatically detect if ibacm is installed

If the ibacm header file is available, automatically have the
librdmacm configured to use it.  This removes the --with-ib_acm
configure option.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoUpdate rdma_disconnect to indicate both sides should call it.
Sean Hefty [Fri, 2 Mar 2012 17:17:48 +0000 (09:17 -0800)]
Update rdma_disconnect to indicate both sides should call it.

rdma_disconnect should be called from both sides to quickly disconnect.
Clarify this in the man page.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmamcm: Check for valid route in ucma_set_ib_route
Sean Hefty [Fri, 30 Sep 2011 21:37:02 +0000 (14:37 -0700)]
librdmamcm: Check for valid route in ucma_set_ib_route

ucma_set_ib_route will call rdma_getaddrinfo to obtain IB path
information.  However, rdma_getaddrinfo will return success,
but not provide routing data if no route can be found (the IB
ACM service is not running).  In this case, we can call
rdma_set_option without a valid route.  Although the kernel
will trap this and fail, we can detect the error in the library.
This will speed up the connection rate if IB ACM is not in use.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Fix warning 'resolve_msg' breaks aliasing rules
Sean Hefty [Fri, 30 Sep 2011 19:02:19 +0000 (12:02 -0700)]
librdmacm: Fix warning 'resolve_msg' breaks aliasing rules

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Return an error if user specifies AF_IB but it is not supported
Sean Hefty [Fri, 30 Sep 2011 16:17:04 +0000 (09:17 -0700)]
librdmacm: Return an error if user specifies AF_IB but it is not supported

If the user specifies an AF_IB address into rdma_bind_addr,
rdma_resolve_addr, rdma_join_multicast, or rdma_leave_multicast,
but the kernel does not support AF_IB return an error.

Note that rdma_getaddrinfo will never return an AF_IB address to the
user unless kernel support is present.  A application would need
to construct and AF_IB address by hand before making one of the
above mentioned calls.  This check prevents overrunning the
command buffer written to the kernel.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoudaddy: Update udaddy to use rdma_getaddrinfo
Sean Hefty [Thu, 29 Sep 2011 23:57:40 +0000 (16:57 -0700)]
udaddy: Update udaddy to use rdma_getaddrinfo

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agocmatose: Replace use of getaddrinfo with rdma_getaddrinfo
Sean Hefty [Thu, 29 Sep 2011 21:23:47 +0000 (14:23 -0700)]
cmatose: Replace use of getaddrinfo with rdma_getaddrinfo

Now that rdma_getaddrinfo exists, use it rather than getaddrinfo.
This will eventually allow us to specify native IB addresses into
cmatose once AF_IB support is there.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Report AF_IB as second rdma_addrinfo
Sean Hefty [Wed, 28 Sep 2011 21:34:01 +0000 (14:34 -0700)]
librdmacm: Report AF_IB as second rdma_addrinfo

If AF_IB is supported, the librdmacm will attempt to convert
AF_INET or AF_INET6 to AF_IB.  Rather than replacing the
AF_INET/6 rdma_addrinfo, provide the AF_IB addresses as a
second rdma_addrinfo linked from the AF_INET/6 version.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Set errno correctly in ucma_complete
Sean Hefty [Fri, 2 Mar 2012 01:01:41 +0000 (17:01 -0800)]
librdmacm: Set errno correctly in ucma_complete

The status value is negative, convert it to positive before setting errno.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agordma_verbs: Set errno correctly in rdma_get_send/recv_comp
Sean Hefty [Fri, 2 Mar 2012 01:01:39 +0000 (17:01 -0800)]
rdma_verbs: Set errno correctly in rdma_get_send/recv_comp

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoMerge branch 'sor'
Sean Hefty [Thu, 15 Dec 2011 00:40:52 +0000 (16:40 -0800)]
Merge branch 'sor'

12 years agolibrdmacm: Update web site and email addresses
Sean Hefty [Thu, 15 Dec 2011 00:38:45 +0000 (16:38 -0800)]
librdmacm: Update web site and email addresses

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agoMerge branch 'sor'
Sean Hefty [Mon, 12 Dec 2011 19:28:57 +0000 (11:28 -0800)]
Merge branch 'sor'

12 years agoudaddy/ucmatose: allow easy setting of tos in hex
Or Gerlitz [Mon, 12 Dec 2011 19:23:59 +0000 (11:23 -0800)]
udaddy/ucmatose: allow easy setting of tos in hex

Under IBoE, the 3 MSBits of the TOS map to the SL, hence letting
the user to specify them in hex makes the interface friendlier.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Return ECONNREFUSED from rdma_connect on reject
Sean Hefty [Wed, 23 Nov 2011 01:17:04 +0000 (17:17 -0800)]
librdmacm: Return ECONNREFUSED from rdma_connect on reject

Make the errno return code from rdma_connect constistent with
connect.  The underlying status value is available by reading
the event data.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agordma/cma: minor code refactoring when saving a string content
Dotan Barak [Mon, 31 Oct 2011 15:53:07 +0000 (08:53 -0700)]
rdma/cma: minor code refactoring when saving a string content

In this case, using strdup will provide a cleaner code
(and maybe a little bit faster too).

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm/udaddy: Fix resource leak in case of error
Dotan Barak [Wed, 26 Oct 2011 14:19:25 +0000 (07:19 -0700)]
librdmacm/udaddy: Fix resource leak in case of error

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Verify size of route_len
Sean Hefty [Wed, 28 Sep 2011 06:22:21 +0000 (23:22 -0700)]
librdmacm: Verify size of route_len

If the user specifies route information on input to rdma_getaddrinfo,
verify that the size of the routing data is something that we're
prepared to handle.

The routing data is only useful if IB ACM is enabled and may be
either struct ibv_path_record or struct ibv_path_data on input.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Fix duplicate free of connect
Sean Hefty [Tue, 27 Sep 2011 18:19:36 +0000 (11:19 -0700)]
librdmacm: Fix duplicate free of connect

The connect data stored with the cma_id_private is freed in
rdma_connect, since it is no longer needed.  Avoid duplicating
the free in rdma_destroy_id by checking for connect_len = 0,
rather than connect to be NULL.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agordma/verbs: Fix race polling for completions
Sean Hefty [Fri, 16 Sep 2011 19:06:40 +0000 (12:06 -0700)]
rdma/verbs: Fix race polling for completions

To avoid hanging in rdma_get_send/recv_comp, we need to rearm
the CQ inside of the while loop.  If the CQ is armed,
the HCA will write an entry to the CQ, then generate a CQ
event.  However, a caller could poll the CQ, find the entry,
then attempt to rearm the CQ before the HCA generates the CQ
event.  In this case, the rearm call (ibv_req_notify_cq) will
act as a no-op, since the HCA hasn't finished generating the
event for the previous completion.  At this point, the event
will be queued.

A call to ibv_get_cq_event will find the event, but not
a CQ entry.  The CQ is now not armed, and a call to
ibv_get_cq_event will block waiting for an event that will
never occur.

Problem was found in an rdma_cm example test under development.
The test can ping-pong messages between two applications.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agov1.0.15 v1.0.15
Sean Hefty [Wed, 14 Sep 2011 18:23:46 +0000 (11:23 -0700)]
v1.0.15

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Fix resource in rdma_migrate_id() error flow
Dotan Barak [Tue, 23 Aug 2011 19:20:56 +0000 (12:20 -0700)]
librdmacm: Fix resource in rdma_migrate_id() error flow

Prevent resource leak by destroying the event channel before returning from
function in an error flow.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agolibrdmacm: Fix resource leak when CMA_CREATE_MSG_CMD_RESP fails
Sean Hefty [Mon, 22 Aug 2011 23:31:46 +0000 (16:31 -0700)]
librdmacm: Fix resource leak when CMA_CREATE_MSG_CMD_RESP fails

If resources are allocated before CMA_CREATE_MSG_CMD_RESP or
CMA_CREATE_MSG_CMD are called, and those calls fail, we need
to cleanup the resources before returning.

Fix this by changing the CMA_CREATE_MSG macros to remove the
alloca and calling return.  The request and response structures
are now declared directly on the stack.  To accomplish this,
we merge the abi header definition into each command structure.

Problem reported by: Dotan Barak <dotanb@dev.mellanox.co.il>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
12 years agordma_xserver/client: Add new test apps
Sean Hefty [Mon, 6 Jun 2011 19:25:24 +0000 (12:25 -0700)]
rdma_xserver/client: Add new test apps

Add new versions of the rdma_server and rdma_client tests that
support other types of connections and show how to use more
RDMA features.  We keep the existing rdma_server and rdma_client
tests as simple examples.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>