rsocket: Segmentation fault fix in case of multiple connections
In case of more than 16 rsocket connections
are established, "svc->rss" buffer is reallocated
with more memory. Index 0 is reserved for the service's
communication socket, and this is not taken in count
when data is copied from old buffer location to
new one.
Signed-off-by: Ilya Nelkenbaum <ilyan@mellanox.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Sean Hefty [Wed, 16 Jul 2014 20:44:56 +0000 (13:44 -0700)]
riostream: Only verify last data transfer
Data verification will fail when running the bandwidth
tests or the transfer count is > 1. The issue is that
subsequent writes by the initiator side will overwrite
the data in the target buffer before the receiver can
verify that it is correct.
To fix this, only verify that the data in the buffer
is correct after the last transfer has completed.
0-byte RDMA writes appears to be working correctly with
HCAs from 2 different vendors. The original problem that
was reported turned out to be a user error.
Sean Hefty [Thu, 3 Jul 2014 20:45:52 +0000 (13:45 -0700)]
rsocket: Update correct rsocket keepalive time
When the keepalive time of an rsocket is updated, the
updated information is forwarded to the keepalive service
thread. However, the thread updates the time for the
wrong service as shown:
Sean Hefty [Thu, 3 Jul 2014 20:55:39 +0000 (13:55 -0700)]
rsocket: Fix removing rsocket from service thread
When removing an rsocket from a service thread, we replace
the removed service with the one at the end of the service list.
This keeps the array tightly packed. However, rs_svc_rm_rs
decrements the rsocket count before doing the swap. The result
is that the entry at the end of the list gets dropped off.
Defer decrementing the count until the swap has been made.
In this case, the cnt value is a valid index into the array,
because we start at index 1. Index 0 is used internally by
the service thread.
Sean Hefty [Wed, 2 Jul 2014 22:37:10 +0000 (15:37 -0700)]
rsocket: Fix crash resulting from keepalive timeout
The following crash was reported by Hal Rosenstock,
<hal@mellanox.com>, with keepalive enabled. The crash
occurs in the keepalive thread attempting to send a
keepalive message.
report:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffecf08700 (LWP 6013)]
rs_post_write (rs=<value optimized out>, sgl=0x0, nsge=0, wr_data=3758096385,
flags=0, addr=0, rkey=0) at src/rsocket.c:1660
1660 return rdma_seterrno(ibv_post_send(rs->cm_id->qp, &wr, &bad));
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64
(gdb)
(gdb) p/x rs
$1 = value has been optimized out
So I added in the following to debug:
1660 if (rs == NULL)
1661 abort();
1662 if (rs->cm_id == NULL)
1663 abort();
1664 if (rs->cm_id->qp == NULL)
1665 abort();
1666 return rdma_seterrno(ibv_post_send(rs->cm_id->qp, &wr, &bad));
1667 }
And saw in gdb:
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffecf08700 (LWP 8096)]
0x00000030d50328a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64
(gdb)
(gdb) bt
#0 0x00000030d50328a5 in raise () from /lib64/libc.so.6
#1 0x00000030d5034085 in abort () from /lib64/libc.so.6
#2 0x00007ffff057fe23 in rs_post_write (rs=<value optimized out>, sgl=0x1fa0,
nsge=6, wr_data=4294967295, flags=0, addr=0, rkey=0) at src/rsocket.c:1665
#3 0x00007ffff058193d in tcp_svc_send_keepalive (arg=0x7ffff0789f20)
at src/rsocket.c:4245
#4 tcp_svc_run (arg=0x7ffff0789f20) at src/rsocket.c:4279
#5 0x00000030d5807851 in start_thread () from /lib64/libpthread.so.0
#6 0x00000030d50e890d in clone () from /lib64/libc.so.6
(gdb) fr 2
#2 0x00007ffff057fe23 in rs_post_write (rs=<value optimized out>, sgl=0x1fa0,
nsge=6, wr_data=4294967295, flags=0, addr=0, rkey=0) at src/rsocket.c:1665
1665 abort();
So qp is NULL somehow...
:end report
There is an issue if an rsocket is closed without going through
the rshutdown.
int rshutdown(int socket, int how)
{
...
if (rs->opts & RS_OPT_SVC_ACTIVE)
rs_notify_svc(&tcp_svc, rs, RS_SVC_REM_KEEPALIVE);
We remove the rsocket from the keepalive thread in rshutdown.
int rclose(int socket)
{
...
if (rs->state & rs_connected)
rshutdown(socket, SHUT_RDWR);
...
rs_free(rs);
rclose will call shutdown only if we're connected. However, if the
keepalive failed, the socket will be in an error state. So,
no call to rshutdown, which will leave the freed rsocket on
the keepalive thread's list.
The fix is to to have rclose remove an rsocket from being processed
by a service thread if it is still active.
Testing has shown that this does not always result in the
keep-alive message working correctly, such that a broken
connection is reported as having failed. The reason for this
behavior is unknown, but revert the patch until the issue has
been resolved.