Doug Ledford [Wed, 18 Jun 2014 17:45:23 +0000 (10:45 -0700)]
rdma_server: handle IBV_SEND_INLINE correctly
Not all RDMA devices support IBV_SEND_INLINE. At least some of those
that don't will ignore the flag passed to rdma_post_send and attempt to
send the command by using an sge entry instead. Because we don't
register the send memory, this fails. The proper way to deal with the
fact that IBV_SEND_INLINE is not guaranteed is to check the returned
value in our cap struct to see if we have support for inline data, and
if not, fall back to non-inline sends and to register the send memory
region.
Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Doug Ledford [Wed, 18 Jun 2014 17:45:23 +0000 (10:45 -0700)]
rdma_server: handle IBV_SEND_INLINE correctly
Not all RDMA devices support IBV_SEND_INLINE. At least some of those
that don't will ignore the flag passed to rdma_post_send and attempt to
send the command by using an sge entry instead. Because we don't
register the send memory, this fails. The proper way to deal with the
fact that IBV_SEND_INLINE is not guaranteed is to check the returned
value in our cap struct to see if we have support for inline data, and
if not, fall back to non-inline sends and to register the send memory
region.
Doug Ledford [Wed, 18 Jun 2014 17:44:49 +0000 (10:44 -0700)]
rdma_client: handle IBV_SEND_INLINE correctly
Not all RDMA devices support IBV_SEND_INLINE. At least some of those
that don't will ignore the flag passed to rdma_post_send and attempt to
send the command by using an sge entry instead. Because we don't
register the send memory, this fails. The proper way to deal with the
fact that IBV_SEND_INLINE is not guaranteed is to check the returned
value in our cap struct to see if we have support for inline data, and
if not, fall back to non-inline sends and to register the send memory
region.
Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Doug Ledford [Wed, 18 Jun 2014 17:44:49 +0000 (10:44 -0700)]
rdma_client: handle IBV_SEND_INLINE correctly
Not all RDMA devices support IBV_SEND_INLINE. At least some of those
that don't will ignore the flag passed to rdma_post_send and attempt to
send the command by using an sge entry instead. Because we don't
register the send memory, this fails. The proper way to deal with the
fact that IBV_SEND_INLINE is not guaranteed is to check the returned
value in our cap struct to see if we have support for inline data, and
if not, fall back to non-inline sends and to register the send memory
region.
Doug Ledford [Wed, 18 Jun 2014 17:44:28 +0000 (10:44 -0700)]
rdma_server: use perror, unwind allocs on failure
Our main test function prints out errno directly, which is hard to read
as it's not decoded at all. Instead, use perror() to make failures more
readable. Also redo the failure flow so that we can do a simple unwind
at the end of the function and just jump to the right unwind spot on
error.
Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Doug Ledford [Wed, 18 Jun 2014 17:44:28 +0000 (10:44 -0700)]
rdma_server: use perror, unwind allocs on failure
Our main test function prints out errno directly, which is hard to read
as it's not decoded at all. Instead, use perror() to make failures more
readable. Also redo the failure flow so that we can do a simple unwind
at the end of the function and just jump to the right unwind spot on
error.
Doug Ledford [Wed, 18 Jun 2014 17:44:13 +0000 (10:44 -0700)]
rdma_client: use perror, unwind allocs on failure
Our main test function prints out errno directly, which is hard to read
as it's not decoded at all. Instead, use perror() to make failures more
readable. Also redo the failure flow so that we can do a simple unwind
at the end of the function and just jump to the right unwind spot on
error.
Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Doug Ledford [Wed, 18 Jun 2014 17:44:13 +0000 (10:44 -0700)]
rdma_client: use perror, unwind allocs on failure
Our main test function prints out errno directly, which is hard to read
as it's not decoded at all. Instead, use perror() to make failures more
readable. Also redo the failure flow so that we can do a simple unwind
at the end of the function and just jump to the right unwind spot on
error.
Doug Ledford [Wed, 18 Jun 2014 17:43:04 +0000 (10:43 -0700)]
cmtime: rework program to be multithread
When using very large numbers of connections (10,000 was in use here),
we ran into a problem where when we resolved a performance problem in
the kernel cma.c code, we suddenly developed a new problem. That new
problem turned out to be the fact that with the underlying kernel issue
resolved, 10,000 connect requests would flood the server side of the
test and the cmtime application would respond as quickly as possible.
However, the client side would not bother to check any of the returns
until after having sent all 10,000 connect requests. When the kernel
had a serializing performance problem, this was OK. When it was fixed,
this caused a general slowdown in connect operations due to overruns in
the event processing. This patch causes the client side to fire off
threads that will handle responses to connect requests as they come in
instead of allowing them to backlog uncontrollably. Times for a 10,000
connect run changed from this:
[root@rdma-dev-01 ~]# more
3.12.0-rc1.cached_gids+optimized_connect+trimmed_cache+.output
ib1:
step total ms max ms min us us / conn
create id : 46.64 0.10 1.00 4.66
bind addr : 89.61 0.04 7.00 8.96
resolve addr : 50.63 26.18 23976.00 5.06
resolve route: 565.44 538.77 26736.00 56.54
create qp : 4028.31 5.70 326.00 402.83
connect : 50077.42 49990.49 90734.00 5007.74
disconnect : 5277.25 4850.35 380017.00 527.72
destroy : 42.15 0.04 2.00 4.21
ib0:
step total ms max ms min us us / conn
create id : 34.82 0.04 1.00 3.48
bind addr : 25.94 0.02 1.00 2.59
resolve addr : 48.18 25.01 22779.00 4.82
resolve route: 501.28 476.26 25071.00 50.13
create qp : 3274.12 6.05 257.00 327.41
connect : 55549.64 55490.32 62150.00 5554.96
disconnect : 5263.64 4851.18 375628.00 526.36
destroy : 47.20 0.07 2.00 4.72
to this:
[root@rdma-dev-01 ~]# more
3.12.0-rc1.cached_gids+optimized_connect+trimmed_cache+-fixed-cmtime.output
ib1:
step total ms max ms min us us / conn
create id : 34.45 0.08 1.00 3.44
bind addr : 88.41 0.04 7.00 8.84
resolve addr : 33.59 4.65 612.00 3.36
resolve route: 618.68 0.61 97.00 61.87
create qp : 4024.03 6.30 341.00 402.40
connect : 6983.35 6886.33 8509.00 698.33
disconnect : 5066.47 230.34 831.00 506.65
destroy : 37.02 0.03 2.00 3.70
ib0:
step total ms max ms min us us / conn
create id : 42.61 0.14 1.00 4.26
bind addr : 27.05 0.03 2.00 2.70
resolve addr : 40.65 10.73 869.00 4.06
resolve route: 626.75 0.60 103.00 62.68
create qp : 3334.50 6.48 273.00 333.45
connect : 6310.29 6251.59 13298.00 631.03
disconnect : 5111.12 365.87 867.00 511.11
destroy : 36.57 0.02 2.00 3.66
with this patch.
Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Doug Ledford [Wed, 18 Jun 2014 17:43:04 +0000 (10:43 -0700)]
cmtime: rework program to be multithread
When using very large numbers of connections (10,000 was in use here),
we ran into a problem where when we resolved a performance problem in
the kernel cma.c code, we suddenly developed a new problem. That new
problem turned out to be the fact that with the underlying kernel issue
resolved, 10,000 connect requests would flood the server side of the
test and the cmtime application would respond as quickly as possible.
However, the client side would not bother to check any of the returns
until after having sent all 10,000 connect requests. When the kernel
had a serializing performance problem, this was OK. When it was fixed,
this caused a general slowdown in connect operations due to overruns in
the event processing. This patch causes the client side to fire off
threads that will handle responses to connect requests as they come in
instead of allowing them to backlog uncontrollably. Times for a 10,000
connect run changed from this:
[root@rdma-dev-01 ~]# more
3.12.0-rc1.cached_gids+optimized_connect+trimmed_cache+.output
ib1:
step total ms max ms min us us / conn
create id : 46.64 0.10 1.00 4.66
bind addr : 89.61 0.04 7.00 8.96
resolve addr : 50.63 26.18 23976.00 5.06
resolve route: 565.44 538.77 26736.00 56.54
create qp : 4028.31 5.70 326.00 402.83
connect : 50077.42 49990.49 90734.00 5007.74
disconnect : 5277.25 4850.35 380017.00 527.72
destroy : 42.15 0.04 2.00 4.21
ib0:
step total ms max ms min us us / conn
create id : 34.82 0.04 1.00 3.48
bind addr : 25.94 0.02 1.00 2.59
resolve addr : 48.18 25.01 22779.00 4.82
resolve route: 501.28 476.26 25071.00 50.13
create qp : 3274.12 6.05 257.00 327.41
connect : 55549.64 55490.32 62150.00 5554.96
disconnect : 5263.64 4851.18 375628.00 526.36
destroy : 47.20 0.07 2.00 4.72
to this:
[root@rdma-dev-01 ~]# more
3.12.0-rc1.cached_gids+optimized_connect+trimmed_cache+-fixed-cmtime.output
ib1:
step total ms max ms min us us / conn
create id : 34.45 0.08 1.00 3.44
bind addr : 88.41 0.04 7.00 8.84
resolve addr : 33.59 4.65 612.00 3.36
resolve route: 618.68 0.61 97.00 61.87
create qp : 4024.03 6.30 341.00 402.40
connect : 6983.35 6886.33 8509.00 698.33
disconnect : 5066.47 230.34 831.00 506.65
destroy : 37.02 0.03 2.00 3.70
ib0:
step total ms max ms min us us / conn
create id : 42.61 0.14 1.00 4.26
bind addr : 27.05 0.03 2.00 2.70
resolve addr : 40.65 10.73 869.00 4.06
resolve route: 626.75 0.60 103.00 62.68
create qp : 3334.50 6.48 273.00 333.45
connect : 6310.29 6251.59 13298.00 631.03
disconnect : 5111.12 365.87 867.00 511.11
destroy : 36.57 0.02 2.00 3.66
Sean Hefty [Thu, 22 May 2014 23:13:08 +0000 (16:13 -0700)]
indexer: Free index_map resources when cleared
Free memory allocated for index map entries when they are no
longer in use. To handle this, count the number of entries
stored by the index map item arrays and release the arrays when
no items are being tracked.
This reduces valgrind noise.
Problem reported by: Hannes Weisbach <hannes_weisbach@gmx.net>
Sean Hefty [Thu, 17 Apr 2014 05:01:51 +0000 (22:01 -0700)]
rsocket: Relax requirement for minimal inline data
Inline data support is optional. Allow rsockets to work
with devices that do not support inline data, provided
that they do support RDMA writes with immediate data.
This allows rsockets to work over Intel TrueScale HCA.
Patch derived from work by: Amir Hanania
Signed-off-by: Amir Hanania <amir.hanania@intel.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Sean Hefty [Thu, 17 Apr 2014 05:33:38 +0000 (22:33 -0700)]
rsocket: Modify when control messages are available
Rsockets currently tracks how many control messages (i.e.
entries in the send queue) that are available using a
single ctrl_avail counter. Seems simple enough.
However, control messages currently require the use of
inline data. In order to support control messages that
do not use inline data, we need to associate each
control message with a specific data buffer. This will
become easier to manage if we modify how we track when
control messages are available.
We replace the single ctrl_avail counter with two new
counters. The new counters conceptually treat control
messages as if each message had its own sequence number.
The sequence number will then be able to correspond to
a specific data buffer in a follow up patch.
ctrl_seqno will be used to indicate the current control
message being sent. ctrl_max_seqno will track the
highest control message that may be sent.
A side effect of this change is that we will be able to
see how many control messages have been sent. This also
separates the updating of the control count on the
sending side, versus the receiving side.
Sean Hefty [Thu, 17 Apr 2014 15:37:47 +0000 (08:37 -0700)]
rsocket: Dedicate a fixed number of SQEs for control messages
The number of SQEs allocated for control messages is set
to 1 of 2 constant values (either 4 or 2). A default
value is used unless the size of the SQ is below a certain
threshold (16 entries). This results in additional code
complexity, and it is highly unlikely that the SQ would
ever be allocated smaller than 16 entries.
Simplify the code to use a single constant value for the
number of SQEs allocated for control messages. This will
also help in subsequent patches that will need to deal
with HCAs that do not support inline data.
Sean Hefty [Thu, 17 Apr 2014 04:42:06 +0000 (21:42 -0700)]
rsocket: Check max inline data after creating QP
The ipath provider will ignore the max_inline_size
specified as input into ibv_create_qp and instead
return the size that it supports (which is 0) on
output.
Update the actual inline size returned from create QP,
and check that it meets the minimum requirement for
rsockets.