Sean Hefty [Fri, 11 May 2012 17:33:13 +0000 (10:33 -0700)]
librdmacm/rstream: Set rsocket nonblocking for base tests
The base set of rstream tests want nonblocking rsockets, but don't
actually set the rsocket to nonblocking. It instead relies on the
MSG_DONTWAIT flag. Make the code match the expected behavior and
set the rsocket to nonblocking and make nonblocking the default.
Provide a test option to switch it back to blocking mode. We keep
the existing nonblocking test option for compatibility.
Sean Hefty [Fri, 11 May 2012 17:33:13 +0000 (10:33 -0700)]
librdmacm/rstream: Set rsocket nonblocking for base tests
The base set of rstream tests want nonblocking rsockets, but don't
actually set the rsocket to nonblocking. It instead relies on the
MSG_DONTWAIT flag. Make the code match the expected behavior and
set the rsocket to nonblocking and make nonblocking the default.
Provide a test option to switch it back to blocking mode. We keep
the existing nonblocking test option for compatibility.
Sean Hefty [Thu, 10 May 2012 18:17:32 +0000 (11:17 -0700)]
librdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets
The RDMA CM fail calls to set REUSEADDR on an rdma_cm_id if
it is not in the idle state. As a result, this causes a failure
in NetPipe when run with socket calls intercepted by rsockets.
Fix this by returning success when REUSEADDR is set on an rsocket
that has already been connected. When running over IB, REUSEADDR
is not necessary, since the TCP/IP addresses are mapped.
Sean Hefty [Thu, 10 May 2012 18:17:32 +0000 (11:17 -0700)]
librdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets
The RDMA CM fail calls to set REUSEADDR on an rdma_cm_id if
it is not in the idle state. As a result, this causes a failure
in NetPipe when is run with socket calls intercepted by rsockets.
Fix this by returning success when REUSEADDR is set on an rsocket
that has already been connected. When running over IB, REUSEADDR
is not necessary, since the TCP/IP addresses are mapped.
Sean Hefty [Thu, 10 May 2012 18:17:32 +0000 (11:17 -0700)]
librdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets
The RDMA CM fail calls to set REUSEADDR on an rdma_cm_id if
it is not in the idle state. As a result, this causes a failure
in NetPipe when is run with socket calls intercepted by rsockets.
Fix this by returning success when REUSEADDR is set on an rsocket
that has already been connected. When running over IB, REUSEADDR
is not necessary, since the TCP/IP addresses are mapped.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Hotspot performance analysis using VTune showed pthread_mutex_unlock()
as the most significant hotspot when transferring small messages using
rstream. To reduce the impact of using pthread mutexes, replace it
with a custom lock built using an atomic variable and a semaphore.
When there's no contention for the lock (which is the expected case
for nonblocking sockets), the synchronization is reduced to
incrementing and decrementing an atomic variable.
A test that acquired and released a lock 2 billion times reported that
the custom lock was roughly 20% faster than using the mutex.
26.6 seconds versus 33.0 seconds.
Unfortunately, further analysis showed that using the custom lock
provided a minimal performance gain on rstream itself, and simply
moved the hotspot to the custom unlock call. The hotspot is likely
a result of some other interaction, rather than caused by slowness
in releasing a lock. However, we keep the custom lock based on
the results of the direct lock tests that were done.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Hotspot performance analysis using VTune showed pthread_mutex_unlock()
as the most significant hotspot when transferring small messages using
rstream. To reduce the impact of using pthread mutexes, replace it
with a custom lock built using an atomic variable and a semaphore.
When there's no contention for the lock (which is the expected case
for nonblocking sockets), the synchronization is reduced to
incrementing and decrementing an atomic variable.
A test that acquired and released a lock 2 billion times reported that
the custom lock was roughly 20% faster than using the mutex.
26.6 seconds versus 33.0 seconds.
Unfortunately, further analysis showed that using the custom lock
provided a minimal performance gain on rstream itself, and simply
moved the hotspot to the custom unlock call. The hotspot is likely
a result of some other interaction, rather than caused by slowness
in releasing a lock. However, we keep the custom lock based on
the results of the direct lock tests that were done.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.
Sean Hefty [Mon, 9 Apr 2012 19:05:39 +0000 (12:05 -0700)]
rsocket: Succeed setting SO_KEEPALIVE option
memcached sets SO_KEEPALIVE, so succeed any requests to set
that option. We don't actually implement keepalive at this time.
To implement keepalive, we would need to record the last time
that a message was sent or received on an rsocket. If no
new messages are processed within the keepalive timeout, then
we would need to issue a keepalive message. For rsockets,
this would simply mean sending a 0-byte control message that
gets ignored on the remote side.
The only real difficulty with handlng keepalive is doing it
without adding threads. This requires additional work in
poll to occasionally timeout, send keepalive messages, then
return to polling if no new data has arrived. Alternatively,
we can add a thread to handle sending keepalive messages.