Sean Hefty [Fri, 11 May 2012 17:33:13 +0000 (10:33 -0700)]
librdmacm/rstream: Set rsocket nonblocking for base tests
The base set of rstream tests want nonblocking rsockets, but don't
actually set the rsocket to nonblocking. It instead relies on the
MSG_DONTWAIT flag. Make the code match the expected behavior and
set the rsocket to nonblocking and make nonblocking the default.
Provide a test option to switch it back to blocking mode. We keep
the existing nonblocking test option for compatibility.
Sean Hefty [Fri, 11 May 2012 17:33:13 +0000 (10:33 -0700)]
librdmacm/rstream: Set rsocket nonblocking for base tests
The base set of rstream tests want nonblocking rsockets, but don't
actually set the rsocket to nonblocking. It instead relies on the
MSG_DONTWAIT flag. Make the code match the expected behavior and
set the rsocket to nonblocking and make nonblocking the default.
Provide a test option to switch it back to blocking mode. We keep
the existing nonblocking test option for compatibility.
Sean Hefty [Thu, 10 May 2012 18:17:32 +0000 (11:17 -0700)]
librdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets
The RDMA CM fail calls to set REUSEADDR on an rdma_cm_id if
it is not in the idle state. As a result, this causes a failure
in NetPipe when run with socket calls intercepted by rsockets.
Fix this by returning success when REUSEADDR is set on an rsocket
that has already been connected. When running over IB, REUSEADDR
is not necessary, since the TCP/IP addresses are mapped.
Sean Hefty [Thu, 10 May 2012 18:17:32 +0000 (11:17 -0700)]
librdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets
The RDMA CM fail calls to set REUSEADDR on an rdma_cm_id if
it is not in the idle state. As a result, this causes a failure
in NetPipe when is run with socket calls intercepted by rsockets.
Fix this by returning success when REUSEADDR is set on an rsocket
that has already been connected. When running over IB, REUSEADDR
is not necessary, since the TCP/IP addresses are mapped.
Sean Hefty [Thu, 10 May 2012 18:17:32 +0000 (11:17 -0700)]
librdmacm/rsocket: Succeed setsockopt REUSEADDR on connected sockets
The RDMA CM fail calls to set REUSEADDR on an rdma_cm_id if
it is not in the idle state. As a result, this causes a failure
in NetPipe when is run with socket calls intercepted by rsockets.
Fix this by returning success when REUSEADDR is set on an rsocket
that has already been connected. When running over IB, REUSEADDR
is not necessary, since the TCP/IP addresses are mapped.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Hotspot performance analysis using VTune showed pthread_mutex_unlock()
as the most significant hotspot when transferring small messages using
rstream. To reduce the impact of using pthread mutexes, replace it
with a custom lock built using an atomic variable and a semaphore.
When there's no contention for the lock (which is the expected case
for nonblocking sockets), the synchronization is reduced to
incrementing and decrementing an atomic variable.
A test that acquired and released a lock 2 billion times reported that
the custom lock was roughly 20% faster than using the mutex.
26.6 seconds versus 33.0 seconds.
Unfortunately, further analysis showed that using the custom lock
provided a minimal performance gain on rstream itself, and simply
moved the hotspot to the custom unlock call. The hotspot is likely
a result of some other interaction, rather than caused by slowness
in releasing a lock. However, we keep the custom lock based on
the results of the direct lock tests that were done.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Hotspot performance analysis using VTune showed pthread_mutex_unlock()
as the most significant hotspot when transferring small messages using
rstream. To reduce the impact of using pthread mutexes, replace it
with a custom lock built using an atomic variable and a semaphore.
When there's no contention for the lock (which is the expected case
for nonblocking sockets), the synchronization is reduced to
incrementing and decrementing an atomic variable.
A test that acquired and released a lock 2 billion times reported that
the custom lock was roughly 20% faster than using the mutex.
26.6 seconds versus 33.0 seconds.
Unfortunately, further analysis showed that using the custom lock
provided a minimal performance gain on rstream itself, and simply
moved the hotspot to the custom unlock call. The hotspot is likely
a result of some other interaction, rather than caused by slowness
in releasing a lock. However, we keep the custom lock based on
the results of the direct lock tests that were done.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.
Sean Hefty [Tue, 8 May 2012 00:16:47 +0000 (17:16 -0700)]
rsockets: Optimize synchronization to improve performance
Performance analysis using VTune showed that pthread_mutex_unlock()
is the single biggest contributor to increasing latency for 64-byte
transfers. Unlocked was followed by get_sw_cqe(), then
__pthread_mutex_lock(). Replace the use of mutexes with an atomic
and a semaphore. When there's no contention for the lock (which
would usually be the case when using nonblocking sockets), the
code simply increments and decrements an atomic varible. Semaphores
are only used when contention occurs.