rsockets: Optimize synchronization to improve performance
Hotspot performance analysis using VTune showed pthread_mutex_unlock()
as the most significant hotspot when transferring small messages using
rstream. To reduce the impact of using pthread mutexes, replace it
with a custom lock built using an atomic variable and a semaphore.
When there's no contention for the lock (which is the expected case
for nonblocking sockets), the synchronization is reduced to
incrementing and decrementing an atomic variable.
A test that acquired and released a lock 2 billion times reported that
the custom lock was roughly 20% faster than using the mutex.
26.6 seconds versus 33.0 seconds.
Unfortunately, further analysis showed that using the custom lock
provided a minimal performance gain on rstream itself, and simply
moved the hotspot to the custom unlock call. The hotspot is likely
a result of some other interaction, rather than caused by slowness
in releasing a lock. However, we keep the custom lock based on
the results of the direct lock tests that were done.