From d0d85629ed68accebe82a882b88cd861af08aec3 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Tue, 11 Sep 2012 15:09:59 -0700 Subject: [PATCH] Refresh of rs-docs --- docs/rsocket | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 142 insertions(+), 1 deletion(-) diff --git a/docs/rsocket b/docs/rsocket index 87cc2894..5399f6cb 100644 --- a/docs/rsocket +++ b/docs/rsocket @@ -1,3 +1,144 @@ -rsocket protocol and design guide +rsocket Protocol and Design Guide 9/10/2012 +Overview +-------- +Rsockets is a protocol over RDMA that supports a socket-level API +for applications. For details on the current state of the +implementation, readers should refer to the rsocket man page. This +document describes the rsocket protocol, general design, and +some implementation details. +Rsockets exchanges data by performing RDMA write operations into +exposed data buffers. In addition to RDMA write data, rsockets uses +small, 32-bit messages for internal communication. RDMA writes +are used to transfer application data into remote data buffers +and to notify the peer when new target data buffers are available. +The following figure highlights the operation. + + host A host B + remote SGL + target SGL <------------- [ ] + [ ] ------ + [ ] -- ------ receive buffer(s) + -- -----> +--+ + -- | | + -- | | + -- | | + -- +--+ + -- + ---> +--+ + | | + | | + +--+ + +The remote SGL contains the address, size, and rkey of the target SGL. As +receive buffers become available on host B, rsockets will issue an RDMA +write against one of the entries in the target SGL on host A. The +updated entry will reference an available receive buffer. Immediate data +included with the RDMA write will indicate to host A that a target SGE +has been updated. + +When host A has data to send, it will check its target SGL. The current +target SGE will contain the address, size, and rkey of the next receive +buffer on host B. If the data transfer is smaller than the size of the +remote receive buffer, host A will update its target SGE to reflect the +remaining size of the receive buffer. That is, once a receive buffer has +been published to a remote peer, it will be fully consumed before a second +buffer is used. + +Rsockets relies on immediate data to notify the remote peer when data has +been transferred or when a target SGL has been updated. Because immediate +data requires that the remote QP have a posted receive, rsockets also uses +a credit based flow control mechanism. The number of credits is based on +the size of the receive queue, with initial credits exchanged during +connection setup. In order to transfer data, rsockets requires both +available receive buffers (published via the target SGL) and data credits. + +Since immediate data is limited to 32-bits, messages may either indicate +the arrival of application data or may be an internal message, but not both. +To avoid credit deadlock, rsockets reserves a small number of available +credits for control messages only, with the protocol relying on RNR NAKs +and retries to make forward progress. + + +Connection Establishment +------------------------ +rsockets uses the RDMA CM for connection establishment. Struct rs_conn_data +is exchanged during the connection exchange as private data in the request +and reply messages. + +struct rs_sge { + uint64_t addr; + uint32_t key; + uint32_t length; +}; + +#define RS_CONN_FLAG_NET 1 + +struct rs_conn_data { + uint8_t version; + uint8_t flags; + uint16_t credits; + uint32_t reserved2; + struct rs_sge target_sgl; + struct rs_sge data_buf; +}; + +Version - current version is 1 +Flags +RS_CONN_FLAG_NET - Set to 1 if host is big Endian. + Determines byte ordering for RDMA write messages +Credits - number of initial receive credits +Reserved2 - set to 0 +Target SGL - Address, size (# entries), and rkey of target SGL. + Remote side will copy this into their remote SGL. +Data Buffer - Initial receive buffer address, size (in bytes), and rkey. + Remote side will copy this into their first target SGE. + + +Message Format +-------------- +Rsocket uses RDMA writes with immediate data for all message exchanges. +RDMA writes of 0 length are used if no additional data beyond the message +needs to be exchanged. Immediate data is limited to 32-bits. Rsockets +defines the following format for messages. + +The upper 3 bits are used to define the type of message being exchanged, +with the meaning of the lower 29 bits determined by the upper bits. + +Bits Message Meaning of +31:29 Type Bits 28:0 +000 Data Transfer bytes transfered +001 reserved +010 reserved +011 reserved +100 Credit Update received credits granted +101 reserved +110 reserved +111 Control control message type + +Data Transfer +Indicates that application data has been written into the next available +receive buffer. The size of the transfer, in bytes, is carried in the lower +bits of the message. + +Credit Update +Used to indicate that additional receive buffers and credits are available. +The number of available credits is carried in the lower bits of the message. +A credit update message is also used to indicate that a target SGE has been +updated, in which case the number of additional credits may be 0. The +receiver of a credit update message must check for updates to the target SGL +by inspecting the contents of the SGL. The rsocket implementation must take +care not to modify a remote target SGL while it may be in use. This is done +by tracking when a receive buffer referenced by a remote target SGL has been +filled. + +Control Message - DISCONNECT +Indicates that the rsocket connection has been fully disconnected and will no +longer send or receive data. Data received before the disconnect message was +processed may still be available for reading. + +Control Message - SHUTDOWN +Indicates that the remote rsocket has shutdown the send side of its +connection. The recipient of a shutdown message will no longer accept +incoming data, but may still transfer outbound data. -- 2.41.0