--- /dev/null
+rsocket Protocol and Design Guide 9/10/2012
+
+Overview
+--------
+Rsockets is a protocol over RDMA that supports a socket-level API
+for applications. For details on the current state of the
+implementation, readers should refer to the rsocket man page. This
+document describes the rsocket protocol, general design, and
+some implementation details.
+
+Rsockets exchanges data by performing RDMA write operations into
+exposed data buffers. In addition to RDMA write data, rsockets uses
+small, 32-bit messages for internal communication. RDMA writes
+are used to transfer application data into remote data buffers
+and to notify the peer when new target data buffers are available.
+The following figure highlights the operation.
+
+ host A host B
+ remote SGL
+ target SGL <------------- [ ]
+ [ ] ------
+ [ ] -- ------ receive buffer(s)
+ -- -----> +--+
+ -- | |
+ -- | |
+ -- | |
+ -- +--+
+ --
+ ---> +--+
+ | |
+ | |
+ +--+
+
+The remote SGL contains the address, size, and rkey of the target SGL. As
+receive buffers become available on host B, rsockets will issue an RDMA
+write against one of the entries in the target SGL on host A. The
+updated entry will reference an available receive buffer. Immediate data
+included with the RDMA write will indicate to host A that a target SGE
+has been updated.
+
+When host A has data to send, it will check its target SGL. The current
+target SGE will contain the address, size, and rkey of the next receive
+buffer on host B. If the data transfer is smaller than the size of the
+remote receive buffer, host A will update its target SGE to reflect the
+remaining size of the receive buffer. That is, once a receive buffer has
+been published to a remote peer, it will be fully consumed before a second
+buffer is used.
+
+Rsockets relies on immediate data to notify the remote peer when data has
+been transferred or when a target SGL has been updated. Because immediate
+data requires that the remote QP have a posted receive, rsockets also uses
+a credit based flow control mechanism. The number of credits is based on
+the size of the receive queue, with initial credits exchanged during
+connection setup. In order to transfer data, rsockets requires both
+available receive buffers (published via the target SGL) and data credits.
+
+Since immediate data is limited to 32-bits, messages may either indicate
+the arrival of application data or may be an internal message, but not both.
+To avoid credit deadlock, rsockets reserves a small number of available
+credits for control messages only, with the protocol relying on RNR NAKs
+and retries to make forward progress.
+
+
+Connection Establishment
+------------------------
+rsockets uses the RDMA CM for connection establishment. Struct rs_conn_data
+is exchanged during the connection exchange as private data in the request
+and reply messages.
+
+struct rs_sge {
+ uint64_t addr;
+ uint32_t key;
+ uint32_t length;
+};
+
+#define RS_CONN_FLAG_NET 1
+
+struct rs_conn_data {
+ uint8_t version;
+ uint8_t flags;
+ uint16_t credits;
+ uint32_t reserved2;
+ struct rs_sge target_sgl;
+ struct rs_sge data_buf;
+};
+
+Version - current version is 1
+Flags
+RS_CONN_FLAG_NET - Set to 1 if host is big Endian.
+ Determines byte ordering for RDMA write messages
+Credits - number of initial receive credits
+Reserved2 - set to 0
+Target SGL - Address, size (# entries), and rkey of target SGL.
+ Remote side will copy this into their remote SGL.
+Data Buffer - Initial receive buffer address, size (in bytes), and rkey.
+ Remote side will copy this into their first target SGE.
+
+
+Message Format
+--------------
+Rsocket uses RDMA writes with immediate data for all message exchanges.
+RDMA writes of 0 length are used if no additional data beyond the message
+needs to be exchanged. Immediate data is limited to 32-bits. Rsockets
+defines the following format for messages.
+
+The upper 3 bits are used to define the type of message being exchanged,
+with the meaning of the lower 29 bits determined by the upper bits.
+
+Bits Message Meaning of
+31:29 Type Bits 28:0
+000 Data Transfer bytes transfered
+001 reserved
+010 reserved
+011 reserved
+100 Credit Update received credits granted
+101 reserved
+110 reserved
+111 Control control message type
+
+Data Transfer
+Indicates that application data has been written into the next available
+receive buffer. The size of the transfer, in bytes, is carried in the lower
+bits of the message.
+
+Credit Update
+Used to indicate that additional receive buffers and credits are available.
+The number of available credits is carried in the lower bits of the message.
+A credit update message is also used to indicate that a target SGE has been
+updated, in which case the number of additional credits may be 0. The
+receiver of a credit update message must check for updates to the target SGL
+by inspecting the contents of the SGL. The rsocket implementation must take
+care not to modify a remote target SGL while it may be in use. This is done
+by tracking when a receive buffer referenced by a remote target SGL has been
+filled.
+
+Control Message - DISCONNECT
+Indicates that the rsocket connection has been fully disconnected and will no
+longer send or receive data. Data received before the disconnect message was
+processed may still be available for reading.
+
+Control Message - SHUTDOWN
+Indicates that the remote rsocket has shutdown the send side of its
+connection. The recipient of a shutdown message will no longer accept
+incoming data, but may still transfer outbound data.