update according to latest RDS man

author Tziporet Koren <tziporet@mellanox.co.il>

Tue, 19 Feb 2008 13:48:20 +0000 (15:48 +0200)

committer Tziporet Koren <tziporet@mellanox.co.il>

Tue, 19 Feb 2008 13:48:20 +0000 (15:48 +0200)
author Tziporet Koren <tziporet@mellanox.co.il>
Tue, 19 Feb 2008 13:48:20 +0000 (15:48 +0200)
committer Tziporet Koren <tziporet@mellanox.co.il>
Tue, 19 Feb 2008 13:48:20 +0000 (15:48 +0200)
diff --git a/RDS_README.txt b/RDS_README.txt

index 41d5c11325935a69a2c69da9d0a0158f5ff5a8c0..54d5fd7f3a54ec604e9b1dea0e3d896e73d590e3 100644 (file)
--- a/RDS_README.txt
+++ b/RDS_README.txt
@@ -1,117 +1,274 @@
-RDS(7)                    Linux Programmer’s Manual                  RDS(7)
+RDS(7)                                                                 RDS(7)
  
  
  
  NAME
-       rds - RDS socket API
+       RDS - Reliable Datagram Sockets
  
  SYNOPSIS
         #include <sys/socket.h>
         #include <netinet/in.h>
-       #define RDS_CANCEL_SENT_TO    1
-       #define RDS_SNDBUF  2
-
-       rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
  
  DESCRIPTION
-       This is an implementation of the RDS socket API.         It provides reliable,
-       in-order datagram delivery between sockets over a  variety  of  trans-
+       This  is an implementation of the RDS socket API. It provides reliable,
+       in-order datagram delivery between sockets over a  variety  of  trans‐
         ports.
  
+       Currently,  RDS can be transported over Infiniband, and loopback.  RDS
+       over TCP is disabled, but will be re-enabled in the near future.
+
+       RDS uses standard AF_INET addresses as described in ip(7)  to  identify
+       end points.
  
-SOCKET CREATION
-       RDS is still in development and as such does not have a reserved proto-
-       col family constant.  Applications must read the string representation
+   Socket Creation
+       RDS is still in development and as such does not have a reserved proto‐
+       col family constant. Applications must read the string  representation
         of  the protocol  family  value  from the pf_rds sysctl parameter file
         described below.
  
+       rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
  
-BINDING
-       A new RDS socket has no local address when it is         first  returned  from
-       socket(2).   It must  be  bound  to a local address by calling bind(2)
-       before any messages can be sent or received.  RDS sockets do  not  sup-
-       port connecting to remote endpoints with connect(2).  An RDS socket can
-       only be bound to one address and only one socket         can  be  bound  to  a
-       given  address. If no port is specified in the binding address then an
-       unbound port is selected at random.
  
-       RDS has the notion of associating a socket to an underlying  transport.
-       The  transport  for a socket is decided based on the local address that
-       is bound.  From that point on the socket can  only  reach  destinations
-       which are available through the this transport.
+   Socket Options
+       RDS sockets support a number of socket  options through  the  setsock‐
+       opt(2)  and  getsockopt(2)  calls.  The following generic options (with
+       socket level SOL_SOCKET) are of specific importance:
+
+       SO_RCVBUF
+             Specifies the size of the receive buffer. See section  on  "Con‐
+             gestion Control" below.
+
+       SO_SNDBUF
+             Specifies  the  size  of the send buffer. See "Message Transmis‐
+             sion" below.
+
+       SO_SNDTIMEO
+             Specifies the send timeout when trying to enqueue a message on a
+             socket with a full queue in blocking mode.
  
+       In  addition  to         these,  RDS  supports  a  number of protocol specific
+       options (with socket level SOL_RDS).  Just as  with  the         RDS  protocol
+       family, an official value has not been assigned yet, so the kernel will
+       assign a value dynamically.  The assigned value can be  retrieved  from
+       the sol_rds sysctl parameter file.
  
-MESSAGE TRANSMISSION
-       Messages         may  be  sent  using sendmsg(2) once the RDS socket is bound.
-       Message length cannot exceed 4 gigabytes as the wire protocol  uses  an
+       RDS  specific  socket  options  will be described in a separate section
+       below.
+
+   Binding
+       A new RDS socket has no local address when it is         first  returned  from
+       socket(2).   It must  be  bound  to a local address by calling bind(2)
+       before any messages can be sent or received. This will also attach  the
+       socket  to  a  specific transport,  based on the type of interface the
+       local address is attached to.  From that point on, the socket can  only
+       reach destinations which are available through this transport.
+
+       For  instance,  when  binding to the address of an Infiniband interface
+       such as ib0, the socket will use the Infiniband transport.  If  RDS  is
+       not  able  to  associate         a  transport  with the given address, it will
+       return EADDRNOTAVAIL.
+
+       An RDS socket can only be bound to one address and only one socket  can
+       be  bound  to a given address/port pair. If no port is specified in the
+       binding address then an unbound port is selected at random.
+
+       RDS does not allow the application to bind a previously bound socket to
+       another address. Binding to the wildcard address INADDR_ANY is not per‐
+       mitted either.
+
+   Connecting
+       The default mode of operation for RDS is to use unconnected socket, and
+       specify a destination address as an argument to sendmsg.  However, RDS
+       allows sockets to be connected to a remote end point using  connect(2).
+       If a socket is connected, calling sendmsg without specifying a destina‐
+       tion address will use the previously given remote address.
+
+   Congestion Control
+       RDS does not have explicit congestion  control  like  common  streaming
+       protocols  such as TCP. However, sockets have two queue limits associ‐
+       ated with them; the send queue size and the receive queue  size.          Mes‐
+       sages are accounted based on the number of bytes of payload.
+
+       The send queue size limits how much data local processes can queue on a
+       local socket (see the following section). If that  limit         is  exceeded,
+       the  kernel will not accept further messages until the queue is drained
+       and messages have been delivered to  and         acknowledged  by  the  remote
+       host.
+
+       The receive queue size limits how much data RDS will put on the receive
+       queue of a socket before marking         the  socket  as  congested.   When  a
+       socket  becomes congested, RDS will send a congestion map update to the
+       other participating hosts, who are then expected to stop         sending  more
+       messages to this port.
+
+       There  is a timing window during which a remote host can still continue
+       to send messages to a congested port;  RDS  solves  this         by  accepting
+       these  messages even if the socket's receive queue is already over the
+       limit.
+
+       As the application pulls incoming messages off the receive queue         using
+       recvmsg(2),  the         number  of bytes on the receive queue will eventually
+       drop below the receive queue size, at which  point  the port  is  then
+       marked  uncongested,  and another congestion update is sent to all par‐
+       ticipating hosts. This tells them to allow applications to  send         addi‐
+       tional messages to this port.
+
+       The  default values for the send and receive buffer size are controlled
+       by the A given  RDS  socket  has         limited  transmit  buffer  space.  It
+       defaults         to  the  system  wide  socket  send  buffer  size  set in the
+       wmem_default and rmem_default sysctls, respectively. They can be         tuned
+       by  the application through the SO_SNDBUF and SO_RCVBUF socket options.
+
+
+   Blocking Behavior
+       The sendmsg(2) and recvmsg(2) calls can block in a  variety  of situa‐
+       tions.  Whether  a call blocks or returns with an error depends on the
+       non-blocking setting of the file descriptor and the  MSG_DONTWAIT  mes‐
+       sage flag. If the file descriptor is set to blocking mode (which is the
+       default), and the MSG_DONTWAIT flag is not given, the call will block.
+
+       In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used
+       to specify a timeout (in seconds) after which the call will abort wait‐
+       ing,  and return an error. The default timeout is 0, which tells RDS to
+       block indefinitely.
+
+   Message Transmission
+       Messages may be sent using sendmsg(2) once the  RDS  socket  is bound.
+       Message length  cannot exceed 4 gigabytes as the wire protocol uses an
         unsigned 32 bit integer to express the message length.
  
-       RDS does not support out of band data.
+       RDS does not support out of band data. Applications are allowed to send
+       to unicast addresses only; broadcast or multicast are not supported.
  
-       A  successful sendmsg(2) call puts the message in the socket’s transmit
+       A  successful sendmsg(2) call puts the message in the socket's transmit
         queue where it will remain until either the  destination  acknowledges
         that the message is no longer in the network or the application removes
-       the message from the send queue.         Messages are removed  from  the  send
-       queue with the RDS_CANCEL_SENT_TO socket option described below.
+       the message from the send queue.
  
-       A  given RDS socket has limited transmit buffer space for each destina-
-       tion address.  While a message is in the         transmit  queue  its  payload
-       bytes  are accounted for.  If an attempt is made to send a message to a
-       destination whose buffer does not have room for the  new         message  then
-       the sender will block or EAGAIN will be returned depending on MSG_DONT-
-       WAIT message flag.  The SO_SNDTIMEO socket option dictates how long the
-       send will wait for buffer.
+       Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO
+       socket option described below.
  
-       The  size of the send buffer for a given destination is governed by the
-       RDS_SNDBUF socket option and sysctl parameters  described  below.   The
-       SO_SNDBUF socket option is ignored.
+       While  a         message  is  in  the  transmit  queue  its  payload bytes are
+       accounted for.  If an attempt is made to send a message while there  is
+       not  sufficient room on the transmit queue, the call will either block
+       or return EAGAIN.
+
+       Trying to send to a destination that is marked congested         (see  above),
+       the call will either block or return ENOBUFS.
  
         A  message sent with no payload bytes will not consume any space in the
-       destination’s send buffer but will result in a message receipt  on  the
-       destination.   The  receiver  will not get any payload data but will be
-       able to see the sender’s address.
+       destination's send buffer but will result in a message receipt  on  the
+       destination.  The  receiver  will  not get any payload data but will be
+       able to see the sender's address.
  
+       Messages sent to a port to which no socket is bound  will  be  silently
+       discarded  by  the  destination host. No error messages are reported to
+       the sender.
  
-MESSAGE RECEIPT
+   Message Receipt
         Messages may be received with recvmsg(2) on an RDS socket  once it  is
-       bound to a source address.  The MSG_DONTWAIT message flag determines if
-       the receive will block waiting for message arrival and the  SO_RCVTIMEO
-       socket  option  dictates         how long the receive will wait.  The MSG_PEEK
-       flag stops the message from being removed from the receive queue.
+       bound to a source address. RDS will return messages in-order, i.e. mes‐
+       sages from the same sender will arrive in the same order in which  they
+       were be sent.
+
+       The address of the sender will be returned in the sockaddr_in structure
+       pointed to by the msg_name field, if set.
+
+       If the MSG_PEEK flag is given, the first         message  on  the  receive  is
+       returned without removing it from the queue.
  
         The memory consumed by messages waiting for delivery does not limit the
-       number  of  messages  that  can be queued for receive.  Senders must be
-       careful not to overwhelm the receiver  by  sizing  their         send  buffers
-       appropriately.  The SO_RCVBUF socket option is ignored.
+       number of messages that can be queued for receive. RDS does attempt  to
+       perform congestion control as described in the section above.
  
         If the length of the message exceeds the size of the buffer provided to
-       recvmsg(2) then the remainder of the bytes in the message are discarded
-       and the MSG_TRUNC flag is set in the msg_flags field.  In this truncat-
-       ing case recvmsg(2) will still return the number of bytes  copied,  not
-       the length of entire messge.  If MSG_TRUNC is set in the flags argument
-       to recvmsg(2) then it will return the number of bytes  in  the  entire
-       message.          Thus  one  can  examine  the size of the next message in the
-       receive queue without incuring a copying overhead by providing  a  zero
-       length buffer and setting MSG_PEEK and MSG_TRUNC in the flags argument.
+       recvmsg(2), then the remainder of the bytes in  the  message  are  dis‐
+       carded  and  the         MSG_TRUNC flag is set in the msg_flags field. In this
+       truncating case recvmsg(2)  will         still  return  the  number  of  bytes
+       copied, not  the  length of entire messge.  If MSG_TRUNC is set in the
+       flags argument to recvmsg(2), then it will return the number  of         bytes
+       in  the entire message. Thus one can examine the size of the next mes‐
+       sage in the receive queue without incurring a copying overhead by  pro‐
+       viding  a  zero length buffer and setting MSG_PEEK and MSG_TRUNC in the
+       flags argument.
  
         The sending address of a zero-length message will still be provided  in
         the msg_name field.
  
-
-POLL
-       RDS supports a limited poll(2) API.  POLLIN is returned when there is a
-       message waiting in the  socket’s       receive  queue.   POLLOUT  is  always
-       returned,  it  is  up to the application to back off if poll is used to
-       trigger sends.
-
-
-RELIABILITY
-       If sendmsg(2) succeeds then RDS guarantees that the  message  will  be
-       visible to  recvmsg(2) on a socket bound to the destination address as
+   Control Messages
+       RDS  uses control messages (a.k.a. ancillary data) through the msg_con‐
+       trol and msg_controllen fields in sendmsg(2) and         recvmsg(2).   Control
+       messages         generated  by  RDS  have a cmsg_level value of sol_rds.  Most
+       control messages are related to the zerocopy  interface added  in  RDS
+       version 3, and are described in rds-rdma(7).
+
+       The  only  exception  is         the  RDS_CMSG_CONG_UPDATE  message,  which is
+       described in the following section.
+
+   Polling
+       RDS supports the poll(2) interface in a limited  fashion.   POLLIN  is
+       returned         when  there  is  a message (either a proper RDS message, or a
+       control message) waiting in the socket's         receive  queue.   POLLOUT  is
+       always returned while there is room on the socket's send queue.
+
+       Sending to congested ports requires special handling. When an applica‐
+       tion tries to send to a congested destination,  the  system  call  will
+       return ENOBUFS. However, it cannot poll for POLLOUT, as there is prob‐
+       ably still room on the transmit queue, so the  call  to poll(2)  would
+       return immediately, even though the destination is still congested.
+
+       There are two ways of dealing with this situation. The first is to sim‐
+       ply poll for POLLIN.  By default, a  process  sleeping  in  poll(2)  is
+       always woken up when the congestion map is updated, and thus the appli‐
+       cation can retry any previously congested sends.
+
+       The second option is explicit congestion monitoring,  which  gives  the
+       application more fine-grained control.
+
+       With  explicit  monitoring, the application polls for POLLIN as before,
+       and additionally uses the RDS_CONG_MONITOR socket option to  install  a
+       64bit  mask  value in the socket, where each bit corresponds to a group
+       of ports. When a congestion update arrives, RDS checks the set of ports
+       that  became  uncongested against the bit mask installed in the socket.
+       If they overlap, a control messages is enqueued on the socket, and  the
+       application is woken up. When it calls recvmsg(2), it will be given the
+       control message containing the bitmap.  on the socket.
+
+       The congestion monitor bitmask can be set and  queried  using  setsock‐
+       opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.
+
+       Congestion   updates   are   delivered  to   the    application    via
+       RDS_CMSG_CONG_UPDATE  control  messages.         These  control  messages  are
+       always delivered by themselves (or  possibly  additional         control  mes‐
+       sages), but never along with a RDS data message. The cmsg_data field of
+       the control message is an 8 byte datum containing the 64bit mask value.
+
+       Applications  can  use the following macros to test for and set bits in
+       the bitmask:
+
+       #define RDS_CONG_MONITOR_SIZE   64
+       #define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
+       #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
+
+
+   Canceling Messages
+       An application can cancel (flush) messages from the  send  queue         using
+       the  RDS_CANCEL_SENT_TO socket  option  with setsockopt(2).  This call
+       takes an optional sockaddr_in address structure as argument. If given,
+       only  messages  to  the destination specified by this address are dis‐
+       carded. If no address is given, all pending messages are discarded.
+
+       Note that this affects messages that have not yet been  transmitted  as
+       well  as messages that have been transmitted, but for which no acknowl‐
+       edgment from the remote host has been received yet.
+
+   Reliability
+       If sendmsg(2) succeeds, RDS guarantees that the message  will  be vis‐
+       ible   to  recvmsg(2)  on  a socket bound to the destination address as
         long as that destination socket remains open.
  
-       If there is no socket bound on the  destination than  the  message  is
-       silently         dropped.   If  the sending RDS can’t be sure that there is no
+       If there is no socket bound on  the   destination,   the          message   is
+       silently         dropped.   If  the sending RDS can't be sure that there is no
         socket bound then it will try to send the message indefinitely until it
         can be sure or the sent message is canceled.
  
@@ -122,97 +279,57 @@ RELIABILITY
         messages to a given destination.
  
         If  a  receiving socket is closed with pending messages then the sender
-       considers those messages as  having  left  the  network and  will  not
+       considers those messages as  having  left  the  network and  will   not
         retransmit them.
  
-       A  message will only be seen by recvmsg(2) without MSG_PEEK once.  Once
-       the message has been delivered it is removed from the sending  socket’s
-       transmit queue.
+       A   message  will  only be seen by recvmsg(2) once, unless MSG_PEEK was
+       specified. Once the message has been delivered it is removed  from  the
+       sending socket's transmit queue.
  
         All  messages sent from the same socket to the same destination will be
-       delivered in the order they’re  sent.   Messages       sent  from  different
-       sockets, or to different destinations, may be delivered in any order.
+       delivered in the order they're sent. Messages sent from different sock‐
+       ets, or to different destinations, may be delivered in any order.
  
-
-ADDRESS FORMATS
-       RDS  uses  sockaddr_in  as  described  in  ip(7) to describe addresses,
-       including setting sin_family to AF_INET .  RDS  only  supports  unicast
-       communication -- broadcast and multicast addresses are not supported.
-
-
-SOCKET OPTIONS
-       The  following  RDS  specific  socket  options  are  available when the
-       sol_rds sysctl parameter is read and used as the         level  with  getsock-
-       opt(2) or setsockopt(2)
-
-
-       RDS_SNDBUF
-             This  determines the total number of bytes that may be queued in
-             the transmit queue for a given destination.  Changing this  does
-             not  have  an  immediate  effect  on pending transmission, it is
-             intended to be set early and infrequently.  The  default,  mini-
-             mum,  and maximum values of this option are governed by the snd-
-             buf_* sysctl parameters described below.
-
-
-       RDS_CANCEL_SENT_TO
-             Setting this option is used to cancel messages sent  to  a  spe-
-             cific  destination.   The  destination  address  is specified by
-             passing a sockaddr pointer and length as the optval  and  optlen
-             arguments  to  setsockopt(2)  .  Errors are only returned if the
-             socket is not yet bound or if sockaddr is malformed.   No  error
-             is returned if there are no messages queued for the given desti-
-             nation.  getsockopt(2) is not supported on this option and  will
-             return ENOPROTOOPT .
-
-
-SYSCTL
+SYSCTL VALUES
         These   parameteres  may         only  be  accessed  through  their  files  in
-       /proc/sys/net/rds/ . Access through sysctl(2) is not supported.
-
+       /proc/sys/net/rds.  Access through sysctl(2) is not supported.
  
         pf_rds This file contains the string  representation  of         the  protocol
               family  constant passed to socket(2) to create a new RDS socket.
  
-
         sol_rds
               This file contains the string representation of the socket level
               parameter  that  is passed to getsockopt(2) and setsockopt(2) to
               manipulate RDS socket options.
  
-
-       sndbuf_default_bytes
-             This parameter determines the initial value of RDS_SNDBUF  on  a
-             newly  created socket.  New values written to this file must not
-             be  less  than  sndbuf_min_bytes  and  not  greater  than   snd-
-             buf_max_bytes
-
-       sndbuf_max_bytes
-             This   parameter  determines  the  maximum  value  of  the  snd-
-             buf_default_bytes and sndbuf_min_bytes parameters.  It  can  not
-             be  greater  than the number of bytes represented in an unsigned
-             32bit integer (4 gigabytes).
-
-       sndbuf_min_bytes
-             This  parameter  determines  the  minimum  value  of  the   snd-
-             buf_default_bytes  and  sndbuf_max_bytes parameters.  It can not
-             be less than 0.
-
-       reconnect_delay_min_ms
-             This parameter determines the minimum amount of time  that  will
-             pass  before  attempting  to  reconnect to a peer after a failed
-             connect attempt.
-
-       reconnect_delay_max_ms
-             This parameter determines the maximum amount of time  that  will
-             seperate  reconnect  attempts.   The  reconnect delay approaches
-             this by exponentially increasing the minimum delay.
-
+       max_unacked_bytes and max_unacked_packets
+             These parameters are used to tune the generation of acknowledge‐
+             ments.  By  default,  the system receiving RDS messages does not
+             send back explicit acknowledgements unless it transmits  a  mes‐
+             sage  of  its own (in which case the ACK is piggybacked onto the
+             outgoing message), or when the sending system requests an ACK.
+
+             However, the sender needs to see an ACK from  time  to  time  so
+             that  it can purge old messages from the send queue. The unacked
+             bytes and packet counters are used to keep  track  of  how  much
+             data  has been sent without requesting an ACK. The default is to
+             request an acknowledgement every 16 packets,  or  every  16  MB,
+             whichever comes first.
+
+       reconnect_delay_min_ms and reconnect_delay_max_ms
+             RDS  uses  host-to-host  connections  to  transport RDS messages
+             (both for the TCP and the Infiniband transport). If this connec‐
+             tion  breaks,  RDS  will  try  to  re-establish  the connection.
+             Because this reconnect may be triggered by  both  hosts  at  the
+             same  time and fail, RDS uses a random backoff before attempting
+             a reconnect. These two parameters specify the minimum and  maxi‐
+             mum  delay  in  milliseconds. The default values are 1 and 1000,
+             respectively.
  
  SEE ALSO
-       socket(2), bind(2), sendmsg(2), recvmsg(2),  getsockopt(2).   setsock-
-       opt(2).
+       rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2),
+       setsockopt(2).
  
  
  
-Linux Man Page                                                         RDS(7)
+                                                                       RDS(7)
author	Tziporet Koren <tziporet@mellanox.co.il>
	Tue, 19 Feb 2008 13:48:20 +0000 (15:48 +0200)
committer	Tziporet Koren <tziporet@mellanox.co.il>
	Tue, 19 Feb 2008 13:48:20 +0000 (15:48 +0200)