r3378: Added DAPL documentation.

author James Lentini <jlentini@netapp.com>

Mon, 12 Sep 2005 19:14:43 +0000 (19:14 +0000)

committer James Lentini <jlentini@netapp.com>

Mon, 12 Sep 2005 19:14:43 +0000 (19:14 +0000)
author James Lentini <jlentini@netapp.com>
Mon, 12 Sep 2005 19:14:43 +0000 (19:14 +0000)
committer James Lentini <jlentini@netapp.com>
Mon, 12 Sep 2005 19:14:43 +0000 (19:14 +0000)
diff --git a/doc/dapl_coding_style.txt b/doc/dapl_coding_style.txt

new file mode 100644 (file)

index 0000000..74e44f1
--- /dev/null
+++ b/doc/dapl_coding_style.txt
@@ -0,0 +1,264 @@
+#######################################################################
+#                                                                     #
+# DAPL Coding style reference                                         #
+#                                                                     #
+# Steve Sears                                                         #
+# sjs2 at users.sourceforge.net                                       #
+#                                                                     #
+# 12/13/2002                                                          #
+#                                                                     #
+#######################################################################
+
+======================================================================
+Introduction
+======================================================================
+
+The purpose of this document is to establish the coding style adopted by
+the team implementing the DAPL reference implementation. The rules
+presented here were arrived at by consensus, they are intended to
+provide consistency of implementation and make it intuitive to work with
+the source code.
+
+======================================================================
+Source code conventions
+======================================================================
+
+1. Brackets
+
+   Brackets should follow C99 conventions and declare a block. The
+   following convention is followed:
+
+   if (x)
+   {
+       statement;
+       statement;
+   }
+
+   The following bracket styles are to be avoided:
+
+    K&R style:
+
+   if (x) {            /* DON'T DO THIS */
+       statement;
+   }
+
+   GNU style:
+
+   if (x)              /* DON'T DO THIS */
+       {
+       statement;
+       }
+
+   Statements are always indented from brackets.
+
+   Brackets are always used for any statement in order to avoid dangling
+   clause bugs. E.g.
+
+   RIGHT:
+       if ( x )
+       {
+           j = 0;
+       }
+
+   WRONG:
+       if ( x )
+           j = 0;
+
+2. Indents
+
+   Indents are always 4, tabs 8. A tab may serve as a double
+   indent. Many of the reference implementation file have an emacs
+   format statement at the bottom.
+
+3. Comments
+
+   Comments are always full C style comments, and never C++
+   style. Comments take the form:
+
+   /*
+    * comment
+    */
+
+4. Variable Declarations
+
+   Variables are always declared on their own line, we do not declare
+   multiple variables on the same line.
+
+   Variables are never initialized in their declaration, they are
+   initialized in the body of the code.
+
+5. Function Declarations
+
+   The return type of a function is declared on a separate line from the
+   function name.
+
+   Parameters each receive a line and should be clearly labeled as IN
+   or OUT or INOUT. Parameter declarations begin one tab stop from the
+   margin.
+
+   For example:
+
+   DAT_RETURN
+   dapl_function (
+       IN      DAT_IA_HANDLE           ia_handle,
+       OUT     DAT_EP_HANDLE           *ep_handle )
+   {
+       ... function body ...
+   }
+
+5. White space
+
+   Don't be afraid of white space, the goal is to make the code readable
+   and maintainable. We use white space:
+
+   - One space following function names or conditional expressions. It
+     might be better to say one space before any open parenthesis.
+
+   - Suggestion: One space following open parens and one space before
+     closing parens. Not all of the code follows this convention, use
+     your best judgment.
+
+   Example:
+
+        foo ( x1, x2 );
+
+6. Conditional code
+
+   We generally try to avoid conditional compilation, but there are
+   certain places where it cannot be avoided. Whenever possible, move
+   the conditional code into a macro or otherwise work to put it into an
+   include file that can be used by the platform (e.g. Linux or Windows
+   osd files), or by the underlying provider (e.g. IBM Torrent or
+   Mellanox Tavor).
+
+   Conditionals should be descriptive, and the associated #endif should
+   contain the declaration. E.g.
+
+   #ifdef THIS_IS_AN_EXAMPLE
+
+         /* code */
+
+   #endif /* THIS_IS_AN_EXAMPLE */
+
+   You may change the ending comment if a #else clause is present. E.g.
+   
+   #ifdef THIS_IS_AN_EXAMPLE
+         /* code */
+
+   #else
+         /* other code */
+
+   #endif /* !THIS_IS_AN_EXAMPLE */
+   
+
+======================================================================
+Naming conventions
+======================================================================
+
+1. Variable Names
+
+   Variable names for DAPL data structures generally follow their type
+   and should be the same in all source files. A few examples:
+
+   Handles
+   DAT_IA_HANDLE       ia_handle
+   DAT_EP_HANDLE       ep_handle
+
+   Pointers
+
+   DAPL_IA             *ia_ptr;
+   DAPL_EP             *ep_ptr;
+
+2. Return Code Names
+
+   There are at least two different subsystems supported in the DAPL
+   reference implementation. In order to bring sanity to the error
+   space, return codes are named and used for their appropriate
+   subsystem. E.g.
+
+   ib_status:  InfiniBand status return code
+   dat_status: DAT/DAPL return code
+
+3. Function Names
+
+   Function names describe the scope to which they apply. There are
+   essentially three names in the reference implementation:
+
+   dapl_*      Name of an exported function visible externally.
+              These functions have a 1 to 1 correspondence to
+              their DAT counterparts.
+
+   dapls_*     Name of a function that is called from more than one
+              source file, but is limited to a subsystem.
+
+   dapli_*     Local function, internal to a file. Should always be
+              of type STATIC.
+
+
+======================================================================
+Util files
+======================================================================
+
+The Reference implementation is organized such that a single, exported
+function is located in its' own file. If you are trying to find the DAPL
+function to create and End Point, it will be found in the dapl version
+of the DAT function in the spec. E.g.
+
+dapl_ep_create() is found in dapl_ep_create.c
+dapl_evd_free() is found in dapl_evd_free.c
+
+It is often the case that the implementation must interact with data
+structures  or call into other subsystems. All utility functions for a
+subsystem are gathered into the appropriate "util" file. 
+
+For example, dapl_ep_create must allocate a DAPL_EP structure. The
+routine to allocate and initialize memory is found in the
+dapl_ep_util.c file and is named dapl_ep_alloc(). Appropriate routines
+for the util file are
+
+    - Alloc
+    - Free
+    - Assign defaults
+    - linking routines
+    - Check restrictions
+    - Perform operations on a data structure.
+
+The idea of a util file is an object oriented idea for a non OO
+language. It encourages a clean implementation.
+
+For each util.c file, there is also a util.h file. The purpose of the
+util include file is to define the prototypes for the util file, and to
+supply any local flags or values necessary to the subsystem.
+
+======================================================================
+Include files, prototypes
+======================================================================
+
+Include files are organized according to subsystem and/or OS
+platform. The include directory contains files that are global to the
+entire source set. Prototypes are found in include files that pertain to
+the data they support.
+
+Commenting on the DAPL Reference Implementation tree:
+
+          dapl/common
+          dapl/include
+               Contains global dapl data structures, symbols, and
+               prototypes
+          dapl/tavor
+               Contains tavor prototypes and symbols
+          dapl/torrent
+               Contains torrent prototypes and symbols
+          dapl/udapl
+               Contains include files to support udapl specific files
+          dapl/udapl/linux
+               Contains osd files for Linux
+          dapl/udapl/windows
+               Contains osd files for Windows
+
+For completeness, the dat files described by the DAT Specification are
+in the tree under the dat/ subdirectory,
+
+           dat/include/dat/
+
+
diff --git a/doc/dapl_end_point_design.txt b/doc/dapl_end_point_design.txt

new file mode 100644 (file)

index 0000000..d351d96
--- /dev/null
+++ b/doc/dapl_end_point_design.txt
@@ -0,0 +1,1129 @@
+#######################################################################
+#                                                                     #
+# DAPL End Point Management Design                                    #
+#                                                                     #
+# Steve Sears                                                         #
+# sjs2 at users.sourceforge.net                                       #
+#                                                                     #
+# 10/04/2002                                                          #
+# Updates                                                             #
+#   02/06/04                                                          #
+#   10/07/04                                                          #
+#                                                                     #
+#######################################################################
+
+
+======================================================================
+Referenced Documents
+======================================================================
+
+uDAPL: User Direct Access Programming Library, Version 1.1.  Published
+05/08/2003.  http://www.datcollaborative.org/uDAPL_050803.pdf.
+Referred to in this document as the "DAT Specification".
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002.  In DAPL SourceForge repository at
+doc/api/access_api.pdf.  Referred to in this document as the "IBM
+Access API Specification".
+
+InfiniBand Architecture Specification Volume 1, Release 1.0.a Referred
+to in this document at the "InfiniBand Spec".
+
+======================================================================
+Introduction to EndPoints
+======================================================================
+
+An EndPoint is the fundamental channel abstraction for the DAT API. An
+application communicates and exchanges data using an EndPoint. Most of
+the time EndPoints are explicitly allocated, but there is an exception
+whereby a connection event can yield an EndPoint as a side effect; this
+is not supported by all transports or implementations, but it is
+supported in the InfiniBand reference implementation.
+
+Each DAT API function is implemented in a file named 
+
+     dapl_<function name>.c
+
+There is a simple mapping provided by the dat library that maps dat_* to
+dapl_*.  For example, dat_pz_create is implemented in dapl_pz_create.c.
+Other examples:
+
+  DAT                   DAPL                    Found in
+  ------------          ---------------         ------------------
+  dat_ep_create          dapl_ep_create          dapl_ep_create.c
+  dat_ep_query           dapl_ep_query           dapl_ep_query.c
+
+There are very few exceptions to this naming convention, the Reference
+Implementation tried to be consistent.
+
+There are also dapl_<object name>_util.{h,c} files for each object.  For
+example, there are dapl_pz_util.h and dapl_pz_util.c files which contain
+common helper functions specific to the 'pz' subsystem.  The use of util
+files follows the convention used elsewhere in the DAPL reference
+implementation.  These files contain common object creation and
+destruction code, link list manipulation, other helper functions.
+
+This implementation has a simple naming convention designed to alert
+someone reading the source code to the nature and scope of a
+function. The convention is in the function name, such that:
+
+       dapl_   Primary entry from a dat_ function, e.g. 
+               dapl_ep_create(), which mirrors dat_ep_create(). 
+       dapls_  The 's' restricts it to the subsystem, e.g. the
+               'ep' subsystem. dapls_ functions are not exposed
+               externally, but are internal to dapl.
+       dapli_  The 'i' restricts the function to the file where it 
+               is declared. These functions are always 'static' C
+               functions.
+
+This convention is not followed as consistently as we would like, but is
+common in the reference implementation.
+
+1. End Points (EPs)
+-------------------------
+
+DAPL End Points provide a channel abstraction necessary to transmit and
+receive data. EPs interact with Service Points, either Public Service
+Points or Reserved Service Points, to establish a connection from one
+provider to another.
+
+The primary EP entry points in the DAT API as they relate to DAPL are
+listed in the following table:
+
+  dat_ep_create
+  dat_ep_query
+  dat_ep_modify
+  dat_ep_connect
+  dat_ep_dup_connect
+  dat_ep_disconnect
+  dat_ep_post_send
+  dat_ep_post_recv 
+  dat_ep_post_rdma_read
+  dat_ep_post_rdma_write
+  dat_ep_get_status
+  dat_ep_reset
+  dat_ep_free
+
+Additionally, the following connection functions interact with
+EndPoints:
+  dat_psp_create
+  dat_psp_query
+  dat_psp_free
+  dat_rsp_create
+  dat_rsp_query
+  dat_rsp_free
+  dat_cr_accept
+  dat_cr_reject
+  dat_cr_query
+  dat_cr_handoff
+
+The reference implementation maps the EndPoint abstraction onto an
+InfiniBand Queue Pair (QP).
+
+The DAPL_EP structure is used to maintain the state and components of
+the EP object and the underlying QP. As will be explained below, keeping
+track of the QP state is critical for successful operation. Access to
+the DAPL_EP fields are done atomically.
+
+
+======================================================================
+Goals
+======================================================================
+
+Initial goals
+-------------
+-- Implement all of the dat_ep_* calls described in the DAT
+   Specification. 
+
+-- Implement connection calls described in the DAT Specification with
+   the following exception:
+   - dat_cr_handoff. This is best done with kernel mediation, and is
+     therefore out of scope for the reference implementation.
+
+-- The implementation should be as portable as possible, to facilitate
+   HCA Vendors efforts to implement vendor-specific versions of DAPL.
+
+-- The implementation must be able to work during ongoing development
+   of provider software agents, drivers, etc.
+
+Later goals
+-----------
+-- Examine various possible performance optimizations.  This document
+   lists potential performance improvements, but the specific
+   performance improvements implemented should be guided by customer
+   requirements.  
+
+============================================
+Requirements, constraints, and design inputs
+============================================
+
+The EndPoint is the base channel abstraction. An Endpoint must be
+established before data can be exchanged with a remote node. The
+EndPoint is mapped to the underlying InfiniBand QP channel abstraction.
+When a connection is initiated, the InfiniBand Connection Manager will
+be solicited. The implementation is constrained by the capabilities and
+behavior of the underlying InfiniBand facilities.
+
+Note that transports other than InfiniBand may not need to rely on
+Connection Managers or other infrastructure, this is an artifact of
+this transport.
+
+An EP is not an exact match to an InfiniBand QP, the differences
+introduce constraints that are not obvious. There are three primary
+areas of conflict between the DAPL and InfiniBand models:
+
+1) EP and QP creation differences
+2) Provider provided EPs on passive side of connections
+3) Connection timeouts
+
+-- EP and QP creation
+
+The most obvious difference between an EP and a QP is the presence of a
+memory handle when the object is created. InfiniBand requires a
+Protection Domain (PD) be specified when a QP is created; in the DAPL
+world, a Protection Zone (PZ) maps to an InfiniBand Protection Domain.
+DAPL does not require a PZ to be present when an EP is created, and that
+introduces two problems:
+
+1) If a PZ is NULL when an EP is created, a QP will not be bound to
+   the EP until dat_ep_modify() is used to assign it later. A PZ is
+   required before RECV requests can be posted and before a connection
+   can be established.
+
+2) If a DAPL user changes the PZ on an EP before it is connected,
+   DAPL must release the current QP and create a new one with a
+   new Protection Domain.
+
+-- Provider provided EPs on connection
+
+The second area where the DAPL and IB models conflict is a direct result
+of the requirement to specify a Protection Domain when a QP is created.
+
+DAPL allows a PSP to be created in such a way that an EP will
+automatically be provided to the user when a connection occurs. This is
+not critical to the DAPL model but in fact does provide some convenience
+to the user. InfiniBand provides a similar mechanism, but with an
+important difference: InfiniBand requires the user to supply the
+Protection Domain for the passive connection endpoint that will be
+supplied to all QPs created as a result of connection requests; DAPL
+mandates a NULL PZ and requires the user to change the PZ before using
+the EP.
+
+The reference implementation creates an 'empty' EP when the user
+specifies the DAT_PSP_PROVIDER flag; it is empty in the sense that a QP
+is not attached to the EP. Before the user can dat_cr_accept the
+connection, the EP must be modified to have a PZ bound to it, which in
+turn will cause a QP to be bound to the EP.
+
+To keep track of the current state of the EP, the DAPL_EP structure
+has a qp_state field. The type of this field is specific to the
+provider and the states are provider-specified states for a particular
+transport, with the addition of a single state from dapl:
+DAPL_QP_STATE_UNATTACHED, indicating that no QP has been bound to the
+EP. The qp_state field is an open enumerator, containing a single DAPL
+state in addition states specified by the provider. 
+DAPL_QP_STATE_UNATTACHED is randomly defined to be 0xFFF0, a
+value selected strictly because it has the property that it will not
+collide with provider states; if this is not true, this value must be
+changed such that it is unique.
+
+The common layer of DAPL only looks at this single value for qp_state,
+it cannot be aware of states that are unique to the provider. However,
+the provider layer is free to update this field and may use it as a
+cache for current QP state. The field must be updated when a QP (or
+other endpoint resource) is bound to the EP.
+
+DAPL 1.2 provides DAT level states that will make this obsolete, but it
+exists in pre DAPL 1.2 code.
+
+
+-- Connection Timeouts
+
+The third difference in the DAPL and InfiniBand models has to do with
+timeouts on connections. InfiniBand does not provide a way to specify a
+connection timeout, so it will wait indefinitely for a connection to
+occur. dat_ep_connect supports a timeout value providing the user with
+control over how long they are willing to wait for a connection to
+occur.
+
+DAPL maintains a timer thread to watch over pending connections. A
+shared timer queue has a sorted list of timeout values. If a timeout
+is requested, dapl_ep_connect() will invoke dapls_timer_set(), which
+will add a timer record to the sorted list of timeouts. The timeout
+thread is started lazily: that is, it isn't started until a timeout is
+requested. Once a timeout has been requested, the thread will continue
+to exist until the application terminates.
+
+The timer record is actually a part of the DAPL_EP structure, so there
+are no extra memory allocations required for timeouts. dapls_timer_set()
+will initialize the timer record and insert it into the sorted queue at
+the appropriate place. If this is the first record, or is inserted
+before the first record (which will be the 'next' timeout to expire),
+the timer thread will be awakened so it can recalculate how long it must
+sleep until the timeout occurs.
+
+When a timeout does occur, the timeout code will cancel the connection
+request by invoking the provider routine dapls_ib_disconnect_clean(). 
+This allows the software module with explicit knowledge of the provider
+to take appropriate action and cancel the connection attempt. As a side
+effect, the EP will be placed into the UNCONNECTED state, and the QP
+will be in the ERROR state. A side effect of this state change is that
+all DTOs will be flushed. The provider must support a mechanism to 
+completely cancel a connection request.
+
+
+======================================================================
+DAPL EP Subsystem Design
+======================================================================
+
+In section 6.5.1 of the DAT Specification there is a UML state
+transition diagram for an EndPoint which goes over the transitions and
+states during the lifetime of an EP. It is nearly impossible to read.
+The reference implementation is faithful to the DAT Spec and is
+believed to be correct.
+
+This description of the EP will follow from creation to connection to
+termination. It will also discuss the source code organization as this
+is part of the design expression.
+
+-- EP and QP creation
+
+The preamble to creating an EP requires us to verify the attributes
+specified by the user. If a user were to specify max_recv_dtos as 0, for
+example, the EP would not be useful in any regard. If the user does not
+provide EP attrs, the DAPL layer will supply a set of common defaults
+resulting in a reasonable EP. The defaults are set up in 
+dapli_ep_default_attrs(), and the default values are given at the top of
+dapl_ep_util.c. Non-InfiniBand transports will want to examine these
+values to make sure they are 'reasonable'. This simplistic mechanism may
+change in the future.
+
+A number of handles are bound to the EP, so a reference count is taken
+on each of them. All reference counts in the DAPL system are incremented
+or decremented using atomic operations; it is important to always use
+the OS dependent atomic routines and not substitute a lock, as it will
+not be observed elsewhere in the system and will have unpredictable
+results.
+
+Reference counts are taken if there are non NULL values on any of:
+         pz_handle
+         connect_evd_handle
+         recv_evd_handle
+         request_evd_handle
+
+The purpose of reference counts should be obvious: to prevent premature
+release of resources that are still being used.
+
+As has been discussed above, each EP is bound to a QP before it can be
+connected. If a valid PZ is provided at creation time then a QP is bound
+to the EP immediately. If the user later uses ep_modify() to change the
+PZ, the QP will be destroyed and a new one created with the appropriate
+Protection Domain.
+
+Finally, an EP is an IA resource and is linked onto the EP chain of the
+superior IA. EPs linked onto an IA are assumed to be complete, so this
+is the final step of EP creation.
+
+After an EP is created, the ep_state will be DAT_EP_STATE_UNCONNECTED
+and the qp_state will either be DAPL_QP_STATE_UNATTACHED or assigned by
+the provider layer (e.g.IB_QP_STATE_INIT). The qp_state indicates the QP
+binding and the current state of the QP.
+
+A qp_state of DAPL_QP_STATE_UNATTACHED indicates there is no QP bound
+to this EP. This is a result of a NULL PZ when dat_ep_create() was
+invoked, and which has been explained in detail above. The user must
+call dat_ep_modify() and install a valid PZ before the EP can be used.
+
+When an InfiniBand QP is created it is in the RESET state, which is
+specified in the InfiniBand Spec, section 10.3. However, DAPL creates
+the EP in the UNCONNECTED state and requires an unconnected EP to be
+able to queue RECV requests before a connection occurs. The InfiniBand
+spec allows RECV requests to be queued on an QP if the QP is in the INIT
+state, so after creating a QP the DAPL provider code must transition it
+to the INIT state.
+
+There is a mapping between the DAPL EP state and the InfiniBand QP
+state. DAPL_QP_STATE_UNATTACHED indicates the underlying QP is in the
+INIT state. This is critical: RECV DTOs can be posted on an EP in the
+UNATTACHED state, so the underlying QP must be in the appropriate state
+to allow this to happen.
+
+There is an obvious design tradeoff in transitioning the QP
+state. Immediately moving the state to INIT takes extra time at creation
+but allows immediate posting of RECV operations; however, it will
+involve a more complex tear down procedure if the QP must be replaced as
+a side effect of a dat_ep_modify operation. The alternative would be to
+delay transitioning the QP to INIT until a post operation is invoked,
+but that requires a run time check for every post operation. This design
+assumes users will infrequently cause a QP to be replaced after it is
+created and prefer to pay the state transition penalty at creation time.
+
+-- EP Query and Modify operations
+
+Because all of the ep_param data are kept up to date in the dapl_ep
+structure, and because they use the complete DAT specified structure, a
+query operation is trivial; a simple assignment from the internal
+structure to the user parameter. uDAPL allows the implementation to
+either return the fields specified by the user, or to return more than
+the user requested; the reference implementation does the latter.  It is
+simpler and faster to copy the entire structure rather than to determine
+which of all of the possible fields the user requested.
+
+dat_ep_query() requires the implementation to report the address of the
+remote node, if the EP is connected. This is different from standard
+InfiniBand, if only because of the difference in name space. InfiniBand
+has the information on the remote LID, but it does not have the remote
+IP address, which is what DAT specifies. The reference implementation
+makes use of a lookup/name-service called ATS ( Address Translation
+Service), which is built using the InfiniBand Subnet Administrator. ATS
+is InfiniBand only, other transports will use a different mechanism.
+
+A driver will register itself and one or more IP addresses with ATS
+at some point before a connection can be made. How the addresses are
+provided to the driver, or how this is managed by the driver is not
+specified. The ATS proposal is available from the DAT Collaborative.
+
+When dat_ep_query() is invoked on a connected EP, it will request the
+remote address from the provider layer. The provider layer will use
+whatever means are necessary to obtain the IP address of the other end of
+the connection. The results are placed into a buffer that is part of the
+EP structure. Finally, the address of the EP structure is placed into
+the ep_param.remote_ia_address_ptr field.
+
+The ep_modify operation will modify the fields in the DAT_PARAM
+structure. There are some fields that cannot be updated, and there are
+others that can only be updated if the EP is in the correct state. The
+uDAPL spec outlines the EP states permitting ep modifications, but
+generally they are DAT_EP_STATE_UNCONNECTED and
+DAT_EP_STATE_PASSIVE_CONNECTION_PENDING.
+
+When replacing EVD handles it is a simple matter of releasing a
+reference on the previous handle and taking a new reference on the new
+handle. The Reference Implementation manages resource tracking using
+reference counts, which guarantees a particular handle will not be
+released prematurely. Reference counts are checked in the free routines
+of various objects.
+
+As has been mentioned previously, if the PZ handle is changed then the
+QP must be released, if already assigned, and a new QP must be created
+to bind to this EP.
+
+There are some fields in the DAT_PARAM structure that are related to the
+underlying hardware implementation. For these values DAPL will do a
+fresh query of the QP, rather than depend on stale values. Even so, the
+values returned are 'best effort' as a competing thread may change
+certain values before the requesting thread has the opportunity to read
+them. Applications should protect against this.
+
+Finally, the underlying provider is invoked to update the QP with new
+values, but only if some of the attributes have been changed.  As is
+true of most of the implementation, we only invoke the provider code
+when necessary.
+
+======================================================================
+Connections
+======================================================================
+
+There are of course two sides to a connection, and in the DAPL model
+there is an Active and a Passive side. For clarity, the Passive side
+is a server waiting for a connection, and the Active side is a client
+requesting a connection from the Passive server. We will discuss each
+of these in turn.
+
+Connections happen in the InfiniBand world by using a Connection Manager
+(CM) interface. Those unfamiliar with the IB model of addressing and
+management agents may want to familiarize themselves with these aspects
+of the IB spec before proceeding in this document. Be warned that the
+connection section of the IB spec is the most ambiguous portion of the
+spec.
+
+First, let's walk through a primitive diagram of a connection:
+
+
+SERVER (passive)                                CLIENT (active)
+---------------                                 ---------------
+1. dapl_psp_create
+   or dapl_rsp_create
+   [ now listening ]
+
+2.                                              dapl_ep_connect
+                           <-------------
+3. dapls_cr_callback
+   DAT_CONNECTION_REQUEST_EVENT
+   [ Create and post a DAT_CONNECTION_REQUEST_EVENT event ]
+
+4. Event code processing
+
+5. Create an EP if necessary
+   (according to the flags
+    when the PSP was created)
+
+6. dapl_cr_accept or dapl_cr_reject
+                           ------------->
+7.                                              dapl_evd_connection_callback
+                                                DAT_CONNECTION_EVENT_ESTABLISHED
+                                                [ Create and post a
+                                                  DAT_CONNECTION_EVENT_ESTABLISHED
+                                                  event ]
+
+8.                         <------------- RTU
+
+9. dapls_cr_callback
+   DAT_CONNECTION_EVENT_ESTABLISHED
+   [ Create and post a DAT_CONNECTION_EVENT_ESTABLISHED 
+     event ]
+
+10. ...processing...
+
+11. Either side issues a dat_ep_disconnect
+
+12.  dapls_cr_callback
+     DAT_CONNECTION_EVENT_DISCONNECTED
+
+   [ Create and post a 
+     DAT_CONNECTION_EVENT_DISCONNECTED
+     event ]
+
+13.                                             dapl_evd_connection_callback
+                                                DAT_CONNECTION_EVENT_DISCONNECTED
+                                                [ Create and post a
+                                                  DAT_CONNECTION_EVENT_DISCONNECTED
+                                                  event ]
+
+
+In the above diagram, time is numbered in the left hand column and is
+represented vertically.
+
+We will continue our discussion of connections using the above diagram,
+following a sequential order for connection establishment.
+
+There are in fact two types of service points detailed in the uDAPL
+specification. We will limit our discussion to PSPs for convenience, but
+there are only minor differences between PSPs and RSPs.
+
+The reader should observe that all passive-side connection events will
+be received by dapls_cr_callback(), and all active side connection
+events occur through dapl_evd_connection_callback(). At one point during
+the implementation these routines were combined as they are very
+similar, but there are subtle differences causing them to remain
+separate.
+
+Progressing through the series of events as outlined in the diagram
+above:
+
+1. dapl_psp_create
+
+   When a PSP is created, the final act will be to set it listening for
+   connections from remote nodes. It is important to realize that a
+   connection may in fact arrive from a remote node before the routine
+   setting up a listener has returned to dapl_psp_create; as soon as
+   dapls_ib_setup_conn_listener() is invoked connection callbacks may
+   arrive. To reduce race conditions this routine must be called as the
+   last practical operation when creating a PSP.
+
+   dapls_ib_setup_conn_listener() is provider specific. The key insight
+   is that the DAPL connection qualifier (conn_qual) will become the
+   InfiniBand Service ID. The passive side of the connection is now
+   listening for connection requests. It should be obvious that the
+   conn_qual must be unique.
+
+   InfiniBand allows a 64 bit connection qualifier, which is supported
+   by the DAT spec. IP based networks may be limited to 16 bits, so
+   provider implementations may want to return an error if it exceeds
+   the maximum allowable by the transport.
+
+2. dapl_ep_connect
+
+   The active side initiates a connection with dapl_ep_connect, which
+   will transition the EP into DAT_EP_STATE_ACTIVE_CONNECTION_PENDING.
+   Again, connections are in the domain of the providers' Connection
+   Manager and the mechanics are very much provider specific. The key
+   points are that a DAT_IA_ADDRESS_PTR must be translated to a GID
+   before a connection initiation can occur. This is discussed below.
+
+   InfiniBand supports different amounts of private data on various
+   connection functions. Other transports allow variable sizes of
+   private data with no practical limit.The DAPL connection code does
+   not enforce a fixed amount of private data, but rather makes
+   available to the user all it has available, as specified by
+   DAPL_MAX_PRIVATE_DATA_SIZE.
+
+   Private data will be stored in a fixed buffer as part of the
+   connection record, which is the primary reason to limit the size.
+
+   To assist development on new transports that do not have a full
+   connection infrastructure in place, there are a couple of compile time
+   flags that will include certain code: CM_BUSTED and
+   IBOSTS_NAMING. These are discussed below in more detail, but
+   essentially:
+
+   CM_BUSTED: fakes a connection on both sides of the wire, does not
+   transmit any private data.
+
+   IBHOSTS_NAMING: provides a simple IP_ADDRESS to LID translation
+   mechanism in a text file, which is read when the dapl library
+   loads. Private data is exchanged in this case, but it includes a
+   header that contains the remote IP address. Technically, this defines
+   a protocol and is in violation of the DAT spec, but it has proved
+   useful in development.
+
+3. dapls_cr_callback
+
+   The connection sequence is entirely event driven. An operation is
+   posted, then an asynchronous event will occur some time later. The
+   event may cause other actions to occur which may result in still
+   more events.
+
+   dapls_ib_setup_conn_listener() registered for a callback for
+   connection events, and we now receive a DAT event for a connection
+   request. The provider layer will translate the native event type to
+   a DAT event.
+
+   An upcall is invoked on the server side of the connection with an
+   event of type DAT_CONNECTION_REQUEST_EVENT. This is a unique event
+   in the callback code as it is the only case when an EP is not
+   already in play; in all other cases, it is possible to look up the
+   relevant EP for an operation.
+
+   Code exists to make sure the relevant connection object, the PSP or
+   RSP, is actually in a useful state and ready to be connected
+   to. One of the critical differences between a PSP and an RSP is
+   that an RSP is a one-shot connection object; once a connection
+   occurs, no other connections can be made to it.
+
+   There is a small difference in the InfiniBand and DAPL connection
+   models here as well. DAPL may disable a PSP at any time without
+   affecting current connections. When you tear down an InfiniBand
+   service endpoint, all of the connections are torn down too. Because
+   of this difference, when a DAPL app frees a PSP, only a state
+   change is made. The underlying service point is still available and
+   technically capable of receiving connections. If a connection
+   request arrives when the PSP is in this state, a rejection message
+   is sent such that the requesting node believes no service point is
+   listening.
+
+   Once the connection has been examined, it will continue with the
+   connection protocol. The EP will move to a CONNECTION_PENDING
+   state.
+
+   The connection request will cause a CR record to be allocated,
+   which holds all of the important connection request
+   information. The CR record will be linked onto the PSP structure
+   for retrieval in the future when other requests arrive.
+
+   The astute reader of the spec will observe that there is not a
+   dapl_cr_create call: CR records are created as part of a connection
+   attempt on the passive side of the connection. A CR is created now
+   and set up.  A point that will become important later, caps for
+   emphasis:
+
+   A CR WILL EXIST FOR THE LIFE OF A CONNECTION; THEY ARE DESTROYED AT
+   DISCONNECT TIME.
+
+   In the connection request processing a CR and an EVENT are created,
+   the event will be posted along with the connection information just
+   received.
+
+   Private data is also copied into the CR record. Private data
+   arrived with the connection request and is not a permanent
+   resource, so it is copied into the dapl space to be used at a later
+   time. Different transports have varying capabilities on the size of
+   private data, so a call to the provider is invoked to determine how
+   big it actually is. There is an upper bound on the amount of
+   private data the implementation will deal with, set at
+   DAPL_MAX_PRIVATE_DATA_SIZE (256 bytes at this writing).
+
+4. Event code processing
+
+   The final stage in a connection request is to generate an event on
+   a connection EVD using dapls_evd_post_cr_arrival_event().
+
+5. Create an EP if necessary
+
+   When the app processes a connection event, it needs to respond. If
+   the PSP is configured to create an EP automatically, the callback
+   code has already done it; creating an EP with no attached QP. Else,
+   the user must provide an EP to make the connection.
+
+   (4) and (5) are all done in user mode. The only interesting thing is
+   that when the user calls dat_cr_accept(), a ready EP must be
+   provided. If the EP was supplied by the PSP in the callback, it
+   must have a PZ associated with it and whatever other attributes
+   need to be set.
+
+6. dapl_cr_accept or dapl_cr_reject
+
+   For discussion purposes, we will follow the accept
+   path. dapl_cr_reject says you are done and there will be no further
+   events to deal with.
+
+   Assuming it accepts the connection for our example, the dapl code
+   will verify that an EP is in place and will deal with private data
+   that can be transmitted in a cr_accept call. The underlying
+   provider is invoked to complete this leg of the protocol.
+
+7. dapl_evd_connection_callback
+
+   An EVD callback is always a response to a connection oriented
+   request. As such, an EP is always present, and in fact is passed
+   into the upcall as the 'context' argument. 
+
+   Connection requests may take an arbitrary amount of time, so the EP
+   is always checked for a running timer when the upcall is made. As
+   has been discussed above, if a timer expires before an upcall
+   occurs, the connection must be completely canceled such that there
+   is no upcall.
+
+   The event signifying completion of the connection is
+   DAT_CONNECTION_EVENT_ESTABLISHED, and it will move the EP to the
+   CONNECTED state and post this event on the connection EVD. Private
+   data will be copied to an area in the EP structure, which is
+   persistent.
+
+   At this point, the EP is connected and the application is free to
+   post DTOs.
+
+8i. RTU
+
+   This item is labeled "8i" as it is internal to the InfiniBand
+   implementation, it is not initiated by dapl. The final leg of a
+   connection is an RTU sent from the initiating node to the server
+   node, indicating the connection has been made successfully.
+
+   Other transports may have a different connection protocol.
+
+9. dapls_cr_callback
+
+   When the RTU arrives, an upcall is invoked with a
+   DAT_CONNECTION_EVENT_ESTABLISHED event, which will be posted to the
+   connection EVD event queue. The EP is moved to the CONNECTED
+   state.
+
+   There is no private data for dapl to deal with, even though some
+   transports may provide private data at each step of a connection.
+
+   The connection activity is occurring on a separate channel than the
+   EP, so this is inherently a racy operation. The correct
+   application will always post RECV buffers on an EP before
+   initiating a connection sequence, as it is entirely possible for
+   DTOs to arrive *before* the final connection event arrives.
+
+   The architecturally interesting feature of this exchange occurs
+   because of differences in the InfiniBand and the DAT connection
+   models, which are briefly outlined here.
+
+   InfiniBand maintains the original connecting objects throughout the
+   life of the connection. That is, we originally get a callback event
+   associated with the Service (DAT PSP) that is listening for
+   connection events. A QP will be connected but the callback event
+   will still be received on the Service. Later, a callback event will
+   occur for a DISCONNECT, and again the Service will be the object of
+   the connection. In the DAPL implementation, the Service will
+   provide the PSP that is registered as listening on that connection
+   qualifier.
+
+   The difference is that DAT has a PSP receive a connection event,
+   but subsequently hands all connection events off to an EP. After a
+   dat_cr_accept is issued, all connection/disconnection events occur
+   on the EP. DAT more closely follows the IP connection model.
+
+   To support the DAT model, a CR is maintained through the life of
+   the connection. There is exactly one CR per connection, but any
+   number of CRs may exist for any given PSP. CRs are maintained on a
+   linked list pointed to by the PSP structure. A lookup routine will
+   match the cm_handle, unique for each connection, with the
+   appropriate CR. This allows us to find the appropriate EP which
+   will be used to create an event to be posted to the user.
+
+* dat_psp_destroy
+
+   It should be understood that the PSP will maintain all of the CR
+   records, and hence the PSP must persist until the final disconnect.
+   In the DAT model there is no association between a PSP and a
+   connected QP, so there is no reason not to destroy a PSP before the
+   final disconnect.
+
+   Because of the model mismatch we must preserve the PSP until the
+   final disconnect. If the user invokes dat_psp_destroy(), all of the
+   associations maintained by the PSP will be severed; but the PSP
+   structure itself remains as a container for the CR records. The PSP
+   structure maintains a simple count of CR records so we can easily
+   determine the final disconnect and release memory. Once a
+   disconnect event is received for a specific cm_handle, no further
+   events will be received and it is safe to discard the CR record.
+
+10. ...processing...
+
+   This is just a place holder to show that applications actually do
+   something after making a connection. They might not too...
+
+11. Either side issues a dat_ep_disconnect
+
+   dat_ep_disconnect() can be initiated by either side of a
+   connection.  There are two kinds of disconnect flags that can be
+   passed in, but the final result is largely the same.
+
+   DAT_CLOSE_ABRUPT_FLAG will cause the connection to be immediately
+   terminated. In InfiniBand terms, the QP is immediately moved to the
+   ERROR state, and after some time it will be moved to the RESET
+   state.
+
+   DAT_CLOSE_GRACEFUL_FLAG will allow in-progress DTOs to complete.
+   The underlying implementation will first transition the QP to the
+   SQE state, before going to RESET.
+
+   Both cases are handled by the underlying CM, there is no extra work
+   for DAPL.
+
+12. dapls_cr_callback
+
+   A disconnect will arrive on the passive side of the connection
+   through dapls_cr_callback() with connection event
+   DAT_CONNECTION_EVENT_DISCONNECTED. With this event the EP lookup
+   code will free the CR associated with the connection, and may free
+   the PSP if it is no longer listening, indicating it has been freed
+   by the application.
+
+   The callback will create and post a
+   DAT_CONNECTION_EVENT_DISCONNECTED event for the application.
+
+13. dapl_evd_connection_callback
+
+   The active side of the connection will receive
+   DAT_CONNECTION_EVENT_DISCONNECTED as the connection event for
+   dapl_evd_connection_callback(), and will create and post a
+   DAT_CONNECTION_EVENT_DISCONNECTED event.  Other than transitioning
+   the EP to the DISCONNECTED state, there is no further processing.
+
+
+Observe that there are a number of exception conditions resulting in a
+disconnect of the EP, most of which will generate unique DAT events
+for the application to deal with.
+
+
+* Addressing and Naming
+
+   The DAT Spec calls for a DAT_IA_ADDRESS_PTR to be an IP address,
+   either IPv4 or IPv6. It is in fact a struct sockaddr in most
+   systems. The dapl structures typically use IPv6 data types to
+   accommodate the largest possible addresses, but most implementations
+   use IPv4 formatted addresses.
+
+   InfiniBand uses a transport specific address known as a LID, which
+   typically is dynamically assigned by a Subnet Manager. Each HCA
+   also has a global address, similar to an Ethernet MAC address,
+   known as a GUID. ATS, mentioned above, is a mechanism using
+   InfiniBand infrastructure to map from GUID/LID to IP addresses. It
+   is not necessary for transports that use IP addresses natively,
+   such as Ethernet devices.
+
+   If a new implementation does not yet have a name service
+   infrastructure, the DAPL implementation provides a simple name
+   service facility under the #ifdef NO_NAME_SERVICE. This depends on
+   two things: valid IP addresses registered and available to standard
+   DNS system calls such as gethostbyname(); and a name/GID mapping
+   file.
+
+   IP addresses may be set up by system administrators or by a local
+   power-user simply by editing the values into the /etc/hosts file.
+   Setting IP addresses up in this manner is beyond the scope of this
+   document.
+
+   A simple mapping of names to GIDs is maintained in the ibhosts
+   file, currently located at /etc/dapl/ibhosts. The format of
+   the file is:
+
+   <IP name>     0x<GID Prefix>    0x<GUID>
+
+   For example:
+
+   dat-linux3-ib0p0 0xfe80000000000000 0x0001730000003d11
+   dat-linux3-ib0p1 0xfe80000000000000 0x0001730000003d11
+   dat-linux3-ib1   0xfe80000000000000 0x0001730000003d52
+   dat-linux5-ib0   0xfe80000000000000 0x0001730000003d91
+
+   And for each hostname, there must be an entry in the /etc/hosts file
+   similar to:
+
+   dat_linux3-ib0a     198.165.10.11
+   dat_linux3-ib0b     198.165.10.12
+   dat_linux3-ib1a     198.165.10.21
+   dat_linux5-ib0a     198.165.10.31
+
+
+   In this example we have adopted the convention of naming each
+   InfiniBand interface by using the form
+
+             <node_name>-ib<device_number>[port_number]
+
+   In the above example we can see that the machine dat_linux3 has three
+   InfiniBand interfaces, which in this case we have named two ports on
+   the first HCA and another port on a second. Utilizing standard DNS
+   naming, the conventions used for identifying individual ports is
+   completely up to the administrator.
+
+   The GID Prefix and GUID are obtained from the HCA and map a port on
+   the HCA: together they form the GID that is required by a CM to
+   connect with the remote node.
+
+   The simple name service builds an internal table after processing
+   the ibhosts file which contains IP addresses and GIDs. It will use
+   the standard getaddrinfo() function to obtain IP address
+   information.
+
+   When an application invoked dat_ep_connect(), the
+   DAT_IA_ADDRESS_PTR will be compared in the table for a match and
+   the destination GID established if found. If the address is not
+   found then the user must first add the name to the ibhosts file.
+
+   With a valid GID for the destination node, the underlying CM is
+   invoked to make a connection.
+
+* Connection Management
+
+   Getting a working CM has taken some time, in fact the DAPL project
+   was nearly complete by the time a CM was available. In order to
+   make progress, a connection hack was introduced that allows
+   specific connections to take place. This is noted in the code by
+   the CM_BUSTED #def.
+
+   CM_BUSTED takes the place of a CM and will manually transition a QP
+   through the various states to connect: INIT->RTR->RTS. It will also
+   disconnect the connection, although the Torrent implementation
+   simply destroys the QP and recreates a new one rather than
+   transitioning through the typical disconnect states (which didn't
+   work on early IB implementations).
+
+   CM_BUSTED makes some assumptions about the remote end of the
+   connection as no real information is exchanged. The ibapi
+   implementation assumes both HCAs have the same LID, which implies
+   there is no SM running. The vapi implementation assumes the LIDs
+   are 0 and 1. Depending on the hardware, the LID value may in fact
+   not make any difference. This code does not set the Global Route
+   Header (GRH), which would cause the InfiniBand chip to be carefully
+   checking LID information.
+
+   The QP number is assumed to be identical on both ends of the
+   connection, or differing by 1 if this is a loopback. There is an
+   environment variable that will be read at initialization time if
+   you are configured with a loopback, this value is checked when
+   setting up a QP. The obvious downside to this scheme is that
+   applications must stay synchronized in their QP usage or the
+   initial exchange will fail as they are not truly connected.
+
+   Add to this the limitation that HCAs must be connected in
+   Point-to-Point topology or in a loopback. Without a GRH it will not
+   work in a fabric.  Again, using an SM will not work when CM_BUSTED
+   is enabled.
+
+   Despite these shortcomings, CM_BUSTED has proven very useful and
+   will remain in the code for a while in order to aid development
+   groups with new hardware and software. It is a hack to be sure, but
+   it is relatively well isolated.
+
+
+-- Notes on Disconnecting
+
+An EP can only be disconnected if it is connected or unconnected; you
+cannot disconnect 'in progress' connections. An 'in progress
+connection may in fact time out, but the DAT Spec does not allow you
+to 'kill' it. DAPL will use the CM interface to disconnect from the
+remote node; this of course results in an asynchronous callback
+notifying the application the disconnect is complete.
+
+Disconnecting an unconnected EP is currently the only way to remove
+pending RECV operations from the EP. The DAPL spec notes that all
+DTOs must be removed from an EP before it can be deallocated, yet
+there is no explicit interface to remove pending RECV DTOs. The user
+will disconnect an unconnected EP to force the pending operations off
+of the queue, resulting in DTO callbacks indicating an error. The
+underlying InfiniBand implementation will cause the correct behavior
+to result. When doing this operation the DAT_CLOSE flag is ignored,
+DAPL will instruct the provider layer to abruptly disconnect the QP.
+
+As has been noted previously, specifying DAT_CLOSE_ABRUPT_FLAG as the
+disconnect completion flag will cause the CM implementation to
+transition the QP to the ERROR state to abort all operations, and then
+transition to the RESET state; if the flag is DAT_CLOSE_GRACEFUL_FLAG,
+the CM will first move to the SQE state and allow all pending I/O's to
+drain before moving to the RESET state. In either case, DAPL only
+needs to know that the QP is now in the RESET state, as it will need
+to be transitioned to the INIT state before it can be used again.
+
+======================================================================
+Data Transfer Operations (DTOs)
+======================================================================
+
+The DTO code is a straightforward translation of the DAT_LMR_TRIPLET
+to an InfiniBand work request. Unfortunately, IB does not specify what
+a work request looks like so this tends to be very vendor specific
+code. Each provider will supply a routine for this operation.
+
+InfiniBand allows the DTO to attach a unique 64 bit work_req_id to
+each work request. The DAPL implementation will install a pointer to a
+DAPL_DTO_COOKIE in this field. Observe that a DAPL_DTO_COOKIE is not
+the same as the user DAT_DTO_COOKIE; indeed, the former has a pointer
+field pointing to the latter.  Different values will be placed in the
+cookie, according to the type of operation it is and the type of data
+required by its completion event. This is a simple scheme to bind DAPL
+data to the DTO and associated completion callback. Each DTO has a
+unique cookie associated with it.
+
+Observe that an InfiniBand work_request remains under control of the
+user, and when a post operation occurs the underlying implementation
+will copy data out of the work_request into a hardware based
+structure. Further, no application can perform a DTO operation on the
+same EP at the same time according to the thread guarantees mandated
+by the specification. This allows us to provide a recv_iov and a
+send_iov in the EP structure for all DTO operations, eliminating any
+malloc operations from this critical path.
+
+The underlying provider implementation will invoke
+dapl_evd_dto_callback() upon completion of DTO operations.
+dapl_evd_dto_callback() is the asynchronous completion for a DTO and
+will create and post an event for the user. Much of this callback is
+concerned with managing error completions.
+
+
+======================================================================
+Data Structure
+======================================================================
+
+The main data structure for an EndPoint is the dapl_ep structure,
+defined in include/dapl.h. The reference implementation uses the
+InfiniBand QP to maintain hardware state, providing a relatively
+simple mapping.
+
+/* DAPL_EP maps to DAT_EP_HANDLE */
+struct dapl_ep
+{
+    DAPL_HEADER                        header;
+    /* What the DAT Consumer asked for */
+    DAT_EP_PARAM               param;
+
+    /* The RC Queue Pair (IBM OS API) */
+    ib_qp_handle_t             qp_handle;
+    unsigned int               qpn;    /* qp number */
+    ib_qp_state_t              qp_state;
+
+    /* communications manager handle (IBM OS API) */
+    ib_cm_handle_t             cm_handle;
+    /* store the remote IA address here, reference from the param
+     * struct which only has a pointer, no storage
+     */
+    DAT_SOCK_ADDR6             remote_ia_address;
+
+    /* For passive connections we maintain a back pointer to the CR */
+    void *                     cr_ptr;
+
+    /* pointer to connection timer, if set */
+    struct dapl_timer_entry    *cxn_timer;
+
+    /* private data container */
+    DAPL_PRIVATE               private;
+
+    /* DTO data */
+    DAPL_ATOMIC                        req_count;
+    DAPL_ATOMIC                        recv_count;
+
+    DAPL_COOKIE_BUFFER         req_buffer;
+    DAPL_COOKIE_BUFFER         recv_buffer;
+
+    ib_data_segment_t          *recv_iov;
+    DAT_COUNT                  recv_iov_num;
+
+    ib_data_segment_t          *send_iov;
+    DAT_COUNT                  send_iov_num;
+#ifdef DAPL_DBG_IO_TRC
+    int                        ibt_dumped;
+    struct io_buf_track *ibt_base;
+    DAPL_RING_BUFFER   ibt_queue;
+#endif /* DAPL_DBG_IO_TRC */
+};
+
+The simple explanation of the fields in the dapl_ep structure follows:
+
+header:           The dapl object header, common to all dapl objects. 
+          It contains a lock field, links to appropriate lists, and
+          handles specifying the IA domain it is a part of.
+
+param:    The bulk of the EP attributes called out in the DAT 
+          specification and are maintained in the DAT_EP_PARAM
+          structure. All internal references to these fields
+          use this structure.
+
+qp_handle: Handle to the underlying InfiniBand provider implementation
+          for a QP. All EPs are mapped to an InfiniBand QP.
+
+qpn:      Number of the QP as returned by the underlying provider
+          implementation. Primarily useful for debugging.
+
+qp_state:  Current state of the QP. The values of this field indicate
+          if a QP is bound to the EP, and the current state of a
+          QP.
+
+cm_handle: Handle to the IB provider's CMA (Connection Manager Agent).
+          Used for CM operations used to connect and disconnect.
+
+remote_ia_address:
+          Remote IP address of the connection. Only valid after the user
+          has asked for it.
+
+cr_ptr:    Attaches the EP to the appropriate CR. Assigned on the passive
+          side of a connection in cr_accept. It is used when an abrupt
+          disconnect is invoked by the app, and we need to 'fake' a
+          callback. It is also used in clean up of an EP and removing
+          connection elements from the associated PSP.
+
+cxn_timer: Pointer to a timer entry, used as a token to set and remove
+          timers.
+
+private:   Local Private data area on the active side of a connection.
+
+req_count: Count of outstanding request DTO operations, including memory
+          ops. Atomically incremented/decremented.
+
+recv_count:Count of outstanding receive DTO operations. Atomically 
+                incremented/decremented.
+
+req_buffer:Ring buffer of request cookies.
+
+recv_buffer:
+          Ring buffer of receive cookies.
+
+recv_iov:  Storage for provider receive work request.
+
+recv_iov_num:
+          Maximum number of receive IOVs. Number is obtained from
+          the provider in a query.
+
+send_iov:  Storage for provider send work request.
+
+send_iov_num:
+          Maximum number of send IOVs. Number is obtained from the
+          provider in a query.
+
+ibt_dumped:DTO debugging aid. Boolean value to control how often DTO 
+          tracing data is printed.
+
+ibt_base:  DTO debugging aid. Base address of DTO ring buffer containing
+          information on DTO processing.
+   
+ibt_queue: Ring buffer containing information n DTO processing.
+
+
+** Debug
+
+The Reference Implementation includes a trace facility that allows a
+developer to see all DTO operations, specifically to catch those that
+are not completing as expected. The DAPL_DBG_IO_TRC conditional will
+enable this code.
+
+A simple ring buffer is used to account for all outstanding DTO
+traffic. The buffer may be dumped when DTOs are not getting
+completions, with enough data to aid the developer to determine where
+things went wrong.
+
+It is implemented as a ring buffer as there are often bugs in this
+part of a provider's implementation which do not manifest until
+intensive data exchange has occurred for many hours.
diff --git a/doc/dapl_environ.txt b/doc/dapl_environ.txt

new file mode 100644 (file)

index 0000000..17aabd1
--- /dev/null
+++ b/doc/dapl_environ.txt
@@ -0,0 +1,42 @@
+               DAPL Environment Guide v. 0.01
+                ------------------------------
+
+The following environment variables affect the behavior of the DAPL
+provider library: 
+
+
+DAPL_DBG_TYPE
+-------------
+
+ Value specifies which parts of the registry will print debugging
+ information, valid values are  
+
+    DAPL_DBG_TYPE_ERR          = 0x0001
+    DAPL_DBG_TYPE_WARN         = 0x0002
+    DAPL_DBG_TYPE_EVD          = 0x0004
+    DAPL_DBG_TYPE_CM           = 0x0008
+    DAPL_DBG_TYPE_EP           = 0x0010
+    DAPL_DBG_TYPE_UTIL         = 0x0020
+    DAPL_DBG_TYPE_CALLBACK     = 0x0040
+    DAPL_DBG_TYPE_DTO_COMP_ERR  = 0x0080
+    DAPL_DBG_TYPE_API           = 0x0100
+    DAPL_DBG_TYPE_RTN           = 0x0200
+    DAPL_DBG_TYPE_EXCEPTION     = 0x0400
+
+ or any combination of these. For example you can use 0xC to get both 
+ EVD and CM output.
+
+ Example setenv DAPL_DBG_TYPE 0xC
+
+  
+DAPL_DBG_DEST
+-------------
+
+ Value sets the output destination, valid values are 
+  
+    DAPL_DBG_DEST_STDOUT       = 0x1
+    DAPL_DBG_DEST_SYSLOG       = 0x2 
+    DAPL_DBG_DEST_ALL          = 0x3 
+  
+ For example, 0x3 will output to both stdout and the syslog. 
+
diff --git a/doc/dapl_event_design.txt b/doc/dapl_event_design.txt

new file mode 100644 (file)

index 0000000..247c7ef
--- /dev/null
+++ b/doc/dapl_event_design.txt
@@ -0,0 +1,875 @@
+               DAPL Event Subsystem Design v. 0.96
+                -----------------------------------
+
+=================
+Table of Contents
+=================
+
+* Table of Contents
+* Referenced Documents
+* Goals
+       + Initial Goals
+       + Later Goals
+* Requirements, constraints, and design inputs
+       + DAT Specification Constraints
+               + Object and routine functionality, in outline
+               + Detailed object and routine specification
+               + Synchronization
+       + IBM Access API constraints
+               + Nature of DAPL Event Streams in IBM Access API.
+               + Nature of access to CQs
+       + Operating System (Pthread) Constraints
+       + Performance model
+               + A note on context switches
+* DAPL Event Subsystem Design
+       + OS Proxy Wait Object
+               + Definition
+               + Suggested Usage
+       + Event Storage
+       + Synchronization
+               + EVD Synchronization: Locking vs. Producer/Consumer queues
+               + EVD Synchronization: Waiter vs. Callback
+               + CNO Synchronization
+               + Inter-Object Synchronization
+               + CQ -> CQEH Assignments
+       + CQ Callbacks
+       + Dynamic Resizing of EVDs
+       + Structure and pseudo-code
+               + EVD
+               + CNO
+* Future directions
+       + Performance improvements: Reducing context switches
+       + Performance improvements: Reducing copying of event data
+       + Performance improvements: Reducing locking
+       + Performance improvements: Reducing atomic operations
+       + Performance improvements: Incrementing concurrency.
+
+====================
+Referenced Documents
+====================
+
+uDAPL: User Direct Access Programming Library, Version 1.0.  Published
+6/21/2002.  http://www.datcollaborative.org/uDAPL_062102.pdf.
+Referred to in this document as the "DAT Specification".
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002.  In DAPL SourceForge repository at
+doc/api/access_api.pdf.  Referred to in this document as the "IBM
+Access API Specification".
+
+=====
+Goals
+=====
+
+Initial goals
+-------------
+-- Implement the dat_evd_* calls described in the DAT Specification (except
+   for dat_evd_resize).
+
+-- The implementation should be as portable as possible, to facilitate
+   HCA Vendors efforts to implement vendor-specific versions of DAPL.
+
+Later goals
+-----------
+-- Examine various possible performance optimizations.  This document
+   lists potential performance improvements, but the specific
+   performance improvements implemented should be guided by customer
+   requirements.
+
+-- Implement the dat_cno_* calls described in the DAT 1.0 spec
+
+-- Implement OS Proxy Wait Objects.
+
+-- Implement dat_evd_resize
+
+Non-goals
+---------
+-- Thread safe implementation
+
+============================================
+Requirements, constraints, and design inputs
+============================================
+
+DAT Specification Constraints
+-----------------------------
+
+-- Object and routine functionality, in outline
+
+The following section summarizes the requirements of the DAT
+Specification in a form that is simpler to follow for purposes of
+implementation.  This section presumes the reader has read the DAT
+Specification with regard to events.
+
+Events are delivered to DAPL through Event Streams.  Each Event Stream
+targets a specific Event Descriptor (EVD); multiple Event Streams may
+target the same EVD.  The Event Stream<->EVD association is
+effectively static; it may not be changed after the time at which
+events start being delivered.  The DAT Consumer always retrieves
+events from EVDs.  EVDs are intended to be 1-to-1 associated with the
+"native" event convergence object on the underlying transport.  For
+InfiniBand, this would imply a 1-to-1 association between EVDs and
+CQs.
+
+EVDs may optionally have an associated Consumer Notification Object
+(CNO).  Multiple EVDs may target the same CNO, and the EVD<->CNO
+association may be dynamically altered.  The DAT Consumer may wait for
+events on either EVDs or CNOs; if there is no waiter on an EVD and it
+is enabled, its associated CNO is triggered on event arrival.  An EVD
+may have only a single waiter; a CNO may have multiple waiters.
+Triggering of a CNO is "sticky"; if there is no waiter on a CNO when
+it is triggered, the next CNO waiter will return immediately.
+
+CNOs may have an associated OS Proxy Wait Object, which is signaled
+when the CNO is triggered.
+
+-- Detailed object and routine specification
+
+Individual events may be "signaling" or "non-signaling", depending
+on the interaction of:
+       * Receive completion endpoint attributes
+       * Request completion endpoint attributes
+       * dat_ep_post_send completion flags
+       * dat_ep_post_recv completion flags
+The nature of this interaction is outside the scope of this document;
+see the DAT Specification 1.0 (or, failing that, clarifications in a
+later version of the DAT Specification).
+
+A call to dat_evd_dequeue returns successfully if there are events on
+the EVD to dequeue.  A call to dat_evd_wait blocks if there are fewer
+events present on the EVD than the value of the "threshold" parameter
+passed in the call.  Such a call to dat_evd_wait will be awoken by the
+first signaling event arriving on the EVD that raises the EVD's event
+count to >= the threshold value specified by dat_evd_wait().
+
+If a signaling event arrives on an EVD that does not have a waiter,
+and that EVD is enabled, the CNO associated with the EVD will be
+triggered.
+
+A CNO has some number of associated waiters, and an optional
+associated OS Proxy Wait Object.  When a CNO is triggered, two things
+happen independently:
+       * The OS Proxy Wait Object associated with the CNO, if any, is
+         signaled, given the handle of an EVD associated with the CNO
+         that has an event on it, and disassociated from the CNO.
+       * If:
+               * there is one or more waiters associated with the
+                 CNO, one of the waiters is unblocked and given the
+                 handle of an EVD associated with the CNO that has an
+                 event on it.
+               * there are no waiters associated with the CNO, the
+                 CNO is placed in the triggered state.
+
+When a thread waits on a CNO, if:
+       * The CNO is in the untriggered state, the waiter goes to
+         sleep pending the CNO being triggered.
+       * The CNO is in the triggered state, the waiter returns
+         immediately with the handle of an EVD associated with the
+         CNO that has an event on it, and the CNO is moved to the
+         untriggered state.
+
+Note specifically that the signaling of the OS Proxy Wait Object is
+independent of the CNO moving into the triggered state or not; it
+occurs based on the state transition from Not-Triggered to Triggered.
+Signaling the OS Proxy Wait Object only occurs when a CNO is
+triggered.  In contrast, waiters on a CNO are unblocked whenever the
+CNO is in the triggered *state*, and that state is sticky.
+
+Note also that which EVD is returned to the caller in a CNO wait is
+not specified; it may be any EVD associated with the CNO on which an
+event arrival might have triggered the CNO.  This includes the
+possibility that the EVD returned to the caller may not have any
+events on it, if the dat_cno_wait() caller raced with a separate
+thread doing a dat_evd_dequeue().
+
+The DAT Specification is silent as to what behavior is to be expected
+from an EVD after an overflow error has occurred on it.  Thus this
+design will also be silent on that issue.
+
+The DAT Specification has minimal requirements on inter-Event Stream
+ordering of events.  Specifically, any connection events must precede
+(in consumption order) any DTO Events for the same endpoint.
+Similarly, any successful disconnection events must follow any DTO
+Events for an endpoint.
+
+-- Synchronization
+
+Our initial implementation is not thread safe.  This means that we do
+not need to protect against the possibility of multiple simultaneous
+user calls occurring on the same object (EVD, CNO, EP, etc.); that is
+the responsibility of the DAT Consumer.
+
+However, there are synchronization guards that we do need to protect
+against because the DAT Consumer cannot. Specifically, since the user
+cannot control the timing of callbacks from the IBM Access API
+Implementation, we need to protect against possible collisions between
+user calls and such callbacks.  We also need to make sure that such
+callbacks do not conflict with one another in some fashion, possibly
+by assuring that they are single-threaded.
+
+In addition, for the sake of simplicity in the user interface, I have
+defined "not thread safe" as "It is the DAT Consumer's responsibility
+to make sure that all calls against an individual object do not
+conflict".  This does, however, suggest that the DAPL library needs to
+protect against calls to different objects that may result in
+collisions "under the covers" (e.g. a call on an EVD vs. a call on its
+associated CNO).
+
+So our synchronization requirements for this implementation are:
+       + Protection against collisions between user calls and IBM
+         Access API callbacks.
+       + Avoidance of or protection against collisions between
+         different IBM Access API callbacks.
+       + Protection against collisions between user calls targeted at
+         different DAT objects.
+
+IBM Access API constraints
+--------------------------
+
+-- Nature of DAPL Event Streams in IBM Access API
+
+DAPL Event Streams are delivered through the IBM Access API in two fashions:
+       + Delivery of a completion to a CQ.
+       + A callback is made directly to a previously registered DAPL
+         function with parameters describing the event.
+(Software events are not delivered through the IBM Access API).
+
+The delivery of a completion to a CQ may spark a call to a previously
+registered callback depending on the attributes of the CQ and the
+reason for the completion.  Event Streams that fall into this class
+are:
+       + Send data transport operation
+       + Receive data transport operation
+       + RMR bind
+
+The Event Streams that are delivered directly through a IBM Access API
+callback include:
+       + Connection request arrival
+       + Connection resolution (establishment or rejection)
+       + Disconnection
+       + Asynchronous errors
+
+Callbacks associated with CQs are further structured by a member of a
+particular CQ Event Handling (CQEH) domain (specified at CQ creation
+time).  All CQ callbacks within a CQEH domain are serviced by the same
+thread, and hence will not collide.
+
+In addition, all connection-related callbacks are serviced by the same
+thread, and will not collide.  Similarly, all asynchronous error
+callbacks are serviced by the same thread, and will not collide.
+Collisions between any pair of a CQEH domain, a connection callback,
+and an asynchronous error callback are possible.
+
+-- Nature of access to CQs
+
+The only probe operation the IBM Access API allows on CQs is
+dequeuing.  The only notification operation the IBM Access API
+supports for CQs is calling a previously registered callback.
+
+Specifically, the IB Consumer may not query the number of completions
+on the CQ; the only way to find out the number of completions on a CQ
+is through dequeuing them all.  It is not possible to block waiting
+on a CQ for the next completion to arrive, with or without a
+threshold parameter.
+
+Operating System Constraints
+----------------------------
+
+The initial platform for implementation of DAPL is RedHat Linux 7.2 on
+Intel hardware.  On this platform, inter-thread synchronization is
+provided by a POSIX Pthreads implementation.  From the viewpoint of
+DAPL, the details of the Pthreads interface are platform specific.
+However, Pthreads is a very widely used threading library, common on
+almost all Unix variants (though not used on the different variations
+of Microsoft Windows(tm)).  In addition, RedHat Linux 7.2 provides
+POSIX thread semaphore operations (e.g. see sem_init(3)), which are
+not normally considered part of pthreads.
+
+Microsoft Windows(tm) provides many synchronization primitives,
+including mutual exclusion locks, and semaphores.
+
+DAPL defines an internal API (not exposed to the consumer), though
+which it accesses Operating Systems Dependent services; this is called
+the OSD API.  It is intended that this layer contain all operating
+system dependencies, and that porting DAPL to a new operating system
+should only require changes to this layer.
+
+We have chosen to define the synchronization interfaces established at
+this layer in terms of two specific objects: mutexes and sempahores w/
+timeout on waiting.  Mutexes provide mutual exclusion in a way that is
+common to all known operating systems.  The functionality of
+semaphores also exists on most known operating systems, though the
+sempahores provided by POSIX do not provide timeout capabilities.
+This is for three reasons.  First, in contrast to Condition Variables
+(the native pthreads waiting/signalling object), operations on
+sempahores do not require use of other synchronization variables
+(i.e. mutexes).  Second, it is fairly easy to emulate sempahores using
+condition variables, and it is not simple to emulate condition
+variables using semaphores.  And third, there are some anticipated
+platforms for DAPL that implement condition variables in relation to
+some types of locks but not others, and hence constrain appropriate
+implementation choices for a potential DAPL interface modeled after
+condition variables.
+
+Implementation of the DAPL OS Wait Objects will initially be based on
+condition variables (requiring the use of an internal lock) since
+POSIX semaphores do not provide a needed timeout capability.  However,
+if improved performance is required, a helper thread could be created
+that arranges to signal waiting semaphores when timeouts have
+expired.  This is a potential future (or vendor) optimization.
+
+Performance Model
+-----------------
+One constraint on the DAPL Event Subsystem implementation is that it
+should perform as well as possible.  We define "as well as possible"
+by listing the characteristics of this subsystem that will affect its
+performance most strongly.  In approximate order of importance, these
+are:
+       + The number of context switches on critical path
+       + The amount of copying on the critical path.
+       + The base cost of locking (assuming no contention) on the
+         critical path.  This is proportional to the number of locks
+         taken.
+       + The amount of locking contention expected.  We make a
+         simplifying assumption and take this as the number of cycles
+         for which we expect to hold locks on the critical path.
+       + The number of "atomic" bus operations executed (these take
+         more cycles than normal operations, as they require locking
+         the bus).
+
+We obviously wish to minimize all of these costs.
+
+-- A note on context switches
+
+In general, it's difficult to minimize context switches in a user
+space library directly communicating with a hardware board.  This is
+because context switches, by their nature, have to go through the
+operating system, but the information about which thread to wake up
+and whether to wake it up is generally in user space.  In addition,
+the IBM Access API delivers all Event Streams as callbacks in user
+context (as opposed to, for example, allowing a thread to block within
+the API waiting for a wakeup).  For this reason, the default sequence
+of events for a wakeup generated from the hardware is:
+       * Hardware interrupts the main processor.
+       * Interrupt thread schedules a user-level IBM Access API
+         provider service thread parked in the kernel.
+       * Provider service thread wakes up the sleeping user-level
+         event DAT implementation thread.
+This implies that any wakeup will involve three context switches.
+This could be reduced by one if there were a way for user threads to
+block in the kernel, we might skip the user-level provider thread.
+
+===========================
+DAPL Event Subsystem Design
+===========================
+
+
+OS Proxy Wait Object
+--------------------
+
+The interface and nature of the OS Proxy Wait Object is specified in
+the uDAPL v. 1.0 header files as a DAT_OS_WAIT_PROXY_AGENT via the
+following defines:
+
+typedef void (*DAT_AGENT_FUNC)
+        (
+        DAT_PVOID,      /* instance data   */
+        DAT_EVD_HANDLE  /* Event Dispatcher*/
+        );
+
+typedef struct dat_os_wait_proxy_agent
+        {
+        DAT_PVOID instance_data;
+        DAT_AGENT_FUNC proxy_agent_func;
+        } DAT_OS_WAIT_PROXY_AGENT;
+
+In other words, an OS Proxy Wait Object is a (function, data) pair,
+and signalling the OS Proxy Wait Object is a matter of calling the
+function on the data and an EVD handle associated with the CNO.
+The nature of that function and its associated data is completely up
+to the uDAPL consumer.
+
+Event Storage
+-------------
+
+The data associated with an Event (the type, the EVD, and any type
+specific data required) must be stored between event production and
+event consumption.  If storage is not provided by the underlying
+Verbs, that data must be stored in the EVD itself.  This may require
+an extra copy (one at event production and one at event consumption).
+
+Event Streams associated purely with callbacks (i.e. IB events that
+are not mediated by CQs) or user calls (i.e. software events) don't
+have any storage allocated for them by the underlying verbs and hence
+must store their data in the EVD.
+
+Event Streams that are associated with CQs have the possibility of
+leaving the information associated with the CQ between the time the
+event is produced and the time it is consumed.  However, even in this
+case, if the user calls dat_evd_wait with a threshold argument, the
+events information must be copied to storage in the CQ.  This is
+because it is not possible to determine how many completions there are
+on a CQ without dequeuing them, and that determination must be made by
+the CQ notification callback in order to decide whether to wakeup a
+dat_evd_wait() waiter.  Note that this determination must be made
+dynamically based on the arguments to dat_evd_wait().
+
+Further, leaving events from Event Streams associated with the CQs "in
+the CQs" until event consumption raises issues about in what order
+events should be dequeued if there are multiple event streams entering
+an EVD.  Should the CQ events be dequeued first, or should the events
+stored in the EVD be dequeued first?  In general this is a complex
+question; the uDAPL spec does not put many restrictions on event
+order, but the major one that it does place is to restrict connection
+events associated with a QP to be dequeued before DTOs associated with
+that QP, and disconnection events after.  Unfortunately, if we adopt
+the policy of always dequeueing CQ events first, followed by EVD
+events, this means that in situations where CQ events have been copied
+to the EVD, CQ events may be received on the EVD out of order.
+
+However, leaving events from Event Streams associated with CQs allows
+us to avoid enabling CQ callbacks in cases where there is no waiter
+associated with the EVDs.  This can be a potentially large savings of
+gratuitous context switches.
+
+For the initial implementation, we will leave all event information
+associated with CQs until dequeued by the consumer.  All other event
+information will be put in storage on the EVD itself.  We will always
+dequeue from the EVD first and the CQ second, to handle ordering among
+CQ events in cases in which CQ events have been copied to the EVD.
+
+
+Synchronization
+---------------
+
+-- EVD synchronization: Locking vs. Producer/Consumer queues.
+
+In the current code, two circular producer/consumer queues are used
+for non-CQ event storage (one holds free events, one holds posted
+events).  Event producers "consume" events from the free queue, and
+produce events onto the posted event queue.  Event consumers consume
+events from the posted event queue, and "produce" events onto the free
+queue.  In what follows, we discuss synchronization onto the posted
+event queue, but since the usage of the queues is symmetric, all of
+what we say also applies to the free event queue (just in the reverse
+polarity).
+
+The reason for using these circular queues is to allow synchronization
+between producer and consumer without locking in some situations.
+Unfortunately, a circular queue is only an effective method of
+synchronization if we can guarantee that there are only two accessors
+to it at a given time: one producer, and one consumer.  The model will
+not work if there are multiple producers, or if there are multiple
+consumers (though obviously a subsidiary lock could be used to
+single-thread either the producers or the consumers).
+
+There are several difficulties with guaranteeing the producers and
+consumers will each be single threaded in accessing the EVD:
+       * Constraints of the IB specification and IBM Access API
+         (differing sources for event streams without guarantees of
+         IB provider synchronization between them) make it difficult
+         to avoid multiple producers.
+       * The primitives used for the producer/consumer queue are not
+         as widely accepted as locks, and may render the design less
+         portable.
+
+We will take locks when needed when producing events.  The details of
+this plan are described below.
+
+This reasoning is described in more detail below to inform judgments
+about future performance improvements.
+
+* EVD producer synchronization
+
+The producers fall into two classes:
+       * Callbacks announcing IA associated events such as connection
+         requests, connections, disconnections, DT ops, RMR bind,
+         etc.
+       * User calls posting a software event onto the EVD.
+
+It is the users responsibility to protect against simultaneous
+postings of software events onto the same EVD.  Similarly, the CQEH
+mechanism provided by the IBM Access API allows us to avoid collisions
+between IBM Access API callbacks associated with CQs.  However, we
+must protect against software events colliding with IBM Access API
+callbacks, and against non-CQ associated IB verb callbacks (connection
+events and asynchronous errors) colliding with CQ associated IBM
+Access API callbacks, or with other non-CQ associated IBM Access API
+callbacks (i.e. a connection callback colliding with an asynchronous
+error callback).
+
+Note that CQ related callbacks do not act as producers on the circular
+list; instead they leave the event information on the CQ until
+dequeue; see "Event Storage" above.  However, there are certain
+situations in which it is necessary for the consumer to determine the
+number of events on the EVD.  The only way that IB provides to do this
+is to dequeue the CQEs from the CQ and count them.  In these
+situations, the consumer will also act as an event producer for the
+EVD event storage, copying all event information from the CQ to the
+EVD.
+
+Based on the above, the only case in which we may do without locking
+on the producer side is when all Event Streams of all of the following
+types may be presumed to be single threaded:
+       * Software events
+       * Non-CQ associated callbacks
+       * Consumer's within dat_evd_wait
+
+We use a lock on the producer side of the EVD whenever we have
+multiple threads of producers.
+
+* EVD Consumer synchronization
+
+It is the consumer's responsibility to avoid multiple callers into
+dat_evd_wait and dat_evd_dequeue.  For this reason, there is no
+requirement for a lock on the consumer side.
+
+* CQ synchronization
+
+We simplify synchronization on the CQ by identifying the CQ consumer
+with the EVD consumer.  In other words, we prohibit any thread other
+than a user thread in dat_evd_wait() or dat_evd_dequeue() from
+dequeueing events from the CQ.  This means that we can rely on the
+uDAPL spec guarantee that only a single thread will be in the
+dat_evd_wait() or dat_evd_dequeue() on a single CQ at a time.  It has
+the negative cost that (because there is no way to probe for the
+number of entries on a CQ without dequeueing) the thread blocked in
+dat_evd_wait() with a threshold argument greater than 1 will be woken
+up on each notification on that CQ, in order to dequeue entries from
+the CQ and determine if the threshold value has been reached.
+
+-- EVD Synchronization: Waiter vs. Callback
+
+Our decision to restrict dequeueing from the IB CQ to the user thread
+(rather than the notification callback thread) means that
+re-requesting notifications must also be done from that thread.  This
+leads to a subtle requirement for synchronization: the request for
+notification (ib_completion_notify) must be atomic with the wait on
+the condition variable by the user thread (atomic in the sense that
+locks must be held to force the signalling from any such notification
+to occur after the sleep on the condition variable).  Otherwise it is
+possible for the notification requested by the ib_completion_notify
+call to occur before the return from that call.  The signal done by
+that notify will be ignored, and no further notifications will be
+enabled, resulting in the thread sleep waiting forever.  The CQE
+associated with the notification might be noticed upon return from the
+notify request, but that CQE might also have been reaped by a previous
+call.
+
+-- CNO Synchronization
+
+In order to protect data items that are changed during CNO signalling
+(OS Proxy Wait Object, EVD associated with triggering, CNO state), it
+is necessary to use locking when triggering and waiting on a CNO.
+
+Note that the synchronization between trigerrer and waiter on CNO must
+take into account the possibility of the waiter returning from the
+wait because of a timeout.  I.e. it must handle the possibility that,
+even though the waiter was detected and the OS Wait Object signalled
+under an atomic lock, there would be no waiter on the OS Wait Object
+when it was signalled.  To handle this case, we make the job of the
+triggerer to be setting the state to triggered and signalling the OS
+Wait Object; all other manipulation is done by the waiter.
+
+-- Inter-Object Synchronization
+
+By the requirements specified above, the DAPL implementation is
+responsible for avoiding collisions between DAT Consumer calls on
+different DAT objects, even in a non-thread safe implementation.
+Luckily, no such collisions exist in this implementation; all exported
+DAPL Event Subsystem calls involve operations only on the objects to
+which they are targeted.  No inter-object synchronization is
+required.
+
+The one exception to this is the posting of a software event on an EVD
+associated with a CNO; this may result in triggering the CNO.
+However, this case was dealt with above in the discussion of
+synchronizing between event producers and consumers; the posting of a
+software event is a DAPL API call, but it's also a event producer.
+
+To avoid lock hierarchy issues between EVDs and CNOs and minimize lock
+contention, we arrange not to hold the EVD lock when triggering the
+CNO.  That is the only context in which we would naturally attempt to
+hold both locks.
+
+-- CQ -> CQEH Assignments
+
+For the initial implementation, we will assign all CQs to the same
+CQEH.  This is for simplicity and efficient use of threading
+resources; we do not want to dedicate a thread per CQ (where the
+number of CQs may grow arbitrarily high), and we have no way of
+knowing which partitioning of CQs is best for the DAPL consumer.
+
+CQ Callbacks
+------------
+
+The responsibility of a CQ callback is to wakeup any waiters
+associated with the CQ--no data needs to be dequeued/delivered, since
+that is always done by the consumer.  Therefore, CQ callbacks must be
+enabled when:
+       * Any thread is in dat_evd_wait() on the EVD associated with
+         the CQ.
+       * The EVD is enabled and has a non-null CNO.  (An alternative
+         design would be to have waiters on a CNO enable callbacks on
+         all CQs associated with all EVDs associated with the CNO,
+         but this choice does not scale well as the number of EVDs
+         associated with a CNO increases).
+
+Dynamic Resizing of EVDs
+------------------------
+
+dat_evd_resize() creates a special problem for the implementor, as it
+requires that the storage allocated in the EVD be changed in size as
+events may be arriving.  If a lock is held by all operations that use
+the EVD, implementation of dat_evd_resize() is trivial; it substitutes
+a new storage mechanism for the old one, copying over all current
+events, all under lock.
+
+However, we wish to avoid universal locking for the initial
+implementation.  This puts the implementation of dat_evd_resize() into
+a tar pit.  Because of the DAT Consumer requirements for a non-thread
+safe DAPL Implementation, there will be no danger of conflict with
+Event Consumers.  However, if an Event Producer is in process of
+adding an event to the circular list when the resize occurs, that
+event may be lost or overwrite freed memory.
+
+If we are willing to make the simplifying decision that any EVD that
+has non-CQ events on it will always do full producer side locking, we
+can solve this problem relatively easily.  Resizing of the underlying
+CQ can be done via ib_cq_resize(), which we can assume available
+because of the IB spec.  Resizing of the EVD storage may be done under
+lock, and there will be no collisions with other uses of the EVD as
+all other uses of the EVD must either take the lock or are prohibitted
+by the uDAPL spec.
+
+dat_evd_resize() has not yet been implemented in the DAPL Event
+subsystem.
+
+Structure and pseudo-code
+-------------------------
+
+-- EVD
+
+All EVDs will have associated with them:
+       + a lock
+       + A DAPL OS Wait Object
+       + An enabled/disabled bit
+       + A CNO pointer (may be null)
+       + A state (no_waiter, waiter, dead)
+       + A threshold count
+       + An event list
+       + A CQ (optional, but common)
+
+Posting an event to the EVD (presumably from a callback) will involve:
+^              + Checking for valid state
+|lock A        + Putting the event on the event list
+|      ^lock B + Signal the DAPL OS Wait Object, if appropriate
+v      v         (waiter & signaling event & over threshold)
+               + Trigger the CNO if appropriate (enabled & signaling
+                 event & no waiter).  Note that the EVD lock is not
+                 held for this operation to avoid holding multiple locks.
+
+("lock A" is used if producer side locking is needed.  "lock B" is
+used if producer side locking is not needed.  Regardless, the lock is
+only held to confirm that the EVD is in the WAITED state, not for
+the wakeup).  
+
+Waiting on an EVD will include:
+       + Loop:
+               + Copy all elements from CQ to EVD
+               + If we have enough, break
+               + If we haven't enabled the CQ callback
+                       + Enable it
+                       + Continue
+               + Sleep on DAPL OS Wait Object
+       + Dequeue and return an event
+
+The CQ callback will include:
+       + If there's a waiter:
+               + Signal it
+       + Otherwise, if the evd is in the OPEN state, there's
+         a CNO, and the EVD is enabled:
+               + Reenable completion
+               + Trigger CNO
+
+Setting the enable/disable state of the EVD or setting the associated
+CNO will simply set the bits and enable the completion if needed (if a
+CNO trigger is implied); no locking is required.
+
+-- CNO
+
+All CNOs will have associated with them:
+       + A lock
+       + A DAPL OS Wait Object
+       + A state (triggered, untriggered, dead)
+       + A waiter count
+       + An EVD handle (last event which triggered the CNO)
+       + An OS Proxy Wait Object pointer (may be null)
+
+Triggering a CNO will involve:
+       ^ + If the CNO state is untriggerred:
+       |     + Set it to triggered
+       |     + Note the OS Proxy wait object and zero it.
+       |     + If there are any waiters associated with the CNO,
+       |       signal them.
+       v     + Signal the OS proxy wait object if noted
+
+Waiting on a CNO will involve:
+       ^ + While the state is not triggered and the timeout has not occurred:
+       |       + Increment the CNO waiter count
+       lock    + Wait on the DAPL OS Wait Object
+       |       + Decrement the CNO waiter count
+       v + If the state is trigerred, note fact&EVD and set to untrigerred.
+         + Return EVD and success if state was trigerred
+         + Return timeout otherwise
+
+Setting the OS Proxy Wait Object on a CNO, under lock, checks for a
+valid state and sets the OS Proxy Wait Object.
+
+
+==============
+Known Problems
+==============
+
+-- Because many event streams are actually delivered to EVDs by
+   callbacks, we cannot in general make any guarantees about the order
+   in which those event streams arrive; we are at the mercy of the
+   thread scheduler.  Thus we cannot hold to the guarantee given by
+   the uDAPL 1.0 specification that within a particular EVD,
+   connection events on a QP will always be before successful DTO
+   operations on that QP.
+
+   Because we have chosen to dequeue EVD events first and CQ events
+   second, we will also not be able to guarantee that all successful
+   DTO events will be received before a disconnect event.  Ability to
+   probe the CQ for its number of entries would solve this problem.
+
+
+=================
+Future Directions
+=================
+
+This section includes both functionality enhancements, and a series of
+performance improvements.  I mark these performance optimizations with
+the following flags:
+       * VerbMod: Requires modifications to the IB Verbs/the IBM
+         Access API to be effective.
+       * VerbInteg: Requires integration between the DAPL
+         implementation and the IB Verbs implementation and IB device
+         driver.
+
+Functionality Enhancements
+--------------------------
+
+-- dat_evd_resize() may be implemented by forcing producer side
+   locking whenever an event producer may occur asynchronously with
+   calls to dat_evd_resize() (i.e. when there are non-CQ event streams
+   associated with the EVD).  See the details under "Dynamic Resizing
+   of EVDs" above.
+
+-- [VerbMod] If we ahd a verbs modification allowing us to probe for
+   the current number of entries on a CQ, we could:
+       * Avoid waking up a dat_evd_wait(threshold>1) thread until
+         there were enough events for it.
+       * Avoid copying events from the CQ to the EVD to satisfy the
+         requirements of the "*nmore" out argument to dat_evd_wait(),
+         as well as the non-unary threshold argument.
+       * Implement the "all successful DTO operation events before
+         disconnect event" uDAPL guarantee (because we would no
+         longer have to copy CQ events to an EVD, and hence dequeue
+         first from the EVD and then from the CQ.
+   This optimization also is relevant for two of the performance
+   improvements cases below (Reducing context switches, and reducing
+   copies).
+
+
+Performance improvements: Reducing context switches
+---------------------------------------------------
+-- [VerbMod] If we had a verbs modification allowing us to probe for
+   the current size of a CQ, we could avoid waking up a
+   dat_evd_wait(threshhold>1) thread until there were enough events
+   for it.  See the Functionality Enhancement entry covering this
+   possibility.
+
+-- [VerbMod] If we had a verbs modification allowing threads to wait
+   for completions to occur on CQs (presumably in the kernel in some
+   efficient manner), we could optimize the case of
+   dat_evd_wait(...,threshold=1,...) on EVDs with only a single CQ
+   associated Event Stream.  In this case, we could avoid the extra
+   context switch into the user callback thread; instead, the user
+   thread waiting on the EVD would be woken up by the kernel directly.
+
+-- [VerbMod] If we had the above verbs modification with a threshold
+   argument on CQs, we could implement the threshold=n case.
+
+-- [VerbInteg] In general, It would be useful to provide ways for
+   threads blocked on EVDs or CNOs to sleep in the hardware driver,
+   and for the driver interrupt thread to determine if they should be
+   awoken rather than handing that determination off to another,
+   user-level thread.  This would allow us to reduce by one the number
+   of context switches required for waking up the various blocked
+   threads.
+
+-- If an EVD has only a single Event Stream coming into it that is
+   only associated with one work queue (send or receive), it may be
+   possible to create thresholding by marking only ever nth WQE on
+   the associated send or receive WQ to signal a completion.  The
+   difficulty with this is that the threshold is specified when
+   waiting on an EVD, and requesting completion signaling is
+   specified when posting a WQE; those two events may not in general
+   be synchronized enough for this strategy.  It is probably
+   worthwhile letting the consumer implement this strategy directly if
+   they so choose, by specifying the correct flags on EP and DTO so
+   that the CQ events are only signaling on every nth completion.
+   They could then use dat_evd_wait() with a threshold of 1.
+
+Performance improvements: Reducing copying of event data
+--------------------------------------------------------
+-- [VerbMod] If we had the ability to query a CQ for the number of
+   completions on it, we could avoid the cost of copying event data from the
+   CQ to the EVD.  This is a duplicate of the second entry under
+   "Functionality Enhancements" above.
+
+Performance improvements: Reducing locking
+------------------------------------------
+-- dat_evd_dequeue() may be modified to not take any locks.
+
+-- If there is no waiter associated with an EVD and there is only a
+   single event producer, we may avoid taking any locks in producing
+   events onto that EVD.  This must be done carefully to handle the
+   case of racing with a waiter waiting on the EVD as we deliver the
+   event.
+
+-- If there is no waiter associated with an EVD, and we create a
+   producer/consumer queue per event stream with a central counter
+   modified with atomic operations, we may avoid locking on the EVD.
+
+-- It may be possible, though judicious use of atomic operations, to
+   avoid locking when triggering a CNO unless there is a waiter on the
+   CNO.  This has not been done to keep the initial design simple.
+
+Performance improvements: Reducing atomic operations
+----------------------------------------------------
+-- We could combine the EVD circular lists, to avoid a single atomic
+   operation on each production and each consumption of an event.  In
+   this model, event structures would not move from list to list;
+   whether or not they had valid information on them would simply
+   depend on where they were on the lists.
+
+-- We may avoid the atomic increments on the circular queues (which
+   have a noticeable performance cost on the bus) if all accesses to an
+   EVD take locks.
+
+
+Performance improvements: Increasing concurrency
+------------------------------------------------
+-- When running on a multi-CPU platform, it may be appropriate to
+   assign CQs to several separate CQEHs, to increase the concurrency
+   of execution of CQ callbacks.  However, note that consumer code is
+   never run within a CQ callback, so those callbacks should take very
+   little time per callback.  This plan would only make sense in
+   situations where there were very many CQs, all of which were
+   active, and for whatever reason (high threshold, polling, etc)
+   user threads were usually not woken up by the execution of a
+   provider CQ callback.
+
+
diff --git a/doc/dapl_ibm_api_variations.txt b/doc/dapl_ibm_api_variations.txt

new file mode 100644 (file)

index 0000000..764e552
--- /dev/null
+++ b/doc/dapl_ibm_api_variations.txt
@@ -0,0 +1,34 @@
+
+               DAPL Variations from IBM OS Access API
+                --------------------------------------
+
+The DAPL reference implementation is targetted at the IBM OS Access
+API (see doc/api/IBM_access_api.pdf).  However, in the course of
+developing the reference implementation it has become necessary to
+alter or enhance this API specification in minor ways.  This document
+describes the ways in which the Access API has been altered to
+accomodate the needs of the reference implementation.
+
+Note that this document is a work in progress/a place holder; it does
+not yet contain all of the API variations used by the reference
+implementation.  It is intended that it will be brought up to date
+before the final release of the DAPL reference implementation.
+
+The variations from the IBM OS Access API are listed below.
+
+-- Thread safety
+
+The IBM OS Access API specifies:
+
+"Implementation of the Access APIs should ensure that multiple threads
+ can call the APIs, provided they do not access the same InfiniBand
+ entity (such as a queue pair or a completion queue)."
+
+This has been extended in two ways:
+       * It is safe for multiple threads to call into the API
+         accessing the same HCA.
+       * Threads calling ib_post_send_req on a particular QP do not
+         conflict with threads calling ib_post_rcv_req on the same
+         QP.  I.e. while there cannot be multiple threads in
+         ib_post_send_req or ib_post_rcv_req on the same QP, there
+         may be one thread in each routine simultaneously.
diff --git a/doc/dapl_memory_management_design.txt b/doc/dapl_memory_management_design.txt

new file mode 100644 (file)

index 0000000..09ff153
--- /dev/null
+++ b/doc/dapl_memory_management_design.txt
@@ -0,0 +1,173 @@
+#######################################################################
+#                                                                     #
+# DAPL Memory Management Design                                       #
+#                                                                     #
+# James Lentini                                                       #
+# jlentini at users.sourceforge.net                                   #
+#                                                                     #
+# Created 05/06/2002                                                  #
+# Updated 08/22/2002                                                  #
+#                                                                     #
+#######################################################################
+
+
+Contents
+-------
+0. Introduction
+1. Protection Zones (PZs)
+2. Local Memory Regions (LMRs)
+3. Remote Memory Regions (RMRs)
+
+
+0. Introduction
+---------------
+
+   The memory management subsystem allows consumers to register and 
+unregister memory regions.  The DAT API distinguishes between local 
+and remote memory areas.  The former server as local buffers for DTO 
+operations while the later are used for RDMA operations.  
+
+Each DAT function is implemented in a file named dapl_<function name>.c.  
+For example, dat_pz_create is implemented in dapl_pz_create.c.  There 
+are also dapl_<object name>_util.{h,c} files for each object.  For 
+example, there are dapl_pz_util.h and dapl_pz_util.c files.  The 
+use of util files follows the convention used elsewhere in the DAPL 
+reference provider.  These files contain common object creation and 
+destruction code.
+
+
+1. Protection Zones (PZs)
+-------------------------
+
+   DAPL protection zones provide consumers with a means to associate 
+various DAPL objects with one another.  The association can then be 
+validated before allowing these objects to be manipulated.  The DAT 
+functions related to PZs are:
+
+dat_pz_create
+dat_pz_free
+dat_pz_query
+
+These are implemented in the DAPL reference provider by 
+
+dapl_pz_create
+dapl_pz_free
+dapl_pz_query
+
+The reference implementation maps the DAPL PZ concept onto Infiniband 
+protections domains (PDs).  
+
+The DAT_PZ_HANDLE value returned to DAT consumers is a pointer to a 
+DAPL_PZ data structure. The DAPL_PZ structure is used to represent all 
+PZ objects. Code that manipulates this structure should atomically 
+increment and decrement the ref_count member to track the number of 
+objects referencing the PZ.
+
+
+2. Local Memory Regions (LMRs)
+------------------------------
+
+    DAPL local memory regions represent a memory area on the host 
+system that the consumer wishes to access via local DTO operations.  
+The DAT functions related to LMRs are:
+
+dat_lmr_create
+dat_lmr_free
+dat_lmr_query
+
+These are implemented in 
+
+dapl_lmr_create
+dapl_lmr_free
+dapl_lmr_query
+
+In the reference implementation, DAPL LMRs are mapped onto 
+Infiniband memory regions (MRs).  
+
+LMR creation produces two values: a DAT_LMR_CONTEXT and a 
+DAT_LRM_HANDLE. 
+
+The DAT_LMR_CONTEXT value is used to uniquely identify the LMR 
+when posting data transfer operations. These values map directly 
+to Infiniband L_KEYs.
+
+Since some DAT functions need to translate a DAT_LMR_CONTEXT value 
+into a DAT_LMR_HANDLE (ex. dat_rmr_bind), a dictionary data structure 
+is used to associate DAT_LMR_CONTEXT values with their corresponding 
+DAT_LMR_HANDLE.  Each time a new LMR is created, the DAT_LMR_HANDLE 
+should be inserted into the dictionary with the associated 
+DAT_LMR_CONTEXT as the key. 
+
+A hash table was chosen to implement this data structure. Since the 
+L_KEY values are being used by the CA hardware for indexing purposes, 
+there distribution is expected to be uniform and hence ideal for hashing.
+
+The DAT_LMR_HANDLE value returned to DAT consumers is a pointer to 
+a DAPL_LMR data structure. The DAPL_LMR structure is used to represent 
+all LMR objects. The ref_count member should be used to track objects 
+associated with a given LMR.
+
+The DAT API exposes the DAT_LMR_CONTEXT to consumers to allow 
+for sharing of memory registrations between multiple address spaces. 
+The mechanism by which such a feature would be implemented does not 
+yet exist. Consumers may be able to take advantage of this 
+feature on future transports. 
+
+
+3. Remote Memory Regions (RMRs)
+-------------------------------
+
+    DAPL remote memory regions represent a memory area on the host 
+system to which the consumer wishes to allow RMDA operations.  The 
+related DAT functions are
+
+dat_rmr_create
+dat_rmr_free
+dat_rmr_query
+dat_rmr_bind
+
+which are implemented in 
+
+dapl_rmr_create
+dapl_rmr_free
+dapl_rmr_query
+dapl_rmr_bind
+
+The reference provider maps RMR objects onto Infiniband memory 
+windows.
+
+The DAT_RMR_HANDLE value returned to DAT consumers is a pointer to 
+a DAPL_RMR data structure. The DAPL_RMR structure is used to represent 
+all RMR objects.
+
+The API for binding a LMR to a RMR has the following function 
+signature:
+
+DAT_RETURN
+dapl_rmr_bind (
+       IN      DAT_RMR_HANDLE          rmr_handle,
+       IN      const DAT_LMR_TRIPLET   *lmr_triplet,
+       IN      DAT_MEM_PRIV_FLAGS      mem_priv,
+       IN      DAT_EP_HANDLE           ep_handle,
+       IN      DAT_RMR_COOKIE          user_cookie,
+       IN      DAT_COMPLETION_FLAGS    completion_flags,
+       OUT     DAT_RMR_CONTEXT         *rmr_context )
+
+where a DAT_LMR_TRIPLET is defined as: 
+
+typedef struct dat_lmr_triplet
+    {
+    DAT_LMR_CONTEXT     lmr_context;
+    DAT_UINT32          pad;
+    DAT_VADDR           virtual_address;
+    DAT_VLEN            segment_length;
+    } DAT_LMR_TRIPLET;
+
+In the case of IB, the DAT_LMR_CONTEXT value is a L_KEY.
+As described in the IB spec, the Bind Memory Window verb 
+takes both a L_KEY and Memory Region Handle among other 
+parameters. Therefore a data structure must be used to 
+map a DAT_LMR_CONTEXT (L_KEY) value to a DAPL_LMR so 
+that the needed memory region handle can be retrieved.
+The LMR hash table described above is used for this 
+purpose.
diff --git a/doc/dapl_patch.txt b/doc/dapl_patch.txt

new file mode 100644 (file)

index 0000000..5bc5f9d
--- /dev/null
+++ b/doc/dapl_patch.txt
@@ -0,0 +1,83 @@
+#######################################################################
+#                                                                     #
+# DAPL Patch Guide                                                    #
+#                                                                     #
+# James Lentini                                                       #
+# jlentini at users.sourceforge.net                                   #
+#                                                                     #
+# Created 03/30/2005                                                  #
+# Version 1.0                                                         #
+#                                                                     #
+#######################################################################
+
+
+Overview
+--------
+
+The DAPL Reference Implementation (RI) Team welcomes code contributions 
+and bug fixes from RI users. This document describes the format for 
+submitting patches to the project.
+
+Directions
+----------
+
+When implementing a new feature or bug fix, please remember to:
+
++ Use the project coding style, described in doc/dapl_coding_style.txt
++ Remember that the RI supports multiple platforms and transports. If 
+  your modification is not applicable to all platforms and transports,
+  please ensure that the implement does not affect these other 
+  configurations.
+
+When creating the patch:
+
++ Create the patch using a unified diff as follows: 
+  diff -Naur old-code new-code > patch
++ Create the patch from the root of the CVS tree.
+
+When submitting the patch:
+
++ Compose an email message containing a brief description of the patch, 
+  a signed-off by line, and the patch.
++ Have the text "[PATCH]" at the start of the subject line
++ Send the message to dapl-devel@lists.sourceforge.net
+
+Example
+-------
+
+Here is an example patch message:
+
+------------------------------------------------------------
+Date: 30 Mar 2005 11:49:45 -0500
+From: Jane Doe
+To: dapl-devel@lists.sourceforge.net
+Subject: [PATCH] fixed status returns
+
+Here's a patch to fix the status return value in 
+dats_handle_vector_init().
+
+Signed-off-by: Jane Doe <jdoe at pseudonyme.com>
+
+--- dat/common/dat_api.c~       2005-03-30 11:58:40.838968000 -0500
++++ dat/common/dat_api.c        2005-03-28 12:33:29.502076000 -0500
+@@ -70,16 +70,15 @@
+ {
+     DAT_RETURN         dat_status;
+     int                        i;
+-    int                        status;
+
+     dat_status = DAT_SUCCESS;
+
+     g_hv.handle_max   = DAT_HANDLE_ENTRY_STEP;
+
+-    status = dat_os_lock_init (&g_hv.handle_lock);
+-    if ( DAT_SUCCESS != status )
++    dat_status = dat_os_lock_init (&g_hv.handle_lock);
++    if ( DAT_SUCCESS != dat_status )
+     {
+-       return status;
++       return dat_status;
+     }
+
+     g_hv.handle_array = dat_os_alloc (sizeof(void *) * DAT_HANDLE_ENTRY_STEP);
+------------------------------------------------------------
diff --git a/doc/dapl_registry_design.txt b/doc/dapl_registry_design.txt

new file mode 100644 (file)

index 0000000..7215cc0
--- /dev/null
+++ b/doc/dapl_registry_design.txt
@@ -0,0 +1,631 @@
+               DAT Registry Subsystem Design v. 0.90
+                -------------------------------------
+
+=================
+Table of Contents
+=================
+
+* Table of Contents
+* Referenced Documents
+* Introduction
+* Goals
+* Provider API
+* Consumer API
+* Registry Design
+    + Registry Database
+    + Provider API pseudo-code
+    + Consumer API pseudo-code
+    + Platform Specific API pseudo-code
+
+====================
+Referenced Documents
+====================
+
+uDAPL: User Direct Access Programming Library, Version 1.0.  Published
+6/21/2002.  http://www.datcollaborative.org/uDAPL_062102.pdf. Referred
+to in this document as the "DAT Specification". 
+
+============
+Introduction
+============
+
+The DAT architecture supports the use of multiple DAT providers within
+a single consumer application. Consumers implicitly select a provider
+using the Interface Adapter name parameter passed to dat_ia_open(). 
+
+The subsystem that maps Interface Adapter names to provider
+implementations is known as the DAT registry. When a consumer calls
+dat_ia_open(), the appropriate provider is found and notified of the
+consumer's request to access the IA. After this point, all DAT API
+calls acting on DAT objects are automatically directed to the
+appropriate provider entry points.
+
+A persistent, administratively configurable database is used to store
+mappings from IA names to provider information. This provider
+information includes: the file system path to the provider library
+object, version information, and thread safety information. The
+location and format of the registry is platform dependent. This
+database is know as the Static Registry (SR). The process of adding a
+provider entry is termed Static Registration.   
+
+Within each DAT consumer, there is a per-process database that
+maps from ia_name -> provider information. When dat_ia_open() is
+called, the provider library is loaded, the ia_open_func is found, and
+the ia_open_func is called.  
+
+=====
+Goals
+=====
+
+-- Implement the registration mechanism described in the uDAPL
+   Specification. 
+
+-- The DAT registry should be thread safe.
+   
+-- On a consumer's performance critical data transfer path, the DAT
+   registry should not require any significant overhead. 
+
+-- The DAT registry should not limit the number of IAs or providers
+   supported.  
+
+-- The user level registry should be tolerant of arbitrary library 
+   initialization orders and support calls from library initialization 
+   functions.
+
+============
+Provider API
+============
+
+Provider libraries must register themselves with the DAT registry.
+Along with the Interface Adapter name they wish to map, they must
+provide a routines vector containing provider-specific implementations
+of all DAT APIs.  If a provider wishes to service multiple Interface
+Adapter names with the same DAT APIs, it must register each name
+separately with the DAT registry. The Provider API is not exposed to
+consumers.
+
+The user level registry must ensure that the Provider API may be
+called from a library's initialization function. Therefore the
+registry must not rely on a specific library initialization order.
+
+    DAT_RETURN
+    dat_registry_add_provider(
+        IN DAT_PROVIDER                 *provider ) 
+
+Description: Allows the provider to add a mapping.  It will return an
+error if the Interface Adapter name already exists. 
+
+    DAT_RETURN
+    dat_registry_remove_provider(
+        IN  DAT_PROVIDER                *provider )
+
+Description: Allows the Provider to remove a mapping. It will return
+an error if the mapping does not already exist.  
+
+============
+Consumer API
+============
+
+Consumers that wish to use a provider library call the DAT registry to
+map Interface Adapter names to provider libraries. The consumer API is
+exposed to both consumers and providers.
+
+    DAT_RETURN
+    dat_ia_open (
+        IN   const DAT_NAME        device_name,
+        IN    DAT_COUNT            async_event_qlen,
+        INOUT DAT_EVD_HANDLE       *async_event_handle,
+        OUT   DAT_IA_HANDLE        *ia_handle )
+
+Description: Upon success, this function returns an DAT_IA_HANDLE to
+the consumer. This handle, while opaque to the consumer, provides
+direct access to the provider supplied library. To support this
+feature, all DAT_HANDLEs must be pointers to a pointer to a
+DAT_PROVIDER structure.
+
+    DAT_RETURN
+    dat_ia_close (
+        IN DAT_IA_HANDLE            ia_handle )
+
+Description: Closes the Interface Adapter.
+
+    DAT_RETURN
+    dat_registry_list_providers(
+        IN  DAT_COUNT                   max_to_return,
+        OUT DAT_COUNT                   *entries_returned,
+        OUT DAT_PROVIDER_INFO           *(dat_provider_list[]) )
+
+Description: Lists the current mappings.
+
+===============
+Registry Design
+===============
+
+There are three separate portions of the DAT registry system:
+
+* Registry Database
+
+* Provider API
+
+* Consumer API 
+
+We address each of these areas in order. The final section will
+describe any necessary platform specific functions.
+
+Registry Database
+-----------------
+
+Static Registry
+................
+
+The Static Registry is a persistent database containing provider
+information keyed by Interface Adapter name. The Static Registry will
+be examined once when the DAT library is loaded. 
+
+There is no synchronization mechanism protecting access to the Static
+Registry. Multiple readers and writers may concurrently access the
+Static Registry and as a result there is no guarantee that the
+database will be in a consistent format at any given time. DAT
+consumers should be aware of this and not run DAT programs when the
+registry is being modified (for example, when a new provider is being
+installed). However, the DAT library must be robust enough to recognize
+an inconsistent Static Registry and ignore invalid entries.
+
+Information in the Static Registry will be used to initialize the
+registry database. The registry will refuse to load libraries for DAT
+API versions different than its DAT API version. Switching API
+versions will require switching versions of the registry library (the
+library explicitly placed on the link line of DAPL programs) as well
+as the header files included by the program. 
+
+Set DAT_NO_STATIC_REGISTRY at compile time if you wish to compile
+DAT without a static registry.
+
+UNIX Registry Format
+.....................
+
+The UNIX registry will be a plain text file with the following
+properties:  
+       * All characters after # on a line are ignored (comments). 
+       * Lines on which there are no characters other than whitespace
+         and comments are considered blank lines and are ignored.
+       * Non-blank lines must have seven whitespace separated fields.
+         These fields may contain whitespace if the field is quoted
+         with double quotes.  Within fields quoated with double quotes, 
+          the following are valid escape sequences:
+
+          \\   backslash
+          \"   quote
+
+       * Each non-blank line will contain the following fields:
+
+        - The IA Name.
+        - The API version of the library:
+          [k|u]major.minor where "major" and "minor" are both integers
+          in decimal format. Examples: "k1.0", "u1.0", and "u1.1".
+        - Whether the library is thread-safe:
+          [threadsafe|nonthreadsafe]
+        - Whether this is the default section: [default|nondefault]
+        - The path name for the library image to be loaded. 
+        - The version of the driver: major.minor, for example, "5.13".
+
+The format of any remaining fields on the line is dependent on the API
+version of the library specified on that line. For API versions 1.0
+and 1.1 (both kDAPL and uDAPL), there is only a single additional
+field, which is:
+
+       - An optional string with instance data, which will be passed to 
+         the loaded library as its run-time arguments.
+
+This file format is described by the following grammar:
+
+<entry-list>      -> <entry> <entry-list> | <eof>
+<entry>           -> <ia-name> <api-ver> <thread-safety> <default-section>
+                     <lib-path> <driver-ver> <ia-params> [<eor>|<eof>] | 
+                     [<eor>|<eof]
+<ia-name>         -> string
+<api-ver>         -> [k|u]decimal.decimal
+<thread-safety>   -> [threadsafe|nonthreadsafe]
+<default-section> -> [default|nondefault]
+<lib-path>        -> string
+<driver-ver>      -> decimal.decimal
+<ia-params>       -> string
+<eof>             -> end of file
+<eor>             -> newline
+
+The location of this file may be specified by setting the environment
+variable DAT_CONF. If DAT_CONF is not set, the default location will
+be /etc/dat.conf.
+
+Windows Registry Format
+.......................
+
+Standardization of the Windows registry format is not complete at this
+time.
+
+Registry Database Data Structures
+.................................
+
+The Registry Database is implemented as a dictionary data structure that
+stores (key, value) pairs. 
+
+Initially the dictionary will be implemented as a linked list. This
+will allow for an arbitrary number of mappings within the resource
+limits of a given system. Although the search algorithm will have O(n)
+worst case time when n elements are stored in the data structure, we
+do not anticipate this to be an issue. We believe that the number of
+IA names and providers will remain relatively small (on the order of
+10). If performance is found to be an issue, the dictionary can be
+re-implemented using another data structure without changing the
+Registry Database API. 
+
+The dictionary uses IA name strings as keys and stores pointers to a
+DAT_REGISTRY_ENTRY structure, which contains the following
+information: 
+
+    - provider library path string,            library_path
+    - DAT_OS_LIBRARY_HANDLE,                   library_handle
+    - IA parameter string,                     ia_params
+    - DAT_IA_OPEN_FUNC function pointer,       ia_open_func
+    - thread safety indicator,                 is_thread_safe
+    - reference counter,                       ref_count
+
+The entire registry database data structure is protected by a single
+lock. All threads that wish to query/modify the database must posses
+this lock. Serializing access in this manner is not expected to have a
+detrimental effect on performance as contention is expected to be
+minimal. 
+
+An important property of the registry is that entries may be inserted
+into the registry, but no entries are ever removed. The contents of
+the static registry are used to populate the initially empty registry
+database. Since these mapping are by definition persistent, no
+mechanism is provided to remove them from the registry database.
+
+NOTE: There is currently no DAT interface to set a provider's IA 
+specific parameters. A solution for this problem has been proposed for
+uDAPL 1.1.
+
+Registry Database API
+.....................
+
+The static variable Dat_Registry_Db is used to store information about
+the Registry Database and has the following members:
+
+    - lock
+    - dictionary
+
+The Registry Database is accessed via the following internal API:
+
+Algorithm: dat_registry_init
+    Input: void
+   Output: DAT_RETURN
+{
+    initialize Dat_Registry_Db
+
+    dat_os_sr_load()
+}
+
+Algorithm: dat_registry_insert
+    Input: IN  const DAT_STATIC_REGISTRY_ENTRY sr_entry
+   Output: DAT_RETURN
+{
+    dat_os_lock(&Dat_Registry_Db.lock)
+
+    create and initialize DAT_REGISTRY_ENTRY structure 
+
+    dat_dictionary_add(&Dat_Registry_Db.dictionary, &entry)
+
+    dat_os_unlock(&Dat_Registry_Db.lock)
+}
+
+Algorithm: dat_registry_search
+    Input: IN    const DAT_NAME_PTR     ia_name
+           IN    DAT_REGISTRY_ENTRY     **entry
+   Output: DAT_RETURN
+{
+    dat_os_lock(&Dat_Registry_Db.lock)
+
+    entry gets dat_dictionary_search(&Dat_Registry_Db.dictionary, &ia_name)
+
+    dat_os_unlock(&Dat_Registry_Db.lock)
+}
+
+Algorithm: dat_registry_list
+    Input: IN  DAT_COUNT                max_to_return
+           OUT DAT_COUNT                *entries_returned
+           OUT DAT_PROVIDER_INFO        *(dat_provider_list[])
+   Output: DAT_RETURN
+{
+    dat_os_lock(&Dat_Registry_Db.lock)
+
+    size = dat_dictionary_size(Dat_Registry_Db.dictionary)
+
+    for ( i = 0, j = 0; 
+          (i < max_to_return) && (j < size); 
+          i++, j++ ) 
+    {
+        initialize dat_provider_list[i] w/ j-th element in dictionary
+    }
+
+    dat_os_unlock(&Dat_Registry_Db.lock)
+
+    *entries_returned = i;
+}
+
+Provider API pseudo-code
+------------------------
+
++ dat_registry_add_provider()
+
+Algorithm: dat_registry_add_provider
+    Input: IN DAT_PROVIDER              *provider
+   Output: DAT_RETURN
+{
+    dat_init()
+
+    dat_registry_search(provider->device_name, &entry)
+
+    if IA name is not found then dat_registry_insert(new entry)
+
+    if entry.ia_open_func is not NULL return an error
+
+    entry.ia_open_func = provider->ia_open_func
+}
+
++ dat_registry_remove_provider()
+
+Algorithm: dat_registry_remove_provider
+    Input: IN  DAT_PROVIDER                *provider 
+   Output: DAT_RETURN
+{
+    dat_init()
+
+    dat_registry_search(provider->device_name, &entry)
+
+    if IA name is not found return an error
+
+    entry.ia_open_func = NULL
+}        
+
+Consumer API pseudo-code
+------------------------
+
+* dat_ia_open() 
+
+This function looks up the specified IA name in the ia_dictionary, 
+loads the provider library, retrieves a function pointer to the
+provider's IA open function from the provider_dictionary, and calls
+the providers IA open function. 
+
+Algorithm: dat_ia_open
+    Input: IN    const DAT_NAME_PTR    name
+          IN    DAT_COUNT              async_event_qlen
+          INOUT DAT_EVD_HANDLE         *async_event_handle
+          OUT   DAT_IA_HANDLE          *ia_handle
+   Output: DAT_RETURN 
+
+{
+    dat_registry_search(name, &entry)
+
+    if the name is not found return an error
+    
+    dat_os_library_load(entry.library_path, &entry.library_handle)
+
+    if the library fails to load return an error
+    
+    if the entry's ia_open_func is invalid 
+    {
+        dl_os_library_unload(entry.library_handle)
+        return an error
+    }
+
+    (*ia_open_func) (name, 
+                     async_event_qlen,
+                     async_event_handle,
+                     ia_handle);
+}
+
+* dat_ia_close()
+
+Algorithm: dat_ia_close
+    Input: IN DAT_IA_HANDLE             ia_handle
+           IN DAT_CLOSE_FLAGS           ia_flags
+   Output: DAT_RETURN 
+{
+    provider = DAT_HANDLE_TO_PROVIDER(ia_handle)
+
+    (*provider->ia_close_func) (ia_handle, ia_flags)
+
+    dat_registry_search(provider->device_name, &entry)
+
+    dat_os_library_unload(entry.library_handle)
+}
+
++ dat_registry_list_providers()
+
+Algorithm: dat_registry_list_providers
+    Input: IN  DAT_COUNT                   max_to_return
+           OUT DAT_COUNT                   *entries_returned
+           OUT DAT_PROVIDER_INFO           *(dat_provider_list[])
+   Output: DAT_RETURN
+{
+    validate parameters
+
+    dat_registry_list(max_to_return, entries_returned, dat_provider_list)
+}
+
+Platform Specific API pseudo-code
+--------------------------------
+
+Below are descriptions of platform specific functions required by the
+DAT Registry. These descriptions are for Linux.
+
+Each entry in the static registry is represented by an OS specific
+structure, DAT_OS_STATIC_REGISTRY_ENTRY. On Linux, this structure will
+have the following members:
+
+ - IA name string
+ - API version
+ - thread safety 
+ - default section
+ - library path string
+ - driver version
+ - IA parameter string
+
+The tokenizer will return a DAT_OS_SR_TOKEN structure
+containing:
+
+ - DAT_OS_SR_TOKEN_TYPE value
+ - string with the fields value
+
+The tokenizer will ignore all white space and comments. The tokenizer
+will also translate any escape sequences found in a string.
+
+Algorithm: dat_os_sr_load
+    Input: n/a
+   Output: DAT_RETURN
+{
+    if DAT_CONF environment variable is set
+     static_registry_file = contents of DAT_CONF
+    else
+     static_registry_file = /etc/dat.conf
+
+    sr_fd = dat_os_open(static_registry_file)
+
+    forever
+    {
+        initialize DAT_OS_SR_ENTRY entry
+
+        do        
+        {
+            // discard blank lines
+            dat_os_token_next(sr_fd, &token)
+        } while token is newline
+
+        if token type is EOF then break // all done
+        // else the token must be a string
+        
+        entry.ia_name = token.value
+
+        dat_os_token_next(sr_fd, &token)
+
+        if token type is EOF then break // all done
+        else if token type is not string then 
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+        else if ( dat_os_convert_api(token.value, &entry.api) fails )
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+
+        dat_os_token_next(sr_fd, &token)
+
+        if token type is EOF then break // all done
+        else if token type is not string then 
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+        else if ( dat_os_convert_thread_safety(token.value, &entry.thread_safety) fails )
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+
+        dat_os_token_next(sr_fd, &token)
+
+        if token type is EOF then break // all done
+        else if token type is not string then 
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+        else if ( dat_os_convert_default(token.value, &entry.default) fails )
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+
+        dat_os_token_next(sr_fd, &token)
+
+        if token type is EOF then break // all done
+        else if token type is not string then 
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+
+        entry.lib_path = token.value
+
+        dat_os_token_next(sr_fd, &token)
+
+        if token type is EOF then break // all done
+        else if token type is not string then 
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+        else if ( dat_os_convert_driver_version(token.value, &entry.driver_version) fails )
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+
+        dat_os_token_next(sr_fd, &token)
+
+        if token type is EOF then break // all done
+        else if token type is not string then 
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+
+        entry.ia_params = token.value
+
+        dat_os_token_next(sr_fd, &token)
+
+        if token type is EOF then break // all done
+        else if token type is not newline then 
+        {
+            // an error has occurred
+            dat_os_token_sync(sr_fd)
+            continue
+        }
+        
+        if ( dat_os_sr_is_valid(entry) )
+        {
+            dat_registry_insert(entry)
+        }
+    }
+
+    dat_os_close(sr_fd)
+}
+
+Algorithm: dat_os_library_load
+    Input: IN  const DAT_NAME_PTR       *library_path
+           OUT DAT_LIBRARY_HANDLE       *library_handle
+   Output: DAT_RETURN
+{
+    *library_handle = dlopen(library_path);
+}
+
+Algorithm: dat_os_library_unload
+    Input: IN  const DAT_LIBRARY_HANDLE library_handle
+   Output: DAT_RETURN
+{
+    dlclose(library_handle)
+}
diff --git a/doc/dapl_shared_memory_design.txt b/doc/dapl_shared_memory_design.txt

new file mode 100644 (file)

index 0000000..f4f3524
--- /dev/null
+++ b/doc/dapl_shared_memory_design.txt
@@ -0,0 +1,876 @@
+#######################################################################
+#                                                                     #
+# DAPL Shared Memory Design                                           #
+#                                                                     #
+# James Lentini                                                       #
+# jlentini at users.sourceforge.net                                   #
+#                                                                     #
+# Created 09/17/2002                                                  #
+# Updated 01/21/2005                                                  #
+# Version 0.04                                                        #
+#                                                                     #
+#######################################################################
+
+
+Contents
+--------
+0. Introduction
+1. Referenced Documents
+2. Requirements
+3. Interface
+4. Implementation Options
+
+
+Introduction
+------------
+
+This document describes the design of shared memory registration for 
+the DAPL reference implementation (RI).
+
+Implementing shared memory support completely within the DAPL RI 
+would not be an ideal solution. A more robust and efficient
+implementation can be acheived by HCA vendors that integrate a DAT
+provider into their software stack. Therefore the RI will not contain
+an implementation of this feature. 
+
+
+Referenced Documents
+--------------------
+
+kDAPL: Kernel Direct Access Programming Library, Version 1.2.
+uDAPL: User Direct Access Programming Library, Version 1.2.
+Available in the DAPL SourceForge repository at
+[doc/api/kDAPL_spec.pdf] and [doc/api/uDAPL_spec.pdf]. Collectively
+referred to in this document as the "DAT Specification". 
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002. Available in the DAPL SourceForge repository
+at [doc/api/IBM_access_api.pdf]. Referred to in this document as the
+"IBM Access API Specification". 
+
+Mellanox IB-Verbs API (VAPI) Mellanox Software Programmers Interface
+for InfiniBand Verbs. Available in the DAPL SourceForge repository
+at [doc/api/MellanoxVerbsAPI.pdf]. Referred to in this document as the
+"VAPI API Specification".
+
+InfiniBand Architecture Specification, Volumes 1 and 2, Release
+1.2, Available from http://www.infinibandta.org/
+Referred to in this document as the "Infiniband Specification".
+
+
+Requirements
+------------
+
+The DAT shared memory model can be characterized as a peer-to-peer
+model since the order in which consumers register a region is not
+dictated by the programming interface.
+
+The DAT API function used to register shared memory is:
+
+DAT_RETURN
+dat_lmr_create (
+        IN     DAT_IA_HANDLE            ia_handle,
+        IN      DAT_MEM_TYPE            mem_type,
+        IN      DAT_REGION_DESCRIPTION  region_description,
+        IN      DAT_VLEN                length,
+        IN      DAT_PZ_HANDLE           pz_handle,
+        IN      DAT_MEM_PRIV_FLAGS      mem_privileges,
+        OUT     DAT_LMR_HANDLE          *lmr_handle,
+        OUT     DAT_LMR_CONTEXT         *lmr_context,
+        OUT     DAT_RMR_CONTEXT         *rmr_context,
+        OUT     DAT_VLEN                *registered_length,
+        OUT     DAT_VADDR               *registered_address );
+
+where a DAT_REGION_DESCRIPTION is defined as:
+
+typedef union dat_region_description
+{
+        DAT_PVOID                   for_va;
+        DAT_LMR_HANDLE              for_lmr_handle;
+        DAT_SHARED_MEMORY           for_shared_memory;
+} DAT_REGION_DESCRIPTION;
+
+In the case of a shared memory registration the DAT consumer will set
+the DAT_MEM_TYPE flag to DAT_MEM_TYPE_SHARED_VIRTUAL and place a
+cookie in the DAT_REGION_DESCRIPTION union's DAT_SHARED_MEMORY
+member. The DAT_SHARED_MEMORY type is defined as follows:
+
+typedef struct dat_shared_memory
+{
+        DAT_PVOID                   virtual_address;
+        DAT_LMR_COOKIE              shared_memory_id;
+} DAT_SHARED_MEMORY;
+
+Unlike the DAT peer-to-peer model, the Infiniband shared memory model
+requires a master-slave relationship. A memory region must first be
+registered using the Register Memory Region verb with subsequent
+registrations made using the Register Shared Memory Region verb. 
+
+The later is implemented in the IBM OS Access API as:
+
+ib_int32_t 
+ib_mr_shared_register_us( 
+        ib_hca_handle_t         hca_handle,
+        ib_mr_handle_t          *mr_handle, /* IN-OUT: could be changed */
+        ib_pd_handle_t          pd_handle, /* IN */
+        ib_uint32_t             access_control, /* IN */
+        ib_uint32_t             *l_key, /* OUT */
+        ib_uint32_t             *r_key, /* OUT: if remote access needed */
+        ib_uint8_t              **va ); /* IN-OUT: virt. addr. to register */
+
+The important parameter is the memory region handle which must be the
+same as an already registered region.
+
+Two requirements are implied by this difference between the DAT and 
+Infiniband models. First, DAPL implementations need a way to determine
+the first registration of a shared region. Second implementations must
+map DAT_LMR_COOKIE values to memory region handles both within and
+across processes. To satisfy the above requirements DAPL must maintain
+this information in a system wide database.
+
+The difficulty of implementing such a database at the DAT provider
+level is the reason the RI's shared memory code is meant to be
+temporary. Such a database is much better suited as part of the HCA
+vendor's software stack, specifically as part of their HCA driver. 
+
+If DAPL was based on a master-slave model like InfiniBand, the
+implementation of shared memory would be straight
+forward. Specifically the complexity is a result of the consumer being
+responsible for specifying the DAT_LMR_COOKIE values. If the DAPL
+spec. were changed to allow the provider and not the consumer to
+specify the DAT_LMR_COOKIE value, the implementation of this feature
+would be greatly simplified. Since the DAPL API already requires
+consumers to communicate the DAT_LMR_COOKIE values between processes,
+such a change places minimal additional requirements on the
+consumer. The dapl_lmr_query call could easily be adapted to allow the
+consumer to query the provider for a given LMR's DAT_LMR_COOKIE
+value. The only spec changes needed would be to add a DAT_LMR_COOKIE
+member to the DAT_LMR_PARAM structure and a DAT_LMR_FIELD_LMR_COOKIE
+constant to the DAT_LMR_PARAM_MASK enumeration. A provider could then
+store the given LMR's memory region handle in this value, greatly
+simplifying the implementation of shared memory in DAPL.  
+
+
+Interface
+---------
+
+To allow the database implementation to easily change, the RI would use
+a well defined interface between the memory subsystem and the
+database. Conceptually the database would contain a single table with
+the following columns:
+
+[ LMR Cookie ][ MR Handle ][ Reference Count ][ Initialized ]
+
+where the LMR Cookie column is the primary key.
+
+The following functions would be used to access the database:
+
+DAT_RETURN
+dapls_mrdb_init (
+        void );
+
+ Called by dapl_init(.) to perform any necessary database
+ initialization. 
+
+DAT_RETURN
+dapls_mrdb_exit (
+        void );
+
+ Called by dapl_fini(.) to perform any necessary database cleanup.
+
+DAT_RETURN
+dapls_mrdb_record_insert (
+        IN  DAPL_LMR_COOKIE     cookie );
+
+ If there is no record for the specified cookie, an empty record is
+ added with a reference count of 1 and the initialized field is set to
+ false. If a record already exists, the function returns an error.
+
+DAT_RETURN 
+dapls_mrdb_record_update (
+       IN  DAPL_LMR_COOKIE     cookie, 
+        IN  ib_mr_handle_t      mr_handle );
+
+ If there is a record for the specified cookie, the MR handle field is
+ set to the specified mr_handle value and the initialized field is set
+ to true. Otherwise an error is returned.
+
+DAT_RETURN
+dapls_mrdb_record_query (
+       IN  DAPL_LMR_COOKIE     cookie,
+        OUT ib_mr_handle_t      *mr_handle );
+
+ If there is a record for the specified cookie and the initialized
+ field is true, the MR handle field is returned and the reference
+ count field is incremented. Otherwise an error is returned. 
+
+DAT_RETURN
+dapls_mrdb_record_dec (
+       IN  DAPL_LMR_COOKIE     cookie );
+
+ If there is a record for the specified cookie, the reference count
+ field is decremented. If the reference count is zero after the
+ decrement, the record is removed from the database. Otherwise an
+ error is returned. 
+
+The generic algorithms for creating and destroying a shared memory
+region are:
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: CreateShared
+ Inputs: 
+         ia_handle
+         pz_handle
+         address
+         length
+         lmr_cookie
+         privileges
+ Outputs:
+         lmr_handle
+         lmr_context
+         registered_address
+         registered_length
+
+forever 
+{
+   if dapls_mrdb_record_insert(cookie) is successful
+   {
+      if dapl_lmr_create_virtual is not successful 
+         dapls_mrdb_record_dec(cookie)
+         return error
+
+      else if dapls_mrdb_record_update(cookie, lmr->mr_handle) is not successful
+         dapls_mrdb_record_dec(cookie)
+         return error
+
+      else break
+   }
+   else if dapls_mrdb_record_query(cookie, mr_handle) is successful
+   {
+      if ib_mrdb_shared_register_us is not successful
+         dapls_mrdb_record_dec(cookie)
+         return error
+
+     else  break
+   }
+}
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: FreeShared
+ Inputs: 
+         lmr
+ Outputs:
+
+if dapls_ib_mr_deregister(lmr) is successful
+   dapls_mrdb_record_dec(lmr->cookie)
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+Implementation Options
+----------------------
+
+As described above the crucial functionality needed to implement
+shared memory support is a system wide database for mapping LMR
+cookies to memory region handles. The following designs represent some
+of the options for implementing such a database. Adding a database
+increases the complexity of DAPL from both an implementor and user's
+perspective. These designs should be evaluated on the degree to which
+they minimize the additional complexity while still providing a robust
+solution. 
+
+
+ File System Database
+ --------------------
+
+Employing a database that is already part of the system would be
+ideal. One option on Linux is to use the file system. An area of the
+file system could be set aside for the creation of files to represent
+each LMR cookie. The area of the file system could be specified
+through a hard coded value, an environment variable, or a
+configuration file. A good candidate would be a DAPL subdirectory of
+/tmp. 
+
+Exclusive file creation is available through the creat(2) system call
+in Linux. The standard I/O interface (fopen(3), etc.) does not support
+this feature making porting difficult. However porting to other
+environments is not a goal of this design since the entire scheme is
+only a temporary solution. 
+
+Determining when to delete the files is a difficult problem. A
+reference count is required to properly remove a file once all the
+memory regions it represents are deregistered. The synchronization
+mechanism necessary for maintaining the reference count is not easily
+implemented. As an alternative, a script could be provided to clean up
+the database by removing all the files. The script would need to be
+run before any DAPL consumers were started to ensure a clean
+database. The disadvantage of using a script is that no DAPL instances
+can be running when it is used. Another option would be to store the
+process ID (PID) of the process that created the file as part of the
+file's contents. Upon finding a record for a given LMR cookie value, a
+DAPL instance could determine if there was a process with the same PID
+in the system. To accomplish this the kill(2) system call could be
+used (ex. kill(pid, 0) ). This method of validating the record assumes
+that all DAPL instances can signal one another and that the PID values
+do not wrap before the check is made. 
+
+Another difficulty with this solution is choosing an accessible
+portion of the file system. The area must have permissions that allow
+all processes using DAPL to access and modify its contents. System
+administrators are typically reluctant to allow areas without any
+access controls. Typically such areas are on a dedicated file system
+of a minimal size to ensure that malicious or malfunctioning software
+does not monopolize the system's storage capacity. Since very little
+information will be stored in each file it is unlikely that DAPL would
+need a large amount of storage space even if a large number of shared
+memory regions were in use. However since a file is needed for each
+shared region, a large number of shared registrations may lead to the
+consumption of all a file system's inodes. Again since this solution
+is meant to be only temporary this constraint may be acceptable.
+
+There is also the possibility for database corruption should a process
+crash or deadlock at an inopportune time. If a process creates file x
+and then crashes all other processes waiting for the memory handle to
+be written to x will fail. 
+
+The database interface could be implemented as follows:
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs: 
+
+ Outputs:
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs: 
+
+ Outputs:
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert 
+ Inputs: 
+         cookie
+ Outputs:
+
+file_name = convert cookie to valid file name
+
+fd = exclusively create file_name
+if fd is invalid
+   return failure
+
+if close fd fails
+   return failure
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update 
+ Inputs: 
+         cookie
+         mr_handle
+ Outputs:
+ 
+file_name = convert cookie to valid file name
+
+fd = open file_name
+if fd is invalid
+   return failure
+
+if write mr_handle to file_name fails
+   return failure
+
+if close fd fails
+   return failure
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query 
+ Inputs: 
+         cookie
+
+ Outputs:
+         mr_handle
+
+file_name = convert cookie to valid file name
+
+fd = open file_name
+if fd is invalid
+   return failure
+
+if read mr_handle from file_name fails
+   return failure
+
+if close fd fails
+   return failure
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec 
+ Inputs: 
+         cookie
+ Outputs:
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+ Daemon Database
+ ---------------
+
+The database could be maintained by a separate daemon process. 
+The DAPL instances would act as clients of the daemon server and
+communicate with the daemon through the various IPC mechanisms
+available on Linux: Unix sockets, TCP/IP sockets, System V message
+queues, FIFOs, or RPCs.
+
+As with the file system based database, process crashes can potentially
+cause database corruption.
+
+While the portability of this implementation will depend on the chosen
+IPC mechanism, this approach will be at best Unix centric and possibly
+Linux specific. 
+
+The database interface could be implemented as follows:
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs: 
+
+ Outputs:
+
+initialize IPC mechanism
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs: 
+
+ Outputs:
+
+shutdown IPC mechanism
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert 
+ Inputs: 
+         cookie
+ Outputs:
+
+if send insert message for cookie fails
+   return error 
+
+if receive insert response message fails
+   return error
+
+if insert success
+   return success
+else return error
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update 
+ Inputs: 
+         cookie
+         mr_handle
+
+ Outputs:
+ 
+if send update message for cookie and mr_handle fails
+   return error
+else return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query 
+ Inputs: 
+         cookie
+
+ Outputs:
+         mr_handle
+
+if send query message for cookie fails
+   return error
+
+else if receive query response message with mr_handle fails
+   return error 
+
+else return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec 
+ Inputs: 
+         cookie
+ Outputs:
+
+if send decrement message for cookie fails
+   return error
+else return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+ Shared Memory Database
+ ----------------------
+
+The database could be maintained in an area of memory shared by all
+DAPL instances running on a system. Linux supports the System V shared
+memory functions shmget(2), shmctl(2), shmat(2), and shmdt(2). A hard
+coded key_t value could be used so that each DAPL instance attached to
+the same piece of shared memory. The size of the database would be
+constrained by the size of the shared memory region. Synchronization
+could be achieved by using atomic operations targeting memory in the
+shared region.
+
+Such a design would suffer from the corruption problems described
+above. If a process crashed there would be no easy way to clean up its
+locks and roll back the database to a consistent state.
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs: 
+
+ Outputs:
+
+attach shared region
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs: 
+
+ Outputs:
+
+detach shared region
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert 
+ Inputs: 
+         cookie
+ Outputs:
+
+lock database
+
+if db does not contain cookie 
+   add record for cookie
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update 
+ Inputs: 
+         cookie
+         mr_handle
+ Outputs:
+ 
+lock database
+
+if db contains cookie 
+   update record's mr_handle
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query 
+ Inputs: 
+         cookie
+
+ Outputs:
+         mr_handle
+
+lock database
+
+if db contains cookie 
+   set mr_handle to record's value
+   increment record's reference count
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec 
+ Inputs: 
+         cookie
+ Outputs:
+
+lock database
+
+if db contains cookie 
+   decrement record's reference count
+
+   if reference count is 0
+      remove record
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+ Kernel Module Database
+ ----------------------
+
+If the DAT library were integrated with an HCA vendor's software
+stack, the database could be managed by the HCA driver. Placing the
+database in the kernel would alleviate the synchronization problems
+posed by multiple processes. Since memory registration operations
+already involve a transition into the kernel, no extra overhead would
+be incurred by this design.
+
+The RI could include a kernel module with this functionality as a
+temporary solution. The module could identify itself as a character
+device driver and communicate with user level processes through an
+ioctl(2). The driver could also create an entry in the proc file
+system to display the database's contents for diagnostic purposes.
+
+A major benefit of a kernel based implementation is that the database
+can remain consistent even in the presence of application
+errors. Since DAPL instances communicate with the driver by means of
+ioctl(2) calls on a file, the driver can be arrange to be informed
+when the file is closed and perform any necessary actions. The driver
+is guaranteed to be notified of a close regardless of the manner in
+which the process exits. 
+
+The database could be implemented as a dictionary using the LMR cookie
+values as keys. 
+
+The following pseudo-code describes the functions needed by the kernel
+module and the database interface.
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: KernelModuleInit
+ Inputs: 
+
+ Outputs:
+
+dictionary = create_dictionary()
+create_proc_entry()
+create_character_device_entry()
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: KernelModuleExit
+ Inputs: 
+
+ Outputs:
+
+remove_character_device_entry()
+remove_proc_entry()
+fee_dictionary(dictionary)
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: DeviceOpen
+ Inputs: 
+         file
+
+ Outputs:
+
+dev_data = allocate device data
+
+file->private_data = dev_data
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: DeviceClose
+ Inputs: 
+         file
+
+ Outputs:
+
+dev_data = file->private_data 
+
+for each record in dev_data
+{
+    RecordDecIoctl
+}
+
+deallocate dev_data
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordInsertIoctl
+ Inputs: 
+         file
+         cookie
+
+ Outputs:
+
+lock dictionary
+
+if cookie is not in dictionary 
+   insert cookie into dictionary
+   
+
+unlock dictionary
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordUpdateIoctl
+ Inputs: 
+         file
+         cookie
+         mr_handle
+
+ Outputs:
+
+dev_data = file->private_data
+
+lock dictionary
+
+if cookie is in dictionary 
+   add record reference to dev_data
+   update mr_handle 
+
+unlock dictionary
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordQueryIoctl
+ Inputs: 
+         file
+         cookie
+
+ Outputs:
+        mr_handle
+
+dev_data = file->private_data
+
+lock dictionary
+
+if cookie is in dictionary 
+   add record reference to dev_data
+   retrieve mr_handle
+
+unlock dictionary
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordDecIoctl
+ Inputs: 
+         file
+         cookie
+
+ Outputs:
+
+dev_data = file->private_data
+remove record reference from dev_data
+
+lock dictionary
+
+if cookie is in dictionary 
+   decrement reference count
+   if reference count is 0
+      remove record
+
+unlock dictionary
+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs: 
+
+ Outputs:
+
+fd = open device file
+
+if fd is invalid 
+   return error
+else 
+   return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs: 
+
+ Outputs:
+
+close fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert 
+ Inputs: 
+         cookie
+ Outputs:
+
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update 
+ Inputs: 
+         cookie
+         mr_handle
+ Outputs:
+ 
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query 
+ Inputs: 
+         cookie
+
+ Outputs:
+         mr_handle
+
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec 
+ Inputs: 
+         cookie
+ Outputs:
+
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
diff --git a/doc/dapl_vendor_specific_changes.txt b/doc/dapl_vendor_specific_changes.txt

new file mode 100644 (file)

index 0000000..19bbbf8
--- /dev/null
+++ b/doc/dapl_vendor_specific_changes.txt
@@ -0,0 +1,394 @@
+       Suggested Vendor-Specific Changes v. 0.92
+        -----------------------------------------
+
+=================
+Table of Contents
+=================
+
+* Table of Contents
+* Introduction
+* Referenced documents
+* Functionality Changes
+       + Missing Functionality
+               + dat_evd_resize
+               + Ordering guarantees on connect/disconnect.
+               + Shared memory
+               + dat_cr_handoff
+* Performance optimizations
+       + Reduction of context switches
+         [Many interrelated optimizations]
+       + Reducing copying of data
+               + Avoidance of s/g list copy on posting
+               + Avoidance of event data copy from CQ to EVD
+       + Elimination of locks
+       + Eliminating subroutine calls          
+
+
+============
+Introduction
+============
+
+This document is a list of functionality enhancements and
+optimizations hardware vendors porting uDAPL may want to consider as
+part of their port.  The functionality enhancements mentioned in this
+document are situations in which HCA Vendors, with their access to
+driver and verb-level source code, and their reduced portability
+concerns, are in a much better position than the reference
+implementation to implement portions of the uDAPL v. 1.0
+specification.  (Additional areas in which the reference
+implementation, because of a lack of time or resources, did not fully
+implement the uDAPL 1.0 specification are not addressed in this file;
+see the file doc/dapl_unimplemented_functionality.txt, forthcoming).
+Vendors should be guided in their implementation of these
+functionality enhancements by their customers need for the features
+involved. 
+
+The optimizations suggested in this document have been identified by
+the uDAPL Reference Implementation team as areas in which performance
+may be improved by "breaching" the IB Verbs API boundary.  They are
+inappropriate for the reference implementation (which has portability
+as one of its primary goals) but may be appropriate for a HCA-specific
+port of uDAPL.  Note that no expected performance gain is attached to
+the suggested optimizations.  This is intentional.  Vendors should be
+guided in their performance improvements by performance evaluations
+done in the context of a representative workload, and the expected
+benefit from a particular optimization weighed against the cost in
+code complexity and scheduling, before the improvement is implemented.
+This document is intended to seed that process; it is not intended to
+be a roadmap for that process.
+
+We divide functionality changes into two categories
+       * Areas in which functionality is lacking in the reference
+         implementation. 
+       * Areas in which the functionality is present in the reference
+         implementation, but needs improvement.
+
+We divide performance improvements into three types:
+       * Reducing context switches
+       * Reducing copying of data (*)
+       * Eliminating subroutine calls
+
+(*) Note that the data referred to in "reducing copying of data" is
+the meta data describing an operation (e.g. scatter/gather list or
+event information), not the actual data to be transferred.  No data
+transfer copies are required within the uDAPL reference
+implementation.
+
+====================
+Referenced Documents
+====================
+
+uDAPL: User Direct Access Programming Library, Version 1.0.  Published
+6/21/2002.  http://www.datcollaborative.org/uDAPL_062102.pdf.
+Referred to in this document as the "DAT Specification".
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002.  In DAPL SourceForge repository at
+doc/api/access_api.pdf.  Referred to in this document as the "IBM
+Access API Specification".
+
+uDAPL Reference Implementation Event System Design.  In DAPL
+SourceForge repository at doc/dapl_event_design.txt.
+
+uDAPL Reference Implementation Shared Memory Design.  In DAPL
+SourceForge repository at doc/dapl_shared_memory_design.txt. 
+
+uDAPL list of unimplmented functionality.  In DAPL SourceForge
+repository at doc/dapl_unimplemented_funcitonality.txt (forthcoming). 
+
+===========================================
+Suggested Vendor Functionality Enhancements
+===========================================
+
+Missing Functionality
+---------------------
+-- dat_evd_resize
+
+The uDAPL event system does not currently implement dat_evd_resize.
+The primary reason for this is that it is not currently possible to
+identify EVDs with the CQs that back them.  Hence uDAPL must keep a
+separate list of events, and any changes to the size of that event
+list would require careful synchronization with all users of that EVD
+(see the uDAPL Event System design for more details).  If the various
+vendor specific optimizations in this document were implemented that
+eliminated the requirement for the EVD to keep its own event list,
+dat_evd_resize might be easily implemented by a call or calls to
+ib_cq_resize. 
+
+-- Ordering guarantees on connect/disconnect.
+
+The DAPL 1.1 specification specifies that if an EVD combines event
+streams for connection events and DTO events for the same endpoint,
+there is an ordering guarantee: the connection event on the AP occurs
+before any DTO events, and the disconnection event occurs after all
+successful DTO events.  Since DTO events are provided by the IBM OS
+Access API through ib_completion_poll (in response to consumer
+request) and connection events are provided through callbacks (which
+may race with consumer requests) there is no trivial way to implement
+this functionality.  The functionality may be implemented through
+under the table synchronizations between EVD and EP; specifically:
+       * The first time a DTO event is seen on an endpoint, if the
+         connection event has not yet arrived it is created and
+         delivered ahead of that DTO event.
+       * When a connection event is seen on an endpoint, if a
+         connection event has already been created for that endpoint
+         it is silently discarded.
+       * When a disconnection event is seen on an endpoint, it is
+         "held" until either: a) all expected DTO events for that
+         endpoint have completed, or b) a DTO marked as "flushed by
+         disconnect" is received.  At that point it is delivered.
+         
+Because of the complexity and performance overhead of implementating
+this feature, the DAPL 1.1 reference implementation has chosen to take
+the second approach allowed by the 1.1 specification: disallowing
+integration of connection and data transfer events on the same EVD.
+This fineses the problem, is in accordance with the specification, and
+is more closely aligned with the ITWG IT-API currently in development,
+which only allows a single event stream type for each simple EVD.
+However, other vendors may choose to implement the functionality
+described above in order to support more integration of event streams.
+
+-- Shared memory implementation
+
+The difficulties involved in the dapl shared memory implementation are
+fully described in doc/dapl_shared_memory_design.txt.  To briefly
+recap: 
+
+The uDAPL spec describes a peer-to-peer shared memory model; all uDAPL
+instances indicate that they want to share registration resources for
+a section of memory do so by providing the same cookie.  No uDAPL
+instance is unique; all register their memory in the same way, and no
+communication between the instances is required.
+
+In contrast, the IB shared memory interface requires the first process
+to register the memory to do so using the standard memory registration
+verbs.  All other processes after that must use the shared memory
+registration verb, and provide to that verb the memory region handle
+returned from the initial call.  This means that the first process to
+register the memory must communicate the memory region handle it
+receives to all the other processes who wish to share this memory.
+This is a master-slave model of shared memory registration; the
+initial process (the master), is unique in its role, and it must tell
+the slaves how to register the memory after it.
+
+To translate between these two models, the uDAPL implementation
+requires some mapping between the shared cookie and the memory region
+handle.  This mapping must be exclusive and must have inserts occur
+atomically with lookups (so that only one process can set the memory
+region handle; the others retrieve it).  It must also track the
+deregistration of the shared memory, and the exiting of the processes
+registering the shared memory; when all processes have deregistered
+(possibly by exitting) it must remove the mapping from cookie to
+memory region handle.
+
+This mapping must obviously be shared between all uDAPL
+implementations on a given host.  Implementing such a shared mapping
+is problematic in a pure user-space implementation (as the reference
+implementation is) but is expected to be relatively easy in vendor
+supplied uDAFS implementations, which will presumably include a
+kernel/device driver component.  For this reason, we have chosen to
+leave this functionality unimplemented in the reference implementation.
+
+-- Implementation of dat_cr_handoff
+
+Given that the change of service point involves a change in associated 
+connection qualifier, which has been advertised at the underlying 
+Verbs/driver level, it is not clear how to implement this function
+cleanly within the reference implementation.  We thus choose to defer
+it for implementation by the hardware vendors.
+
+=========================
+Performance Optimizations
+=========================
+
+
+Reduction of context switches
+-----------------------------
+Currently, three context switches are required on the standard
+uDAPL notification path.  These are:
+       * Invocation of the hardware interrupt handler in the kernel.
+         Through this method the hardware notifies the CPU of
+         completion queue entries for operations that have requested
+         notification. 
+       * Unblocking of the per-process IB provider service thread
+         blocked within the driver.  This thread returns to
+         user-space within its process, where it causes 
+       * Unblocking of the user thread blocked within the uDAPL entry
+         point (dat_evd_wait() or dat_cno_wait()).
+         
+There are several reasons for the high number of context switches,
+specifically: 
+       * The fact that the IB interface delivers notifications
+         through callbacks rather than through unblocking waiting
+         threads; this does not match uDAPL's blocking interface.
+       * The fact that the IB interface for blocking on a CQ doesn't
+         have a threshhold.  If it did, we could often convert a
+         dat_evd_wait() into a wait on that CQ.
+       * The lack of a parallel concept to the CNO within IB.  
+
+These are all areas in which closer integration between the IB
+verbs/driver and uDAPL could allow the user thread to wait within the
+driver.  This would allow the hardware interrupt thread to directly
+unblock the user thread, saving a context switch.
+
+A specific listing of the optimizations considered here are:
+       * Allow blocking on an IB CQ.  This would allow removal of the
+         excess context switch for dat_evd_wait() in cases where
+         there is a 1-to-1 correspondence between an EVD and a CQ and
+         no threshold was passed to dat_evd_wait(). 
+       * Allow blocking on an IB CQ to take a threshold argument.
+         This would allow removal of the excess context switch for
+         dat_evd_wait() in cases where there is a 1-to-1
+         correspondence between an EVD and a CQ regardless of the
+         threshold value.
+       * Give the HCA device driver knowledge of and access to the
+         implementation of the uDAPL EVD, and implement dat_evd_wait()
+         as an ioctl blocking within the device driver.  This would
+         allow removal of the excess context switch in all cases for
+         a dat_evd_wait().
+       * Give the HCA device driver knowledge of and access to the
+         implementation of the uDAPL CNO, and implement dat_cno_wait()
+         as an ioctl blocking within the device driver.  This would
+         allow removal of the excess context switch in all cases for
+         a dat_cno_wait(), and could improve performance for blocking
+         on OS Proxy Wait Objects related to the uDAPL CNO.
+
+See the DAPL Event Subsystem Design (doc/dapl_event_design.txt) for
+more details on this class of optimization.
+
+========================
+Reducing Copying of Data
+========================
+
+There are two primary places in which a closer integration between the
+IB verbs/driver and the uDAPL implementation could reducing copying
+costs:
+
+-- Avoidance of s/g list copy on posting
+
+Currently there are two copies involved in posting a data transfer
+request in uDAPL:
+       * From the user context to uDAPL.  This copy is required
+         because the scatter/gather list formats for uDAPL and IB
+         differ; a copy is required to change formats.
+       * From uDAPL to the WQE.  This copy is required because IB
+         specifies that all user parameters are owned by the user
+         upon return from the IB call, and therefore IB must keep its
+         own copy for use during the data transfer operation.
+
+If the uDAPL data transfer dispatch operations were implemented
+directly on the IB hardware, these copies could be combined.
+
+-- Avoidance of Event data copy from CQ to EVD
+
+Currently there are two copies of data involved in receiving an event
+in a standard data transfer operation:
+       * From the CQ on which the IB completion occurs to an event
+         structure held within the uDAPL EVD.  This is because the IB
+         verbs provide no way to discover how many elements have been
+         posted to a CQ.  This copy is not
+         required for dat_evd_dequeue.  However, dat_evd_wait
+         requires this copy in order to correctly implement the
+         threshhold argument; the callback must know when to wakeup
+         the waiting thread.  In addition, copying all CQ entries
+         (not just the one to be returned) is necessary before
+         returning from dat_evd_wait in order to set the *nmore OUT
+         parameter. 
+       * From the EVD into the  event structure provided in the
+         dat_evd_wait() call.  This copy is required because of the
+         DAT specification, which requires a user-provided event
+         structure to the dat_evd_wait() call in which the event
+         information will be returned.  If dat_evd_wait() were
+         instead, for example, to hand back a pointer to the already
+         allocated event structure, that would eventually require the
+         event subsystem to allocate more event structures.  This is
+         avoided in the critical path. 
+
+A tighter integration between the IB verbs/driver and the uDAPL
+implementation would allow the avoidance of the first copy.
+Specifically, providing a way to get information as to the number of
+completions on a CQ would allow avoidance of that copy.
+
+See the uDAPL Event Subsystem Design for more details on this class of
+optimization.
+
+====================
+Elimination of Locks
+====================
+
+Currently there is only a single lock used on the critical path in the
+reference implementation, in dat_evd_wait() and dat_evd_dequeue().
+This lock is in place because the ib_completion_poll() routine is not
+defined as thread safe, and both dat_evd_wait() and dat_evd_dequeue()
+are.  If there was some way for a vendor to make ib_completion_poll()
+thread safe without a lock (e.g. if the appropriate hardware/software
+interactions were naturally safe against races), and certain other
+modifications made to the code, the lock might be removed.
+
+The modifications required are:
+       * Making racing consumers from DAPL ring buffers thread safe.
+         This is possible, but somewhat tricky; the key is to make
+         the interaction with the producer occur through a count of
+         elements on the ring buffer (atomically incremented and
+         decremented), but to dequeue elements with a separate atomic
+         pointer increment.  The atomic modification of the element
+         count synchronizes with the producer and acquires the right
+         to do an atomic pointer increment to get the actual data.
+         The atomic pointer increment synchronizes with the other
+         consumers and actually gets the buffer data.
+       * The optimization described above for avoiding copies from
+         the CQ to the DAPL EVD Event storage queue.  Without this
+         optimization a potential race between dat_evd_dequeue() and
+         dat_evd_wait() exists where dat_evd_dequeue will return an
+         element advanced in the event stream from the one returned
+         from dat_evd_wait():
+
+                                       dat_evd_dequeue() called
+
+                                         EVD state checked; ok for
+                                         dat_evd_dequeue()
+               dat_evd_wait() called
+
+                 State changed to reserve EVD
+                 for dat_evd_wait() 
+
+                 Partial copy of CQ to EVD Event store
+
+                                         Dequeue of CQE from CQ
+
+                 Completion of copy of CQ to EVD Event store
+
+                 Return of first CQE copied to EVD Event store.
+
+                                         Return of thie CQE from the middle
+                                         of the copied stream.
+
+
+         If no copy occurs, dat_evd_wait() and dat_evd_dequeue() may
+         race, but if all operations on which they may race (access
+         to the EVD Event Queue and access to the CQ) are thread
+         safe, this race will cause no problems.
+
+============================
+Eliminating Subroutine Calls
+============================
+
+This area is the simplest, as there are many DAPL calls on the
+critical path that are very thin veneers on top of their IB
+equivalents.  All of these calls are canidates for being merged with
+those IB equivalents.  In cases where there are other optimizations
+that may be acheived with the call described above (e.g. within the
+event subsystem, the data transfer operation posting code), that call
+is not mentioned here: 
+       * dat_pz_create
+       * dat_pz_free
+       * dat_pz_query
+       * dat_lmr_create
+       * dat_lmr_free
+       * dat_lmr_query
+       * dat_rmr_create
+       * dat_rmr_free
+       * dat_rmr_query
+       * dat_rmr_bind
+
+       
diff --git a/doc/dat.conf b/doc/dat.conf

new file mode 100644 (file)

index 0000000..9d50857
--- /dev/null
+++ b/doc/dat.conf
@@ -0,0 +1,15 @@
+#
+# DAT 1.1 configuration file
+#
+# Each entry should have the following fields:
+#
+# <ia_name> <api_version> <threadsafety> <default> <lib_path> \ 
+#           <provider_version> <ia_params> <platform_params>
+#
+
+ia0 u1.1 nonthreadsafe default /usr/lib/libdapl.so ri.1.1 "ia_params" "pd_params"
+
+# Example for openib using the first Mellanox adapter,  port 1 and port 2
+OpenIB1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" ""
+OpenIB2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" ""
+
diff --git a/doc/dat_environ.txt b/doc/dat_environ.txt

new file mode 100644 (file)

index 0000000..7c32037
--- /dev/null
+++ b/doc/dat_environ.txt
@@ -0,0 +1,45 @@
+               DAT Environment Guide v. 0.01
+                -----------------------------
+
+The following environment variables affect the behavior of the DAT
+library: 
+
+
+DAT_OVERRIDE
+------------
+ Value used as the static registry configuration file, overriding the
+ default location, /etc/dat.conf
+
+ Example: setenv DAT_OVERRIDE /path/to/my/private.conf
+
+
+DAT_DBG_TYPE
+------------
+
+ Value specifies which parts of the registry will print debugging
+ information, valid values are  
+
+    DAT_OS_DBG_TYPE_ERROR              = 0x1
+    DAT_OS_DBG_TYPE_GENERIC            = 0x2
+    DAT_OS_DBG_TYPE_SR                 = 0x4
+    DAT_OS_DBG_TYPE_DR                 = 0x8
+    DAT_OS_DBG_TYPE_PROVIDER_API       = 0x10
+    DAT_OS_DBG_TYPE_CONSUMER_API       = 0x20
+    DAT_OS_DBG_TYPE_ALL                = 0xff
+
+ or any combination of these. For example you can use 0xC to get both 
+ static and dynamic registry output.
+
+ Example setenv DAT_DBG_TYPE 0xC
+  
+DAT_DBG_DEST
+------------ 
+
+ Value sets the output destination, valid values are 
+  
+    DAT_OS_DBG_DEST_STDOUT              = 0x1
+    DAT_OS_DBG_DEST_SYSLOG              = 0x2 
+    DAT_OS_DBG_DEST_ALL                 = 0x3 
+  
+ For example, 0x3 will output to both stdout and the syslog. 
+
diff --git a/doc/ibhosts b/doc/ibhosts

new file mode 100644 (file)

index 0000000..4a72bd9
--- /dev/null
+++ b/doc/ibhosts
@@ -0,0 +1,3 @@
+dat-linux3-ib0 0xfe80000000000000 0x0001730000003d11
+dat-linux5-ib0 0xfe80000000000000 0x0001730000003d91
+dat-linux6-ib0 0xfe80000000000000 0x0001730000009791
author	James Lentini <jlentini@netapp.com>
	Mon, 12 Sep 2005 19:14:43 +0000 (19:14 +0000)
committer	James Lentini <jlentini@netapp.com>
	Mon, 12 Sep 2005 19:14:43 +0000 (19:14 +0000)
doc/dapl_coding_style.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_end_point_design.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_environ.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_event_design.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_ibm_api_variations.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_memory_management_design.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_patch.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_registry_design.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_shared_memory_design.txt	[new file with mode: 0644]	patch \| blob
doc/dapl_vendor_specific_changes.txt	[new file with mode: 0644]	patch \| blob
doc/dat.conf	[new file with mode: 0644]	patch \| blob
doc/dat_environ.txt	[new file with mode: 0644]	patch \| blob
doc/ibhosts	[new file with mode: 0644]	patch \| blob