--- /dev/null
+#######################################################################
+# #
+# DAPL Coding style reference #
+# #
+# Steve Sears #
+# sjs2 at users.sourceforge.net #
+# #
+# 12/13/2002 #
+# #
+#######################################################################
+
+======================================================================
+Introduction
+======================================================================
+
+The purpose of this document is to establish the coding style adopted by
+the team implementing the DAPL reference implementation. The rules
+presented here were arrived at by consensus, they are intended to
+provide consistency of implementation and make it intuitive to work with
+the source code.
+
+======================================================================
+Source code conventions
+======================================================================
+
+1. Brackets
+
+ Brackets should follow C99 conventions and declare a block. The
+ following convention is followed:
+
+ if (x)
+ {
+ statement;
+ statement;
+ }
+
+ The following bracket styles are to be avoided:
+
+ K&R style:
+
+ if (x) { /* DON'T DO THIS */
+ statement;
+ }
+
+ GNU style:
+
+ if (x) /* DON'T DO THIS */
+ {
+ statement;
+ }
+
+ Statements are always indented from brackets.
+
+ Brackets are always used for any statement in order to avoid dangling
+ clause bugs. E.g.
+
+ RIGHT:
+ if ( x )
+ {
+ j = 0;
+ }
+
+ WRONG:
+ if ( x )
+ j = 0;
+
+2. Indents
+
+ Indents are always 4, tabs 8. A tab may serve as a double
+ indent. Many of the reference implementation file have an emacs
+ format statement at the bottom.
+
+3. Comments
+
+ Comments are always full C style comments, and never C++
+ style. Comments take the form:
+
+ /*
+ * comment
+ */
+
+4. Variable Declarations
+
+ Variables are always declared on their own line, we do not declare
+ multiple variables on the same line.
+
+ Variables are never initialized in their declaration, they are
+ initialized in the body of the code.
+
+5. Function Declarations
+
+ The return type of a function is declared on a separate line from the
+ function name.
+
+ Parameters each receive a line and should be clearly labeled as IN
+ or OUT or INOUT. Parameter declarations begin one tab stop from the
+ margin.
+
+ For example:
+
+ DAT_RETURN
+ dapl_function (
+ IN DAT_IA_HANDLE ia_handle,
+ OUT DAT_EP_HANDLE *ep_handle )
+ {
+ ... function body ...
+ }
+
+5. White space
+
+ Don't be afraid of white space, the goal is to make the code readable
+ and maintainable. We use white space:
+
+ - One space following function names or conditional expressions. It
+ might be better to say one space before any open parenthesis.
+
+ - Suggestion: One space following open parens and one space before
+ closing parens. Not all of the code follows this convention, use
+ your best judgment.
+
+ Example:
+
+ foo ( x1, x2 );
+
+6. Conditional code
+
+ We generally try to avoid conditional compilation, but there are
+ certain places where it cannot be avoided. Whenever possible, move
+ the conditional code into a macro or otherwise work to put it into an
+ include file that can be used by the platform (e.g. Linux or Windows
+ osd files), or by the underlying provider (e.g. IBM Torrent or
+ Mellanox Tavor).
+
+ Conditionals should be descriptive, and the associated #endif should
+ contain the declaration. E.g.
+
+ #ifdef THIS_IS_AN_EXAMPLE
+
+ /* code */
+
+ #endif /* THIS_IS_AN_EXAMPLE */
+
+ You may change the ending comment if a #else clause is present. E.g.
+
+ #ifdef THIS_IS_AN_EXAMPLE
+ /* code */
+
+ #else
+ /* other code */
+
+ #endif /* !THIS_IS_AN_EXAMPLE */
+
+
+======================================================================
+Naming conventions
+======================================================================
+
+1. Variable Names
+
+ Variable names for DAPL data structures generally follow their type
+ and should be the same in all source files. A few examples:
+
+ Handles
+ DAT_IA_HANDLE ia_handle
+ DAT_EP_HANDLE ep_handle
+
+ Pointers
+
+ DAPL_IA *ia_ptr;
+ DAPL_EP *ep_ptr;
+
+2. Return Code Names
+
+ There are at least two different subsystems supported in the DAPL
+ reference implementation. In order to bring sanity to the error
+ space, return codes are named and used for their appropriate
+ subsystem. E.g.
+
+ ib_status: InfiniBand status return code
+ dat_status: DAT/DAPL return code
+
+3. Function Names
+
+ Function names describe the scope to which they apply. There are
+ essentially three names in the reference implementation:
+
+ dapl_* Name of an exported function visible externally.
+ These functions have a 1 to 1 correspondence to
+ their DAT counterparts.
+
+ dapls_* Name of a function that is called from more than one
+ source file, but is limited to a subsystem.
+
+ dapli_* Local function, internal to a file. Should always be
+ of type STATIC.
+
+
+======================================================================
+Util files
+======================================================================
+
+The Reference implementation is organized such that a single, exported
+function is located in its' own file. If you are trying to find the DAPL
+function to create and End Point, it will be found in the dapl version
+of the DAT function in the spec. E.g.
+
+dapl_ep_create() is found in dapl_ep_create.c
+dapl_evd_free() is found in dapl_evd_free.c
+
+It is often the case that the implementation must interact with data
+structures or call into other subsystems. All utility functions for a
+subsystem are gathered into the appropriate "util" file.
+
+For example, dapl_ep_create must allocate a DAPL_EP structure. The
+routine to allocate and initialize memory is found in the
+dapl_ep_util.c file and is named dapl_ep_alloc(). Appropriate routines
+for the util file are
+
+ - Alloc
+ - Free
+ - Assign defaults
+ - linking routines
+ - Check restrictions
+ - Perform operations on a data structure.
+
+The idea of a util file is an object oriented idea for a non OO
+language. It encourages a clean implementation.
+
+For each util.c file, there is also a util.h file. The purpose of the
+util include file is to define the prototypes for the util file, and to
+supply any local flags or values necessary to the subsystem.
+
+======================================================================
+Include files, prototypes
+======================================================================
+
+Include files are organized according to subsystem and/or OS
+platform. The include directory contains files that are global to the
+entire source set. Prototypes are found in include files that pertain to
+the data they support.
+
+Commenting on the DAPL Reference Implementation tree:
+
+ dapl/common
+ dapl/include
+ Contains global dapl data structures, symbols, and
+ prototypes
+ dapl/tavor
+ Contains tavor prototypes and symbols
+ dapl/torrent
+ Contains torrent prototypes and symbols
+ dapl/udapl
+ Contains include files to support udapl specific files
+ dapl/udapl/linux
+ Contains osd files for Linux
+ dapl/udapl/windows
+ Contains osd files for Windows
+
+For completeness, the dat files described by the DAT Specification are
+in the tree under the dat/ subdirectory,
+
+ dat/include/dat/
+
+
--- /dev/null
+#######################################################################
+# #
+# DAPL End Point Management Design #
+# #
+# Steve Sears #
+# sjs2 at users.sourceforge.net #
+# #
+# 10/04/2002 #
+# Updates #
+# 02/06/04 #
+# 10/07/04 #
+# #
+#######################################################################
+
+
+======================================================================
+Referenced Documents
+======================================================================
+
+uDAPL: User Direct Access Programming Library, Version 1.1. Published
+05/08/2003. http://www.datcollaborative.org/uDAPL_050803.pdf.
+Referred to in this document as the "DAT Specification".
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002. In DAPL SourceForge repository at
+doc/api/access_api.pdf. Referred to in this document as the "IBM
+Access API Specification".
+
+InfiniBand Architecture Specification Volume 1, Release 1.0.a Referred
+to in this document at the "InfiniBand Spec".
+
+======================================================================
+Introduction to EndPoints
+======================================================================
+
+An EndPoint is the fundamental channel abstraction for the DAT API. An
+application communicates and exchanges data using an EndPoint. Most of
+the time EndPoints are explicitly allocated, but there is an exception
+whereby a connection event can yield an EndPoint as a side effect; this
+is not supported by all transports or implementations, but it is
+supported in the InfiniBand reference implementation.
+
+Each DAT API function is implemented in a file named
+
+ dapl_<function name>.c
+
+There is a simple mapping provided by the dat library that maps dat_* to
+dapl_*. For example, dat_pz_create is implemented in dapl_pz_create.c.
+Other examples:
+
+ DAT DAPL Found in
+ ------------ --------------- ------------------
+ dat_ep_create dapl_ep_create dapl_ep_create.c
+ dat_ep_query dapl_ep_query dapl_ep_query.c
+
+There are very few exceptions to this naming convention, the Reference
+Implementation tried to be consistent.
+
+There are also dapl_<object name>_util.{h,c} files for each object. For
+example, there are dapl_pz_util.h and dapl_pz_util.c files which contain
+common helper functions specific to the 'pz' subsystem. The use of util
+files follows the convention used elsewhere in the DAPL reference
+implementation. These files contain common object creation and
+destruction code, link list manipulation, other helper functions.
+
+This implementation has a simple naming convention designed to alert
+someone reading the source code to the nature and scope of a
+function. The convention is in the function name, such that:
+
+ dapl_ Primary entry from a dat_ function, e.g.
+ dapl_ep_create(), which mirrors dat_ep_create().
+ dapls_ The 's' restricts it to the subsystem, e.g. the
+ 'ep' subsystem. dapls_ functions are not exposed
+ externally, but are internal to dapl.
+ dapli_ The 'i' restricts the function to the file where it
+ is declared. These functions are always 'static' C
+ functions.
+
+This convention is not followed as consistently as we would like, but is
+common in the reference implementation.
+
+1. End Points (EPs)
+-------------------------
+
+DAPL End Points provide a channel abstraction necessary to transmit and
+receive data. EPs interact with Service Points, either Public Service
+Points or Reserved Service Points, to establish a connection from one
+provider to another.
+
+The primary EP entry points in the DAT API as they relate to DAPL are
+listed in the following table:
+
+ dat_ep_create
+ dat_ep_query
+ dat_ep_modify
+ dat_ep_connect
+ dat_ep_dup_connect
+ dat_ep_disconnect
+ dat_ep_post_send
+ dat_ep_post_recv
+ dat_ep_post_rdma_read
+ dat_ep_post_rdma_write
+ dat_ep_get_status
+ dat_ep_reset
+ dat_ep_free
+
+Additionally, the following connection functions interact with
+EndPoints:
+ dat_psp_create
+ dat_psp_query
+ dat_psp_free
+ dat_rsp_create
+ dat_rsp_query
+ dat_rsp_free
+ dat_cr_accept
+ dat_cr_reject
+ dat_cr_query
+ dat_cr_handoff
+
+The reference implementation maps the EndPoint abstraction onto an
+InfiniBand Queue Pair (QP).
+
+The DAPL_EP structure is used to maintain the state and components of
+the EP object and the underlying QP. As will be explained below, keeping
+track of the QP state is critical for successful operation. Access to
+the DAPL_EP fields are done atomically.
+
+
+======================================================================
+Goals
+======================================================================
+
+Initial goals
+-------------
+-- Implement all of the dat_ep_* calls described in the DAT
+ Specification.
+
+-- Implement connection calls described in the DAT Specification with
+ the following exception:
+ - dat_cr_handoff. This is best done with kernel mediation, and is
+ therefore out of scope for the reference implementation.
+
+-- The implementation should be as portable as possible, to facilitate
+ HCA Vendors efforts to implement vendor-specific versions of DAPL.
+
+-- The implementation must be able to work during ongoing development
+ of provider software agents, drivers, etc.
+
+Later goals
+-----------
+-- Examine various possible performance optimizations. This document
+ lists potential performance improvements, but the specific
+ performance improvements implemented should be guided by customer
+ requirements.
+
+============================================
+Requirements, constraints, and design inputs
+============================================
+
+The EndPoint is the base channel abstraction. An Endpoint must be
+established before data can be exchanged with a remote node. The
+EndPoint is mapped to the underlying InfiniBand QP channel abstraction.
+When a connection is initiated, the InfiniBand Connection Manager will
+be solicited. The implementation is constrained by the capabilities and
+behavior of the underlying InfiniBand facilities.
+
+Note that transports other than InfiniBand may not need to rely on
+Connection Managers or other infrastructure, this is an artifact of
+this transport.
+
+An EP is not an exact match to an InfiniBand QP, the differences
+introduce constraints that are not obvious. There are three primary
+areas of conflict between the DAPL and InfiniBand models:
+
+1) EP and QP creation differences
+2) Provider provided EPs on passive side of connections
+3) Connection timeouts
+
+-- EP and QP creation
+
+The most obvious difference between an EP and a QP is the presence of a
+memory handle when the object is created. InfiniBand requires a
+Protection Domain (PD) be specified when a QP is created; in the DAPL
+world, a Protection Zone (PZ) maps to an InfiniBand Protection Domain.
+DAPL does not require a PZ to be present when an EP is created, and that
+introduces two problems:
+
+1) If a PZ is NULL when an EP is created, a QP will not be bound to
+ the EP until dat_ep_modify() is used to assign it later. A PZ is
+ required before RECV requests can be posted and before a connection
+ can be established.
+
+2) If a DAPL user changes the PZ on an EP before it is connected,
+ DAPL must release the current QP and create a new one with a
+ new Protection Domain.
+
+-- Provider provided EPs on connection
+
+The second area where the DAPL and IB models conflict is a direct result
+of the requirement to specify a Protection Domain when a QP is created.
+
+DAPL allows a PSP to be created in such a way that an EP will
+automatically be provided to the user when a connection occurs. This is
+not critical to the DAPL model but in fact does provide some convenience
+to the user. InfiniBand provides a similar mechanism, but with an
+important difference: InfiniBand requires the user to supply the
+Protection Domain for the passive connection endpoint that will be
+supplied to all QPs created as a result of connection requests; DAPL
+mandates a NULL PZ and requires the user to change the PZ before using
+the EP.
+
+The reference implementation creates an 'empty' EP when the user
+specifies the DAT_PSP_PROVIDER flag; it is empty in the sense that a QP
+is not attached to the EP. Before the user can dat_cr_accept the
+connection, the EP must be modified to have a PZ bound to it, which in
+turn will cause a QP to be bound to the EP.
+
+To keep track of the current state of the EP, the DAPL_EP structure
+has a qp_state field. The type of this field is specific to the
+provider and the states are provider-specified states for a particular
+transport, with the addition of a single state from dapl:
+DAPL_QP_STATE_UNATTACHED, indicating that no QP has been bound to the
+EP. The qp_state field is an open enumerator, containing a single DAPL
+state in addition states specified by the provider.
+DAPL_QP_STATE_UNATTACHED is randomly defined to be 0xFFF0, a
+value selected strictly because it has the property that it will not
+collide with provider states; if this is not true, this value must be
+changed such that it is unique.
+
+The common layer of DAPL only looks at this single value for qp_state,
+it cannot be aware of states that are unique to the provider. However,
+the provider layer is free to update this field and may use it as a
+cache for current QP state. The field must be updated when a QP (or
+other endpoint resource) is bound to the EP.
+
+DAPL 1.2 provides DAT level states that will make this obsolete, but it
+exists in pre DAPL 1.2 code.
+
+
+-- Connection Timeouts
+
+The third difference in the DAPL and InfiniBand models has to do with
+timeouts on connections. InfiniBand does not provide a way to specify a
+connection timeout, so it will wait indefinitely for a connection to
+occur. dat_ep_connect supports a timeout value providing the user with
+control over how long they are willing to wait for a connection to
+occur.
+
+DAPL maintains a timer thread to watch over pending connections. A
+shared timer queue has a sorted list of timeout values. If a timeout
+is requested, dapl_ep_connect() will invoke dapls_timer_set(), which
+will add a timer record to the sorted list of timeouts. The timeout
+thread is started lazily: that is, it isn't started until a timeout is
+requested. Once a timeout has been requested, the thread will continue
+to exist until the application terminates.
+
+The timer record is actually a part of the DAPL_EP structure, so there
+are no extra memory allocations required for timeouts. dapls_timer_set()
+will initialize the timer record and insert it into the sorted queue at
+the appropriate place. If this is the first record, or is inserted
+before the first record (which will be the 'next' timeout to expire),
+the timer thread will be awakened so it can recalculate how long it must
+sleep until the timeout occurs.
+
+When a timeout does occur, the timeout code will cancel the connection
+request by invoking the provider routine dapls_ib_disconnect_clean().
+This allows the software module with explicit knowledge of the provider
+to take appropriate action and cancel the connection attempt. As a side
+effect, the EP will be placed into the UNCONNECTED state, and the QP
+will be in the ERROR state. A side effect of this state change is that
+all DTOs will be flushed. The provider must support a mechanism to
+completely cancel a connection request.
+
+
+======================================================================
+DAPL EP Subsystem Design
+======================================================================
+
+In section 6.5.1 of the DAT Specification there is a UML state
+transition diagram for an EndPoint which goes over the transitions and
+states during the lifetime of an EP. It is nearly impossible to read.
+The reference implementation is faithful to the DAT Spec and is
+believed to be correct.
+
+This description of the EP will follow from creation to connection to
+termination. It will also discuss the source code organization as this
+is part of the design expression.
+
+-- EP and QP creation
+
+The preamble to creating an EP requires us to verify the attributes
+specified by the user. If a user were to specify max_recv_dtos as 0, for
+example, the EP would not be useful in any regard. If the user does not
+provide EP attrs, the DAPL layer will supply a set of common defaults
+resulting in a reasonable EP. The defaults are set up in
+dapli_ep_default_attrs(), and the default values are given at the top of
+dapl_ep_util.c. Non-InfiniBand transports will want to examine these
+values to make sure they are 'reasonable'. This simplistic mechanism may
+change in the future.
+
+A number of handles are bound to the EP, so a reference count is taken
+on each of them. All reference counts in the DAPL system are incremented
+or decremented using atomic operations; it is important to always use
+the OS dependent atomic routines and not substitute a lock, as it will
+not be observed elsewhere in the system and will have unpredictable
+results.
+
+Reference counts are taken if there are non NULL values on any of:
+ pz_handle
+ connect_evd_handle
+ recv_evd_handle
+ request_evd_handle
+
+The purpose of reference counts should be obvious: to prevent premature
+release of resources that are still being used.
+
+As has been discussed above, each EP is bound to a QP before it can be
+connected. If a valid PZ is provided at creation time then a QP is bound
+to the EP immediately. If the user later uses ep_modify() to change the
+PZ, the QP will be destroyed and a new one created with the appropriate
+Protection Domain.
+
+Finally, an EP is an IA resource and is linked onto the EP chain of the
+superior IA. EPs linked onto an IA are assumed to be complete, so this
+is the final step of EP creation.
+
+After an EP is created, the ep_state will be DAT_EP_STATE_UNCONNECTED
+and the qp_state will either be DAPL_QP_STATE_UNATTACHED or assigned by
+the provider layer (e.g.IB_QP_STATE_INIT). The qp_state indicates the QP
+binding and the current state of the QP.
+
+A qp_state of DAPL_QP_STATE_UNATTACHED indicates there is no QP bound
+to this EP. This is a result of a NULL PZ when dat_ep_create() was
+invoked, and which has been explained in detail above. The user must
+call dat_ep_modify() and install a valid PZ before the EP can be used.
+
+When an InfiniBand QP is created it is in the RESET state, which is
+specified in the InfiniBand Spec, section 10.3. However, DAPL creates
+the EP in the UNCONNECTED state and requires an unconnected EP to be
+able to queue RECV requests before a connection occurs. The InfiniBand
+spec allows RECV requests to be queued on an QP if the QP is in the INIT
+state, so after creating a QP the DAPL provider code must transition it
+to the INIT state.
+
+There is a mapping between the DAPL EP state and the InfiniBand QP
+state. DAPL_QP_STATE_UNATTACHED indicates the underlying QP is in the
+INIT state. This is critical: RECV DTOs can be posted on an EP in the
+UNATTACHED state, so the underlying QP must be in the appropriate state
+to allow this to happen.
+
+There is an obvious design tradeoff in transitioning the QP
+state. Immediately moving the state to INIT takes extra time at creation
+but allows immediate posting of RECV operations; however, it will
+involve a more complex tear down procedure if the QP must be replaced as
+a side effect of a dat_ep_modify operation. The alternative would be to
+delay transitioning the QP to INIT until a post operation is invoked,
+but that requires a run time check for every post operation. This design
+assumes users will infrequently cause a QP to be replaced after it is
+created and prefer to pay the state transition penalty at creation time.
+
+-- EP Query and Modify operations
+
+Because all of the ep_param data are kept up to date in the dapl_ep
+structure, and because they use the complete DAT specified structure, a
+query operation is trivial; a simple assignment from the internal
+structure to the user parameter. uDAPL allows the implementation to
+either return the fields specified by the user, or to return more than
+the user requested; the reference implementation does the latter. It is
+simpler and faster to copy the entire structure rather than to determine
+which of all of the possible fields the user requested.
+
+dat_ep_query() requires the implementation to report the address of the
+remote node, if the EP is connected. This is different from standard
+InfiniBand, if only because of the difference in name space. InfiniBand
+has the information on the remote LID, but it does not have the remote
+IP address, which is what DAT specifies. The reference implementation
+makes use of a lookup/name-service called ATS ( Address Translation
+Service), which is built using the InfiniBand Subnet Administrator. ATS
+is InfiniBand only, other transports will use a different mechanism.
+
+A driver will register itself and one or more IP addresses with ATS
+at some point before a connection can be made. How the addresses are
+provided to the driver, or how this is managed by the driver is not
+specified. The ATS proposal is available from the DAT Collaborative.
+
+When dat_ep_query() is invoked on a connected EP, it will request the
+remote address from the provider layer. The provider layer will use
+whatever means are necessary to obtain the IP address of the other end of
+the connection. The results are placed into a buffer that is part of the
+EP structure. Finally, the address of the EP structure is placed into
+the ep_param.remote_ia_address_ptr field.
+
+The ep_modify operation will modify the fields in the DAT_PARAM
+structure. There are some fields that cannot be updated, and there are
+others that can only be updated if the EP is in the correct state. The
+uDAPL spec outlines the EP states permitting ep modifications, but
+generally they are DAT_EP_STATE_UNCONNECTED and
+DAT_EP_STATE_PASSIVE_CONNECTION_PENDING.
+
+When replacing EVD handles it is a simple matter of releasing a
+reference on the previous handle and taking a new reference on the new
+handle. The Reference Implementation manages resource tracking using
+reference counts, which guarantees a particular handle will not be
+released prematurely. Reference counts are checked in the free routines
+of various objects.
+
+As has been mentioned previously, if the PZ handle is changed then the
+QP must be released, if already assigned, and a new QP must be created
+to bind to this EP.
+
+There are some fields in the DAT_PARAM structure that are related to the
+underlying hardware implementation. For these values DAPL will do a
+fresh query of the QP, rather than depend on stale values. Even so, the
+values returned are 'best effort' as a competing thread may change
+certain values before the requesting thread has the opportunity to read
+them. Applications should protect against this.
+
+Finally, the underlying provider is invoked to update the QP with new
+values, but only if some of the attributes have been changed. As is
+true of most of the implementation, we only invoke the provider code
+when necessary.
+
+======================================================================
+Connections
+======================================================================
+
+There are of course two sides to a connection, and in the DAPL model
+there is an Active and a Passive side. For clarity, the Passive side
+is a server waiting for a connection, and the Active side is a client
+requesting a connection from the Passive server. We will discuss each
+of these in turn.
+
+Connections happen in the InfiniBand world by using a Connection Manager
+(CM) interface. Those unfamiliar with the IB model of addressing and
+management agents may want to familiarize themselves with these aspects
+of the IB spec before proceeding in this document. Be warned that the
+connection section of the IB spec is the most ambiguous portion of the
+spec.
+
+First, let's walk through a primitive diagram of a connection:
+
+
+SERVER (passive) CLIENT (active)
+--------------- ---------------
+1. dapl_psp_create
+ or dapl_rsp_create
+ [ now listening ]
+
+2. dapl_ep_connect
+ <-------------
+3. dapls_cr_callback
+ DAT_CONNECTION_REQUEST_EVENT
+ [ Create and post a DAT_CONNECTION_REQUEST_EVENT event ]
+
+4. Event code processing
+
+5. Create an EP if necessary
+ (according to the flags
+ when the PSP was created)
+
+6. dapl_cr_accept or dapl_cr_reject
+ ------------->
+7. dapl_evd_connection_callback
+ DAT_CONNECTION_EVENT_ESTABLISHED
+ [ Create and post a
+ DAT_CONNECTION_EVENT_ESTABLISHED
+ event ]
+
+8. <------------- RTU
+
+9. dapls_cr_callback
+ DAT_CONNECTION_EVENT_ESTABLISHED
+ [ Create and post a DAT_CONNECTION_EVENT_ESTABLISHED
+ event ]
+
+10. ...processing...
+
+11. Either side issues a dat_ep_disconnect
+
+12. dapls_cr_callback
+ DAT_CONNECTION_EVENT_DISCONNECTED
+
+ [ Create and post a
+ DAT_CONNECTION_EVENT_DISCONNECTED
+ event ]
+
+13. dapl_evd_connection_callback
+ DAT_CONNECTION_EVENT_DISCONNECTED
+ [ Create and post a
+ DAT_CONNECTION_EVENT_DISCONNECTED
+ event ]
+
+
+In the above diagram, time is numbered in the left hand column and is
+represented vertically.
+
+We will continue our discussion of connections using the above diagram,
+following a sequential order for connection establishment.
+
+There are in fact two types of service points detailed in the uDAPL
+specification. We will limit our discussion to PSPs for convenience, but
+there are only minor differences between PSPs and RSPs.
+
+The reader should observe that all passive-side connection events will
+be received by dapls_cr_callback(), and all active side connection
+events occur through dapl_evd_connection_callback(). At one point during
+the implementation these routines were combined as they are very
+similar, but there are subtle differences causing them to remain
+separate.
+
+Progressing through the series of events as outlined in the diagram
+above:
+
+1. dapl_psp_create
+
+ When a PSP is created, the final act will be to set it listening for
+ connections from remote nodes. It is important to realize that a
+ connection may in fact arrive from a remote node before the routine
+ setting up a listener has returned to dapl_psp_create; as soon as
+ dapls_ib_setup_conn_listener() is invoked connection callbacks may
+ arrive. To reduce race conditions this routine must be called as the
+ last practical operation when creating a PSP.
+
+ dapls_ib_setup_conn_listener() is provider specific. The key insight
+ is that the DAPL connection qualifier (conn_qual) will become the
+ InfiniBand Service ID. The passive side of the connection is now
+ listening for connection requests. It should be obvious that the
+ conn_qual must be unique.
+
+ InfiniBand allows a 64 bit connection qualifier, which is supported
+ by the DAT spec. IP based networks may be limited to 16 bits, so
+ provider implementations may want to return an error if it exceeds
+ the maximum allowable by the transport.
+
+2. dapl_ep_connect
+
+ The active side initiates a connection with dapl_ep_connect, which
+ will transition the EP into DAT_EP_STATE_ACTIVE_CONNECTION_PENDING.
+ Again, connections are in the domain of the providers' Connection
+ Manager and the mechanics are very much provider specific. The key
+ points are that a DAT_IA_ADDRESS_PTR must be translated to a GID
+ before a connection initiation can occur. This is discussed below.
+
+ InfiniBand supports different amounts of private data on various
+ connection functions. Other transports allow variable sizes of
+ private data with no practical limit.The DAPL connection code does
+ not enforce a fixed amount of private data, but rather makes
+ available to the user all it has available, as specified by
+ DAPL_MAX_PRIVATE_DATA_SIZE.
+
+ Private data will be stored in a fixed buffer as part of the
+ connection record, which is the primary reason to limit the size.
+
+ To assist development on new transports that do not have a full
+ connection infrastructure in place, there are a couple of compile time
+ flags that will include certain code: CM_BUSTED and
+ IBOSTS_NAMING. These are discussed below in more detail, but
+ essentially:
+
+ CM_BUSTED: fakes a connection on both sides of the wire, does not
+ transmit any private data.
+
+ IBHOSTS_NAMING: provides a simple IP_ADDRESS to LID translation
+ mechanism in a text file, which is read when the dapl library
+ loads. Private data is exchanged in this case, but it includes a
+ header that contains the remote IP address. Technically, this defines
+ a protocol and is in violation of the DAT spec, but it has proved
+ useful in development.
+
+3. dapls_cr_callback
+
+ The connection sequence is entirely event driven. An operation is
+ posted, then an asynchronous event will occur some time later. The
+ event may cause other actions to occur which may result in still
+ more events.
+
+ dapls_ib_setup_conn_listener() registered for a callback for
+ connection events, and we now receive a DAT event for a connection
+ request. The provider layer will translate the native event type to
+ a DAT event.
+
+ An upcall is invoked on the server side of the connection with an
+ event of type DAT_CONNECTION_REQUEST_EVENT. This is a unique event
+ in the callback code as it is the only case when an EP is not
+ already in play; in all other cases, it is possible to look up the
+ relevant EP for an operation.
+
+ Code exists to make sure the relevant connection object, the PSP or
+ RSP, is actually in a useful state and ready to be connected
+ to. One of the critical differences between a PSP and an RSP is
+ that an RSP is a one-shot connection object; once a connection
+ occurs, no other connections can be made to it.
+
+ There is a small difference in the InfiniBand and DAPL connection
+ models here as well. DAPL may disable a PSP at any time without
+ affecting current connections. When you tear down an InfiniBand
+ service endpoint, all of the connections are torn down too. Because
+ of this difference, when a DAPL app frees a PSP, only a state
+ change is made. The underlying service point is still available and
+ technically capable of receiving connections. If a connection
+ request arrives when the PSP is in this state, a rejection message
+ is sent such that the requesting node believes no service point is
+ listening.
+
+ Once the connection has been examined, it will continue with the
+ connection protocol. The EP will move to a CONNECTION_PENDING
+ state.
+
+ The connection request will cause a CR record to be allocated,
+ which holds all of the important connection request
+ information. The CR record will be linked onto the PSP structure
+ for retrieval in the future when other requests arrive.
+
+ The astute reader of the spec will observe that there is not a
+ dapl_cr_create call: CR records are created as part of a connection
+ attempt on the passive side of the connection. A CR is created now
+ and set up. A point that will become important later, caps for
+ emphasis:
+
+ A CR WILL EXIST FOR THE LIFE OF A CONNECTION; THEY ARE DESTROYED AT
+ DISCONNECT TIME.
+
+ In the connection request processing a CR and an EVENT are created,
+ the event will be posted along with the connection information just
+ received.
+
+ Private data is also copied into the CR record. Private data
+ arrived with the connection request and is not a permanent
+ resource, so it is copied into the dapl space to be used at a later
+ time. Different transports have varying capabilities on the size of
+ private data, so a call to the provider is invoked to determine how
+ big it actually is. There is an upper bound on the amount of
+ private data the implementation will deal with, set at
+ DAPL_MAX_PRIVATE_DATA_SIZE (256 bytes at this writing).
+
+4. Event code processing
+
+ The final stage in a connection request is to generate an event on
+ a connection EVD using dapls_evd_post_cr_arrival_event().
+
+5. Create an EP if necessary
+
+ When the app processes a connection event, it needs to respond. If
+ the PSP is configured to create an EP automatically, the callback
+ code has already done it; creating an EP with no attached QP. Else,
+ the user must provide an EP to make the connection.
+
+ (4) and (5) are all done in user mode. The only interesting thing is
+ that when the user calls dat_cr_accept(), a ready EP must be
+ provided. If the EP was supplied by the PSP in the callback, it
+ must have a PZ associated with it and whatever other attributes
+ need to be set.
+
+6. dapl_cr_accept or dapl_cr_reject
+
+ For discussion purposes, we will follow the accept
+ path. dapl_cr_reject says you are done and there will be no further
+ events to deal with.
+
+ Assuming it accepts the connection for our example, the dapl code
+ will verify that an EP is in place and will deal with private data
+ that can be transmitted in a cr_accept call. The underlying
+ provider is invoked to complete this leg of the protocol.
+
+7. dapl_evd_connection_callback
+
+ An EVD callback is always a response to a connection oriented
+ request. As such, an EP is always present, and in fact is passed
+ into the upcall as the 'context' argument.
+
+ Connection requests may take an arbitrary amount of time, so the EP
+ is always checked for a running timer when the upcall is made. As
+ has been discussed above, if a timer expires before an upcall
+ occurs, the connection must be completely canceled such that there
+ is no upcall.
+
+ The event signifying completion of the connection is
+ DAT_CONNECTION_EVENT_ESTABLISHED, and it will move the EP to the
+ CONNECTED state and post this event on the connection EVD. Private
+ data will be copied to an area in the EP structure, which is
+ persistent.
+
+ At this point, the EP is connected and the application is free to
+ post DTOs.
+
+8i. RTU
+
+ This item is labeled "8i" as it is internal to the InfiniBand
+ implementation, it is not initiated by dapl. The final leg of a
+ connection is an RTU sent from the initiating node to the server
+ node, indicating the connection has been made successfully.
+
+ Other transports may have a different connection protocol.
+
+9. dapls_cr_callback
+
+ When the RTU arrives, an upcall is invoked with a
+ DAT_CONNECTION_EVENT_ESTABLISHED event, which will be posted to the
+ connection EVD event queue. The EP is moved to the CONNECTED
+ state.
+
+ There is no private data for dapl to deal with, even though some
+ transports may provide private data at each step of a connection.
+
+ The connection activity is occurring on a separate channel than the
+ EP, so this is inherently a racy operation. The correct
+ application will always post RECV buffers on an EP before
+ initiating a connection sequence, as it is entirely possible for
+ DTOs to arrive *before* the final connection event arrives.
+
+ The architecturally interesting feature of this exchange occurs
+ because of differences in the InfiniBand and the DAT connection
+ models, which are briefly outlined here.
+
+ InfiniBand maintains the original connecting objects throughout the
+ life of the connection. That is, we originally get a callback event
+ associated with the Service (DAT PSP) that is listening for
+ connection events. A QP will be connected but the callback event
+ will still be received on the Service. Later, a callback event will
+ occur for a DISCONNECT, and again the Service will be the object of
+ the connection. In the DAPL implementation, the Service will
+ provide the PSP that is registered as listening on that connection
+ qualifier.
+
+ The difference is that DAT has a PSP receive a connection event,
+ but subsequently hands all connection events off to an EP. After a
+ dat_cr_accept is issued, all connection/disconnection events occur
+ on the EP. DAT more closely follows the IP connection model.
+
+ To support the DAT model, a CR is maintained through the life of
+ the connection. There is exactly one CR per connection, but any
+ number of CRs may exist for any given PSP. CRs are maintained on a
+ linked list pointed to by the PSP structure. A lookup routine will
+ match the cm_handle, unique for each connection, with the
+ appropriate CR. This allows us to find the appropriate EP which
+ will be used to create an event to be posted to the user.
+
+* dat_psp_destroy
+
+ It should be understood that the PSP will maintain all of the CR
+ records, and hence the PSP must persist until the final disconnect.
+ In the DAT model there is no association between a PSP and a
+ connected QP, so there is no reason not to destroy a PSP before the
+ final disconnect.
+
+ Because of the model mismatch we must preserve the PSP until the
+ final disconnect. If the user invokes dat_psp_destroy(), all of the
+ associations maintained by the PSP will be severed; but the PSP
+ structure itself remains as a container for the CR records. The PSP
+ structure maintains a simple count of CR records so we can easily
+ determine the final disconnect and release memory. Once a
+ disconnect event is received for a specific cm_handle, no further
+ events will be received and it is safe to discard the CR record.
+
+10. ...processing...
+
+ This is just a place holder to show that applications actually do
+ something after making a connection. They might not too...
+
+11. Either side issues a dat_ep_disconnect
+
+ dat_ep_disconnect() can be initiated by either side of a
+ connection. There are two kinds of disconnect flags that can be
+ passed in, but the final result is largely the same.
+
+ DAT_CLOSE_ABRUPT_FLAG will cause the connection to be immediately
+ terminated. In InfiniBand terms, the QP is immediately moved to the
+ ERROR state, and after some time it will be moved to the RESET
+ state.
+
+ DAT_CLOSE_GRACEFUL_FLAG will allow in-progress DTOs to complete.
+ The underlying implementation will first transition the QP to the
+ SQE state, before going to RESET.
+
+ Both cases are handled by the underlying CM, there is no extra work
+ for DAPL.
+
+12. dapls_cr_callback
+
+ A disconnect will arrive on the passive side of the connection
+ through dapls_cr_callback() with connection event
+ DAT_CONNECTION_EVENT_DISCONNECTED. With this event the EP lookup
+ code will free the CR associated with the connection, and may free
+ the PSP if it is no longer listening, indicating it has been freed
+ by the application.
+
+ The callback will create and post a
+ DAT_CONNECTION_EVENT_DISCONNECTED event for the application.
+
+13. dapl_evd_connection_callback
+
+ The active side of the connection will receive
+ DAT_CONNECTION_EVENT_DISCONNECTED as the connection event for
+ dapl_evd_connection_callback(), and will create and post a
+ DAT_CONNECTION_EVENT_DISCONNECTED event. Other than transitioning
+ the EP to the DISCONNECTED state, there is no further processing.
+
+
+Observe that there are a number of exception conditions resulting in a
+disconnect of the EP, most of which will generate unique DAT events
+for the application to deal with.
+
+
+* Addressing and Naming
+
+ The DAT Spec calls for a DAT_IA_ADDRESS_PTR to be an IP address,
+ either IPv4 or IPv6. It is in fact a struct sockaddr in most
+ systems. The dapl structures typically use IPv6 data types to
+ accommodate the largest possible addresses, but most implementations
+ use IPv4 formatted addresses.
+
+ InfiniBand uses a transport specific address known as a LID, which
+ typically is dynamically assigned by a Subnet Manager. Each HCA
+ also has a global address, similar to an Ethernet MAC address,
+ known as a GUID. ATS, mentioned above, is a mechanism using
+ InfiniBand infrastructure to map from GUID/LID to IP addresses. It
+ is not necessary for transports that use IP addresses natively,
+ such as Ethernet devices.
+
+ If a new implementation does not yet have a name service
+ infrastructure, the DAPL implementation provides a simple name
+ service facility under the #ifdef NO_NAME_SERVICE. This depends on
+ two things: valid IP addresses registered and available to standard
+ DNS system calls such as gethostbyname(); and a name/GID mapping
+ file.
+
+ IP addresses may be set up by system administrators or by a local
+ power-user simply by editing the values into the /etc/hosts file.
+ Setting IP addresses up in this manner is beyond the scope of this
+ document.
+
+ A simple mapping of names to GIDs is maintained in the ibhosts
+ file, currently located at /etc/dapl/ibhosts. The format of
+ the file is:
+
+ <IP name> 0x<GID Prefix> 0x<GUID>
+
+ For example:
+
+ dat-linux3-ib0p0 0xfe80000000000000 0x0001730000003d11
+ dat-linux3-ib0p1 0xfe80000000000000 0x0001730000003d11
+ dat-linux3-ib1 0xfe80000000000000 0x0001730000003d52
+ dat-linux5-ib0 0xfe80000000000000 0x0001730000003d91
+
+ And for each hostname, there must be an entry in the /etc/hosts file
+ similar to:
+
+ dat_linux3-ib0a 198.165.10.11
+ dat_linux3-ib0b 198.165.10.12
+ dat_linux3-ib1a 198.165.10.21
+ dat_linux5-ib0a 198.165.10.31
+
+
+ In this example we have adopted the convention of naming each
+ InfiniBand interface by using the form
+
+ <node_name>-ib<device_number>[port_number]
+
+ In the above example we can see that the machine dat_linux3 has three
+ InfiniBand interfaces, which in this case we have named two ports on
+ the first HCA and another port on a second. Utilizing standard DNS
+ naming, the conventions used for identifying individual ports is
+ completely up to the administrator.
+
+ The GID Prefix and GUID are obtained from the HCA and map a port on
+ the HCA: together they form the GID that is required by a CM to
+ connect with the remote node.
+
+ The simple name service builds an internal table after processing
+ the ibhosts file which contains IP addresses and GIDs. It will use
+ the standard getaddrinfo() function to obtain IP address
+ information.
+
+ When an application invoked dat_ep_connect(), the
+ DAT_IA_ADDRESS_PTR will be compared in the table for a match and
+ the destination GID established if found. If the address is not
+ found then the user must first add the name to the ibhosts file.
+
+ With a valid GID for the destination node, the underlying CM is
+ invoked to make a connection.
+
+* Connection Management
+
+ Getting a working CM has taken some time, in fact the DAPL project
+ was nearly complete by the time a CM was available. In order to
+ make progress, a connection hack was introduced that allows
+ specific connections to take place. This is noted in the code by
+ the CM_BUSTED #def.
+
+ CM_BUSTED takes the place of a CM and will manually transition a QP
+ through the various states to connect: INIT->RTR->RTS. It will also
+ disconnect the connection, although the Torrent implementation
+ simply destroys the QP and recreates a new one rather than
+ transitioning through the typical disconnect states (which didn't
+ work on early IB implementations).
+
+ CM_BUSTED makes some assumptions about the remote end of the
+ connection as no real information is exchanged. The ibapi
+ implementation assumes both HCAs have the same LID, which implies
+ there is no SM running. The vapi implementation assumes the LIDs
+ are 0 and 1. Depending on the hardware, the LID value may in fact
+ not make any difference. This code does not set the Global Route
+ Header (GRH), which would cause the InfiniBand chip to be carefully
+ checking LID information.
+
+ The QP number is assumed to be identical on both ends of the
+ connection, or differing by 1 if this is a loopback. There is an
+ environment variable that will be read at initialization time if
+ you are configured with a loopback, this value is checked when
+ setting up a QP. The obvious downside to this scheme is that
+ applications must stay synchronized in their QP usage or the
+ initial exchange will fail as they are not truly connected.
+
+ Add to this the limitation that HCAs must be connected in
+ Point-to-Point topology or in a loopback. Without a GRH it will not
+ work in a fabric. Again, using an SM will not work when CM_BUSTED
+ is enabled.
+
+ Despite these shortcomings, CM_BUSTED has proven very useful and
+ will remain in the code for a while in order to aid development
+ groups with new hardware and software. It is a hack to be sure, but
+ it is relatively well isolated.
+
+
+-- Notes on Disconnecting
+
+An EP can only be disconnected if it is connected or unconnected; you
+cannot disconnect 'in progress' connections. An 'in progress
+connection may in fact time out, but the DAT Spec does not allow you
+to 'kill' it. DAPL will use the CM interface to disconnect from the
+remote node; this of course results in an asynchronous callback
+notifying the application the disconnect is complete.
+
+Disconnecting an unconnected EP is currently the only way to remove
+pending RECV operations from the EP. The DAPL spec notes that all
+DTOs must be removed from an EP before it can be deallocated, yet
+there is no explicit interface to remove pending RECV DTOs. The user
+will disconnect an unconnected EP to force the pending operations off
+of the queue, resulting in DTO callbacks indicating an error. The
+underlying InfiniBand implementation will cause the correct behavior
+to result. When doing this operation the DAT_CLOSE flag is ignored,
+DAPL will instruct the provider layer to abruptly disconnect the QP.
+
+As has been noted previously, specifying DAT_CLOSE_ABRUPT_FLAG as the
+disconnect completion flag will cause the CM implementation to
+transition the QP to the ERROR state to abort all operations, and then
+transition to the RESET state; if the flag is DAT_CLOSE_GRACEFUL_FLAG,
+the CM will first move to the SQE state and allow all pending I/O's to
+drain before moving to the RESET state. In either case, DAPL only
+needs to know that the QP is now in the RESET state, as it will need
+to be transitioned to the INIT state before it can be used again.
+
+======================================================================
+Data Transfer Operations (DTOs)
+======================================================================
+
+The DTO code is a straightforward translation of the DAT_LMR_TRIPLET
+to an InfiniBand work request. Unfortunately, IB does not specify what
+a work request looks like so this tends to be very vendor specific
+code. Each provider will supply a routine for this operation.
+
+InfiniBand allows the DTO to attach a unique 64 bit work_req_id to
+each work request. The DAPL implementation will install a pointer to a
+DAPL_DTO_COOKIE in this field. Observe that a DAPL_DTO_COOKIE is not
+the same as the user DAT_DTO_COOKIE; indeed, the former has a pointer
+field pointing to the latter. Different values will be placed in the
+cookie, according to the type of operation it is and the type of data
+required by its completion event. This is a simple scheme to bind DAPL
+data to the DTO and associated completion callback. Each DTO has a
+unique cookie associated with it.
+
+Observe that an InfiniBand work_request remains under control of the
+user, and when a post operation occurs the underlying implementation
+will copy data out of the work_request into a hardware based
+structure. Further, no application can perform a DTO operation on the
+same EP at the same time according to the thread guarantees mandated
+by the specification. This allows us to provide a recv_iov and a
+send_iov in the EP structure for all DTO operations, eliminating any
+malloc operations from this critical path.
+
+The underlying provider implementation will invoke
+dapl_evd_dto_callback() upon completion of DTO operations.
+dapl_evd_dto_callback() is the asynchronous completion for a DTO and
+will create and post an event for the user. Much of this callback is
+concerned with managing error completions.
+
+
+======================================================================
+Data Structure
+======================================================================
+
+The main data structure for an EndPoint is the dapl_ep structure,
+defined in include/dapl.h. The reference implementation uses the
+InfiniBand QP to maintain hardware state, providing a relatively
+simple mapping.
+
+/* DAPL_EP maps to DAT_EP_HANDLE */
+struct dapl_ep
+{
+ DAPL_HEADER header;
+ /* What the DAT Consumer asked for */
+ DAT_EP_PARAM param;
+
+ /* The RC Queue Pair (IBM OS API) */
+ ib_qp_handle_t qp_handle;
+ unsigned int qpn; /* qp number */
+ ib_qp_state_t qp_state;
+
+ /* communications manager handle (IBM OS API) */
+ ib_cm_handle_t cm_handle;
+ /* store the remote IA address here, reference from the param
+ * struct which only has a pointer, no storage
+ */
+ DAT_SOCK_ADDR6 remote_ia_address;
+
+ /* For passive connections we maintain a back pointer to the CR */
+ void * cr_ptr;
+
+ /* pointer to connection timer, if set */
+ struct dapl_timer_entry *cxn_timer;
+
+ /* private data container */
+ DAPL_PRIVATE private;
+
+ /* DTO data */
+ DAPL_ATOMIC req_count;
+ DAPL_ATOMIC recv_count;
+
+ DAPL_COOKIE_BUFFER req_buffer;
+ DAPL_COOKIE_BUFFER recv_buffer;
+
+ ib_data_segment_t *recv_iov;
+ DAT_COUNT recv_iov_num;
+
+ ib_data_segment_t *send_iov;
+ DAT_COUNT send_iov_num;
+#ifdef DAPL_DBG_IO_TRC
+ int ibt_dumped;
+ struct io_buf_track *ibt_base;
+ DAPL_RING_BUFFER ibt_queue;
+#endif /* DAPL_DBG_IO_TRC */
+};
+
+The simple explanation of the fields in the dapl_ep structure follows:
+
+header: The dapl object header, common to all dapl objects.
+ It contains a lock field, links to appropriate lists, and
+ handles specifying the IA domain it is a part of.
+
+param: The bulk of the EP attributes called out in the DAT
+ specification and are maintained in the DAT_EP_PARAM
+ structure. All internal references to these fields
+ use this structure.
+
+qp_handle: Handle to the underlying InfiniBand provider implementation
+ for a QP. All EPs are mapped to an InfiniBand QP.
+
+qpn: Number of the QP as returned by the underlying provider
+ implementation. Primarily useful for debugging.
+
+qp_state: Current state of the QP. The values of this field indicate
+ if a QP is bound to the EP, and the current state of a
+ QP.
+
+cm_handle: Handle to the IB provider's CMA (Connection Manager Agent).
+ Used for CM operations used to connect and disconnect.
+
+remote_ia_address:
+ Remote IP address of the connection. Only valid after the user
+ has asked for it.
+
+cr_ptr: Attaches the EP to the appropriate CR. Assigned on the passive
+ side of a connection in cr_accept. It is used when an abrupt
+ disconnect is invoked by the app, and we need to 'fake' a
+ callback. It is also used in clean up of an EP and removing
+ connection elements from the associated PSP.
+
+cxn_timer: Pointer to a timer entry, used as a token to set and remove
+ timers.
+
+private: Local Private data area on the active side of a connection.
+
+req_count: Count of outstanding request DTO operations, including memory
+ ops. Atomically incremented/decremented.
+
+recv_count:Count of outstanding receive DTO operations. Atomically
+ incremented/decremented.
+
+req_buffer:Ring buffer of request cookies.
+
+recv_buffer:
+ Ring buffer of receive cookies.
+
+recv_iov: Storage for provider receive work request.
+
+recv_iov_num:
+ Maximum number of receive IOVs. Number is obtained from
+ the provider in a query.
+
+send_iov: Storage for provider send work request.
+
+send_iov_num:
+ Maximum number of send IOVs. Number is obtained from the
+ provider in a query.
+
+ibt_dumped:DTO debugging aid. Boolean value to control how often DTO
+ tracing data is printed.
+
+ibt_base: DTO debugging aid. Base address of DTO ring buffer containing
+ information on DTO processing.
+
+ibt_queue: Ring buffer containing information n DTO processing.
+
+
+** Debug
+
+The Reference Implementation includes a trace facility that allows a
+developer to see all DTO operations, specifically to catch those that
+are not completing as expected. The DAPL_DBG_IO_TRC conditional will
+enable this code.
+
+A simple ring buffer is used to account for all outstanding DTO
+traffic. The buffer may be dumped when DTOs are not getting
+completions, with enough data to aid the developer to determine where
+things went wrong.
+
+It is implemented as a ring buffer as there are often bugs in this
+part of a provider's implementation which do not manifest until
+intensive data exchange has occurred for many hours.
--- /dev/null
+ DAPL Environment Guide v. 0.01
+ ------------------------------
+
+The following environment variables affect the behavior of the DAPL
+provider library:
+
+
+DAPL_DBG_TYPE
+-------------
+
+ Value specifies which parts of the registry will print debugging
+ information, valid values are
+
+ DAPL_DBG_TYPE_ERR = 0x0001
+ DAPL_DBG_TYPE_WARN = 0x0002
+ DAPL_DBG_TYPE_EVD = 0x0004
+ DAPL_DBG_TYPE_CM = 0x0008
+ DAPL_DBG_TYPE_EP = 0x0010
+ DAPL_DBG_TYPE_UTIL = 0x0020
+ DAPL_DBG_TYPE_CALLBACK = 0x0040
+ DAPL_DBG_TYPE_DTO_COMP_ERR = 0x0080
+ DAPL_DBG_TYPE_API = 0x0100
+ DAPL_DBG_TYPE_RTN = 0x0200
+ DAPL_DBG_TYPE_EXCEPTION = 0x0400
+
+ or any combination of these. For example you can use 0xC to get both
+ EVD and CM output.
+
+ Example setenv DAPL_DBG_TYPE 0xC
+
+
+DAPL_DBG_DEST
+-------------
+
+ Value sets the output destination, valid values are
+
+ DAPL_DBG_DEST_STDOUT = 0x1
+ DAPL_DBG_DEST_SYSLOG = 0x2
+ DAPL_DBG_DEST_ALL = 0x3
+
+ For example, 0x3 will output to both stdout and the syslog.
+
--- /dev/null
+ DAPL Event Subsystem Design v. 0.96
+ -----------------------------------
+
+=================
+Table of Contents
+=================
+
+* Table of Contents
+* Referenced Documents
+* Goals
+ + Initial Goals
+ + Later Goals
+* Requirements, constraints, and design inputs
+ + DAT Specification Constraints
+ + Object and routine functionality, in outline
+ + Detailed object and routine specification
+ + Synchronization
+ + IBM Access API constraints
+ + Nature of DAPL Event Streams in IBM Access API.
+ + Nature of access to CQs
+ + Operating System (Pthread) Constraints
+ + Performance model
+ + A note on context switches
+* DAPL Event Subsystem Design
+ + OS Proxy Wait Object
+ + Definition
+ + Suggested Usage
+ + Event Storage
+ + Synchronization
+ + EVD Synchronization: Locking vs. Producer/Consumer queues
+ + EVD Synchronization: Waiter vs. Callback
+ + CNO Synchronization
+ + Inter-Object Synchronization
+ + CQ -> CQEH Assignments
+ + CQ Callbacks
+ + Dynamic Resizing of EVDs
+ + Structure and pseudo-code
+ + EVD
+ + CNO
+* Future directions
+ + Performance improvements: Reducing context switches
+ + Performance improvements: Reducing copying of event data
+ + Performance improvements: Reducing locking
+ + Performance improvements: Reducing atomic operations
+ + Performance improvements: Incrementing concurrency.
+
+====================
+Referenced Documents
+====================
+
+uDAPL: User Direct Access Programming Library, Version 1.0. Published
+6/21/2002. http://www.datcollaborative.org/uDAPL_062102.pdf.
+Referred to in this document as the "DAT Specification".
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002. In DAPL SourceForge repository at
+doc/api/access_api.pdf. Referred to in this document as the "IBM
+Access API Specification".
+
+=====
+Goals
+=====
+
+Initial goals
+-------------
+-- Implement the dat_evd_* calls described in the DAT Specification (except
+ for dat_evd_resize).
+
+-- The implementation should be as portable as possible, to facilitate
+ HCA Vendors efforts to implement vendor-specific versions of DAPL.
+
+Later goals
+-----------
+-- Examine various possible performance optimizations. This document
+ lists potential performance improvements, but the specific
+ performance improvements implemented should be guided by customer
+ requirements.
+
+-- Implement the dat_cno_* calls described in the DAT 1.0 spec
+
+-- Implement OS Proxy Wait Objects.
+
+-- Implement dat_evd_resize
+
+Non-goals
+---------
+-- Thread safe implementation
+
+============================================
+Requirements, constraints, and design inputs
+============================================
+
+DAT Specification Constraints
+-----------------------------
+
+-- Object and routine functionality, in outline
+
+The following section summarizes the requirements of the DAT
+Specification in a form that is simpler to follow for purposes of
+implementation. This section presumes the reader has read the DAT
+Specification with regard to events.
+
+Events are delivered to DAPL through Event Streams. Each Event Stream
+targets a specific Event Descriptor (EVD); multiple Event Streams may
+target the same EVD. The Event Stream<->EVD association is
+effectively static; it may not be changed after the time at which
+events start being delivered. The DAT Consumer always retrieves
+events from EVDs. EVDs are intended to be 1-to-1 associated with the
+"native" event convergence object on the underlying transport. For
+InfiniBand, this would imply a 1-to-1 association between EVDs and
+CQs.
+
+EVDs may optionally have an associated Consumer Notification Object
+(CNO). Multiple EVDs may target the same CNO, and the EVD<->CNO
+association may be dynamically altered. The DAT Consumer may wait for
+events on either EVDs or CNOs; if there is no waiter on an EVD and it
+is enabled, its associated CNO is triggered on event arrival. An EVD
+may have only a single waiter; a CNO may have multiple waiters.
+Triggering of a CNO is "sticky"; if there is no waiter on a CNO when
+it is triggered, the next CNO waiter will return immediately.
+
+CNOs may have an associated OS Proxy Wait Object, which is signaled
+when the CNO is triggered.
+
+-- Detailed object and routine specification
+
+Individual events may be "signaling" or "non-signaling", depending
+on the interaction of:
+ * Receive completion endpoint attributes
+ * Request completion endpoint attributes
+ * dat_ep_post_send completion flags
+ * dat_ep_post_recv completion flags
+The nature of this interaction is outside the scope of this document;
+see the DAT Specification 1.0 (or, failing that, clarifications in a
+later version of the DAT Specification).
+
+A call to dat_evd_dequeue returns successfully if there are events on
+the EVD to dequeue. A call to dat_evd_wait blocks if there are fewer
+events present on the EVD than the value of the "threshold" parameter
+passed in the call. Such a call to dat_evd_wait will be awoken by the
+first signaling event arriving on the EVD that raises the EVD's event
+count to >= the threshold value specified by dat_evd_wait().
+
+If a signaling event arrives on an EVD that does not have a waiter,
+and that EVD is enabled, the CNO associated with the EVD will be
+triggered.
+
+A CNO has some number of associated waiters, and an optional
+associated OS Proxy Wait Object. When a CNO is triggered, two things
+happen independently:
+ * The OS Proxy Wait Object associated with the CNO, if any, is
+ signaled, given the handle of an EVD associated with the CNO
+ that has an event on it, and disassociated from the CNO.
+ * If:
+ * there is one or more waiters associated with the
+ CNO, one of the waiters is unblocked and given the
+ handle of an EVD associated with the CNO that has an
+ event on it.
+ * there are no waiters associated with the CNO, the
+ CNO is placed in the triggered state.
+
+When a thread waits on a CNO, if:
+ * The CNO is in the untriggered state, the waiter goes to
+ sleep pending the CNO being triggered.
+ * The CNO is in the triggered state, the waiter returns
+ immediately with the handle of an EVD associated with the
+ CNO that has an event on it, and the CNO is moved to the
+ untriggered state.
+
+Note specifically that the signaling of the OS Proxy Wait Object is
+independent of the CNO moving into the triggered state or not; it
+occurs based on the state transition from Not-Triggered to Triggered.
+Signaling the OS Proxy Wait Object only occurs when a CNO is
+triggered. In contrast, waiters on a CNO are unblocked whenever the
+CNO is in the triggered *state*, and that state is sticky.
+
+Note also that which EVD is returned to the caller in a CNO wait is
+not specified; it may be any EVD associated with the CNO on which an
+event arrival might have triggered the CNO. This includes the
+possibility that the EVD returned to the caller may not have any
+events on it, if the dat_cno_wait() caller raced with a separate
+thread doing a dat_evd_dequeue().
+
+The DAT Specification is silent as to what behavior is to be expected
+from an EVD after an overflow error has occurred on it. Thus this
+design will also be silent on that issue.
+
+The DAT Specification has minimal requirements on inter-Event Stream
+ordering of events. Specifically, any connection events must precede
+(in consumption order) any DTO Events for the same endpoint.
+Similarly, any successful disconnection events must follow any DTO
+Events for an endpoint.
+
+-- Synchronization
+
+Our initial implementation is not thread safe. This means that we do
+not need to protect against the possibility of multiple simultaneous
+user calls occurring on the same object (EVD, CNO, EP, etc.); that is
+the responsibility of the DAT Consumer.
+
+However, there are synchronization guards that we do need to protect
+against because the DAT Consumer cannot. Specifically, since the user
+cannot control the timing of callbacks from the IBM Access API
+Implementation, we need to protect against possible collisions between
+user calls and such callbacks. We also need to make sure that such
+callbacks do not conflict with one another in some fashion, possibly
+by assuring that they are single-threaded.
+
+In addition, for the sake of simplicity in the user interface, I have
+defined "not thread safe" as "It is the DAT Consumer's responsibility
+to make sure that all calls against an individual object do not
+conflict". This does, however, suggest that the DAPL library needs to
+protect against calls to different objects that may result in
+collisions "under the covers" (e.g. a call on an EVD vs. a call on its
+associated CNO).
+
+So our synchronization requirements for this implementation are:
+ + Protection against collisions between user calls and IBM
+ Access API callbacks.
+ + Avoidance of or protection against collisions between
+ different IBM Access API callbacks.
+ + Protection against collisions between user calls targeted at
+ different DAT objects.
+
+IBM Access API constraints
+--------------------------
+
+-- Nature of DAPL Event Streams in IBM Access API
+
+DAPL Event Streams are delivered through the IBM Access API in two fashions:
+ + Delivery of a completion to a CQ.
+ + A callback is made directly to a previously registered DAPL
+ function with parameters describing the event.
+(Software events are not delivered through the IBM Access API).
+
+The delivery of a completion to a CQ may spark a call to a previously
+registered callback depending on the attributes of the CQ and the
+reason for the completion. Event Streams that fall into this class
+are:
+ + Send data transport operation
+ + Receive data transport operation
+ + RMR bind
+
+The Event Streams that are delivered directly through a IBM Access API
+callback include:
+ + Connection request arrival
+ + Connection resolution (establishment or rejection)
+ + Disconnection
+ + Asynchronous errors
+
+Callbacks associated with CQs are further structured by a member of a
+particular CQ Event Handling (CQEH) domain (specified at CQ creation
+time). All CQ callbacks within a CQEH domain are serviced by the same
+thread, and hence will not collide.
+
+In addition, all connection-related callbacks are serviced by the same
+thread, and will not collide. Similarly, all asynchronous error
+callbacks are serviced by the same thread, and will not collide.
+Collisions between any pair of a CQEH domain, a connection callback,
+and an asynchronous error callback are possible.
+
+-- Nature of access to CQs
+
+The only probe operation the IBM Access API allows on CQs is
+dequeuing. The only notification operation the IBM Access API
+supports for CQs is calling a previously registered callback.
+
+Specifically, the IB Consumer may not query the number of completions
+on the CQ; the only way to find out the number of completions on a CQ
+is through dequeuing them all. It is not possible to block waiting
+on a CQ for the next completion to arrive, with or without a
+threshold parameter.
+
+Operating System Constraints
+----------------------------
+
+The initial platform for implementation of DAPL is RedHat Linux 7.2 on
+Intel hardware. On this platform, inter-thread synchronization is
+provided by a POSIX Pthreads implementation. From the viewpoint of
+DAPL, the details of the Pthreads interface are platform specific.
+However, Pthreads is a very widely used threading library, common on
+almost all Unix variants (though not used on the different variations
+of Microsoft Windows(tm)). In addition, RedHat Linux 7.2 provides
+POSIX thread semaphore operations (e.g. see sem_init(3)), which are
+not normally considered part of pthreads.
+
+Microsoft Windows(tm) provides many synchronization primitives,
+including mutual exclusion locks, and semaphores.
+
+DAPL defines an internal API (not exposed to the consumer), though
+which it accesses Operating Systems Dependent services; this is called
+the OSD API. It is intended that this layer contain all operating
+system dependencies, and that porting DAPL to a new operating system
+should only require changes to this layer.
+
+We have chosen to define the synchronization interfaces established at
+this layer in terms of two specific objects: mutexes and sempahores w/
+timeout on waiting. Mutexes provide mutual exclusion in a way that is
+common to all known operating systems. The functionality of
+semaphores also exists on most known operating systems, though the
+sempahores provided by POSIX do not provide timeout capabilities.
+This is for three reasons. First, in contrast to Condition Variables
+(the native pthreads waiting/signalling object), operations on
+sempahores do not require use of other synchronization variables
+(i.e. mutexes). Second, it is fairly easy to emulate sempahores using
+condition variables, and it is not simple to emulate condition
+variables using semaphores. And third, there are some anticipated
+platforms for DAPL that implement condition variables in relation to
+some types of locks but not others, and hence constrain appropriate
+implementation choices for a potential DAPL interface modeled after
+condition variables.
+
+Implementation of the DAPL OS Wait Objects will initially be based on
+condition variables (requiring the use of an internal lock) since
+POSIX semaphores do not provide a needed timeout capability. However,
+if improved performance is required, a helper thread could be created
+that arranges to signal waiting semaphores when timeouts have
+expired. This is a potential future (or vendor) optimization.
+
+Performance Model
+-----------------
+One constraint on the DAPL Event Subsystem implementation is that it
+should perform as well as possible. We define "as well as possible"
+by listing the characteristics of this subsystem that will affect its
+performance most strongly. In approximate order of importance, these
+are:
+ + The number of context switches on critical path
+ + The amount of copying on the critical path.
+ + The base cost of locking (assuming no contention) on the
+ critical path. This is proportional to the number of locks
+ taken.
+ + The amount of locking contention expected. We make a
+ simplifying assumption and take this as the number of cycles
+ for which we expect to hold locks on the critical path.
+ + The number of "atomic" bus operations executed (these take
+ more cycles than normal operations, as they require locking
+ the bus).
+
+We obviously wish to minimize all of these costs.
+
+-- A note on context switches
+
+In general, it's difficult to minimize context switches in a user
+space library directly communicating with a hardware board. This is
+because context switches, by their nature, have to go through the
+operating system, but the information about which thread to wake up
+and whether to wake it up is generally in user space. In addition,
+the IBM Access API delivers all Event Streams as callbacks in user
+context (as opposed to, for example, allowing a thread to block within
+the API waiting for a wakeup). For this reason, the default sequence
+of events for a wakeup generated from the hardware is:
+ * Hardware interrupts the main processor.
+ * Interrupt thread schedules a user-level IBM Access API
+ provider service thread parked in the kernel.
+ * Provider service thread wakes up the sleeping user-level
+ event DAT implementation thread.
+This implies that any wakeup will involve three context switches.
+This could be reduced by one if there were a way for user threads to
+block in the kernel, we might skip the user-level provider thread.
+
+===========================
+DAPL Event Subsystem Design
+===========================
+
+
+OS Proxy Wait Object
+--------------------
+
+The interface and nature of the OS Proxy Wait Object is specified in
+the uDAPL v. 1.0 header files as a DAT_OS_WAIT_PROXY_AGENT via the
+following defines:
+
+typedef void (*DAT_AGENT_FUNC)
+ (
+ DAT_PVOID, /* instance data */
+ DAT_EVD_HANDLE /* Event Dispatcher*/
+ );
+
+typedef struct dat_os_wait_proxy_agent
+ {
+ DAT_PVOID instance_data;
+ DAT_AGENT_FUNC proxy_agent_func;
+ } DAT_OS_WAIT_PROXY_AGENT;
+
+In other words, an OS Proxy Wait Object is a (function, data) pair,
+and signalling the OS Proxy Wait Object is a matter of calling the
+function on the data and an EVD handle associated with the CNO.
+The nature of that function and its associated data is completely up
+to the uDAPL consumer.
+
+Event Storage
+-------------
+
+The data associated with an Event (the type, the EVD, and any type
+specific data required) must be stored between event production and
+event consumption. If storage is not provided by the underlying
+Verbs, that data must be stored in the EVD itself. This may require
+an extra copy (one at event production and one at event consumption).
+
+Event Streams associated purely with callbacks (i.e. IB events that
+are not mediated by CQs) or user calls (i.e. software events) don't
+have any storage allocated for them by the underlying verbs and hence
+must store their data in the EVD.
+
+Event Streams that are associated with CQs have the possibility of
+leaving the information associated with the CQ between the time the
+event is produced and the time it is consumed. However, even in this
+case, if the user calls dat_evd_wait with a threshold argument, the
+events information must be copied to storage in the CQ. This is
+because it is not possible to determine how many completions there are
+on a CQ without dequeuing them, and that determination must be made by
+the CQ notification callback in order to decide whether to wakeup a
+dat_evd_wait() waiter. Note that this determination must be made
+dynamically based on the arguments to dat_evd_wait().
+
+Further, leaving events from Event Streams associated with the CQs "in
+the CQs" until event consumption raises issues about in what order
+events should be dequeued if there are multiple event streams entering
+an EVD. Should the CQ events be dequeued first, or should the events
+stored in the EVD be dequeued first? In general this is a complex
+question; the uDAPL spec does not put many restrictions on event
+order, but the major one that it does place is to restrict connection
+events associated with a QP to be dequeued before DTOs associated with
+that QP, and disconnection events after. Unfortunately, if we adopt
+the policy of always dequeueing CQ events first, followed by EVD
+events, this means that in situations where CQ events have been copied
+to the EVD, CQ events may be received on the EVD out of order.
+
+However, leaving events from Event Streams associated with CQs allows
+us to avoid enabling CQ callbacks in cases where there is no waiter
+associated with the EVDs. This can be a potentially large savings of
+gratuitous context switches.
+
+For the initial implementation, we will leave all event information
+associated with CQs until dequeued by the consumer. All other event
+information will be put in storage on the EVD itself. We will always
+dequeue from the EVD first and the CQ second, to handle ordering among
+CQ events in cases in which CQ events have been copied to the EVD.
+
+
+Synchronization
+---------------
+
+-- EVD synchronization: Locking vs. Producer/Consumer queues.
+
+In the current code, two circular producer/consumer queues are used
+for non-CQ event storage (one holds free events, one holds posted
+events). Event producers "consume" events from the free queue, and
+produce events onto the posted event queue. Event consumers consume
+events from the posted event queue, and "produce" events onto the free
+queue. In what follows, we discuss synchronization onto the posted
+event queue, but since the usage of the queues is symmetric, all of
+what we say also applies to the free event queue (just in the reverse
+polarity).
+
+The reason for using these circular queues is to allow synchronization
+between producer and consumer without locking in some situations.
+Unfortunately, a circular queue is only an effective method of
+synchronization if we can guarantee that there are only two accessors
+to it at a given time: one producer, and one consumer. The model will
+not work if there are multiple producers, or if there are multiple
+consumers (though obviously a subsidiary lock could be used to
+single-thread either the producers or the consumers).
+
+There are several difficulties with guaranteeing the producers and
+consumers will each be single threaded in accessing the EVD:
+ * Constraints of the IB specification and IBM Access API
+ (differing sources for event streams without guarantees of
+ IB provider synchronization between them) make it difficult
+ to avoid multiple producers.
+ * The primitives used for the producer/consumer queue are not
+ as widely accepted as locks, and may render the design less
+ portable.
+
+We will take locks when needed when producing events. The details of
+this plan are described below.
+
+This reasoning is described in more detail below to inform judgments
+about future performance improvements.
+
+* EVD producer synchronization
+
+The producers fall into two classes:
+ * Callbacks announcing IA associated events such as connection
+ requests, connections, disconnections, DT ops, RMR bind,
+ etc.
+ * User calls posting a software event onto the EVD.
+
+It is the users responsibility to protect against simultaneous
+postings of software events onto the same EVD. Similarly, the CQEH
+mechanism provided by the IBM Access API allows us to avoid collisions
+between IBM Access API callbacks associated with CQs. However, we
+must protect against software events colliding with IBM Access API
+callbacks, and against non-CQ associated IB verb callbacks (connection
+events and asynchronous errors) colliding with CQ associated IBM
+Access API callbacks, or with other non-CQ associated IBM Access API
+callbacks (i.e. a connection callback colliding with an asynchronous
+error callback).
+
+Note that CQ related callbacks do not act as producers on the circular
+list; instead they leave the event information on the CQ until
+dequeue; see "Event Storage" above. However, there are certain
+situations in which it is necessary for the consumer to determine the
+number of events on the EVD. The only way that IB provides to do this
+is to dequeue the CQEs from the CQ and count them. In these
+situations, the consumer will also act as an event producer for the
+EVD event storage, copying all event information from the CQ to the
+EVD.
+
+Based on the above, the only case in which we may do without locking
+on the producer side is when all Event Streams of all of the following
+types may be presumed to be single threaded:
+ * Software events
+ * Non-CQ associated callbacks
+ * Consumer's within dat_evd_wait
+
+We use a lock on the producer side of the EVD whenever we have
+multiple threads of producers.
+
+* EVD Consumer synchronization
+
+It is the consumer's responsibility to avoid multiple callers into
+dat_evd_wait and dat_evd_dequeue. For this reason, there is no
+requirement for a lock on the consumer side.
+
+* CQ synchronization
+
+We simplify synchronization on the CQ by identifying the CQ consumer
+with the EVD consumer. In other words, we prohibit any thread other
+than a user thread in dat_evd_wait() or dat_evd_dequeue() from
+dequeueing events from the CQ. This means that we can rely on the
+uDAPL spec guarantee that only a single thread will be in the
+dat_evd_wait() or dat_evd_dequeue() on a single CQ at a time. It has
+the negative cost that (because there is no way to probe for the
+number of entries on a CQ without dequeueing) the thread blocked in
+dat_evd_wait() with a threshold argument greater than 1 will be woken
+up on each notification on that CQ, in order to dequeue entries from
+the CQ and determine if the threshold value has been reached.
+
+-- EVD Synchronization: Waiter vs. Callback
+
+Our decision to restrict dequeueing from the IB CQ to the user thread
+(rather than the notification callback thread) means that
+re-requesting notifications must also be done from that thread. This
+leads to a subtle requirement for synchronization: the request for
+notification (ib_completion_notify) must be atomic with the wait on
+the condition variable by the user thread (atomic in the sense that
+locks must be held to force the signalling from any such notification
+to occur after the sleep on the condition variable). Otherwise it is
+possible for the notification requested by the ib_completion_notify
+call to occur before the return from that call. The signal done by
+that notify will be ignored, and no further notifications will be
+enabled, resulting in the thread sleep waiting forever. The CQE
+associated with the notification might be noticed upon return from the
+notify request, but that CQE might also have been reaped by a previous
+call.
+
+-- CNO Synchronization
+
+In order to protect data items that are changed during CNO signalling
+(OS Proxy Wait Object, EVD associated with triggering, CNO state), it
+is necessary to use locking when triggering and waiting on a CNO.
+
+Note that the synchronization between trigerrer and waiter on CNO must
+take into account the possibility of the waiter returning from the
+wait because of a timeout. I.e. it must handle the possibility that,
+even though the waiter was detected and the OS Wait Object signalled
+under an atomic lock, there would be no waiter on the OS Wait Object
+when it was signalled. To handle this case, we make the job of the
+triggerer to be setting the state to triggered and signalling the OS
+Wait Object; all other manipulation is done by the waiter.
+
+-- Inter-Object Synchronization
+
+By the requirements specified above, the DAPL implementation is
+responsible for avoiding collisions between DAT Consumer calls on
+different DAT objects, even in a non-thread safe implementation.
+Luckily, no such collisions exist in this implementation; all exported
+DAPL Event Subsystem calls involve operations only on the objects to
+which they are targeted. No inter-object synchronization is
+required.
+
+The one exception to this is the posting of a software event on an EVD
+associated with a CNO; this may result in triggering the CNO.
+However, this case was dealt with above in the discussion of
+synchronizing between event producers and consumers; the posting of a
+software event is a DAPL API call, but it's also a event producer.
+
+To avoid lock hierarchy issues between EVDs and CNOs and minimize lock
+contention, we arrange not to hold the EVD lock when triggering the
+CNO. That is the only context in which we would naturally attempt to
+hold both locks.
+
+-- CQ -> CQEH Assignments
+
+For the initial implementation, we will assign all CQs to the same
+CQEH. This is for simplicity and efficient use of threading
+resources; we do not want to dedicate a thread per CQ (where the
+number of CQs may grow arbitrarily high), and we have no way of
+knowing which partitioning of CQs is best for the DAPL consumer.
+
+CQ Callbacks
+------------
+
+The responsibility of a CQ callback is to wakeup any waiters
+associated with the CQ--no data needs to be dequeued/delivered, since
+that is always done by the consumer. Therefore, CQ callbacks must be
+enabled when:
+ * Any thread is in dat_evd_wait() on the EVD associated with
+ the CQ.
+ * The EVD is enabled and has a non-null CNO. (An alternative
+ design would be to have waiters on a CNO enable callbacks on
+ all CQs associated with all EVDs associated with the CNO,
+ but this choice does not scale well as the number of EVDs
+ associated with a CNO increases).
+
+Dynamic Resizing of EVDs
+------------------------
+
+dat_evd_resize() creates a special problem for the implementor, as it
+requires that the storage allocated in the EVD be changed in size as
+events may be arriving. If a lock is held by all operations that use
+the EVD, implementation of dat_evd_resize() is trivial; it substitutes
+a new storage mechanism for the old one, copying over all current
+events, all under lock.
+
+However, we wish to avoid universal locking for the initial
+implementation. This puts the implementation of dat_evd_resize() into
+a tar pit. Because of the DAT Consumer requirements for a non-thread
+safe DAPL Implementation, there will be no danger of conflict with
+Event Consumers. However, if an Event Producer is in process of
+adding an event to the circular list when the resize occurs, that
+event may be lost or overwrite freed memory.
+
+If we are willing to make the simplifying decision that any EVD that
+has non-CQ events on it will always do full producer side locking, we
+can solve this problem relatively easily. Resizing of the underlying
+CQ can be done via ib_cq_resize(), which we can assume available
+because of the IB spec. Resizing of the EVD storage may be done under
+lock, and there will be no collisions with other uses of the EVD as
+all other uses of the EVD must either take the lock or are prohibitted
+by the uDAPL spec.
+
+dat_evd_resize() has not yet been implemented in the DAPL Event
+subsystem.
+
+Structure and pseudo-code
+-------------------------
+
+-- EVD
+
+All EVDs will have associated with them:
+ + a lock
+ + A DAPL OS Wait Object
+ + An enabled/disabled bit
+ + A CNO pointer (may be null)
+ + A state (no_waiter, waiter, dead)
+ + A threshold count
+ + An event list
+ + A CQ (optional, but common)
+
+Posting an event to the EVD (presumably from a callback) will involve:
+^ + Checking for valid state
+|lock A + Putting the event on the event list
+| ^lock B + Signal the DAPL OS Wait Object, if appropriate
+v v (waiter & signaling event & over threshold)
+ + Trigger the CNO if appropriate (enabled & signaling
+ event & no waiter). Note that the EVD lock is not
+ held for this operation to avoid holding multiple locks.
+
+("lock A" is used if producer side locking is needed. "lock B" is
+used if producer side locking is not needed. Regardless, the lock is
+only held to confirm that the EVD is in the WAITED state, not for
+the wakeup).
+
+Waiting on an EVD will include:
+ + Loop:
+ + Copy all elements from CQ to EVD
+ + If we have enough, break
+ + If we haven't enabled the CQ callback
+ + Enable it
+ + Continue
+ + Sleep on DAPL OS Wait Object
+ + Dequeue and return an event
+
+The CQ callback will include:
+ + If there's a waiter:
+ + Signal it
+ + Otherwise, if the evd is in the OPEN state, there's
+ a CNO, and the EVD is enabled:
+ + Reenable completion
+ + Trigger CNO
+
+Setting the enable/disable state of the EVD or setting the associated
+CNO will simply set the bits and enable the completion if needed (if a
+CNO trigger is implied); no locking is required.
+
+-- CNO
+
+All CNOs will have associated with them:
+ + A lock
+ + A DAPL OS Wait Object
+ + A state (triggered, untriggered, dead)
+ + A waiter count
+ + An EVD handle (last event which triggered the CNO)
+ + An OS Proxy Wait Object pointer (may be null)
+
+Triggering a CNO will involve:
+ ^ + If the CNO state is untriggerred:
+ | + Set it to triggered
+ | + Note the OS Proxy wait object and zero it.
+ | + If there are any waiters associated with the CNO,
+ | signal them.
+ v + Signal the OS proxy wait object if noted
+
+Waiting on a CNO will involve:
+ ^ + While the state is not triggered and the timeout has not occurred:
+ | + Increment the CNO waiter count
+ lock + Wait on the DAPL OS Wait Object
+ | + Decrement the CNO waiter count
+ v + If the state is trigerred, note fact&EVD and set to untrigerred.
+ + Return EVD and success if state was trigerred
+ + Return timeout otherwise
+
+Setting the OS Proxy Wait Object on a CNO, under lock, checks for a
+valid state and sets the OS Proxy Wait Object.
+
+
+==============
+Known Problems
+==============
+
+-- Because many event streams are actually delivered to EVDs by
+ callbacks, we cannot in general make any guarantees about the order
+ in which those event streams arrive; we are at the mercy of the
+ thread scheduler. Thus we cannot hold to the guarantee given by
+ the uDAPL 1.0 specification that within a particular EVD,
+ connection events on a QP will always be before successful DTO
+ operations on that QP.
+
+ Because we have chosen to dequeue EVD events first and CQ events
+ second, we will also not be able to guarantee that all successful
+ DTO events will be received before a disconnect event. Ability to
+ probe the CQ for its number of entries would solve this problem.
+
+
+=================
+Future Directions
+=================
+
+This section includes both functionality enhancements, and a series of
+performance improvements. I mark these performance optimizations with
+the following flags:
+ * VerbMod: Requires modifications to the IB Verbs/the IBM
+ Access API to be effective.
+ * VerbInteg: Requires integration between the DAPL
+ implementation and the IB Verbs implementation and IB device
+ driver.
+
+Functionality Enhancements
+--------------------------
+
+-- dat_evd_resize() may be implemented by forcing producer side
+ locking whenever an event producer may occur asynchronously with
+ calls to dat_evd_resize() (i.e. when there are non-CQ event streams
+ associated with the EVD). See the details under "Dynamic Resizing
+ of EVDs" above.
+
+-- [VerbMod] If we ahd a verbs modification allowing us to probe for
+ the current number of entries on a CQ, we could:
+ * Avoid waking up a dat_evd_wait(threshold>1) thread until
+ there were enough events for it.
+ * Avoid copying events from the CQ to the EVD to satisfy the
+ requirements of the "*nmore" out argument to dat_evd_wait(),
+ as well as the non-unary threshold argument.
+ * Implement the "all successful DTO operation events before
+ disconnect event" uDAPL guarantee (because we would no
+ longer have to copy CQ events to an EVD, and hence dequeue
+ first from the EVD and then from the CQ.
+ This optimization also is relevant for two of the performance
+ improvements cases below (Reducing context switches, and reducing
+ copies).
+
+
+Performance improvements: Reducing context switches
+---------------------------------------------------
+-- [VerbMod] If we had a verbs modification allowing us to probe for
+ the current size of a CQ, we could avoid waking up a
+ dat_evd_wait(threshhold>1) thread until there were enough events
+ for it. See the Functionality Enhancement entry covering this
+ possibility.
+
+-- [VerbMod] If we had a verbs modification allowing threads to wait
+ for completions to occur on CQs (presumably in the kernel in some
+ efficient manner), we could optimize the case of
+ dat_evd_wait(...,threshold=1,...) on EVDs with only a single CQ
+ associated Event Stream. In this case, we could avoid the extra
+ context switch into the user callback thread; instead, the user
+ thread waiting on the EVD would be woken up by the kernel directly.
+
+-- [VerbMod] If we had the above verbs modification with a threshold
+ argument on CQs, we could implement the threshold=n case.
+
+-- [VerbInteg] In general, It would be useful to provide ways for
+ threads blocked on EVDs or CNOs to sleep in the hardware driver,
+ and for the driver interrupt thread to determine if they should be
+ awoken rather than handing that determination off to another,
+ user-level thread. This would allow us to reduce by one the number
+ of context switches required for waking up the various blocked
+ threads.
+
+-- If an EVD has only a single Event Stream coming into it that is
+ only associated with one work queue (send or receive), it may be
+ possible to create thresholding by marking only ever nth WQE on
+ the associated send or receive WQ to signal a completion. The
+ difficulty with this is that the threshold is specified when
+ waiting on an EVD, and requesting completion signaling is
+ specified when posting a WQE; those two events may not in general
+ be synchronized enough for this strategy. It is probably
+ worthwhile letting the consumer implement this strategy directly if
+ they so choose, by specifying the correct flags on EP and DTO so
+ that the CQ events are only signaling on every nth completion.
+ They could then use dat_evd_wait() with a threshold of 1.
+
+Performance improvements: Reducing copying of event data
+--------------------------------------------------------
+-- [VerbMod] If we had the ability to query a CQ for the number of
+ completions on it, we could avoid the cost of copying event data from the
+ CQ to the EVD. This is a duplicate of the second entry under
+ "Functionality Enhancements" above.
+
+Performance improvements: Reducing locking
+------------------------------------------
+-- dat_evd_dequeue() may be modified to not take any locks.
+
+-- If there is no waiter associated with an EVD and there is only a
+ single event producer, we may avoid taking any locks in producing
+ events onto that EVD. This must be done carefully to handle the
+ case of racing with a waiter waiting on the EVD as we deliver the
+ event.
+
+-- If there is no waiter associated with an EVD, and we create a
+ producer/consumer queue per event stream with a central counter
+ modified with atomic operations, we may avoid locking on the EVD.
+
+-- It may be possible, though judicious use of atomic operations, to
+ avoid locking when triggering a CNO unless there is a waiter on the
+ CNO. This has not been done to keep the initial design simple.
+
+Performance improvements: Reducing atomic operations
+----------------------------------------------------
+-- We could combine the EVD circular lists, to avoid a single atomic
+ operation on each production and each consumption of an event. In
+ this model, event structures would not move from list to list;
+ whether or not they had valid information on them would simply
+ depend on where they were on the lists.
+
+-- We may avoid the atomic increments on the circular queues (which
+ have a noticeable performance cost on the bus) if all accesses to an
+ EVD take locks.
+
+
+Performance improvements: Increasing concurrency
+------------------------------------------------
+-- When running on a multi-CPU platform, it may be appropriate to
+ assign CQs to several separate CQEHs, to increase the concurrency
+ of execution of CQ callbacks. However, note that consumer code is
+ never run within a CQ callback, so those callbacks should take very
+ little time per callback. This plan would only make sense in
+ situations where there were very many CQs, all of which were
+ active, and for whatever reason (high threshold, polling, etc)
+ user threads were usually not woken up by the execution of a
+ provider CQ callback.
+
+
--- /dev/null
+
+ DAPL Variations from IBM OS Access API
+ --------------------------------------
+
+The DAPL reference implementation is targetted at the IBM OS Access
+API (see doc/api/IBM_access_api.pdf). However, in the course of
+developing the reference implementation it has become necessary to
+alter or enhance this API specification in minor ways. This document
+describes the ways in which the Access API has been altered to
+accomodate the needs of the reference implementation.
+
+Note that this document is a work in progress/a place holder; it does
+not yet contain all of the API variations used by the reference
+implementation. It is intended that it will be brought up to date
+before the final release of the DAPL reference implementation.
+
+The variations from the IBM OS Access API are listed below.
+
+-- Thread safety
+
+The IBM OS Access API specifies:
+
+"Implementation of the Access APIs should ensure that multiple threads
+ can call the APIs, provided they do not access the same InfiniBand
+ entity (such as a queue pair or a completion queue)."
+
+This has been extended in two ways:
+ * It is safe for multiple threads to call into the API
+ accessing the same HCA.
+ * Threads calling ib_post_send_req on a particular QP do not
+ conflict with threads calling ib_post_rcv_req on the same
+ QP. I.e. while there cannot be multiple threads in
+ ib_post_send_req or ib_post_rcv_req on the same QP, there
+ may be one thread in each routine simultaneously.
--- /dev/null
+#######################################################################
+# #
+# DAPL Memory Management Design #
+# #
+# James Lentini #
+# jlentini at users.sourceforge.net #
+# #
+# Created 05/06/2002 #
+# Updated 08/22/2002 #
+# #
+#######################################################################
+
+
+Contents
+-------
+0. Introduction
+1. Protection Zones (PZs)
+2. Local Memory Regions (LMRs)
+3. Remote Memory Regions (RMRs)
+
+
+0. Introduction
+---------------
+
+ The memory management subsystem allows consumers to register and
+unregister memory regions. The DAT API distinguishes between local
+and remote memory areas. The former server as local buffers for DTO
+operations while the later are used for RDMA operations.
+
+Each DAT function is implemented in a file named dapl_<function name>.c.
+For example, dat_pz_create is implemented in dapl_pz_create.c. There
+are also dapl_<object name>_util.{h,c} files for each object. For
+example, there are dapl_pz_util.h and dapl_pz_util.c files. The
+use of util files follows the convention used elsewhere in the DAPL
+reference provider. These files contain common object creation and
+destruction code.
+
+
+1. Protection Zones (PZs)
+-------------------------
+
+ DAPL protection zones provide consumers with a means to associate
+various DAPL objects with one another. The association can then be
+validated before allowing these objects to be manipulated. The DAT
+functions related to PZs are:
+
+dat_pz_create
+dat_pz_free
+dat_pz_query
+
+These are implemented in the DAPL reference provider by
+
+dapl_pz_create
+dapl_pz_free
+dapl_pz_query
+
+The reference implementation maps the DAPL PZ concept onto Infiniband
+protections domains (PDs).
+
+The DAT_PZ_HANDLE value returned to DAT consumers is a pointer to a
+DAPL_PZ data structure. The DAPL_PZ structure is used to represent all
+PZ objects. Code that manipulates this structure should atomically
+increment and decrement the ref_count member to track the number of
+objects referencing the PZ.
+
+
+2. Local Memory Regions (LMRs)
+------------------------------
+
+ DAPL local memory regions represent a memory area on the host
+system that the consumer wishes to access via local DTO operations.
+The DAT functions related to LMRs are:
+
+dat_lmr_create
+dat_lmr_free
+dat_lmr_query
+
+These are implemented in
+
+dapl_lmr_create
+dapl_lmr_free
+dapl_lmr_query
+
+In the reference implementation, DAPL LMRs are mapped onto
+Infiniband memory regions (MRs).
+
+LMR creation produces two values: a DAT_LMR_CONTEXT and a
+DAT_LRM_HANDLE.
+
+The DAT_LMR_CONTEXT value is used to uniquely identify the LMR
+when posting data transfer operations. These values map directly
+to Infiniband L_KEYs.
+
+Since some DAT functions need to translate a DAT_LMR_CONTEXT value
+into a DAT_LMR_HANDLE (ex. dat_rmr_bind), a dictionary data structure
+is used to associate DAT_LMR_CONTEXT values with their corresponding
+DAT_LMR_HANDLE. Each time a new LMR is created, the DAT_LMR_HANDLE
+should be inserted into the dictionary with the associated
+DAT_LMR_CONTEXT as the key.
+
+A hash table was chosen to implement this data structure. Since the
+L_KEY values are being used by the CA hardware for indexing purposes,
+there distribution is expected to be uniform and hence ideal for hashing.
+
+The DAT_LMR_HANDLE value returned to DAT consumers is a pointer to
+a DAPL_LMR data structure. The DAPL_LMR structure is used to represent
+all LMR objects. The ref_count member should be used to track objects
+associated with a given LMR.
+
+The DAT API exposes the DAT_LMR_CONTEXT to consumers to allow
+for sharing of memory registrations between multiple address spaces.
+The mechanism by which such a feature would be implemented does not
+yet exist. Consumers may be able to take advantage of this
+feature on future transports.
+
+
+3. Remote Memory Regions (RMRs)
+-------------------------------
+
+ DAPL remote memory regions represent a memory area on the host
+system to which the consumer wishes to allow RMDA operations. The
+related DAT functions are
+
+dat_rmr_create
+dat_rmr_free
+dat_rmr_query
+dat_rmr_bind
+
+which are implemented in
+
+dapl_rmr_create
+dapl_rmr_free
+dapl_rmr_query
+dapl_rmr_bind
+
+The reference provider maps RMR objects onto Infiniband memory
+windows.
+
+The DAT_RMR_HANDLE value returned to DAT consumers is a pointer to
+a DAPL_RMR data structure. The DAPL_RMR structure is used to represent
+all RMR objects.
+
+The API for binding a LMR to a RMR has the following function
+signature:
+
+DAT_RETURN
+dapl_rmr_bind (
+ IN DAT_RMR_HANDLE rmr_handle,
+ IN const DAT_LMR_TRIPLET *lmr_triplet,
+ IN DAT_MEM_PRIV_FLAGS mem_priv,
+ IN DAT_EP_HANDLE ep_handle,
+ IN DAT_RMR_COOKIE user_cookie,
+ IN DAT_COMPLETION_FLAGS completion_flags,
+ OUT DAT_RMR_CONTEXT *rmr_context )
+
+where a DAT_LMR_TRIPLET is defined as:
+
+typedef struct dat_lmr_triplet
+ {
+ DAT_LMR_CONTEXT lmr_context;
+ DAT_UINT32 pad;
+ DAT_VADDR virtual_address;
+ DAT_VLEN segment_length;
+ } DAT_LMR_TRIPLET;
+
+In the case of IB, the DAT_LMR_CONTEXT value is a L_KEY.
+As described in the IB spec, the Bind Memory Window verb
+takes both a L_KEY and Memory Region Handle among other
+parameters. Therefore a data structure must be used to
+map a DAT_LMR_CONTEXT (L_KEY) value to a DAPL_LMR so
+that the needed memory region handle can be retrieved.
+The LMR hash table described above is used for this
+purpose.
--- /dev/null
+#######################################################################
+# #
+# DAPL Patch Guide #
+# #
+# James Lentini #
+# jlentini at users.sourceforge.net #
+# #
+# Created 03/30/2005 #
+# Version 1.0 #
+# #
+#######################################################################
+
+
+Overview
+--------
+
+The DAPL Reference Implementation (RI) Team welcomes code contributions
+and bug fixes from RI users. This document describes the format for
+submitting patches to the project.
+
+Directions
+----------
+
+When implementing a new feature or bug fix, please remember to:
+
++ Use the project coding style, described in doc/dapl_coding_style.txt
++ Remember that the RI supports multiple platforms and transports. If
+ your modification is not applicable to all platforms and transports,
+ please ensure that the implement does not affect these other
+ configurations.
+
+When creating the patch:
+
++ Create the patch using a unified diff as follows:
+ diff -Naur old-code new-code > patch
++ Create the patch from the root of the CVS tree.
+
+When submitting the patch:
+
++ Compose an email message containing a brief description of the patch,
+ a signed-off by line, and the patch.
++ Have the text "[PATCH]" at the start of the subject line
++ Send the message to dapl-devel@lists.sourceforge.net
+
+Example
+-------
+
+Here is an example patch message:
+
+------------------------------------------------------------
+Date: 30 Mar 2005 11:49:45 -0500
+From: Jane Doe
+To: dapl-devel@lists.sourceforge.net
+Subject: [PATCH] fixed status returns
+
+Here's a patch to fix the status return value in
+dats_handle_vector_init().
+
+Signed-off-by: Jane Doe <jdoe at pseudonyme.com>
+
+--- dat/common/dat_api.c~ 2005-03-30 11:58:40.838968000 -0500
++++ dat/common/dat_api.c 2005-03-28 12:33:29.502076000 -0500
+@@ -70,16 +70,15 @@
+ {
+ DAT_RETURN dat_status;
+ int i;
+- int status;
+
+ dat_status = DAT_SUCCESS;
+
+ g_hv.handle_max = DAT_HANDLE_ENTRY_STEP;
+
+- status = dat_os_lock_init (&g_hv.handle_lock);
+- if ( DAT_SUCCESS != status )
++ dat_status = dat_os_lock_init (&g_hv.handle_lock);
++ if ( DAT_SUCCESS != dat_status )
+ {
+- return status;
++ return dat_status;
+ }
+
+ g_hv.handle_array = dat_os_alloc (sizeof(void *) * DAT_HANDLE_ENTRY_STEP);
+------------------------------------------------------------
--- /dev/null
+ DAT Registry Subsystem Design v. 0.90
+ -------------------------------------
+
+=================
+Table of Contents
+=================
+
+* Table of Contents
+* Referenced Documents
+* Introduction
+* Goals
+* Provider API
+* Consumer API
+* Registry Design
+ + Registry Database
+ + Provider API pseudo-code
+ + Consumer API pseudo-code
+ + Platform Specific API pseudo-code
+
+====================
+Referenced Documents
+====================
+
+uDAPL: User Direct Access Programming Library, Version 1.0. Published
+6/21/2002. http://www.datcollaborative.org/uDAPL_062102.pdf. Referred
+to in this document as the "DAT Specification".
+
+============
+Introduction
+============
+
+The DAT architecture supports the use of multiple DAT providers within
+a single consumer application. Consumers implicitly select a provider
+using the Interface Adapter name parameter passed to dat_ia_open().
+
+The subsystem that maps Interface Adapter names to provider
+implementations is known as the DAT registry. When a consumer calls
+dat_ia_open(), the appropriate provider is found and notified of the
+consumer's request to access the IA. After this point, all DAT API
+calls acting on DAT objects are automatically directed to the
+appropriate provider entry points.
+
+A persistent, administratively configurable database is used to store
+mappings from IA names to provider information. This provider
+information includes: the file system path to the provider library
+object, version information, and thread safety information. The
+location and format of the registry is platform dependent. This
+database is know as the Static Registry (SR). The process of adding a
+provider entry is termed Static Registration.
+
+Within each DAT consumer, there is a per-process database that
+maps from ia_name -> provider information. When dat_ia_open() is
+called, the provider library is loaded, the ia_open_func is found, and
+the ia_open_func is called.
+
+=====
+Goals
+=====
+
+-- Implement the registration mechanism described in the uDAPL
+ Specification.
+
+-- The DAT registry should be thread safe.
+
+-- On a consumer's performance critical data transfer path, the DAT
+ registry should not require any significant overhead.
+
+-- The DAT registry should not limit the number of IAs or providers
+ supported.
+
+-- The user level registry should be tolerant of arbitrary library
+ initialization orders and support calls from library initialization
+ functions.
+
+============
+Provider API
+============
+
+Provider libraries must register themselves with the DAT registry.
+Along with the Interface Adapter name they wish to map, they must
+provide a routines vector containing provider-specific implementations
+of all DAT APIs. If a provider wishes to service multiple Interface
+Adapter names with the same DAT APIs, it must register each name
+separately with the DAT registry. The Provider API is not exposed to
+consumers.
+
+The user level registry must ensure that the Provider API may be
+called from a library's initialization function. Therefore the
+registry must not rely on a specific library initialization order.
+
+ DAT_RETURN
+ dat_registry_add_provider(
+ IN DAT_PROVIDER *provider )
+
+Description: Allows the provider to add a mapping. It will return an
+error if the Interface Adapter name already exists.
+
+ DAT_RETURN
+ dat_registry_remove_provider(
+ IN DAT_PROVIDER *provider )
+
+Description: Allows the Provider to remove a mapping. It will return
+an error if the mapping does not already exist.
+
+============
+Consumer API
+============
+
+Consumers that wish to use a provider library call the DAT registry to
+map Interface Adapter names to provider libraries. The consumer API is
+exposed to both consumers and providers.
+
+ DAT_RETURN
+ dat_ia_open (
+ IN const DAT_NAME device_name,
+ IN DAT_COUNT async_event_qlen,
+ INOUT DAT_EVD_HANDLE *async_event_handle,
+ OUT DAT_IA_HANDLE *ia_handle )
+
+Description: Upon success, this function returns an DAT_IA_HANDLE to
+the consumer. This handle, while opaque to the consumer, provides
+direct access to the provider supplied library. To support this
+feature, all DAT_HANDLEs must be pointers to a pointer to a
+DAT_PROVIDER structure.
+
+ DAT_RETURN
+ dat_ia_close (
+ IN DAT_IA_HANDLE ia_handle )
+
+Description: Closes the Interface Adapter.
+
+ DAT_RETURN
+ dat_registry_list_providers(
+ IN DAT_COUNT max_to_return,
+ OUT DAT_COUNT *entries_returned,
+ OUT DAT_PROVIDER_INFO *(dat_provider_list[]) )
+
+Description: Lists the current mappings.
+
+===============
+Registry Design
+===============
+
+There are three separate portions of the DAT registry system:
+
+* Registry Database
+
+* Provider API
+
+* Consumer API
+
+We address each of these areas in order. The final section will
+describe any necessary platform specific functions.
+
+Registry Database
+-----------------
+
+Static Registry
+................
+
+The Static Registry is a persistent database containing provider
+information keyed by Interface Adapter name. The Static Registry will
+be examined once when the DAT library is loaded.
+
+There is no synchronization mechanism protecting access to the Static
+Registry. Multiple readers and writers may concurrently access the
+Static Registry and as a result there is no guarantee that the
+database will be in a consistent format at any given time. DAT
+consumers should be aware of this and not run DAT programs when the
+registry is being modified (for example, when a new provider is being
+installed). However, the DAT library must be robust enough to recognize
+an inconsistent Static Registry and ignore invalid entries.
+
+Information in the Static Registry will be used to initialize the
+registry database. The registry will refuse to load libraries for DAT
+API versions different than its DAT API version. Switching API
+versions will require switching versions of the registry library (the
+library explicitly placed on the link line of DAPL programs) as well
+as the header files included by the program.
+
+Set DAT_NO_STATIC_REGISTRY at compile time if you wish to compile
+DAT without a static registry.
+
+UNIX Registry Format
+.....................
+
+The UNIX registry will be a plain text file with the following
+properties:
+ * All characters after # on a line are ignored (comments).
+ * Lines on which there are no characters other than whitespace
+ and comments are considered blank lines and are ignored.
+ * Non-blank lines must have seven whitespace separated fields.
+ These fields may contain whitespace if the field is quoted
+ with double quotes. Within fields quoated with double quotes,
+ the following are valid escape sequences:
+
+ \\ backslash
+ \" quote
+
+ * Each non-blank line will contain the following fields:
+
+ - The IA Name.
+ - The API version of the library:
+ [k|u]major.minor where "major" and "minor" are both integers
+ in decimal format. Examples: "k1.0", "u1.0", and "u1.1".
+ - Whether the library is thread-safe:
+ [threadsafe|nonthreadsafe]
+ - Whether this is the default section: [default|nondefault]
+ - The path name for the library image to be loaded.
+ - The version of the driver: major.minor, for example, "5.13".
+
+The format of any remaining fields on the line is dependent on the API
+version of the library specified on that line. For API versions 1.0
+and 1.1 (both kDAPL and uDAPL), there is only a single additional
+field, which is:
+
+ - An optional string with instance data, which will be passed to
+ the loaded library as its run-time arguments.
+
+This file format is described by the following grammar:
+
+<entry-list> -> <entry> <entry-list> | <eof>
+<entry> -> <ia-name> <api-ver> <thread-safety> <default-section>
+ <lib-path> <driver-ver> <ia-params> [<eor>|<eof>] |
+ [<eor>|<eof]
+<ia-name> -> string
+<api-ver> -> [k|u]decimal.decimal
+<thread-safety> -> [threadsafe|nonthreadsafe]
+<default-section> -> [default|nondefault]
+<lib-path> -> string
+<driver-ver> -> decimal.decimal
+<ia-params> -> string
+<eof> -> end of file
+<eor> -> newline
+
+The location of this file may be specified by setting the environment
+variable DAT_CONF. If DAT_CONF is not set, the default location will
+be /etc/dat.conf.
+
+Windows Registry Format
+.......................
+
+Standardization of the Windows registry format is not complete at this
+time.
+
+Registry Database Data Structures
+.................................
+
+The Registry Database is implemented as a dictionary data structure that
+stores (key, value) pairs.
+
+Initially the dictionary will be implemented as a linked list. This
+will allow for an arbitrary number of mappings within the resource
+limits of a given system. Although the search algorithm will have O(n)
+worst case time when n elements are stored in the data structure, we
+do not anticipate this to be an issue. We believe that the number of
+IA names and providers will remain relatively small (on the order of
+10). If performance is found to be an issue, the dictionary can be
+re-implemented using another data structure without changing the
+Registry Database API.
+
+The dictionary uses IA name strings as keys and stores pointers to a
+DAT_REGISTRY_ENTRY structure, which contains the following
+information:
+
+ - provider library path string, library_path
+ - DAT_OS_LIBRARY_HANDLE, library_handle
+ - IA parameter string, ia_params
+ - DAT_IA_OPEN_FUNC function pointer, ia_open_func
+ - thread safety indicator, is_thread_safe
+ - reference counter, ref_count
+
+The entire registry database data structure is protected by a single
+lock. All threads that wish to query/modify the database must posses
+this lock. Serializing access in this manner is not expected to have a
+detrimental effect on performance as contention is expected to be
+minimal.
+
+An important property of the registry is that entries may be inserted
+into the registry, but no entries are ever removed. The contents of
+the static registry are used to populate the initially empty registry
+database. Since these mapping are by definition persistent, no
+mechanism is provided to remove them from the registry database.
+
+NOTE: There is currently no DAT interface to set a provider's IA
+specific parameters. A solution for this problem has been proposed for
+uDAPL 1.1.
+
+Registry Database API
+.....................
+
+The static variable Dat_Registry_Db is used to store information about
+the Registry Database and has the following members:
+
+ - lock
+ - dictionary
+
+The Registry Database is accessed via the following internal API:
+
+Algorithm: dat_registry_init
+ Input: void
+ Output: DAT_RETURN
+{
+ initialize Dat_Registry_Db
+
+ dat_os_sr_load()
+}
+
+Algorithm: dat_registry_insert
+ Input: IN const DAT_STATIC_REGISTRY_ENTRY sr_entry
+ Output: DAT_RETURN
+{
+ dat_os_lock(&Dat_Registry_Db.lock)
+
+ create and initialize DAT_REGISTRY_ENTRY structure
+
+ dat_dictionary_add(&Dat_Registry_Db.dictionary, &entry)
+
+ dat_os_unlock(&Dat_Registry_Db.lock)
+}
+
+Algorithm: dat_registry_search
+ Input: IN const DAT_NAME_PTR ia_name
+ IN DAT_REGISTRY_ENTRY **entry
+ Output: DAT_RETURN
+{
+ dat_os_lock(&Dat_Registry_Db.lock)
+
+ entry gets dat_dictionary_search(&Dat_Registry_Db.dictionary, &ia_name)
+
+ dat_os_unlock(&Dat_Registry_Db.lock)
+}
+
+Algorithm: dat_registry_list
+ Input: IN DAT_COUNT max_to_return
+ OUT DAT_COUNT *entries_returned
+ OUT DAT_PROVIDER_INFO *(dat_provider_list[])
+ Output: DAT_RETURN
+{
+ dat_os_lock(&Dat_Registry_Db.lock)
+
+ size = dat_dictionary_size(Dat_Registry_Db.dictionary)
+
+ for ( i = 0, j = 0;
+ (i < max_to_return) && (j < size);
+ i++, j++ )
+ {
+ initialize dat_provider_list[i] w/ j-th element in dictionary
+ }
+
+ dat_os_unlock(&Dat_Registry_Db.lock)
+
+ *entries_returned = i;
+}
+
+Provider API pseudo-code
+------------------------
+
++ dat_registry_add_provider()
+
+Algorithm: dat_registry_add_provider
+ Input: IN DAT_PROVIDER *provider
+ Output: DAT_RETURN
+{
+ dat_init()
+
+ dat_registry_search(provider->device_name, &entry)
+
+ if IA name is not found then dat_registry_insert(new entry)
+
+ if entry.ia_open_func is not NULL return an error
+
+ entry.ia_open_func = provider->ia_open_func
+}
+
++ dat_registry_remove_provider()
+
+Algorithm: dat_registry_remove_provider
+ Input: IN DAT_PROVIDER *provider
+ Output: DAT_RETURN
+{
+ dat_init()
+
+ dat_registry_search(provider->device_name, &entry)
+
+ if IA name is not found return an error
+
+ entry.ia_open_func = NULL
+}
+
+Consumer API pseudo-code
+------------------------
+
+* dat_ia_open()
+
+This function looks up the specified IA name in the ia_dictionary,
+loads the provider library, retrieves a function pointer to the
+provider's IA open function from the provider_dictionary, and calls
+the providers IA open function.
+
+Algorithm: dat_ia_open
+ Input: IN const DAT_NAME_PTR name
+ IN DAT_COUNT async_event_qlen
+ INOUT DAT_EVD_HANDLE *async_event_handle
+ OUT DAT_IA_HANDLE *ia_handle
+ Output: DAT_RETURN
+
+{
+ dat_registry_search(name, &entry)
+
+ if the name is not found return an error
+
+ dat_os_library_load(entry.library_path, &entry.library_handle)
+
+ if the library fails to load return an error
+
+ if the entry's ia_open_func is invalid
+ {
+ dl_os_library_unload(entry.library_handle)
+ return an error
+ }
+
+ (*ia_open_func) (name,
+ async_event_qlen,
+ async_event_handle,
+ ia_handle);
+}
+
+* dat_ia_close()
+
+Algorithm: dat_ia_close
+ Input: IN DAT_IA_HANDLE ia_handle
+ IN DAT_CLOSE_FLAGS ia_flags
+ Output: DAT_RETURN
+{
+ provider = DAT_HANDLE_TO_PROVIDER(ia_handle)
+
+ (*provider->ia_close_func) (ia_handle, ia_flags)
+
+ dat_registry_search(provider->device_name, &entry)
+
+ dat_os_library_unload(entry.library_handle)
+}
+
++ dat_registry_list_providers()
+
+Algorithm: dat_registry_list_providers
+ Input: IN DAT_COUNT max_to_return
+ OUT DAT_COUNT *entries_returned
+ OUT DAT_PROVIDER_INFO *(dat_provider_list[])
+ Output: DAT_RETURN
+{
+ validate parameters
+
+ dat_registry_list(max_to_return, entries_returned, dat_provider_list)
+}
+
+Platform Specific API pseudo-code
+--------------------------------
+
+Below are descriptions of platform specific functions required by the
+DAT Registry. These descriptions are for Linux.
+
+Each entry in the static registry is represented by an OS specific
+structure, DAT_OS_STATIC_REGISTRY_ENTRY. On Linux, this structure will
+have the following members:
+
+ - IA name string
+ - API version
+ - thread safety
+ - default section
+ - library path string
+ - driver version
+ - IA parameter string
+
+The tokenizer will return a DAT_OS_SR_TOKEN structure
+containing:
+
+ - DAT_OS_SR_TOKEN_TYPE value
+ - string with the fields value
+
+The tokenizer will ignore all white space and comments. The tokenizer
+will also translate any escape sequences found in a string.
+
+Algorithm: dat_os_sr_load
+ Input: n/a
+ Output: DAT_RETURN
+{
+ if DAT_CONF environment variable is set
+ static_registry_file = contents of DAT_CONF
+ else
+ static_registry_file = /etc/dat.conf
+
+ sr_fd = dat_os_open(static_registry_file)
+
+ forever
+ {
+ initialize DAT_OS_SR_ENTRY entry
+
+ do
+ {
+ // discard blank lines
+ dat_os_token_next(sr_fd, &token)
+ } while token is newline
+
+ if token type is EOF then break // all done
+ // else the token must be a string
+
+ entry.ia_name = token.value
+
+ dat_os_token_next(sr_fd, &token)
+
+ if token type is EOF then break // all done
+ else if token type is not string then
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+ else if ( dat_os_convert_api(token.value, &entry.api) fails )
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+
+ dat_os_token_next(sr_fd, &token)
+
+ if token type is EOF then break // all done
+ else if token type is not string then
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+ else if ( dat_os_convert_thread_safety(token.value, &entry.thread_safety) fails )
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+
+ dat_os_token_next(sr_fd, &token)
+
+ if token type is EOF then break // all done
+ else if token type is not string then
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+ else if ( dat_os_convert_default(token.value, &entry.default) fails )
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+
+ dat_os_token_next(sr_fd, &token)
+
+ if token type is EOF then break // all done
+ else if token type is not string then
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+
+ entry.lib_path = token.value
+
+ dat_os_token_next(sr_fd, &token)
+
+ if token type is EOF then break // all done
+ else if token type is not string then
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+ else if ( dat_os_convert_driver_version(token.value, &entry.driver_version) fails )
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+
+ dat_os_token_next(sr_fd, &token)
+
+ if token type is EOF then break // all done
+ else if token type is not string then
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+
+ entry.ia_params = token.value
+
+ dat_os_token_next(sr_fd, &token)
+
+ if token type is EOF then break // all done
+ else if token type is not newline then
+ {
+ // an error has occurred
+ dat_os_token_sync(sr_fd)
+ continue
+ }
+
+ if ( dat_os_sr_is_valid(entry) )
+ {
+ dat_registry_insert(entry)
+ }
+ }
+
+ dat_os_close(sr_fd)
+}
+
+Algorithm: dat_os_library_load
+ Input: IN const DAT_NAME_PTR *library_path
+ OUT DAT_LIBRARY_HANDLE *library_handle
+ Output: DAT_RETURN
+{
+ *library_handle = dlopen(library_path);
+}
+
+Algorithm: dat_os_library_unload
+ Input: IN const DAT_LIBRARY_HANDLE library_handle
+ Output: DAT_RETURN
+{
+ dlclose(library_handle)
+}
--- /dev/null
+#######################################################################
+# #
+# DAPL Shared Memory Design #
+# #
+# James Lentini #
+# jlentini at users.sourceforge.net #
+# #
+# Created 09/17/2002 #
+# Updated 01/21/2005 #
+# Version 0.04 #
+# #
+#######################################################################
+
+
+Contents
+--------
+0. Introduction
+1. Referenced Documents
+2. Requirements
+3. Interface
+4. Implementation Options
+
+
+Introduction
+------------
+
+This document describes the design of shared memory registration for
+the DAPL reference implementation (RI).
+
+Implementing shared memory support completely within the DAPL RI
+would not be an ideal solution. A more robust and efficient
+implementation can be acheived by HCA vendors that integrate a DAT
+provider into their software stack. Therefore the RI will not contain
+an implementation of this feature.
+
+
+Referenced Documents
+--------------------
+
+kDAPL: Kernel Direct Access Programming Library, Version 1.2.
+uDAPL: User Direct Access Programming Library, Version 1.2.
+Available in the DAPL SourceForge repository at
+[doc/api/kDAPL_spec.pdf] and [doc/api/uDAPL_spec.pdf]. Collectively
+referred to in this document as the "DAT Specification".
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002. Available in the DAPL SourceForge repository
+at [doc/api/IBM_access_api.pdf]. Referred to in this document as the
+"IBM Access API Specification".
+
+Mellanox IB-Verbs API (VAPI) Mellanox Software Programmers Interface
+for InfiniBand Verbs. Available in the DAPL SourceForge repository
+at [doc/api/MellanoxVerbsAPI.pdf]. Referred to in this document as the
+"VAPI API Specification".
+
+InfiniBand Architecture Specification, Volumes 1 and 2, Release
+1.2, Available from http://www.infinibandta.org/
+Referred to in this document as the "Infiniband Specification".
+
+
+Requirements
+------------
+
+The DAT shared memory model can be characterized as a peer-to-peer
+model since the order in which consumers register a region is not
+dictated by the programming interface.
+
+The DAT API function used to register shared memory is:
+
+DAT_RETURN
+dat_lmr_create (
+ IN DAT_IA_HANDLE ia_handle,
+ IN DAT_MEM_TYPE mem_type,
+ IN DAT_REGION_DESCRIPTION region_description,
+ IN DAT_VLEN length,
+ IN DAT_PZ_HANDLE pz_handle,
+ IN DAT_MEM_PRIV_FLAGS mem_privileges,
+ OUT DAT_LMR_HANDLE *lmr_handle,
+ OUT DAT_LMR_CONTEXT *lmr_context,
+ OUT DAT_RMR_CONTEXT *rmr_context,
+ OUT DAT_VLEN *registered_length,
+ OUT DAT_VADDR *registered_address );
+
+where a DAT_REGION_DESCRIPTION is defined as:
+
+typedef union dat_region_description
+{
+ DAT_PVOID for_va;
+ DAT_LMR_HANDLE for_lmr_handle;
+ DAT_SHARED_MEMORY for_shared_memory;
+} DAT_REGION_DESCRIPTION;
+
+In the case of a shared memory registration the DAT consumer will set
+the DAT_MEM_TYPE flag to DAT_MEM_TYPE_SHARED_VIRTUAL and place a
+cookie in the DAT_REGION_DESCRIPTION union's DAT_SHARED_MEMORY
+member. The DAT_SHARED_MEMORY type is defined as follows:
+
+typedef struct dat_shared_memory
+{
+ DAT_PVOID virtual_address;
+ DAT_LMR_COOKIE shared_memory_id;
+} DAT_SHARED_MEMORY;
+
+Unlike the DAT peer-to-peer model, the Infiniband shared memory model
+requires a master-slave relationship. A memory region must first be
+registered using the Register Memory Region verb with subsequent
+registrations made using the Register Shared Memory Region verb.
+
+The later is implemented in the IBM OS Access API as:
+
+ib_int32_t
+ib_mr_shared_register_us(
+ ib_hca_handle_t hca_handle,
+ ib_mr_handle_t *mr_handle, /* IN-OUT: could be changed */
+ ib_pd_handle_t pd_handle, /* IN */
+ ib_uint32_t access_control, /* IN */
+ ib_uint32_t *l_key, /* OUT */
+ ib_uint32_t *r_key, /* OUT: if remote access needed */
+ ib_uint8_t **va ); /* IN-OUT: virt. addr. to register */
+
+The important parameter is the memory region handle which must be the
+same as an already registered region.
+
+Two requirements are implied by this difference between the DAT and
+Infiniband models. First, DAPL implementations need a way to determine
+the first registration of a shared region. Second implementations must
+map DAT_LMR_COOKIE values to memory region handles both within and
+across processes. To satisfy the above requirements DAPL must maintain
+this information in a system wide database.
+
+The difficulty of implementing such a database at the DAT provider
+level is the reason the RI's shared memory code is meant to be
+temporary. Such a database is much better suited as part of the HCA
+vendor's software stack, specifically as part of their HCA driver.
+
+If DAPL was based on a master-slave model like InfiniBand, the
+implementation of shared memory would be straight
+forward. Specifically the complexity is a result of the consumer being
+responsible for specifying the DAT_LMR_COOKIE values. If the DAPL
+spec. were changed to allow the provider and not the consumer to
+specify the DAT_LMR_COOKIE value, the implementation of this feature
+would be greatly simplified. Since the DAPL API already requires
+consumers to communicate the DAT_LMR_COOKIE values between processes,
+such a change places minimal additional requirements on the
+consumer. The dapl_lmr_query call could easily be adapted to allow the
+consumer to query the provider for a given LMR's DAT_LMR_COOKIE
+value. The only spec changes needed would be to add a DAT_LMR_COOKIE
+member to the DAT_LMR_PARAM structure and a DAT_LMR_FIELD_LMR_COOKIE
+constant to the DAT_LMR_PARAM_MASK enumeration. A provider could then
+store the given LMR's memory region handle in this value, greatly
+simplifying the implementation of shared memory in DAPL.
+
+
+Interface
+---------
+
+To allow the database implementation to easily change, the RI would use
+a well defined interface between the memory subsystem and the
+database. Conceptually the database would contain a single table with
+the following columns:
+
+[ LMR Cookie ][ MR Handle ][ Reference Count ][ Initialized ]
+
+where the LMR Cookie column is the primary key.
+
+The following functions would be used to access the database:
+
+DAT_RETURN
+dapls_mrdb_init (
+ void );
+
+ Called by dapl_init(.) to perform any necessary database
+ initialization.
+
+DAT_RETURN
+dapls_mrdb_exit (
+ void );
+
+ Called by dapl_fini(.) to perform any necessary database cleanup.
+
+DAT_RETURN
+dapls_mrdb_record_insert (
+ IN DAPL_LMR_COOKIE cookie );
+
+ If there is no record for the specified cookie, an empty record is
+ added with a reference count of 1 and the initialized field is set to
+ false. If a record already exists, the function returns an error.
+
+DAT_RETURN
+dapls_mrdb_record_update (
+ IN DAPL_LMR_COOKIE cookie,
+ IN ib_mr_handle_t mr_handle );
+
+ If there is a record for the specified cookie, the MR handle field is
+ set to the specified mr_handle value and the initialized field is set
+ to true. Otherwise an error is returned.
+
+DAT_RETURN
+dapls_mrdb_record_query (
+ IN DAPL_LMR_COOKIE cookie,
+ OUT ib_mr_handle_t *mr_handle );
+
+ If there is a record for the specified cookie and the initialized
+ field is true, the MR handle field is returned and the reference
+ count field is incremented. Otherwise an error is returned.
+
+DAT_RETURN
+dapls_mrdb_record_dec (
+ IN DAPL_LMR_COOKIE cookie );
+
+ If there is a record for the specified cookie, the reference count
+ field is decremented. If the reference count is zero after the
+ decrement, the record is removed from the database. Otherwise an
+ error is returned.
+
+The generic algorithms for creating and destroying a shared memory
+region are:
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: CreateShared
+ Inputs:
+ ia_handle
+ pz_handle
+ address
+ length
+ lmr_cookie
+ privileges
+ Outputs:
+ lmr_handle
+ lmr_context
+ registered_address
+ registered_length
+
+forever
+{
+ if dapls_mrdb_record_insert(cookie) is successful
+ {
+ if dapl_lmr_create_virtual is not successful
+ dapls_mrdb_record_dec(cookie)
+ return error
+
+ else if dapls_mrdb_record_update(cookie, lmr->mr_handle) is not successful
+ dapls_mrdb_record_dec(cookie)
+ return error
+
+ else break
+ }
+ else if dapls_mrdb_record_query(cookie, mr_handle) is successful
+ {
+ if ib_mrdb_shared_register_us is not successful
+ dapls_mrdb_record_dec(cookie)
+ return error
+
+ else break
+ }
+}
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: FreeShared
+ Inputs:
+ lmr
+ Outputs:
+
+if dapls_ib_mr_deregister(lmr) is successful
+ dapls_mrdb_record_dec(lmr->cookie)
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+Implementation Options
+----------------------
+
+As described above the crucial functionality needed to implement
+shared memory support is a system wide database for mapping LMR
+cookies to memory region handles. The following designs represent some
+of the options for implementing such a database. Adding a database
+increases the complexity of DAPL from both an implementor and user's
+perspective. These designs should be evaluated on the degree to which
+they minimize the additional complexity while still providing a robust
+solution.
+
+
+ File System Database
+ --------------------
+
+Employing a database that is already part of the system would be
+ideal. One option on Linux is to use the file system. An area of the
+file system could be set aside for the creation of files to represent
+each LMR cookie. The area of the file system could be specified
+through a hard coded value, an environment variable, or a
+configuration file. A good candidate would be a DAPL subdirectory of
+/tmp.
+
+Exclusive file creation is available through the creat(2) system call
+in Linux. The standard I/O interface (fopen(3), etc.) does not support
+this feature making porting difficult. However porting to other
+environments is not a goal of this design since the entire scheme is
+only a temporary solution.
+
+Determining when to delete the files is a difficult problem. A
+reference count is required to properly remove a file once all the
+memory regions it represents are deregistered. The synchronization
+mechanism necessary for maintaining the reference count is not easily
+implemented. As an alternative, a script could be provided to clean up
+the database by removing all the files. The script would need to be
+run before any DAPL consumers were started to ensure a clean
+database. The disadvantage of using a script is that no DAPL instances
+can be running when it is used. Another option would be to store the
+process ID (PID) of the process that created the file as part of the
+file's contents. Upon finding a record for a given LMR cookie value, a
+DAPL instance could determine if there was a process with the same PID
+in the system. To accomplish this the kill(2) system call could be
+used (ex. kill(pid, 0) ). This method of validating the record assumes
+that all DAPL instances can signal one another and that the PID values
+do not wrap before the check is made.
+
+Another difficulty with this solution is choosing an accessible
+portion of the file system. The area must have permissions that allow
+all processes using DAPL to access and modify its contents. System
+administrators are typically reluctant to allow areas without any
+access controls. Typically such areas are on a dedicated file system
+of a minimal size to ensure that malicious or malfunctioning software
+does not monopolize the system's storage capacity. Since very little
+information will be stored in each file it is unlikely that DAPL would
+need a large amount of storage space even if a large number of shared
+memory regions were in use. However since a file is needed for each
+shared region, a large number of shared registrations may lead to the
+consumption of all a file system's inodes. Again since this solution
+is meant to be only temporary this constraint may be acceptable.
+
+There is also the possibility for database corruption should a process
+crash or deadlock at an inopportune time. If a process creates file x
+and then crashes all other processes waiting for the memory handle to
+be written to x will fail.
+
+The database interface could be implemented as follows:
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs:
+
+ Outputs:
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs:
+
+ Outputs:
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert
+ Inputs:
+ cookie
+ Outputs:
+
+file_name = convert cookie to valid file name
+
+fd = exclusively create file_name
+if fd is invalid
+ return failure
+
+if close fd fails
+ return failure
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update
+ Inputs:
+ cookie
+ mr_handle
+ Outputs:
+
+file_name = convert cookie to valid file name
+
+fd = open file_name
+if fd is invalid
+ return failure
+
+if write mr_handle to file_name fails
+ return failure
+
+if close fd fails
+ return failure
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query
+ Inputs:
+ cookie
+
+ Outputs:
+ mr_handle
+
+file_name = convert cookie to valid file name
+
+fd = open file_name
+if fd is invalid
+ return failure
+
+if read mr_handle from file_name fails
+ return failure
+
+if close fd fails
+ return failure
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec
+ Inputs:
+ cookie
+ Outputs:
+
+return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+ Daemon Database
+ ---------------
+
+The database could be maintained by a separate daemon process.
+The DAPL instances would act as clients of the daemon server and
+communicate with the daemon through the various IPC mechanisms
+available on Linux: Unix sockets, TCP/IP sockets, System V message
+queues, FIFOs, or RPCs.
+
+As with the file system based database, process crashes can potentially
+cause database corruption.
+
+While the portability of this implementation will depend on the chosen
+IPC mechanism, this approach will be at best Unix centric and possibly
+Linux specific.
+
+The database interface could be implemented as follows:
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs:
+
+ Outputs:
+
+initialize IPC mechanism
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs:
+
+ Outputs:
+
+shutdown IPC mechanism
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert
+ Inputs:
+ cookie
+ Outputs:
+
+if send insert message for cookie fails
+ return error
+
+if receive insert response message fails
+ return error
+
+if insert success
+ return success
+else return error
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update
+ Inputs:
+ cookie
+ mr_handle
+
+ Outputs:
+
+if send update message for cookie and mr_handle fails
+ return error
+else return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query
+ Inputs:
+ cookie
+
+ Outputs:
+ mr_handle
+
+if send query message for cookie fails
+ return error
+
+else if receive query response message with mr_handle fails
+ return error
+
+else return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec
+ Inputs:
+ cookie
+ Outputs:
+
+if send decrement message for cookie fails
+ return error
+else return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+ Shared Memory Database
+ ----------------------
+
+The database could be maintained in an area of memory shared by all
+DAPL instances running on a system. Linux supports the System V shared
+memory functions shmget(2), shmctl(2), shmat(2), and shmdt(2). A hard
+coded key_t value could be used so that each DAPL instance attached to
+the same piece of shared memory. The size of the database would be
+constrained by the size of the shared memory region. Synchronization
+could be achieved by using atomic operations targeting memory in the
+shared region.
+
+Such a design would suffer from the corruption problems described
+above. If a process crashed there would be no easy way to clean up its
+locks and roll back the database to a consistent state.
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs:
+
+ Outputs:
+
+attach shared region
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs:
+
+ Outputs:
+
+detach shared region
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert
+ Inputs:
+ cookie
+ Outputs:
+
+lock database
+
+if db does not contain cookie
+ add record for cookie
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update
+ Inputs:
+ cookie
+ mr_handle
+ Outputs:
+
+lock database
+
+if db contains cookie
+ update record's mr_handle
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query
+ Inputs:
+ cookie
+
+ Outputs:
+ mr_handle
+
+lock database
+
+if db contains cookie
+ set mr_handle to record's value
+ increment record's reference count
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec
+ Inputs:
+ cookie
+ Outputs:
+
+lock database
+
+if db contains cookie
+ decrement record's reference count
+
+ if reference count is 0
+ remove record
+
+unlock database
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+ Kernel Module Database
+ ----------------------
+
+If the DAT library were integrated with an HCA vendor's software
+stack, the database could be managed by the HCA driver. Placing the
+database in the kernel would alleviate the synchronization problems
+posed by multiple processes. Since memory registration operations
+already involve a transition into the kernel, no extra overhead would
+be incurred by this design.
+
+The RI could include a kernel module with this functionality as a
+temporary solution. The module could identify itself as a character
+device driver and communicate with user level processes through an
+ioctl(2). The driver could also create an entry in the proc file
+system to display the database's contents for diagnostic purposes.
+
+A major benefit of a kernel based implementation is that the database
+can remain consistent even in the presence of application
+errors. Since DAPL instances communicate with the driver by means of
+ioctl(2) calls on a file, the driver can be arrange to be informed
+when the file is closed and perform any necessary actions. The driver
+is guaranteed to be notified of a close regardless of the manner in
+which the process exits.
+
+The database could be implemented as a dictionary using the LMR cookie
+values as keys.
+
+The following pseudo-code describes the functions needed by the kernel
+module and the database interface.
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: KernelModuleInit
+ Inputs:
+
+ Outputs:
+
+dictionary = create_dictionary()
+create_proc_entry()
+create_character_device_entry()
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: KernelModuleExit
+ Inputs:
+
+ Outputs:
+
+remove_character_device_entry()
+remove_proc_entry()
+fee_dictionary(dictionary)
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: DeviceOpen
+ Inputs:
+ file
+
+ Outputs:
+
+dev_data = allocate device data
+
+file->private_data = dev_data
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: DeviceClose
+ Inputs:
+ file
+
+ Outputs:
+
+dev_data = file->private_data
+
+for each record in dev_data
+{
+ RecordDecIoctl
+}
+
+deallocate dev_data
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordInsertIoctl
+ Inputs:
+ file
+ cookie
+
+ Outputs:
+
+lock dictionary
+
+if cookie is not in dictionary
+ insert cookie into dictionary
+
+
+unlock dictionary
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordUpdateIoctl
+ Inputs:
+ file
+ cookie
+ mr_handle
+
+ Outputs:
+
+dev_data = file->private_data
+
+lock dictionary
+
+if cookie is in dictionary
+ add record reference to dev_data
+ update mr_handle
+
+unlock dictionary
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordQueryIoctl
+ Inputs:
+ file
+ cookie
+
+ Outputs:
+ mr_handle
+
+dev_data = file->private_data
+
+lock dictionary
+
+if cookie is in dictionary
+ add record reference to dev_data
+ retrieve mr_handle
+
+unlock dictionary
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: RecordDecIoctl
+ Inputs:
+ file
+ cookie
+
+ Outputs:
+
+dev_data = file->private_data
+remove record reference from dev_data
+
+lock dictionary
+
+if cookie is in dictionary
+ decrement reference count
+ if reference count is 0
+ remove record
+
+unlock dictionary
+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_init
+ Inputs:
+
+ Outputs:
+
+fd = open device file
+
+if fd is invalid
+ return error
+else
+ return success
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_exit
+ Inputs:
+
+ Outputs:
+
+close fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_insert
+ Inputs:
+ cookie
+ Outputs:
+
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_update
+ Inputs:
+ cookie
+ mr_handle
+ Outputs:
+
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_query
+ Inputs:
+ cookie
+
+ Outputs:
+ mr_handle
+
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ Function: dapls_mrdb_record_dec
+ Inputs:
+ cookie
+ Outputs:
+
+ioctl on fd
+
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
--- /dev/null
+ Suggested Vendor-Specific Changes v. 0.92
+ -----------------------------------------
+
+=================
+Table of Contents
+=================
+
+* Table of Contents
+* Introduction
+* Referenced documents
+* Functionality Changes
+ + Missing Functionality
+ + dat_evd_resize
+ + Ordering guarantees on connect/disconnect.
+ + Shared memory
+ + dat_cr_handoff
+* Performance optimizations
+ + Reduction of context switches
+ [Many interrelated optimizations]
+ + Reducing copying of data
+ + Avoidance of s/g list copy on posting
+ + Avoidance of event data copy from CQ to EVD
+ + Elimination of locks
+ + Eliminating subroutine calls
+
+
+============
+Introduction
+============
+
+This document is a list of functionality enhancements and
+optimizations hardware vendors porting uDAPL may want to consider as
+part of their port. The functionality enhancements mentioned in this
+document are situations in which HCA Vendors, with their access to
+driver and verb-level source code, and their reduced portability
+concerns, are in a much better position than the reference
+implementation to implement portions of the uDAPL v. 1.0
+specification. (Additional areas in which the reference
+implementation, because of a lack of time or resources, did not fully
+implement the uDAPL 1.0 specification are not addressed in this file;
+see the file doc/dapl_unimplemented_functionality.txt, forthcoming).
+Vendors should be guided in their implementation of these
+functionality enhancements by their customers need for the features
+involved.
+
+The optimizations suggested in this document have been identified by
+the uDAPL Reference Implementation team as areas in which performance
+may be improved by "breaching" the IB Verbs API boundary. They are
+inappropriate for the reference implementation (which has portability
+as one of its primary goals) but may be appropriate for a HCA-specific
+port of uDAPL. Note that no expected performance gain is attached to
+the suggested optimizations. This is intentional. Vendors should be
+guided in their performance improvements by performance evaluations
+done in the context of a representative workload, and the expected
+benefit from a particular optimization weighed against the cost in
+code complexity and scheduling, before the improvement is implemented.
+This document is intended to seed that process; it is not intended to
+be a roadmap for that process.
+
+We divide functionality changes into two categories
+ * Areas in which functionality is lacking in the reference
+ implementation.
+ * Areas in which the functionality is present in the reference
+ implementation, but needs improvement.
+
+We divide performance improvements into three types:
+ * Reducing context switches
+ * Reducing copying of data (*)
+ * Eliminating subroutine calls
+
+(*) Note that the data referred to in "reducing copying of data" is
+the meta data describing an operation (e.g. scatter/gather list or
+event information), not the actual data to be transferred. No data
+transfer copies are required within the uDAPL reference
+implementation.
+
+====================
+Referenced Documents
+====================
+
+uDAPL: User Direct Access Programming Library, Version 1.0. Published
+6/21/2002. http://www.datcollaborative.org/uDAPL_062102.pdf.
+Referred to in this document as the "DAT Specification".
+
+InfiniBand Access Application Programming Interface Specification,
+Version 1.2, 4/15/2002. In DAPL SourceForge repository at
+doc/api/access_api.pdf. Referred to in this document as the "IBM
+Access API Specification".
+
+uDAPL Reference Implementation Event System Design. In DAPL
+SourceForge repository at doc/dapl_event_design.txt.
+
+uDAPL Reference Implementation Shared Memory Design. In DAPL
+SourceForge repository at doc/dapl_shared_memory_design.txt.
+
+uDAPL list of unimplmented functionality. In DAPL SourceForge
+repository at doc/dapl_unimplemented_funcitonality.txt (forthcoming).
+
+===========================================
+Suggested Vendor Functionality Enhancements
+===========================================
+
+Missing Functionality
+---------------------
+-- dat_evd_resize
+
+The uDAPL event system does not currently implement dat_evd_resize.
+The primary reason for this is that it is not currently possible to
+identify EVDs with the CQs that back them. Hence uDAPL must keep a
+separate list of events, and any changes to the size of that event
+list would require careful synchronization with all users of that EVD
+(see the uDAPL Event System design for more details). If the various
+vendor specific optimizations in this document were implemented that
+eliminated the requirement for the EVD to keep its own event list,
+dat_evd_resize might be easily implemented by a call or calls to
+ib_cq_resize.
+
+-- Ordering guarantees on connect/disconnect.
+
+The DAPL 1.1 specification specifies that if an EVD combines event
+streams for connection events and DTO events for the same endpoint,
+there is an ordering guarantee: the connection event on the AP occurs
+before any DTO events, and the disconnection event occurs after all
+successful DTO events. Since DTO events are provided by the IBM OS
+Access API through ib_completion_poll (in response to consumer
+request) and connection events are provided through callbacks (which
+may race with consumer requests) there is no trivial way to implement
+this functionality. The functionality may be implemented through
+under the table synchronizations between EVD and EP; specifically:
+ * The first time a DTO event is seen on an endpoint, if the
+ connection event has not yet arrived it is created and
+ delivered ahead of that DTO event.
+ * When a connection event is seen on an endpoint, if a
+ connection event has already been created for that endpoint
+ it is silently discarded.
+ * When a disconnection event is seen on an endpoint, it is
+ "held" until either: a) all expected DTO events for that
+ endpoint have completed, or b) a DTO marked as "flushed by
+ disconnect" is received. At that point it is delivered.
+
+Because of the complexity and performance overhead of implementating
+this feature, the DAPL 1.1 reference implementation has chosen to take
+the second approach allowed by the 1.1 specification: disallowing
+integration of connection and data transfer events on the same EVD.
+This fineses the problem, is in accordance with the specification, and
+is more closely aligned with the ITWG IT-API currently in development,
+which only allows a single event stream type for each simple EVD.
+However, other vendors may choose to implement the functionality
+described above in order to support more integration of event streams.
+
+-- Shared memory implementation
+
+The difficulties involved in the dapl shared memory implementation are
+fully described in doc/dapl_shared_memory_design.txt. To briefly
+recap:
+
+The uDAPL spec describes a peer-to-peer shared memory model; all uDAPL
+instances indicate that they want to share registration resources for
+a section of memory do so by providing the same cookie. No uDAPL
+instance is unique; all register their memory in the same way, and no
+communication between the instances is required.
+
+In contrast, the IB shared memory interface requires the first process
+to register the memory to do so using the standard memory registration
+verbs. All other processes after that must use the shared memory
+registration verb, and provide to that verb the memory region handle
+returned from the initial call. This means that the first process to
+register the memory must communicate the memory region handle it
+receives to all the other processes who wish to share this memory.
+This is a master-slave model of shared memory registration; the
+initial process (the master), is unique in its role, and it must tell
+the slaves how to register the memory after it.
+
+To translate between these two models, the uDAPL implementation
+requires some mapping between the shared cookie and the memory region
+handle. This mapping must be exclusive and must have inserts occur
+atomically with lookups (so that only one process can set the memory
+region handle; the others retrieve it). It must also track the
+deregistration of the shared memory, and the exiting of the processes
+registering the shared memory; when all processes have deregistered
+(possibly by exitting) it must remove the mapping from cookie to
+memory region handle.
+
+This mapping must obviously be shared between all uDAPL
+implementations on a given host. Implementing such a shared mapping
+is problematic in a pure user-space implementation (as the reference
+implementation is) but is expected to be relatively easy in vendor
+supplied uDAFS implementations, which will presumably include a
+kernel/device driver component. For this reason, we have chosen to
+leave this functionality unimplemented in the reference implementation.
+
+-- Implementation of dat_cr_handoff
+
+Given that the change of service point involves a change in associated
+connection qualifier, which has been advertised at the underlying
+Verbs/driver level, it is not clear how to implement this function
+cleanly within the reference implementation. We thus choose to defer
+it for implementation by the hardware vendors.
+
+=========================
+Performance Optimizations
+=========================
+
+
+Reduction of context switches
+-----------------------------
+Currently, three context switches are required on the standard
+uDAPL notification path. These are:
+ * Invocation of the hardware interrupt handler in the kernel.
+ Through this method the hardware notifies the CPU of
+ completion queue entries for operations that have requested
+ notification.
+ * Unblocking of the per-process IB provider service thread
+ blocked within the driver. This thread returns to
+ user-space within its process, where it causes
+ * Unblocking of the user thread blocked within the uDAPL entry
+ point (dat_evd_wait() or dat_cno_wait()).
+
+There are several reasons for the high number of context switches,
+specifically:
+ * The fact that the IB interface delivers notifications
+ through callbacks rather than through unblocking waiting
+ threads; this does not match uDAPL's blocking interface.
+ * The fact that the IB interface for blocking on a CQ doesn't
+ have a threshhold. If it did, we could often convert a
+ dat_evd_wait() into a wait on that CQ.
+ * The lack of a parallel concept to the CNO within IB.
+
+These are all areas in which closer integration between the IB
+verbs/driver and uDAPL could allow the user thread to wait within the
+driver. This would allow the hardware interrupt thread to directly
+unblock the user thread, saving a context switch.
+
+A specific listing of the optimizations considered here are:
+ * Allow blocking on an IB CQ. This would allow removal of the
+ excess context switch for dat_evd_wait() in cases where
+ there is a 1-to-1 correspondence between an EVD and a CQ and
+ no threshold was passed to dat_evd_wait().
+ * Allow blocking on an IB CQ to take a threshold argument.
+ This would allow removal of the excess context switch for
+ dat_evd_wait() in cases where there is a 1-to-1
+ correspondence between an EVD and a CQ regardless of the
+ threshold value.
+ * Give the HCA device driver knowledge of and access to the
+ implementation of the uDAPL EVD, and implement dat_evd_wait()
+ as an ioctl blocking within the device driver. This would
+ allow removal of the excess context switch in all cases for
+ a dat_evd_wait().
+ * Give the HCA device driver knowledge of and access to the
+ implementation of the uDAPL CNO, and implement dat_cno_wait()
+ as an ioctl blocking within the device driver. This would
+ allow removal of the excess context switch in all cases for
+ a dat_cno_wait(), and could improve performance for blocking
+ on OS Proxy Wait Objects related to the uDAPL CNO.
+
+See the DAPL Event Subsystem Design (doc/dapl_event_design.txt) for
+more details on this class of optimization.
+
+========================
+Reducing Copying of Data
+========================
+
+There are two primary places in which a closer integration between the
+IB verbs/driver and the uDAPL implementation could reducing copying
+costs:
+
+-- Avoidance of s/g list copy on posting
+
+Currently there are two copies involved in posting a data transfer
+request in uDAPL:
+ * From the user context to uDAPL. This copy is required
+ because the scatter/gather list formats for uDAPL and IB
+ differ; a copy is required to change formats.
+ * From uDAPL to the WQE. This copy is required because IB
+ specifies that all user parameters are owned by the user
+ upon return from the IB call, and therefore IB must keep its
+ own copy for use during the data transfer operation.
+
+If the uDAPL data transfer dispatch operations were implemented
+directly on the IB hardware, these copies could be combined.
+
+-- Avoidance of Event data copy from CQ to EVD
+
+Currently there are two copies of data involved in receiving an event
+in a standard data transfer operation:
+ * From the CQ on which the IB completion occurs to an event
+ structure held within the uDAPL EVD. This is because the IB
+ verbs provide no way to discover how many elements have been
+ posted to a CQ. This copy is not
+ required for dat_evd_dequeue. However, dat_evd_wait
+ requires this copy in order to correctly implement the
+ threshhold argument; the callback must know when to wakeup
+ the waiting thread. In addition, copying all CQ entries
+ (not just the one to be returned) is necessary before
+ returning from dat_evd_wait in order to set the *nmore OUT
+ parameter.
+ * From the EVD into the event structure provided in the
+ dat_evd_wait() call. This copy is required because of the
+ DAT specification, which requires a user-provided event
+ structure to the dat_evd_wait() call in which the event
+ information will be returned. If dat_evd_wait() were
+ instead, for example, to hand back a pointer to the already
+ allocated event structure, that would eventually require the
+ event subsystem to allocate more event structures. This is
+ avoided in the critical path.
+
+A tighter integration between the IB verbs/driver and the uDAPL
+implementation would allow the avoidance of the first copy.
+Specifically, providing a way to get information as to the number of
+completions on a CQ would allow avoidance of that copy.
+
+See the uDAPL Event Subsystem Design for more details on this class of
+optimization.
+
+====================
+Elimination of Locks
+====================
+
+Currently there is only a single lock used on the critical path in the
+reference implementation, in dat_evd_wait() and dat_evd_dequeue().
+This lock is in place because the ib_completion_poll() routine is not
+defined as thread safe, and both dat_evd_wait() and dat_evd_dequeue()
+are. If there was some way for a vendor to make ib_completion_poll()
+thread safe without a lock (e.g. if the appropriate hardware/software
+interactions were naturally safe against races), and certain other
+modifications made to the code, the lock might be removed.
+
+The modifications required are:
+ * Making racing consumers from DAPL ring buffers thread safe.
+ This is possible, but somewhat tricky; the key is to make
+ the interaction with the producer occur through a count of
+ elements on the ring buffer (atomically incremented and
+ decremented), but to dequeue elements with a separate atomic
+ pointer increment. The atomic modification of the element
+ count synchronizes with the producer and acquires the right
+ to do an atomic pointer increment to get the actual data.
+ The atomic pointer increment synchronizes with the other
+ consumers and actually gets the buffer data.
+ * The optimization described above for avoiding copies from
+ the CQ to the DAPL EVD Event storage queue. Without this
+ optimization a potential race between dat_evd_dequeue() and
+ dat_evd_wait() exists where dat_evd_dequeue will return an
+ element advanced in the event stream from the one returned
+ from dat_evd_wait():
+
+ dat_evd_dequeue() called
+
+ EVD state checked; ok for
+ dat_evd_dequeue()
+ dat_evd_wait() called
+
+ State changed to reserve EVD
+ for dat_evd_wait()
+
+ Partial copy of CQ to EVD Event store
+
+ Dequeue of CQE from CQ
+
+ Completion of copy of CQ to EVD Event store
+
+ Return of first CQE copied to EVD Event store.
+
+ Return of thie CQE from the middle
+ of the copied stream.
+
+
+ If no copy occurs, dat_evd_wait() and dat_evd_dequeue() may
+ race, but if all operations on which they may race (access
+ to the EVD Event Queue and access to the CQ) are thread
+ safe, this race will cause no problems.
+
+============================
+Eliminating Subroutine Calls
+============================
+
+This area is the simplest, as there are many DAPL calls on the
+critical path that are very thin veneers on top of their IB
+equivalents. All of these calls are canidates for being merged with
+those IB equivalents. In cases where there are other optimizations
+that may be acheived with the call described above (e.g. within the
+event subsystem, the data transfer operation posting code), that call
+is not mentioned here:
+ * dat_pz_create
+ * dat_pz_free
+ * dat_pz_query
+ * dat_lmr_create
+ * dat_lmr_free
+ * dat_lmr_query
+ * dat_rmr_create
+ * dat_rmr_free
+ * dat_rmr_query
+ * dat_rmr_bind
+
+
--- /dev/null
+#
+# DAT 1.1 configuration file
+#
+# Each entry should have the following fields:
+#
+# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
+# <provider_version> <ia_params> <platform_params>
+#
+
+ia0 u1.1 nonthreadsafe default /usr/lib/libdapl.so ri.1.1 "ia_params" "pd_params"
+
+# Example for openib using the first Mellanox adapter, port 1 and port 2
+OpenIB1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" ""
+OpenIB2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" ""
+
--- /dev/null
+ DAT Environment Guide v. 0.01
+ -----------------------------
+
+The following environment variables affect the behavior of the DAT
+library:
+
+
+DAT_OVERRIDE
+------------
+ Value used as the static registry configuration file, overriding the
+ default location, /etc/dat.conf
+
+ Example: setenv DAT_OVERRIDE /path/to/my/private.conf
+
+
+DAT_DBG_TYPE
+------------
+
+ Value specifies which parts of the registry will print debugging
+ information, valid values are
+
+ DAT_OS_DBG_TYPE_ERROR = 0x1
+ DAT_OS_DBG_TYPE_GENERIC = 0x2
+ DAT_OS_DBG_TYPE_SR = 0x4
+ DAT_OS_DBG_TYPE_DR = 0x8
+ DAT_OS_DBG_TYPE_PROVIDER_API = 0x10
+ DAT_OS_DBG_TYPE_CONSUMER_API = 0x20
+ DAT_OS_DBG_TYPE_ALL = 0xff
+
+ or any combination of these. For example you can use 0xC to get both
+ static and dynamic registry output.
+
+ Example setenv DAT_DBG_TYPE 0xC
+
+DAT_DBG_DEST
+------------
+
+ Value sets the output destination, valid values are
+
+ DAT_OS_DBG_DEST_STDOUT = 0x1
+ DAT_OS_DBG_DEST_SYSLOG = 0x2
+ DAT_OS_DBG_DEST_ALL = 0x3
+
+ For example, 0x3 will output to both stdout and the syslog.
+
--- /dev/null
+dat-linux3-ib0 0xfe80000000000000 0x0001730000003d11
+dat-linux5-ib0 0xfe80000000000000 0x0001730000003d91
+dat-linux6-ib0 0xfe80000000000000 0x0001730000009791