Arlin Davis [Mon, 8 Aug 2011 05:06:09 +0000 (22:06 -0700)]
dat: add definitions for MPI offloaded collectives in IB transport extensions
The collective extensions are designed to support MPI and general
multicast operations over IB fabrics that support offloaded collectives.
Where feasible, they come as close to MPI semantics as possible.
Unless otherwise stated, all members participating in a data collective
operation must call the associated collective routine for the data
transfer operation to complete. Unless otherwise stated, the
root collective member of a data operation will receive its own portion
of the collective data. In most cases, the root member can prevent
sending/receiving data when such operations would be redundant. When root
data is already "in place" the root member may set the send and/or receive
buffer pointer argument to NULL.
Unlike standard DAPL movement operations that require registered
memory and LMR objects, collective data movement operations employ
pointers to user-virtual address space that do not require
pre-registration by the application. From a resource usage point
of view, the API user should consider that the provider implementation
my perform memory registrations/deregistration on behalf of the
application to accomplish a data transfer.
Most collective calls are asynchronous. Upon completion, an event
will be posted to the EVD specified when the collective was created.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Sat, 12 Feb 2011 19:46:08 +0000 (11:46 -0800)]
ucm: delay freeing of active side UD cm object in case RTU is dropped
The ucm was freeing the UD CM object to quickly so a retried REPLY
was dropped and the passive side never received the AH info via RTU.
Keep active side UD cm objects on work queue until QP is destroyed
so RTU can be resent if necessary.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Sat, 12 Feb 2011 19:36:35 +0000 (11:36 -0800)]
ucm: cm object needs to be on work queue before req sent on wire
With this delay in cm object queuing there is potential for replies
being dropped coming back with a NO MATCH. Start with INIT
state and queue it up, move to state REP_PENDING when
sending out on the wire to start request timer.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 21 Jan 2011 02:31:30 +0000 (18:31 -0800)]
common: reduce default max inline data size because of performance anomaly
Increasing max inline causes small message rates to decrease from
4M/sec when set to 64 to about 1M/sec when set to 400. This has
been observed on latest mlx4 adapters. Set default to 64 until resolved.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 17 Jan 2011 22:43:58 +0000 (14:43 -0800)]
ucm, scm: exchange max_qp_rd_atom and limit outstanding requests
exchange and add proper checking to limit outstanding
rdma reads and atomics. Use one of the reserve bytes
in CM message protocol to exchange limits and reset
EP attribute rdma_out and set QP RTS attribute properly.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 7 Dec 2010 00:06:47 +0000 (16:06 -0800)]
libdat: static provider entries created for local SR database not freed
During load (dat_sr_init) the SR database is created with all dat.conf entries
but are never cleaned up during unload. Add new functions dat_sr_remove_all()
and dat_sr_remove() calls to cleanup and deallocate SR database entries and
database via dat_sr_fini().
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 3 Dec 2010 22:56:21 +0000 (14:56 -0800)]
ucm: release UD cm objects after AH is exchanged to avoid duplicate request drops
When EP is in UD mode, AH resolution is handled with DAT connection semantics
connect and accept. Since AH info can be resolved for the same EPs you can
get false duplicate requests because a previous CR from is still on the
CM processing list. The CM object will remain on the EP free list and not
be freed until EP is destroyed given the possibilty of consumer accessing CR
private data buffer.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 3 Dec 2010 22:24:40 +0000 (14:24 -0800)]
ucm: hold lock when sending cm_msgs to sync timer start with packet send
releasing the lock after setting start timer and before
ucm_send could result in incorrect timeout on CM operations
if thread is scheduled out when releasing lock.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 3 Dec 2010 22:02:25 +0000 (14:02 -0800)]
ucm: add debugging to include process id for better scale up debug aids
use part of the resv[] area of the cm_msg to include local and
remote process ids. Add more debug messages to help isolate
problems related to many process problems.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 3 Dec 2010 18:25:46 +0000 (10:25 -0800)]
cma: disconnect can block for excessive times waiting for rdma_cm DREP timeout
rdma_cm uses the same timeout values for connect and disconnect
request/reply. Disconnect abrupt option allows DAT consumers to
specify a prompt disconnect with immediate event. If the remote
node goes down or is non-responsive a CM disconnect event could
take minutes. Add a time limit waiting for event and move EP to
disconnected state to prevent callback from issuing duplicate
disconnect event via callback. The EP to CM linking will
cleanup/cancel any pending events before destroying cm_id.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 22 Oct 2010 17:15:15 +0000 (10:15 -0700)]
scm, ucm: MPI spawn test on oversubcribed server taking excessive time to complete
Simultanious DREQ processing from user and CM thread caused some improper
state change on UCM. State change can incorrectly change from FREE back
to DISC in certain corner cases. Add checking on internal disconnect call
to prevent double callback events and improper state change.
For SCM, a remote DREQ will shutdown socket which will cause POLLERR
on the disconnected FD. This will in turn cause the cm_thread to
wakeup continuously unnecessarily. Fix thread thrashing by moving
CM object to FREE state and removing object FD from pollfd array.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
ardavis [Tue, 19 Oct 2010 16:52:45 +0000 (09:52 -0700)]
common: dat_evd_dequeue (poll_cq) fails with invalid parameter after EP (qp) free
Failure occured during Intel MPI spawn test on windows.
The QP's need to be flushed and processed via EVD's during
the EP (QP) destroy to avoid an error on poll_cq. IBAL
provider was not moving to ERR state during QP destroy.
Better flush CQ processing was added and pushed down to the provider
level via dapls_ib_qp_free() where it can move QP to ERR, flush CQ,
and then free QP after flushing. Because there is no QP_ERR_FLUSH
state on a QP the spin on poll_cq (until empty) after modify_qp
to ERR could return empty and before all WQE's are flushed. This
could result in a CQE being added to CQ with a invalid QP reference.
So, an additional check was added to flush_evds for the recv_q to
poll_cq until all recv's pending are complete. For transmit_q there
is no quarantee that the posted work is signaled and so the best
that can be done is poll_cq until empty.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
ardavis [Mon, 11 Oct 2010 19:24:31 +0000 (12:24 -0700)]
ucm: allow configuration of CM burst (signal) threshold on posting
Add new DAPL_UCM_TX_BURST environment variable, default=50.
Every 50 posted send messages will signal event which
is 10 percent blocks of default 500 message limit.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
ardavis [Thu, 7 Oct 2010 18:14:03 +0000 (11:14 -0700)]
Allow DAPL out of band connection models to use ibacm to obtain
path record data. This will enable support for a wider range of
topologies, where the SL is required from the SA to prevent
deadlock.
DAPL will obtain path record data using rdma_getaddrinfo, provided
that IB ACM support is enabled. On failure, dapl will fall back to
using its default SL value. The IB ACM can be configured to cache
path information or always query the SA to ensure that the SL that is
obtained is current.
ibal provider calls ib_cm_drep in response to receiving
a dreq. The result is that the user's QP is transitioned
through the error state, which fails any outstanding send
operations and flushes all receives. The disconnect request
is then reported to the user.
Since a user can receive errors from the QP before they are
aware of a pending disconnect request, the application may
respond to the errors as, well, actual errors. Fix this by
delaying the QP transition until the user responds to the
dreq.
This fixes an error with Intel MPI running over the ibal
dapl provider with a 'spawn' test.
Arlin Davis [Wed, 22 Sep 2010 17:35:24 +0000 (10:35 -0700)]
ibal: delay QP transition during disconnect phase
The ibal provider calls ib_cm_drep in response to receiving
a dreq. The result is that the user's QP is transitioned
through the error state, which fails any outstanding send
operations and flushes all receives. The disconnect request
is then reported to the user.
Since a user can receive errors from the QP before they are
aware of a pending disconnect request, the application may
respond to the errors as, well, actual errors. Fix this by
delaying the QP transition until the user responds to the
dreq.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---
Arlin Davis [Mon, 20 Sep 2010 17:42:41 +0000 (10:42 -0700)]
common: restructure EVD processing to handle EP destruction phase
EVD processing in the common code will return unformated events
if EP context is invalid as a result of destruction. During
EP destruction, add changes to flush EVD and process DTO completions
before the EP freeing is called. Simplified the locking in the
EVD code to eliminate the unecessary and very confusing condition
checking of evd_producer_locking_needed.
new dapls_ep_flush_cqs() call created to syncronize flush and
event processing.
unnecessary KDAPL code removed in the EVD processing.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Arlin Davis [Mon, 13 Sep 2010 23:19:44 +0000 (16:19 -0700)]
ibal: sync QP destruction and device close
Make QP destruction synchronous to ensure that no callbacks are
in progress for a QP after dapl has destroyed it. This fixes a
use after free error accessing the dapl ep structure from a qp
callback that results in an application crash.
Arlin Davis [Mon, 12 Jul 2010 22:57:34 +0000 (15:57 -0700)]
The linux compatability header file _errno.h is moving out of verbs.h.
Include _errno.h in the windows osd header files, similar to how
errno.h is included in the linux osd header files.
Arlin Davis [Thu, 17 Jun 2010 19:40:21 +0000 (12:40 -0700)]
scm, ucm: add pkey, pkey_index, sl override for QP's
On a per open basis, add environment variables
DAPL_IB_SL and DAPL_IB_PKEY and use on
connection setup (QP modify) to override default
values of 0 for SL and PKEY index. If pkey is
provided then find the pkey index with
ibv_query_pkey for dev_attr.max_pkeys.
Will be used for RC and UD type QP's.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 2 Jun 2010 17:05:03 +0000 (10:05 -0700)]
ucm: incorrectly freeing port on passive side after reject
cm_release was incorrectly freeing a client port
assuming it was the server listening port. Move
the listening port cleanup to remove_conn_listner
and only cleanup client ports in cm_release.
Arlin Davis [Wed, 19 May 2010 22:17:58 +0000 (15:17 -0700)]
dapltest: server info devicename is not large enough for dapl_name storage
Server info device name is a 80 char array but the dapl device name
that is copied is 256 bytes. Increase started_server.devicename definition.
Chalk one up for windows SDK OACR (auto code review).
Arlin Davis [Mon, 17 May 2010 23:15:21 +0000 (16:15 -0700)]
scm, cma: fini code can be called multiple times and hang via fork
The providers should protect against forked child exits and
not cleanup until the parent init actually exits. Otherwise,
the child will hang trying to cleanup dapl thread. Modify to
check process id for proper init to fini cleanup and limit
cleanup to parent only.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 13 May 2010 17:31:17 +0000 (10:31 -0700)]
scm: SOCKOPT ERR Connection timed out on large clusters
Large scale all to all connections on +1000 cores
the listen backlog is reached and SYN's are dropped
which causes the connect to timeout. Retry connect
on timeout errors.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 10 May 2010 19:46:17 +0000 (12:46 -0700)]
ucm: UD mode, active side cm object released to soon, the RTU could be lost.
Will see following message with DAPL_DBG_TYPE set for Errors & Warnings (0x3):
ucm_recv: NO MATCH op REP 0x120 65487 i0x60005e c0x60005e < 0xd2 19824 0x60006a
The cm object was released on the active side after the connection
was established, RTU sent. This is a problem if the RTU is lost
and the remote side retries the REPLY. The RTU is never resent.
Keep the cm object until the EP is destroyed.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 10 May 2010 19:35:51 +0000 (12:35 -0700)]
cma, ucm: cleanup issues with dat_ep_free on a connected EP without disconnecting.
During EP free, disconnecting with ABRUPT close flag, the disconnect should wait
for the DISC event to fire to allow the CM to be properly destroyed upon return.
The cma must also release the lock when calling the blocking rdma_destroy_id given
the callback thread could attempt to acquire the lock for reference counting.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 27 Apr 2010 18:20:08 +0000 (11:20 -0700)]
scm: remove modify QP to ERR state during disconnect on UD type QP
The disconnect on a UD type QP should not modify QP to error
since this is a shared QP. The disconnect should be treated
as a NOP on the UD type QP and only be transitioned during
the QP destroy (dat_ep_free).
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 8 Apr 2010 16:38:57 +0000 (09:38 -0700)]
common: EP links to EVD, PZ incorrectly released before provider CM objects freed.
unlink/clear references after ALL CM objects linked to EP are freed.
Otherwise, event processing via CM objects could reference the handles
still linked to EP. After CM objects are freed (blocking) these handles
linked to EP are guaranteed not to refereence from underlying provider.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 7 Apr 2010 18:12:21 +0000 (11:12 -0700)]
common: remove unnecessary lmr lkey hashing and duplicate lkey checking
lmr lkey hashing is too restrictive given the returned lkey could be
the same value for different regions on some rdma devices. Actually,
this checking is really unecesssary and requires considerable overhead
for hashing so just remove hashing of lmr lkey's. Let verbs device
level do the checking and validation.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 16 Mar 2010 22:47:58 +0000 (14:47 -0800)]
ucm: fix issues with new EP to CM linking changes
Add EP locking around QP modify
Remove release during disconnect event processing
Add check in cm_free to check state and schedule thread if necessary.
Add some additional debugging
Add processing in disconnect_clean for conn_req timeout
Remove extra CR's
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 16 Mar 2010 17:44:44 +0000 (09:44 -0800)]
scm: new cm_ep linking broke UD mode over socket cm
Add EP locking around modify_qp for EP state.
Add new dapli_ep_check for debugging EP
Cleanup extra CR's
Change socket errno to dapl_socket_errno() abstraction
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 16 Mar 2010 17:12:11 +0000 (09:12 -0800)]
common: dat_ep_connect should not set timer UD endpoints
connect for UD type is simply AH resolution and doesn't
need timed. The common code is not designed to handle
multiple timed events on connect requests so just ignore
timing UD AH requests.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 8 Mar 2010 20:53:45 +0000 (12:53 -0800)]
ibal: changes for EP to CM linking and synchronization.
Windows IBAL changes to allocate and manage CM objects
and to link them to the EP. This will insure the CM
IBAL objects and cm_id's are not destroy before EP.
Remove windows only ibal_cm_handle in EP structure.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 24 Feb 2010 18:03:57 +0000 (10:03 -0800)]
common: add CM-EP linking to support mutiple CM's and proper protection during destruction
Add linking for CM to EP, including reference counting, to insure syncronization
during creation and destruction. A cm_list_head has been added to the EP object to
support multiple CM objects (UD) per EP. If the CM object is linked to an EP it
cannot be destroyed.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>