]> git.openfabrics.org - ~ardavis/dapl.git/log
~ardavis/dapl.git
13 years agoRelease 2.0.31 dapl-2.0.31-1
Arlin Davis [Fri, 10 Dec 2010 22:26:07 +0000 (14:26 -0800)]
Release 2.0.31

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: clean up build warning for unused variable event_ptr
Arlin Davis [Fri, 10 Dec 2010 22:19:45 +0000 (14:19 -0800)]
common: clean up build warning for unused variable event_ptr

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoscm, ucm: set RAI_NOROUTE flag with rdma_getaddrinfo() call to avoid blocking.
Arlin Davis [Fri, 10 Dec 2010 21:49:47 +0000 (13:49 -0800)]
scm, ucm: set RAI_NOROUTE flag with rdma_getaddrinfo() call to avoid blocking.

if path is not returned, print warning message and use default SL.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocma: definition for dapl_sp_remove_ep() is missing in cm.c
Arlin Davis [Fri, 10 Dec 2010 21:47:15 +0000 (13:47 -0800)]
cma: definition for dapl_sp_remove_ep() is missing in cm.c

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agolibdat: static provider entries created for local SR database not freed
Arlin Davis [Tue, 7 Dec 2010 00:06:47 +0000 (16:06 -0800)]
libdat: static provider entries created for local SR database not freed

During load (dat_sr_init) the SR database is created with all dat.conf entries
but are never cleaned up during unload. Add new functions dat_sr_remove_all()
and dat_sr_remove() calls to cleanup and deallocate SR database entries and
database via dat_sr_fini().

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agolibdat: memory leak in static registration during parsing
Arlin Davis [Tue, 7 Dec 2010 00:02:13 +0000 (16:02 -0800)]
libdat: memory leak in static registration during parsing

The platform_params char string, allocated when
parsing dat.conf, is not freed.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: increase default IB inline send threshold to 400
Arlin Davis [Sat, 4 Dec 2010 00:13:31 +0000 (16:13 -0800)]
common: increase default IB inline send threshold to 400

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon cq: a mixup of errno and the -1 return from poll in dapls_wait_comp_channel
Pradeep Satyanarayana [Fri, 3 Dec 2010 23:52:55 +0000 (15:52 -0800)]
common cq: a mixup of errno and the -1 return from poll in dapls_wait_comp_channel

call should return errno and not status returned from poll.

Signed-off-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
13 years agoucm: release UD cm objects after AH is exchanged to avoid duplicate request drops
Arlin Davis [Fri, 3 Dec 2010 22:56:21 +0000 (14:56 -0800)]
ucm: release UD cm objects after AH is exchanged to avoid duplicate request drops

When EP is in UD mode, AH resolution is handled with DAT connection semantics
connect and accept. Since AH info can be resolved for the same EPs you can
get false duplicate requests because a previous CR from is still on the
CM processing list. The CM object will remain on the EP free list and not
be freed until EP is destroyed given the possibilty of consumer accessing CR
private data buffer.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoucm: decrease timeout retry count for disconnect requests
Arlin Davis [Fri, 3 Dec 2010 22:52:26 +0000 (14:52 -0800)]
ucm: decrease timeout retry count for disconnect requests

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoucm: hold lock when sending cm_msgs to sync timer start with packet send
Arlin Davis [Fri, 3 Dec 2010 22:24:40 +0000 (14:24 -0800)]
ucm: hold lock when sending cm_msgs to sync timer start with packet send

releasing the lock after setting start timer and before
ucm_send could result in incorrect timeout on CM operations
if thread is scheduled out when releasing lock.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoucm: add debugging to include process id for better scale up debug aids
Arlin Davis [Fri, 3 Dec 2010 22:02:25 +0000 (14:02 -0800)]
ucm: add debugging to include process id for better scale up debug aids

use part of the resv[] area of the cm_msg to include local and
remote process ids. Add more debug messages to help isolate
problems related to many process problems.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocma: disconnect can block for excessive times waiting for rdma_cm DREP timeout
Arlin Davis [Fri, 3 Dec 2010 18:25:46 +0000 (10:25 -0800)]
cma: disconnect can block for excessive times waiting for rdma_cm DREP timeout

rdma_cm uses the same timeout values for connect and disconnect
request/reply. Disconnect abrupt option allows DAT consumers to
specify a prompt disconnect with immediate event. If the remote
node goes down or is non-responsive a CM disconnect event could
take minutes. Add a time limit waiting for event and move EP to
disconnected state to prevent callback from issuing duplicate
disconnect event via callback. The EP to CM linking will
cleanup/cancel any pending events before destroying cm_id.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoucm: configure the recv channel FD to non-blocking
Arlin Davis [Tue, 16 Nov 2010 22:48:10 +0000 (14:48 -0800)]
ucm: configure the recv channel FD to non-blocking

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agowindows: Missing librdmacm include path for build
Stan Smith [Fri, 29 Oct 2010 17:11:41 +0000 (10:11 -0700)]
windows: Missing librdmacm include path for build

Signed-off-by: stan smith <stan.smith@intel.com>
13 years agodebug build: only timestamp if sending to stdout to avoid performance hit
Arlin Davis [Thu, 28 Oct 2010 18:12:33 +0000 (11:12 -0700)]
debug build: only timestamp if sending to stdout to avoid performance hit

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: print out errors on free build and not just debug builds
Arlin Davis [Thu, 28 Oct 2010 18:11:12 +0000 (11:11 -0700)]
common: print out errors on free build and not just debug builds

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocma: fix debug build issue
Arlin Davis [Fri, 22 Oct 2010 18:58:19 +0000 (11:58 -0700)]
cma: fix debug build issue

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoscm, ucm: MPI spawn test on oversubcribed server taking excessive time to complete
Arlin Davis [Fri, 22 Oct 2010 17:15:15 +0000 (10:15 -0700)]
scm, ucm: MPI spawn test on oversubcribed server taking excessive time to complete

Simultanious DREQ processing from user and CM thread caused some improper
state change on UCM. State change can incorrectly change from FREE back
to DISC in certain corner cases. Add checking on internal disconnect call
to prevent double callback events and improper state change.

For SCM, a remote DREQ will shutdown socket which will cause POLLERR
on the disconnected FD. This will in turn cause the cm_thread to
wakeup continuously unnecessarily. Fix thread thrashing by moving
CM object to FREE state and removing object FD from pollfd array.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: add high resolution time stamps and thread id to sdtout debug logs
Arlin Davis [Fri, 22 Oct 2010 17:04:21 +0000 (10:04 -0700)]
common: add high resolution time stamps and thread id to sdtout debug logs

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: modify debug in dat_evd_dequeue to reduce noise, only output on non-empty
ardavis [Fri, 22 Oct 2010 17:01:12 +0000 (10:01 -0700)]
common: modify debug in dat_evd_dequeue to reduce noise, only output on non-empty

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocma: rdma_destroy_id called twice during device open bind error
sean.hefty@intel.com [Tue, 19 Oct 2010 20:54:42 +0000 (13:54 -0700)]
cma: rdma_destroy_id called twice during device open bind error

Signed-off-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
13 years agocommon: dat_evd_dequeue (poll_cq) fails with invalid parameter after EP (qp) free
ardavis [Tue, 19 Oct 2010 16:52:45 +0000 (09:52 -0700)]
common: dat_evd_dequeue (poll_cq) fails  with invalid parameter after EP (qp) free

Failure occured during Intel MPI spawn test on windows.
The QP's need to be flushed and processed via EVD's during
the EP (QP) destroy to avoid an error on poll_cq. IBAL
provider was not moving to ERR state during QP destroy.

Better flush CQ processing was added and pushed down to the provider
level via dapls_ib_qp_free() where it can move QP to ERR, flush CQ,
and then free QP after flushing. Because there is no QP_ERR_FLUSH
state on a QP the spin on poll_cq (until empty) after modify_qp
to ERR could return empty and before all WQE's are flushed. This
could result in a CQE being added to CQ with a invalid QP reference.
So, an additional check was added to flush_evds for the recv_q to
poll_cq until all recv's pending are complete. For transmit_q there
is no quarantee that the posted work is signaled and so the best
that can be done is poll_cq until empty.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
13 years agoucm: allow configuration of CM burst (signal) threshold on posting
ardavis [Mon, 11 Oct 2010 19:24:31 +0000 (12:24 -0700)]
ucm: allow configuration of CM burst (signal) threshold on posting

Add new DAPL_UCM_TX_BURST environment variable, default=50.
Every 50 posted send messages will signal event which
is 10 percent blocks of default 500 message limit.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocma: fix debug build
ardavis [Mon, 11 Oct 2010 19:23:50 +0000 (12:23 -0700)]
cma: fix debug build

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agowindows: debug version of windows does not build.
ardavis [Thu, 7 Oct 2010 21:29:21 +0000 (14:29 -0700)]
windows: debug version of windows does not build.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
13 years agoAllow DAPL out of band connection models to use ibacm to obtain
ardavis [Thu, 7 Oct 2010 18:14:03 +0000 (11:14 -0700)]
Allow DAPL out of band connection models to use ibacm to obtain
path record data.  This will enable support for a wider range of
topologies, where the SL is required from the SA to prevent
deadlock.

DAPL will obtain path record data using rdma_getaddrinfo, provided
that IB ACM support is enabled.  On failure, dapl will fall back to
using its default SL value.  The IB ACM can be configured to cache
path information or always query the SA to ensure that the SL that is
obtained is current.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
13 years agoucm: add missing map file for UCM provider
ardavis [Mon, 27 Sep 2010 18:12:08 +0000 (11:12 -0700)]
ucm: add missing map file for UCM provider

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoibal: delay QP transition during disconnect phase
ardavis [Fri, 24 Sep 2010 17:47:30 +0000 (10:47 -0700)]
ibal: delay QP transition during disconnect phase

ibal provider calls ib_cm_drep in response to receiving
a dreq.  The result is that the user's QP is transitioned
through the error state, which fails any outstanding send
operations and flushes all receives.  The disconnect request
is then reported to the user.

Since a user can receive errors from the QP before they are
aware of a pending disconnect request, the application may
respond to the errors as, well, actual errors.  Fix this by
delaying the QP transition until the user responds to the
dreq.

This fixes an error with Intel MPI running over the ibal
dapl provider with a 'spawn' test.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
13 years agoRevert "ibal: delay QP transition during disconnect phase"
ardavis [Thu, 23 Sep 2010 20:50:05 +0000 (13:50 -0700)]
Revert "ibal: delay QP transition during disconnect phase"

This reverts commit 4eda455d9bc80c35743b3a2f6773e6c4a500affc.

13 years agoibal: delay QP transition during disconnect phase
Arlin Davis [Wed, 22 Sep 2010 17:35:24 +0000 (10:35 -0700)]
ibal: delay QP transition during disconnect phase

The ibal provider calls ib_cm_drep in response to receiving
a dreq.  The result is that the user's QP is transitioned
through the error state, which fails any outstanding send
operations and flushes all receives.  The disconnect request
is then reported to the user.

Since a user can receive errors from the QP before they are
aware of a pending disconnect request, the application may
respond to the errors as, well, actual errors.  Fix this by
delaying the QP transition until the user responds to the
dreq.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---

13 years agocommon: restructure EVD processing to handle EP destruction phase
Arlin Davis [Mon, 20 Sep 2010 17:42:41 +0000 (10:42 -0700)]
common: restructure EVD processing to handle EP destruction phase

EVD processing in the common code will return unformated events
if EP context is invalid as a result of destruction. During
EP destruction, add changes to flush EVD and process DTO completions
before the EP freeing is called. Simplified the locking in the
EVD code to eliminate the unecessary and very confusing condition
checking of evd_producer_locking_needed.

new dapls_ep_flush_cqs() call created to syncronize flush and
event processing.

unnecessary KDAPL code removed in the EVD processing.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
13 years agoibal: sync QP destruction and device close
Arlin Davis [Mon, 13 Sep 2010 23:19:44 +0000 (16:19 -0700)]
ibal: sync QP destruction and device close

Make QP destruction synchronous to ensure that no callbacks are
in progress for a QP after dapl has destroyed it.  This fixes a
use after free error accessing the dapl ep structure from a qp
callback that results in an application crash.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
13 years agoucm: remove unnecessary debug warning in async callback
Arlin Davis [Mon, 13 Sep 2010 22:06:42 +0000 (15:06 -0700)]
ucm: remove unnecessary debug warning in async callback

The switch() cases print when necessary.

signed-off-by: stan smith <stan.smith@intel.com>

13 years agoRelease 2.0.30 dapl-2.0.30-1 ofed_1.5.2_v2
Arlin Davis [Mon, 9 Aug 2010 21:28:20 +0000 (14:28 -0700)]
Release 2.0.30

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: increase default logging to include warnings
Arlin Davis [Mon, 9 Aug 2010 21:25:09 +0000 (14:25 -0700)]
common: increase default logging to include warnings

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: add more debug levels for cm logging
Arlin Davis [Mon, 9 Aug 2010 21:19:50 +0000 (14:19 -0700)]
common: add more debug levels for cm logging

DAPL_DBG_TYPE_CM_EST   = 0x8000,
DAPL_DBG_TYPE_CM_WARN  = 0x10000

Add level for connection establishment and events
and for retries/timer events.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agocommon: cleanup CR linkings after DTO error on EP
Arlin Davis [Mon, 2 Aug 2010 16:51:30 +0000 (09:51 -0700)]
common: cleanup CR linkings after DTO error on EP

Add cleanup to remove CR from SP and EP
during DTO errors in dapli_evd_cqe_to_event.

dapl_sp_remove_ep needs to remove cr_ptr
reference from EP before freeing cr object.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoucm: cleanup CM debug warning messages
Arlin Davis [Mon, 2 Aug 2010 16:49:23 +0000 (09:49 -0700)]
ucm: cleanup CM debug warning messages

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
13 years agoscm, ucm: improperly handles pkey check/query in host order
Arlin Davis [Fri, 23 Jul 2010 22:17:53 +0000 (15:17 -0700)]
scm, ucm: improperly handles pkey check/query in host order

Convert consumer input to network order before verbs
query pkey check.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoThe linux compatability header file _errno.h is moving out of verbs.h.
Arlin Davis [Mon, 12 Jul 2010 22:57:34 +0000 (15:57 -0700)]
The linux compatability header file _errno.h is moving out of verbs.h.
Include _errno.h in the windows osd header files, similar to how
errno.h is included in the linux osd header files.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
14 years agowindows: update SOURCES files to link winverbs.lib, which is
Arlin Davis [Wed, 30 Jun 2010 19:03:41 +0000 (12:03 -0700)]
windows: update SOURCES files to link winverbs.lib, which is
needed for common ofa providers.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
14 years agoRelease 2.0.29
Arlin Davis [Thu, 17 Jun 2010 19:58:22 +0000 (12:58 -0700)]
Release 2.0.29

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm, ucm: add pkey, pkey_index, sl override for QP's
Arlin Davis [Thu, 17 Jun 2010 19:40:21 +0000 (12:40 -0700)]
scm, ucm: add pkey, pkey_index, sl override for QP's

On a per open basis, add environment variables
DAPL_IB_SL and DAPL_IB_PKEY and use on
connection setup (QP modify) to override default
values of 0 for SL and PKEY index. If pkey is
provided then find the pkey index with
ibv_query_pkey for dev_attr.max_pkeys.
Will be used for RC and UD type QP's.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocma: remove dependency on rdma_cma_abi.h
Arlin Davis [Thu, 10 Jun 2010 18:40:45 +0000 (11:40 -0700)]
cma: remove dependency on rdma_cma_abi.h

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoconfigure: need a false conditional for verbs attr.link_layer member check
Arlin Davis [Wed, 2 Jun 2010 21:13:05 +0000 (14:13 -0700)]
configure: need a false conditional for verbs attr.link_layer member check

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoucm: incorrectly freeing port on passive side after reject
Arlin Davis [Wed, 2 Jun 2010 17:05:03 +0000 (10:05 -0700)]
ucm: incorrectly freeing port on passive side after reject

cm_release was incorrectly freeing a client port
assuming it was the server listening port. Move
the listening port cleanup to remove_conn_listner
and only cleanup client ports in cm_release.

Error Messages indicating problem:

  CM_REQ retry 1 [lid, port, qpn]: 9 ff9a 340085 -> 9 6fa 34004e Time(ms) 1999 > 1600
  DUPLICATE: op REQ st CM_CONNECTED [lid, port, qpn]: 9 6fa 0x0 <- 0x9 ff9a 0x340085

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoucm: modify debug CM output for consistency, all ports, qpn in hex
Arlin Davis [Wed, 2 Jun 2010 16:45:42 +0000 (09:45 -0700)]
ucm: modify debug CM output for consistency, all ports, qpn in hex

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoRelease 2.0.28
Arlin Davis [Mon, 24 May 2010 23:44:25 +0000 (16:44 -0700)]
Release 2.0.28

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoconfig: add conditional check for new verbs port_attr.link_layer
Arlin Davis [Mon, 24 May 2010 23:28:05 +0000 (16:28 -0700)]
config: add conditional check for new verbs port_attr.link_layer

Check for link_layer type ETHERNET and set global for GID
configuration on modify QP.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agodat.conf: update manpage with latest provider information, add examples
Arlin Davis [Mon, 24 May 2010 17:30:28 +0000 (10:30 -0700)]
dat.conf: update manpage with latest provider information, add examples

Add information regarding OpenFabrics provider choices
and explain cma, scm, and ucm providers.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocma, scm: new provider entries for Mellanox RDMA over Ethernet device
Arlin Davis [Wed, 19 May 2010 23:38:53 +0000 (16:38 -0700)]
cma, scm: new provider entries for Mellanox RDMA over Ethernet device

Add options for netdev eth2 and eth3 for cma and for device mlx4_0 port 1 and 2 for scm.

ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agodapltest: server info devicename is not large enough for dapl_name storage
Arlin Davis [Wed, 19 May 2010 22:17:58 +0000 (15:17 -0700)]
dapltest: server info devicename is not large enough for dapl_name storage

Server info device name is a 80 char array but the dapl device name
that is copied is 256 bytes. Increase started_server.devicename definition.
Chalk one up for windows SDK OACR (auto code review).

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
14 years agowindows: comp_channel.cpp is included by util.c in the openib_common.
Arlin Davis [Wed, 19 May 2010 21:48:49 +0000 (14:48 -0700)]
windows: comp_channel.cpp is included by util.c in the openib_common.

Remove it from device.c in individual providers to avoid
duplicate definitions.

Line endings were corrected to linux format from windows as part of
the change.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
14 years agowindows: need to include linux directory to pick up _errno.h
Arlin Davis [Wed, 19 May 2010 21:45:55 +0000 (14:45 -0700)]
windows: need to include linux directory to pick up _errno.h

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
14 years agoscm: check for hca object before signaling thread
Arlin Davis [Mon, 17 May 2010 23:22:30 +0000 (16:22 -0700)]
scm: check for hca object before signaling thread

There may not be an hca object attached to cm object
when freeing during cleanup.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm, cma: fini code can be called multiple times and hang via fork
Arlin Davis [Mon, 17 May 2010 23:15:21 +0000 (16:15 -0700)]
scm, cma: fini code can be called multiple times and hang via fork

The providers should protect against forked child exits and
not cleanup until the parent init actually exits. Otherwise,
the child will hang trying to cleanup dapl thread. Modify to
check process id for proper init to fini cleanup and limit
cleanup to parent only.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm: add option to use other network devices with environment variable DAPL_SCM_NETDEV
Arlin Davis [Fri, 14 May 2010 23:20:52 +0000 (16:20 -0700)]
scm: add option to use other network devices with environment variable DAPL_SCM_NETDEV

New environment variable can be used to set the netdev
for sockets to use instead of the default network device
returned using gethostname.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm: cr_thread occasionally segv's when disconnecting all-to-all MPI static connections
Arlin Davis [Fri, 14 May 2010 17:27:50 +0000 (10:27 -0700)]
scm: cr_thread occasionally segv's when disconnecting all-to-all MPI static connections

Note: no valid calltrace for segv on cr_thread because
of state changing in switch statement from another
thread, jumped unknown location.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x41a65940 (LWP 1328)]
0x00002b2e7d9d5134 in ?? ()

Add cm object locking on all state change/checking. When
freeing CM object wakeup cr_thread to process
state change to CM_FREE.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm: SOCKOPT ERR Connection timed out on large clusters
Arlin Davis [Thu, 13 May 2010 17:31:17 +0000 (10:31 -0700)]
scm: SOCKOPT ERR Connection timed out on large clusters

Large scale all to all connections on +1000 cores
the listen backlog is reached and SYN's are dropped
which causes the connect to timeout. Retry connect
on timeout errors.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoucm: UD mode, active side cm object released to soon, the RTU could be lost.
Arlin Davis [Mon, 10 May 2010 19:46:17 +0000 (12:46 -0700)]
ucm: UD mode, active side cm object released to soon, the RTU could be lost.

Will see following message with DAPL_DBG_TYPE set for Errors & Warnings (0x3):
ucm_recv: NO MATCH op REP 0x120 65487 i0x60005e c0x60005e < 0xd2 19824 0x60006a

The cm object was released on the active side after the connection
was established, RTU sent. This is a problem if the RTU is lost
and the remote side retries the REPLY. The RTU is never resent.
Keep the cm object until the EP is destroyed.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocma, ucm: cleanup issues with dat_ep_free on a connected EP without disconnecting.
Arlin Davis [Mon, 10 May 2010 19:35:51 +0000 (12:35 -0700)]
cma, ucm: cleanup issues with dat_ep_free on a connected EP without disconnecting.

During EP free, disconnecting with ABRUPT close flag, the disconnect should wait
for the DISC event to fire to allow the CM to be properly destroyed upon return.

The cma must also release the lock when calling the blocking rdma_destroy_id given
the callback thread could attempt to acquire the lock for reference counting.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoucm: increase default UCM retry count for connect reply to 15
Arlin Davis [Wed, 28 Apr 2010 22:37:27 +0000 (15:37 -0700)]
ucm: increase default UCM retry count for connect reply to 15

On very large clusters UCM is timing out with retries at 10.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm: remove modify QP to ERR state during disconnect on UD type QP
Arlin Davis [Tue, 27 Apr 2010 18:20:08 +0000 (11:20 -0700)]
scm: remove modify QP to ERR state during disconnect on UD type QP

The disconnect on a UD type QP should not modify QP to error
since this is a shared QP. The disconnect should be treated
as a NOP on the UD type QP and only be transitioned during
the QP destroy (dat_ep_free).

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agowindows: remove static paths from dapltest scripts
Arlin Davis [Thu, 8 Apr 2010 23:32:02 +0000 (16:32 -0700)]
windows: remove static paths from dapltest scripts

signed-off-by: stan smith <stan.smith@intel.com>

14 years agocommon: EP links to EVD, PZ incorrectly released before provider CM objects freed.
Arlin Davis [Thu, 8 Apr 2010 16:38:57 +0000 (09:38 -0700)]
common: EP links to EVD, PZ incorrectly released before provider CM objects freed.

unlink/clear references after ALL CM objects linked to EP are freed.
Otherwise, event processing via CM objects could reference the handles
still linked to EP. After CM objects are freed (blocking) these handles
linked to EP are guaranteed not to refereence from underlying provider.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocommon: remove unnecessary lmr lkey hashing and duplicate lkey checking
Arlin Davis [Wed, 7 Apr 2010 18:12:21 +0000 (11:12 -0700)]
common: remove unnecessary lmr lkey hashing and duplicate lkey checking

lmr lkey hashing is too restrictive given the returned lkey could be
the same value for different regions on some rdma devices. Actually,
this checking is really unecesssary and requires considerable overhead
for hashing so just remove hashing of lmr lkey's. Let verbs device
level do the checking and validation.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoibal: output completion code in deciaml & hex as intended
Arlin Davis [Mon, 29 Mar 2010 20:20:34 +0000 (12:20 -0800)]
ibal: output completion code in deciaml & hex as intended

sign-off-by: stan smith <stan.smith@intel.com>

14 years agoucm: set timer during RTU_PENDING state change.
Arlin Davis [Tue, 16 Mar 2010 23:02:44 +0000 (15:02 -0800)]
ucm: set timer during RTU_PENDING state change.

The timer thread may pick up an unitialized timer
value and timeout before the reply was sent.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoucm: fix issues with new EP to CM linking changes
Arlin Davis [Tue, 16 Mar 2010 22:47:58 +0000 (14:47 -0800)]
ucm: fix issues with new EP to CM linking changes

Add EP locking around QP modify
Remove release during disconnect event processing
Add check in cm_free to check state and schedule thread if necessary.
Add some additional debugging
Add processing in disconnect_clean for conn_req timeout
Remove extra CR's

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm: add EP locking and cm checking to socket cm disconnect
Arlin Davis [Tue, 16 Mar 2010 22:18:06 +0000 (14:18 -0800)]
scm: add EP locking and cm checking to socket cm disconnect

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm: new cm_ep linking broke UD mode over socket cm
Arlin Davis [Tue, 16 Mar 2010 17:44:44 +0000 (09:44 -0800)]
scm: new cm_ep linking broke UD mode over socket cm

Add EP locking around modify_qp for EP state.
Add new dapli_ep_check for debugging EP
Cleanup extra CR's
Change socket errno to dapl_socket_errno() abstraction

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoopenib common: add some debug prints to help isolate QP type issues
Arlin Davis [Tue, 16 Mar 2010 17:17:01 +0000 (09:17 -0800)]
openib common: add some debug prints to help isolate QP type issues

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocommon: dapl_event_str function missing 2 IB extended events
Arlin Davis [Tue, 16 Mar 2010 17:15:12 +0000 (09:15 -0800)]
common: dapl_event_str function missing 2 IB extended events

Add all IB extended events in event string print function

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocommon: dat_ep_connect should not set timer UD endpoints
Arlin Davis [Tue, 16 Mar 2010 17:12:11 +0000 (09:12 -0800)]
common: dat_ep_connect should not set timer UD endpoints

connect for UD type is simply AH resolution and doesn't
need timed. The common code is not designed to handle
multiple timed events on connect requests so just ignore
timing UD AH requests.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoucm: fix error path during accept_usr reply failure
Arlin Davis [Mon, 15 Mar 2010 18:23:47 +0000 (10:23 -0800)]
ucm: fix error path during accept_usr reply failure

if accept_usr fails when sending reply the EP was
being linked to CM instead of properly unlinked.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoibal: add missing windows makefile
Arlin Davis [Mon, 8 Mar 2010 21:56:28 +0000 (13:56 -0800)]
ibal: add missing windows makefile

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoibal: changes for EP to CM linking and synchronization.
Arlin Davis [Mon, 8 Mar 2010 20:53:45 +0000 (12:53 -0800)]
ibal: changes for EP to CM linking and synchronization.

Windows IBAL changes to allocate and manage CM objects
and to link them to the EP. This will insure the CM
IBAL objects and cm_id's are not destroy before EP.
Remove windows only ibal_cm_handle in EP structure.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm: add support for canceling conn request that times out.
Arlin Davis [Wed, 24 Feb 2010 20:00:07 +0000 (12:00 -0800)]
scm: add support for canceling conn request that times out.

print warning message during timeout.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoscm, cma, ucm: consolidate dat event/provider event translation
Arlin Davis [Wed, 24 Feb 2010 19:28:04 +0000 (11:28 -0800)]
scm, cma, ucm: consolidate dat event/provider event translation

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocommon: missed linking changes from atomic to acquire/release
Arlin Davis [Wed, 24 Feb 2010 19:26:25 +0000 (11:26 -0800)]
common: missed linking changes from atomic to acquire/release

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agocommon: add CM-EP linking to support mutiple CM's and proper protection during destru...
Arlin Davis [Wed, 24 Feb 2010 18:03:57 +0000 (10:03 -0800)]
common: add CM-EP linking to support mutiple CM's and proper protection during destruction

Add linking for CM to EP, including reference counting, to insure syncronization
during creation and destruction. A cm_list_head has been added to the EP object to
support multiple CM objects (UD) per EP. If the CM object is linked to an EP it
cannot be destroyed.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoRelease 2.0.27-1 dapl-2.0.27-1
Arlin Davis [Wed, 24 Feb 2010 00:26:41 +0000 (16:26 -0800)]
Release 2.0.27-1

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agowindows: add scm makefile
Arlin Davis [Mon, 22 Feb 2010 17:42:17 +0000 (09:42 -0800)]
windows: add scm makefile

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoWindows does not require rdma_cma_abi.h, move the include from common code and to...
Arlin Davis [Mon, 22 Feb 2010 17:41:13 +0000 (09:41 -0800)]
Windows does not require rdma_cma_abi.h, move the include from common code and to OSD file.

Signed-off-by: stan smith <stan.smith@intel.com>
14 years agoWindows patch to fix IB_INVALID_HANDLE name collision
Arlin Davis [Fri, 19 Feb 2010 22:52:01 +0000 (14:52 -0800)]
Windows patch to fix IB_INVALID_HANDLE name collision

signed-off-by: stan smith <stan.smith@intel.com>

14 years agoscm: dat_ep_connect fails on 32bit servers
Arlin Davis [Mon, 8 Feb 2010 21:49:35 +0000 (13:49 -0800)]
scm: dat_ep_connect fails on 32bit servers

memcpy for remote IA address uses incorrect sizeof for a pointer type.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoundefined symbol: dapls_print_cm_list
Arlin Davis [Fri, 5 Feb 2010 19:51:16 +0000 (11:51 -0800)]
undefined symbol: dapls_print_cm_list

call prototype should be dependent on DAPL_COUNTERS.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoCleanup CM object lock before freeing CM object memory
Arlin Davis [Fri, 5 Feb 2010 19:39:21 +0000 (11:39 -0800)]
Cleanup CM object lock before freeing CM object memory

Running windows application verifiier for uDAPL validation
for all 3 providers. Cleanup memory lock leaks found
by verifier.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agodestroy verbs completion channels created via ia_open or ep_create.
Arlin Davis [Thu, 4 Feb 2010 00:21:30 +0000 (16:21 -0800)]
destroy verbs completion channels created via ia_open or ep_create.

Completion channels are created with ia_open for CNO events and
with ep_create in cases where DAT allows EP(qp) to be created with
no EVD(cq) and IB doesn't. These completion channels need to be
destroyed at close along with a CQ for the EP without CQ case.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoUpdate Copyright file and include the 3 license files in distribution
Arlin Davis [Wed, 3 Feb 2010 19:06:45 +0000 (11:06 -0800)]
Update Copyright file and include the 3 license files in distribution

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoWhen copying private_data out of rdma_cm events, use the
Arlin Davis [Tue, 2 Feb 2010 22:43:03 +0000 (14:43 -0800)]
When copying private_data out of rdma_cm events, use the
reported private_data_len for the size, and not IB maximums.
This fixes a bug running over the librdmacm on windows, where
DAPL accessed invalid memory.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agodapl/cma: fix referencing freed address
Sean Hefty [Thu, 28 Jan 2010 18:19:20 +0000 (10:19 -0800)]
dapl/cma: fix referencing freed address

DAPL uses a pointer to reference the local and remote addresses
of an endpoint.  It expects that those addresses are located
in memory that is always accessible.  Typically, for the local
address, the pointer references the address stored with the DAPL
HCA device.  However, for the cma provider, it changes this pointer
to reference the address stored with the rdma_cm_id.

This causes a problem when that endpoint is connected on the
passive side of a connection.  When connect requests are given
to DAPL, a new rdma_cm_id is associated with the request.  The
DAPL code replaces the current rdma_cm_id associated with a
user's endpoint with the new rdma_cm_id.  The old rdma_cm_id is
then deleted.  But the endpoint's local address pointer still
references the address stored with the old rdma_cm_id.  The
result is that any reference to the address will access freed
memory.

Fix this by keeping the local address pointer always pointing
to the address associated with the DAPL HCA device.  This is about
the best that can be done given the DAPL interface design.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
14 years agodapl: move close device after async thread is done
Sean Hefty [Tue, 26 Jan 2010 23:13:03 +0000 (15:13 -0800)]
dapl: move close device after async thread is done

using it

Before calling ibv_close_device, wait for the asynchronous
processing thread to finish using the device.  This prevents
a use after free error.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
14 years agoRelease 2.0.26-1 dapl-2.0.26-1
Arlin Davis [Mon, 11 Jan 2010 17:03:10 +0000 (09:03 -0800)]
Release 2.0.26-1

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoopenib_common: add check for both gid and global routing in RTR
Arlin Davis [Tue, 22 Dec 2009 22:00:33 +0000 (14:00 -0800)]
openib_common: add check for both gid and global routing in RTR

check for valid gid pointer along with global route setting
during transition to RTR. Add more GID information to
debug print statement in qp modify call.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoopenib_common: remote memory read privilege set multi times
Arlin Davis [Fri, 4 Dec 2009 20:31:22 +0000 (12:31 -0800)]
openib_common: remote memory read privilege set multi times

duplicate setting of read privilege in dapls_convert_privileges

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoucm, scm: DAPL_GLOBAL_ROUTING enabled causes segv
Arlin Davis [Fri, 4 Dec 2009 20:25:30 +0000 (12:25 -0800)]
ucm, scm: DAPL_GLOBAL_ROUTING enabled causes segv

socket cm and ud cm providers support QP modify with is_global
set and GRH. New v2 providers didn't pass GID information
in modify_qp RTR call and incorrectly byte swapped the already
network order GID. Add debug print of GID during global modify.

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agoRelease 2.0.25-1 dapl-2.0.25-1
Arlin Davis [Wed, 25 Nov 2009 06:16:58 +0000 (22:16 -0800)]
Release 2.0.25-1

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
14 years agowinof scm: initialize opt for NODELAY setsockopt
Arlin Davis [Wed, 25 Nov 2009 06:15:46 +0000 (22:15 -0800)]
winof scm: initialize opt for NODELAY setsockopt

Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>