Sean Hefty [Wed, 14 Oct 2009 16:34:18 +0000 (09:34 -0700)]
The HCA should not be closed until all resources have been released.
This results in a hang on windows, since closing the device frees
the event processing thread.
Arlin Davis [Thu, 8 Oct 2009 23:02:52 +0000 (16:02 -0700)]
ucm: add timer/retry CM logic to the ucm provider
add reply, rtu and retry count options via
environment variables. Times in msecs.
DAPL_UCM_RETRY 10
DAPL_UCM_REP_TIME 400
DAPL_UCM_RTU_TIME 200
Add RTU_PENDING and DISC_RECV states
Add check timer code to the cm_thread
and the option to the select abstaction
to take timeout values in msecs.
DREQ, REQ, and REPLY will all be timed
and retried.
Split out reply code and disconnect_final
code to better facilitate retry timers.
Add checking for duplicate messages.
Added new UD extension events for errors.
DAT_IB_UD_CONNECTION_REJECT_EVENT
DAT_IB_UD_CONNECTION_ERROR_EVENT
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 2 Oct 2009 21:48:15 +0000 (14:48 -0700)]
cma: cannot reuse the cm_id and qp for new connection, must reallocate a new one.
When merging common code base the dapls_ib_reinit_ep mistakely
modified QP to reset then init for all providers. Will
not work for rdma_cm (cma provider) since the cm_id cannot
be reused. Add build check for _OPENIB_CMA_ to pull in correct
free and reallocate method for reinit_ep.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 2 Oct 2009 20:50:12 +0000 (13:50 -0700)]
scm, cma: update DAPL cm protocol revision with latest address/port changes
CM protocol changed, roll revision to 6.
The socket cm could be competing with address space if
application is using sockets above to exchange information
like dapltest, and MPI consumers. Adjust port on listen
and connect to reduce the chance of port collision with
application above.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 2 Oct 2009 19:47:37 +0000 (12:47 -0700)]
ucm: modify IB address format to align better with sockaddr_in6
Restructure the dcm_addr union to map the IB side
closer to sockaddr6 and initialize family to
AF_INET6 to insure callee allocates enough memory
for ucm dat_ia_address type. Put qpn in flowinfo
and gid in sin6_addr. Change the test suites
to print address information based on AF_INET
or AF_INET6 instead of using specific IB address
union from the provider.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Sean Hefty [Wed, 30 Sep 2009 21:27:50 +0000 (14:27 -0700)]
The completion manager was updated to provide an abstraction that
better mimicked how fd's were used. Update dapl to use this
abstraction, rather than the older completion manager api.
This helps minimize changes between linux and windows.
Arlin Davis [Mon, 28 Sep 2009 17:59:36 +0000 (10:59 -0700)]
scm: tighten up socket options to insure similiar behavior on Windows and Linux.
Add IPPROTO_TCP to create socket. Specify device IP address
when binding instead of INADDR_ANY and remove setsocketopt
REUSEADDR on the listen socket to avoid any issues with
portability. Don't want duplicate port bindings.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 28 Sep 2009 17:46:26 +0000 (10:46 -0700)]
cma: improve serialization of destroy and event processing
WinOF testing with slightly different scheduler and verbs
showed some issues with cleanup. Add better protection around
destroy and event processing thread.
Remove destroy flag and add refs counting to conn objects
to block destroy until all references are cleared. Add
locking aroung ref counting and passive and active
event processing.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 28 Sep 2009 17:42:52 +0000 (10:42 -0700)]
scm: improve serialization of destroy and state changes
WinOF testing with slightly different scheduler and verbs
showed some issues with cleanup. Add better protection around
destroy and move state change before socket send to insure
correct state in multi-thread environment targeting the same
device on send and recv.
Change DCM_RTU_PENDING to DCM_REP_PENDING and
and add static definition to local routines for better
readability.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 17 Sep 2009 15:56:06 +0000 (08:56 -0700)]
common: no cleanup/release code for timer thread
dapl_set_timer() creates a thread to process timers for dat_ep_connect
but provides no mechanism to destroy/exit during dapl library unload.
Timers are initialized in library init code and should be released
in the fini code. Add a dapl_timer_release call to the dapl_fini
function to check state of timer thread and destroy before exiting.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 17 Sep 2009 15:53:29 +0000 (08:53 -0700)]
scm, cma: dapli_thread doesn't always get teminated on library close.
DAPL doesn't actually wait for the async processing thread to exit before
allowing the library to close. It will wait up to 10 seconds, which under
heavy load isn't enough time. Since the thread is created by an application
level thread, it will continue to run as long as the application runs. But
if the application closes the library, then all library data and code is
invalid, which can result in the thread running something that's not
library code and accessing freed memory.
With this change, I was able to run mpi ping-pong, 16 ranks on a single
system (scm provider) without crashes 1300 times.
Arlin Davis [Wed, 9 Sep 2009 16:44:03 +0000 (09:44 -0700)]
ucm: For UD type QP's, return CR p_data with CONN_EST event on passive side.
Intel MPI uses the p_data provided with CONN_EST as a reference to the
UD pair and remote rank. The ucm provider was overwriting the CR p_data
with the ACCEPT p_data. Change to save CR p_data but also provide
storage for user provided ACCEPT p_data in case the REPLY is lost
and needs retransmitted.
p_data size was provided to event processing in network order
instead of host order.
For new QP's create new address handles and do not use
existing AH's created for the CM. Different PD's are
associated with each.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 2 Sep 2009 21:01:51 +0000 (14:01 -0700)]
dtest, dtestx: modifications for UD QP testing with ucm provider.
remote_addr is wrong for IP remote address.
The dtestx requires the server connect back to the client
for the UD test. With the ucm provider you need to provide
the QPN and the LID which you cannot get until the dtest
client starts. So, for now, don't support UD testing
on UCM providers.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 2 Sep 2009 20:54:59 +0000 (13:54 -0700)]
scm, ucm: UD QP support was broken when porting to common openib code base.
create remote_ah was moved out of modify_qp_state function but not
included in the RTU and ACCEPT code for UD QP's. qp type check
should be on daddr not saddr in ucm cm code.
QP number must be converted to host order before supplying remote_ah,
and qp number to consumer.
Modify QP state to RTR for UD QP mask setting incorrect.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 1 Sep 2009 19:36:31 +0000 (12:36 -0700)]
cma: remove debug message after rdma_disconnect failure
DAPL automatically calls rdma_disconnect() when a disconnect request is
received. If the user also calls disconnect, that calls rdma_disconnect() as
well, but the connection has already been disconnected by DAPL and is no longer
valid. The result is that the user's call to rdma_disconnect() will fail. Do
not display an error message if this occurs.
Locking could be added to prevent calling rdma_disconnect() multiple times, but
since the librdmacm provides synchronization to trap this, we might as well take
advantage of it.
Intel MPI checks the uDAPL error code when calling dat_psp_create() to see if
the port number that it provides is in use or not. Convert winsock error codes
to unix errno values.
This fixes the following error reported by Intel MPI:
'DAPL provider is not found and fallback device is not enabled'
Arlin Davis [Tue, 18 Aug 2009 17:15:15 +0000 (10:15 -0700)]
ucm: Add new provider using a DAPL based IB-UD cm mechanism for MPI implementations.
New provider uses it's own CM protocol on top of IB-UD queue pairs.
During device open, this provider creates a UD queue pair and
returns local address information via dat_ia_query. This 24 byte
opaque address must be exchange out-of-band before connecting to a
server via dat_ep_connect. This provider is targeted for MPI
implementations that already exchange address information
during mpi_init phase.
Future release may provide some ARP mechanism via multicast.
dtest, dtestx, and dtestcm was modified to report the lid and qpn
information on the server side so you can provide appropriate
destination address information for the client test suite.
dapltest will not work with this provider.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 5 Aug 2009 03:49:09 +0000 (20:49 -0700)]
scm: Fix disconnect. QP's need to move to ERROR state in
order to flush work requests and notify consumer. Moving to
RESET removed all requests but did not notify consumer.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 5 Aug 2009 03:48:03 +0000 (20:48 -0700)]
modify dtest.c to cleanup CNO wait code and consolidate into
collect_event() call. After waking up from CNO wait the
consumer must check all EVD's. The EVD's under the CNO
could be dropped if already triggered or could come in any order.
DT_RetToString changed to DT_RetToStr and DT_EventToSTr
changed to DT_EventToStr for consistency.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 5 Aug 2009 03:47:17 +0000 (20:47 -0700)]
CNO events, once triggered will not be returned during the cno wait.
Check for triggered state before going to sleep in cno_wait. Reset
triggered EVD reference after reporting.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Sun, 2 Aug 2009 21:21:09 +0000 (14:21 -0700)]
CNO support broken in both CMA and SCM providers.
CQ thread/callback mechanism was removed by mistake. Still
need indirect DTO callbacks when CNO is attached to EVD's.
Add CQ event channel to cma provider's thread and add
to select for rdma_cm and async channels.
For scm provider there is not easy way to add this channel
to the select across sockets on windows. So, for portablity
reasons 2 thread is started to process the ASYNC and
CQ channels for events.
Must disable EVD (evd_endabled=FALSE) during destroy
to prevent EVD events firing for CNOs and re-arming CQ while
CQ is being destroyed.
Change dtest to check EVD after CNO timesout.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Sean Hefty [Mon, 27 Jul 2009 22:07:33 +0000 (15:07 -0700)]
DAPL introduced the concept of directly waiting on the CQ for
events by adding a compile time flag and special handling in the common
code. Rather than using the compile time flag and modifying the
common code, let the provider implement the best way to wait for
CQ events.
This simplifies the code and allows the common openib providers to
optimize for Linux and Windows platforms independently, rather than
assuming a specific implementation for signaling events.
Arlin Davis [Thu, 16 Jul 2009 19:41:22 +0000 (12:41 -0700)]
dapltest: Implement a malloc() threshold for the completion reaping.
change byte vector allocation to stack in functions:
DT_handle_send_op, DT_handle_rdma_op & DT_handle_recv_op.
When allocation size is under the threshold, use a stack local
allocation instead of malloc/free. Move redundant bzero() to
be called only in the case of using local stack allocation as
DT_Mdep_malloc() already does a bzero(). Consolidate error handling
return and free()check to a single point by using goto.
Arlin Davis [Thu, 16 Jul 2009 19:32:09 +0000 (12:32 -0700)]
scm: handle connected state when freeing CM objects
The QP could be freed before being disconnected
so the provider needs process disconnect before freeing
the CM object. The disconnect clean will finish
the destroy process during the disc callback.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 2 Jul 2009 21:11:20 +0000 (14:11 -0700)]
dtestcm: add UD type QP option to test
Add -u for UD type QP's during connection setup.
Will setup UD QPs and provide remote AH
in connect establishment event. Measures
setup/exchange rates.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 2 Jul 2009 21:07:36 +0000 (14:07 -0700)]
scm: destroy QP called before disconnect
Handle the case where QP is destroyed before
disconnect processing. Windows supports
reinit_qp during a disconnect call by
destroying the QP and recreating the
QO instead of state change from reset
to init. Call disconnect in destroy
CM code to handle this unexpected state.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 29 Jun 2009 15:57:46 +0000 (08:57 -0700)]
openib_common: reorganize provider code base to share common mem, cq, qp, dto functions
add new openib_common directory with cq, qp, util, dto, mem function calls
and definitions. This basically leaves the unique CM and Device definitions
and functions to the individual providers directory of openib_scm and openib_cma.
modifications to dapl_cr_accept required. ep->cm_handle is allocated
and managed entirely in provider so dapl common code should not update
ep_handle->cm_handle from the cr->cm_handle automatically. The provider
should determine which cm_handle is required for the accept.
openib_cma defines _OPENIB_CMA_ and openib_scm defines _OPENIB_SCM_ for provider
specific build needs in common code.
Arlin Davis [Fri, 26 Jun 2009 21:45:34 +0000 (14:45 -0700)]
scm: fixes and optimizations for connection scaling
Prioritize accepts on listen ports via FD_READ
process the accepts ahead of other work to avoid
socket half_connection (SYN_RECV) stalls.
Fix dapl_poll to return DAPL_FD_ERROR on
all event error types.
Add new state for socket released, but CR
not yet destroyed. This enables scm to release
the socket resources immediately after exchanging
all QP information. Also, add state to str call.
Only add the CR reference to the EP if it is
RC type. UD has multiple CR's per EP so when
a UD EP disconnect_clean was called, from a
timeout, it destroyed the wrong CR.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Sat, 20 Jun 2009 03:52:51 +0000 (20:52 -0700)]
common,scm: add debug capabilities to print in-process CM lists
Add a new debug bit DAPL_DBG_TYPE_CM_LIST.
If set, the pending CM requests will be
dumped when dat_print_counters is called.
Only provided when built with -DDAPL_COUNTERS
Add new dapl_cm_state_str() call for state
to string conversion for debug prints.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 10 Jun 2009 16:09:56 +0000 (09:09 -0700)]
scm: cleanup orphaned UD CR's when destroying the EP
UD CR objects are kept active because of direct private data references
from CONN events. The cr->socket is closed and marked inactive but the
object remains allocated and queued on the CR resource list. There can
be multiple CR's associated with a given EP and there is no way to
determine when consumer is finished with event until the dat_ep_free.
Schedule destruction for all CR's associated with this EP during
free call. cr_thread will complete cleanup with state of SCM_DESTROY.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 10 Jun 2009 17:06:59 +0000 (10:06 -0700)]
scm: update CM code to shutdown before closing socket
data could be lost without calling shutdown on the socket
before closing. Update to shutdown and then close. Add
definition for SHUT_RW to SD_BOTH for windows.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
---
Sean Hefty [Thu, 4 Jun 2009 15:19:12 +0000 (08:19 -0700)]
dapl/windows cma provider: add support for network devices based on index
The linux cma provider provides support for named network devices, such
as 'ib0' or 'eth0'. This allows the same dapl configuration file to
be used easily across a cluster.
To allow similar support on Windows, allow users to specify the device
name 'rdma_devN' in the dapl.conf file. The given index, N, is map to a
corresponding IP address that is associated with an RDMA device.
Arlin Davis [Mon, 18 May 2009 16:06:19 +0000 (09:06 -0700)]
windows: add build files for openib_scm, remove /Wp64 build option.
Add build files for windows socket cm and change build
option on windows providers. The new Win7 WDK issues a
depreciated compiler option warning for /Wp64
(Enable 64-bit porting warnings)
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 18 May 2009 15:50:35 +0000 (08:50 -0700)]
scm: multi-hca CM processing broken. Need cr thread wakeup mechanism per HCA.
Currently there is only one pipe across all
device opens. This results in some posted CR work
getting delayed or not processed at all. Provide
pipe for each device open and cr thread created
and manage on a per device level.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 15 May 2009 16:48:38 +0000 (09:48 -0700)]
linux_osd: use pthread_self instead of getpid for debug messages
getpid provides process ids which are not unique. Use unique thread
id's in debug messages to help isolate issues across many device
opens with multiple CM threads.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 29 Apr 2009 15:39:37 +0000 (08:39 -0700)]
dtest: add flush EVD call after data transfer errors
Flush and print entries on async, request, and receive
queues after any data transfer error. Will help
identify failing operation during operations
without completion events requested.
Fix -B0 so burst size of 0 works.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 16 Apr 2009 21:35:18 +0000 (14:35 -0700)]
dapltest: reset server listen ports to avoid collisions during long runs
If server is running continuously the port number increments
from base without reseting between tests. This will
eventually cause collisions in port space.
Signed-off-by: Arlin Davis <ardavis@ichips.intel.com>
Sean Hefty [Thu, 16 Apr 2009 17:21:51 +0000 (10:21 -0700)]
To avoid duplicating port numbers between different tests, the next port
number to use must increment based on the number of endpoints per thread *
the number of threads.
Sean Hefty [Thu, 16 Apr 2009 17:21:45 +0000 (10:21 -0700)]
dapltest assumes that events across multiple endpoints occur in a specific
order. Since this is a false assumption, avoid this by directing events to
per endpoint EVDs, rather than using shared EVDs.
Sean Hefty [Thu, 16 Apr 2009 17:21:41 +0000 (10:21 -0700)]
Synchronization is missing between removing items from an EVD and queuing
them. Since the removal thread is the user's, but the queuing thread is
not, the synchronization must be provided by DAPL. Hold the evd lock
around any calls to dapls_rbuf_*.
Sean Hefty [Thu, 16 Apr 2009 17:21:26 +0000 (10:21 -0700)]
Communication to the CR thread is done using an internal socket. When a
new connection request is ready for processing, an object is placed on
the CR list, and data is written to the internal socket. The write causes
the CR thread to wake-up and process anything on its cr list.
If multiple objects are placed on the CR list around the same time, then
the CR thread will read in a single character, but process the entire list.
This results in additional data being left on the internal socket. When
the CR does a select(), it will find more data to read, read the data, but
not have any real work to do. The result is that the thread spins in a
loop checking for changes when none have occurred until all data on the
internal socket has been read.
Avoid this overhead by reading all data off the internal socket before
processing the CR list.