Arlin Davis [Wed, 12 Feb 2014 22:55:25 +0000 (14:55 -0800)]
new lightweight open_query/close_query IB extension for fast attribute query
Consumers that need provider attributes must do a full device open
in order to get any provider/device information. With so many static device
entries in /etc/dat.conf consumers are building classification
mechanisms to identify provider type, locality, name, device
mode, and decide which device is appropriate. The existing DAT interface
doesn't provide a lightweight mechanism for queries.
The following fast query functions have been added to dat_ib_extensions.h:
In addition, DAT extension interface, dat_extension_op, has been
expanded to include new internal calls to handle quick provider load
and function linkage via udat_extension_open, and udat_extension_close
functions. Extended operations needing DAT open/close services need
to be defined from a DAT_OPEN_EXTENSION_BASE or DAT_CLOSE_EXTENSION_BASE
respectively.
NOTE: The ia_handle returned with open query must be closed with subsequent
close_query and not used with any other dat_ia_ operations. Attribute
storage from query_open is not valid after close_query call.
The IB extensions have been rolled to version 2.0.8 with this new API.
The changes are backward compatible.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 12 Feb 2014 21:41:37 +0000 (13:41 -0800)]
mpxyd: need CM to QP linking with CM references
Complete coding support for ref_cnt on CM to allow for
proper destruction of CM resourses. Ref count for CM alloc,
QP linking, and queue list. List dequeue will trigger CM
free, move to destroy state, and dealloc if ref_cnt is zero.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:58:04 +0000 (18:58 -0800)]
mpxyd: proxy-in added to proxy-out service to increase cross socket performance
Proxy-in service added to MCM dapl providers to
improve cross socket MIC adapter performance.
Additional RX thread created to handle PI service,
new CM wire protocol to exchange WR and WC references,
and new DTO wire protocol to Read remote PO data
segments and forward via SCIF writeto.
In order to maintain DAT API compatibility the IB
MR addr, rkeys are translated to SCIF addresses
and TPT entries created on the MPXYD to handle inbound
rmda writes targeted to MIC adapters.
Code broken out into separate source files:
mpxy_in.c - proxy_in service
mpxy_out.c - proxy_out service
util.c - general utilities
mix.c - MIC to HOST operations
mpxyd.c - device open, RX, TX, OP, CM threads.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:49:07 +0000 (18:49 -0800)]
mcm: add proxy in support to MCM provider and MPXYD interface
Add dapli_mix_post_recv, dapli_mix_mr_create, dapli_mix_mr_free
no QPr exist on MIC with MXS to MXS connections
cm addr becomes addr1, save all QPr addr1 info during rejects
verify CM service exists before freeing port space, could be on mpxyd
system guid support to verify locality to inside/outside the box
change UD mode checking on EP instead of QP, QP doesnt exist on MXS
add reject support
Fix for CM service RX posting, walking queue doesnt include GRH.
Add system_guid field to MCM provider ib_hca_transport struct
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:37:34 +0000 (18:37 -0800)]
open_ib common: qp, cq, and post_recv changes for proxy-in
Modify common QP, CQ, and DTO services to support proxy-in
service that eliminates the need for local QP and CQ resouces
on the MIC adapter.
Change WR UD type check to support no QP mode.
Add dapli_mix_post_recv funtionality for PI, QPr on mpxyd.
Store platform unique guid for EP locality - inside/outside.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:31:31 +0000 (18:31 -0800)]
commom: add lmr support for proxy in service
Registration details must be tranfered to proxy service
to enable proxy-in data transfers. IB registration
and SCIF registration is sent to mpxyd for inbound
rdma write TPT services for IB RW store and SCIF writeto
forward capabilities. Extend DAT LMR to include
scif information and ID. If proxy service is
in use call new functions dapli_mix_mr_create/free
to sync with mpxyd.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:26:53 +0000 (18:26 -0800)]
new definitions and states for CCL Proxy-in support
MCM proxy data limits, new CM free state, EP mode support for EP locallity
New ep mapping field in address structure
New MIX ops mix_recv and mix_cm_reject_user
Expanded MIX ops mr structure to include IB and SCIF details
Changed dat_mix_send struct name to dat_mix_sr for send and recv
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 25 Sep 2013 22:10:56 +0000 (15:10 -0700)]
ucm, scm: UD mode triggers list_head assert with large scale alltoall test
1024+ ranks, IMB alltoall may hit assert when running Intel MPI in UD mode.
CR clean up was implemented with EP to CR references still linked.
During cr_accept, the CR remote_ia_address is linked to EP object
by mistake with UD mode. UD mode my have multiple CRs per EP so
no direct mappings to CR memory can exist unless RC mode which
always has one EP to CR mapping.
In scm, ucm: for CM object free with CR references the search and
unlinking from SP must be under SP lock to serialize. Also,
cleanup thread wakeup logic to only trigger the thread if
reference count indicates the need for more processing.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 13 Sep 2013 22:12:05 +0000 (15:12 -0700)]
mpxyd: ERR: stalled, insufficient proxy memory
When scaling up/out with lots of QP's using shared
proxy buffer the rdma writes can block waiting for
memory to free. The signal rate on the posted
writes must be reduced to insure proxy buffer
are freed in a more timely manner.
Add logic to return failure if stalling becomes
excessive.
Allow administrator to adjust IB mcm_signal_rate
via mpxyd.conf. Default is now 10 instead of 100.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 12 Sep 2013 21:03:58 +0000 (14:03 -0700)]
mpxyd: handle catastrophic IB async events, including IBV_EVENT_LID_CHANGE
cleanup mdev destroy functions, use mcm_ib_async_str for all IB events.
Destroy all mdev resouces, including CM services, and abort all
open clients when receiving the following IB async events:
Arlin Davis [Thu, 12 Sep 2013 16:12:55 +0000 (09:12 -0700)]
mcm: reduce max qp depth and msg size in proxy mode, allow override
DAPL_MCM_WR_MAX is used set max qp depth on mcm provider, default=500
DAPL_MCM_MSG_MAX is used set max msg size on mcm provider, default=8388608
DAPL_WR_MAX is used to override max qp depth on all IB providers.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 11 Sep 2013 22:04:37 +0000 (15:04 -0700)]
mpxyd: CM_REPLY: RETRIES (7) EXHAUSTED
The clients RTU is not processed by mpxyd thread in corner cases.
The SCIF EP, handling the client cm thread (scif_ev_ep) operations,
was not added to select FD set so the op_thread didn't wake up in the
case where RTU's were sent on scif_ev_ep and no operations are
being sent on scif_op_ep.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 10 Sep 2013 16:19:18 +0000 (09:19 -0700)]
common: cleanup async event processing and logging
Add formatted string print for ib verbs async events
Remove unecessary logging and duplicate async callbacks
Modify all IB providers to use dapli_async_event_cb()
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 18 Jul 2013 17:17:11 +0000 (10:17 -0700)]
mcm: support incompatable verbs definitions inter-node within the platform
OFA verbs 3.5 and 1.5.4 are incompatable so there can be no direct
mappings to verbs within any MIC to Host communications. Remove
all direct verbs mappings in MIX and create inline construct fuctions
to convert verbs to new dat_mix_wr and dat_mix_wc types for both work requests
and work completions.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 16 Jul 2013 23:12:37 +0000 (16:12 -0700)]
dapltest: add -n parameter to override default server port number (45278)
Modify all tests and commands to take a new -n parameter option for server
listen port. The default port, when running multiple EP's and threads,
will sometimes collide and fail with EADDRINUSE on iWARP configurations
using rdma_bind_addr with sin_port=0.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 12 Jul 2013 18:52:33 +0000 (11:52 -0700)]
ucm,scm: UD mode creates many CR objects per EP that needs cleaned up
After connection is established and the AH is provided to consumer
on UD connect establishment there is no need to keep the CR object
on the SP. For large clusters this results in a growing memory
footprint for CR objects and long cleanup times on device close.
Change ucm and scm providers to unlink and free CR resources
during CM object free if this is a UD QP and CONN_EST state.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 12 Jun 2013 16:45:45 +0000 (09:45 -0700)]
cma: long delays when opening cma provider with no IPoIB configured
The rdma_cm provider (ofa-v2-ib0) can take netdev, ip address, or hostname
for local address bindings. When trying to open a non-existent netdev (ib0)
the provider will fall through and use the getaddrinfo sys call assuming
dat.conf parameter is either an IP address or hostname and not a netdev.
When trying hostname option it will attempt to resolve the name via the
name services. On a KNC this can result in long timeouts depending on the
configuration. This changes the error handling when opening the cma provider
on a non-existant netdev and will only call getaddrinfo with AI_CANONNAME
hints after checking the dat.conf parameter for a valid hostname.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 29 May 2013 23:00:32 +0000 (16:00 -0700)]
mpxyd: CM optimizations for MIC clients, improved checking on inbound CM messages
allow CM operations to be received on OP or EV channels from
MIC clients and provide each SMD channel with aligned message buffer
for scif_recv processing.
add checking for NO match at MD level after checking all SMD children
for inbound CM message match and add dump_cm_lists function for debug.
add check for inline message threshold, DAT_MIX_INLINE_MAX
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 29 May 2013 22:36:29 +0000 (15:36 -0700)]
mcm: scif_recv err on mpxyd when scaling up on MPI IMB scatter benchmark
The inline send changes incorporated fragmented scif_send options which
de-serialized the stream operation on the scif endpoint. This can result
in a CM operation from the CM thread to interleave with the post_send
inline operation that sends a hdr and inline data separately.
Modify the post_send to use only one scif_send operation for inline.
Also optimize CM and Operations by moving all CM message to the
scif_ev_ep. Cleanup operation log messages to include op strings
for easier debug.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
pmmccorm [Mon, 13 May 2013 21:03:04 +0000 (14:03 -0700)]
Enable ccl-proxy support if possible by default: yes, if nothing specified and scif.h is present no, if scif.h is not present and nothing specfied no, if --enable-mcm=no is specified yes, if --enable-mcm=yes and scif.h is present error, if --enable-mcm=yes and scif.h missing
Make the corresponding changes to the spec file so that whatever
options are specified, the RPM will contain the right files (before
we were shipping the mpxyd service and conf regardless).
Arlin Davis [Fri, 17 May 2013 19:27:36 +0000 (12:27 -0700)]
SCM: getifaddrs modfications for better out of the box experience with MIC
socket cm will now walk list of interfaces and ignore loopback
and ignore IB devices, unless the IB netdev is the only device.
Works better in a heterogenous environment with a mix of MICs.
Tested with br0, mic0, and mic0:ib netdev mixes.
Overriding with DAPL_SCM_NETDEV still works as is.
Signed-off-by: Patrick Mccormick <patrick.m.mccormick@intel.com> Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 25 Mar 2013 17:18:05 +0000 (10:18 -0700)]
mpxyd: add support for full work request or memory pool
Current implemention will fail when WR or memory is full. Change to
throttle and retry mix post_send opertions during full work queue.
New wr_pp (pst pending) added to m_qp for tracking outstanding
IB work request in flight.
Add counters for full wr and mem pool cases. Print mix-version on
startup.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Add eager completion, configurable, to signal writes or sends
after scif_readfrom is signaled and all data is local to proxy
instead of waiting for IB signal. User data on MIC is available
for reuse.
Combine sends and writes to mix_post_send command, provide
ordering guarantees between inline and dma data. Allow's
direct posting from OP thread is head of queue.
Add new counters for inline and signaled IO.
Extend m_wr to include flags for controlling eager completions
and proxy buffer and work request management.
cq event FD is now non-blocking and processed via TX thread
instead of OP thread. Allows for polling > 1 event at a time.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 15 Mar 2013 19:23:16 +0000 (12:23 -0700)]
mcm: add support for mix inline data, improve mix_poll events
mpxyd can be configured for inline data for posted
writes and sends. This will use scif_send/recv instead
of scif_readfrom based on threashold set in mpxyd.conf
change the mix_poll command to NOT issue the request
on scif and simply wait for mpxyd to write completion
back to EVD. This removes unneccesary SCIF command
traffic.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>