Arlin Davis [Fri, 16 May 2014 17:04:21 +0000 (10:04 -0700)]
mpxyd: MIC scale-up issue with MPI gather workloads, I_MPI_FABRICS=dapl:dapl
issue with shared proxy-in buffer pool when rdma reads complete
out of order across QP's. The tail adjustment when read completes
fails to walk entire queue and process head entry.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 16 May 2014 16:21:26 +0000 (09:21 -0700)]
mcm: mpxyd error event of m_pi_prep_rcv_q: ERR: ib_qp == 0
When incorrect ep_mode is provided by consumer the proxy-in
service modes get setup incorrectly. Validate remote address
ep mode, set to unknown if out of range.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 30 Apr 2014 18:13:35 +0000 (11:13 -0700)]
mpxyd,mcm: changes for backward compatibility with older v4 MIC clients
Allow mpxyd service to run with older MIC clients that support only proxy-out
and not proxy-in capabilities. Define minimal and compatible versions and
sync to MIC client during device open.
Create and use dat_mcm_msg_compat, dat_mix_mr_compat, and dat_mix_cm_compat
messages and operations with older v4 clients.
Move current MIX command version to v5.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 14 Mar 2014 17:47:06 +0000 (10:47 -0700)]
dapltest: change server port, from 45278 to 62000, out of registered IANA range
The existing port 45278 is in the registered port range.
RFC 6335:
System Ports, well known, 0-1023 (assigned by IANA)
User Ports, registered, 1024-49151 (assigned by IANA)
Dynamic Ports, private or Ephemeral, 49152-65535 (never assigned)
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
dapltest: Add final send/recv "sync" for transaction tests.
The transaction tests need both sides to send a sync message after running the test. This ensures that all remote operations are complete before dapltest deregeisters memory and disconnects the endpoints.
Without this logic, we see intermittent async errors on iwarp devices because a read response or write arrives after the rmr has been destroyed.
I believe this is more likely to happen with iWARP than IB because iWARP completions only indicate the local buffer can be reused. It doesn't imply that the message has even arrived at the peer, let alone been placed in the peer application's memory.
Changes from V1:
- allocate new send/recv buffers for the Final Sync message.
- post the Final Sync recv buffer at the beginning of the final iteration of a test.
- tests ok on cxgb4 and mlx4 devices.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Arlin Davis [Tue, 15 Apr 2014 22:28:10 +0000 (15:28 -0700)]
mpxyd: scale-up improvements to support 200-300 processes per MIC
Change scif_send_msg from blocking to non-blocking.
Serialize listen port space across MIC devices on same IB device.
Set CM to reject state after sending or recving user reject.
Serialize usage of scif_ev_ep on CM and DTO events across TX and CM threads.
Increase default scif listen backlog.
Reduce default proxy buffer, WR, and WC queues.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 14 Apr 2014 23:55:47 +0000 (16:55 -0700)]
mcm: serialize CM cmds on ev_ep, add dev_id, increase dev_open listen backlog
user thread or cm thread could be processing CM commands
and events so use of scif_ev_ep needs locking.
Add dev_id to req_id for client to mpxyd device open linking
and increase backlog on MCM scif_listen for mpxyd to avoid
connection refused scenarios during device open.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 25 Mar 2014 18:36:37 +0000 (11:36 -0700)]
mcm: add host to mic cross socket support to proxy-in service to improve performance
mcm provider running on host will now connect and move data via
the remote mpxyd proxy in (PI) service when connecting to a MIC
cross socket from HCA.
CM protocol/service enhanced to connect multi-pathed QP's
between a non-MIC and MICs for optimized speed paths per
direction.
HOST-> MSS (same socket) will connect QP2 directly to remote MIC rcv
QP1 for send data and connect QP1 to MPXYD PO service QP2 on remote
host for recv data.
HOST-> MXS (cross socket) will connect QP2 to MPXYD PI service QP1
on remote host for send data and connect QP1 to MPXYD PO service QP2
on remote host for recv data.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 6 Mar 2014 17:24:04 +0000 (09:24 -0800)]
mpxyd: 64KB segment sizes hang with MPI IMB pingpong cross socket
proxy out work request processing rounds down starting address
and rounds up size to 64 byte cacheline. The case where rounded
up from non-64 byte resulted in a 0 byte RDMA segment. Add checking
for actual len versus rounded up l_len for last segment.
Add additional perf profileing via MCM_PROFILE_DBG.
Signal on LS if not marked via signal rate modulo.
Add support for new M_READ_FROM_DONE work request state.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 6 Mar 2014 05:23:27 +0000 (21:23 -0800)]
mpxyd: add new M_READ_FROM_DONE state for send WR's and add more profiling options
new state added to work request flag along with a m_qp->wr_tl_rf field
to limit wr pending thread processing to just RF pending entries
and avoiding needless processing of M_SEND_POSTED entries.
Add more perf profiling capabilities to defer IB RDMA until after all the post_send
scif_readfrom's, first to last segment, are complete.
disable MCM_PROFILE_DBG compile option by default
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 24 Feb 2014 17:03:15 +0000 (09:03 -0800)]
mpxyd: sync PI WC trigger to PO MP_SIG
The PI and PO segment comp signal was out of sync so the WC update
back from the PI incorrectly updated the m_po_buf_tl
on the PO side. Set signaling on both sides based
on the rdma_write initiator setting via the WR M_SEND_MP_SIG bit.
Modify some log levels and add check for m_idx during tail update.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 19 Feb 2014 00:02:18 +0000 (16:02 -0800)]
mpxyd: improve QP destruction to manage QP1 and QP2 variations
With proxy-in and proxy-out connection combinations the
proxy agent sometimes manages 2 QPs. Change QP flush
and destruction to manage all combinations of QPs.
QP can also be on both tx and rx link-list for proxy-in
and proxy-out processing. QP free needs to be modified
to serialize and remove QP object from all lists.
Remove QPN option from mix_get_qp call.
Proxy-in RX_IMM message processing changed to validate
CM connected state and IB QP state before reposting.
Proxy-in pending_wr processing should send WC's to release
proxy buffers more frequently instead of on last segment.
With multiple QP's sharing proxy buffer it could stall
waiting for last segment WC's. It will now signal on last
segment or every 10th segment by default.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 18 Feb 2014 23:23:37 +0000 (15:23 -0800)]
mpxyd: proxy out doesn't release proxy buffer as quickly as necessary
Change proxy-out WC processing to release the proxy-out buffer
during every event, not just consumer signaled events.
The remote proxy-in will only send WC if this WR segments has been
completely moved and is ready for reuse.
Change WR and proxy memory stall logic to limit retries to 5 seconds.
Print warning messages every 100 retries with appropriate queue info.
Arlin Davis [Wed, 12 Feb 2014 22:55:25 +0000 (14:55 -0800)]
new lightweight open_query/close_query IB extension for fast attribute query
Consumers that need provider attributes must do a full device open
in order to get any provider/device information. With so many static device
entries in /etc/dat.conf consumers are building classification
mechanisms to identify provider type, locality, name, device
mode, and decide which device is appropriate. The existing DAT interface
doesn't provide a lightweight mechanism for queries.
The following fast query functions have been added to dat_ib_extensions.h:
In addition, DAT extension interface, dat_extension_op, has been
expanded to include new internal calls to handle quick provider load
and function linkage via udat_extension_open, and udat_extension_close
functions. Extended operations needing DAT open/close services need
to be defined from a DAT_OPEN_EXTENSION_BASE or DAT_CLOSE_EXTENSION_BASE
respectively.
NOTE: The ia_handle returned with open query must be closed with subsequent
close_query and not used with any other dat_ia_ operations. Attribute
storage from query_open is not valid after close_query call.
The IB extensions have been rolled to version 2.0.8 with this new API.
The changes are backward compatible.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 12 Feb 2014 21:41:37 +0000 (13:41 -0800)]
mpxyd: need CM to QP linking with CM references
Complete coding support for ref_cnt on CM to allow for
proper destruction of CM resourses. Ref count for CM alloc,
QP linking, and queue list. List dequeue will trigger CM
free, move to destroy state, and dealloc if ref_cnt is zero.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:58:04 +0000 (18:58 -0800)]
mpxyd: proxy-in added to proxy-out service to increase cross socket performance
Proxy-in service added to MCM dapl providers to
improve cross socket MIC adapter performance.
Additional RX thread created to handle PI service,
new CM wire protocol to exchange WR and WC references,
and new DTO wire protocol to Read remote PO data
segments and forward via SCIF writeto.
In order to maintain DAT API compatibility the IB
MR addr, rkeys are translated to SCIF addresses
and TPT entries created on the MPXYD to handle inbound
rmda writes targeted to MIC adapters.
Code broken out into separate source files:
mpxy_in.c - proxy_in service
mpxy_out.c - proxy_out service
util.c - general utilities
mix.c - MIC to HOST operations
mpxyd.c - device open, RX, TX, OP, CM threads.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:49:07 +0000 (18:49 -0800)]
mcm: add proxy in support to MCM provider and MPXYD interface
Add dapli_mix_post_recv, dapli_mix_mr_create, dapli_mix_mr_free
no QPr exist on MIC with MXS to MXS connections
cm addr becomes addr1, save all QPr addr1 info during rejects
verify CM service exists before freeing port space, could be on mpxyd
system guid support to verify locality to inside/outside the box
change UD mode checking on EP instead of QP, QP doesnt exist on MXS
add reject support
Fix for CM service RX posting, walking queue doesnt include GRH.
Add system_guid field to MCM provider ib_hca_transport struct
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:37:34 +0000 (18:37 -0800)]
open_ib common: qp, cq, and post_recv changes for proxy-in
Modify common QP, CQ, and DTO services to support proxy-in
service that eliminates the need for local QP and CQ resouces
on the MIC adapter.
Change WR UD type check to support no QP mode.
Add dapli_mix_post_recv funtionality for PI, QPr on mpxyd.
Store platform unique guid for EP locality - inside/outside.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:31:31 +0000 (18:31 -0800)]
commom: add lmr support for proxy in service
Registration details must be tranfered to proxy service
to enable proxy-in data transfers. IB registration
and SCIF registration is sent to mpxyd for inbound
rdma write TPT services for IB RW store and SCIF writeto
forward capabilities. Extend DAT LMR to include
scif information and ID. If proxy service is
in use call new functions dapli_mix_mr_create/free
to sync with mpxyd.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 4 Feb 2014 02:26:53 +0000 (18:26 -0800)]
new definitions and states for CCL Proxy-in support
MCM proxy data limits, new CM free state, EP mode support for EP locallity
New ep mapping field in address structure
New MIX ops mix_recv and mix_cm_reject_user
Expanded MIX ops mr structure to include IB and SCIF details
Changed dat_mix_send struct name to dat_mix_sr for send and recv
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 25 Sep 2013 22:10:56 +0000 (15:10 -0700)]
ucm, scm: UD mode triggers list_head assert with large scale alltoall test
1024+ ranks, IMB alltoall may hit assert when running Intel MPI in UD mode.
CR clean up was implemented with EP to CR references still linked.
During cr_accept, the CR remote_ia_address is linked to EP object
by mistake with UD mode. UD mode my have multiple CRs per EP so
no direct mappings to CR memory can exist unless RC mode which
always has one EP to CR mapping.
In scm, ucm: for CM object free with CR references the search and
unlinking from SP must be under SP lock to serialize. Also,
cleanup thread wakeup logic to only trigger the thread if
reference count indicates the need for more processing.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 13 Sep 2013 22:12:05 +0000 (15:12 -0700)]
mpxyd: ERR: stalled, insufficient proxy memory
When scaling up/out with lots of QP's using shared
proxy buffer the rdma writes can block waiting for
memory to free. The signal rate on the posted
writes must be reduced to insure proxy buffer
are freed in a more timely manner.
Add logic to return failure if stalling becomes
excessive.
Allow administrator to adjust IB mcm_signal_rate
via mpxyd.conf. Default is now 10 instead of 100.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 12 Sep 2013 21:03:58 +0000 (14:03 -0700)]
mpxyd: handle catastrophic IB async events, including IBV_EVENT_LID_CHANGE
cleanup mdev destroy functions, use mcm_ib_async_str for all IB events.
Destroy all mdev resouces, including CM services, and abort all
open clients when receiving the following IB async events:
Arlin Davis [Thu, 12 Sep 2013 16:12:55 +0000 (09:12 -0700)]
mcm: reduce max qp depth and msg size in proxy mode, allow override
DAPL_MCM_WR_MAX is used set max qp depth on mcm provider, default=500
DAPL_MCM_MSG_MAX is used set max msg size on mcm provider, default=8388608
DAPL_WR_MAX is used to override max qp depth on all IB providers.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 11 Sep 2013 22:04:37 +0000 (15:04 -0700)]
mpxyd: CM_REPLY: RETRIES (7) EXHAUSTED
The clients RTU is not processed by mpxyd thread in corner cases.
The SCIF EP, handling the client cm thread (scif_ev_ep) operations,
was not added to select FD set so the op_thread didn't wake up in the
case where RTU's were sent on scif_ev_ep and no operations are
being sent on scif_op_ep.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 10 Sep 2013 16:19:18 +0000 (09:19 -0700)]
common: cleanup async event processing and logging
Add formatted string print for ib verbs async events
Remove unecessary logging and duplicate async callbacks
Modify all IB providers to use dapli_async_event_cb()
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 18 Jul 2013 17:17:11 +0000 (10:17 -0700)]
mcm: support incompatable verbs definitions inter-node within the platform
OFA verbs 3.5 and 1.5.4 are incompatable so there can be no direct
mappings to verbs within any MIC to Host communications. Remove
all direct verbs mappings in MIX and create inline construct fuctions
to convert verbs to new dat_mix_wr and dat_mix_wc types for both work requests
and work completions.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 16 Jul 2013 23:12:37 +0000 (16:12 -0700)]
dapltest: add -n parameter to override default server port number (45278)
Modify all tests and commands to take a new -n parameter option for server
listen port. The default port, when running multiple EP's and threads,
will sometimes collide and fail with EADDRINUSE on iWARP configurations
using rdma_bind_addr with sin_port=0.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 12 Jul 2013 18:52:33 +0000 (11:52 -0700)]
ucm,scm: UD mode creates many CR objects per EP that needs cleaned up
After connection is established and the AH is provided to consumer
on UD connect establishment there is no need to keep the CR object
on the SP. For large clusters this results in a growing memory
footprint for CR objects and long cleanup times on device close.
Change ucm and scm providers to unlink and free CR resources
during CM object free if this is a UD QP and CONN_EST state.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 12 Jun 2013 16:45:45 +0000 (09:45 -0700)]
cma: long delays when opening cma provider with no IPoIB configured
The rdma_cm provider (ofa-v2-ib0) can take netdev, ip address, or hostname
for local address bindings. When trying to open a non-existent netdev (ib0)
the provider will fall through and use the getaddrinfo sys call assuming
dat.conf parameter is either an IP address or hostname and not a netdev.
When trying hostname option it will attempt to resolve the name via the
name services. On a KNC this can result in long timeouts depending on the
configuration. This changes the error handling when opening the cma provider
on a non-existant netdev and will only call getaddrinfo with AI_CANONNAME
hints after checking the dat.conf parameter for a valid hostname.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>