Arlin Davis [Wed, 10 Feb 2016 22:45:12 +0000 (14:45 -0800)]
openib_common: set providers mtu to active_mtu instead of 2048
Better out of the box performance when setting mtu to active_mtu
instead of default settings of 2K. The new mtu settings are applied
on a per QP basis and negotiated via CM mtu 8-bit field. One of the
reserved 8 bit CM message fields is used to insure compatibility
with older versions.
If older endpoints are mixed with newer versions it will fallback to
the pre-existing 2K MTU settings, unless overriden by DAPL_IB_MTU.
The change has been made across all providers including ucm, scm, mcm,
and cma (rdma_cm). The mcm provider on a MIC will notify the CCL Proxy
service of a DAPL_IB_MTU override via a new MIX_OP_FLAGS bit
MIX_OP_MTU during the open call.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Amir Hanania [Tue, 26 Jan 2016 22:03:16 +0000 (14:03 -0800)]
dtest: enhancement to test, -D option for data check
With -D option, dtest will run pingpong rdma write test
with data validation. Changes pattern during iterations.
Aborts and reports location/pattern with any miscompare.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com> Signed-off-by: Amir Hanania <amir.hanania@intel.com>
Amir Hanania [Mon, 25 Jan 2016 20:30:38 +0000 (12:30 -0800)]
mcm: add support for Intel Omni-Path driver (hfi) via mic MFO mode
Set MIC based consumer to MFO (full offload) mode for both qib and new hfi devices.
Add to dat.conf entries for hfi verbs support. This can be run from mic or host
endpoints.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com> Signed-off-by: Amir Hanania <amir.hanania@intel.com>
Arlin Davis [Mon, 25 Jan 2016 19:51:33 +0000 (11:51 -0800)]
mpxyd: fix ordering issues with the CCL Proxy receive side forwarding mechanism
scif_writeto doesn't guarantee ordering on DMA posting like IB rdma writes.
Since CCL Proxy is emulating IB semantics we must perserve order of
the rdma write request from MIC consumers via any proxy scif operations.
Changes made to proxy-in to defer forwarding RR completed segments
unless they are middle segments of a larger write operation. On FS or LS
the previous scif_writeto DMA operations must be completed and signaled
before posting a first or last segment. Last segment scif_writeto
operation is ordered to insure last byte is the last byte of
complete rdma write proxied operation.
During scif_wt errors send WC error status for each pending segment
with rdma write operation for accurate proxy-out error processing.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 10 Dec 2015 22:48:05 +0000 (14:48 -0800)]
mpxyd: with abnormal CM termination a CM object can be referenced after QP destroy
The proxy-in CQ is not flushed and processes properly during
mix_qp_destroy. Depending on the EP mode there can be 2 seperate
connections with multiple CQs to process. Add new mix_cq_flush
function that will flush all pending work on TX and RX side of
proxy engine. CM object is destroyed and reset only after all
pending work is processed on ALL endpoint CQ associations.
Add error logging when WR resources are exhausted.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Error caused by cm_msg size compatability issue with new v8
protocol and older socket cm providers (2.1.4 and older).
The ucm, cma, and mcm providers are not affected.
Modify socket data sizes for SCM request/reply to interoperate
between new v8 with smaller private data and older protocols.
Adjust SCM reply/rtu based on remote CM version and retry a failed
request with pre-v8 adjusted size in case of server side failure.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
In function dapls_ib_qp_free(), pointers qp and cm_ptr->cm_id->qp are pointing to the same qp
structure, initialized in function dapls_ib_qp_alloc(). The memory pointed by these pointers are freed
twice in function dapls_ib_qp_free(), using rdma_destroy_qp() for the case _OPENIB_CMA defined and
then further using ibv_destroy_qp(), causing a segmentation fault while freeing the qp. Therefore
assigned NULL value to qp to avoid freeing illegal memory.
Fixes: 7ff4f840bf11 ("common: add CM-EP linking to support mutiple CM's and proper protection during
destruction")
Signed-off-by: Bharat Potnuri <bharat@chelsio.com> Acked-by: Arlin Davis <arlin.r.davis@intel.com>
Amir Hanania [Wed, 23 Sep 2015 21:43:38 +0000 (14:43 -0700)]
mpxyd: add P2P inline support for data size <= 96 bytes
Improve small message latency for proxy to proxy service
by including data with the proxy work request. Necessary
changes made to preservie order across WR's regardless
of size. Additional logging included. Improves single byte
one-way latency of about 27% on MFO configurations.
Changes made to avoid forwarding 0-byte rdma write to
scif_writeto, remove CPU hand copies, and order.
Changes for numa_node == -1 such that mic0 assumes MSS
and mic1 assumes MXS modes.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com> Signed-off-by: Amir Hanania <amir.hanania@intel.com>
Arlin Davis [Mon, 21 Sep 2015 22:48:15 +0000 (15:48 -0700)]
dtest: change rdma_write_ping_pong so client is always last receiver
server always waits after test loops for DREQ event so in order
to gracefully shutdown client should always receive last handshake
message and issue DREQ. Remove logging in loop.
Always init data and increase min rdma buffer size to 4KB.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Carol L Soto [Mon, 24 Aug 2015 19:58:58 +0000 (12:58 -0700)]
dapltest: dapltest with no argument not working in ppc64 arch
If dapltest is run with no args then the client was getting
Warning: conn_event_wait DAT_CONNECTION_EVENT_NON_PEER_REJECTED
Reference to RH1056487- dapltest Read and Write performance
tests are not working
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Arlin Davis [Wed, 12 Aug 2015 16:46:30 +0000 (09:46 -0700)]
mpxyd: proxy_in data transfers can improperly start before RTU received
Proxy-in data transfers must be defered until RTU is received
and QP is in CONN state. Otherwise, the remote PI WC address/rkey
information is still unitialized.
Check for initial CONN state before processing RR or WT data phase
and set RR to pause state until RTU and remote PI WRC information
is processed. Update pi_req_event error logging.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Amir Hanania [Wed, 5 Aug 2015 21:55:30 +0000 (14:55 -0700)]
mpxyd: add MFO support on proxy side
Add checking for MFO and MXS and provide proxy-in and proxy-out
services for each mode. MXS_EP check is now MXF_EP (MFO or MXS).
Add new MIX device open, query, port query, pz operations.
Add new pz list and object management via scif_dev structure.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com> Signed-off-by: Amir Hanania <amir.hanania@intel.com>
Amir Hanania [Wed, 5 Aug 2015 20:41:32 +0000 (13:41 -0700)]
mcm: add MFO support to openib_common code base
Provide full proxy support of CQ, QP, PZ, MR and device.
Use use new MXF_EP macro to switch proxy service based
on MXS (cross socket) or MFO (full offload) modes.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com> Signed-off-by: Amir Hanania <amir.hanania@intel.com>
Arlin Davis [Fri, 24 Jul 2015 23:01:29 +0000 (16:01 -0700)]
mcm: add intra-node support via ibscif device and mcm provider
- New device entry ofa-v2-scif0-m
- Support for different CM and EP locality (MIC vs proxy LID)
- MSS mode for all scif device opens via proxy
- logging changes for multi-lid options
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 14 Jul 2015 21:58:32 +0000 (14:58 -0700)]
mcm,mpxyd: fix dreq processing to defer QP flush when proxy WRs still pending
The proxy will now defer DREQ flushing of proxy QPs if PI and PO
data engines have outstanding requests. Add mcm_qp_busy routine
for checking PI and PO data engines. When MIC calls disconnect
always send DREQ up to proxy in order to handle deferred flush
of proxy side posted rcv messages.
Change QP free to modify both local and proxy QPs and check for
outstanding rcv message before qp_destroy to avoid infinite wait
in dapls_ep_flush_cqs.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Amir Hanania [Wed, 17 Jun 2015 17:12:24 +0000 (10:12 -0700)]
mcm: bug fixes for non-inline devices
mcm proxy mi_send_pi setup registered WR structure properly for no
inline data support but incorrectly overwrote sg.addr with WR
WR structure on stack.
qp create didn't check for no inline and setup create accordingly
Signed-off-by: Amir Hanania <amir.hanania@intel.com> Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 5 Jun 2015 19:14:37 +0000 (12:14 -0700)]
mpxyd,mcm: RDMA write with immed data not signaled on request side
With eager completions set, the wc_flags is not set properly on event.
With eager completions no set, the proxy CQ reference is incorrect
and event is forwarded to MCM receive EVD instead of transmit EVD.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Wed, 20 May 2015 18:43:03 +0000 (11:43 -0700)]
mcm: add HST side provider support for device without inline data capability
Add registered WR buffers for HST->MXS (proxy in) mode
when inline data is not supported by device. Use registered
memory for source WR buffer instead of stack when sending
RDMA write request to peer proxy-in service.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 18 May 2015 21:51:08 +0000 (14:51 -0700)]
ucm: CM changes for UD extended port space and indexer
Tested on 1200n 28ppn cluster, AlltoAll Intel MPI, UD mode.
Both static and dynamic modes, over 500m connections.
Change port manager to indexer and service ID manager
to bitarray indexer. Reduces footprint for service IDs
and allow direct lookup on CM messages.
New insert, remove, lookup functions for processing ID
based CM objects. Inbound requests, with the exception
of new CM requests, will no longer parse list but
use hash table lookups.
AH caching is now used to prevent unnecessarily
creating multiple AH's for same QP destination.
Add 24-bit port space support to CM processing code and
to wire protocol via DCM message reserve space.
Add version check to limit to 16-bit for backward compatibility.
Bump CM protocol version to 8 for xport and rtns fields.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 18 May 2015 21:21:07 +0000 (14:21 -0700)]
ucm: optimizations for large scale UD communication management
AH caching per QP, AH space set to 48K for LID unicast
Bump port space up to 24 bits
Reduce CM object and reduce private data to 68 bytes
Add xport space and rtns to DCM reserve fields.
New indexer macros for port space hash table management
Add hash table storage to ibtrans device objects
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 12 Feb 2015 20:21:37 +0000 (15:21 -0500)]
mpxyd: add support for devices without inline data support
Add function to check for inline support during device open.
If inline data is not supported, the CM service and Proxy
data mover will not use inline data option on small IO.
The PO->PI service will now allocate and register necessary
memory to send mcm_wr_rx and mcm_wc_rx operations from
registered memory locations if inline data not supported.
If inline is supported, no extra memory will be allocated
and src buffer will be built on stack as before.
Cleanup some build warnings.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Thu, 22 Jan 2015 23:49:25 +0000 (15:49 -0800)]
openib: add inline data support check during device open
Not all rdma devices support inline data, however without
a verbs device attribute the only way to determine
support is with a QP create with max_inline_send set.
Add a common function to verify inline data support
before setting default to 64 bytes.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Mon, 15 Dec 2014 20:15:54 +0000 (12:15 -0800)]
dapl: mpxyd service changes to support multi-thread single-core option
The proxy service has been changed to reduce the number of cores required
on the host side. Provides new option, via mpxyd.conf, to use single-core
and allow system adminitrator to bind to specific core id for all Intel
Xeon Phi adapters in the platform.
mcm_affinity = 2 will set to single core (per Intel Xeon Phi).
mcm_affinity_base_mic will set to specific core for all adapters.
Best performance can be acheived with mcm_affinity = 2 and
mcm_affinity_base_mic == 0. This option will cause single core
to remain busy, polling operations from clients, as long
as long as device is open and being used by clients for data
transfers.
Arlin Davis [Mon, 15 Dec 2014 20:05:33 +0000 (12:05 -0800)]
dapl: add rdma_write_imm and write only option to dtest
New write_only (-w) option with rdma_write_imm can
be used with providers that support IB extensions.
Allows more options for write bandwith profiling
with immediate data and signaling rate options
to increase write data rates, especially on MIC
clients that use proxy services.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Fri, 21 Nov 2014 22:26:40 +0000 (14:26 -0800)]
mpxyd: DTO completion ERR: status 12, op RDMA_WRITE running MPI alltoall test
Running MIC scale-up configuration with mcm provider on a MXS node
instead of shm causes DTO error due to heavy use of proxy-in buffer pools.
Hit corner case where proxy buffer management hd ptr crossed tl
ptr due to 64 byte alignment on start when hd < 64 bytes behind tl.
Add additional checking on PO and PI buffer management to handle
the case of HD passing TL on start locations. Also changed PO
processing to hold lock until hd ptr is registered with buf_wc slot
management to preserve order of memory usage across threads.
Reduced the size of WC queue for PO and PI buffer management.
Profiling, via MCM_PROFILE, was added to monitor and trigger buffer
management errors.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Arlin Davis [Tue, 30 Sep 2014 21:07:52 +0000 (14:07 -0700)]
dtestx: update IB extension example test with new v2.0.9 features
Add support for new IB extensions for CM and AH resource cleanup.
Check for v2.0.9 and call dat_ib_ud_cm_free after connection
establishment and dat_ib_ud_ah_free after all data has been
transfered on UD endpoints.
Also add socket based address exchange to eliminate the need
to include lid and qpn parameters on the client side.
Change the multiple EP mode to send from EP 0 to EP[0-3] on
server side and EP[0-3] to EP[0-3] on client side.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
Amir Hanania [Thu, 25 Sep 2014 23:32:06 +0000 (16:32 -0700)]
common: add srq support for openib verbs providers
Add necessary components and hooks to support ib_verbs shared
receive queues for both RC and UD QP's. External interfaces
were already provided per DAT 2.0 specification but internal
support was missing.
A new dtestsrq will be provided with package for testing and
example code.