[WinOF] Not all installers (svr2003/XP) default TARGETDIR == WindowsVolume, explicitly set WindowsVolume so DAT\ is installed where expected '%SystemDrive%\DAT'.
[WinOF] document logic in loading the windows driver store 1st with mlx4_hca driver, then mlx4_bus driver. Story is a race between PNP and MSFT installer, as the mlx4_bus driver if installed 1st will setup for mlx4_hca and PNP will request mlx4_hca driver before the mlx4_hca driver has been installed into the driver store - net result is a failed mlx4_hca load/startup. Install into Driver Store: 1st mlx4_hca, then mlx4_bus.
dapl2
Move close socket calls to the connection thread, to prevent accessing a socket
after it has been closed.
Remove the setsockopt call to mark listen addresses as reusable. A reusable
address allows other libraries, such as MPI, to bind to the same address.
Create all sockets as IPPROTO_TCP, rather than undefined.
Listen on a specific address, rather than any address, to prevent two listens
from occurring on the same port, but using different IP addresses. This
prevents connections from going to the wrong process.
winverbs
Map select winsock errors to winverb error codes. MPI relies on specific error
codes mapped through DAPL, specifically to determine if an address is in use.
Also adds devices to the end of the device list, rather than the beginning,
which helps maintain a more natural order of the devices when matched against
system device lists.
librdmacm
Have librdmacm release all libibverbs resources when the last librdmacm
structure goes away. This allows for graceful cleanup when the libraries are
properly unloaded, which is useful for debugging application that crash or do
not cleanup all resources properly.
winverbs: fix crash accessing freed memory from async thread
If an application exits while asynchronous accept processing is queued,
it's possible for the async processing to access the IbCmId after it has
been freed. A similar problem to this was fixed that dealt with accessing
the verbs QP handle.
A simpler, more generic solution to this problem is to handle application
exit in the same manner as device removal, and lock the winverb provider
lookup lists with exclusive access. Asynchronous operations that are in
process will run to completion, and future operations will be blocked until
the provider cleanup thread has completed. Once they run, they will fail
to acquire a reference on the desired object, which should result in a
graceful failure.
This avoids more complicated locking to use handles belonging to the lower
level code. If a reference on an object can be acquired, the handle will
be available for use until the reference is released. To handle IB CM
callbacks, additional state checking is required to avoid processing
CM events when we're trying to destroy the endpoint.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2462 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
winverbs: fix crash accessing freed memory from async thread
If an application exits while asynchronous accept processing is queued,
it's possible for the async processing to access the IbCmId after it has
been freed. A similar problem to this was fixed that dealt with accessing
the verbs QP handle.
A simpler, more generic solution to this problem is to handle application
exit in the same manner as device removal, and lock the winverb provider
lookup lists with exclusive access. Asynchronous operations that are in
process will run to completion, and future operations will be blocked until
the provider cleanup thread has completed. Once they run, they will fail
to acquire a reference on the desired object, which should result in a
graceful failure.
This avoids more complicated locking to use handles belonging to the lower
level code. If a reference on an object can be acquired, the handle will
be available for use until the reference is released. To handle IB CM
callbacks, additional state checking is required to avoid processing
CM events when we're trying to destroy the endpoint.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2453 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[DAPL2] wait for async processing thread to actually exit.
DAPL doesn't actually wait for the async processing thread to exit before
allowing the library to close. It will wait up to 10 seconds, which under
heavy load isn't enough time. Since the thread is created by an application
level thread, it will continue to run as long as the application runs. But
if the application closes the library, then all library data and code is
invalid, which can result in the thread running something that's not
library code and accessing freed memory.
With this change, I was able to run mpi ping-pong, 16 ranks on a single
system (scm provider) without crashes 1300 times.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2450 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[DAPL2] add cleanup/release code for timer thread
dapl_set_timer() creates a thread to process timers for dat_ep_connect
but provides no mechanism to destroy/exit during dapl library unload.
Timers are initialized in library init code and should be released
in the fini code. Add a dapl_timer_release call to the dapl_fini
function to check state of timer thread and destroy before exiting.
Signed-off-by: Arlin Davis <arlin.r.davis@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2449 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[WinVerbs] fix crash accessing freed memory from async thread
If an application exits while asynchronous accept processing is queued,
it's possible for the async processing to access the IbCmId after it has
been freed. A similar problem to this was fixed that dealt with accessing
the verbs QP handle.
A simpler, more generic solution to this problem is to handle application
exit in the same manner as device removal, and lock the winverb provider
lookup lists with exclusive access. Asynchronous operations that are in
process will run to completion, and future operations will be blocked until
the provider cleanup thread has completed. Once they run, they will fail
to acquire a reference on the desired object, which should result in a
graceful failure.
This avoids more complicated locking to use handles belonging to the lower
level code. If a reference on an object can be acquired, the handle will
be available for use until the reference is released. To handle IB CM
callbacks, additional state checking is required to avoid processing
CM events when we're trying to destroy the endpoint.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2448 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[WinOF,ETC] When building with latest WDK, turn off OACR (Auto-Code-Review) as it slows down the build process.
Remove extra 'call' when invoking bldwo.bat from buildrelease.bat.
[WinOF] Enable Windows 7 & Svr 2008 R2 build support by making WDK 7600.16385.0 the default build env when using winof\buildrelease.bat to build a release; only affects builds using buildrelease.bat script.
[WinOF] Identify platform processor architecture [32/64 bit] for which the installer is targeted. Use the new include file WinOF\Wix\common\UserInterface.inc to define the WIX UI for all installer variants.
[WinOF] reduce Wix installer (.msi file generation) complexity & redundancy by moving bulk of Makefile to a parameterized WinOF\Wix\common\Makefile.inc
[IBAL] use non-pageable memory to prevent possible problems on power down.
IBAL uses pageable memory to create PnP context. It can create possible problems in power down flows at the time of system contention. We saw a similar case at a customer. There is no strong evidence that this is what influenced, but with this patch IBAL will be more safe and at no cost. WinOF 2.1 testing has demonstrated that with this patch, infrequent (1 out of 10) power-down BSOD have disappeared.
[WinOF] remove redundant 'root' folder specifications as TARGETDIR is the %SysVolume% 'root' (aka 'slash'). Identify 64-bit installers along with installing WinOF into ProgramFiles64 folder which == ProgramFiles; install WinOF in the same folder for all architectures.
[WinOF/DAPL2] reflect WinOF now installs into %ProgramFiles% on all systems, not %ProgramFiles(x86)%.
Move Dat/Dapl 1.1 providers to the end of the list for those MPIs which use the 1st available provider; ends up defaulting to DAT/DAPL V2.0. Thus begins the phasing out of DAT/DAPL 1.1 support.
[WinOF] Streamline driver uninstall and cleanup to play nicely with MSFT PNP; believe PNP will cleanup .inf referenced files when device references have reached zero (shutdown ND & WSD services prior to PNP device removal).
[WinOF] Place librdmacm.dll in %windir% folder along with the rest of the WinOF libs. Review folder usage to consider using system32/syswow64 at a later date.
[WinOF] remove correct WinOF shortcut folder & rename WinOF start menu Command link to 'WinOF Command' so as not to confuse the Svr 2008 start menu auto include entries.
[WinOF] Streamline WinOF uninstall such that is plays nicely with MFST PNP.
1) allow PNP to remove .inf referenced files and cleanup driver store.
2) shutdown ND & WSD prior to PNP device removal.
3) remove stale code which checks for OpenIB installs and forces a reboot
[IBBUS,COMPLIB] Eliminate re-initialization of the stop lock. Crash reported upon running “System Common Scenario” WHQL test with our stack. The crash: C4 (0xd7), which means Driver Verifier revealed a re-initializing of Remove Lock.
Signed-off by Leonid Keller leonid@mellanox.co.il
[IBBUS,COMPLIB] Eliminate re-initialization of the stop lock. Crash reported upon running “System Common Scenario” WHQL test with our stack. The crash: C4 (0xd7), which means Driver Verifier revealed a re-initializing of Remove Lock.
Signed-off by Leonid Keller leonid@mellanox.co.il
To better mimic Linux fd support, add the concept of channel sets to completion channels. This is needed to minimize changes in platform independent uDAPL code.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2412 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[HW] allow retrieving CA attributes with pageable memory. [winof: 2408]
Modify the HCA drivers to support querying for attributes using a pageable buffer. Since the query calls block, it seems appropriate for the calls to allow pageable memory, rather than forcing the user to allocate a non-paged buffer in order to obtain a list of attributes. The problem stems from the HCA drivers accessing a user's buffer after acquiring a spinlock that raise IRQL.
This fixes kernel crashes with both the winmad and winverbs drivers.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2411 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
If an application calls Connect or Accept, their IRP is queued to a
work queue for asynchronous processing. However, if the application
crashes or exits before the work queue can process the IRP, the cleanup
code will call WvEpFree(). This destroys the IbCmId.
When the work queue finally runs, it can access a freed IbCmId.
This is bad. A similar race exists with the QP and the asynchronous
disconnect processing. The disconnect processing can access a
the hVerbsQp handle after it has been destroyed.
Additionally, in all three cases, the IRPs assume that the WV provider
is able to process IRPs. Specifically, they require that the index
tables maintained by the provider are still valid. References must
be held on the WV provider until the IRPs finish their processing to
ensure this.
Fix invalid accesses to the IbCmId and hVerbsQp handles by locking
around their use after valid state checks. In the case of the QP, we
add a guarded mutex for synchronization purposes and use that in place
where the PD mutex had been used.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2410 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
Winmad returns an incorrect error code when using send or receive in synchronous mode. The OFED MAD code ends up working, since it checks for errors by comparing the return value < 0. In this case, the return value us positive, when it should be zero. Simplify the code and return the correct error code.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2409 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[HW] allow retrieving CA attributes with pageable memory
Modify the HCA drivers to support querying for attributes using a pageable buffer. Since the query calls block, it seems appropriate for the calls to allow pageable memory, rather than forcing the user to allocate a non-paged buffer in order to obtain a list of attributes. The problem stems from the HCA drivers accessing a user's buffer after acquiring a spinlock that raise IRQL.
This fixes kernel crashes with both the winmad and winverbs drivers.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2408 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
Winmad returns an incorrect error code when using send or receive in synchronous mode. The OFED MAD code ends up working, since it checks for errors by comparing the return value < 0. In this case, the return value us positive, when it should be zero. Simplify the code and return the correct error code.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2407 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
If an application calls Connect or Accept, their IRP is queued to a
work queue for asynchronous processing. However, if the application
crashes or exits before the work queue can process the IRP, the cleanup
code will call WvEpFree(). This destroys the IbCmId.
When the work queue finally runs, it can access a freed IbCmId.
This is bad. A similar race exists with the QP and the asynchronous
disconnect processing. The disconnect processing can access a
the hVerbsQp handle after it has been destroyed.
Additionally, in all three cases, the IRPs assume that the WV provider
is able to process IRPs. Specifically, they require that the index
tables maintained by the provider are still valid. References must
be held on the WV provider until the IRPs finish their processing to
ensure this.
Fix invalid accesses to the IbCmId and hVerbsQp handles by locking
around their use after valid state checks. In the case of the QP, we
add a guarded mutex for synchronization purposes and use that in place
where the PD mutex had been used.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2406 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
DAPL automatically calls rdma_disconnect() when a disconnect request is
received. If the user also calls disconnect, that calls rdma_disconnect() as
well, but the connection has already been disconnected by DAPL and is no longer
valid. The result is that the user's call to rdma_disconnect() will fail. Do
not display an error message if this occurs.
Locking could be added to prevent calling rdma_disconnect() multiple times, but
since the librdmacm provides synchronization to trap this, we might as well take
advantage of it.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2405 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
winmad: allocate registration struct from NonPagedPool
Apparently data structures that are accessed from within MAD callbacks must be
allocated from NonPagedPool. Allocated the WM_REGISTRATION structure from non
paged pool.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2404 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
winmad: allocate registration struct from NonPagedPool
Apparently data structures that are accessed from within MAD callbacks must be
allocated from NonPagedPool. Allocated the WM_REGISTRATION structure from non
paged pool.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2403 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[WinOF] Shutdown NetworkDirect and Winsock direct before DIFxApp removes devices. Makes sure no lingering device references are on the IB stack which would prevent components from being removed.
Moved ND/WSD shutdown into separate CustomAction called before MsiProcessDevices.
[DAPL2] udapl/scm: convert error code into dapl error code
Intel MPI checks the uDAPL error code when calling dat_psp_create() to see if
the port number that it provides is in use or not. Convert winsock error codes
to unix errno values.
This fixes the following error reported by Intel MPI:
'DAPL provider is not found and fallback device is not enabled'
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2401 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
[WINMAD] winmad: allocate registration struct from NonPagedPool.
Apparently data structures that are accessed from within MAD callbacks must be
allocated from NonPagedPool. Allocated the WM_REGISTRATION structure from non
paged pool.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2400 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
stansmith [Sat, 29 Aug 2009 01:03:54 +0000 (01:03 +0000)]
[WinOF] Add WinOF to Command Window name to distinguish it from other Command Windows as Svr 2008 likes to add recently used commands to the start menu.
signed off by stan.smith@intel.com
stansmith [Sat, 29 Aug 2009 01:01:14 +0000 (01:01 +0000)]
[WINVERBS] should have been pat of Revision: 2391; DllMain is called multiple times for a given process. Prevent double initialization of critical sections by only initializing it during process attach. This avoids corrupting the critical section while it may be in use. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2397 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
stansmith [Sat, 29 Aug 2009 00:58:34 +0000 (00:58 +0000)]
[WINVERBS] winverbs: fix race in async connect handling. If an application calls Connect or Accept, their IRP is queued to a work queue for asynchronous processing. However, if the application crashes or exits before the work queue can process the IRP, the cleanup code will call WvEpFree(). This destroys the IbCmId. When the work queue finally runs, it can access a freed IbCmId. This is bad. A similar race exists with the QP and the asynchronous disconnect processing. The disconnect processing can access a the hVerbsQp handle after it has been destroyed.
Additionally, in all three cases, the IRPs assume that the WV provider is able to process IRPs. Specifically, they require that the index tables maintained by the provider are still valid. References must be held on the WV provider until the IRPs finish their processing to ensure this.
Fix invalid accesses to the IbCmId and hVerbsQp handles by locking around their use after valid state checks. In the case of the QP, we add a guarded mutex for synchronization purposes and use that in place where the PD mutex had been used. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2396 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
stansmith [Sat, 29 Aug 2009 00:52:02 +0000 (00:52 +0000)]
[WINVERBS] To help match memory allocations with free, replace ExFreePool with ExFreePoolWithTag. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2395 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
stansmith [Sat, 29 Aug 2009 00:44:58 +0000 (00:44 +0000)]
[WINVERBS] Endpoints are not maintained in a list associated with a provider. The list entry for an endpoint is used to track connection requests with listens. When an endpoint is unassociated from a listen, it is removed from the listen list. Trying to remove it from a list during provider cleanup results in a duplicate removal, can corrupt the listen list, and may access freed memory. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2394 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
stansmith [Sat, 29 Aug 2009 00:42:29 +0000 (00:42 +0000)]
[WINVERBS] The winverbs PD structure contains both an event and a guarded mutex. Both must be allocated as part of resident memory, or vague system corruptions may occurif their memory is paged out. The fix is to allocate the PD structure from NonPagedPool. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2393 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
stansmith [Sat, 29 Aug 2009 00:39:47 +0000 (00:39 +0000)]
[WINVERBS] Fix a memory leak. We need to free the port array, which is allocated separately from the device structure. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2392 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86
stansmith [Sat, 29 Aug 2009 00:37:08 +0000 (00:37 +0000)]
[WINVERBS] DllMain is called multiple times for a given process. Prevent double initialization of critical sections by only initializing it during process attach. This avoids corrupting the critical section while it may be in use. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
git-svn-id: svn://openib.tc.cornell.edu/gen1@2391 ad392aa1-c5ef-ae45-8dd8-e69d62a5ef86