Skip to content

Bug 4382: "Detect/document ucx-conduit native kinds when UCX build lacks GDR support".#10

Draft
PHHargrove wants to merge 1 commit into
BerkeleyLab:developfrom
PHHargrove:bug4382-ucx-gdr-detect
Draft

Bug 4382: "Detect/document ucx-conduit native kinds when UCX build lacks GDR support".#10
PHHargrove wants to merge 1 commit into
BerkeleyLab:developfrom
PHHargrove:bug4382-ucx-gdr-detect

Conversation

@PHHargrove
Copy link
Copy Markdown
Collaborator

This PR is "DRAFT" pending at least

  1. A UCX release in which memory kinds function correct and with performance at least comparable to ibv-conduit on the same hardware. Lacking correctness and performance, we should not be recommending use of UCX native kinds.
  2. Completion of the "document" aspect of the bug report, which is not yet present.

TO DO:

This work has only been tested with respect to CUDA. So HIP (on both NVIDIA and AMD hardware) and ZE kinds support need to be tested before this can be considered ready for merge.


ucx: check memory kinds support at MK_Create time

This commit resolves the "detect" aspect of Bug 4382: "Detect/document ucx-conduit native kinds when UCX build lacks GDR support".

As agreed in the bug report, gex_MK_Create will now return a non-fatal GASNET_ERR_BAD_ARG. A well written client can potentially recover by falling back to use of some non-kinds comms. In the future, however, GASNet-EX would ideally fall back to a reference implementation which would "do the right thing" transparently.

Example output from process 0 in a run of testcudauva on a system with CUDA GPUs, but lacking CUDA support in the UCX library:

*** WARNING (proc 0): GASNet gasnetc_mk_create_hook returning an error code: GASNET_ERR_BAD_ARG (Invalid function parameter passed)
  at /[REDACTED]/gasnet/ucx-conduit/gasnet_kinds.c:93
  reason: Requested device memory type is not supported in the UCX library
ERROR calling: gex_MK_Create(&kind, myclient, &args, 0)
 at: /[REDACTED]/gasnet/tests/testcudauva.c:203
 error: GASNET_ERR_BAD_ARG (Invalid function parameter passed)

This commit resolves the "detect" aspect of Bug 4382: "Detect/document
ucx-conduit native kinds when UCX build lacks GDR support".

As agreed in the bug report, `gex_MK_Create` will now return a
non-fatal `GASNET_ERR_BAD_ARG`.  A well written client can potentially
recover by falling back to use of some non-kinds comms.  In the
future, however, GASNet-EX would ideally fall back to a reference
implementation which would "do the right thing" transparently.

Example output from process 0 in a run of `testcudauva` on a system
with CUDA GPUs, but lacking CUDA support in the UCX library:

```
*** WARNING (proc 0): GASNet gasnetc_mk_create_hook returning an error code: GASNET_ERR_BAD_ARG (Invalid function parameter passed)
  at /[REDACTED]/gasnet/ucx-conduit/gasnet_kinds.c:93
  reason: Requested device memory type is not supported in the UCX library
ERROR calling: gex_MK_Create(&kind, myclient, &args, 0)
 at: /[REDACTED]/gasnet/tests/testcudauva.c:203
 error: GASNET_ERR_BAD_ARG (Invalid function parameter passed)
```
@PHHargrove
Copy link
Copy Markdown
Collaborator Author

See https://bitbucket.org/berkeleylab/gasnet/pull-requests/517 for prior history of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant