Distinguish between GPU-aware MPI tracing#504
Draft
abagusetty wants to merge 14 commits into
Draft
Conversation
Introspect L0/CUDA/HIP pointers at runtime (via dlopen, no build deps) for every buffer-bearing MPI call and emit a buffer_info tracepoint. The interval consumer prefixes the call name with '*' when any buffer resides in device or shared/managed memory.
Avoid creating an L0 context inside MPI_Init's pthread_once. The application's runtime setup (zeInit, ZE_AFFINITY_MASK, etc.) is allowed to complete before we touch the GPU stack.
Make pointer introspection a silent default. Removes the only new getenv on the branch — the existing LTTNG_UST_MPI_VERBOSE no longer gates anything we added.
metababel strips qualifiers when emitting the callback typedef, so the 'const void *' field in mpi_events.yaml becomes 'void *' in the generated lttng_ust_mpi_properties_buffer_info_callback_f signature. Match it in the callback.
The YAML top-level key is the Ruby symbol :meta_parameters
(serialised as ':meta_parameters:'), so fetch('meta_parameters', ...)
returns the empty default. Switch to fetch(:meta_parameters, {}) so
prologues are actually registered for all 145 buffer-bearing MPI
calls. Without this, _dump_buffer_info was unreferenced in the
generated tracer_mpi.c and the build failed under -Werror.
iprof manages its own LTTng session and ignores tracer_mpi.sh, so the enable-event line previously added to tracer_mpi.sh.in never took effect under iprof. Add the matching line in enable_events_mpi() so buffer_info events make it into the trace under both invocation paths.
- clang-format-18 on btx_mpiinterval_callbacks.cpp and tracer_mpi_helpers.include.c (strip vertical alignment, collapse multi-statement switch cases). - yamlfmt on mpi_events.yaml (drop --- marker, strip inner spaces from flow sequences). - rubocop: 'Note:' -> 'NOTE:' in mpi_model.rb (Style/CommentAnnotation).
Per review (Thomas Applencourt): the MPI tracer should not call into
the GPU runtime to classify pointers. Specifically:
- calling zeInit can break apps that rely on zesInit being first
(deprecated API + sysman interaction); THAPI takes care to never
call it,
- dlopen('libze_loader.so') from inside an LD_PRELOAD'd libmpi.so
actually opens the ze backend's own interceptor wrapper, so every
zeMemGetAllocProperties query produces a spurious ze event and
self-traces,
- querying zeMemGetAllocProperties with a private context returns
correct results only by accident on the current Intel L0 loader.
Revert all producer-side changes (tracer_mpi_helpers.include.c,
mpi_events.yaml, mpi_model.rb prologue registration, gen_mpi.rb
includes, Makefile.am probe entry, tracer_mpi.sh.in / xprof.rb.in
enable-event lines, mpi.h.include). Classification will be done
post-mortem in btx_mpiinterval_callbacks.cpp using the existing
zeMemAlloc*/zeMemFree events emitted by the ze backend, mirroring
the rangeset_memory_{host,device,shared} pattern in
btx_zeinterval_callbacks.cpp.
Implements Thomas's review suggestion: instead of querying the GPU
runtime from the MPI tracer, classify MPI buffer pointers post-mortem
in btx_mpiinterval_callbacks.cpp by joining the ze backend's
zeMemAlloc/zeMemFree events into the MPI filter's upstream model.
Changes:
backends/mpi/Makefile.am:
Union backends/ze/btx_ze_model.yaml into the MPI filter's -u so
metababel routes ze events to btx_filter_mpi. Adds a build-order
dependency mpi -> ze.
backends/mpi/btx_mpimatching_model.yaml:
- Anchor the existing entries/exits categories to the lttng_ust_mpi
provider so ze events do not accidentally enter them.
- Add mpi_1buf_entry / mpi_2buf_entry categories that capture
payload buffer pointer(s) of buffer-bearing MPI calls.
- Add ze_alloc_entry / ze_alloc_exit / ze_free_entry categories
that capture ze USM allocation events.
backends/mpi/btx_mpiinterval_callbacks.cpp:
- Maintain per-process address-range maps rangeset_memory_device
and rangeset_memory_shared, mirroring the pattern in
backends/ze/btx_zeinterval_callbacks.cpp.
- Populate the maps from ze_alloc_*/ze_free_entry callbacks (host
USM is intentionally not classified as GPU-aware).
- On every buffer-bearing MPI entry event, look up each buffer
pointer; if it falls in a tracked GPU range, record the call's
{hostname,vpid,vtid} in gpu_aware_calls.
- In send_host_message, prepend '*' to the MPI call name when the
flag is set, so the tally aggregates GPU-aware and CPU-only
variants as distinct rows.
When the user runs --backend mpi without --backend ze, no ze events
appear in the trace, the range-map stays empty, and the output is
unchanged from upstream (no '*' ever appears). No producer-side
changes; the MPI tracer hot path is unaffected.
metababel's bare `-u model_a,model_b` upstream union fails when both
models declare the same environment field, raising:
Match expression '^hostname$' must match only one member,
'2' matched ["hostname", "hostname"]
Every backend model generated by utils/gen_babeltrace_model_helper.rb
declares :environment: :entries: [{name: hostname, type: string}],
so any two backend models clash. metababel itself owns the rule and
isn't ours to change.
Side-step it: generate a single merged upstream model
(btx_mpi_with_ze_model.yaml) that starts from btx_mpi_model.yaml and
copies only the ze allocation/free event classes we actually need
into MPI's stream class. One environment declaration, one stream
class, no duplicates.
- backends/mpi/gen_btx_mpi_with_ze_model.rb: the generator.
- backends/mpi/Makefile.am: builds the merged YAML, feeds it to
metababel as the sole `-u` source.
Two matching failures surfaced when metababel walked the merged upstream model: 1. MPI buffer parameters are declared with two different cast_types in btx_mpi_model.yaml: 'const void *' for input buffers (sendbuf, buf on _Send variants, ...) and plain 'void *' for output buffers (recvbuf, buf on _Recv variants, ...). The matching model required 'const void *' on both, so mpi_2buf_entry matched zero events: 'No event matched mpi_2buf_entry, at least one matching event required.' Relax to '^(const )?void \\*$' on both buffer slots. 2. zeResult is typed ze_result_t in btx_ze_model.yaml. Matching against '^int$' fails. Use ze_result_t directly would force btx_mpiinterval_callbacks.cpp to include ze_api.h (heavy build dep on the MPI consumer just to spell an enum). Drop zeResult from the match entirely; a failed allocation has pptr_val == NULL, which the callback already guards against.
metababel's matching engine requires each :name: regex to bind to
EXACTLY ONE field per matched event, so the broad regex
^(sendbuf|buf|buffer|inbuf|origin_addr|inoutbuf)$
fails on MPI_Reduce_local (has both inbuf and inoutbuf) with:
Match expression '...' must match only one member,
'2' matched ["inbuf", "inoutbuf"]
Enumerate one category per distinct MPI buffer-shape so member regexes
never overlap within an event:
- mpi_cas_entry : origin_addr + compare_addr + result_addr
(MPI_Compare_and_swap)
- mpi_sendrecv_entry : sendbuf + recvbuf
(Sendrecv, Allreduce, Reduce, Gather, ...)
- mpi_reduce_local_entry : inbuf + inoutbuf
(MPI_Reduce_local)
- mpi_rma_fetch_entry : origin_addr + result_addr
(MPI_Get_accumulate, MPI_Fetch_and_op)
- mpi_1buf_entry : everything else with a single buffer field
Each category subtracts the prior set_ids from its domain so an event
falls into exactly one bucket. Register one consumer-side callback per
shape; all delegate to the same tag_if_gpu helper.
After enumerating all MPI calls and excluding the 4 multi-buffer
categories, every remaining buffer-bearing MPI call carries exactly
one of {buf, buffer, origin_addr}. Tightening the 1buf regex to
^(buf|buffer|origin_addr)$ removes the last source of in-event
ambiguity for metababel.
Pack/Unpack/Pack_external/Unpack_external are intentionally NOT
classified: they carry inbuf+outbuf but the buffers never traverse
the network, so GPU-aware tagging would be semantically meaningless.
Two issues surfaced when the consumer compiled: 1. metababel emitted callback typedefs for the full ze allocation events (ze_context_handle_t, ze_device_mem_alloc_desc_t *, ze_result_t, ...). libMPIInterval is built with only mpi.h visible, so btx_upstream.h failed to compile with 'unknown type name ze_context_handle_t' (and friends). Teach gen_btx_mpi_with_ze_model.rb to keep only the payload fields the MPI consumer actually uses (size, pptr_val, ptr). After stripping, every generated typedef references only size_t / void *, which mpi.h transitively pulls in. 2. metababel strips qualifiers when generating callback typedefs, so the matching model's 'const void *' becomes 'void *' in the generated mpi_*_entry_callback_f signatures. Drop 'const' from each mpi_*_entry_callback parameter so the signatures match. (Same fix shape as the earlier buffer_info_callback fix.) Also replace filter_map with map.compact for Ruby <= 2.6 compatibility, even though Aurora's Ruby (3.3) supports filter_map.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #495
Sample output: