Distinguish between GPU-aware MPI tracing by abagusetty · Pull Request #504 · argonne-lcf/THAPI

abagusetty · 2026-06-03T22:01:38Z

Fixes #495

Sample output:

BACKEND_MPI | 1 Hostnames | 12 Processes | 12 Threads | 

         Name |     Time | Time(%) | Calls |  Average |      Min |      Max |         
     MPI_Init |    5.81s |  95.72% |    12 | 484.22ms | 381.20ms | 689.20ms |         
 MPI_Finalize | 164.94ms |   2.72% |    12 |  13.74ms |  12.27ms |  15.31ms |         
*MPI_Sendrecv |  68.77ms |   1.13% |    24 |   2.87ms |   1.01ms |   4.19ms |         
  MPI_Barrier |  24.17ms |   0.40% |    36 | 671.36us |  32.31us |   2.25ms |         
   MPI_Reduce |   1.61ms |   0.03% |    12 | 133.85us |  96.51us | 197.54us |         
 MPI_Sendrecv | 318.50us |   0.01% |    12 |  26.54us |  23.40us |  31.27us |         
MPI_Comm_size |  33.61us |   0.00% |    12 |   2.80us |    952ns |   6.59us |         
MPI_Comm_rank |  11.97us |   0.00% |    12 | 997.33ns |    869ns |   1.12us |         
        Total |    6.07s | 100.00% |   132 |

Introspect L0/CUDA/HIP pointers at runtime (via dlopen, no build deps) for every buffer-bearing MPI call and emit a buffer_info tracepoint. The interval consumer prefixes the call name with '*' when any buffer resides in device or shared/managed memory.

Avoid creating an L0 context inside MPI_Init's pthread_once. The application's runtime setup (zeInit, ZE_AFFINITY_MASK, etc.) is allowed to complete before we touch the GPU stack.

Make pointer introspection a silent default. Removes the only new getenv on the branch — the existing LTTNG_UST_MPI_VERBOSE no longer gates anything we added.

metababel strips qualifiers when emitting the callback typedef, so the 'const void *' field in mpi_events.yaml becomes 'void *' in the generated lttng_ust_mpi_properties_buffer_info_callback_f signature. Match it in the callback.

The YAML top-level key is the Ruby symbol :meta_parameters (serialised as ':meta_parameters:'), so fetch('meta_parameters', ...) returns the empty default. Switch to fetch(:meta_parameters, {}) so prologues are actually registered for all 145 buffer-bearing MPI calls. Without this, _dump_buffer_info was unreferenced in the generated tracer_mpi.c and the build failed under -Werror.

iprof manages its own LTTng session and ignores tracer_mpi.sh, so the enable-event line previously added to tracer_mpi.sh.in never took effect under iprof. Add the matching line in enable_events_mpi() so buffer_info events make it into the trace under both invocation paths.

- clang-format-18 on btx_mpiinterval_callbacks.cpp and tracer_mpi_helpers.include.c (strip vertical alignment, collapse multi-statement switch cases). - yamlfmt on mpi_events.yaml (drop --- marker, strip inner spaces from flow sequences). - rubocop: 'Note:' -> 'NOTE:' in mpi_model.rb (Style/CommentAnnotation).

Per review (Thomas Applencourt): the MPI tracer should not call into the GPU runtime to classify pointers. Specifically: - calling zeInit can break apps that rely on zesInit being first (deprecated API + sysman interaction); THAPI takes care to never call it, - dlopen('libze_loader.so') from inside an LD_PRELOAD'd libmpi.so actually opens the ze backend's own interceptor wrapper, so every zeMemGetAllocProperties query produces a spurious ze event and self-traces, - querying zeMemGetAllocProperties with a private context returns correct results only by accident on the current Intel L0 loader. Revert all producer-side changes (tracer_mpi_helpers.include.c, mpi_events.yaml, mpi_model.rb prologue registration, gen_mpi.rb includes, Makefile.am probe entry, tracer_mpi.sh.in / xprof.rb.in enable-event lines, mpi.h.include). Classification will be done post-mortem in btx_mpiinterval_callbacks.cpp using the existing zeMemAlloc*/zeMemFree events emitted by the ze backend, mirroring the rangeset_memory_{host,device,shared} pattern in btx_zeinterval_callbacks.cpp.

Implements Thomas's review suggestion: instead of querying the GPU runtime from the MPI tracer, classify MPI buffer pointers post-mortem in btx_mpiinterval_callbacks.cpp by joining the ze backend's zeMemAlloc/zeMemFree events into the MPI filter's upstream model. Changes: backends/mpi/Makefile.am: Union backends/ze/btx_ze_model.yaml into the MPI filter's -u so metababel routes ze events to btx_filter_mpi. Adds a build-order dependency mpi -> ze. backends/mpi/btx_mpimatching_model.yaml: - Anchor the existing entries/exits categories to the lttng_ust_mpi provider so ze events do not accidentally enter them. - Add mpi_1buf_entry / mpi_2buf_entry categories that capture payload buffer pointer(s) of buffer-bearing MPI calls. - Add ze_alloc_entry / ze_alloc_exit / ze_free_entry categories that capture ze USM allocation events. backends/mpi/btx_mpiinterval_callbacks.cpp: - Maintain per-process address-range maps rangeset_memory_device and rangeset_memory_shared, mirroring the pattern in backends/ze/btx_zeinterval_callbacks.cpp. - Populate the maps from ze_alloc_*/ze_free_entry callbacks (host USM is intentionally not classified as GPU-aware). - On every buffer-bearing MPI entry event, look up each buffer pointer; if it falls in a tracked GPU range, record the call's {hostname,vpid,vtid} in gpu_aware_calls. - In send_host_message, prepend '*' to the MPI call name when the flag is set, so the tally aggregates GPU-aware and CPU-only variants as distinct rows. When the user runs --backend mpi without --backend ze, no ze events appear in the trace, the range-map stays empty, and the output is unchanged from upstream (no '*' ever appears). No producer-side changes; the MPI tracer hot path is unaffected.

metababel's bare `-u model_a,model_b` upstream union fails when both models declare the same environment field, raising: Match expression '^hostname$' must match only one member, '2' matched ["hostname", "hostname"] Every backend model generated by utils/gen_babeltrace_model_helper.rb declares :environment: :entries: [{name: hostname, type: string}], so any two backend models clash. metababel itself owns the rule and isn't ours to change. Side-step it: generate a single merged upstream model (btx_mpi_with_ze_model.yaml) that starts from btx_mpi_model.yaml and copies only the ze allocation/free event classes we actually need into MPI's stream class. One environment declaration, one stream class, no duplicates. - backends/mpi/gen_btx_mpi_with_ze_model.rb: the generator. - backends/mpi/Makefile.am: builds the merged YAML, feeds it to metababel as the sole `-u` source.

Two matching failures surfaced when metababel walked the merged upstream model: 1. MPI buffer parameters are declared with two different cast_types in btx_mpi_model.yaml: 'const void *' for input buffers (sendbuf, buf on _Send variants, ...) and plain 'void *' for output buffers (recvbuf, buf on _Recv variants, ...). The matching model required 'const void *' on both, so mpi_2buf_entry matched zero events: 'No event matched mpi_2buf_entry, at least one matching event required.' Relax to '^(const )?void \\*$' on both buffer slots. 2. zeResult is typed ze_result_t in btx_ze_model.yaml. Matching against '^int$' fails. Use ze_result_t directly would force btx_mpiinterval_callbacks.cpp to include ze_api.h (heavy build dep on the MPI consumer just to spell an enum). Drop zeResult from the match entirely; a failed allocation has pptr_val == NULL, which the callback already guards against.

metababel's matching engine requires each :name: regex to bind to EXACTLY ONE field per matched event, so the broad regex ^(sendbuf|buf|buffer|inbuf|origin_addr|inoutbuf)$ fails on MPI_Reduce_local (has both inbuf and inoutbuf) with: Match expression '...' must match only one member, '2' matched ["inbuf", "inoutbuf"] Enumerate one category per distinct MPI buffer-shape so member regexes never overlap within an event: - mpi_cas_entry : origin_addr + compare_addr + result_addr (MPI_Compare_and_swap) - mpi_sendrecv_entry : sendbuf + recvbuf (Sendrecv, Allreduce, Reduce, Gather, ...) - mpi_reduce_local_entry : inbuf + inoutbuf (MPI_Reduce_local) - mpi_rma_fetch_entry : origin_addr + result_addr (MPI_Get_accumulate, MPI_Fetch_and_op) - mpi_1buf_entry : everything else with a single buffer field Each category subtracts the prior set_ids from its domain so an event falls into exactly one bucket. Register one consumer-side callback per shape; all delegate to the same tag_if_gpu helper.

After enumerating all MPI calls and excluding the 4 multi-buffer categories, every remaining buffer-bearing MPI call carries exactly one of {buf, buffer, origin_addr}. Tightening the 1buf regex to ^(buf|buffer|origin_addr)$ removes the last source of in-event ambiguity for metababel. Pack/Unpack/Pack_external/Unpack_external are intentionally NOT classified: they carry inbuf+outbuf but the buffers never traverse the network, so GPU-aware tagging would be semantically meaningless.

Two issues surfaced when the consumer compiled: 1. metababel emitted callback typedefs for the full ze allocation events (ze_context_handle_t, ze_device_mem_alloc_desc_t *, ze_result_t, ...). libMPIInterval is built with only mpi.h visible, so btx_upstream.h failed to compile with 'unknown type name ze_context_handle_t' (and friends). Teach gen_btx_mpi_with_ze_model.rb to keep only the payload fields the MPI consumer actually uses (size, pptr_val, ptr). After stripping, every generated typedef references only size_t / void *, which mpi.h transitively pulls in. 2. metababel strips qualifiers when generating callback typedefs, so the matching model's 'const void *' becomes 'void *' in the generated mpi_*_entry_callback_f signatures. Drop 'const' from each mpi_*_entry_callback parameter so the signatures match. (Same fix shape as the earlier buffer_info_callback fix.) Also replace filter_map with map.compact for Ruby <= 2.6 compatibility, even though Aurora's Ruby (3.3) supports filter_map.

abagusetty added 6 commits June 3, 2026 12:51

mpi: defer GPU introspection loader to first buffer query

6daf724

Avoid creating an L0 context inside MPI_Init's pthread_once. The application's runtime setup (zeInit, ZE_AFFINITY_MASK, etc.) is allowed to complete before we touch the GPU stack.

mpi: drop GPU introspection diagnostic prints

3a48873

Make pointer introspection a silent default. Removes the only new getenv on the branch — the existing LTTNG_UST_MPI_VERBOSE no longer gates anything we added.

abagusetty requested a review from TApplencourt June 3, 2026 22:02

abagusetty marked this pull request as draft June 3, 2026 22:11

abagusetty added 7 commits June 3, 2026 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between GPU-aware MPI tracing#504

Distinguish between GPU-aware MPI tracing#504
abagusetty wants to merge 14 commits into
argonne-lcf:develfrom
abagusetty:mpi-gpu-aware-tracing

abagusetty commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abagusetty commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant