Skip to content

Distinguish between GPU-aware MPI tracing#504

Draft
abagusetty wants to merge 14 commits into
argonne-lcf:develfrom
abagusetty:mpi-gpu-aware-tracing
Draft

Distinguish between GPU-aware MPI tracing#504
abagusetty wants to merge 14 commits into
argonne-lcf:develfrom
abagusetty:mpi-gpu-aware-tracing

Conversation

@abagusetty

Copy link
Copy Markdown

Fixes #495

Sample output:

BACKEND_MPI | 1 Hostnames | 12 Processes | 12 Threads | 

         Name |     Time | Time(%) | Calls |  Average |      Min |      Max |         
     MPI_Init |    5.81s |  95.72% |    12 | 484.22ms | 381.20ms | 689.20ms |         
 MPI_Finalize | 164.94ms |   2.72% |    12 |  13.74ms |  12.27ms |  15.31ms |         
*MPI_Sendrecv |  68.77ms |   1.13% |    24 |   2.87ms |   1.01ms |   4.19ms |         
  MPI_Barrier |  24.17ms |   0.40% |    36 | 671.36us |  32.31us |   2.25ms |         
   MPI_Reduce |   1.61ms |   0.03% |    12 | 133.85us |  96.51us | 197.54us |         
 MPI_Sendrecv | 318.50us |   0.01% |    12 |  26.54us |  23.40us |  31.27us |         
MPI_Comm_size |  33.61us |   0.00% |    12 |   2.80us |    952ns |   6.59us |         
MPI_Comm_rank |  11.97us |   0.00% |    12 | 997.33ns |    869ns |   1.12us |         
        Total |    6.07s | 100.00% |   132 |                                

Introspect L0/CUDA/HIP pointers at runtime (via dlopen, no build deps)
for every buffer-bearing MPI call and emit a buffer_info tracepoint.
The interval consumer prefixes the call name with '*' when any buffer
resides in device or shared/managed memory.
Avoid creating an L0 context inside MPI_Init's pthread_once. The
application's runtime setup (zeInit, ZE_AFFINITY_MASK, etc.) is
allowed to complete before we touch the GPU stack.
Make pointer introspection a silent default. Removes the only new
getenv on the branch — the existing LTTNG_UST_MPI_VERBOSE no longer
gates anything we added.
metababel strips qualifiers when emitting the callback typedef, so the
'const void *' field in mpi_events.yaml becomes 'void *' in the
generated lttng_ust_mpi_properties_buffer_info_callback_f signature.
Match it in the callback.
The YAML top-level key is the Ruby symbol :meta_parameters
(serialised as ':meta_parameters:'), so fetch('meta_parameters', ...)
returns the empty default. Switch to fetch(:meta_parameters, {}) so
prologues are actually registered for all 145 buffer-bearing MPI
calls. Without this, _dump_buffer_info was unreferenced in the
generated tracer_mpi.c and the build failed under -Werror.
iprof manages its own LTTng session and ignores tracer_mpi.sh, so the
enable-event line previously added to tracer_mpi.sh.in never took
effect under iprof. Add the matching line in enable_events_mpi() so
buffer_info events make it into the trace under both invocation paths.
@abagusetty abagusetty requested a review from TApplencourt June 3, 2026 22:02
- clang-format-18 on btx_mpiinterval_callbacks.cpp and
  tracer_mpi_helpers.include.c (strip vertical alignment, collapse
  multi-statement switch cases).
- yamlfmt on mpi_events.yaml (drop --- marker, strip inner spaces
  from flow sequences).
- rubocop: 'Note:' -> 'NOTE:' in mpi_model.rb (Style/CommentAnnotation).
@abagusetty abagusetty marked this pull request as draft June 3, 2026 22:11
Per review (Thomas Applencourt): the MPI tracer should not call into
the GPU runtime to classify pointers. Specifically:

  - calling zeInit can break apps that rely on zesInit being first
    (deprecated API + sysman interaction); THAPI takes care to never
    call it,
  - dlopen('libze_loader.so') from inside an LD_PRELOAD'd libmpi.so
    actually opens the ze backend's own interceptor wrapper, so every
    zeMemGetAllocProperties query produces a spurious ze event and
    self-traces,
  - querying zeMemGetAllocProperties with a private context returns
    correct results only by accident on the current Intel L0 loader.

Revert all producer-side changes (tracer_mpi_helpers.include.c,
mpi_events.yaml, mpi_model.rb prologue registration, gen_mpi.rb
includes, Makefile.am probe entry, tracer_mpi.sh.in / xprof.rb.in
enable-event lines, mpi.h.include). Classification will be done
post-mortem in btx_mpiinterval_callbacks.cpp using the existing
zeMemAlloc*/zeMemFree events emitted by the ze backend, mirroring
the rangeset_memory_{host,device,shared} pattern in
btx_zeinterval_callbacks.cpp.
Implements Thomas's review suggestion: instead of querying the GPU
runtime from the MPI tracer, classify MPI buffer pointers post-mortem
in btx_mpiinterval_callbacks.cpp by joining the ze backend's
zeMemAlloc/zeMemFree events into the MPI filter's upstream model.

Changes:

  backends/mpi/Makefile.am:
    Union backends/ze/btx_ze_model.yaml into the MPI filter's -u so
    metababel routes ze events to btx_filter_mpi. Adds a build-order
    dependency mpi -> ze.

  backends/mpi/btx_mpimatching_model.yaml:
    - Anchor the existing entries/exits categories to the lttng_ust_mpi
      provider so ze events do not accidentally enter them.
    - Add mpi_1buf_entry / mpi_2buf_entry categories that capture
      payload buffer pointer(s) of buffer-bearing MPI calls.
    - Add ze_alloc_entry / ze_alloc_exit / ze_free_entry categories
      that capture ze USM allocation events.

  backends/mpi/btx_mpiinterval_callbacks.cpp:
    - Maintain per-process address-range maps rangeset_memory_device
      and rangeset_memory_shared, mirroring the pattern in
      backends/ze/btx_zeinterval_callbacks.cpp.
    - Populate the maps from ze_alloc_*/ze_free_entry callbacks (host
      USM is intentionally not classified as GPU-aware).
    - On every buffer-bearing MPI entry event, look up each buffer
      pointer; if it falls in a tracked GPU range, record the call's
      {hostname,vpid,vtid} in gpu_aware_calls.
    - In send_host_message, prepend '*' to the MPI call name when the
      flag is set, so the tally aggregates GPU-aware and CPU-only
      variants as distinct rows.

When the user runs --backend mpi without --backend ze, no ze events
appear in the trace, the range-map stays empty, and the output is
unchanged from upstream (no '*' ever appears). No producer-side
changes; the MPI tracer hot path is unaffected.
metababel's bare `-u model_a,model_b` upstream union fails when both
models declare the same environment field, raising:

  Match expression '^hostname$' must match only one member,
  '2' matched ["hostname", "hostname"]

Every backend model generated by utils/gen_babeltrace_model_helper.rb
declares :environment: :entries: [{name: hostname, type: string}],
so any two backend models clash. metababel itself owns the rule and
isn't ours to change.

Side-step it: generate a single merged upstream model
(btx_mpi_with_ze_model.yaml) that starts from btx_mpi_model.yaml and
copies only the ze allocation/free event classes we actually need
into MPI's stream class. One environment declaration, one stream
class, no duplicates.

  - backends/mpi/gen_btx_mpi_with_ze_model.rb: the generator.
  - backends/mpi/Makefile.am: builds the merged YAML, feeds it to
    metababel as the sole `-u` source.
Two matching failures surfaced when metababel walked the merged
upstream model:

1. MPI buffer parameters are declared with two different cast_types
   in btx_mpi_model.yaml: 'const void *' for input buffers (sendbuf,
   buf on _Send variants, ...) and plain 'void *' for output buffers
   (recvbuf, buf on _Recv variants, ...). The matching model required
   'const void *' on both, so mpi_2buf_entry matched zero events:
   'No event matched mpi_2buf_entry, at least one matching event
   required.'

   Relax to '^(const )?void \\*$' on both buffer slots.

2. zeResult is typed ze_result_t in btx_ze_model.yaml. Matching
   against '^int$' fails. Use ze_result_t directly would force
   btx_mpiinterval_callbacks.cpp to include ze_api.h (heavy build
   dep on the MPI consumer just to spell an enum).

   Drop zeResult from the match entirely; a failed allocation has
   pptr_val == NULL, which the callback already guards against.
metababel's matching engine requires each :name: regex to bind to
EXACTLY ONE field per matched event, so the broad regex

  ^(sendbuf|buf|buffer|inbuf|origin_addr|inoutbuf)$

fails on MPI_Reduce_local (has both inbuf and inoutbuf) with:

  Match expression '...' must match only one member,
  '2' matched ["inbuf", "inoutbuf"]

Enumerate one category per distinct MPI buffer-shape so member regexes
never overlap within an event:

  - mpi_cas_entry          : origin_addr + compare_addr + result_addr
                             (MPI_Compare_and_swap)
  - mpi_sendrecv_entry     : sendbuf + recvbuf
                             (Sendrecv, Allreduce, Reduce, Gather, ...)
  - mpi_reduce_local_entry : inbuf + inoutbuf
                             (MPI_Reduce_local)
  - mpi_rma_fetch_entry    : origin_addr + result_addr
                             (MPI_Get_accumulate, MPI_Fetch_and_op)
  - mpi_1buf_entry         : everything else with a single buffer field

Each category subtracts the prior set_ids from its domain so an event
falls into exactly one bucket. Register one consumer-side callback per
shape; all delegate to the same tag_if_gpu helper.
After enumerating all MPI calls and excluding the 4 multi-buffer
categories, every remaining buffer-bearing MPI call carries exactly
one of {buf, buffer, origin_addr}. Tightening the 1buf regex to
^(buf|buffer|origin_addr)$ removes the last source of in-event
ambiguity for metababel.

Pack/Unpack/Pack_external/Unpack_external are intentionally NOT
classified: they carry inbuf+outbuf but the buffers never traverse
the network, so GPU-aware tagging would be semantically meaningless.
Two issues surfaced when the consumer compiled:

1. metababel emitted callback typedefs for the full ze allocation
   events (ze_context_handle_t, ze_device_mem_alloc_desc_t *,
   ze_result_t, ...). libMPIInterval is built with only mpi.h
   visible, so btx_upstream.h failed to compile with 'unknown type
   name ze_context_handle_t' (and friends).

   Teach gen_btx_mpi_with_ze_model.rb to keep only the payload
   fields the MPI consumer actually uses (size, pptr_val, ptr).
   After stripping, every generated typedef references only
   size_t / void *, which mpi.h transitively pulls in.

2. metababel strips qualifiers when generating callback typedefs,
   so the matching model's 'const void *' becomes 'void *' in the
   generated mpi_*_entry_callback_f signatures. Drop 'const' from
   each mpi_*_entry_callback parameter so the signatures match.
   (Same fix shape as the earlier buffer_info_callback fix.)

Also replace filter_map with map.compact for Ruby <= 2.6
compatibility, even though Aurora's Ruby (3.3) supports filter_map.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thapi to mark MPI routines using GPU vs CPU

1 participant