Skip to content

Adding Telemetry support for GPU EP#29

Draft
urpetkov-amd wants to merge 3 commits into
mainfrom
telemetry_support
Draft

Adding Telemetry support for GPU EP#29
urpetkov-amd wants to merge 3 commits into
mainfrom
telemetry_support

Conversation

@urpetkov-amd

Copy link
Copy Markdown
Collaborator

GPU EP Telemetry

Lightweight, fire-and-forget telemetry for the AMD GPU Execution Provider. Each
model load appends a single line to a shared log file describing the EP version,
backend, GPU architecture, model, and cache state.

Goals

  • Lightweight: one line per model load, no measurable runtime cost.
  • Safe under concurrency: many processes/threads may write the same file.
    Each append takes an exclusive lock, writes, and releases.
  • Never affects the caller: if the file can't be opened, or the lock can't
    be acquired within ~10 ms, the record is silently dropped. Telemetry never
    throws and never propagates a failure into inference.

Log location

Platform Path
Windows %ProgramData%\AMD\GPUEP\telemetry.log
Linux /var/log/AMD/GPUEP/telemetry.log (override with AMD_GPUEP_TELEMETRY_DIR)

%ProgramData% is machine-wide and writable by normal (non-elevated) user
processes, which suits inference hosts. A compile-time switch
(telemetry::kUseProgramFiles in telemetry.h) can move the log to
%ProgramFiles%\AMD\GPUEP\ once the installer grants write access there via an
ACL — until then, %ProgramFiles% would silently fail for non-elevated
processes (by design).

Record format

One newline-terminated line per model load:

<ISO-8601-UTC> v=<schema> ep_ver=<v> backend=<name> gfx=<arch> model=<file> mxr_cache=<hit|miss> parent=<proc>

Example:

2026-06-19T13:50:03Z v=1 ep_ver=0.1.0 backend=MIGraphX gfx=gfx1100 model=synthetic_model.onnx mxr_cache=miss parent=python.exe
  • key=value was chosen over CSV/JSON so the consumer can add or ignore fields
    without breaking parsers.
  • A leading v=<schema> (currently 1) lets the format evolve.
  • Optional fields are omitted when unknown (e.g. gfx/mxr_cache are not
    emitted on the DirectML path).
  • Values are sanitized (whitespace/newlines → _) so a value can never split a
    line.

Telemetry component (src/shared/common/)

Follows the repo's existing platform-split convention:

File Responsibility
telemetry.h Public API: Record, Log(), AppendLine(), LogFilePath(), process-name helpers, constants
telemetry.cc Cross-platform: Record::Format(), LogFilePath() (path join + create_directories), Log() orchestration
platform/windows/telemetry.cc LockFileEx append, %ProgramData% from env, Toolhelp process lookup
platform/linux/telemetry.cc fcntl(F_SETLK) append, base dir, /proc process lookup

All cross-cutting logic lives in telemetry.cc; the platform files contain only
the irreducible OS-specific code.

Locking mechanism

Exclusive file lock with a ~10 ms acquire timeout. On
Windows this maps directly to LockFileEx:

  1. Open with GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE,
    OPEN_ALWAYS. Sharing is allowed; the byte-range lock provides mutual
    exclusion.
  2. Acquire the lock with LOCKFILE_EXCLUSIVE_LOCK | LOCKFILE_FAIL_IMMEDIATELY in
    a poll loop bounded by GetTickCount64() + ~10 ms (LockFileEx has no native
    timeout). Give up silently on timeout.
  3. Seek to EOF under the lock and WriteFile.
  4. UnlockFileEx, CloseHandle.

A sentinel byte far past EOF (offset 0x7FFFFFFF'FFFFFFFF) is locked as the
gate rather than byte 0, so Windows' mandatory locks never block a concurrent
reader.

Files changed

New:

  • src/shared/common/telemetry.h
  • src/shared/common/telemetry.cc
  • src/shared/common/platform/windows/telemetry.cc
  • src/shared/common/platform/linux/telemetry.cc
  • src/shared/tests/telemetry_stress_test.cc

Modified:

  • src/shared/CMakeLists.txt — add telemetry sources; add BUILD_TELEMETRY_STRESS_TEST option.
  • src/amdgpu/gpu_ep.{h,cc} — record backend choice; emit telemetry for the
    DirectML path (LogTelemetry, std::call_once).
  • src/amdgpu/gpu_factory.h — public Version() accessor.
  • src/migraphx/mgx_ep.{h,cc} — emit full telemetry record in Compile();
    thread MXR cache-hit out of CreateNodeComputeInfoFromGraph.
  • src/migraphx/mgx_factory.h — public Version() accessor.
  • src/amdgpu/pyproject.toml — package the migraphx-backend and
    directml-backend components (see Related change).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant