Adding Telemetry support for GPU EP by urpetkov-amd · Pull Request #29 · onnxruntime/onnxruntime-ep-amdgpu

urpetkov-amd · 2026-06-19T14:34:25Z

GPU EP Telemetry

Lightweight, fire-and-forget telemetry for the AMD GPU Execution Provider. Each
model load appends a single line to a shared log file describing the EP version,
backend, GPU architecture, model, and cache state.

Goals

Lightweight: one line per model load, no measurable runtime cost.
Safe under concurrency: many processes/threads may write the same file.
Each append takes an exclusive lock, writes, and releases.
Never affects the caller: if the file can't be opened, or the lock can't
be acquired within ~10 ms, the record is silently dropped. Telemetry never
throws and never propagates a failure into inference.

Log location

Platform	Path
Windows	`%ProgramData%\AMD\GPUEP\telemetry.log`
Linux	`/var/log/AMD/GPUEP/telemetry.log` (override with `AMD_GPUEP_TELEMETRY_DIR`)

%ProgramData% is machine-wide and writable by normal (non-elevated) user
processes, which suits inference hosts. A compile-time switch
(telemetry::kUseProgramFiles in telemetry.h) can move the log to
%ProgramFiles%\AMD\GPUEP\ once the installer grants write access there via an
ACL — until then, %ProgramFiles% would silently fail for non-elevated
processes (by design).

Record format

One newline-terminated line per model load:

<ISO-8601-UTC> v=<schema> ep_ver=<v> backend=<name> gfx=<arch> model=<file> mxr_cache=<hit|miss> parent=<proc>

Example:

2026-06-19T13:50:03Z v=1 ep_ver=0.1.0 backend=MIGraphX gfx=gfx1100 model=synthetic_model.onnx mxr_cache=miss parent=python.exe

key=value was chosen over CSV/JSON so the consumer can add or ignore fields
without breaking parsers.
A leading v=<schema> (currently 1) lets the format evolve.
Optional fields are omitted when unknown (e.g. gfx/mxr_cache are not
emitted on the DirectML path).
Values are sanitized (whitespace/newlines → _) so a value can never split a
line.

Telemetry component (`src/shared/common/`)

Follows the repo's existing platform-split convention:

File	Responsibility
`telemetry.h`	Public API: `Record`, `Log()`, `AppendLine()`, `LogFilePath()`, process-name helpers, constants
`telemetry.cc`	Cross-platform: `Record::Format()`, `LogFilePath()` (path join + `create_directories`), `Log()` orchestration
`platform/windows/telemetry.cc`	`LockFileEx` append, `%ProgramData%` from env, Toolhelp process lookup
`platform/linux/telemetry.cc`	`fcntl(F_SETLK)` append, base dir, `/proc` process lookup

All cross-cutting logic lives in telemetry.cc; the platform files contain only
the irreducible OS-specific code.

Locking mechanism

Exclusive file lock with a ~10 ms acquire timeout. On
Windows this maps directly to LockFileEx:

Open with GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE,
OPEN_ALWAYS. Sharing is allowed; the byte-range lock provides mutual
exclusion.
Acquire the lock with LOCKFILE_EXCLUSIVE_LOCK | LOCKFILE_FAIL_IMMEDIATELY in
a poll loop bounded by GetTickCount64() + ~10 ms (LockFileEx has no native
timeout). Give up silently on timeout.
Seek to EOF under the lock and WriteFile.
UnlockFileEx, CloseHandle.

A sentinel byte far past EOF (offset 0x7FFFFFFF'FFFFFFFF) is locked as the
gate rather than byte 0, so Windows' mandatory locks never block a concurrent
reader.

Files changed

New:

src/shared/common/telemetry.h
src/shared/common/telemetry.cc
src/shared/common/platform/windows/telemetry.cc
src/shared/common/platform/linux/telemetry.cc
src/shared/tests/telemetry_stress_test.cc

Modified:

src/shared/CMakeLists.txt — add telemetry sources; add BUILD_TELEMETRY_STRESS_TEST option.
src/amdgpu/gpu_ep.{h,cc} — record backend choice; emit telemetry for the
DirectML path (LogTelemetry, std::call_once).
src/amdgpu/gpu_factory.h — public Version() accessor.
src/migraphx/mgx_ep.{h,cc} — emit full telemetry record in Compile();
thread MXR cache-hit out of CreateNodeComputeInfoFromGraph.
src/migraphx/mgx_factory.h — public Version() accessor.
src/amdgpu/pyproject.toml — package the migraphx-backend and
directml-backend components (see Related change).

urpetkov-amd added 3 commits June 19, 2026 15:02

Adding telemetry support

b0c820f

More stuff for telemetry

76c7401

More changes to telemetry

a00d061

urpetkov-amd requested review from apwojcik and tperry-amd June 19, 2026 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Telemetry support for GPU EP#29

Adding Telemetry support for GPU EP#29
urpetkov-amd wants to merge 3 commits into
mainfrom
telemetry_support

urpetkov-amd commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

urpetkov-amd commented Jun 19, 2026

GPU EP Telemetry

Goals

Log location

Record format

Telemetry component (src/shared/common/)

Locking mechanism

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Telemetry component (`src/shared/common/`)