Adding Telemetry support for GPU EP#29
Draft
urpetkov-amd wants to merge 3 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GPU EP Telemetry
Lightweight, fire-and-forget telemetry for the AMD GPU Execution Provider. Each
model load appends a single line to a shared log file describing the EP version,
backend, GPU architecture, model, and cache state.
Goals
Each append takes an exclusive lock, writes, and releases.
be acquired within ~10 ms, the record is silently dropped. Telemetry never
throws and never propagates a failure into inference.
Log location
%ProgramData%\AMD\GPUEP\telemetry.log/var/log/AMD/GPUEP/telemetry.log(override withAMD_GPUEP_TELEMETRY_DIR)%ProgramData%is machine-wide and writable by normal (non-elevated) userprocesses, which suits inference hosts. A compile-time switch
(
telemetry::kUseProgramFilesintelemetry.h) can move the log to%ProgramFiles%\AMD\GPUEP\once the installer grants write access there via anACL — until then,
%ProgramFiles%would silently fail for non-elevatedprocesses (by design).
Record format
One newline-terminated line per model load:
Example:
key=valuewas chosen over CSV/JSON so the consumer can add or ignore fieldswithout breaking parsers.
v=<schema>(currently1) lets the format evolve.gfx/mxr_cacheare notemitted on the DirectML path).
_) so a value can never split aline.
Telemetry component (
src/shared/common/)Follows the repo's existing platform-split convention:
telemetry.hRecord,Log(),AppendLine(),LogFilePath(), process-name helpers, constantstelemetry.ccRecord::Format(),LogFilePath()(path join +create_directories),Log()orchestrationplatform/windows/telemetry.ccLockFileExappend,%ProgramData%from env, Toolhelp process lookupplatform/linux/telemetry.ccfcntl(F_SETLK)append, base dir,/procprocess lookupAll cross-cutting logic lives in
telemetry.cc; the platform files contain onlythe irreducible OS-specific code.
Locking mechanism
Exclusive file lock with a ~10 ms acquire timeout. On
Windows this maps directly to
LockFileEx:GENERIC_READ | GENERIC_WRITE,FILE_SHARE_READ | FILE_SHARE_WRITE,OPEN_ALWAYS. Sharing is allowed; the byte-range lock provides mutualexclusion.
LOCKFILE_EXCLUSIVE_LOCK | LOCKFILE_FAIL_IMMEDIATELYina poll loop bounded by
GetTickCount64()+ ~10 ms (LockFileExhas no nativetimeout). Give up silently on timeout.
WriteFile.UnlockFileEx,CloseHandle.A sentinel byte far past EOF (offset
0x7FFFFFFF'FFFFFFFF) is locked as thegate rather than byte 0, so Windows' mandatory locks never block a concurrent
reader.
Files changed
New:
src/shared/common/telemetry.hsrc/shared/common/telemetry.ccsrc/shared/common/platform/windows/telemetry.ccsrc/shared/common/platform/linux/telemetry.ccsrc/shared/tests/telemetry_stress_test.ccModified:
src/shared/CMakeLists.txt— add telemetry sources; addBUILD_TELEMETRY_STRESS_TESToption.src/amdgpu/gpu_ep.{h,cc}— record backend choice; emit telemetry for theDirectML path (
LogTelemetry,std::call_once).src/amdgpu/gpu_factory.h— publicVersion()accessor.src/migraphx/mgx_ep.{h,cc}— emit full telemetry record inCompile();thread MXR cache-hit out of
CreateNodeComputeInfoFromGraph.src/migraphx/mgx_factory.h— publicVersion()accessor.src/amdgpu/pyproject.toml— package themigraphx-backendanddirectml-backendcomponents (see Related change).