Skip to content

Preserve live-allocation tracking across worker resets#550

Draft
r1viollet wants to merge 5 commits into
mainfrom
r1viollet/preserve-live-alloc-on-worker-reset
Draft

Preserve live-allocation tracking across worker resets#550
r1viollet wants to merge 5 commits into
mainfrom
r1viollet/preserve-live-alloc-on-worker-reset

Conversation

@r1viollet
Copy link
Copy Markdown
Collaborator

What

Snapshot LiveAllocation to a parent-held memfd before a worker restart and
restore it in the newly-forked worker, so live-heap tracking survives the
periodic worker reset instead of starting from zero each time.

Why

Workers are reset by forking a fresh process from the parent (every worker_period exports).
Everything in DDProfWorkerContext — including the heap-tracking aggregator
in LiveAllocation — is discarded. The target process keeps allocating
addresses the profiler no longer has stacks for, so live-heap is undercounted
until natural alloc/free traffic refills the map. The library has no way to
replay since it only tracks addresses, not stacks.

How

  • main_loop creates a memfd that the parent keeps open and every worker
    child inherits.
  • Outgoing child captures a self-owned snapshot: UnwindOutput handles are
    resolved back to portable strings via libdatadog Function2/Mapping2
    read-back; all string_views into Process/base-frame caches are copied.
  • The blob is written to the memfd just before the child exits (after the
    final synchronous export).
  • Incoming child reads the blob in worker_library_init after the new
    SymbolHdr / Symbolizer / ProfilesDictionary are constructed but
    before the poll loop drains events. Mappings and functions are re-interned
    into the fresh dictionary; LiveAllocation owns a string deque backing
    the string_views of restored UnwindOutputs.
  • The memfd is truncated on read so a non-restarting exit doesn't leak stale
    state into the next worker.

Budget enforcement (value-preserving)

  • Default target 4 MB, hard ceiling 20 MB.
  • Over budget: rank stacks by aggregate value, drop the lowest. Their
    addresses get remapped to a synthetic [live-alloc cleared] common frame
    so per-PID heap totals remain correct even when detail is shed.
  • Still over: drop entire PIDs from lowest aggregate value upwards.
  • New stats: live_alloc.snapshot.bytes, live_alloc.snapshot.cleared_stacks,
    live_alloc.snapshot.dropped_pids.

Known gap

Events arriving between the old child's exit and the new child's first poll
are still lost. A library-side pause hook can be added separately.

Tests

  • New live_allocation_snapshot-ut (binary round-trip, bad-magic and
    truncation rejection, empty snapshot).
  • Existing live_allocation-ut still green.

r1viollet added 2 commits May 26, 2026 12:32
Workers are restarted by forking a fresh process from the parent, which
loses everything in DDProfWorkerContext — including the heap-tracking
aggregator in LiveAllocation. Until natural alloc/free traffic refills
the map, live-heap is undercounted for the rest of the target's life.

Add a serialisation path that survives the fork:

- main_loop allocates a memfd that the parent keeps open and every
  worker child inherits.
- On 'restart_worker', the outgoing child resolves its UnwindOutput
  handles back to portable strings (via libdatadog Function2/Mapping2
  read-back) and writes a self-owned snapshot to the memfd.
- The new child reads the snapshot in worker_library_init, re-interns
  mappings/functions into its fresh ProfilesDictionary and rebuilds the
  LiveAllocation maps before the poll loop starts draining events.
- LiveAllocation owns a string deque backing the string_views of
  restored UnwindOutputs; live entries built from incoming events keep
  using Process/base-frame views.

Budget enforcement, value-preserving:

- Default target 4 MB, hard ceiling 20 MB.
- When over budget, rank stacks by aggregate value and drop the lowest;
  their addresses are remapped to a synthetic [live-alloc cleared]
  common frame so per-PID heap totals remain correct.
- If still over after dropping all stacks, drop entire PIDs from the
  lowest aggregate value upwards.

In-flight events between the old child exit and the new child's first
poll are still lost; a library-side pause hook is a separate change.
Add a third live-heap variant to simple_malloc-ut.sh that drives the
worker into at least one reset (upload_period=2s, worker_period=2) with
--skip-free 100 keeping ~99% of allocations live, and checks:

  - at least one '[live-alloc] Snapshot restored' log line
  - zero 'Tracked address count mismatch' warnings between the profiler
    and the in-target library after restore

Adds ~7s to the simple_malloc suite (target needs to outlive 2 export
cycles). Same test runs under DD_PROFILING_REORDER_EVENTS=1 too.
@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented May 26, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/apm-reliability/ddprof | report_gitlab_CI_status   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Job failed: command terminated with exit code 1 indicating a possible issue with the downstream pipeline.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: deaabdc | Docs | Datadog PR Page | Give us feedback!

r1viollet added 3 commits May 26, 2026 15:08
clang-tidy errors flagged by CI:
- readability-math-missing-parentheses on sizeof(T) * N + ... arithmetic
- cppcoreguidelines-avoid-const-or-ref-data-members on Writer::_out
  (switch the reference member to a non-owning pointer)
- readability-uppercase-literal-suffix (0u -> 0U)
- misc-const-correctness on loop indices (uint32_t idx -> uint32_t const idx)

Also adds a TODO block above portable_to_uo() spelling out the four
overlapping caches (ProfilesDictionary, SymbolTable/MapInfoTable,
RuntimeSymbolLookup et al., _restored_strings), the duplicate-entry
cost we accept on the restore path, and how a future PR can unify the
model by making FunLoc identity content-based on libdatadog handles.
- DD_PROFILING_NATIVE_LIVE_ALLOC_SNAPSHOT_MAX_BYTES overrides the
  per-capture budget. Capped at the hard ceiling. Lets tests force
  the cleared-stack remap path and the dropped-pid fallback without
  rebuilding the binary.

- simple_malloc --unique-sites N spreads allocations across up to 256
  templated alloc_at_site<Tag> instantiations, each producing a
  distinct innermost frame to the unwinder. Used to stress-test the
  snapshot path with many unique stacks per cycle.

Verified locally at three budget levels: full preservation, cleared
remap (stacks=30 cleared=582 dropped_pids=0 at 240 KB), and pid drop
(dropped_pids=1 at 8 KB). All paths keep 'Tracked address count
mismatch' warnings at zero in the steady state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant