add TaskBlock events for blocking intervals#570
Conversation
|
CI Test ResultsRun: #27828360029 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-06-19 13:40:05 UTC |
bcf3c2b to
a46d368
Compare
3f2f71a to
d1e3210
Compare
|
@copilot review this PR |
Reviewed and found a flaky assertion in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d21e4936d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2101007 to
a94677a
Compare
a94677a to
681582c
Compare
There was a problem hiding this comment.
Can you add more test coverage?
Tier 1
directly threatens the design's core invariant (no test exists, and the race is real, not theoretical):
-
Concurrent enter/exit/snapshot on the same ThreadFilter slot. enterBlockedRun, snapshotAndExitBlockedRun, and the signal-handler's markSampledThisRun are all lock-free on shared slot state, but every test drives them single-threaded. The whole feature is built on a signal handler racing the application thread — and that interleaving is never tested. This is the headline gap.
-
Snapshot consistency under concurrent mutation. snapshotBlockRun does several independent relaxed loads (sampled_this_run, owner, anchor_sample_id, suppressed_sample_count). No test mutates state mid-snapshot to confirm the result is self-consistent (e.g. anchored=false but non-zero suppressed count).
-
TaskBlockQueue concurrent push/pop. It's an MPSC-style sequence-cell queue (the hard-to-get-right kind), tested only single-threaded. No multi-producer race, no cell-reuse-staleness test.
### Tier 2 error/early-exit paths that are written but never executed by a test: -
NativeBlockScope constructor gates. Of its ~6 early-exit gates (!taskBlockAsyncActive, !filter.enabled, current==nullptr, non-Java thread, slot_id<0, enterBlockedRun==0), only the spanId!=0 skip is tested. The others are silent - a regression that flips one would pass CI.
-
Null orig* function pointer → ENOSYS. The interposer handles it; no test injects null.
-
getsockopt failure with errno ≠ ENOTSOCK - uncovered, and it's the one real perf cliff.
Tier 3
lifecycle, worth one test each:
- Profiler restart with changed args (wallprecheck=true then false; ASGCT↔JVMTI engine switch with precheck active) — the static-singleton counters and drain-thread lifecycle across restart are untested. Note the prior memory: restart-in-prod isn't a supported mode, which lowers this priority.
- JVMTI-unavailable / J9 fallback — precheck wiring on non-ASGCT engines is asserted by the agents to silently no-op; worth a test to confirm it degrades cleanly rather than leaving weight slots un-restored.
98cb129 to
c278669
Compare
| }); | ||
| } | ||
|
|
||
| ssize_t NativeSocketInterposer::write_hook(int fd, const void* buf, size_t len) { |
There was a problem hiding this comment.
[Sphinx Review — MEDIUM] read()/write() GOT hooks intercept ALL process read/write calls (files, pipes, tty), not just sockets, while the interposer is active. Each call performs an fd classification lookup and, on cache miss, a getsockopt syscall on the I/O hot path.
Suggestion: Consider a cheaper first-line filter before classification (e.g. skip well-known non-socket fds 0/1/2, or only treat fds that have appeared in connect/accept/socket as candidates), or document/measure the overhead of intercepting all read/write. Confirm the getsockopt-on-miss cost is acceptable for file-heavy workloads.
| return errno == ENOTSOCK ? FD_TYPE_NON_SOCKET : 0; | ||
| }; | ||
|
|
||
| if (static_cast<size_t>(fd) >= static_cast<size_t>(FD_TYPE_CACHE_SIZE)) { |
There was a problem hiding this comment.
[Sphinx Review — MEDIUM] fd values >= FD_TYPE_CACHE_SIZE (65536) are never cached; every intercepted I/O call on such an fd issues a fresh getsockopt syscall.
Suggestion: Either size the cache to the process RLIMIT_NOFILE (or use a hashed cache) or document that the syscall fast-path overhead applies to fds above the cache size. High-fd-count servers are exactly the workloads where this matters.
|
|
||
| class TaskBlockQueue { | ||
| private: | ||
| static const size_t kCapacity = 4096; |
There was a problem hiding this comment.
[Sphinx Review — LOW] Fixed 4096-entry async queue silently drops task-block events under contention bursts between 1ms drains.
Suggestion: Confirm 4096 is adequate for expected monitor-contention burst rates; the drop counter exists for observability. Consider documenting the drain interval/capacity relationship.
| event._ctx = ctx; | ||
| if (call_trace_id != 0 || correlation_id != 0 || | ||
| observed_state != OSThreadState::UNKNOWN) { | ||
| setTaskBlockStackReference(event, call_trace_id, correlation_id, |
There was a problem hiding this comment.
[Sphinx Review — LOW] On the explicit-stack-reference deferred path the caller-supplied observedBlockingState is recorded verbatim with no validation; if the caller's captured state is stale relative to the target thread, the emitted event records an incorrect blocking state.
Suggestion: Document the contract that the caller is responsible for capturing observedBlockingState atomically with the block interval; the deferred path cannot validate it. No code change required if the contract is acceptable.
| * Intended for background drain threads that record events on behalf of a different (sleeping) | ||
| * thread; avoids reading OTEP TLS from the calling thread. | ||
| */ | ||
| public void recordTaskBlockFromContext(int tid, long startTicks, long endTicks, |
There was a problem hiding this comment.
[Sphinx Review — LOW] recordTaskBlockFromContext (and recordTaskBlockWithContext) discard the boolean emit/skip result computed by the native recorder, so callers cannot detect dropped events.
Suggestion: Consider returning the boolean (or surfacing a drop counter) so the trace integration can observe whether the deferred record actually emitted.
| return call(fn); | ||
| } | ||
|
|
||
| NativeBlockScope block(kind, fd); |
There was a problem hiding this comment.
[Sphinx Review — LOW] When the interposer and sampler are both active, the TaskBlock duration measured by NativeBlockScope encloses the sampler's own work (probe, getpeername, LRU, recordSample), inflating the reported blocked interval, and the same syscall produces two distinct event types with no documented coordination.
Suggestion: Either snapshot the syscall boundary ticks tightly around the raw fn() call and pass them into NativeBlockScope::finish(), or document that the combined operation deliberately emits both event types and accept the bounded duration overhead. Add a test exercising both engines active at once.
c278669 to
90c80bd
Compare
…concurrency coverage test
90c80bd to
93362c8
Compare
What does this PR do?:
Adds
datadog.TaskBlockJFR events for blocking intervals such asLockSupport.park,Object.wait, monitor contention, including recording APIs used bydd-trace-javafor instrumented blocking operations such asThread.sleep.Motivation:
This builds on
paul.fournillon/wallclock-suppression(#560) by preserving visibility into blocked spans as explicit duration events.Additional Notes:
This PR does not add the dd-trace-java instrumentation itself;
Thread.sleepemission depends on that side calling the new profiler APIs. Monitor callback support is HotSpot-specific, and virtual-thread carrier attribution is avoided.How to test the change?:
./.claude/commands/build-and-summarize :ddprof-test:testDebug -Ptests="*.wallclock.*TaskBlockTest"./.claude/commands/build-and-summarize :ddprof-lib:gtestDebugFor Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!