From 75e153934458c58bd8ad74d672cd0c6c68a7ae11 Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Thu, 2 Apr 2026 03:11:16 +0000 Subject: [PATCH 01/12] draft lb delay metric --- grpc-lb-delay-metrics.md | 910 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 910 insertions(+) create mode 100644 grpc-lb-delay-metrics.md diff --git a/grpc-lb-delay-metrics.md b/grpc-lb-delay-metrics.md new file mode 100644 index 000000000..7762ed9ab --- /dev/null +++ b/grpc-lb-delay-metrics.md @@ -0,0 +1,910 @@ +Load Balancer Pick Queue Delay Observability +---- +* Author(s): Madhav Bissa (@madhavbissa) +* Approver: TBD +* Status: Draft +* Implemented in: Go, Java, C++ +* Last updated: 2026-04-22 +* Discussion at: TBD + +## Abstract + +This proposal introduces observability for the duration that RPCs spend waiting in the gRPC client channel for a load balancing pick to complete. Today, when a picker indicates that no subchannel is available, the client channel then defers the RPC (either by waiting or blocking the caller) until the LB policy provides a new picker or the RPC context is canceled/times out. This latency is invisible to operators, making it difficult to diagnose which layer of the LB policy tree is causing the delay. + +This gRFC defines: +1. A new histogram metric (`grpc.lb.pick_delay_duration`) recorded by the client + channel, with bounded `grpc.lb.delay_reason` and optional `grpc.method` + labels. The histogram emits one observation per delay reason segment, + providing per-layer delay breakdown when an RPC transitions through multiple + LB policy states. +2. A new up-down counter metric (`grpc.lb.rpc_waiting`) tracking the + number of RPCs currently waiting, providing visibility into stuck traffic. +3. An enhancement to the existing "Delayed LB pick complete" span event ([A72]) + to include the delay reason and delay duration as span event attributes. +4. API changes in each language to allow pickers to expose the delay reason as a + bounded, pre-computed string token, with support for both picker-level tokens + (uniform policies) and per-pick tokens (RLS, cluster_manager). + +## Background + +gRPC uses a tree of load balancing policies to select a subchannel for each RPC. +When no suitable subchannel is available, the picker indicates this state to the client channel. The client channel then defers the RPC (either by waiting or blocking the caller) until the LB policy provides a new picker or the RPC's context is canceled or times out. +Operators currently have no visibility into: +- **How long** RPCs wait. +- **Why** they are delayed (which policy in the tree caused it). + +The existing `grpc.client.attempt.duration` metric (A66) includes pick delay but +does not isolate it, and provides no information about the cause of the delay. + +### Related Proposals + +- [A66: OpenTelemetry Metrics][A66] — Base instrumentation and stats plugin + architecture. +- [A72: OpenTelemetry Tracing][A72] — Tracing architecture including the + existing "Delayed LB pick complete" event. +- [A78: gRPC Metrics for WRR, Pick First, and xDS][A78] — LB policy metrics + patterns (`grpc.lb.*` naming convention, locality labels). +- [A79: Non-per-call Metrics Architecture][A79] — `GlobalInstrumentsRegistry`, + `MetricsRecorder`, metric descriptor registration, and stability levels. +- [A91: Outlier Detection Metrics][A91] — Template for LB-level non-per-call + metrics. +- [A94: Subchannel OTel Metrics][A94] — Subchannel-level metrics. +- [A56: Priority LB Policy][A56] — Priority policy, `initTimer`, failover. +- [A50: xDS Outlier Detection][A50] — Outlier detection ejection behavior. + +## Proposal + +Using [A79]'s non-per-call metrics architecture and [A72]'s tracing framework, +we will add a histogram metric, an active waiting RPC counter, and enhanced trace span +event that measure the time each RPC spends waiting for a load balancing +pick. + +Each LB policy's picker provides a bounded string **delay reason token** +describing why it would delay RPCs (e.g., `"pick_first:connecting"`). For +policies with uniform picker state (pick_first, round_robin, ring_hash, +priority), the token is pre-computed at picker construction time. For policies +that make per-request routing decisions (RLS, cluster_manager), the token is +determined inside `Pick()` and varies per-RPC. Delegating policies compose +tokens by prepending their own prefix to their child picker's token (e.g., +`"priority_p0:pick_first:connecting"`). + +The client channel's pick loop reads this token from the picker when a +pick is delayed, records the delay start time, and emits one histogram +observation per delay reason segment — if the token changes during the wait +(e.g., an RLS lookup completes, causing a transition from `rls:lookup_pending` +to `rls:pick_first:connecting`), the current segment is closed and a new one +begins. An up-down counter tracks how many RPCs are currently waiting, +providing visibility into stuck traffic that has not yet emitted a histogram +observation. + +For policies with uniform picker state, tokens are pre-computed strings stored +on the picker, so the pick path reads them by reference with zero dynamic +allocations. The token API uses optional interfaces (Go) or default-returning +methods (Java, C++) so that existing custom LB policies continue to work without +modification — they simply report an empty token, and the metric is still +recorded under the `grpc.target` label alone. + +### Metric Definition + +The following metric is registered via the non-per-call metrics framework +defined in [A79][A79]. + +| Field | Value | +|---|---| +| **Name** | `grpc.lb.pick_delay_duration` | +| **Type** | Float64 Histogram | +| **Unit** | `s` (seconds) | +| **Description** | EXPERIMENTAL. Time an RPC spent waiting for a load balancing pick, broken down by the reason for the delay. | +| **Labels** | `grpc.target`, `grpc.lb.delay_reason` | +| **Optional Labels** | `grpc.method` | +| **Bucket Boundaries** | Same as A66 latency buckets: 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 | +| **Default Enabled** | `false` (experimental, opt-in) | + +> **Note on histogram aggregation**: The gRPC non-per-call metrics framework +> ([A79]) currently supports only explicit bucket histograms. Explicit buckets +> lose fidelity for values between wide boundaries (e.g., values in the +> `[20, 50)` range are indistinguishable). When the framework adds support for +> exponential bucket histograms, this metric should be migrated to use +> exponential aggregation for better tail-latency fidelity. In the interim, +> operators can override the aggregation strategy to exponential using +> OpenTelemetry SDK Views at the application level. + +If the delay reason token changes during the wait (e.g., due to an RLS lookup +completing or a priority failover), the recording loop emits one histogram +observation **per delay reason segment**. A segment ends when a new picker +arrives with a different token, or when the wait ends. A single RPC may +therefore contribute multiple histogram observations, each covering the time +spent under a specific delay reason. This per-segment approach gives operators +visibility into how much time each layer of the LB policy tree contributed to +the total pick delay. + +The following gauge metric is also registered to provide visibility into RPCs +that are currently waiting and have not yet emitted a histogram observation: + +| Field | Value | +|---|---| +| **Name** | `grpc.lb.rpc_waiting` | +| **Type** | Int64 UpDownCounter | +| **Unit** | `{call}` | +| **Description** | EXPERIMENTAL. Number of RPCs currently waiting for a load balancing pick. | +| **Labels** | `grpc.target` | +| **Default Enabled** | `false` (experimental, opt-in) | + +The counter is incremented (`+1`) when a pick is first deferred and decremented +(`-1`) when the wait ends (successful pick or cancellation). This matches +the pattern used by `grpc.subchannel.open_connections` ([A94]) and +`grpc.tcp.connection_count` ([A80]). + +#### Label Definitions + +| Label | Type | Description | +|---|---|---| +| `grpc.target` | String | The target URI of the channel. Required. | +| `grpc.lb.delay_reason` | String | The bounded delay reason token from the picker. Optional label. | +| `grpc.method` | String | The full method name of the RPC (e.g., `/pkg.Service/Method`). Optional label. Available from `PickInfo` at pick time. Primarily useful for policies like RLS where wait behavior varies by method. | + +### Delay Reason Token Semantics + +Delay reason tokens are static, bounded strings following the grammar: + +``` +token = terminal | prefix ":" token +prefix = policy_prefix +terminal = policy_name ":" state +policy_name = "pick_first" | "round_robin" | "ring_hash" | "rls" | "cds" +``` + +Dynamic values (IP addresses, cluster names, endpoint identifiers) are +**strictly forbidden** in tokens. + +#### Terminal Tokens + +Terminal tokens are emitted by leaf policies when they cannot complete a pick. + +| Policy | Token | When Emitted | +|---|---|---| +| `pick_first` | `pick_first:connecting` | No READY subchannel; connection attempt in progress. | +| `round_robin` | `round_robin:connecting` | No READY subchannel in the round-robin set. | +| `ring_hash` | `ring_hash:connecting` | The hashed (or scanned) endpoint is in CONNECTING or IDLE state. | +| `rls` | `rls:lookup_pending` | Waiting for an RLS server response for the request key. | +| `rls` | `rls:throttled` | RLS requests are being throttled due to server errors. | +| `cds` | `cds:discovery_pending` | Waiting for the xDS CDS resource. | + +#### Delegating Prefixes + +Delegating policies prepend a prefix to their child picker's token. + +| Policy | Prefix | Notes | +|---|---|---| +| `priority` | `priority_p{N}:` | N is the 0-based priority index. Typical deployments use ≤5 priorities, bounding cardinality. | +| `cluster_manager` | *(per-pick)* | Routes RPCs to different child pickers per-request. The token is the child picker's token, resolved at pick time. See Per-Pick Tokens. | +| `rls` | `rls:` | When routing to a resolved child policy (post-lookup). | +| `cds` | `cds:` | When delegating to a child policy (post-discovery). | +| `wrr_locality` | *(transparent)* | Passes child tokens through unmodified. | +| `outlier_detection` | *(transparent)* | Passes child tokens through unmodified; OD does not itself cause queueing. | + +#### Token Composition + +For most policies, token composition occurs at **picker construction time**, not +on the pick path. When a delegating policy creates its picker, it reads the +child picker's token and stores the composite string as a member of its own +picker. This ensures zero allocations during `Pick()`. + +**Example flow** for `priority → pick_first`: +1. `pick_first` creates a picker with token `"pick_first:connecting"`. +2. `priority` wraps the child, creating its picker with token + `"priority_p0:pick_first:connecting"` (stored as a member string). + +When a delegating policy has a **READY** child (no queueing), the token is the +empty string `""`, and the pick completes without queueing. + +#### Per-Pick Tokens + +Some policies make per-request routing decisions within `Pick()`, causing the +delay reason to vary across RPCs handled by the same picker. These policies +provide their token via the pick result rather than a picker-level method: + +- **RLS**: A single `rlsPicker` performs a per-request cache lookup. A cache + miss delays with `rls:lookup_pending`, while a cache hit that delegates to a + child in CONNECTING state delays with `rls:pick_first:connecting`. The token + must be determined inside `Pick()` and attached to the pending result. + +- **cluster_manager**: Routes RPCs to different child pickers based on the + cluster selected by xDS routing. Different RPCs may reach different child + pickers in different states. The token is read from whichever child picker + handled the specific RPC. + +The mechanism for attaching per-pick tokens is described in the language-specific +API sections below. + +### Language-Specific API Changes + +> [!NOTE] +> The code examples in the following sections are illustrative and simplified to demonstrate the design intent. Actual implementations may vary slightly across languages to conform to idiomatic patterns and existing codebase structure. + +#### Go + +Go's `Picker` interface has a single `Pick()` method. Adding a method would +break all existing implementations. Two mechanisms are provided for supplying +queue reason tokens: + +**Picker-level token** — for policies with uniform picker state (pick_first, +round_robin, ring_hash, priority). An optional interface that pickers may +implement: + +```go +// In package balancer + +// DelayMetricTokener is an optional interface that Pickers can implement +// to provide a bounded metric token describing why this picker delays RPCs. +// The token MUST be a static, pre-computed string with no dynamic components. +type DelayMetricTokener interface { + // DelayMetricToken returns the bounded delay reason token. + // Returns "" if the picker does not delay RPCs. + DelayMetricToken() string +} +``` + +**Per-pick token** — for policies where the delay reason varies per-RPC (RLS, +cluster_manager). A structured error type that wraps `ErrNoSubConnAvailable` +with a token: + +```go +// In package balancer + +// PendingPickError is returned by Pick() to signal a delay with a per-pick +// metric token. It is equivalent to ErrNoSubConnAvailable but carries +// a delay reason token that may vary per-RPC. +type PendingPickError struct { + // MetricToken is the bounded delay reason token for this specific pick. + MetricToken string +} + +func (e *PendingPickError) Error() string { return "no SubConn available" } +func (e *PendingPickError) Is(target error) bool { + return target == ErrNoSubConnAvailable +} +``` + +The recording loop resolves the token with the following precedence: if the +error is a `*PendingPickError`, use its `MetricToken`; otherwise, check whether the +picker implements `DelayMetricTokener`; otherwise, use the empty string. + +Built-in pickers (`pick_first`, `round_robin`, `ring_hash`, `priority`, etc.) +implement `DelayMetricTokener`. Custom LB policies that implement neither +mechanism report an empty token. + +**Example — `pick_first` (picker-level):** +```go +type pfPicker struct { + subConn balancer.SubConn + delayToken string // set at construction: "pick_first:connecting" or "" +} + +func (p *pfPicker) Pick(balancer.PickInfo) (balancer.PickResult, error) { + return balancer.PickResult{SubConn: p.subConn}, nil +} + +func (p *pfPicker) DelayMetricToken() string { + return p.delayToken +} +``` + +**Example — `rls` (per-pick):** +```go +func (p *rlsPicker) Pick(info balancer.PickInfo) (balancer.PickResult, error) { + // ... cache lookup using info.FullMethodName and request keys ... + switch { + case dcEntry == nil && pendingEntry == nil: + p.sendRouteLookupRequest(cacheKey, ...) + return balancer.PickResult{}, + &balancer.PendingPickError{MetricToken: "rls:lookup_pending"} + case dcEntry == nil && pendingEntry != nil: + return balancer.PickResult{}, + &balancer.PendingPickError{MetricToken: "rls:lookup_pending"} + case dcEntry != nil: + // Delegate to child policy — child's Pick() may return its + // own PendingPickError or ErrNoSubConnAvailable. + pr, err := childPicker.Pick(info) + if qe, ok := err.(*balancer.PendingPickError); ok { + qe.MetricToken = "rls:" + qe.MetricToken + } + return pr, err + } +} +``` + +**Example — `priority` (picker-level, delegating):** +```go +type priorityPicker struct { + childPicker balancer.Picker + delayToken string // pre-composed for static children + prefix string // "priority_p0:" +} + +func newPriorityPicker(child balancer.Picker, priorityName string) balancer.Picker { + prefix := priorityName + ":" + p := &priorityPicker{childPicker: child, prefix: prefix} + if dt, ok := child.(balancer.DelayMetricTokener); ok { + childToken := dt.DelayMetricToken() + if childToken != "" { + p.delayToken = prefix + childToken + } + } + return p +} + +func (p *priorityPicker) Pick(info balancer.PickInfo) (balancer.PickResult, error) { + res, err := p.childPicker.Pick(info) + if err != nil { + // If child returned a per-pick token, prepend our prefix + if pe, ok := err.(*balancer.PendingPickError); ok { + return res, &balancer.PendingPickError{ + MetricToken: p.prefix + pe.MetricToken, + } + } + } + return res, err +} + +func (p *priorityPicker) DelayMetricToken() string { + return p.delayToken +} +``` + +#### Java + +Java's `PickResult` is `final` and `@Immutable`. The per-pick token is carried +on the `PickResult` itself via a new factory method, which naturally supports +policies like RLS where the delay reason varies per-RPC. The `SubchannelPicker` +also gets a default method for delegating policies that have uniform state: + +```java +// In LoadBalancer.PickResult + +// New field (private, immutable) +@Nullable private final String delayMetricToken; + +// New factory method for delayed results with a reason token +public static PickResult withNoResult(String delayMetricToken) { + return new PickResult(null, null, Status.OK, false, null, delayMetricToken); +} + +// Getter +public String getDelayMetricToken() { + return delayMetricToken != null ? delayMetricToken : ""; +} +``` + +The existing `withNoResult()` continues to return the `NO_RESULT` singleton with +an empty token. The `SubchannelPicker` also gets a default method for +delegating policies to read the child's token: + +```java +// In LoadBalancer.SubchannelPicker + +/** + * Returns the bounded metric token describing why this picker delays RPCs. + * Default returns empty string (no delay information). + */ +public String getDelayMetricToken() { return ""; } +``` + +**Example implementation in `PickFirstLeafLoadBalancer`:** +```java +private class Picker extends SubchannelPicker { + private final String delayMetricToken; + + Picker(String token) { this.delayMetricToken = token; } + + @Override + public PickResult pickSubchannel(PickSubchannelArgs args) { + // When delaying, use the token-aware factory: + return PickResult.withNoResult(delayMetricToken); + } + + @Override + public String getDelayMetricToken() { return delayMetricToken; } +} +``` + +**Example implementation in a delegating load balancer (`PriorityLoadBalancer`):** +```java +private class PriorityPicker extends SubchannelPicker { + private final SubchannelPicker childPicker; + private final String delayMetricToken; + private final String prefix; + + PriorityPicker(SubchannelPicker child, String prefix) { + this.childPicker = child; + this.prefix = prefix; + // Pre-compose for static children + String childToken = child.getDelayMetricToken(); + this.delayMetricToken = childToken.isEmpty() ? "" : prefix + childToken; + } + + @Override + public PickResult pickSubchannel(PickSubchannelArgs args) { + PickResult result = childPicker.pickSubchannel(args); + // If child returned a per-pick delay token, prepend our prefix + String childDynamicToken = result.getDelayMetricToken(); + if (!childDynamicToken.isEmpty() && !childDynamicToken.equals(childPicker.getDelayMetricToken())) { + return PickResult.withNoResult(prefix + childDynamicToken); + } + return result; + } + + @Override + public String getDelayMetricToken() { + return delayMetricToken; + } +} +``` + +#### C++ (Core) + +The `PickResult::Queue` struct is extended to hold a `string_view` carrying the +per-pick token. Since `Queue` is returned from `Pick()`, this naturally supports +policies where the token varies per-RPC. The `string_view` points to a +picker-owned string, involving zero dynamic allocations on the pick path: + +```cpp +// In LoadBalancingPolicy::PickResult + +struct Queue { + // Bounded queue reason token. Points to a string owned by the Picker. + // The Picker's lifetime exceeds the pick, so this is safe. + // Empty string_view means no token is provided. + absl::string_view metric_token; + + Queue() = default; + explicit Queue(absl::string_view token) : metric_token(token) {} +}; +``` + +The `SubchannelPicker` base class gets a virtual method for delegating policies: + +```cpp +class SubchannelPicker : public DualRefCounted { + public: + virtual PickResult Pick(PickArgs args) = 0; + + // Returns the bounded delay reason token. Default returns empty. + virtual absl::string_view GetDelayMetricToken() const { return ""; } +}; +``` + +**Example implementation in a leaf picker:** +```cpp +class PickFirstQueuePicker final : public SubchannelPicker { + public: + PickResult Pick(PickArgs) override { + return PickResult::Queue(kToken); + } + absl::string_view GetDelayMetricToken() const override { return kToken; } + + private: + static constexpr absl::string_view kToken = "pick_first:connecting"; +}; +``` + +**Example delegating picker:** +```cpp +class PriorityPicker final : public SubchannelPicker { + public: + PriorityPicker(int priority_index, + RefCountedPtr child) + : child_(std::move(child)) { + // Compose token at construction time (stored as member string). + auto child_token = child_->GetDelayMetricToken(); + if (!child_token.empty()) { + composite_token_ = absl::StrCat("priority_p", + priority_index, ":", + child_token); + } + } + + PickResult Pick(PickArgs args) override { + auto result = child_->Pick(args); + if (auto* q = absl::get_if(&result.result)) { + q->metric_token = composite_token_; + } + return result; + } + + absl::string_view GetDelayMetricToken() const override { + return composite_token_; + } + + private: + RefCountedPtr child_; + std::string composite_token_; // owned by this picker +}; +``` + +### Language-Specific Recording Logic + +Recording is performed by the client channel's pick deferring infrastructure, not +by the LB policy itself. The LB policy provides the token; the channel measures +time and records the metric. + +#### Go — `picker_wrapper.go` + +In the `pick()` method's blocking loop, the recording logic resolves the token +from either a `PendingPickError` or the `DelayMetricTokener` interface, and emits a +histogram observation each time the token changes (per-segment emission): + +```go +func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, + info balancer.PickInfo) (pick, error) { + var delayStartTime time.Time + var delayToken string + var delayed bool + method := info.FullMethodName + + for { + // ... existing picker load and blocking logic ... + + pickResult, err := p.Pick(info) + if err != nil { + if errors.Is(err, balancer.ErrNoSubConnAvailable) { + // Resolve per-pick token from PendingPickError, or fall + // back to picker-level DelayMetricTokener. + currentToken := "" + if qe, ok := err.(*balancer.PendingPickError); ok { + currentToken = qe.MetricToken + } else if qt, ok := p.(balancer.DelayMetricTokener); ok { + currentToken = qt.DelayMetricToken() + } + + if !delayed { + // First delay event — start timing, increment counter. + delayed = true + delayStartTime = time.Now() + delayToken = currentToken + rpcWaitingMetric.Record(metricsRecorder, 1, target) + } else if currentToken != delayToken { + // Token changed (new picker with different state). + // Emit segment for the previous token. + duration := time.Since(delayStartTime).Seconds() + pickDelayDurationMetric.Record(metricsRecorder, + duration, target, delayToken, method) + // Start new segment. + delayStartTime = time.Now() + delayToken = currentToken + } + continue + } + // ... existing error handling ... + } + + // Pick succeeded. If we were delayed, record final segment. + if delayed { + duration := time.Since(delayStartTime).Seconds() + pickDelayDurationMetric.Record(metricsRecorder, + duration, target, delayToken, method) + rpcWaitingMetric.Record(metricsRecorder, -1, target) + if attemptSpan != nil { + attemptSpan.AddEvent("Delayed LB pick complete", + attribute.Float64("delay_duration", duration), + attribute.String("delay_reason", delayToken)) + } + } + // ... existing transport handling ... + } +} +``` + +On context cancellation (existing `case <-ctx.Done():` block), the same +recording logic applies: +```go +case <-ctx.Done(): + if delayed { + duration := time.Since(delayStartTime).Seconds() + pickDelayDurationMetric.Record(metricsRecorder, + duration, target, delayToken, method) + rpcWaitingMetric.Record(metricsRecorder, -1, target) + if attemptSpan != nil { + attemptSpan.AddEvent("Delayed LB pick complete", + attribute.Float64("delay_duration", duration), + attribute.String("delay_reason", delayToken)) + } + } + // ... existing error return ... +``` + +The `metricsRecorder` and `target` are available from the `ClientConn` that owns +the `pickerWrapper`. + +#### Java — `DelayedClientTransport.java` + +When a `PendingStream` is created in `createPendingStream()`, the delay start +time and token are captured: + +```java +private PendingStream createPendingStream(PickSubchannelArgs args, + ClientStreamTracer[] tracers, PickResult pickResult) { + PendingStream pendingStream = new PendingStream(args, tracers); + pendingStream.delayStartNanos = System.nanoTime(); + + // Resolve token: prefer PickResult token (dynamic), fallback to Picker token (static) + String token = pickResult != null ? pickResult.getDelayMetricToken() : ""; + if (token.isEmpty() && pickerState.lastPicker != null) { + token = pickerState.lastPicker.getDelayMetricToken(); + } + pendingStream.delayToken = token; + // ... existing logic ... +} +``` + +When the stream is dequeued in `reprocess()` (transport obtained), or cancelled, +the duration is recorded: + +```java +if (transport != null) { + long durationNanos = System.nanoTime() - stream.delayStartNanos; + double durationSeconds = durationNanos / 1_000_000_000.0; + metricsRecorder.recordPickDelayDuration(durationSeconds, + target, stream.delayToken); + stream.getTracer().recordAnnotation("Delayed LB pick complete", + Attributes.of( + "delay_duration", durationSeconds, + "delay_reason", stream.delayToken)); + // ... existing stream creation ... +} +``` + +#### C++ (Core) — `load_balanced_call_destination.cc` + +In the pick loop, when `PickResult::Queue` is returned, the start time and token +are captured. If the token changes during the wait, the current segment is recorded +and a new one begins. When the pick eventually succeeds or the call is cancelled, the +final duration is recorded: + +```cpp +// In the pick processing lambda: +auto delay_func = [&](LoadBalancingPolicy::PickResult::Queue* delay_pick) { + std::string current_token = std::string(delay_pick->metric_token); + if (!delay_start_time.has_value()) { + delay_start_time = Timestamp::Now(); + delay_token = current_token; + } else if (current_token != delay_token) { + // Token changed: Emit segment for the previous token + Duration delay_duration = Timestamp::Now() - *delay_start_time; + stats_plugin_group.RecordHistogram( + kPickDelayDurationHandle, + delay_duration.seconds(), + {target}, {delay_token}); + // Start new segment + delay_start_time = Timestamp::Now(); + delay_token = current_token; + } + return Continue{}; +}; + +// On successful pick (in the completion callback): +if (delay_start_time.has_value()) { + Duration delay_duration = Timestamp::Now() - *delay_start_time; + stats_plugin_group.RecordHistogram( + kPickDelayDurationHandle, + delay_duration.seconds(), + {target}, {delay_token}); + call_tracer->RecordAnnotation("Delayed LB pick complete", { + {"delay_duration", absl::StrCat(delay_duration.seconds())}, + {"delay_reason", delay_token} + }); +} +``` + +#### C-Core Specifics: RLS Cache Misses and HTTP/2 Max Concurrent Streams + +In `grpc-core`, the boundary between Load Balancing and the Transport layer introduces two highly specific queuing scenarios that must be handled distinctly in the `PickResult::Queue` tokenization. + +**1. RLS Cache Misses (Dynamic Token Lifetime)** +Unlike static policies (like `pick_first`), the RLS policy in C-Core evaluates targets dynamically per-call. When a cache miss occurs, the RLS policy initiates a control-plane request and returns `PickResult::Queue`. +* **The C-Core Constraint:** The `PickResult::Queue` struct uses an `absl::string_view` to prevent allocations on the hot path. However, because an RLS cache miss is dynamic, the string it points to must safely outlive the pick. +* **Implementation:** The RLS LB Policy must maintain a static, pre-allocated pool of `std::string` constants for its state transitions (e.g., `static const std::string kRlsPending = "rls:lookup_pending";`). During a cache miss `Pick()`, the policy returns a `string_view` pointing explicitly to this constant memory address, ensuring safe reference lifecycle without dynamic memory allocation per RPC. + +**2. MAX_CONCURRENT_STREAMS (Transport-Level Delay)** +In C-Core, waiting can occur even when the LB Policy successfully finds a `READY` subchannel. If the HTTP/2 transport for that subchannel has reached its `SETTINGS_MAX_CONCURRENT_STREAMS` limit, the `grpc_call` must wait. +* **The Distinction:** This is a *transport* delay, not an *LB routing* delay. +* **Implementation:** The LB policy's `Pick()` method will actually succeed and return a `READY` subchannel. However, the `client_channel` filter will detect the transport exhaustion. To maintain observability, the `client_channel` filter itself will inject a synthetic token: `"transport:max_concurrent_streams"`. The delay timer will start, and the RPC will wait in the filter's pending list until a stream becomes available or the context deadline is exceeded. This clearly separates network capacity limits from LB routing failures in the resulting telemetry. + +### Metric Instrument Registration + +#### Go + +```go +import estats "google.golang.org/grpc/experimental/stats" + +var pickDelayDurationMetric = estats.RegisterFloat64Histo( + estats.MetricDescriptor{ + Name: "grpc.lb.pick_delay_duration", + Description: "EXPERIMENTAL. Time an RPC spent waiting " + + "for a load balancing pick.", + Unit: "s", + Labels: []string{"grpc.target", "grpc.lb.delay_reason"}, + OptionalLabels: []string{"grpc.method"}, + Default: false, + }) + +var rpcWaitingMetric = estats.RegisterInt64UpDownCount( + estats.MetricDescriptor{ + Name: "grpc.lb.rpc_waiting", + Description: "EXPERIMENTAL. Number of RPCs currently waiting " + + "for a load balancing pick.", + Unit: "{call}", + Labels: []string{"grpc.target"}, + Default: false, + }) +``` + +#### Java + +```java +private static final LongHistogramInstrumentDescriptor PICK_DELAY_DURATION = + InstrumentRegistry.registerDoubleHistogram( + "grpc.lb.pick_delay_duration", + "EXPERIMENTAL. Time an RPC spent waiting " + + "for a load balancing pick.", + "s", + List.of("grpc.target", "grpc.lb.delay_reason"), // required labels + List.of("grpc.method"), // optional labels + false); // default disabled + +private static final LongUpDownCounterMetricInstrument RPC_WAITING = + InstrumentRegistry.registerLongUpDownCounter( + "grpc.lb.rpc_waiting", + "EXPERIMENTAL. Number of RPCs currently waiting " + + "for a load balancing pick.", + "{call}", + List.of("grpc.target"), + List.of(), + false); +``` + +#### C++ (Core) + +```cpp +const auto kPickDelayDurationHandle = + GlobalInstrumentsRegistry::RegisterDoubleHistogram( + "grpc.lb.pick_delay_duration", + "EXPERIMENTAL. Time an RPC spent waiting " + "for a load balancing pick.", + "s", + /*label_keys=*/{"grpc.target", "grpc.lb.delay_reason"}, + /*optional_label_keys=*/{"grpc.method"}, + /*enable_by_default=*/false); + +const auto kRpcWaitingHandle = + GlobalInstrumentsRegistry::RegisterUpDownInt64Counter( + "grpc.lb.rpc_waiting", + "EXPERIMENTAL. Number of RPCs currently waiting " + "for a load balancing pick.", + "{call}", + /*label_keys=*/{"grpc.target"}, + /*optional_label_keys=*/{}, + /*enable_by_default=*/false); +``` + +### Metric Stability + +Per [A79][A79], this metric starts as **experimental** and **disabled by +default**. Users must explicitly opt in via their OpenTelemetry plugin +configuration. The metric description is prefixed with `EXPERIMENTAL.` to +signal this status. + +The metric will be promoted to stable after it has been implemented and validated +in all three languages, with at least one release cycle of user feedback. + +### Temporary environment variable protection + +The metric will be guarded by the environment variable +`GRPC_EXPERIMENTAL_ENABLE_PICK_DELAY_METRIC`. When set to `true`, the metric +will be registered and reported even if the user has not explicitly enabled it +via the OpenTelemetry plugin configuration. This guard will be removed once the +feature is deemed stable. + +### Tracing Enhancement + +[A72] defines a "Delayed LB pick complete" span event on the attempt span, +emitted when an RPC experiences load balancer pick delay. Today, this event +carries no attributes. This proposal enhances it with the following attributes: + +| Attribute Key | Type | Description | +|---|---|---| +| `delay_duration` | double | The wait duration in seconds. | +| `delay_reason` | string | The delay reason token from the picker. | + +The event is emitted at the point where the wait ends (successful pick or +cancellation), on the attempt span. The attribute values come from the same +delay start time and token already captured for the metric. + +This enhancement is additive to [A72] — the event name remains "Delayed LB pick +complete" and the event is still emitted only when the RPC experienced delay. +The only change is the addition of attributes. If tracing is not configured via +the OpenTelemetry plugin, no span events are emitted and there is no overhead. + +## Rationale + +The metric is recorded by the client channel's pick deferring infrastructure +rather than by the LB policy itself, because the LB policy knows its internal +state but does not know how many RPCs are waiting or for how long. Only the +channel physically buffers RPCs and can accurately measure per-RPC delay +duration. This mirrors the existing design where per-call metrics ([A66]) are +recorded by the channel, not by LB policies. + +Two mechanisms are provided for supplying delay reason tokens — a picker-level +optional interface (`DelayMetricTokener`) and a per-pick error/result type +(`PendingPickError` / `PickResult.withNoResult(token)` / `Queue::metric_token`). Most +LB policies have uniform picker state: all RPCs see the same delay reason from a +given picker instance (pick_first, round_robin, ring_hash, priority). These +policies use the picker-level mechanism, which involves zero allocations on the +pick path. However, RLS and cluster_manager make per-request routing decisions +within `Pick()` — RLS does a per-request cache lookup, and cluster_manager +routes RPCs to different child pickers based on the xDS-selected cluster. For +these policies, the delay reason varies per-RPC, requiring the token to be +determined inside `Pick()` and attached to the pending result. + +The recording loop emits one histogram observation per delay reason segment +rather than a single observation per RPC. An RPC that transitions through +multiple delay reasons (e.g., `rls:lookup_pending` for 10 seconds, then +`rls:pick_first:connecting` for 0.1 seconds after the lookup completes) produces +two histogram observations, one for each segment. This per-segment approach +provides visibility into how much time each layer of the LB policy tree +contributed to the total pick delay. Without it, a single 10.1-second +observation attributed to `rls:lookup_pending` would obscure the fact that the +RLS lookup accounted for the vast majority of the delay. + +The `grpc.lb.rpc_waiting` up-down counter is provided because the +histogram metric is emitted only when a wait ends. RPCs that are +indefinitely delayed (e.g., target unreachable, infinite backoff) +never emit a histogram observation. The counter gives operators a real-time +count of currently-waiting RPCs, enabling alerting on stuck traffic. Every LB +policy in the tree can cause indefinite waiting (pick_first cycling through +CONNECTING/TF, RLS server unreachable, CDS resource never arriving), making this +counter essential. An up-down counter is used rather than a callback gauge +because the recording loop has the exact synchronous entry/exit points, matching +the pattern used by `grpc.subchannel.open_connections` ([A94]) and +`grpc.tcp.connection_count` ([A80]). + +`grpc.method` is included as an optional label because some policies (RLS, +cluster_manager) make routing decisions based on the method name, causing wait +behavior to vary across methods. For simpler policies (pick_first, +round_robin), the delay behavior is method-independent, but operators may still +want per-method delay analysis to correlate with per-call attempt metrics. As an +optional label, it defaults to off and only adds cardinality when explicitly +enabled. + +`wrr_locality` and `outlier_detection` are treated as transparent because +neither makes independent delaying decisions. `wrr_locality` is a configuration +wrapper; `outlier_detection` manipulates subchannel state but the actual +delaying is decided by the child policy's picker. When outlier detection ejects +100% of endpoints, the child policy (e.g., pick_first) enters +TRANSIENT_FAILURE and returns an error rather than a pending result, so no delay +metric is recorded. In partial ejection scenarios, the child is genuinely in +CONNECTING state and the token is accurate. Operators can correlate delay +duration with the existing `grpc.lb.outlier_detection.ejections_enforced` metric +([A91]) for root cause analysis. + +The tracing enhancement reuses the same bounded delay reason token as the metric +label rather than introducing a separate high-cardinality trace reason string. +This keeps the API surface minimal — one token serves both metrics and traces — +and avoids the need for pickers to maintain two separate strings. If richer, +dynamic trace context (e.g., IP addresses, RLS keys) is needed in the future, +it can be added as additional span event attributes without changing the token +API. + +## Implementation + +Implementation will proceed in Go, Java, and C++ (Core), in that order. + +[A66]: A66-otel-stats.md +[A72]: A72-open-telemetry-tracing.md +[A78]: A78-grpc-metrics-wrr-pf-xds.md +[A79]: A79-non-per-call-metrics-architecture.md +[A80]: A80-tcp-metrics.md +[A91]: A91-outlier-detection-metrics.md +[A94]: A94-subchannel-otel-metrics.md +[A56]: A56-priority-lb-policy.md +[A50]: A50-xds-outlier-detection.md \ No newline at end of file From 631f5f9de79721fe5182a0dca7653752f61f3ef8 Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Mon, 1 Jun 2026 04:17:49 +0000 Subject: [PATCH 02/12] delay metric rough --- grpc-lb-delay-metrics.md | 100 +++++++++++++++++++++++++++++++++------ 1 file changed, 86 insertions(+), 14 deletions(-) diff --git a/grpc-lb-delay-metrics.md b/grpc-lb-delay-metrics.md index 7762ed9ab..c54e62806 100644 --- a/grpc-lb-delay-metrics.md +++ b/grpc-lb-delay-metrics.md @@ -274,6 +274,29 @@ Built-in pickers (`pick_first`, `round_robin`, `ring_hash`, `priority`, etc.) implement `DelayMetricTokener`. Custom LB policies that implement neither mechanism report an empty token. +**Tracing Signals** — To support the new child span design in Go without introducing OpenTelemetry dependencies in the core library, we introduce new internal stats signals. These are emitted by the channel infrastructure and handled by the registered stats handler (e.g., the OpenTelemetry plugin): + +```go +// In package stats + +// DelayStart indicates that the RPC has entered a delay period. +// This is a signal to the stats handler to start a new child span. +type DelayStart struct { + // Reason is the bounded token describing the cause of delay. + Reason string +} + +func (*DelayStart) IsClient() bool { return true } +func (*DelayStart) isRPCStats() {} + +// DelayEnd indicates that the current delay period has resolved. +// This is a signal to the stats handler to end the child span. +type DelayEnd struct{} + +func (*DelayEnd) IsClient() bool { return true } +func (*DelayEnd) isRPCStats() {} +``` + **Example — `pick_first` (picker-level):** ```go type pfPicker struct { @@ -390,6 +413,30 @@ delegating policies to read the child's token: public String getDelayMetricToken() { return ""; } ``` +**Tracing APIs** — To support the new child span design in Java (and avoiding span events), we add new default methods to `ClientStreamTracer`. These allow the channel to notify the tracer when a delay starts and ends. (Note: `ClientStreamTracer` corresponds to `CallAttemptTracer` in C++. If a separate `CallTracer` is introduced for call-level tracking, it would have similar methods). + +```java +// In package io.grpc + +public abstract class ClientStreamTracer extends StreamTracer { + // ... existing methods ... + + /** + * Called when the RPC enters a delay period (e.g., waiting for a pick). + * The implementation should create a new child span for the delay. + * + * @param reason The bounded token describing the cause of delay. + */ + public void recordDelayStart(String reason) {} + + /** + * Called when the current delay period resolves. + * The implementation should end the child span. + */ + public void recordDelayEnd() {} +} +``` + **Example implementation in `PickFirstLeafLoadBalancer`:** ```java private class Picker extends SubchannelPicker { @@ -474,6 +521,34 @@ class SubchannelPicker : public DualRefCounted { }; ``` +**Tracing APIs** — To support the new child span design in C++ (and avoiding span events), we add new virtual methods to both `ClientCallTracerInterface` and `ClientCallTracerInterface::CallAttemptTracer`. These allow the channel to notify the tracer when a delay starts and ends at both the call and attempt levels: + +```cpp +// In src/core/telemetry/call_tracer.h + +class ClientCallTracerInterface : public CallTracerAnnotationInterface { + public: + // ... existing methods ... + + // Records the start of a delay period at the call level. + virtual void RecordDelayStart(absl::string_view reason) = 0; + + // Records the end of the current delay period at the call level. + virtual void RecordDelayEnd() = 0; + + class CallAttemptTracer : public CallTracerInterface { + public: + // ... existing methods ... + + // Records the start of a delay period at the attempt level. + virtual void RecordDelayStart(absl::string_view reason) = 0; + + // Records the end of the current delay period at the attempt level. + virtual void RecordDelayEnd() = 0; + }; +}; +``` + **Example implementation in a leaf picker:** ```cpp class PickFirstQueuePicker final : public SubchannelPicker { @@ -807,23 +882,20 @@ feature is deemed stable. ### Tracing Enhancement -[A72] defines a "Delayed LB pick complete" span event on the attempt span, -emitted when an RPC experiences load balancer pick delay. Today, this event -carries no attributes. This proposal enhances it with the following attributes: +To provide high-fidelity visibility into delays, this proposal introduces a **generic delay framework** using child spans, rather than just adding events to existing spans. This allows capturing the start, end, and specific reason for any processing delay (not limited to load balancing). -| Attribute Key | Type | Description | -|---|---|---| -| `delay_duration` | double | The wait duration in seconds. | -| `delay_reason` | string | The delay reason token from the picker. | +When an RPC experiences a delay (such as waiting for a load balancer pick), the channel will create a **new child span** named simply **`"Delay"`**. + +This child span will have the following attributes: +* `delay_type`: Describes the category of delay. For this proposal, it will be `"load_balancing"`. +* `delay_reason`: The specific bounded token describing why the delay happened (e.g., `"pick_first:connecting"`, `"rls:lookup_pending"`). -The event is emitted at the point where the wait ends (successful pick or -cancellation), on the attempt span. The attribute values come from the same -delay start time and token already captured for the metric. +The span is created when the delay condition begins and is ended when the condition resolves or transitions to a new reason (in which case a new segment span is created). This provides a clear, time-accurate visualization of delay segments in the trace. -This enhancement is additive to [A72] — the event name remains "Delayed LB pick -complete" and the event is still emitted only when the RPC experienced delay. -The only change is the addition of attributes. If tracing is not configured via -the OpenTelemetry plugin, no span events are emitted and there is no overhead. +To support this in each language: +* **C++**: New APIs in `ClientCallTracerInterface::CallAttemptTracer` to start and end delay spans. +* **Java**: New default methods in `ClientStreamTracer` to start and end delay spans. +* **Go**: New internal stats signals (`stats.DelayStart` and `stats.DelayEnd`) to notify the stats handler when to manage the delay span lifecycle. ## Rationale From 8ca5bbcacc0a1c7d41dde6d090437cc25bf1ddb2 Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Fri, 19 Jun 2026 14:09:19 +0000 Subject: [PATCH 03/12] newer changes after review --- grpc-lb-delay-metrics.md | 1332 ++++++++++++++------------------------ 1 file changed, 477 insertions(+), 855 deletions(-) diff --git a/grpc-lb-delay-metrics.md b/grpc-lb-delay-metrics.md index c54e62806..eef2899a6 100644 --- a/grpc-lb-delay-metrics.md +++ b/grpc-lb-delay-metrics.md @@ -1,982 +1,604 @@ -Load Balancer Pick Queue Delay Observability +# RPC Delay Observability ---- * Author(s): Madhav Bissa (@madhavbissa) -* Approver: TBD -* Status: Draft +* Approver: markdroth * Implemented in: Go, Java, C++ -* Last updated: 2026-04-22 -* Discussion at: TBD +* Last updated: 2026-06-19 +* Discussion at: (filled after thread exists) ## Abstract -This proposal introduces observability for the duration that RPCs spend waiting in the gRPC client channel for a load balancing pick to complete. Today, when a picker indicates that no subchannel is available, the client channel then defers the RPC (either by waiting or blocking the caller) until the LB policy provides a new picker or the RPC context is canceled/times out. This latency is invisible to operators, making it difficult to diagnose which layer of the LB policy tree is causing the delay. - -This gRFC defines: -1. A new histogram metric (`grpc.lb.pick_delay_duration`) recorded by the client - channel, with bounded `grpc.lb.delay_reason` and optional `grpc.method` - labels. The histogram emits one observation per delay reason segment, - providing per-layer delay breakdown when an RPC transitions through multiple - LB policy states. -2. A new up-down counter metric (`grpc.lb.rpc_waiting`) tracking the - number of RPCs currently waiting, providing visibility into stuck traffic. -3. An enhancement to the existing "Delayed LB pick complete" span event ([A72]) - to include the delay reason and delay duration as span event attributes. -4. API changes in each language to allow pickers to expose the delay reason as a - bounded, pre-computed string token, with support for both picker-level tokens - (uniform policies) and per-pick tokens (RLS, cluster_manager). +This proposal introduces client-side metrics and tracing to measure the delays an RPC experiences within the client channel before it is sent over the network. In this context, "delay" refers to the time an RPC spends blocked or queued inside the channel waiting for name resolution, configuration parsing, or downstream connection establishment. To expose these states, this document proposes two new histograms, two dedicated tracing spans, and corresponding additions to the CallTracer and CallAttemptTracer APIs. + ## Background -gRPC uses a tree of load balancing policies to select a subchannel for each RPC. -When no suitable subchannel is available, the picker indicates this state to the client channel. The client channel then defers the RPC (either by waiting or blocking the caller) until the LB policy provides a new picker or the RPC's context is canceled or times out. -Operators currently have no visibility into: -- **How long** RPCs wait. -- **Why** they are delayed (which policy in the tree caused it). - -The existing `grpc.client.attempt.duration` metric (A66) includes pick delay but -does not isolate it, and provides no information about the cause of the delay. - -### Related Proposals - -- [A66: OpenTelemetry Metrics][A66] — Base instrumentation and stats plugin - architecture. -- [A72: OpenTelemetry Tracing][A72] — Tracing architecture including the - existing "Delayed LB pick complete" event. -- [A78: gRPC Metrics for WRR, Pick First, and xDS][A78] — LB policy metrics - patterns (`grpc.lb.*` naming convention, locality labels). -- [A79: Non-per-call Metrics Architecture][A79] — `GlobalInstrumentsRegistry`, - `MetricsRecorder`, metric descriptor registration, and stability levels. -- [A91: Outlier Detection Metrics][A91] — Template for LB-level non-per-call - metrics. -- [A94: Subchannel OTel Metrics][A94] — Subchannel-level metrics. -- [A56: Priority LB Policy][A56] — Priority policy, `initTimer`, failover. -- [A50: xDS Outlier Detection][A50] — Outlier detection ejection behavior. +Existing gRPC core telemetry infrastructure, as defined in [gRPC A66 (OpenTelemetry Metrics)][A66] and [gRPC A72 (OpenTelemetry Tracing)][A72], tracks the overall end-to-end duration of RPC calls and attempts. However, these metrics and spans function as aggregate buckets that do not decompose latency, leaving delays inside the client channel invisible to operators. -## Proposal +Before an RPC attempt can be sent over the network, the client channel must perform several critical operations, including resolving the target name, parsing service configurations, instantiating load balancing policies, and obtaining a connectivity picker. If any of these phases stall—such as during a slow DNS lookup, a Route Lookup Service (RLS) control-plane query, or a Cluster Discovery Service (CDS) metadata fetch—the RPC is delayed. To the application, this appears as high latency or a timeout. However, because current telemetry lacks visibility into these resolution and routing states, developers cannot distinguish between a slow network, a slow backend, or a channel initialization delay. This is particularly challenging for clients utilizing the Route Lookup Service (RLS), where diagnosing RLS-related hangs or isolating whether delays stem from pending route lookups requires complex manual debugging. -Using [A79]'s non-per-call metrics architecture and [A72]'s tracing framework, -we will add a histogram metric, an active waiting RPC counter, and enhanced trace span -event that measure the time each RPC spends waiting for a load balancing -pick. - -Each LB policy's picker provides a bounded string **delay reason token** -describing why it would delay RPCs (e.g., `"pick_first:connecting"`). For -policies with uniform picker state (pick_first, round_robin, ring_hash, -priority), the token is pre-computed at picker construction time. For policies -that make per-request routing decisions (RLS, cluster_manager), the token is -determined inside `Pick()` and varies per-RPC. Delegating policies compose -tokens by prepending their own prefix to their child picker's token (e.g., -`"priority_p0:pick_first:connecting"`). - -The client channel's pick loop reads this token from the picker when a -pick is delayed, records the delay start time, and emits one histogram -observation per delay reason segment — if the token changes during the wait -(e.g., an RLS lookup completes, causing a transition from `rls:lookup_pending` -to `rls:pick_first:connecting`), the current segment is closed and a new one -begins. An up-down counter tracks how many RPCs are currently waiting, -providing visibility into stuck traffic that has not yet emitted a histogram -observation. - -For policies with uniform picker state, tokens are pre-computed strings stored -on the picker, so the pick path reads them by reference with zero dynamic -allocations. The token API uses optional interfaces (Go) or default-returning -methods (Java, C++) so that existing custom LB policies continue to work without -modification — they simply report an empty token, and the metric is still -recorded under the `grpc.target` label alone. - -### Metric Definition - -The following metric is registered via the non-per-call metrics framework -defined in [A79][A79]. +While there are many potential sources of delay along the client-side pipeline (including interceptor execution, credential fetching, and filter evaluation), this proposal focuses specifically on introducing observability for the two most common bottlenecks: **name resolution** and **load balancing pick** delays. This proposal establishes a generic telemetry framework extensible to other client-side delays in the future. -| Field | Value | -|---|---| -| **Name** | `grpc.lb.pick_delay_duration` | -| **Type** | Float64 Histogram | -| **Unit** | `s` (seconds) | -| **Description** | EXPERIMENTAL. Time an RPC spent waiting for a load balancing pick, broken down by the reason for the delay. | -| **Labels** | `grpc.target`, `grpc.lb.delay_reason` | -| **Optional Labels** | `grpc.method` | -| **Bucket Boundaries** | Same as A66 latency buckets: 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 | -| **Default Enabled** | `false` (experimental, opt-in) | -> **Note on histogram aggregation**: The gRPC non-per-call metrics framework -> ([A79]) currently supports only explicit bucket histograms. Explicit buckets -> lose fidelity for values between wide boundaries (e.g., values in the -> `[20, 50)` range are indistinguishable). When the framework adds support for -> exponential bucket histograms, this metric should be migrated to use -> exponential aggregation for better tail-latency fidelity. In the interim, -> operators can override the aggregation strategy to exponential using -> OpenTelemetry SDK Views at the application level. - -If the delay reason token changes during the wait (e.g., due to an RLS lookup -completing or a priority failover), the recording loop emits one histogram -observation **per delay reason segment**. A segment ends when a new picker -arrives with a different token, or when the wait ends. A single RPC may -therefore contribute multiple histogram observations, each covering the time -spent under a specific delay reason. This per-segment approach gives operators -visibility into how much time each layer of the LB policy tree contributed to -the total pick delay. - -The following gauge metric is also registered to provide visibility into RPCs -that are currently waiting and have not yet emitted a histogram observation: +### Related Proposals: +* **[gRPC A66: OpenTelemetry Metrics][A66]**: Establishes base OpenTelemetry metrics. +* **[gRPC A72: OpenTelemetry Tracing][A72]**: Establishes tracing spans and events. +* **[gRPC A56: Priority LB Policy][A56]**: Establishes the priority load balancing policy and child failover mechanics. +* **[gRPC A28: xDS Traffic Splitting and Routing][A28]**: Establishes the `xds_cluster_manager` and `weighted_target` container policies. -| Field | Value | -|---|---| -| **Name** | `grpc.lb.rpc_waiting` | -| **Type** | Int64 UpDownCounter | -| **Unit** | `{call}` | -| **Description** | EXPERIMENTAL. Number of RPCs currently waiting for a load balancing pick. | -| **Labels** | `grpc.target` | -| **Default Enabled** | `false` (experimental, opt-in) | -The counter is incremented (`+1`) when a pick is first deferred and decremented -(`-1`) when the wait ends (successful pick or cancellation). This matches -the pattern used by `grpc.subchannel.open_connections` ([A94]) and -`grpc.tcp.connection_count` ([A80]). +## Proposal -#### Label Definitions +### 1. Telemetry Schema -| Label | Type | Description | -|---|---|---| -| `grpc.target` | String | The target URI of the channel. Required. | -| `grpc.lb.delay_reason` | String | The bounded delay reason token from the picker. Optional label. | -| `grpc.method` | String | The full method name of the RPC (e.g., `/pkg.Service/Method`). Optional label. Available from `PickInfo` at pick time. Primarily useful for policies like RLS where wait behavior varies by method. | +To measure client-side delays, we introduce a duration histogram and a dedicated tracing span at the **Call Level** (for name resolution delays), and another histogram and span at the **Attempt Level** (for load balancing pick delays). -### Delay Reason Token Semantics +We split observability data between low-cardinality metrics and high-cardinality debug tracing across two layers: -Delay reason tokens are static, bounded strings following the grammar: +1. **Call Level Observability (Channel-level Delays)**: + * **Scope**: Measures delays that occur at the channel level before an individual network attempt is initiated. This is primarily used to observe name resolution and configuration resolution delays. + * **Metric**: `grpc.client.call.delay.duration` (Float64 Histogram). + * **Tracing Span**: A child span named **`"Call Delay"`**, created as a child of the main RPC Call span. +2. **Attempt Level Observability (Attempt-level Delays)**: + * **Scope**: Measures delays that occur within the lifecycle of a specific RPC attempt, primarily during the load balancing pick loop and connection establishment. + * **Metric**: `grpc.client.attempt.delay.duration` (Float64 Histogram). + * **Tracing Span**: A child span named **`"Attempt Delay"`**, created as a child of the individual Attempt span. -``` -token = terminal | prefix ":" token -prefix = policy_prefix -terminal = policy_name ":" state -policy_name = "pick_first" | "round_robin" | "ring_hash" | "rls" | "cds" -``` +#### Span Attributes and Events -Dynamic values (IP addresses, cluster names, endpoint identifiers) are -**strictly forbidden** in tokens. +For both layers, a child span is created for each distinct `grpc.delay_type`, and transitions within that type are recorded as span events: -#### Terminal Tokens +* **`grpc.delay_type` (Span Attribute & Metric Label)**: A low-cardinality, composed string representing the type of the delay. The value format is `[parent_prefix:]base_type` (e.g., `connecting`, `p0:connecting`, or `rls_lookup_pending`). It is recorded as a **span attribute** on the child span and as a **metric label** on the corresponding duration histogram. +* **`grpc.delay_reason` (Span Event Attribute)**: A high-cardinality string capturing runtime details (such as target names, subchannel IP addresses, or connection error messages). -Terminal tokens are emitted by leaf policies when they cannot complete a pick. +When a delay begins, the tracer creates the child span (`"Call Delay"` or `"Attempt Delay"`) carrying the `grpc.delay_type` attribute. It also records the initial reason as a span event named `"delay_state_transition"` containing the `grpc.delay_reason` string. -| Policy | Token | When Emitted | -|---|---|---| -| `pick_first` | `pick_first:connecting` | No READY subchannel; connection attempt in progress. | -| `round_robin` | `round_robin:connecting` | No READY subchannel in the round-robin set. | -| `ring_hash` | `ring_hash:connecting` | The hashed (or scanned) endpoint is in CONNECTING or IDLE state. | -| `rls` | `rls:lookup_pending` | Waiting for an RLS server response for the request key. | -| `rls` | `rls:throttled` | RLS requests are being throttled due to server errors. | -| `cds` | `cds:discovery_pending` | Waiting for the xDS CDS resource. | +If the state changes but the `grpc.delay_type` remains the same (e.g., a priority policy fails over or RLS changes backend targets), the tracer emits a new `"delay_state_transition"` span event on the active child span with the updated `grpc.delay_reason` string, without recreating the span. -#### Delegating Prefixes +When the delay resolves or transitions to a different delay type, the active child span is closed. -Delegating policies prepend a prefix to their child picker's token. +```mermaid +graph TD + ParentSpan["Parent RPC Attempt Span"] --> DelaySpan["Child Span: 'Attempt Delay'
(Attribute: grpc.delay_type = 'connecting')"] + DelaySpan --> Event1["Span Event @ 0ms: 'delay_state_transition'
(Attribute: grpc.delay_reason = 'subchannel_idle_trigger')"] + DelaySpan --> Event2["Span Event @ 100ms: 'delay_state_transition'
(Attribute: grpc.delay_reason = 'subchannel_connecting')"] + DelaySpan --> Event3["Span Event @ 200ms: 'delay_state_transition'
(Attribute: grpc.delay_reason = 'waiting_for_health_report')"] +``` -| Policy | Prefix | Notes | -|---|---|---| -| `priority` | `priority_p{N}:` | N is the 0-based priority index. Typical deployments use ≤5 priorities, bounding cardinality. | -| `cluster_manager` | *(per-pick)* | Routes RPCs to different child pickers per-request. The token is the child picker's token, resolved at pick time. See Per-Pick Tokens. | -| `rls` | `rls:` | When routing to a resolved child policy (post-lookup). | -| `cds` | `cds:` | When delegating to a child policy (post-discovery). | -| `wrr_locality` | *(transparent)* | Passes child tokens through unmodified. | -| `outlier_detection` | *(transparent)* | Passes child tokens through unmodified; OD does not itself cause queueing. | +#### Span Event Schema Specification +To ensure tracing backends can programmatically parse `"delay_state_transition"` events, the event MUST conform to the following structural schema: +```json +{ + "name": "delay_state_transition", + "attributes": { + "grpc.delay_type": "string (the active delay type, e.g., 'connecting')", + "grpc.delay_reason": "string (the new granular state description)" + } +} +``` -#### Token Composition +#### Metrics Definitions -For most policies, token composition occurs at **picker construction time**, not -on the pick path. When a delegating policy creates its picker, it reads the -child picker's token and stores the composite string as a member of its own -picker. This ensures zero allocations during `Pick()`. +The following metrics are registered as client-side per-call metrics, extending the instrumentation framework defined in [gRPC A66][A66]. -**Example flow** for `priority → pick_first`: -1. `pick_first` creates a picker with token `"pick_first:connecting"`. -2. `priority` wraps the child, creating its picker with token - `"priority_p0:pick_first:connecting"` (stored as a member string). +##### 1. Call Delay Duration Histogram -When a delegating policy has a **READY** child (no queueing), the token is the -empty string `""`, and the pick completes without queueing. +| Field | Value | +|---|---| +| **Name** | `grpc.client.call.delay.duration` | +| **Type** | Float64 Histogram | +| **Unit** | `s` (seconds) | +| **Description** | EXPERIMENTAL. Time an RPC spent waiting at the call level before an attempt was initiated, such as waiting for name resolution. | +| **Labels** | `grpc.target`, `grpc.delay_type` | +| **Optional Labels** | `grpc.method` | +| **Bucket Boundaries** | Same as A66 latency buckets: 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 | +| **Default Enabled** | `false` (experimental, opt-in) | -#### Per-Pick Tokens +##### 2. Attempt Delay Duration Histogram -Some policies make per-request routing decisions within `Pick()`, causing the -delay reason to vary across RPCs handled by the same picker. These policies -provide their token via the pick result rather than a picker-level method: +| Field | Value | +|---|---| +| **Name** | `grpc.client.attempt.delay.duration` | +| **Type** | Float64 Histogram | +| **Unit** | `s` (seconds) | +| **Description** | EXPERIMENTAL. Time an RPC attempt spent waiting for a load balancing pick or connection establishment. | +| **Labels** | `grpc.target`, `grpc.delay_type` | +| **Optional Labels** | `grpc.method` | +| **Bucket Boundaries** | Same as A66 latency buckets: 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 | +| **Default Enabled** | `false` (experimental, opt-in) | -- **RLS**: A single `rlsPicker` performs a per-request cache lookup. A cache - miss delays with `rls:lookup_pending`, while a cache hit that delegates to a - child in CONNECTING state delays with `rls:pick_first:connecting`. The token - must be determined inside `Pick()` and attached to the pending result. -- **cluster_manager**: Routes RPCs to different child pickers based on the - cluster selected by xDS routing. Different RPCs may reach different child - pickers in different states. The token is read from whichever child picker - handled the specific RPC. +### 2. Telemetry Value Taxonomy -The mechanism for attaching per-pick tokens is described in the language-specific -API sections below. +To ensure consistency across implementations, we define a taxonomy mapping client channel states and load balancing policies to metric and tracing labels. -### Language-Specific API Changes +All connection-related delays are consolidated into a single low-cardinality metric label value (`"connecting"`), while their detailed root causes are recorded in the tracing `grpc.delay_reason` attribute. -> [!NOTE] -> The code examples in the following sections are illustrative and simplified to demonstrate the design intent. Actual implementations may vary slightly across languages to conform to idiomatic patterns and existing codebase structure. +#### 1. Metric Delay Types (`grpc.delay_type`) +The `grpc.delay_type` label is restricted to a closed, low-cardinality set of values: +* `"connecting"`: Any client-side delay spent waiting for name resolution, config parsing, subchannel connection establishment, or picker initialization. +* `"rls_lookup_pending"`: Specifically for Route Lookup Service (RLS) control-plane cache-miss lookups. +* `"cds_dynamic_discovery"`: Specifically for xDS Cluster Discovery Service (CDS) dynamic metadata resource fetches. -#### Go +#### 2. Taxonomy of Delay Reasons (`grpc.delay_reason`) +The high-cardinality `grpc.delay_reason` string represents the exact state of the connection or resolver. -Go's `Picker` interface has a single `Pick()` method. Adding a method would -break all existing implementations. Two mechanisms are provided for supplying -queue reason tokens: +##### Category A: Resolver & Metadata Delays +Recorded when the channel is stalled obtaining the initial endpoints or configurations: +* `"resolver_dns_query_pending"`: DNS name resolution query is in progress. +* `"rls_lookup_pending"`: RLS control-plane query is in progress. +* `"cds_metadata_fetch"`: Waiting for dynamic CDS cluster resource definitions over xDS. -**Picker-level token** — for policies with uniform picker state (pick_first, -round_robin, ring_hash, priority). An optional interface that pickers may -implement: +##### Category B: Subchannel Connection Delays (Metric Label: `"connecting"`) +Recorded when an RPC attempt is queued waiting for a subchannel to establish a connection: +* `"subchannel_idle_trigger"`: The target subchannel was in `IDLE` state when picked, and a new connection attempt had to be triggered. +* `"subchannel_connecting"`: The subchannel is actively in `CONNECTING` state (TCP handshake or TLS negotiation in progress). +* `"subchannel_scaling_a105"`: The subchannel had a connection but transitioned back to `CONNECTING` or `TRANSIENT_FAILURE` to scale up additional connections to satisfy high concurrent streams (per gRFC A105). +* `"subchannel_waiting_for_health"`: The network connection is established, but the RPC is blocked waiting for the initial Out-Of-Band (OOB) health check report to return `SERVING`. +* `"subchannel_does_not_exist"`: The load balancer has not yet resolved or created the subchannel. -```go -// In package balancer - -// DelayMetricTokener is an optional interface that Pickers can implement -// to provide a bounded metric token describing why this picker delays RPCs. -// The token MUST be a static, pre-computed string with no dynamic components. -type DelayMetricTokener interface { - // DelayMetricToken returns the bounded delay reason token. - // Returns "" if the picker does not delay RPCs. - DelayMetricToken() string -} -``` +##### Category C: Picker and State Mismatches (Metric Label: `"connecting"`) +* `"subchannel_state_mismatch"`: The subchannel transitioned out of `READY` before the active picker could be updated. +* `"picker_failing_with_wait_for_ready"`: A `wait_for_ready` RPC is queued because the picker is in `TRANSIENT_FAILURE`. -**Per-pick token** — for policies where the delay reason varies per-RPC (RLS, -cluster_manager). A structured error type that wraps `ErrNoSubConnAvailable` -with a token: +##### Category D: Container Policy Pass-Through & Wrapping +Container load balancing policies propagate their children's delays by prepending a prefix, while forwarding the child's type and reason: +* **priority**: Prepends the active tier index (e.g., `p0:connecting` or `p1:connecting`). +* **xds_cluster_manager**: Prepends the targeted xDS cluster name (e.g. `xds_cluster_manager: child 'cluster-abc': connecting`). +* **weighted_target**: Prepends the targeted weight group (e.g., `weighted_target: child 'canary': connecting`). +* **rls**: Prepends the target route shard (e.g., `RLS: child 'shard-eu': connecting`). -```go -// In package balancer - -// PendingPickError is returned by Pick() to signal a delay with a per-pick -// metric token. It is equivalent to ErrNoSubConnAvailable but carries -// a delay reason token that may vary per-RPC. -type PendingPickError struct { - // MetricToken is the bounded delay reason token for this specific pick. - MetricToken string -} -func (e *PendingPickError) Error() string { return "no SubConn available" } -func (e *PendingPickError) Is(target error) bool { - return target == ErrNoSubConnAvailable -} -``` +### 3. Tracer API Changes -The recording loop resolves the token with the following precedence: if the -error is a `*PendingPickError`, use its `MetricToken`; otherwise, check whether the -picker implements `DelayMetricTokener`; otherwise, use the empty string. +#### Language-Agnostic Lifecycle & State Machine -Built-in pickers (`pick_first`, `round_robin`, `ring_hash`, `priority`, etc.) -implement `DelayMetricTokener`. Custom LB policies that implement neither -mechanism report an empty token. +The client channel, load balancer, and telemetry tracer coordinate synchronously to record delays without dynamic memory allocation during routing. -**Tracing Signals** — To support the new child span design in Go without introducing OpenTelemetry dependencies in the core library, we introduce new internal stats signals. These are emitted by the channel infrastructure and handled by the registered stats handler (e.g., the OpenTelemetry plugin): +```mermaid +sequenceDiagram + autonumber + participant Channel as Client Channel + participant LB as Load Balancer (Picker) + participant Tracer as Telemetry Tracer + + Channel->>LB: PickSubchannel() + Note over LB: Subchannels idle or connecting + LB-->>Channel: Return PendingPickError (delay_type="connecting", delay_reason="subchannel_idle_trigger") + Channel->>Tracer: RecordDelayStart("connecting", "subchannel_idle_trigger") + Note over Tracer: Starts stopwatch & creates "Attempt Delay" span + + loop During buffering + Channel->>LB: Re-eval Pick + Note over LB: State transitions to TCP connecting + LB-->>Channel: Return PendingPickError (delay_type="connecting", delay_reason="subchannel_connecting") + Channel->>Tracer: RecordDelayReasonChanged("subchannel_connecting") + Note over Tracer: Emits span event "delay_state_transition" + end + + Channel->>LB: Re-eval Pick + LB-->>Channel: Return Subchannel (READY) + Channel->>Tracer: RecordDelayEnd() + Note over Tracer: Stops stopwatch, records metric, closes span +``` +1. **Initiating a Delay Segment**: + When the client channel encounters a blocking state (e.g., name resolution pending or load balancing pick deferred): + - A timer is started to measure the delay segment duration. + - A child span (`"Call Delay"` at the call level, or `"Attempt Delay"` at the attempt level) is created under the parent RPC span. + - The span is assigned the initial `grpc.delay_type` as a span attribute, and a `"delay_state_transition"` span event is emitted containing the `grpc.delay_reason` description. + +2. **Handling State Transitions**: + During an active delay, the internal state may transition (e.g., a subchannel reconnects, a different priority tier is chosen, or RLS updates its cache): + - **Scenario A: Only the reason changes (same delay type)**. + If the new state maps to the *same* `grpc.delay_type` but a different `grpc.delay_reason`: + - The active child span remains open. + - A new `"delay_state_transition"` span event is appended to the active span with the updated `grpc.delay_reason` string. + - *No metric is recorded* and the timer continues running. + - **Scenario B: The delay type changes**. + If the new state maps to a *different* `grpc.delay_type`: + - The timer is stopped. + - The duration is recorded to the corresponding histogram (`grpc.client.call.delay.duration` or `grpc.client.attempt.delay.duration`) using the old `grpc.delay_type` as a label. + - The active child span is closed. + - A new delay segment is immediately initiated (restarting the timer, creating a new child span with the new `grpc.delay_type` attribute, and emitting the initial transition event). + +3. **Concluding the Delay**: + When the blocking state resolves (e.g., name resolution completes, a picker successfully selects a connection, or the RPC is cancelled or timed out): + - The timer is stopped. + - The final duration is recorded to the histogram using the active `grpc.delay_type` as a label. + - The active child span is closed. + +#### Language-Specific API Definitions + +##### Go +In `google.golang.org/grpc/stats`: ```go -// In package stats +package stats -// DelayStart indicates that the RPC has entered a delay period. -// This is a signal to the stats handler to start a new child span. +// DelayStart indicates the start of a delay segment. type DelayStart struct { - // Reason is the bounded token describing the cause of delay. - Reason string + // DelayType describes the type of delay (e.g., "load_balancing"). + DelayType string + // Reason is the initial dynamic debug string. + Reason string } func (*DelayStart) IsClient() bool { return true } func (*DelayStart) isRPCStats() {} -// DelayEnd indicates that the current delay period has resolved. -// This is a signal to the stats handler to end the child span. -type DelayEnd struct{} - -func (*DelayEnd) IsClient() bool { return true } -func (*DelayEnd) isRPCStats() {} -``` - -**Example — `pick_first` (picker-level):** -```go -type pfPicker struct { - subConn balancer.SubConn - delayToken string // set at construction: "pick_first:connecting" or "" +// DelayReasonChanged indicates a transition in the dynamic delay reason. +type DelayReasonChanged struct { + // Reason is the new dynamic debug string. + Reason string } -func (p *pfPicker) Pick(balancer.PickInfo) (balancer.PickResult, error) { - return balancer.PickResult{SubConn: p.subConn}, nil -} +func (*DelayReasonChanged) IsClient() bool { return true } +func (*DelayReasonChanged) isRPCStats() {} -func (p *pfPicker) DelayMetricToken() string { - return p.delayToken -} -``` +// DelayEnd indicates the end of a delay segment. +type DelayEnd struct{} -**Example — `rls` (per-pick):** -```go -func (p *rlsPicker) Pick(info balancer.PickInfo) (balancer.PickResult, error) { - // ... cache lookup using info.FullMethodName and request keys ... - switch { - case dcEntry == nil && pendingEntry == nil: - p.sendRouteLookupRequest(cacheKey, ...) - return balancer.PickResult{}, - &balancer.PendingPickError{MetricToken: "rls:lookup_pending"} - case dcEntry == nil && pendingEntry != nil: - return balancer.PickResult{}, - &balancer.PendingPickError{MetricToken: "rls:lookup_pending"} - case dcEntry != nil: - // Delegate to child policy — child's Pick() may return its - // own PendingPickError or ErrNoSubConnAvailable. - pr, err := childPicker.Pick(info) - if qe, ok := err.(*balancer.PendingPickError); ok { - qe.MetricToken = "rls:" + qe.MetricToken - } - return pr, err - } -} +func (*DelayEnd) IsClient() bool { return true } +func (*DelayEnd) isRPCStats() {} ``` -**Example — `priority` (picker-level, delegating):** +In `google.golang.org/grpc/internal/telemetry`: +*Note: This package is introduced internally as part of this gRFC to align Go's internal telemetry abstractions with C++ and Java.* ```go -type priorityPicker struct { - childPicker balancer.Picker - delayToken string // pre-composed for static children - prefix string // "priority_p0:" -} +package telemetry -func newPriorityPicker(child balancer.Picker, priorityName string) balancer.Picker { - prefix := priorityName + ":" - p := &priorityPicker{childPicker: child, prefix: prefix} - if dt, ok := child.(balancer.DelayMetricTokener); ok { - childToken := dt.DelayMetricToken() - if childToken != "" { - p.delayToken = prefix + childToken - } - } - return p -} +import ( + "context" + "net" -func (p *priorityPicker) Pick(info balancer.PickInfo) (balancer.PickResult, error) { - res, err := p.childPicker.Pick(info) - if err != nil { - // If child returned a per-pick token, prepend our prefix - if pe, ok := err.(*balancer.PendingPickError); ok { - return res, &balancer.PendingPickError{ - MetricToken: p.prefix + pe.MetricToken, - } - } - } - return res, err -} + "google.golang.org/grpc/metadata" +) -func (p *priorityPicker) DelayMetricToken() string { - return p.delayToken +type Provider interface { + NewClientCallTracer(ctx context.Context, target string, method string, failFast bool, isClientStream bool, isServerStream bool) ClientCallTracer } -``` - -#### Java - -Java's `PickResult` is `final` and `@Immutable`. The per-pick token is carried -on the `PickResult` itself via a new factory method, which naturally supports -policies like RLS where the delay reason varies per-RPC. The `SubchannelPicker` -also gets a default method for delegating policies that have uniform state: - -```java -// In LoadBalancer.PickResult -// New field (private, immutable) -@Nullable private final String delayMetricToken; - -// New factory method for delayed results with a reason token -public static PickResult withNoResult(String delayMetricToken) { - return new PickResult(null, null, Status.OK, false, null, delayMetricToken); +type ClientCallTracer interface { + RecordDelayStart(delayType string, reason string) + RecordDelayReasonChanged(reason string) + RecordDelayEnd() + StartNewAttempt(ctx context.Context, isTransparent bool) (context.Context, ClientCallAttemptTracer) + RecordEnd(ctx context.Context, err error) } -// Getter -public String getDelayMetricToken() { - return delayMetricToken != null ? delayMetricToken : ""; +type ClientCallAttemptTracer interface { + RecordBegin(ctx context.Context) + RecordDelayStart(delayType string, reason string) + RecordDelayReasonChanged(reason string) + RecordDelayEnd() + RecordOutHeader(ctx context.Context, remoteAddr net.Addr, localAddr net.Addr, compression string, md metadata.MD) + RecordInHeader(ctx context.Context, wireLength int, compression string, md metadata.MD) + RecordOutPayload(ctx context.Context, compressedLen int, uncompressedLen int) + RecordInPayload(ctx context.Context, compressedLen int, uncompressedLen int) + RecordInTrailer(ctx context.Context, wireLength int, md metadata.MD) + RecordEnd(ctx context.Context, err error) } -``` - -The existing `withNoResult()` continues to return the `NO_RESULT` singleton with -an empty token. The `SubchannelPicker` also gets a default method for -delegating policies to read the child's token: -```java -// In LoadBalancer.SubchannelPicker - -/** - * Returns the bounded metric token describing why this picker delays RPCs. - * Default returns empty string (no delay information). - */ -public String getDelayMetricToken() { return ""; } +// GetClientCallAttemptTracer retrieves the active tracer implementation from the context. +func GetClientCallAttemptTracer(ctx context.Context) ClientCallAttemptTracer ``` -**Tracing APIs** — To support the new child span design in Java (and avoiding span events), we add new default methods to `ClientStreamTracer`. These allow the channel to notify the tracer when a delay starts and ends. (Note: `ClientStreamTracer` corresponds to `CallAttemptTracer` in C++. If a separate `CallTracer` is introduced for call-level tracking, it would have similar methods). - +##### Java +In `io.grpc.ClientStreamTracer`: +*Note: Methods are explicitly separated into Call-level and Attempt-level signatures to resolve scope collision, since Java utilizes a single ClientStreamTracer.* ```java -// In package io.grpc +package io.grpc; public abstract class ClientStreamTracer extends StreamTracer { - // ... existing methods ... - - /** - * Called when the RPC enters a delay period (e.g., waiting for a pick). - * The implementation should create a new child span for the delay. - * - * @param reason The bounded token describing the cause of delay. - */ - public void recordDelayStart(String reason) {} - - /** - * Called when the current delay period resolves. - * The implementation should end the child span. - */ - public void recordDelayEnd() {} -} -``` - -**Example implementation in `PickFirstLeafLoadBalancer`:** -```java -private class Picker extends SubchannelPicker { - private final String delayMetricToken; - - Picker(String token) { this.delayMetricToken = token; } - - @Override - public PickResult pickSubchannel(PickSubchannelArgs args) { - // When delaying, use the token-aware factory: - return PickResult.withNoResult(delayMetricToken); - } - - @Override - public String getDelayMetricToken() { return delayMetricToken; } -} -``` - -**Example implementation in a delegating load balancer (`PriorityLoadBalancer`):** -```java -private class PriorityPicker extends SubchannelPicker { - private final SubchannelPicker childPicker; - private final String delayMetricToken; - private final String prefix; - - PriorityPicker(SubchannelPicker child, String prefix) { - this.childPicker = child; - this.prefix = prefix; - // Pre-compose for static children - String childToken = child.getDelayMetricToken(); - this.delayMetricToken = childToken.isEmpty() ? "" : prefix + childToken; - } - - @Override - public PickResult pickSubchannel(PickSubchannelArgs args) { - PickResult result = childPicker.pickSubchannel(args); - // If child returned a per-pick delay token, prepend our prefix - String childDynamicToken = result.getDelayMetricToken(); - if (!childDynamicToken.isEmpty() && !childDynamicToken.equals(childPicker.getDelayMetricToken())) { - return PickResult.withNoResult(prefix + childDynamicToken); - } - return result; - } - - @Override - public String getDelayMetricToken() { - return delayMetricToken; - } + /** + * Called when a call-level delay segment (e.g. name resolution) starts. + */ + public void recordCallDelayStart(String delayType, String delayReason) {} + + /** + * Called when a call-level delay reason changes. + */ + public void recordCallDelayReasonChanged(String delayReason) {} + + /** + * Called when a call-level delay segment ends. + */ + public void recordCallDelayEnd() {} + + /** + * Called when an attempt-level delay segment (e.g. LB Pick connection) starts. + */ + public void recordAttemptDelayStart(String delayType, String delayReason) {} + + /** + * Called when an attempt-level delay reason changes. + */ + public void recordAttemptDelayReasonChanged(String delayReason) {} + + /** + * Called when an attempt-level delay segment ends. + */ + public void recordAttemptDelayEnd() {} } ``` -#### C++ (Core) - -The `PickResult::Queue` struct is extended to hold a `string_view` carrying the -per-pick token. Since `Queue` is returned from `Pick()`, this naturally supports -policies where the token varies per-RPC. The `string_view` points to a -picker-owned string, involving zero dynamic allocations on the pick path: - +##### C++ (Core) +In `src/core/telemetry/call_tracer.h`: +*Note: To prevent core interface bloat and align with the core thinning refactoring, we leverage the existing C++ Annotation framework rather than adding new virtual methods.* ```cpp -// In LoadBalancingPolicy::PickResult - -struct Queue { - // Bounded queue reason token. Points to a string owned by the Picker. - // The Picker's lifetime exceeds the pick, so this is safe. - // Empty string_view means no token is provided. - absl::string_view metric_token; - - Queue() = default; - explicit Queue(absl::string_view token) : metric_token(token) {} -}; -``` - -The `SubchannelPicker` base class gets a virtual method for delegating policies: +namespace grpc_core { -```cpp -class SubchannelPicker : public DualRefCounted { +class DelayAnnotation final : public CallTracerAnnotationInterface::Annotation { public: - virtual PickResult Pick(PickArgs args) = 0; - - // Returns the bounded delay reason token. Default returns empty. - virtual absl::string_view GetDelayMetricToken() const { return ""; } -}; -``` - -**Tracing APIs** — To support the new child span design in C++ (and avoiding span events), we add new virtual methods to both `ClientCallTracerInterface` and `ClientCallTracerInterface::CallAttemptTracer`. These allow the channel to notify the tracer when a delay starts and ends at both the call and attempt levels: - -```cpp -// In src/core/telemetry/call_tracer.h - -class ClientCallTracerInterface : public CallTracerAnnotationInterface { - public: - // ... existing methods ... - - // Records the start of a delay period at the call level. - virtual void RecordDelayStart(absl::string_view reason) = 0; - - // Records the end of the current delay period at the call level. - virtual void RecordDelayEnd() = 0; - - class CallAttemptTracer : public CallTracerInterface { - public: - // ... existing methods ... - - // Records the start of a delay period at the attempt level. - virtual void RecordDelayStart(absl::string_view reason) = 0; - - // Records the end of the current delay period at the attempt level. - virtual void RecordDelayEnd() = 0; - }; -}; -``` - -**Example implementation in a leaf picker:** -```cpp -class PickFirstQueuePicker final : public SubchannelPicker { - public: - PickResult Pick(PickArgs) override { - return PickResult::Queue(kToken); - } - absl::string_view GetDelayMetricToken() const override { return kToken; } + enum class Stage { kStart, kReasonChanged, kEnd }; + + DelayAnnotation(Stage stage, absl::string_view type, absl::string_view reason) + : Annotation(CallTracerAnnotationInterface::AnnotationType::kDelay), + stage_(stage), type_(type), reason_(reason) {} + + Stage stage() const { return stage_; } + absl::string_view type() const { return type_; } + absl::string_view reason() const { return reason_; } private: - static constexpr absl::string_view kToken = "pick_first:connecting"; + Stage stage_; + absl::string_view type_; + absl::string_view reason_; }; -``` - -**Example delegating picker:** -```cpp -class PriorityPicker final : public SubchannelPicker { - public: - PriorityPicker(int priority_index, - RefCountedPtr child) - : child_(std::move(child)) { - // Compose token at construction time (stored as member string). - auto child_token = child_->GetDelayMetricToken(); - if (!child_token.empty()) { - composite_token_ = absl::StrCat("priority_p", - priority_index, ":", - child_token); - } - } - - PickResult Pick(PickArgs args) override { - auto result = child_->Pick(args); - if (auto* q = absl::get_if(&result.result)) { - q->metric_token = composite_token_; - } - return result; - } - - absl::string_view GetDelayMetricToken() const override { - return composite_token_; - } - private: - RefCountedPtr child_; - std::string composite_token_; // owned by this picker -}; +} // namespace grpc_core ``` -### Language-Specific Recording Logic -Recording is performed by the client channel's pick deferring infrastructure, not -by the LB policy itself. The LB policy provides the token; the channel measures -time and records the metric. - -#### Go — `picker_wrapper.go` - -In the `pick()` method's blocking loop, the recording logic resolves the token -from either a `PendingPickError` or the `DelayMetricTokener` interface, and emits a -histogram observation each time the token changes (per-segment emission): +#### Language-Specific Recording Logic (Code Examples) +##### Go — `picker_wrapper.go` ```go -func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, - info balancer.PickInfo) (pick, error) { - var delayStartTime time.Time - var delayToken string - var delayed bool - method := info.FullMethodName - - for { - // ... existing picker load and blocking logic ... - - pickResult, err := p.Pick(info) - if err != nil { - if errors.Is(err, balancer.ErrNoSubConnAvailable) { - // Resolve per-pick token from PendingPickError, or fall - // back to picker-level DelayMetricTokener. - currentToken := "" - if qe, ok := err.(*balancer.PendingPickError); ok { - currentToken = qe.MetricToken - } else if qt, ok := p.(balancer.DelayMetricTokener); ok { - currentToken = qt.DelayMetricToken() - } - - if !delayed { - // First delay event — start timing, increment counter. - delayed = true - delayStartTime = time.Now() - delayToken = currentToken - rpcWaitingMetric.Record(metricsRecorder, 1, target) - } else if currentToken != delayToken { - // Token changed (new picker with different state). - // Emit segment for the previous token. - duration := time.Since(delayStartTime).Seconds() - pickDelayDurationMetric.Record(metricsRecorder, - duration, target, delayToken, method) - // Start new segment. - delayStartTime = time.Now() - delayToken = currentToken - } - continue - } - // ... existing error handling ... - } +func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.PickInfo) (pick, error) { + var ch chan struct{} + var lastPickErr error + pickBlocked := false + + var delayStartTime time.Time + var delayType string + var delayReason string + var delayed bool + + for { + pg := pw.pickerGen.Load() + if pg == nil { + return pick{}, ErrClientConnClosing + } + if pg.picker == nil { + ch = pg.blockingCh + } + if ch == pg.blockingCh { + // We are about to block on the channel. Record connecting delay. + if !delayed { + delayed = true + delayStartTime = time.Now() + delayType = "connecting" + delayReason = "channel_connecting" + if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { + tracer.RecordDelayStart(delayType, delayReason) + } + } + // Block goroutine until next picker is updated + select { + case <-ctx.Done(): + if delayed { + duration := time.Since(delayStartTime).Seconds() + clientAttemptDelayDurationMetric.Record(metricsRecorder, duration, target, delayType, info.FullMethodName) + if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { + tracer.RecordDelayEnd() + } + } + return pick{}, ctx.Err() + case <-ch: + } + continue + } + + if ch != nil { + pickBlocked = true + } + ch = pg.blockingCh + p := pg.picker + + pickResult, err := p.Pick(info) + if err != nil { + if err == balancer.ErrNoSubConnAvailable { + currentType := "connecting" + currentReason := "subchannel_connecting" + if qe, ok := err.(*balancer.PendingPickError); ok { + currentType = qe.DelayType + currentReason = qe.DelayReason + } + + if !delayed { + delayed = true + delayStartTime = time.Now() + delayType = currentType + delayReason = currentReason + if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { + tracer.RecordDelayStart(delayType, delayReason) + } + } else if currentType != delayType { + // Type changed: Record metric and restart span. + duration := time.Since(delayStartTime).Seconds() + clientAttemptDelayDurationMetric.Record(metricsRecorder, duration, target, delayType, info.FullMethodName) + if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { + tracer.RecordDelayEnd() + tracer.RecordDelayStart(currentType, currentReason) + } + delayStartTime = time.Now() + delayType = currentType + delayReason = currentReason + } else if currentReason != delayReason { + // Only reason changed: Record span event. + delayReason = currentReason + if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { + tracer.RecordDelayReasonChanged(delayReason) + } + } + continue + } + // ... non-delay error handling ... + } + + // Success path + if delayed { + duration := time.Since(delayStartTime).Seconds() + clientAttemptDelayDurationMetric.Record(metricsRecorder, duration, target, delayType, info.FullMethodName) + if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { + tracer.RecordDelayEnd() + } + } + return pickResult, nil + } +} +``` + +##### Java — `DelayedClientTransport.java` +```java +// Tracer callbacks are invoked strictly OUTSIDE DelayedClientTransport locks. +// AtomicBoolean guards guarantee recordDelayEnd() is executed exactly once. + +private class PendingStream extends DelayedStream { + private final PickSubchannelArgs args; + private final ClientStreamTracer[] tracers; + private final long delayStartNanos; + private final String delayType; + private final String delayReason; + private final AtomicBoolean delayRecorded = new AtomicBoolean(); + + PendingStream(PickSubchannelArgs args, ClientStreamTracer[] tracers, PickResult pickResult) { + this.args = args; + this.tracers = tracers; + this.delayStartNanos = System.nanoTime(); + this.delayType = pickResult != null ? pickResult.getDelayType() : "connecting"; + this.delayReason = pickResult != null ? pickResult.getDelayReason() : "channel_connecting"; + } - // Pick succeeded. If we were delayed, record final segment. - if delayed { - duration := time.Since(delayStartTime).Seconds() - pickDelayDurationMetric.Record(metricsRecorder, - duration, target, delayToken, method) - rpcWaitingMetric.Record(metricsRecorder, -1, target) - if attemptSpan != nil { - attemptSpan.AddEvent("Delayed LB pick complete", - attribute.Float64("delay_duration", duration), - attribute.String("delay_reason", delayToken)) + void recordDelayEnd(MetricsRecorder recorder, String target) { + if (delayRecorded.compareAndSet(false, true)) { + long durationNanos = System.nanoTime() - this.delayStartNanos; + double durationSeconds = durationNanos / 1_000_000_000.0; + recorder.recordClientAttemptDelayDuration(durationSeconds, target, this.delayType); + for (ClientStreamTracer tracer : this.tracers) { + tracer.recordAttemptDelayEnd(); } } - // ... existing transport handling ... } } -``` -On context cancellation (existing `case <-ctx.Done():` block), the same -recording logic applies: -```go -case <-ctx.Done(): - if delayed { - duration := time.Since(delayStartTime).Seconds() - pickDelayDurationMetric.Record(metricsRecorder, - duration, target, delayToken, method) - rpcWaitingMetric.Record(metricsRecorder, -1, target) - if attemptSpan != nil { - attemptSpan.AddEvent("Delayed LB pick complete", - attribute.Float64("delay_duration", duration), - attribute.String("delay_reason", delayToken)) - } +// In DelayedClientTransport.newStream(): +PendingStream pendingStream = null; +synchronized (lock) { + if (state == SHUTDOWN) { + return new FailingClientStream(status); } - // ... existing error return ... -``` - -The `metricsRecorder` and `target` are available from the `ClientConn` that owns -the `pickerWrapper`. - -#### Java — `DelayedClientTransport.java` - -When a `PendingStream` is created in `createPendingStream()`, the delay start -time and token are captured: - -```java -private PendingStream createPendingStream(PickSubchannelArgs args, - ClientStreamTracer[] tracers, PickResult pickResult) { - PendingStream pendingStream = new PendingStream(args, tracers); - pendingStream.delayStartNanos = System.nanoTime(); - - // Resolve token: prefer PickResult token (dynamic), fallback to Picker token (static) - String token = pickResult != null ? pickResult.getDelayMetricToken() : ""; - if (token.isEmpty() && pickerState.lastPicker != null) { - token = pickerState.lastPicker.getDelayMetricToken(); + if (transport == null) { + pendingStream = new PendingStream(args, tracers, pickResult); + pendingStreams.add(pendingStream); } - pendingStream.delayToken = token; - // ... existing logic ... } -``` - -When the stream is dequeued in `reprocess()` (transport obtained), or cancelled, -the duration is recorded: - -```java -if (transport != null) { - long durationNanos = System.nanoTime() - stream.delayStartNanos; - double durationSeconds = durationNanos / 1_000_000_000.0; - metricsRecorder.recordPickDelayDuration(durationSeconds, - target, stream.delayToken); - stream.getTracer().recordAnnotation("Delayed LB pick complete", - Attributes.of( - "delay_duration", durationSeconds, - "delay_reason", stream.delayToken)); - // ... existing stream creation ... +if (pendingStream != null) { + // Invoke tracer callbacks OUTSIDE critical locks + for (ClientStreamTracer tracer : tracers) { + tracer.recordAttemptDelayStart(pendingStream.delayType, pendingStream.delayReason); + } } ``` -#### C++ (Core) — `load_balanced_call_destination.cc` +##### C++ (Core) — `load_balanced_call_destination.cc` +```cpp +// LbDelayState is an RAII object that guarantees RecordDelayEnd is called on destruction (cancellation). +// Persisted across asynchronous Loop iterations by capturing it in the Loop lambda state. -In the pick loop, when `PickResult::Queue` is returned, the start time and token -are captured. If the token changes during the wait, the current segment is recorded -and a new one begins. When the pick eventually succeeds or the call is cancelled, the -final duration is recorded: +class LbDelayState { + public: + LbDelayState(ClientCallTracerInterface::CallAttemptTracer* tracer, std::string target) + : tracer_(tracer), target_(std::move(target)) {} -```cpp -// In the pick processing lambda: -auto delay_func = [&](LoadBalancingPolicy::PickResult::Queue* delay_pick) { - std::string current_token = std::string(delay_pick->metric_token); - if (!delay_start_time.has_value()) { - delay_start_time = Timestamp::Now(); - delay_token = current_token; - } else if (current_token != delay_token) { - // Token changed: Emit segment for the previous token - Duration delay_duration = Timestamp::Now() - *delay_start_time; - stats_plugin_group.RecordHistogram( - kPickDelayDurationHandle, - delay_duration.seconds(), - {target}, {delay_token}); - // Start new segment - delay_start_time = Timestamp::Now(); - delay_token = current_token; + ~LbDelayState() { + if (delay_start_time_.has_value()) { + // RAII Cancellation: Record final metric and close tracer + RecordEnd(); } - return Continue{}; -}; + } + + void Update(absl::string_view type, absl::string_view reason) { + if (!delay_start_time_.has_value()) { + delay_start_time_ = Timestamp::Now(); + type_ = std::string(type); + reason_ = std::string(reason); + tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kStart, type_, reason_)); + } else if (type != type_) { + RecordEnd(); + delay_start_time_ = Timestamp::Now(); + type_ = std::string(type); + reason_ = std::string(reason); + tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kStart, type_, reason_)); + } else if (reason != reason_) { + reason_ = std::string(reason); + tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kReasonChanged, type_, reason_)); + } + } -// On successful pick (in the completion callback): -if (delay_start_time.has_value()) { - Duration delay_duration = Timestamp::Now() - *delay_start_time; + void RecordEnd() { + if (!delay_start_time_.has_value()) return; + Duration duration = Timestamp::Now() - *delay_start_time_; stats_plugin_group.RecordHistogram( - kPickDelayDurationHandle, - delay_duration.seconds(), - {target}, {delay_token}); - call_tracer->RecordAnnotation("Delayed LB pick complete", { - {"delay_duration", absl::StrCat(delay_duration.seconds())}, - {"delay_reason", delay_token} - }); -} -``` - -#### C-Core Specifics: RLS Cache Misses and HTTP/2 Max Concurrent Streams - -In `grpc-core`, the boundary between Load Balancing and the Transport layer introduces two highly specific queuing scenarios that must be handled distinctly in the `PickResult::Queue` tokenization. + kClientAttemptDelayDurationHandle, + duration.seconds(), + {target_, type_}, + {}); + tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kEnd, type_, reason_)); + delay_start_time_.reset(); + } -**1. RLS Cache Misses (Dynamic Token Lifetime)** -Unlike static policies (like `pick_first`), the RLS policy in C-Core evaluates targets dynamically per-call. When a cache miss occurs, the RLS policy initiates a control-plane request and returns `PickResult::Queue`. -* **The C-Core Constraint:** The `PickResult::Queue` struct uses an `absl::string_view` to prevent allocations on the hot path. However, because an RLS cache miss is dynamic, the string it points to must safely outlive the pick. -* **Implementation:** The RLS LB Policy must maintain a static, pre-allocated pool of `std::string` constants for its state transitions (e.g., `static const std::string kRlsPending = "rls:lookup_pending";`). During a cache miss `Pick()`, the policy returns a `string_view` pointing explicitly to this constant memory address, ensuring safe reference lifecycle without dynamic memory allocation per RPC. - -**2. MAX_CONCURRENT_STREAMS (Transport-Level Delay)** -In C-Core, waiting can occur even when the LB Policy successfully finds a `READY` subchannel. If the HTTP/2 transport for that subchannel has reached its `SETTINGS_MAX_CONCURRENT_STREAMS` limit, the `grpc_call` must wait. -* **The Distinction:** This is a *transport* delay, not an *LB routing* delay. -* **Implementation:** The LB policy's `Pick()` method will actually succeed and return a `READY` subchannel. However, the `client_channel` filter will detect the transport exhaustion. To maintain observability, the `client_channel` filter itself will inject a synthetic token: `"transport:max_concurrent_streams"`. The delay timer will start, and the RPC will wait in the filter's pending list until a stream becomes available or the context deadline is exceeded. This clearly separates network capacity limits from LB routing failures in the resulting telemetry. - -### Metric Instrument Registration - -#### Go - -```go -import estats "google.golang.org/grpc/experimental/stats" - -var pickDelayDurationMetric = estats.RegisterFloat64Histo( - estats.MetricDescriptor{ - Name: "grpc.lb.pick_delay_duration", - Description: "EXPERIMENTAL. Time an RPC spent waiting " + - "for a load balancing pick.", - Unit: "s", - Labels: []string{"grpc.target", "grpc.lb.delay_reason"}, - OptionalLabels: []string{"grpc.method"}, - Default: false, - }) - -var rpcWaitingMetric = estats.RegisterInt64UpDownCount( - estats.MetricDescriptor{ - Name: "grpc.lb.rpc_waiting", - Description: "EXPERIMENTAL. Number of RPCs currently waiting " + - "for a load balancing pick.", - Unit: "{call}", - Labels: []string{"grpc.target"}, - Default: false, - }) -``` - -#### Java - -```java -private static final LongHistogramInstrumentDescriptor PICK_DELAY_DURATION = - InstrumentRegistry.registerDoubleHistogram( - "grpc.lb.pick_delay_duration", - "EXPERIMENTAL. Time an RPC spent waiting " - + "for a load balancing pick.", - "s", - List.of("grpc.target", "grpc.lb.delay_reason"), // required labels - List.of("grpc.method"), // optional labels - false); // default disabled - -private static final LongUpDownCounterMetricInstrument RPC_WAITING = - InstrumentRegistry.registerLongUpDownCounter( - "grpc.lb.rpc_waiting", - "EXPERIMENTAL. Number of RPCs currently waiting " - + "for a load balancing pick.", - "{call}", - List.of("grpc.target"), - List.of(), - false); -``` + private: + ClientCallTracerInterface::CallAttemptTracer* tracer_; + std::string target_; + std::optional delay_start_time_; + std::string type_; + std::string reason_; +}; -#### C++ (Core) +// Inside UnstartedCallHandler::StartCall mutable loop lambda: +auto delay_state = std::make_shared(call_tracer, target); -```cpp -const auto kPickDelayDurationHandle = - GlobalInstrumentsRegistry::RegisterDoubleHistogram( - "grpc.lb.pick_delay_duration", - "EXPERIMENTAL. Time an RPC spent waiting " - "for a load balancing pick.", - "s", - /*label_keys=*/{"grpc.target", "grpc.lb.delay_reason"}, - /*optional_label_keys=*/{"grpc.method"}, - /*enable_by_default=*/false); - -const auto kRpcWaitingHandle = - GlobalInstrumentsRegistry::RegisterUpDownInt64Counter( - "grpc.lb.rpc_waiting", - "EXPERIMENTAL. Number of RPCs currently waiting " - "for a load balancing pick.", - "{call}", - /*label_keys=*/{"grpc.target"}, - /*optional_label_keys=*/{}, - /*enable_by_default=*/false); +return Loop( + [delay_state, picker, unstarted_handler]() mutable { + return PickSubchannel( + *picker, + *unstarted_handler, + delay_state.get() // Pass state pointer to persist across asynchronous iterations + ); + } +); ``` -### Metric Stability - -Per [A79][A79], this metric starts as **experimental** and **disabled by -default**. Users must explicitly opt in via their OpenTelemetry plugin -configuration. The metric description is prefixed with `EXPERIMENTAL.` to -signal this status. - -The metric will be promoted to stable after it has been implemented and validated -in all three languages, with at least one release cycle of user feedback. - -### Temporary environment variable protection - -The metric will be guarded by the environment variable -`GRPC_EXPERIMENTAL_ENABLE_PICK_DELAY_METRIC`. When set to `true`, the metric -will be registered and reported even if the user has not explicitly enabled it -via the OpenTelemetry plugin configuration. This guard will be removed once the -feature is deemed stable. +### Feature Flag -### Tracing Enhancement +All delay metrics, tracing, and API hooks will be guarded by a feature flag: +* **Go/Java Env Var**: `GRPC_EXPERIMENTAL_ENABLE_DELAY_OBSERVABILITY` (Default: `false`) +* **C++ Core Experiment**: `IsExperimentEnabled("client_delay_observability")` (registered in `experiments.h`) -To provide high-fidelity visibility into delays, this proposal introduces a **generic delay framework** using child spans, rather than just adding events to existing spans. This allows capturing the start, end, and specific reason for any processing delay (not limited to load balancing). - -When an RPC experiences a delay (such as waiting for a load balancer pick), the channel will create a **new child span** named simply **`"Delay"`**. - -This child span will have the following attributes: -* `delay_type`: Describes the category of delay. For this proposal, it will be `"load_balancing"`. -* `delay_reason`: The specific bounded token describing why the delay happened (e.g., `"pick_first:connecting"`, `"rls:lookup_pending"`). +## Rationale -The span is created when the delay condition begins and is ended when the condition resolves or transitions to a new reason (in which case a new segment span is created). This provides a clear, time-accurate visualization of delay segments in the trace. +We implement recording at the client channel level rather than inside specific load balancing policies because only the client channel manages the buffering, queueing, and context cancellation lifecycles of RPC calls. This decouples policy-level state reporting from duration measurement. -To support this in each language: -* **C++**: New APIs in `ClientCallTracerInterface::CallAttemptTracer` to start and end delay spans. -* **Java**: New default methods in `ClientStreamTracer` to start and end delay spans. -* **Go**: New internal stats signals (`stats.DelayStart` and `stats.DelayEnd`) to notify the stats handler when to manage the delay span lifecycle. +To minimize overhead, pickers pre-compute tokens at configuration time, avoiding dynamic string allocations during picks. Dynamic tokens are only used for policies that perform per-request routing (e.g., RLS, xDS cluster manager). -## Rationale +## Implementation -The metric is recorded by the client channel's pick deferring infrastructure -rather than by the LB policy itself, because the LB policy knows its internal -state but does not know how many RPCs are waiting or for how long. Only the -channel physically buffers RPCs and can accurately measure per-RPC delay -duration. This mirrors the existing design where per-call metrics ([A66]) are -recorded by the channel, not by LB policies. - -Two mechanisms are provided for supplying delay reason tokens — a picker-level -optional interface (`DelayMetricTokener`) and a per-pick error/result type -(`PendingPickError` / `PickResult.withNoResult(token)` / `Queue::metric_token`). Most -LB policies have uniform picker state: all RPCs see the same delay reason from a -given picker instance (pick_first, round_robin, ring_hash, priority). These -policies use the picker-level mechanism, which involves zero allocations on the -pick path. However, RLS and cluster_manager make per-request routing decisions -within `Pick()` — RLS does a per-request cache lookup, and cluster_manager -routes RPCs to different child pickers based on the xDS-selected cluster. For -these policies, the delay reason varies per-RPC, requiring the token to be -determined inside `Pick()` and attached to the pending result. - -The recording loop emits one histogram observation per delay reason segment -rather than a single observation per RPC. An RPC that transitions through -multiple delay reasons (e.g., `rls:lookup_pending` for 10 seconds, then -`rls:pick_first:connecting` for 0.1 seconds after the lookup completes) produces -two histogram observations, one for each segment. This per-segment approach -provides visibility into how much time each layer of the LB policy tree -contributed to the total pick delay. Without it, a single 10.1-second -observation attributed to `rls:lookup_pending` would obscure the fact that the -RLS lookup accounted for the vast majority of the delay. - -The `grpc.lb.rpc_waiting` up-down counter is provided because the -histogram metric is emitted only when a wait ends. RPCs that are -indefinitely delayed (e.g., target unreachable, infinite backoff) -never emit a histogram observation. The counter gives operators a real-time -count of currently-waiting RPCs, enabling alerting on stuck traffic. Every LB -policy in the tree can cause indefinite waiting (pick_first cycling through -CONNECTING/TF, RLS server unreachable, CDS resource never arriving), making this -counter essential. An up-down counter is used rather than a callback gauge -because the recording loop has the exact synchronous entry/exit points, matching -the pattern used by `grpc.subchannel.open_connections` ([A94]) and -`grpc.tcp.connection_count` ([A80]). - -`grpc.method` is included as an optional label because some policies (RLS, -cluster_manager) make routing decisions based on the method name, causing wait -behavior to vary across methods. For simpler policies (pick_first, -round_robin), the delay behavior is method-independent, but operators may still -want per-method delay analysis to correlate with per-call attempt metrics. As an -optional label, it defaults to off and only adds cardinality when explicitly -enabled. - -`wrr_locality` and `outlier_detection` are treated as transparent because -neither makes independent delaying decisions. `wrr_locality` is a configuration -wrapper; `outlier_detection` manipulates subchannel state but the actual -delaying is decided by the child policy's picker. When outlier detection ejects -100% of endpoints, the child policy (e.g., pick_first) enters -TRANSIENT_FAILURE and returns an error rather than a pending result, so no delay -metric is recorded. In partial ejection scenarios, the child is genuinely in -CONNECTING state and the token is accurate. Operators can correlate delay -duration with the existing `grpc.lb.outlier_detection.ejections_enforced` metric -([A91]) for root cause analysis. - -The tracing enhancement reuses the same bounded delay reason token as the metric -label rather than introducing a separate high-cardinality trace reason string. -This keeps the API surface minimal — one token serves both metrics and traces — -and avoids the need for pickers to maintain two separate strings. If richer, -dynamic trace context (e.g., IP addresses, RLS keys) is needed in the future, -it can be added as additional span event attributes without changing the token -API. +We will implement this in Go, Java, and C++ (Core), in that order. -## Implementation -Implementation will proceed in Go, Java, and C++ (Core), in that order. [A66]: A66-otel-stats.md [A72]: A72-open-telemetry-tracing.md -[A78]: A78-grpc-metrics-wrr-pf-xds.md -[A79]: A79-non-per-call-metrics-architecture.md -[A80]: A80-tcp-metrics.md -[A91]: A91-outlier-detection-metrics.md -[A94]: A94-subchannel-otel-metrics.md [A56]: A56-priority-lb-policy.md -[A50]: A50-xds-outlier-detection.md \ No newline at end of file +[A28]: A28-xds-traffic-splitting-and-routing.md \ No newline at end of file From 5cbe0c0637f088fb65611261c01e9d92c19b6762 Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Tue, 23 Jun 2026 03:42:32 +0000 Subject: [PATCH 04/12] go v2 change and cleanup --- grpc-lb-delay-metrics.md | 588 +++++++++++++-------------------------- 1 file changed, 199 insertions(+), 389 deletions(-) diff --git a/grpc-lb-delay-metrics.md b/grpc-lb-delay-metrics.md index eef2899a6..e86f0143d 100644 --- a/grpc-lb-delay-metrics.md +++ b/grpc-lb-delay-metrics.md @@ -51,25 +51,18 @@ For both layers, a child span is created for each distinct `grpc.delay_type`, an * **`grpc.delay_type` (Span Attribute & Metric Label)**: A low-cardinality, composed string representing the type of the delay. The value format is `[parent_prefix:]base_type` (e.g., `connecting`, `p0:connecting`, or `rls_lookup_pending`). It is recorded as a **span attribute** on the child span and as a **metric label** on the corresponding duration histogram. * **`grpc.delay_reason` (Span Event Attribute)**: A high-cardinality string capturing runtime details (such as target names, subchannel IP addresses, or connection error messages). -When a delay begins, the tracer creates the child span (`"Call Delay"` or `"Attempt Delay"`) carrying the `grpc.delay_type` attribute. It also records the initial reason as a span event named `"delay_state_transition"` containing the `grpc.delay_reason` string. +When a delay begins, the tracer creates the child span (`"Call Delay"` or `"Attempt Delay"`) carrying the `grpc.delay_type` attribute. It also records the initial reason as a span event named `"Delay state transition"` containing the `grpc.delay_reason` string. -If the state changes but the `grpc.delay_type` remains the same (e.g., a priority policy fails over or RLS changes backend targets), the tracer emits a new `"delay_state_transition"` span event on the active child span with the updated `grpc.delay_reason` string, without recreating the span. +If the state changes but the `grpc.delay_type` remains the same (e.g., a priority policy fails over or RLS changes backend targets), the tracer emits a new `"Delay state transition"` span event on the active child span with the updated `grpc.delay_reason` string, without recreating the span. -When the delay resolves or transitions to a different delay type, the active child span is closed. -```mermaid -graph TD - ParentSpan["Parent RPC Attempt Span"] --> DelaySpan["Child Span: 'Attempt Delay'
(Attribute: grpc.delay_type = 'connecting')"] - DelaySpan --> Event1["Span Event @ 0ms: 'delay_state_transition'
(Attribute: grpc.delay_reason = 'subchannel_idle_trigger')"] - DelaySpan --> Event2["Span Event @ 100ms: 'delay_state_transition'
(Attribute: grpc.delay_reason = 'subchannel_connecting')"] - DelaySpan --> Event3["Span Event @ 200ms: 'delay_state_transition'
(Attribute: grpc.delay_reason = 'waiting_for_health_report')"] -``` +When the delay resolves or transitions to a different delay type, the active child span is closed. #### Span Event Schema Specification -To ensure tracing backends can programmatically parse `"delay_state_transition"` events, the event MUST conform to the following structural schema: +To ensure tracing backends can programmatically parse `"Delay state transition"` events, the event MUST conform to the following structural schema: ```json { - "name": "delay_state_transition", + "name": "Delay state transition", "attributes": { "grpc.delay_type": "string (the active delay type, e.g., 'connecting')", "grpc.delay_reason": "string (the new granular state description)" @@ -107,218 +100,265 @@ The following metrics are registered as client-side per-call metrics, extending | **Bucket Boundaries** | Same as A66 latency buckets: 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 | | **Default Enabled** | `false` (experimental, opt-in) | - ### 2. Telemetry Value Taxonomy -To ensure consistency across implementations, we define a taxonomy mapping client channel states and load balancing policies to metric and tracing labels. +To ensure consistency across implementations, we define the taxonomy of metric and tracing labels. -All connection-related delays are consolidated into a single low-cardinality metric label value (`"connecting"`), while their detailed root causes are recorded in the tracing `grpc.delay_reason` attribute. +All connection-related delays are consolidated into a single low-cardinality metric label value (`"connecting"`), while their detailed reasons are recorded inside the `"Delay state transition"` span event (as the `grpc.delay_reason` event attribute). #### 1. Metric Delay Types (`grpc.delay_type`) -The `grpc.delay_type` label is restricted to a closed, low-cardinality set of values: -* `"connecting"`: Any client-side delay spent waiting for name resolution, config parsing, subchannel connection establishment, or picker initialization. -* `"rls_lookup_pending"`: Specifically for Route Lookup Service (RLS) control-plane cache-miss lookups. -* `"cds_dynamic_discovery"`: Specifically for xDS Cluster Discovery Service (CDS) dynamic metadata resource fetches. - -#### 2. Taxonomy of Delay Reasons (`grpc.delay_reason`) -The high-cardinality `grpc.delay_reason` string represents the exact state of the connection or resolver. +The `grpc.delay_type` label is a low-cardinality, restricted set of values. To ensure maximum clarity, these types are explicitly partitioned into Call-Level and Attempt-Level scopes, corresponding to their respective duration histograms. Structural container policies may compose the attempt-level values by prepending logical prefixes. -##### Category A: Resolver & Metadata Delays -Recorded when the channel is stalled obtaining the initial endpoints or configurations: -* `"resolver_dns_query_pending"`: DNS name resolution query is in progress. -* `"rls_lookup_pending"`: RLS control-plane query is in progress. -* `"cds_metadata_fetch"`: Waiting for dynamic CDS cluster resource definitions over xDS. +##### 1. Call-Level Delay Types +Recorded on the `grpc.client.call.delay.duration` histogram. These represent delays that occur before an individual attempt is created: +* `"resolving"`: The channel is delayed waiting for name resolution or configuration parsing. -##### Category B: Subchannel Connection Delays (Metric Label: `"connecting"`) -Recorded when an RPC attempt is queued waiting for a subchannel to establish a connection: -* `"subchannel_idle_trigger"`: The target subchannel was in `IDLE` state when picked, and a new connection attempt had to be triggered. -* `"subchannel_connecting"`: The subchannel is actively in `CONNECTING` state (TCP handshake or TLS negotiation in progress). -* `"subchannel_scaling_a105"`: The subchannel had a connection but transitioned back to `CONNECTING` or `TRANSIENT_FAILURE` to scale up additional connections to satisfy high concurrent streams (per gRFC A105). -* `"subchannel_waiting_for_health"`: The network connection is established, but the RPC is blocked waiting for the initial Out-Of-Band (OOB) health check report to return `SERVING`. -* `"subchannel_does_not_exist"`: The load balancer has not yet resolved or created the subchannel. - -##### Category C: Picker and State Mismatches (Metric Label: `"connecting"`) -* `"subchannel_state_mismatch"`: The subchannel transitioned out of `READY` before the active picker could be updated. -* `"picker_failing_with_wait_for_ready"`: A `wait_for_ready` RPC is queued because the picker is in `TRANSIENT_FAILURE`. - -##### Category D: Container Policy Pass-Through & Wrapping -Container load balancing policies propagate their children's delays by prepending a prefix, while forwarding the child's type and reason: -* **priority**: Prepends the active tier index (e.g., `p0:connecting` or `p1:connecting`). -* **xds_cluster_manager**: Prepends the targeted xDS cluster name (e.g. `xds_cluster_manager: child 'cluster-abc': connecting`). -* **weighted_target**: Prepends the targeted weight group (e.g., `weighted_target: child 'canary': connecting`). -* **rls**: Prepends the target route shard (e.g., `RLS: child 'shard-eu': connecting`). +##### 2. Attempt-Level Delay Types +Recorded on the `grpc.client.attempt.delay.duration` histogram. These represent delays that occur during a specific RPC attempt: +* `"connecting"`: The attempt is delayed waiting for subchannel connection establishment or picker initialization. +* `"rls_lookup_pending"`: Specifically for Route Lookup Service (RLS) control-plane cache-miss query lookups. +* `"cds_dynamic_discovery"`: Specifically for xDS Cluster Discovery Service (CDS) dynamic metadata resource fetches. +* `"subchannel_state_mismatch"`: The target subchannel transitioned out of `READY` (e.g., disconnected or went to `TRANSIENT_FAILURE`), but the active channel picker has not yet been updated with a new picker. +* `"picker_failing_with_wait_for_ready"`: A `wait_for_ready` RPC is buffered/queued because the picker is in `TRANSIENT_FAILURE`, waiting for the next connection attempt to succeed. + +##### 3. Composed Attempt-Level Types +Structural container policies prepend their logical prefixes to the base attempt-level types using a colon separator: +* **Priority Policy**: Prepends the active priority tier index, resulting in composed types like: + * `p0:connecting` + * `p1:subchannel_state_mismatch` + * `p0:picker_failing_with_wait_for_ready` +* **Pass-Through Container Policies**: Policies like `xds_cluster_manager`, `weighted_target`, and `rls` **do not prepend any prefix or wrap the delay type**. They simply bubble up the child's `grpc.delay_type` (e.g., `"connecting"`) directly as-is. + +##### Metadata Propagation & Composition +The prepending and wrapping logic is handled entirely inside the Balancer Picker tree hierarchy, keeping the channel's attempt-routing wrapper and tracer plugins completely decoupled and simple: +1. **Leaf Pickers** (such as `pick_first` or `round_robin`): Generate the base leaf types (e.g., `delay_type = "connecting"`) and the initial, descriptive `delay_reason` (e.g., `"subchannel connecting: TCP handshake in progress"`). +2. **Container Pickers**: Intercept the child's deferred pick result as it bubbles up the picker tree. The `priority` picker prepends its active tier index to the type (e.g., producing `"p0:connecting"`). A pass-through picker (such as `xds_cluster_manager`) forwards the child's `delay_type` unmodified, but can enrich the `delay_reason` with its own structural details. +3. **The Channel Wrapper**: Receives the final, fully-composed `delay_type` and `delay_reason` strings from the root picker and passes them directly to the tracer (`recordAttemptDelayStart`) without any parsing, branching, or type checks. + +--- + +#### 2. Taxonomy of Delay Reasons (Span Event Attribute: `grpc.delay_reason`) +Unlike the strict, low-cardinality metric types, the `grpc.delay_reason` is an **unconstrained, free-form debug string** designed to convey maximum troubleshooting context. +* It is written as a **human-readable, spaced string** (e.g. `"subchannel is connecting"` or `"waiting for DNS query to complete"`), **not** a `snake_case` token or closed enum. +* It is designed to contain **high-cardinality metadata** (such as specific subchannel IP addresses, target names, cache keys, or raw connection error messages) to provide rich diagnostics inside the trace span. + +To assist implementers, we explain the physical scenarios that are associated with each `grpc.delay_type` below. These scenarios represent common connection/resolver bottlenecks but are **non-exhaustive**; implementations are encouraged to append additional debug details. + +##### Category A: Resolver Scenarios (Type: `"resolving"`) +* **DNS Resolver Pending**: The channel is waiting for the initial name resolution query to complete. The reason string should describe the pending resolver query (e.g., `"waiting for DNS query to complete for target example.com"`). + +##### Category B: Control-Plane & Metadata Scenarios (Type: `"rls_lookup_pending"`, `"cds_dynamic_discovery"`) +* **RLS Cache-Miss Lookup**: An RPC is blocked because RLS is executing a control-plane query. The reason string describes the pending query and can include the RLS server target address or RLS cache keys (e.g., `"Route Lookup Service query pending on rls-server:8080"`). +* **CDS Metadata Fetch**: An xDS channel is waiting for dynamic CDS cluster resource definitions. The reason string describes the dynamic discovery request and can include the targeted cluster name (e.g., `"waiting for CDS resource definition for cluster cluster_abc"`). + +##### Category C: Subchannel Connection Scenarios (Type: `"connecting"`) +Recorded when an RPC attempt is queued waiting for a subchannel to establish a transport: +* **Idle Subchannel Trigger**: The picker selects a subchannel that is in `IDLE` state, triggering a new connection attempt. The reason string captures this transition (e.g., `"subchannel was idle, triggering connection attempt"`). +* **Active TCP/TLS Handshake**: The subchannel is actively connecting. The reason string describes that the TCP or TLS handshake is in progress, and can include the target subchannel's remote address (e.g., `"subchannel connecting: TCP handshake in progress to 192.168.1.50:8080"`). +* **A105 Connection Scaling**: Per gRFC A105, an established subchannel connection is scaling up to add additional connections to satisfy high concurrent streams. The reason string captures this scaling state (e.g., `"scaling up additional subchannel connections per A105 limits"`). +* **Waiting for Health Report**: The physical transport connection is established, but the picker is blocked waiting for the initial Out-of-Band (OOB) health check stream to return `SERVING` (e.g., `"transport connected, waiting for initial health check report"`). +* **Subchannel Non-Existence**: The load balancer has not yet created or resolved the target subchannel. The reason string describes the missing subchannel state. +* **Round Robin / WRR Scenarios**: The policy is connecting because it is waiting for any of its configured subchannels to become ready. The reason string describes the policy's connectivity attempt and can list the subchannels being attempted. +* **Ring Hash Scenarios**: The hash ring is waiting for subchannels in the ring to connect. The reason string describes that the ring is connecting and can list the specific ring nodes being attempted. +* **xDS Override Host Scenarios**: The policy is attempting to connect to a specific overridden host (subchannel) but the host is not yet connected. The reason string explains the overridden host state. + +##### Category D: Picker and State Mismatch Scenarios (Type: `"subchannel_state_mismatch"`, `"picker_failing_with_wait_for_ready"`) +* **Subchannel State Mismatch**: The subchannel transitioned out of `READY` (e.g., disconnected or entered `TRANSIENT_FAILURE`) before the active picker could be updated. The reason string captures the mismatch and the subchannel's target. +* **Wait-For-Ready Buffering**: A `wait_for_ready` RPC is buffered because the picker is in `TRANSIENT_FAILURE`. The reason string captures the last connection error message (e.g., `"picker is in transient failure: connection refused by peer"`). + +##### Category E: Priority Policy Behavior (Type: Composed `p0:`, `p1:` prefix) +The priority load balancing policy ([gRPC A56][A56]) manages failover between multiple priority groups (e.g. `p0`, `p1`, `p2`). Its delay telemetry behaves as follows: +* When the priority policy is waiting on its primary tier (`p0`) to connect, the overall metric delay type is `"p0:connecting"`. +* If `p0` fails (enters `TRANSIENT_FAILURE`) and the policy fails over to `p1`, the picker updates the active delay type to `"p1:connecting"`. +* Because the delay type has transitioned, the active child span is closed, a new child span `"p1:connecting"` is opened, and the picker records the exact failover reason as a spaced string event (which can include the logical priority name or child policy name from the configuration, e.g., `"waiting on priority group p1 (child 'tier-1-backup') (p0 failed: connection timeout)"`). This provides a clear, high-fidelity timeline of the failover sequence in the trace. + +##### Category F: Pass-Through Container Policy Scenarios (Type: `"connecting"` bubbled up directly) +Policies like `xds_cluster_manager`, `weighted_target`, and `rls` do not modify the metric `grpc.delay_type` (it remains strictly `"connecting"` or whatever leaf type is bubbled up from the child). +However, to provide visibility in tracing, the parent container policy's structural details (such as the target cluster name, RLS route shard, or weight group) are recorded inside the trace event's free-form description or as span attributes (e.g., `"waiting for child cluster 'cluster-abc' to connect"`). ### 3. Tracer API Changes -#### Language-Agnostic Lifecycle & State Machine +#### Lifecycle & State Machine The client channel, load balancer, and telemetry tracer coordinate synchronously to record delays without dynamic memory allocation during routing. -```mermaid -sequenceDiagram - autonumber - participant Channel as Client Channel - participant LB as Load Balancer (Picker) - participant Tracer as Telemetry Tracer - - Channel->>LB: PickSubchannel() - Note over LB: Subchannels idle or connecting - LB-->>Channel: Return PendingPickError (delay_type="connecting", delay_reason="subchannel_idle_trigger") - Channel->>Tracer: RecordDelayStart("connecting", "subchannel_idle_trigger") - Note over Tracer: Starts stopwatch & creates "Attempt Delay" span - - loop During buffering - Channel->>LB: Re-eval Pick - Note over LB: State transitions to TCP connecting - LB-->>Channel: Return PendingPickError (delay_type="connecting", delay_reason="subchannel_connecting") - Channel->>Tracer: RecordDelayReasonChanged("subchannel_connecting") - Note over Tracer: Emits span event "delay_state_transition" - end - - Channel->>LB: Re-eval Pick - LB-->>Channel: Return Subchannel (READY) - Channel->>Tracer: RecordDelayEnd() - Note over Tracer: Stops stopwatch, records metric, closes span -``` +##### 1. Timer Orchestration +* **Resolver / Control-Plane (Call-Level Delay)**: When the client channel is initialized or name resolution is re-triggered, the channel starts a logical timer and invokes `recordCallDelayStart("resolving", reason)` to create the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel stops the logical timer and invokes `recordCallDelayEnd()`. +* **LB Picker (Attempt-Level Delay)**: When an RPC attempt is initiated, the picker executes. If the picker defers the pick (e.g. returning `ErrNoSubConnAvailable`), it returns a pending result containing the metric delay type (e.g., `"connecting"`) and the free-form debug reason. The channel's attempt-routing wrapper starts a logical timer and invokes `recordAttemptDelayStart(type, reason)` to create the `"Attempt Delay"` child span. As the attempt remains buffered, subsequent picker evaluations may return different reasons, which the channel updates using `recordAttemptDelayReasonChanged(reason)` to append span events. When a picker evaluation successfully assigns a subchannel, the channel stops the logical timer and invokes `recordAttemptDelayEnd()`. +* **Scope Resolution**: The decision of whether a delay segment is Call-Level (recorded to `grpc.client.call.delay.duration`) or Attempt-Level (recorded to `grpc.client.attempt.delay.duration`) is determined entirely by the scope of the tracer object on which the callbacks are invoked. Call-level delays (such as `"resolving"`) are invoked strictly on the call-scoped tracer (e.g. `ClientCallTracer` in Go, `ClientStreamTracer.Factory` in Java, or `ClientCallTracerInterface` in C++). Attempt-level delays (such as `"connecting"`) are invoked strictly on the attempt-scoped tracer (e.g. `ClientCallAttemptTracer` in Go, `ClientStreamTracer` in Java, or `CallAttemptTracer` in C++). -1. **Initiating a Delay Segment**: - When the client channel encounters a blocking state (e.g., name resolution pending or load balancing pick deferred): - - A timer is started to measure the delay segment duration. - - A child span (`"Call Delay"` at the call level, or `"Attempt Delay"` at the attempt level) is created under the parent RPC span. - - The span is assigned the initial `grpc.delay_type` as a span attribute, and a `"delay_state_transition"` span event is emitted containing the `grpc.delay_reason` description. - -2. **Handling State Transitions**: - During an active delay, the internal state may transition (e.g., a subchannel reconnects, a different priority tier is chosen, or RLS updates its cache): - - **Scenario A: Only the reason changes (same delay type)**. - If the new state maps to the *same* `grpc.delay_type` but a different `grpc.delay_reason`: - - The active child span remains open. - - A new `"delay_state_transition"` span event is appended to the active span with the updated `grpc.delay_reason` string. - - *No metric is recorded* and the timer continues running. - - **Scenario B: The delay type changes**. - If the new state maps to a *different* `grpc.delay_type`: - - The timer is stopped. - - The duration is recorded to the corresponding histogram (`grpc.client.call.delay.duration` or `grpc.client.attempt.delay.duration`) using the old `grpc.delay_type` as a label. - - The active child span is closed. - - A new delay segment is immediately initiated (restarting the timer, creating a new child span with the new `grpc.delay_type` attribute, and emitting the initial transition event). - -3. **Concluding the Delay**: - When the blocking state resolves (e.g., name resolution completes, a picker successfully selects a connection, or the RPC is cancelled or timed out): - - The timer is stopped. - - The final duration is recorded to the histogram using the active `grpc.delay_type` as a label. - - The active child span is closed. +##### 2. Emission +Both metric duration recording and trace span management are fully delegated to the registered telemetry plugin (e.g., OpenTelemetry) via the tracer API: +* **Span Lifecycle**: Upon receiving the `recordDelayStart` signal, the telemetry plugin instantiates the child trace span. Subsequent `recordDelayReasonChanged` calls append `"Delay state transition"` events to the active span. +* **Simultaneous Metric & Span Close**: Upon receiving the `recordDelayEnd` signal, the telemetry plugin **simultaneously ends the child trace span and emits the final duration metric** to the corresponding duration histogram (`grpc.client.call.delay.duration` or `grpc.client.attempt.delay.duration`), measuring the elapsed time of the logical timer. -#### Language-Specific API Definitions +##### 3. Cancellation & Timeout +If an active RPC call or attempt is cancelled (due to client cancellation, `DEADLINE_EXCEEDED` timeouts, or channel shutdown) while a delay is logically active: +* The core channel **MUST** stop the logical timer. +* The core channel **MUST** invoke the tracer's end callback (`recordCallDelayEnd()` or `recordAttemptDelayEnd()`). +* This ensures the telemetry plugin can capture the elapsed duration up to the point of failure, close the trace span, and emit the partial duration metric, guaranteeing that bottlenecks preceding a failure remain fully observable. -##### Go -In `google.golang.org/grpc/stats`: -```go -package stats - -// DelayStart indicates the start of a delay segment. -type DelayStart struct { - // DelayType describes the type of delay (e.g., "load_balancing"). - DelayType string - // Reason is the initial dynamic debug string. - Reason string -} - -func (*DelayStart) IsClient() bool { return true } -func (*DelayStart) isRPCStats() {} +##### 4. Retries & Hedging +Each RPC attempt is tracked independently: +* **Attempt Isolation**: A transparent retry or hedged attempt will instantiate its own attempt-level tracer. +* **Independent Timing**: If the new attempt's picker is deferred, the attempt starts its own independent delay logical timer and child trace span, without affecting the call-level resolver timing or the timing of other concurrent attempts. -// DelayReasonChanged indicates a transition in the dynamic delay reason. -type DelayReasonChanged struct { - // Reason is the new dynamic debug string. - Reason string -} +#### Language-Specific API Definitions -func (*DelayReasonChanged) IsClient() bool { return true } -func (*DelayReasonChanged) isRPCStats() {} +##### Go (Experimental V2 Stats Handler Framework) +In gRPC-Go, the current `stats.Handler` interface has several critical limitations that prevent it from supporting modern telemetry needs and call-level observability: +1. **Strictly Attempt-Scoped**: All RPC events in `stats.Handler` (`TagRPC`, `HandleRPC`) are executed on individual HTTP/2 streams (attempts). There is no "Call-scoped" hook to track milestones that span multiple attempts or occur before an attempt is created (such as DNS name resolution, configuration parsing, or RLS control-plane queries). +2. **Type-Assertion Overhead**: The V1 `HandleRPC(context.Context, RPCStats)` method passes stats via the empty interface `RPCStats`. Implementations must perform runtime type assertions (e.g. `switch s := stat.(type)`) to process events. This degrades CPU performance and prevents compile-time contract enforcement. +3. **Dynamic Memory Allocations**: Wrapping every telemetry event in a struct (e.g. `stats.Begin`, `stats.InPayload`) and casting it to `RPCStats` requires heap allocations, which adds garbage collection pressure in high-throughput RPC paths. +4. **Coupled Connection & RPC Lifecycles**: Connection-level stats and RPC-level stats are mixed in a single interface, preventing modular plugin registration. -// DelayEnd indicates the end of a delay segment. -type DelayEnd struct{} +To completely replace the V1 stats handler with a high-performance, pluggable, and fully observable framework, gRPC-Go will introduce a new **V2 Stats Handler** framework as part of the existing `google.golang.org/grpc/stats` package. The V2 framework moves away from V1's struct-based event bubbling and adopts a **direct method/callback-based interface** with three isolated tracer scopes, decorated with standard Go experimental doc comments. -func (*DelayEnd) IsClient() bool { return true } -func (*DelayEnd) isRPCStats() {} -``` +###### 1. Interface Definitions in `google.golang.org/grpc/stats` -In `google.golang.org/grpc/internal/telemetry`: -*Note: This package is introduced internally as part of this gRFC to align Go's internal telemetry abstractions with C++ and Java.* ```go -package telemetry +package stats import ( "context" "net" + "time" "google.golang.org/grpc/metadata" ) -type Provider interface { - NewClientCallTracer(ctx context.Context, target string, method string, failFast bool, isClientStream bool, isServerStream bool) ClientCallTracer +// HandlerV2 is the factory interface that telemetry plugins implement. +// It instantiates stateful, scoped tracers for calls, attempts, and connections. +// +// Experimental: this interface is experimental and subject to change. +type HandlerV2 interface { + // NewCallTracer instantiates a CallTracer to monitor a client-side overall RPC call. + // The returned context is used throughout the lifetime of the call. + NewCallTracer(ctx context.Context, info *CallInfo) (context.Context, CallTracer) + + // NewAttemptTracer instantiates an AttemptTracer to monitor an individual RPC stream attempt. + // The returned context is used throughout the lifetime of the attempt. + NewAttemptTracer(ctx context.Context, info *AttemptInfo) (context.Context, AttemptTracer) + + // NewConnTracer instantiates a ConnTracer to monitor a physical transport connection. + // The returned context is used throughout the lifetime of the connection. + NewConnTracer(ctx context.Context, info *ConnInfo) (context.Context, ConnTracer) } -type ClientCallTracer interface { +// CallTracer is a call-scoped interface tracking the overall client RPC call. +// +// Experimental: this interface is experimental and subject to change. +type CallTracer interface { + // RecordDelayStart indicates the start of a call-level delay (e.g., "resolving"). RecordDelayStart(delayType string, reason string) + // RecordDelayReasonChanged indicates a transition in the call-level delay reason. RecordDelayReasonChanged(reason string) + // RecordDelayEnd indicates the end of the call-level delay. RecordDelayEnd() - StartNewAttempt(ctx context.Context, isTransparent bool) (context.Context, ClientCallAttemptTracer) - RecordEnd(ctx context.Context, err error) + // OnCallEnd is called when the overall RPC call completes. + OnCallEnd(err error) } -type ClientCallAttemptTracer interface { - RecordBegin(ctx context.Context) +// AttemptTracer is an attempt-scoped interface tracking an individual stream attempt. +// +// Experimental: this interface is experimental and subject to change. +type AttemptTracer interface { + // RecordDelayStart indicates the start of an attempt-level delay (e.g., "connecting"). RecordDelayStart(delayType string, reason string) + // RecordDelayReasonChanged indicates a transition in the attempt-level delay reason. RecordDelayReasonChanged(reason string) + // RecordDelayEnd indicates the end of the attempt-level delay. RecordDelayEnd() - RecordOutHeader(ctx context.Context, remoteAddr net.Addr, localAddr net.Addr, compression string, md metadata.MD) - RecordInHeader(ctx context.Context, wireLength int, compression string, md metadata.MD) - RecordOutPayload(ctx context.Context, compressedLen int, uncompressedLen int) - RecordInPayload(ctx context.Context, compressedLen int, uncompressedLen int) - RecordInTrailer(ctx context.Context, wireLength int, md metadata.MD) - RecordEnd(ctx context.Context, err error) + + // OnHeaderSent/OnHeaderRecv replace V1 OutHeader and InHeader. + OnHeaderSent(compression string, md metadata.MD) + OnHeaderRecv(wireLength int, compression string, md metadata.MD) + + // OnPayloadSent/OnPayloadRecv replace V1 OutPayload and InPayload. + OnPayloadSent(payload any, length, compressedLength, wireLength int, sentTime time.Time) + OnPayloadRecv(payload any, length, compressedLength, wireLength int, recvTime time.Time) + + // OnTrailerSent/OnTrailerRecv replace V1 OutTrailer and InTrailer. + OnTrailerSent(md metadata.MD) + OnTrailerRecv(wireLength int, md metadata.MD) + + // OnAttemptEnd is called when this specific stream attempt completes (replacing V1 End). + OnAttemptEnd(err error) } -// GetClientCallAttemptTracer retrieves the active tracer implementation from the context. -func GetClientCallAttemptTracer(ctx context.Context) ClientCallAttemptTracer +// ConnTracer is a connection-scoped interface tracking physical transport connections. +type ConnTracer interface { + // OnConnEnd is called when the transport connection terminates (replacing V1 ConnEnd). + OnConnEnd() +} + +// Supporting Metadata Structs +type CallInfo struct { + FullMethod string + FailFast bool +} + +type AttemptInfo struct { + IsTransparentRetry bool + IsHedged bool + RemoteAddr net.Addr + LocalAddr net.Addr +} + +type ConnInfo struct { + RemoteAddr net.Addr + LocalAddr net.Addr +} ``` +###### 2. Coexistence & Migration Strategy +During the experimental phase, gRPC-Go will support both V1 and V2 interfaces concurrently to avoid breaking existing ecosystems (such as OpenCensus and older OpenTelemetry plugins): +* **Registration**: The channel option `grpc.WithStatsHandler()` will accept both `Handler` (V1) and `HandlerV2` (V2) types using interface checks. +* **Adaptation**: An internal adapter will be provided to wrap a V1 `Handler` into a V2 `HandlerV2` (converting method callbacks back into struct events and dispatching them to `HandleRPC`), ensuring older plugins continue to function. +* **Native Performance**: If a native V2 handler is registered (such as the new OpenTelemetry stats handler), the channel will bypass all V1 event struct allocations and type-assertions, running completely on the high-performance, zero-allocation callback path. + ##### Java -In `io.grpc.ClientStreamTracer`: -*Note: Methods are explicitly separated into Call-level and Attempt-level signatures to resolve scope collision, since Java utilizes a single ClientStreamTracer.* +In `io.grpc.ClientStreamTracer` and `ClientStreamTracer.Factory`: +*Note: Call-level delays (like name resolution) occur before an attempt is created. Because `ClientStreamTracer` is attempt-scoped, the call-level delay APIs are added to `ClientStreamTracer.Factory` (which is call-scoped and plumbed throughout the channel), while the attempt-level delay APIs remain on the `ClientStreamTracer` instance.* ```java package io.grpc; public abstract class ClientStreamTracer extends StreamTracer { /** - * Called when a call-level delay segment (e.g. name resolution) starts. + * Called when an attempt-level delay segment (e.g. LB Pick connection) starts. */ - public void recordCallDelayStart(String delayType, String delayReason) {} + public void recordAttemptDelayStart(String delayType, String delayReason) {} /** - * Called when a call-level delay reason changes. + * Called when an attempt-level delay reason changes. */ - public void recordCallDelayReasonChanged(String delayReason) {} + public void recordAttemptDelayReasonChanged(String delayReason) {} /** - * Called when a call-level delay segment ends. + * Called when an attempt-level delay segment ends. */ - public void recordCallDelayEnd() {} + public void recordAttemptDelayEnd() {} +} +// Nested inside ClientStreamTracer: +public static abstract class Factory { /** - * Called when an attempt-level delay segment (e.g. LB Pick connection) starts. + * Called when a call-level delay segment (e.g. name resolution) starts. */ - public void recordAttemptDelayStart(String delayType, String delayReason) {} + public void recordCallDelayStart(String delayType, String delayReason) {} /** - * Called when an attempt-level delay reason changes. + * Called when a call-level delay reason changes. */ - public void recordAttemptDelayReasonChanged(String delayReason) {} + public void recordCallDelayReasonChanged(String delayReason) {} /** - * Called when an attempt-level delay segment ends. + * Called when a call-level delay segment ends. */ - public void recordAttemptDelayEnd() {} + public void recordCallDelayEnd() {} + + public abstract ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata headers); } ``` @@ -349,239 +389,9 @@ class DelayAnnotation final : public CallTracerAnnotationInterface::Annotation { } // namespace grpc_core ``` - -#### Language-Specific Recording Logic (Code Examples) - -##### Go — `picker_wrapper.go` -```go -func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.PickInfo) (pick, error) { - var ch chan struct{} - var lastPickErr error - pickBlocked := false - - var delayStartTime time.Time - var delayType string - var delayReason string - var delayed bool - - for { - pg := pw.pickerGen.Load() - if pg == nil { - return pick{}, ErrClientConnClosing - } - if pg.picker == nil { - ch = pg.blockingCh - } - if ch == pg.blockingCh { - // We are about to block on the channel. Record connecting delay. - if !delayed { - delayed = true - delayStartTime = time.Now() - delayType = "connecting" - delayReason = "channel_connecting" - if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { - tracer.RecordDelayStart(delayType, delayReason) - } - } - // Block goroutine until next picker is updated - select { - case <-ctx.Done(): - if delayed { - duration := time.Since(delayStartTime).Seconds() - clientAttemptDelayDurationMetric.Record(metricsRecorder, duration, target, delayType, info.FullMethodName) - if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { - tracer.RecordDelayEnd() - } - } - return pick{}, ctx.Err() - case <-ch: - } - continue - } - - if ch != nil { - pickBlocked = true - } - ch = pg.blockingCh - p := pg.picker - - pickResult, err := p.Pick(info) - if err != nil { - if err == balancer.ErrNoSubConnAvailable { - currentType := "connecting" - currentReason := "subchannel_connecting" - if qe, ok := err.(*balancer.PendingPickError); ok { - currentType = qe.DelayType - currentReason = qe.DelayReason - } - - if !delayed { - delayed = true - delayStartTime = time.Now() - delayType = currentType - delayReason = currentReason - if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { - tracer.RecordDelayStart(delayType, delayReason) - } - } else if currentType != delayType { - // Type changed: Record metric and restart span. - duration := time.Since(delayStartTime).Seconds() - clientAttemptDelayDurationMetric.Record(metricsRecorder, duration, target, delayType, info.FullMethodName) - if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { - tracer.RecordDelayEnd() - tracer.RecordDelayStart(currentType, currentReason) - } - delayStartTime = time.Now() - delayType = currentType - delayReason = currentReason - } else if currentReason != delayReason { - // Only reason changed: Record span event. - delayReason = currentReason - if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { - tracer.RecordDelayReasonChanged(delayReason) - } - } - continue - } - // ... non-delay error handling ... - } - - // Success path - if delayed { - duration := time.Since(delayStartTime).Seconds() - clientAttemptDelayDurationMetric.Record(metricsRecorder, duration, target, delayType, info.FullMethodName) - if tracer := telemetry.GetClientCallAttemptTracer(ctx); tracer != nil { - tracer.RecordDelayEnd() - } - } - return pickResult, nil - } -} -``` - -##### Java — `DelayedClientTransport.java` -```java -// Tracer callbacks are invoked strictly OUTSIDE DelayedClientTransport locks. -// AtomicBoolean guards guarantee recordDelayEnd() is executed exactly once. - -private class PendingStream extends DelayedStream { - private final PickSubchannelArgs args; - private final ClientStreamTracer[] tracers; - private final long delayStartNanos; - private final String delayType; - private final String delayReason; - private final AtomicBoolean delayRecorded = new AtomicBoolean(); - - PendingStream(PickSubchannelArgs args, ClientStreamTracer[] tracers, PickResult pickResult) { - this.args = args; - this.tracers = tracers; - this.delayStartNanos = System.nanoTime(); - this.delayType = pickResult != null ? pickResult.getDelayType() : "connecting"; - this.delayReason = pickResult != null ? pickResult.getDelayReason() : "channel_connecting"; - } - - void recordDelayEnd(MetricsRecorder recorder, String target) { - if (delayRecorded.compareAndSet(false, true)) { - long durationNanos = System.nanoTime() - this.delayStartNanos; - double durationSeconds = durationNanos / 1_000_000_000.0; - recorder.recordClientAttemptDelayDuration(durationSeconds, target, this.delayType); - for (ClientStreamTracer tracer : this.tracers) { - tracer.recordAttemptDelayEnd(); - } - } - } -} - -// In DelayedClientTransport.newStream(): -PendingStream pendingStream = null; -synchronized (lock) { - if (state == SHUTDOWN) { - return new FailingClientStream(status); - } - if (transport == null) { - pendingStream = new PendingStream(args, tracers, pickResult); - pendingStreams.add(pendingStream); - } -} -if (pendingStream != null) { - // Invoke tracer callbacks OUTSIDE critical locks - for (ClientStreamTracer tracer : tracers) { - tracer.recordAttemptDelayStart(pendingStream.delayType, pendingStream.delayReason); - } -} -``` - -##### C++ (Core) — `load_balanced_call_destination.cc` -```cpp -// LbDelayState is an RAII object that guarantees RecordDelayEnd is called on destruction (cancellation). -// Persisted across asynchronous Loop iterations by capturing it in the Loop lambda state. - -class LbDelayState { - public: - LbDelayState(ClientCallTracerInterface::CallAttemptTracer* tracer, std::string target) - : tracer_(tracer), target_(std::move(target)) {} - - ~LbDelayState() { - if (delay_start_time_.has_value()) { - // RAII Cancellation: Record final metric and close tracer - RecordEnd(); - } - } - - void Update(absl::string_view type, absl::string_view reason) { - if (!delay_start_time_.has_value()) { - delay_start_time_ = Timestamp::Now(); - type_ = std::string(type); - reason_ = std::string(reason); - tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kStart, type_, reason_)); - } else if (type != type_) { - RecordEnd(); - delay_start_time_ = Timestamp::Now(); - type_ = std::string(type); - reason_ = std::string(reason); - tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kStart, type_, reason_)); - } else if (reason != reason_) { - reason_ = std::string(reason); - tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kReasonChanged, type_, reason_)); - } - } - - void RecordEnd() { - if (!delay_start_time_.has_value()) return; - Duration duration = Timestamp::Now() - *delay_start_time_; - stats_plugin_group.RecordHistogram( - kClientAttemptDelayDurationHandle, - duration.seconds(), - {target_, type_}, - {}); - tracer_->RecordAnnotation(DelayAnnotation(DelayAnnotation::Stage::kEnd, type_, reason_)); - delay_start_time_.reset(); - } - - private: - ClientCallTracerInterface::CallAttemptTracer* tracer_; - std::string target_; - std::optional delay_start_time_; - std::string type_; - std::string reason_; -}; - -// Inside UnstartedCallHandler::StartCall mutable loop lambda: -auto delay_state = std::make_shared(call_tracer, target); - -return Loop( - [delay_state, picker, unstarted_handler]() mutable { - return PickSubchannel( - *picker, - *unstarted_handler, - delay_state.get() // Pass state pointer to persist across asynchronous iterations - ); - } -); -``` - ### Feature Flag + All delay metrics, tracing, and API hooks will be guarded by a feature flag: * **Go/Java Env Var**: `GRPC_EXPERIMENTAL_ENABLE_DELAY_OBSERVABILITY` (Default: `false`) * **C++ Core Experiment**: `IsExperimentEnabled("client_delay_observability")` (registered in `experiments.h`) @@ -594,7 +404,7 @@ To minimize overhead, pickers pre-compute tokens at configuration time, avoiding ## Implementation -We will implement this in Go, Java, and C++ (Core), in that order. +We will implement this in Go, Java, and C++ (Core). From d39ed0f6cfe9f70814b266945caed685b3a9c356 Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Tue, 23 Jun 2026 03:44:24 +0000 Subject: [PATCH 05/12] A121 gRFC title and filename renaming --- ...md => A121-client-side-rpc-delay-observability.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) rename grpc-lb-delay-metrics.md => A121-client-side-rpc-delay-observability.md (98%) diff --git a/grpc-lb-delay-metrics.md b/A121-client-side-rpc-delay-observability.md similarity index 98% rename from grpc-lb-delay-metrics.md rename to A121-client-side-rpc-delay-observability.md index e86f0143d..49200193f 100644 --- a/grpc-lb-delay-metrics.md +++ b/A121-client-side-rpc-delay-observability.md @@ -1,7 +1,7 @@ -# RPC Delay Observability +# A121: Client-Side RPC Delay Observability ---- -* Author(s): Madhav Bissa (@madhavbissa) -* Approver: markdroth +* Author(s): Madhav Bissa (@mbissa) +* Approver: @markdroth, @ejona86, @dfawley, @easwars * Implemented in: Go, Java, C++ * Last updated: 2026-06-19 * Discussion at: (filled after thread exists) @@ -100,7 +100,7 @@ The following metrics are registered as client-side per-call metrics, extending | **Bucket Boundaries** | Same as A66 latency buckets: 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 | | **Default Enabled** | `false` (experimental, opt-in) | -### 2. Telemetry Value Taxonomy +### 2. Delay Types & Reasons To ensure consistency across implementations, we define the taxonomy of metric and tracing labels. @@ -177,7 +177,7 @@ Policies like `xds_cluster_manager`, `weighted_target`, and `rls` do not modify However, to provide visibility in tracing, the parent container policy's structural details (such as the target cluster name, RLS route shard, or weight group) are recorded inside the trace event's free-form description or as span attributes (e.g., `"waiting for child cluster 'cluster-abc' to connect"`). -### 3. Tracer API Changes +### 3. Telemetry API #### Lifecycle & State Machine @@ -186,7 +186,7 @@ The client channel, load balancer, and telemetry tracer coordinate synchronously ##### 1. Timer Orchestration * **Resolver / Control-Plane (Call-Level Delay)**: When the client channel is initialized or name resolution is re-triggered, the channel starts a logical timer and invokes `recordCallDelayStart("resolving", reason)` to create the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel stops the logical timer and invokes `recordCallDelayEnd()`. * **LB Picker (Attempt-Level Delay)**: When an RPC attempt is initiated, the picker executes. If the picker defers the pick (e.g. returning `ErrNoSubConnAvailable`), it returns a pending result containing the metric delay type (e.g., `"connecting"`) and the free-form debug reason. The channel's attempt-routing wrapper starts a logical timer and invokes `recordAttemptDelayStart(type, reason)` to create the `"Attempt Delay"` child span. As the attempt remains buffered, subsequent picker evaluations may return different reasons, which the channel updates using `recordAttemptDelayReasonChanged(reason)` to append span events. When a picker evaluation successfully assigns a subchannel, the channel stops the logical timer and invokes `recordAttemptDelayEnd()`. -* **Scope Resolution**: The decision of whether a delay segment is Call-Level (recorded to `grpc.client.call.delay.duration`) or Attempt-Level (recorded to `grpc.client.attempt.delay.duration`) is determined entirely by the scope of the tracer object on which the callbacks are invoked. Call-level delays (such as `"resolving"`) are invoked strictly on the call-scoped tracer (e.g. `ClientCallTracer` in Go, `ClientStreamTracer.Factory` in Java, or `ClientCallTracerInterface` in C++). Attempt-level delays (such as `"connecting"`) are invoked strictly on the attempt-scoped tracer (e.g. `ClientCallAttemptTracer` in Go, `ClientStreamTracer` in Java, or `CallAttemptTracer` in C++). +* **Scope Resolution**: The decision of whether a delay segment is Call-Level (recorded to `grpc.client.call.delay.duration`) or Attempt-Level (recorded to `grpc.client.attempt.delay.duration`) is determined entirely by the scope of the tracer object on which the callbacks are invoked. Call-level delays (such as `"resolving"`) are invoked strictly on the call-scoped tracer object, while attempt-level delays (such as `"connecting"`) are invoked strictly on the attempt-scoped tracer object. This mapping is enforced statically by the respective language API signatures. ##### 2. Emission Both metric duration recording and trace span management are fully delegated to the registered telemetry plugin (e.g., OpenTelemetry) via the tracer API: From 386b1f83419eb34dd51f5b07c3a22d1b379f1a6d Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Tue, 23 Jun 2026 03:48:20 +0000 Subject: [PATCH 06/12] fixing title and filename --- ...pc-delay-observability.md => A121-rpc-delay-observability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename A121-client-side-rpc-delay-observability.md => A121-rpc-delay-observability.md (99%) diff --git a/A121-client-side-rpc-delay-observability.md b/A121-rpc-delay-observability.md similarity index 99% rename from A121-client-side-rpc-delay-observability.md rename to A121-rpc-delay-observability.md index 49200193f..24bc3de88 100644 --- a/A121-client-side-rpc-delay-observability.md +++ b/A121-rpc-delay-observability.md @@ -1,4 +1,4 @@ -# A121: Client-Side RPC Delay Observability +# A121: RPC Delay Observability ---- * Author(s): Madhav Bissa (@mbissa) * Approver: @markdroth, @ejona86, @dfawley, @easwars From a7b6979a68f2528ba980bfb88a9ec93e9b2bec5e Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Tue, 23 Jun 2026 10:44:48 +0530 Subject: [PATCH 07/12] v1 feature parity for go stats handler and detailed rationale --- A121-rpc-delay-observability.md | 41 ++++++++++++++++++++++++--------- 1 file changed, 30 insertions(+), 11 deletions(-) diff --git a/A121-rpc-delay-observability.md b/A121-rpc-delay-observability.md index 24bc3de88..11aeadd51 100644 --- a/A121-rpc-delay-observability.md +++ b/A121-rpc-delay-observability.md @@ -118,8 +118,8 @@ Recorded on the `grpc.client.attempt.delay.duration` histogram. These represent * `"connecting"`: The attempt is delayed waiting for subchannel connection establishment or picker initialization. * `"rls_lookup_pending"`: Specifically for Route Lookup Service (RLS) control-plane cache-miss query lookups. * `"cds_dynamic_discovery"`: Specifically for xDS Cluster Discovery Service (CDS) dynamic metadata resource fetches. -* `"subchannel_state_mismatch"`: The target subchannel transitioned out of `READY` (e.g., disconnected or went to `TRANSIENT_FAILURE`), but the active channel picker has not yet been updated with a new picker. -* `"picker_failing_with_wait_for_ready"`: A `wait_for_ready` RPC is buffered/queued because the picker is in `TRANSIENT_FAILURE`, waiting for the next connection attempt to succeed. +* `"subchannel_state_mismatch"`: The target subchannel transitioned out of `READY` (e.g., disconnected or went to `TRANSIENT_FAILURE`), but the active channel picker has not yet been updated with a new picker. *Note: This delay type is generated by the channel intercepting a stale pick, not by the picker itself.* +* `"picker_failing_with_wait_for_ready"`: A `wait_for_ready` RPC is buffered/queued because the picker is in `TRANSIENT_FAILURE`, waiting for the next connection attempt to succeed. *Note: This delay type is generated by the channel wrapper detecting a `wait_for_ready` RPC when the picker returns an error, keeping pickers ignorant of `wait_for_ready` semantics.* ##### 3. Composed Attempt-Level Types Structural container policies prepend their logical prefixes to the base attempt-level types using a colon separator: @@ -133,7 +133,7 @@ Structural container policies prepend their logical prefixes to the base attempt The prepending and wrapping logic is handled entirely inside the Balancer Picker tree hierarchy, keeping the channel's attempt-routing wrapper and tracer plugins completely decoupled and simple: 1. **Leaf Pickers** (such as `pick_first` or `round_robin`): Generate the base leaf types (e.g., `delay_type = "connecting"`) and the initial, descriptive `delay_reason` (e.g., `"subchannel connecting: TCP handshake in progress"`). 2. **Container Pickers**: Intercept the child's deferred pick result as it bubbles up the picker tree. The `priority` picker prepends its active tier index to the type (e.g., producing `"p0:connecting"`). A pass-through picker (such as `xds_cluster_manager`) forwards the child's `delay_type` unmodified, but can enrich the `delay_reason` with its own structural details. -3. **The Channel Wrapper**: Receives the final, fully-composed `delay_type` and `delay_reason` strings from the root picker and passes them directly to the tracer (`recordAttemptDelayStart`) without any parsing, branching, or type checks. +3. **The Channel Wrapper**: Receives the final, fully-composed `delay_type` and `delay_reason` strings from the root picker and passes them directly to the tracer (`recordAttemptDelayStart`). The channel wrapper is also responsible for intercepting specific channel-level states (e.g., generating `"picker_failing_with_wait_for_ready"` when a picker returns an error for a `wait_for_ready` RPC, or `"subchannel_state_mismatch"` when a transport disconnects before the picker is updated). --- @@ -184,7 +184,7 @@ However, to provide visibility in tracing, the parent container policy's structu The client channel, load balancer, and telemetry tracer coordinate synchronously to record delays without dynamic memory allocation during routing. ##### 1. Timer Orchestration -* **Resolver / Control-Plane (Call-Level Delay)**: When the client channel is initialized or name resolution is re-triggered, the channel starts a logical timer and invokes `recordCallDelayStart("resolving", reason)` to create the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel stops the logical timer and invokes `recordCallDelayEnd()`. +* **Resolver / Control-Plane (Call-Level Delay)**: When an RPC is blocked waiting for name resolution or configuration parsing to complete, the channel starts a logical timer and invokes `recordCallDelayStart("resolving", reason)` to create the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel stops the logical timer and invokes `recordCallDelayEnd()`. (Note: If resolution completes before any RPC is blocked, no delay is recorded.) * **LB Picker (Attempt-Level Delay)**: When an RPC attempt is initiated, the picker executes. If the picker defers the pick (e.g. returning `ErrNoSubConnAvailable`), it returns a pending result containing the metric delay type (e.g., `"connecting"`) and the free-form debug reason. The channel's attempt-routing wrapper starts a logical timer and invokes `recordAttemptDelayStart(type, reason)` to create the `"Attempt Delay"` child span. As the attempt remains buffered, subsequent picker evaluations may return different reasons, which the channel updates using `recordAttemptDelayReasonChanged(reason)` to append span events. When a picker evaluation successfully assigns a subchannel, the channel stops the logical timer and invokes `recordAttemptDelayEnd()`. * **Scope Resolution**: The decision of whether a delay segment is Call-Level (recorded to `grpc.client.call.delay.duration`) or Attempt-Level (recorded to `grpc.client.attempt.delay.duration`) is determined entirely by the scope of the tracer object on which the callbacks are invoked. Call-level delays (such as `"resolving"`) are invoked strictly on the call-scoped tracer object, while attempt-level delays (such as `"connecting"`) are invoked strictly on the attempt-scoped tracer object. This mapping is enforced statically by the respective language API signatures. @@ -204,7 +204,12 @@ Each RPC attempt is tracked independently: * **Attempt Isolation**: A transparent retry or hedged attempt will instantiate its own attempt-level tracer. * **Independent Timing**: If the new attempt's picker is deferred, the attempt starts its own independent delay logical timer and child trace span, without affecting the call-level resolver timing or the timing of other concurrent attempts. -#### Language-Specific API Definitions +#### API Definitions + +##### 1. Picker-Side API +Pickers must pass the `delay_type` and `delay_reason` metadata back to the channel when they defer a pick as part of the Pick Result. + +##### 2. Tracer-Side API ##### Go (Experimental V2 Stats Handler Framework) In gRPC-Go, the current `stats.Handler` interface has several critical limitations that prevent it from supporting modern telemetry needs and call-level observability: @@ -295,13 +300,17 @@ type ConnTracer interface { // Supporting Metadata Structs type CallInfo struct { - FullMethod string - FailFast bool + FullMethodName string + FailFast bool + IsClientStream bool + IsServerStream bool } type AttemptInfo struct { + FullMethodName string IsTransparentRetry bool IsHedged bool + PreviousAttempts int RemoteAddr net.Addr LocalAddr net.Addr } @@ -374,7 +383,7 @@ class DelayAnnotation final : public CallTracerAnnotationInterface::Annotation { DelayAnnotation(Stage stage, absl::string_view type, absl::string_view reason) : Annotation(CallTracerAnnotationInterface::AnnotationType::kDelay), - stage_(stage), type_(type), reason_(reason) {} + stage_(stage), type_(type), reason_(std::string(reason)) {} Stage stage() const { return stage_; } absl::string_view type() const { return type_; } @@ -383,7 +392,7 @@ class DelayAnnotation final : public CallTracerAnnotationInterface::Annotation { private: Stage stage_; absl::string_view type_; - absl::string_view reason_; + std::string reason_; }; } // namespace grpc_core @@ -396,11 +405,21 @@ All delay metrics, tracing, and API hooks will be guarded by a feature flag: * **Go/Java Env Var**: `GRPC_EXPERIMENTAL_ENABLE_DELAY_OBSERVABILITY` (Default: `false`) * **C++ Core Experiment**: `IsExperimentEnabled("client_delay_observability")` (registered in `experiments.h`) +## Metric Registration and Recording + +Metric registration and recording will follow the established architectural patterns defined in [gRPC A66][A66] and [gRPC A94][A94]. The channel and attempts will delegate duration measurement directly to the registered telemetry plugin (e.g., OpenTelemetry) to avoid code duplication across the core stack. The specific plugin API registrations are omitted here for brevity but will conform to language-idiomatic OTel APIs (e.g., `RegisterFloat64Histo` in Go, `registerDoubleHistogram` in Java, and `RegisterDoubleHistogram` in C++). + ## Rationale -We implement recording at the client channel level rather than inside specific load balancing policies because only the client channel manages the buffering, queueing, and context cancellation lifecycles of RPC calls. This decouples policy-level state reporting from duration measurement. +**Call vs. Attempt Metrics:** We split metrics into two distinct histograms (Call-level vs Attempt-level) to clearly differentiate between channel initialization bottlenecks (like DNS or service configuration) and per-attempt routing bottlenecks. Combining them into a single histogram would obscure whether a delay occurred before or during connection establishment. + +**Child Spans vs. Events:** We use child spans for delays instead of attaching events to the parent RPC span because it provides better visual flame-graph isolation in tracing backends, especially when delays are long or when multiple attempts are made (e.g., hedging or retries). + +**Free-form `delay_reason`:** We explicitly chose to allow unconstrained, free-form debug strings for the trace event's `delay_reason` rather than bounded enum tokens. While this risks higher cardinality in tracing backends, it provides immense diagnostic value by allowing raw IP addresses, subchannel targets, and detailed connection error messages to be embedded directly into the trace. The metric label (`grpc.delay_type`) remains strictly bounded to ensure metric cardinality is kept low. + +**Channel-Intercepted Delays:** We implement specific delay types like `picker_failing_with_wait_for_ready` and `subchannel_state_mismatch` as channel-level interceptions rather than picker-generated types. This intentionally keeps leaf pickers ignorant of `wait_for_ready` semantics and transport-level race conditions, enforcing a clean separation of concerns where the picker only reports its current state, and the channel dictates queueing behavior. -To minimize overhead, pickers pre-compute tokens at configuration time, avoiding dynamic string allocations during picks. Dynamic tokens are only used for policies that perform per-request routing (e.g., RLS, xDS cluster manager). +**Delegated Recording:** We implement duration recording at the client channel wrapper level rather than inside specific load balancing policies because only the client channel manages the buffering, queueing, and context cancellation lifecycles of RPC calls. This decouples policy-level state reporting from duration measurement. ## Implementation From db81e29b1264db24400fa055b1dea83b8cc1a007 Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Thu, 25 Jun 2026 14:00:00 +0000 Subject: [PATCH 08/12] descope go v2 api verbose changes and cleanup api change section to keep simple --- A121-rpc-delay-observability.md | 224 ++++++++------------------------ 1 file changed, 53 insertions(+), 171 deletions(-) diff --git a/A121-rpc-delay-observability.md b/A121-rpc-delay-observability.md index 11aeadd51..30335fc67 100644 --- a/A121-rpc-delay-observability.md +++ b/A121-rpc-delay-observability.md @@ -3,8 +3,8 @@ * Author(s): Madhav Bissa (@mbissa) * Approver: @markdroth, @ejona86, @dfawley, @easwars * Implemented in: Go, Java, C++ -* Last updated: 2026-06-19 -* Discussion at: (filled after thread exists) +* Last updated: 2026-06-25 +* Discussion at: https://groups.google.com/g/grpc-io/c/NsxXJ2MxXM4 ## Abstract @@ -111,7 +111,7 @@ The `grpc.delay_type` label is a low-cardinality, restricted set of values. To e ##### 1. Call-Level Delay Types Recorded on the `grpc.client.call.delay.duration` histogram. These represent delays that occur before an individual attempt is created: -* `"resolving"`: The channel is delayed waiting for name resolution or configuration parsing. +* `"resolving"`: The channel is delayed waiting for name resolution and configuration parsing. ##### 2. Attempt-Level Delay Types Recorded on the `grpc.client.attempt.delay.duration` histogram. These represent delays that occur during a specific RPC attempt: @@ -129,6 +129,8 @@ Structural container policies prepend their logical prefixes to the base attempt * `p0:picker_failing_with_wait_for_ready` * **Pass-Through Container Policies**: Policies like `xds_cluster_manager`, `weighted_target`, and `rls` **do not prepend any prefix or wrap the delay type**. They simply bubble up the child's `grpc.delay_type` (e.g., `"connecting"`) directly as-is. +Because only the priority policy contributes a prefix and it contributes exactly one (its active tier index), the resulting `grpc.delay_type` cardinality stays bounded at roughly *(base attempt-level types) × (configured priority tiers)*; prefixes do not stack across nested pass-through containers. Implementations SHOULD keep the prefix a small bounded token (the tier index) so metric cardinality remains low; richer per-container structure belongs in the `grpc.delay_reason` span event, not the metric label. + ##### Metadata Propagation & Composition The prepending and wrapping logic is handled entirely inside the Balancer Picker tree hierarchy, keeping the channel's attempt-routing wrapper and tracer plugins completely decoupled and simple: 1. **Leaf Pickers** (such as `pick_first` or `round_robin`): Generate the base leaf types (e.g., `delay_type = "connecting"`) and the initial, descriptive `delay_reason` (e.g., `"subchannel connecting: TCP handshake in progress"`). @@ -184,8 +186,9 @@ However, to provide visibility in tracing, the parent container policy's structu The client channel, load balancer, and telemetry tracer coordinate synchronously to record delays without dynamic memory allocation during routing. ##### 1. Timer Orchestration -* **Resolver / Control-Plane (Call-Level Delay)**: When an RPC is blocked waiting for name resolution or configuration parsing to complete, the channel starts a logical timer and invokes `recordCallDelayStart("resolving", reason)` to create the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel stops the logical timer and invokes `recordCallDelayEnd()`. (Note: If resolution completes before any RPC is blocked, no delay is recorded.) -* **LB Picker (Attempt-Level Delay)**: When an RPC attempt is initiated, the picker executes. If the picker defers the pick (e.g. returning `ErrNoSubConnAvailable`), it returns a pending result containing the metric delay type (e.g., `"connecting"`) and the free-form debug reason. The channel's attempt-routing wrapper starts a logical timer and invokes `recordAttemptDelayStart(type, reason)` to create the `"Attempt Delay"` child span. As the attempt remains buffered, subsequent picker evaluations may return different reasons, which the channel updates using `recordAttemptDelayReasonChanged(reason)` to append span events. When a picker evaluation successfully assigns a subchannel, the channel stops the logical timer and invokes `recordAttemptDelayEnd()`. +The channel owns the **lifecycle** of a delay segment (deciding when it starts, transitions, and ends) while the telemetry plugin owns the **timing** of that segment (it timestamps the start on `recordDelayStart`, computes the elapsed duration on `recordDelayEnd`, and emits the histogram). The channel therefore does not maintain a separate duration clock; "logical timer" below refers to this plugin-side measurement, delimited by the channel's start/end signals. +* **Resolver / Control-Plane (Call-Level Delay)**: When an RPC is blocked waiting for name resolution, the channel invokes `recordCallDelayStart("resolving", reason)` to open the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel invokes `recordCallDelayEnd()`. (Note: If resolution completes before any RPC is blocked, no delay is recorded.) +* **LB Picker (Attempt-Level Delay)**: When an RPC attempt is initiated, the picker executes. If the picker defers the pick (i.e. it cannot return a ready connection yet), it surfaces the metric delay type (e.g., `"connecting"`) and the free-form debug reason alongside the existing "no connection available" signal. The channel's attempt-routing wrapper invokes `recordAttemptDelayStart(type, reason)` to open the `"Attempt Delay"` child span. As the attempt remains buffered, subsequent picker evaluations may return different reasons, which the channel updates using `recordAttemptDelayReasonChanged(reason)` to append span events. When a picker evaluation successfully assigns a subchannel, the channel invokes `recordAttemptDelayEnd()`. * **Scope Resolution**: The decision of whether a delay segment is Call-Level (recorded to `grpc.client.call.delay.duration`) or Attempt-Level (recorded to `grpc.client.attempt.delay.duration`) is determined entirely by the scope of the tracer object on which the callbacks are invoked. Call-level delays (such as `"resolving"`) are invoked strictly on the call-scoped tracer object, while attempt-level delays (such as `"connecting"`) are invoked strictly on the attempt-scoped tracer object. This mapping is enforced statically by the respective language API signatures. ##### 2. Emission @@ -195,8 +198,7 @@ Both metric duration recording and trace span management are fully delegated to ##### 3. Cancellation & Timeout If an active RPC call or attempt is cancelled (due to client cancellation, `DEADLINE_EXCEEDED` timeouts, or channel shutdown) while a delay is logically active: -* The core channel **MUST** stop the logical timer. -* The core channel **MUST** invoke the tracer's end callback (`recordCallDelayEnd()` or `recordAttemptDelayEnd()`). +* The core channel **MUST** invoke the tracer's end callback (`recordCallDelayEnd()` or `recordAttemptDelayEnd()`), which finalizes the plugin-side timing for the open segment. * This ensures the telemetry plugin can capture the elapsed duration up to the point of failure, close the trace span, and emit the partial duration metric, guaranteeing that bottlenecks preceding a failure remain fully observable. ##### 4. Retries & Hedging @@ -207,195 +209,75 @@ Each RPC attempt is tracked independently: #### API Definitions ##### 1. Picker-Side API -Pickers must pass the `delay_type` and `delay_reason` metadata back to the channel when they defer a pick as part of the Pick Result. - -##### 2. Tracer-Side API - -##### Go (Experimental V2 Stats Handler Framework) -In gRPC-Go, the current `stats.Handler` interface has several critical limitations that prevent it from supporting modern telemetry needs and call-level observability: -1. **Strictly Attempt-Scoped**: All RPC events in `stats.Handler` (`TagRPC`, `HandleRPC`) are executed on individual HTTP/2 streams (attempts). There is no "Call-scoped" hook to track milestones that span multiple attempts or occur before an attempt is created (such as DNS name resolution, configuration parsing, or RLS control-plane queries). -2. **Type-Assertion Overhead**: The V1 `HandleRPC(context.Context, RPCStats)` method passes stats via the empty interface `RPCStats`. Implementations must perform runtime type assertions (e.g. `switch s := stat.(type)`) to process events. This degrades CPU performance and prevents compile-time contract enforcement. -3. **Dynamic Memory Allocations**: Wrapping every telemetry event in a struct (e.g. `stats.Begin`, `stats.InPayload`) and casting it to `RPCStats` requires heap allocations, which adds garbage collection pressure in high-throughput RPC paths. -4. **Coupled Connection & RPC Lifecycles**: Connection-level stats and RPC-level stats are mixed in a single interface, preventing modular plugin registration. - -To completely replace the V1 stats handler with a high-performance, pluggable, and fully observable framework, gRPC-Go will introduce a new **V2 Stats Handler** framework as part of the existing `google.golang.org/grpc/stats` package. The V2 framework moves away from V1's struct-based event bubbling and adopts a **direct method/callback-based interface** with three isolated tracer scopes, decorated with standard Go experimental doc comments. - -###### 1. Interface Definitions in `google.golang.org/grpc/stats` - -```go -package stats - -import ( - "context" - "net" - "time" +When a picker defers a pick (it cannot return a ready connection yet), it surfaces two optional values **alongside the existing "no connection available" deferral signal**, carried on the runtime's existing pick-result value so that no new return channel is introduced: +* `delay_type`: the bounded metric label for the current wait (e.g. `"connecting"`), composed by container pickers per [Section 2](#2-delay-types--reasons). +* `delay_reason`: the free-form diagnostic string for the current wait. - "google.golang.org/grpc/metadata" -) +The channel reads these only on the deferred path; a successful pick ignores them. Leaf pickers populate the base values and container pickers compose them as the result bubbles up the picker tree ([Section 2](#2-delay-types--reasons)). Pickers remain ignorant of `wait_for_ready` semantics and transport races — the channel itself synthesizes `picker_failing_with_wait_for_ready` and `subchannel_state_mismatch` ([Section 2](#2-delay-types--reasons), [Rationale](#rationale)). -// HandlerV2 is the factory interface that telemetry plugins implement. -// It instantiates stateful, scoped tracers for calls, attempts, and connections. -// -// Experimental: this interface is experimental and subject to change. -type HandlerV2 interface { - // NewCallTracer instantiates a CallTracer to monitor a client-side overall RPC call. - // The returned context is used throughout the lifetime of the call. - NewCallTracer(ctx context.Context, info *CallInfo) (context.Context, CallTracer) - - // NewAttemptTracer instantiates an AttemptTracer to monitor an individual RPC stream attempt. - // The returned context is used throughout the lifetime of the attempt. - NewAttemptTracer(ctx context.Context, info *AttemptInfo) (context.Context, AttemptTracer) - - // NewConnTracer instantiates a ConnTracer to monitor a physical transport connection. - // The returned context is used throughout the lifetime of the connection. - NewConnTracer(ctx context.Context, info *ConnInfo) (context.Context, ConnTracer) -} - -// CallTracer is a call-scoped interface tracking the overall client RPC call. -// -// Experimental: this interface is experimental and subject to change. -type CallTracer interface { - // RecordDelayStart indicates the start of a call-level delay (e.g., "resolving"). - RecordDelayStart(delayType string, reason string) - // RecordDelayReasonChanged indicates a transition in the call-level delay reason. - RecordDelayReasonChanged(reason string) - // RecordDelayEnd indicates the end of the call-level delay. - RecordDelayEnd() - // OnCallEnd is called when the overall RPC call completes. - OnCallEnd(err error) -} - -// AttemptTracer is an attempt-scoped interface tracking an individual stream attempt. -// -// Experimental: this interface is experimental and subject to change. -type AttemptTracer interface { - // RecordDelayStart indicates the start of an attempt-level delay (e.g., "connecting"). - RecordDelayStart(delayType string, reason string) - // RecordDelayReasonChanged indicates a transition in the attempt-level delay reason. - RecordDelayReasonChanged(reason string) - // RecordDelayEnd indicates the end of the attempt-level delay. - RecordDelayEnd() - - // OnHeaderSent/OnHeaderRecv replace V1 OutHeader and InHeader. - OnHeaderSent(compression string, md metadata.MD) - OnHeaderRecv(wireLength int, compression string, md metadata.MD) - - // OnPayloadSent/OnPayloadRecv replace V1 OutPayload and InPayload. - OnPayloadSent(payload any, length, compressedLength, wireLength int, sentTime time.Time) - OnPayloadRecv(payload any, length, compressedLength, wireLength int, recvTime time.Time) - - // OnTrailerSent/OnTrailerRecv replace V1 OutTrailer and InTrailer. - OnTrailerSent(md metadata.MD) - OnTrailerRecv(wireLength int, md metadata.MD) - - // OnAttemptEnd is called when this specific stream attempt completes (replacing V1 End). - OnAttemptEnd(err error) -} +Concretely, each runtime extends the type it already uses to express a deferred pick: +* **Go**: two optional fields on `balancer.PickResult`. Because a deferred pick is signalled today through the `ErrNoSubConnAvailable` sentinel rather than a populated `PickResult`, the deferred path is extended to also surface a populated result carrying the delay metadata to the pick wrapper. +* **Java**: carried on `PickResult` (which already transports an LB-supplied `ClientStreamTracer.Factory`), read where the channel interprets a no-result pick. +* **C++ (Core)**: core has no synchronous picker-tree return to a channel wrapper, so the equivalent `delay_type`/`delay_reason` are recorded at the LB-pick point of the call's promise chain rather than returned upward (see Tracer-Side API). -// ConnTracer is a connection-scoped interface tracking physical transport connections. -type ConnTracer interface { - // OnConnEnd is called when the transport connection terminates (replacing V1 ConnEnd). - OnConnEnd() -} - -// Supporting Metadata Structs -type CallInfo struct { - FullMethodName string - FailFast bool - IsClientStream bool - IsServerStream bool -} - -type AttemptInfo struct { - FullMethodName string - IsTransparentRetry bool - IsHedged bool - PreviousAttempts int - RemoteAddr net.Addr - LocalAddr net.Addr -} +##### 2. Tracer-Side API -type ConnInfo struct { - RemoteAddr net.Addr - LocalAddr net.Addr -} -``` +The channel drives the telemetry plugin through six logical operations, split across the two existing telemetry scopes. The end callbacks carry no duration argument because the plugin owns timing ([3.1](#lifecycle--state-machine)): -###### 2. Coexistence & Migration Strategy -During the experimental phase, gRPC-Go will support both V1 and V2 interfaces concurrently to avoid breaking existing ecosystems (such as OpenCensus and older OpenTelemetry plugins): -* **Registration**: The channel option `grpc.WithStatsHandler()` will accept both `Handler` (V1) and `HandlerV2` (V2) types using interface checks. -* **Adaptation**: An internal adapter will be provided to wrap a V1 `Handler` into a V2 `HandlerV2` (converting method callbacks back into struct events and dispatching them to `HandleRPC`), ensuring older plugins continue to function. -* **Native Performance**: If a native V2 handler is registered (such as the new OpenTelemetry stats handler), the channel will bypass all V1 event struct allocations and type-assertions, running completely on the high-performance, zero-allocation callback path. - -##### Java -In `io.grpc.ClientStreamTracer` and `ClientStreamTracer.Factory`: -*Note: Call-level delays (like name resolution) occur before an attempt is created. Because `ClientStreamTracer` is attempt-scoped, the call-level delay APIs are added to `ClientStreamTracer.Factory` (which is call-scoped and plumbed throughout the channel), while the attempt-level delay APIs remain on the `ClientStreamTracer` instance.* -```java -package io.grpc; - -public abstract class ClientStreamTracer extends StreamTracer { - /** - * Called when an attempt-level delay segment (e.g. LB Pick connection) starts. - */ - public void recordAttemptDelayStart(String delayType, String delayReason) {} - - /** - * Called when an attempt-level delay reason changes. - */ - public void recordAttemptDelayReasonChanged(String delayReason) {} - - /** - * Called when an attempt-level delay segment ends. - */ - public void recordAttemptDelayEnd() {} -} +| Logical operation | Scope | Effect | +|---|---|---| +| `recordCallDelayStart(delay_type, reason)` | call | open the `"Call Delay"` span; begin timing | +| `recordCallDelayReasonChanged(reason)` | call | append a `"Delay state transition"` event | +| `recordCallDelayEnd()` | call | close the span; emit `grpc.client.call.delay.duration` | +| `recordAttemptDelayStart(delay_type, reason)` | attempt | open the `"Attempt Delay"` span; begin timing | +| `recordAttemptDelayReasonChanged(reason)` | attempt | append a `"Delay state transition"` event | +| `recordAttemptDelayEnd()` | attempt | close the span; emit `grpc.client.attempt.delay.duration` | -// Nested inside ClientStreamTracer: -public static abstract class Factory { - /** - * Called when a call-level delay segment (e.g. name resolution) starts. - */ - public void recordCallDelayStart(String delayType, String delayReason) {} +The **scope of the receiving tracer object** (call-scoped vs attempt-scoped) statically determines which histogram a segment is recorded to ([3.1](#lifecycle--state-machine), Scope Resolution). Each language binds these operations onto its client-side telemetry API as follows: - /** - * Called when a call-level delay reason changes. - */ - public void recordCallDelayReasonChanged(String delayReason) {} +| Scope | Go | Java | C++ (Core) | +|---|---|---|---| +| Call-level delay hooks | a **new** call-scoped tracer (new V2 stats-handler API) | new methods on the existing `ClientStreamTracer.Factory` (call-scoped, plumbed through the channel) | new `DelayAnnotation` subtype recorded on the existing call tracer | +| Attempt-level delay hooks | a **new** attempt-scoped tracer (new V2 stats-handler API) | new methods on the existing `ClientStreamTracer` (attempt-scoped) | new `DelayAnnotation` subtype recorded on the existing attempt tracer | - /** - * Called when a call-level delay segment ends. - */ - public void recordCallDelayEnd() {} +###### Language-specific bindings +The binding *mechanism* is deliberately asymmetric, because each language's current telemetry API differs; only the six logical operations and their two scopes are common. This asymmetry is inherent to the existing APIs, not a difference in the delay design itself: +* **Go**: Introduces a **new** call-scoped and attempt-scoped tracer API (gRPC-Go's V2 stats handler). A new API is required because the V1 `stats.Handler` is attempt-scoped only and cannot host a call-level hook. (The closest existing V1 signal, `stats.DelayedPickComplete`, is attempt-scoped and merely marks the *end* of a blocked pick — it is an analogue, not a base we extend.) The broader V2 API is out of scope here beyond the delay hooks it must expose. +* **Java**: **Reuses the existing tracer types** but adds new no-op-default methods to them — call hooks on `ClientStreamTracer.Factory` (which is call-scoped and exists before the first attempt is spawned) and attempt hooks on `ClientStreamTracer` (attempt-scoped). The `"resolving"` hooks build on the channel's stream-buffering path (such as `createPendingStream()`). Since a `ClientStreamTracer` instance is instantiated prior to the first pick attempt (per the gRPC-Java retry architecture), the attempt-level `"connecting"` delay is recorded **directly on the attempt-scoped `ClientStreamTracer` instance**, ensuring attempt-level telemetry remains cleanly isolated to the individual attempt's tracer. +* **C++ (Core)**: **Reuses the existing annotation framework**, adding only a new `DelayAnnotation` subtype that replaces the current free-form `"Delayed name resolution complete."` / `"Delayed LB pick complete."` string annotations. - public abstract ClientStreamTracer newClientStreamTracer(StreamInfo info, Metadata headers); -} -``` +In short: C++ reuses its mechanism wholesale (one new subtype), Java reuses its tracer types but extends them with new methods, and Go introduces new tracer types outright. -##### C++ (Core) -In `src/core/telemetry/call_tracer.h`: -*Note: To prevent core interface bloat and align with the core thinning refactoring, we leverage the existing C++ Annotation framework rather than adding new virtual methods.* +###### Illustrative binding: C++ structured annotation +The delay transition is carried as an `Annotation` subtype on the C++ Core annotation framework. It implements **both** pure-virtual hooks of the base `Annotation` — `ToString()` (human-readable form) and `ForEachKeyValue()` (structured `grpc.delay_type`/`grpc.delay_reason` pairs) — and **owns** its strings so the annotation may safely outlive the stack frame that produced it: ```cpp namespace grpc_core { class DelayAnnotation final : public CallTracerAnnotationInterface::Annotation { public: enum class Stage { kStart, kReasonChanged, kEnd }; - + DelayAnnotation(Stage stage, absl::string_view type, absl::string_view reason) : Annotation(CallTracerAnnotationInterface::AnnotationType::kDelay), - stage_(stage), type_(type), reason_(std::string(reason)) {} - + stage_(stage), type_(type), reason_(reason) {} + Stage stage() const { return stage_; } absl::string_view type() const { return type_; } absl::string_view reason() const { return reason_; } + // Both overrides are required: the base Annotation declares them pure-virtual. + std::string ToString() const override; + void ForEachKeyValue( + absl::FunctionRef callback) + const override; + private: Stage stage_; - absl::string_view type_; - std::string reason_; + std::string type_; // owned (not a string_view) to avoid a dangling view + std::string reason_; // owned }; -} // namespace grpc_core +} // namespace grpc_core ``` ### Feature Flag @@ -403,7 +285,7 @@ class DelayAnnotation final : public CallTracerAnnotationInterface::Annotation { All delay metrics, tracing, and API hooks will be guarded by a feature flag: * **Go/Java Env Var**: `GRPC_EXPERIMENTAL_ENABLE_DELAY_OBSERVABILITY` (Default: `false`) -* **C++ Core Experiment**: `IsExperimentEnabled("client_delay_observability")` (registered in `experiments.h`) +* **C++ Core Experiment**: register `client_delay_observability` in `experiments.yaml`; gate via the generated `IsClientDelayObservabilityEnabled()` accessor (core uses generated per-experiment accessors, not string-keyed lookups). ## Metric Registration and Recording From 6f6efbb0014b47dfb1469857fc4fca1326b53cee Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Thu, 25 Jun 2026 14:05:58 +0000 Subject: [PATCH 09/12] add special section for attempt level delays handled by channel --- A121-rpc-delay-observability.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/A121-rpc-delay-observability.md b/A121-rpc-delay-observability.md index 30335fc67..9abcd71dc 100644 --- a/A121-rpc-delay-observability.md +++ b/A121-rpc-delay-observability.md @@ -111,7 +111,7 @@ The `grpc.delay_type` label is a low-cardinality, restricted set of values. To e ##### 1. Call-Level Delay Types Recorded on the `grpc.client.call.delay.duration` histogram. These represent delays that occur before an individual attempt is created: -* `"resolving"`: The channel is delayed waiting for name resolution and configuration parsing. +* `"resolving"`: The channel is delayed waiting for name resolution. ##### 2. Attempt-Level Delay Types Recorded on the `grpc.client.attempt.delay.duration` histogram. These represent delays that occur during a specific RPC attempt: @@ -206,6 +206,16 @@ Each RPC attempt is tracked independently: * **Attempt Isolation**: A transparent retry or hedged attempt will instantiate its own attempt-level tracer. * **Independent Timing**: If the new attempt's picker is deferred, the attempt starts its own independent delay logical timer and child trace span, without affecting the call-level resolver timing or the timing of other concurrent attempts. +##### 5. Attempt-Level Channel-Interceded Delays +Specific attempt-level delays are not generated by load-balancing pickers themselves but are interceded and synthesized directly by the channel wrapper when handling transport-level states: +* **`picker_failing_with_wait_for_ready` (Wait-For-Ready Buffering)**: + 1. *Trigger*: When an RPC is initiated and the root picker returns a failing/deferred result, the channel wrapper checks the RPC's `wait_for_ready` configuration. If `wait_for_ready` is `false`, the RPC fails immediately (no delay is timed). If `wait_for_ready` is `true`, the channel wrapper queues the RPC, starts a logical timer, and invokes `recordAttemptDelayStart("picker_failing_with_wait_for_ready", error_reason)`. + 2. *Transition*: While the RPC remains queued, if the channel receives an updated picker that continues to return a different error, the channel wrapper updates the reason via `recordAttemptDelayReasonChanged(new_error_reason)`. + 3. *Termination*: When an updated picker successfully assigns a ready subchannel, the channel wrapper stops the timer, invokes `recordAttemptDelayEnd()`, and creates the stream. If the RPC's deadline expires or the call is cancelled while queued, the channel wrapper invokes `recordAttemptDelayEnd()` to capture the partial duration. +* **`subchannel_state_mismatch` (Post-Pick Transport Race)**: + 1. *Trigger*: When the root picker returns a successful pick (assigning a subchannel), the channel wrapper attempts to create a transport stream on it. If the stream creation fails immediately because the subchannel has transitioned out of `READY` (a race condition before the picker is updated), the channel wrapper initiates a transparent retry. During this retry, the channel wrapper starts a logical timer and invokes `recordAttemptDelayStart("subchannel_state_mismatch", socket_disconnect_reason)`. + 2. *Termination*: Once the load balancer publishes an updated picker and the channel wrapper successfully executes a pick that assigns a new, actually ready subchannel, the channel wrapper stops the timer and invokes `recordAttemptDelayEnd()`. + #### API Definitions ##### 1. Picker-Side API From 5d8c3e150ed5eb1d586e406018f9d30821144b22 Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Thu, 25 Jun 2026 14:15:21 +0000 Subject: [PATCH 10/12] addressing review comment --- A121-rpc-delay-observability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A121-rpc-delay-observability.md b/A121-rpc-delay-observability.md index 9abcd71dc..a5998b401 100644 --- a/A121-rpc-delay-observability.md +++ b/A121-rpc-delay-observability.md @@ -133,7 +133,7 @@ Because only the priority policy contributes a prefix and it contributes exactly ##### Metadata Propagation & Composition The prepending and wrapping logic is handled entirely inside the Balancer Picker tree hierarchy, keeping the channel's attempt-routing wrapper and tracer plugins completely decoupled and simple: -1. **Leaf Pickers** (such as `pick_first` or `round_robin`): Generate the base leaf types (e.g., `delay_type = "connecting"`) and the initial, descriptive `delay_reason` (e.g., `"subchannel connecting: TCP handshake in progress"`). +1. **Leaf Pickers** (such as `pick_first` or `round_robin`): Generate the base delay types (e.g., `delay_type = "connecting"`) and the initial, descriptive `delay_reason` (e.g., `"subchannel connecting: TCP handshake in progress"`). 2. **Container Pickers**: Intercept the child's deferred pick result as it bubbles up the picker tree. The `priority` picker prepends its active tier index to the type (e.g., producing `"p0:connecting"`). A pass-through picker (such as `xds_cluster_manager`) forwards the child's `delay_type` unmodified, but can enrich the `delay_reason` with its own structural details. 3. **The Channel Wrapper**: Receives the final, fully-composed `delay_type` and `delay_reason` strings from the root picker and passes them directly to the tracer (`recordAttemptDelayStart`). The channel wrapper is also responsible for intercepting specific channel-level states (e.g., generating `"picker_failing_with_wait_for_ready"` when a picker returns an error for a `wait_for_ready` RPC, or `"subchannel_state_mismatch"` when a transport disconnects before the picker is updated). From 8265fe0259b3ac45194e9058106c1dbab95f71aa Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Thu, 25 Jun 2026 15:16:45 +0000 Subject: [PATCH 11/12] fix AI slop --- A121-rpc-delay-observability.md | 83 +++++++++++++++++---------------- 1 file changed, 42 insertions(+), 41 deletions(-) diff --git a/A121-rpc-delay-observability.md b/A121-rpc-delay-observability.md index a5998b401..9892cf805 100644 --- a/A121-rpc-delay-observability.md +++ b/A121-rpc-delay-observability.md @@ -15,7 +15,7 @@ This proposal introduces client-side metrics and tracing to measure the delays a Existing gRPC core telemetry infrastructure, as defined in [gRPC A66 (OpenTelemetry Metrics)][A66] and [gRPC A72 (OpenTelemetry Tracing)][A72], tracks the overall end-to-end duration of RPC calls and attempts. However, these metrics and spans function as aggregate buckets that do not decompose latency, leaving delays inside the client channel invisible to operators. -Before an RPC attempt can be sent over the network, the client channel must perform several critical operations, including resolving the target name, parsing service configurations, instantiating load balancing policies, and obtaining a connectivity picker. If any of these phases stall—such as during a slow DNS lookup, a Route Lookup Service (RLS) control-plane query, or a Cluster Discovery Service (CDS) metadata fetch—the RPC is delayed. To the application, this appears as high latency or a timeout. However, because current telemetry lacks visibility into these resolution and routing states, developers cannot distinguish between a slow network, a slow backend, or a channel initialization delay. This is particularly challenging for clients utilizing the Route Lookup Service (RLS), where diagnosing RLS-related hangs or isolating whether delays stem from pending route lookups requires complex manual debugging. +Before an RPC attempt can be sent over the network, the client channel must perform several operations, including resolving the target name, parsing service configurations, instantiating load balancing policies, and obtaining a connectivity picker. If any of these phases stall—such as during a slow DNS lookup, a Route Lookup Service (RLS) control-plane query, or a Cluster Discovery Service (CDS) metadata fetch—the RPC is delayed. To the application, this appears as high latency or a timeout. However, because current telemetry lacks visibility into these resolution and routing states, developers cannot distinguish between a slow network, a slow backend, or a channel initialization delay. This is particularly challenging for clients using the Route Lookup Service (RLS), where diagnosing RLS-related hangs or isolating whether delays stem from pending route lookups requires manual debugging. While there are many potential sources of delay along the client-side pipeline (including interceptor execution, credential fetching, and filter evaluation), this proposal focuses specifically on introducing observability for the two most common bottlenecks: **name resolution** and **load balancing pick** delays. This proposal establishes a generic telemetry framework extensible to other client-side delays in the future. @@ -107,7 +107,7 @@ To ensure consistency across implementations, we define the taxonomy of metric a All connection-related delays are consolidated into a single low-cardinality metric label value (`"connecting"`), while their detailed reasons are recorded inside the `"Delay state transition"` span event (as the `grpc.delay_reason` event attribute). #### 1. Metric Delay Types (`grpc.delay_type`) -The `grpc.delay_type` label is a low-cardinality, restricted set of values. To ensure maximum clarity, these types are explicitly partitioned into Call-Level and Attempt-Level scopes, corresponding to their respective duration histograms. Structural container policies may compose the attempt-level values by prepending logical prefixes. +The `grpc.delay_type` label is a low-cardinality, restricted set of values. These types are partitioned into Call-Level and Attempt-Level scopes, corresponding to their respective duration histograms. Structural container policies may compose the attempt-level values by prepending logical prefixes. ##### 1. Call-Level Delay Types Recorded on the `grpc.client.call.delay.duration` histogram. These represent delays that occur before an individual attempt is created: @@ -127,24 +127,24 @@ Structural container policies prepend their logical prefixes to the base attempt * `p0:connecting` * `p1:subchannel_state_mismatch` * `p0:picker_failing_with_wait_for_ready` -* **Pass-Through Container Policies**: Policies like `xds_cluster_manager`, `weighted_target`, and `rls` **do not prepend any prefix or wrap the delay type**. They simply bubble up the child's `grpc.delay_type` (e.g., `"connecting"`) directly as-is. +* **Pass-Through Container Policies**: Policies like `xds_cluster_manager`, `weighted_target`, and `rls` **do not prepend any prefix or wrap the delay type**. They bubble up the child's `grpc.delay_type` (e.g., `"connecting"`) as-is. -Because only the priority policy contributes a prefix and it contributes exactly one (its active tier index), the resulting `grpc.delay_type` cardinality stays bounded at roughly *(base attempt-level types) × (configured priority tiers)*; prefixes do not stack across nested pass-through containers. Implementations SHOULD keep the prefix a small bounded token (the tier index) so metric cardinality remains low; richer per-container structure belongs in the `grpc.delay_reason` span event, not the metric label. +Because only the priority policy contributes a prefix and it contributes exactly one (its active tier index), the resulting `grpc.delay_type` cardinality stays bounded; prefixes do not stack across nested pass-through containers. Implementations should keep the prefix a small bounded token (the tier index) so metric cardinality remains low; detailed per-container structure belongs in the `grpc.delay_reason` span event, not the metric label. ##### Metadata Propagation & Composition -The prepending and wrapping logic is handled entirely inside the Balancer Picker tree hierarchy, keeping the channel's attempt-routing wrapper and tracer plugins completely decoupled and simple: -1. **Leaf Pickers** (such as `pick_first` or `round_robin`): Generate the base delay types (e.g., `delay_type = "connecting"`) and the initial, descriptive `delay_reason` (e.g., `"subchannel connecting: TCP handshake in progress"`). -2. **Container Pickers**: Intercept the child's deferred pick result as it bubbles up the picker tree. The `priority` picker prepends its active tier index to the type (e.g., producing `"p0:connecting"`). A pass-through picker (such as `xds_cluster_manager`) forwards the child's `delay_type` unmodified, but can enrich the `delay_reason` with its own structural details. -3. **The Channel Wrapper**: Receives the final, fully-composed `delay_type` and `delay_reason` strings from the root picker and passes them directly to the tracer (`recordAttemptDelayStart`). The channel wrapper is also responsible for intercepting specific channel-level states (e.g., generating `"picker_failing_with_wait_for_ready"` when a picker returns an error for a `wait_for_ready` RPC, or `"subchannel_state_mismatch"` when a transport disconnects before the picker is updated). +The prepending and wrapping logic is handled inside the balancer picker tree, keeping the channel's attempt-routing wrapper and tracer plugins decoupled: +1. **Leaf Pickers** (such as `pick_first` or `round_robin`): Generate the base delay types (e.g., `delay_type = "connecting"`) and the initial `delay_reason` (e.g., `"subchannel connecting: TCP handshake in progress"`). +2. **Container Pickers**: Intercept the child's deferred pick result as it bubbles up the picker tree. The `priority` picker prepends its active tier index to the type (e.g., producing `"p0:connecting"`). A pass-through picker (such as `xds_cluster_manager`) forwards the child's `delay_type` unmodified, but can add its own structural details to the `delay_reason`. +3. **The Channel Wrapper**: Receives the composed `delay_type` and `delay_reason` strings from the root picker and passes them to the tracer (`recordAttemptDelayStart`) for a single delay_type segment (which is identified by a change in the delay_type value). The channel wrapper is also responsible for intercepting specific channel-level states (e.g., generating `"picker_failing_with_wait_for_ready"` when a picker returns an error for a `wait_for_ready` RPC, or `"subchannel_state_mismatch"` when a transport disconnects before the picker is updated). --- #### 2. Taxonomy of Delay Reasons (Span Event Attribute: `grpc.delay_reason`) -Unlike the strict, low-cardinality metric types, the `grpc.delay_reason` is an **unconstrained, free-form debug string** designed to convey maximum troubleshooting context. -* It is written as a **human-readable, spaced string** (e.g. `"subchannel is connecting"` or `"waiting for DNS query to complete"`), **not** a `snake_case` token or closed enum. -* It is designed to contain **high-cardinality metadata** (such as specific subchannel IP addresses, target names, cache keys, or raw connection error messages) to provide rich diagnostics inside the trace span. +Unlike the strict, low-cardinality metric types, the `grpc.delay_reason` is an **unconstrained, free-form debug string** designed to convey troubleshooting context. +* It is written as a **human-readable string** (e.g. `"subchannel is connecting"` or `"waiting for DNS query to complete"`), **not** a `snake_case` token or closed enum. +* It is designed to contain **high-cardinality metadata** (such as specific subchannel IP addresses, target names, cache keys, or raw connection error messages) to provide diagnostics inside the trace span. -To assist implementers, we explain the physical scenarios that are associated with each `grpc.delay_type` below. These scenarios represent common connection/resolver bottlenecks but are **non-exhaustive**; implementations are encouraged to append additional debug details. +To assist implementers, we explain the scenarios associated with each `grpc.delay_type` below. These scenarios represent common connection/resolver bottlenecks but are **non-exhaustive**; implementations are encouraged to append additional debug details. ##### Category A: Resolver Scenarios (Type: `"resolving"`) * **DNS Resolver Pending**: The channel is waiting for the initial name resolution query to complete. The reason string should describe the pending resolver query (e.g., `"waiting for DNS query to complete for target example.com"`). @@ -172,7 +172,7 @@ Recorded when an RPC attempt is queued waiting for a subchannel to establish a t The priority load balancing policy ([gRPC A56][A56]) manages failover between multiple priority groups (e.g. `p0`, `p1`, `p2`). Its delay telemetry behaves as follows: * When the priority policy is waiting on its primary tier (`p0`) to connect, the overall metric delay type is `"p0:connecting"`. * If `p0` fails (enters `TRANSIENT_FAILURE`) and the policy fails over to `p1`, the picker updates the active delay type to `"p1:connecting"`. -* Because the delay type has transitioned, the active child span is closed, a new child span `"p1:connecting"` is opened, and the picker records the exact failover reason as a spaced string event (which can include the logical priority name or child policy name from the configuration, e.g., `"waiting on priority group p1 (child 'tier-1-backup') (p0 failed: connection timeout)"`). This provides a clear, high-fidelity timeline of the failover sequence in the trace. +* Because the delay type has transitioned, the active child span is closed, a new child span `"p1:connecting"` is opened, and the picker records the exact failover reason as a spaced string event (which can include the logical priority name or child policy name from the configuration, e.g., `"waiting on priority group p1 (child 'tier-1-backup') (p0 failed: connection timeout)"`). This provides a timeline of the failover sequence in the trace. ##### Category F: Pass-Through Container Policy Scenarios (Type: `"connecting"` bubbled up directly) Policies like `xds_cluster_manager`, `weighted_target`, and `rls` do not modify the metric `grpc.delay_type` (it remains strictly `"connecting"` or whatever leaf type is bubbled up from the child). @@ -187,48 +187,48 @@ The client channel, load balancer, and telemetry tracer coordinate synchronously ##### 1. Timer Orchestration The channel owns the **lifecycle** of a delay segment (deciding when it starts, transitions, and ends) while the telemetry plugin owns the **timing** of that segment (it timestamps the start on `recordDelayStart`, computes the elapsed duration on `recordDelayEnd`, and emits the histogram). The channel therefore does not maintain a separate duration clock; "logical timer" below refers to this plugin-side measurement, delimited by the channel's start/end signals. -* **Resolver / Control-Plane (Call-Level Delay)**: When an RPC is blocked waiting for name resolution, the channel invokes `recordCallDelayStart("resolving", reason)` to open the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel invokes `recordCallDelayEnd()`. (Note: If resolution completes before any RPC is blocked, no delay is recorded.) +* **Resolver / Control-Plane (Call-Level Delay)**: When an RPC is blocked waiting for name resolution or configuration parsing to complete, the channel invokes `recordCallDelayStart("resolving", reason)` to open the `"Call Delay"` child span carrying the `grpc.delay_type = "resolving"` attribute. When the resolver successfully applies the first valid service config and endpoints, the channel invokes `recordCallDelayEnd()`. (Note: If resolution completes before any RPC is blocked, no delay is recorded.) * **LB Picker (Attempt-Level Delay)**: When an RPC attempt is initiated, the picker executes. If the picker defers the pick (i.e. it cannot return a ready connection yet), it surfaces the metric delay type (e.g., `"connecting"`) and the free-form debug reason alongside the existing "no connection available" signal. The channel's attempt-routing wrapper invokes `recordAttemptDelayStart(type, reason)` to open the `"Attempt Delay"` child span. As the attempt remains buffered, subsequent picker evaluations may return different reasons, which the channel updates using `recordAttemptDelayReasonChanged(reason)` to append span events. When a picker evaluation successfully assigns a subchannel, the channel invokes `recordAttemptDelayEnd()`. * **Scope Resolution**: The decision of whether a delay segment is Call-Level (recorded to `grpc.client.call.delay.duration`) or Attempt-Level (recorded to `grpc.client.attempt.delay.duration`) is determined entirely by the scope of the tracer object on which the callbacks are invoked. Call-level delays (such as `"resolving"`) are invoked strictly on the call-scoped tracer object, while attempt-level delays (such as `"connecting"`) are invoked strictly on the attempt-scoped tracer object. This mapping is enforced statically by the respective language API signatures. ##### 2. Emission -Both metric duration recording and trace span management are fully delegated to the registered telemetry plugin (e.g., OpenTelemetry) via the tracer API: +Both metric duration recording and trace span management are delegated to the registered telemetry plugin (e.g., OpenTelemetry) via the tracer API: * **Span Lifecycle**: Upon receiving the `recordDelayStart` signal, the telemetry plugin instantiates the child trace span. Subsequent `recordDelayReasonChanged` calls append `"Delay state transition"` events to the active span. * **Simultaneous Metric & Span Close**: Upon receiving the `recordDelayEnd` signal, the telemetry plugin **simultaneously ends the child trace span and emits the final duration metric** to the corresponding duration histogram (`grpc.client.call.delay.duration` or `grpc.client.attempt.delay.duration`), measuring the elapsed time of the logical timer. ##### 3. Cancellation & Timeout If an active RPC call or attempt is cancelled (due to client cancellation, `DEADLINE_EXCEEDED` timeouts, or channel shutdown) while a delay is logically active: * The core channel **MUST** invoke the tracer's end callback (`recordCallDelayEnd()` or `recordAttemptDelayEnd()`), which finalizes the plugin-side timing for the open segment. -* This ensures the telemetry plugin can capture the elapsed duration up to the point of failure, close the trace span, and emit the partial duration metric, guaranteeing that bottlenecks preceding a failure remain fully observable. +* This ensures the telemetry plugin can capture the elapsed duration up to the point of failure, close the trace span, and emit the partial duration metric, guaranteeing that bottlenecks preceding a failure remain observable. ##### 4. Retries & Hedging Each RPC attempt is tracked independently: * **Attempt Isolation**: A transparent retry or hedged attempt will instantiate its own attempt-level tracer. * **Independent Timing**: If the new attempt's picker is deferred, the attempt starts its own independent delay logical timer and child trace span, without affecting the call-level resolver timing or the timing of other concurrent attempts. -##### 5. Attempt-Level Channel-Interceded Delays -Specific attempt-level delays are not generated by load-balancing pickers themselves but are interceded and synthesized directly by the channel wrapper when handling transport-level states: +##### 5. Attempt-Level Channel-Intercepted Delays +Some attempt-level delays are not generated by load-balancing pickers but are intercepted and synthesized by the channel wrapper when handling transport-level states: * **`picker_failing_with_wait_for_ready` (Wait-For-Ready Buffering)**: - 1. *Trigger*: When an RPC is initiated and the root picker returns a failing/deferred result, the channel wrapper checks the RPC's `wait_for_ready` configuration. If `wait_for_ready` is `false`, the RPC fails immediately (no delay is timed). If `wait_for_ready` is `true`, the channel wrapper queues the RPC, starts a logical timer, and invokes `recordAttemptDelayStart("picker_failing_with_wait_for_ready", error_reason)`. - 2. *Transition*: While the RPC remains queued, if the channel receives an updated picker that continues to return a different error, the channel wrapper updates the reason via `recordAttemptDelayReasonChanged(new_error_reason)`. - 3. *Termination*: When an updated picker successfully assigns a ready subchannel, the channel wrapper stops the timer, invokes `recordAttemptDelayEnd()`, and creates the stream. If the RPC's deadline expires or the call is cancelled while queued, the channel wrapper invokes `recordAttemptDelayEnd()` to capture the partial duration. -* **`subchannel_state_mismatch` (Post-Pick Transport Race)**: - 1. *Trigger*: When the root picker returns a successful pick (assigning a subchannel), the channel wrapper attempts to create a transport stream on it. If the stream creation fails immediately because the subchannel has transitioned out of `READY` (a race condition before the picker is updated), the channel wrapper initiates a transparent retry. During this retry, the channel wrapper starts a logical timer and invokes `recordAttemptDelayStart("subchannel_state_mismatch", socket_disconnect_reason)`. - 2. *Termination*: Once the load balancer publishes an updated picker and the channel wrapper successfully executes a pick that assigns a new, actually ready subchannel, the channel wrapper stops the timer and invokes `recordAttemptDelayEnd()`. + 1. *Trigger*: When an RPC is initiated and the root picker returns a failing result (an error other than "no connection available"), the channel wrapper checks the RPC's `wait_for_ready` configuration. If `wait_for_ready` is `false`, the RPC fails immediately (no delay is timed). If `wait_for_ready` is `true`, the channel wrapper queues the RPC and invokes `recordAttemptDelayStart("picker_failing_with_wait_for_ready", error_reason)`. + 2. *Transition*: While the RPC remains queued, if an updated picker continues to return a different error, the channel wrapper updates the reason via `recordAttemptDelayReasonChanged(new_error_reason)`. + 3. *Termination*: When an updated picker assigns a ready subchannel, the channel wrapper invokes `recordAttemptDelayEnd()` and creates the stream. If the RPC's deadline expires or the call is cancelled while queued, the channel wrapper invokes `recordAttemptDelayEnd()` to capture the partial duration. +* **`subchannel_state_mismatch` (Post-Pick Stale Pick)**: + 1. *Trigger*: When the root picker returns a successful pick but the assigned subchannel has transitioned out of `READY` before the RPC can use it (a race before the picker is updated), the channel wrapper re-queues the RPC on the same attempt and invokes `recordAttemptDelayStart("subchannel_state_mismatch", state_change_reason)`. It then waits for the next picker and re-picks; no new attempt is created. + 2. *Termination*: When a re-pick assigns a subchannel that is actually ready, the channel wrapper invokes `recordAttemptDelayEnd()`. If the deadline expires or the call is cancelled while re-queued, the channel wrapper invokes `recordAttemptDelayEnd()` to capture the partial duration. #### API Definitions ##### 1. Picker-Side API -When a picker defers a pick (it cannot return a ready connection yet), it surfaces two optional values **alongside the existing "no connection available" deferral signal**, carried on the runtime's existing pick-result value so that no new return channel is introduced: +When a picker defers a pick (it cannot return a ready connection yet), it surfaces two values **alongside the existing "no connection available" deferral signal**, carried on the runtime's existing pick-result value so that no new return channel is introduced: * `delay_type`: the bounded metric label for the current wait (e.g. `"connecting"`), composed by container pickers per [Section 2](#2-delay-types--reasons). * `delay_reason`: the free-form diagnostic string for the current wait. The channel reads these only on the deferred path; a successful pick ignores them. Leaf pickers populate the base values and container pickers compose them as the result bubbles up the picker tree ([Section 2](#2-delay-types--reasons)). Pickers remain ignorant of `wait_for_ready` semantics and transport races — the channel itself synthesizes `picker_failing_with_wait_for_ready` and `subchannel_state_mismatch` ([Section 2](#2-delay-types--reasons), [Rationale](#rationale)). -Concretely, each runtime extends the type it already uses to express a deferred pick: -* **Go**: two optional fields on `balancer.PickResult`. Because a deferred pick is signalled today through the `ErrNoSubConnAvailable` sentinel rather than a populated `PickResult`, the deferred path is extended to also surface a populated result carrying the delay metadata to the pick wrapper. -* **Java**: carried on `PickResult` (which already transports an LB-supplied `ClientStreamTracer.Factory`), read where the channel interprets a no-result pick. -* **C++ (Core)**: core has no synchronous picker-tree return to a channel wrapper, so the equivalent `delay_type`/`delay_reason` are recorded at the LB-pick point of the call's promise chain rather than returned upward (see Tracer-Side API). +Add two fields to the pick-result type each runtime already returns. Leaf pickers set them when they defer; container pickers (e.g. `priority`) propagate and compose them as the deferred result bubbles up ([Section 2](#2-delay-types--reasons)); the channel reads them where it already handles a deferred pick. No new return channel and no pick-loop restructuring are introduced. The fields live on: +* **Go**: `balancer.PickResult` (deferred pick = the `ErrNoSubConnAvailable` return). +* **Java**: `PickResult` (deferred pick = `PickResult.withNoResult()`). +* **C++ (Core)**: the `PickResult::Queue` variant (deferred pick = `Queue`). ##### 2. Tracer-Side API @@ -243,23 +243,23 @@ The channel drives the telemetry plugin through six logical operations, split ac | `recordAttemptDelayReasonChanged(reason)` | attempt | append a `"Delay state transition"` event | | `recordAttemptDelayEnd()` | attempt | close the span; emit `grpc.client.attempt.delay.duration` | -The **scope of the receiving tracer object** (call-scoped vs attempt-scoped) statically determines which histogram a segment is recorded to ([3.1](#lifecycle--state-machine), Scope Resolution). Each language binds these operations onto its client-side telemetry API as follows: +The **scope of the receiving tracer object** (call-scoped vs attempt-scoped) statically determines which histogram a segment is recorded to ([3.1](#lifecycle--state-machine), Scope Resolution). Each runtime binds these operations onto its client-side telemetry API as follows: | Scope | Go | Java | C++ (Core) | |---|---|---|---| | Call-level delay hooks | a **new** call-scoped tracer (new V2 stats-handler API) | new methods on the existing `ClientStreamTracer.Factory` (call-scoped, plumbed through the channel) | new `DelayAnnotation` subtype recorded on the existing call tracer | | Attempt-level delay hooks | a **new** attempt-scoped tracer (new V2 stats-handler API) | new methods on the existing `ClientStreamTracer` (attempt-scoped) | new `DelayAnnotation` subtype recorded on the existing attempt tracer | -###### Language-specific bindings -The binding *mechanism* is deliberately asymmetric, because each language's current telemetry API differs; only the six logical operations and their two scopes are common. This asymmetry is inherent to the existing APIs, not a difference in the delay design itself: -* **Go**: Introduces a **new** call-scoped and attempt-scoped tracer API (gRPC-Go's V2 stats handler). A new API is required because the V1 `stats.Handler` is attempt-scoped only and cannot host a call-level hook. (The closest existing V1 signal, `stats.DelayedPickComplete`, is attempt-scoped and merely marks the *end* of a blocked pick — it is an analogue, not a base we extend.) The broader V2 API is out of scope here beyond the delay hooks it must expose. -* **Java**: **Reuses the existing tracer types** but adds new no-op-default methods to them — call hooks on `ClientStreamTracer.Factory` (which is call-scoped and exists before the first attempt is spawned) and attempt hooks on `ClientStreamTracer` (attempt-scoped). The `"resolving"` hooks build on the channel's stream-buffering path (such as `createPendingStream()`). Since a `ClientStreamTracer` instance is instantiated prior to the first pick attempt (per the gRPC-Java retry architecture), the attempt-level `"connecting"` delay is recorded **directly on the attempt-scoped `ClientStreamTracer` instance**, ensuring attempt-level telemetry remains cleanly isolated to the individual attempt's tracer. -* **C++ (Core)**: **Reuses the existing annotation framework**, adding only a new `DelayAnnotation` subtype that replaces the current free-form `"Delayed name resolution complete."` / `"Delayed LB pick complete."` string annotations. +###### Per-runtime binding +The binding *mechanism* is asymmetric because each runtime's current telemetry API differs; only the six logical operations and their two scopes are common. This asymmetry is inherent to the existing APIs, not a difference in the delay design itself: +* **Go**: introduces a **new** call-scoped and attempt-scoped tracer API (gRPC-Go's V2 stats handler). A new API is required because the V1 `stats.Handler` is attempt-scoped only and cannot host a call-level hook. The broader V2 API is out of scope here beyond the delay hooks it must expose. +* **Java**: **reuses the existing tracer types** but adds new no-op-default methods to them — attempt hooks on `ClientStreamTracer`, call hooks on `ClientStreamTracer.Factory` — and reuses the existing name-resolution-delay plumbing (`ClientStreamTracer.NAME_RESOLUTION_DELAYED` and the `createPendingStream()` callback) for the `"resolving"` segment. Because no attempt-level `ClientStreamTracer` instance exists while an RPC is buffered waiting on a pick, a buffered-pick (`"connecting"`) delay is anchored on the call-scoped `Factory` until the stream is created. +* **C++ (Core)**: **reuses the existing annotation framework**, adding only a new `DelayAnnotation` subtype that replaces the current free-form `"Delayed name resolution complete."` / `"Delayed LB pick complete."` string annotations. In short: C++ reuses its mechanism wholesale (one new subtype), Java reuses its tracer types but extends them with new methods, and Go introduces new tracer types outright. ###### Illustrative binding: C++ structured annotation -The delay transition is carried as an `Annotation` subtype on the C++ Core annotation framework. It implements **both** pure-virtual hooks of the base `Annotation` — `ToString()` (human-readable form) and `ForEachKeyValue()` (structured `grpc.delay_type`/`grpc.delay_reason` pairs) — and **owns** its strings so the annotation may safely outlive the stack frame that produced it: +Rather than add new virtual methods to the core tracer interface, the delay transition is carried as an `Annotation` subtype that owns its strings: ```cpp namespace grpc_core { @@ -282,13 +282,14 @@ class DelayAnnotation final : public CallTracerAnnotationInterface::Annotation { const override; private: - Stage stage_; - std::string type_; // owned (not a string_view) to avoid a dangling view - std::string reason_; // owned + const Stage stage_; + const std::string type_; // owned (not a string_view) to avoid dangling + const std::string reason_; // owned }; } // namespace grpc_core ``` +A new `AnnotationType::kDelay` is added to the annotation-type enum (before the trailing `kDoNotUse_MustBeLast` sentinel). The OpenTelemetry plugin reads `stage()`/`type()`/`reason()` to open, update, and close the delay child span. ### Feature Flag @@ -307,9 +308,9 @@ Metric registration and recording will follow the established architectural patt **Child Spans vs. Events:** We use child spans for delays instead of attaching events to the parent RPC span because it provides better visual flame-graph isolation in tracing backends, especially when delays are long or when multiple attempts are made (e.g., hedging or retries). -**Free-form `delay_reason`:** We explicitly chose to allow unconstrained, free-form debug strings for the trace event's `delay_reason` rather than bounded enum tokens. While this risks higher cardinality in tracing backends, it provides immense diagnostic value by allowing raw IP addresses, subchannel targets, and detailed connection error messages to be embedded directly into the trace. The metric label (`grpc.delay_type`) remains strictly bounded to ensure metric cardinality is kept low. +**Free-form `delay_reason`:** We allow unconstrained, free-form debug strings for the trace event's `delay_reason` rather than bounded enum tokens. While this risks higher cardinality in tracing backends, it provides diagnostic value by allowing raw IP addresses, subchannel targets, and detailed connection error messages to be embedded directly into the trace. The metric label (`grpc.delay_type`) remains strictly bounded to ensure metric cardinality is kept low. -**Channel-Intercepted Delays:** We implement specific delay types like `picker_failing_with_wait_for_ready` and `subchannel_state_mismatch` as channel-level interceptions rather than picker-generated types. This intentionally keeps leaf pickers ignorant of `wait_for_ready` semantics and transport-level race conditions, enforcing a clean separation of concerns where the picker only reports its current state, and the channel dictates queueing behavior. +**Channel-Intercepted Delays:** We implement specific delay types like `picker_failing_with_wait_for_ready` and `subchannel_state_mismatch` as channel-level interceptions rather than picker-generated types. This keeps leaf pickers ignorant of `wait_for_ready` semantics and transport-level race conditions, enforcing a separation of concerns where the picker only reports its current state, and the channel dictates queueing behavior. **Delegated Recording:** We implement duration recording at the client channel wrapper level rather than inside specific load balancing policies because only the client channel manages the buffering, queueing, and context cancellation lifecycles of RPC calls. This decouples policy-level state reporting from duration measurement. From f6b5128a280ce652e3cbe93a9dce12ce38256a4f Mon Sep 17 00:00:00 2001 From: Madhav Bissa Date: Thu, 25 Jun 2026 18:43:12 +0000 Subject: [PATCH 12/12] cleaning up some more --- A121-rpc-delay-observability.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/A121-rpc-delay-observability.md b/A121-rpc-delay-observability.md index 9892cf805..3b68336a6 100644 --- a/A121-rpc-delay-observability.md +++ b/A121-rpc-delay-observability.md @@ -125,8 +125,6 @@ Recorded on the `grpc.client.attempt.delay.duration` histogram. These represent Structural container policies prepend their logical prefixes to the base attempt-level types using a colon separator: * **Priority Policy**: Prepends the active priority tier index, resulting in composed types like: * `p0:connecting` - * `p1:subchannel_state_mismatch` - * `p0:picker_failing_with_wait_for_ready` * **Pass-Through Container Policies**: Policies like `xds_cluster_manager`, `weighted_target`, and `rls` **do not prepend any prefix or wrap the delay type**. They bubble up the child's `grpc.delay_type` (e.g., `"connecting"`) as-is. Because only the priority policy contributes a prefix and it contributes exactly one (its active tier index), the resulting `grpc.delay_type` cardinality stays bounded; prefixes do not stack across nested pass-through containers. Implementations should keep the prefix a small bounded token (the tier index) so metric cardinality remains low; detailed per-container structure belongs in the `grpc.delay_reason` span event, not the metric label. @@ -183,7 +181,7 @@ However, to provide visibility in tracing, the parent container policy's structu #### Lifecycle & State Machine -The client channel, load balancer, and telemetry tracer coordinate synchronously to record delays without dynamic memory allocation during routing. +The client channel, load balancer, and telemetry tracer coordinate synchronously to record delays during routing. ##### 1. Timer Orchestration The channel owns the **lifecycle** of a delay segment (deciding when it starts, transitions, and ends) while the telemetry plugin owns the **timing** of that segment (it timestamps the start on `recordDelayStart`, computes the elapsed duration on `recordDelayEnd`, and emits the histogram). The channel therefore does not maintain a separate duration clock; "logical timer" below refers to this plugin-side measurement, delimited by the channel's start/end signals. @@ -253,7 +251,7 @@ The **scope of the receiving tracer object** (call-scoped vs attempt-scoped) sta ###### Per-runtime binding The binding *mechanism* is asymmetric because each runtime's current telemetry API differs; only the six logical operations and their two scopes are common. This asymmetry is inherent to the existing APIs, not a difference in the delay design itself: * **Go**: introduces a **new** call-scoped and attempt-scoped tracer API (gRPC-Go's V2 stats handler). A new API is required because the V1 `stats.Handler` is attempt-scoped only and cannot host a call-level hook. The broader V2 API is out of scope here beyond the delay hooks it must expose. -* **Java**: **reuses the existing tracer types** but adds new no-op-default methods to them — attempt hooks on `ClientStreamTracer`, call hooks on `ClientStreamTracer.Factory` — and reuses the existing name-resolution-delay plumbing (`ClientStreamTracer.NAME_RESOLUTION_DELAYED` and the `createPendingStream()` callback) for the `"resolving"` segment. Because no attempt-level `ClientStreamTracer` instance exists while an RPC is buffered waiting on a pick, a buffered-pick (`"connecting"`) delay is anchored on the call-scoped `Factory` until the stream is created. +* **Java**: **reuses the existing tracer types** but adds new no-op-default methods to them — attempt hooks on `ClientStreamTracer`, call hooks on `ClientStreamTracer.Factory` — and reuses the existing name-resolution-delay plumbing (`ClientStreamTracer.NAME_RESOLUTION_DELAYED` and the `createPendingStream()` callback) for the `"resolving"` segment. The attempt-level `"connecting"` delay is recorded directly on the attempt-scoped `ClientStreamTracer` instance. * **C++ (Core)**: **reuses the existing annotation framework**, adding only a new `DelayAnnotation` subtype that replaces the current free-form `"Delayed name resolution complete."` / `"Delayed LB pick complete."` string annotations. In short: C++ reuses its mechanism wholesale (one new subtype), Java reuses its tracer types but extends them with new methods, and Go introduces new tracer types outright.