Skip to content

fix: route telemetry HTTP through shared connection stack; fix close() flush race#362

Merged
samikshya-db merged 2 commits intomainfrom
telemetry-2-vikrant-fixes
Apr 21, 2026
Merged

fix: route telemetry HTTP through shared connection stack; fix close() flush race#362
samikshya-db merged 2 commits intomainfrom
telemetry-2-vikrant-fixes

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

Summary

Targeted fixes addressing Vikrant's Apr 21 review comments on #325.

  • Shared HTTP stack (DatabricksTelemetryExporter, FeatureFlagCache): sendRequest and fetchWithRetry now use connectionProvider.getRetryPolicy().invokeWithRetry(), matching the CloudFetchResultHandler pattern. Removes bespoke per-file fetch/retry logic.
  • MetricsAggregator.close() flush race: adds a closing flag so batch-triggered fire-and-forget flushes are suppressed during the close loop, ensuring flushForClose() is the single awaited drain with no escaping promises.

Test plan

  • 215 unit tests pass
  • 3-file diff only — no unrelated changes

This pull request was AI-assisted by Isaac.

…) flush race

- DatabricksTelemetryExporter.sendRequest and FeatureFlagCache.fetchWithRetry
  now use connectionProvider.getRetryPolicy().invokeWithRetry(), matching the
  CloudFetchResultHandler pattern instead of bespoke fetch/retry logic
- MetricsAggregator: add closing flag so batch-triggered fire-and-forget flushes
  are suppressed during close(), ensuring a single awaited flushForClose() drains
  all remaining metrics without racing past process.exit()

Co-authored-by: samikshya-chand_data
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Replace completeStatementForClose/flushForClose toggle pattern with a
closing flag that suppresses batch-triggered fire-and-forget flushes.
Set closing=true first so completeStatement works normally via
addPendingMetric (closed is still false), then seal with closed=true
and drain with a single awaited flush(false).

Co-authored-by: samikshya-chand_data
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@samikshya-db samikshya-db merged commit 3a5e659 into main Apr 21, 2026
7 of 8 checks passed
@samikshya-db samikshya-db deleted the telemetry-2-vikrant-fixes branch April 21, 2026 08:51
samikshya-db added a commit that referenced this pull request Apr 21, 2026
…e, MetricsAggregator

Picks up #362 (shared connection stack + close() flush race fix) merged to main.
Resolves add/add conflicts by preferring main's versions of the three infrastructure
files, consistent with the prior merge strategy.

Co-authored-by: samikshya-chand_data
samikshya-db added a commit that referenced this pull request Apr 28, 2026
…iring, error telemetry

Rebase the regressed exporter/aggregator/feature-flag-cache on main's
hardened versions and re-apply only the genuinely new functionality
(CONNECTION_CLOSE event, chunk-timing aggregation) on top. Closes the
critical findings from the multi-reviewer audit:

  - SSRF guard, redactSensitive, sanitizeProcessName, hasAuthorization,
    auth-missing warn-once — all restored via main's telemetryUtils.
  - MetricsAggregator memory bounds (maxPendingMetrics with error-preferring
    drop, maxErrorsPerStatement, statementTtlMs eviction) restored.
  - FeatureFlagCache in-flight fetch dedup and TTL clamp [60s, 3600s]
    restored; lib/telemetry/urlUtils.ts deleted.
  - close() now properly awaits aggregator drain — fixes the close()/flush
    race that PR #362 already fixed once.
  - Driver version reads lib/version.ts via buildUserAgentString instead
    of hardcoded '1.0.0'; uuidv4() restored in place of Math.random().
  - TelemetryTerminalError re-exported from lib/index.ts.

Type-safe wiring:

  - Added optional getTelemetryEmitter() / getTelemetryAggregator() to
    IClientContext; removed all 7 `(this.context as any)` casts.
  - Six copy-pasted event listeners in DBSQLClient.initializeTelemetry
    collapsed into one `Object.values(TelemetryEventType)` loop — closes
    the listener-name mismatch that silently dropped error events.
  - mapAuthType now covers all 6 authType values instead of defaulting
    everything to 'pat'.

TelemetryClient now owns the host-scoped resources:

  - TelemetryClientProvider is a process-wide singleton (getInstance()).
  - TelemetryClient owns DatabricksTelemetryExporter, MetricsAggregator,
    CircuitBreakerRegistry, and FeatureFlagCache for its host. Implements
    IClientContext itself so the owned components have a stable context
    that survives any single DBSQLClient's close.
  - DBSQLClient instances on the same host share the breaker counters,
    feature-flag cache, exporter, and HTTP batches. Fixes the per-instance
    breaker-fragmentation noted in iter-2 architecture review.
  - Each DBSQLClient still holds its own TelemetryEventEmitter (respects
    per-client telemetryEnabled); emitters bridge into the shared aggregator.
  - Exporter falls back to context.getAuthProvider() when no explicit auth
    provider is passed, so the shared exporter resolves auth through the
    TelemetryClient's FIFO of registered DBSQLClients.

Error telemetry wired across operation entry points:

  - Re-added emitErrorEvent(error) on DBSQLOperation; uses
    ExceptionClassifier.isTerminal() to classify.
  - fetchChunk, cancel, close, getMetadata wrap their bodies in try/catch
    that calls emitErrorEvent before re-throwing. Verified end-to-end
    against a real Azure Databricks workspace: failed query produces
    STATEMENT_COMPLETE + ERROR (with redacted stack) on the wire.
  - Removed the await getMetadata() call from emitStatementComplete —
    eliminates the extra Thrift RPC on every close (F19) AND prevents
    spurious error telemetry from getMetadata's wrapper firing during
    close-cleanup of an already-failed operation.

Other:

  - Iterating Map.keys() while mutating made safe via snapshot in close().
  - STATEMENT_COMPLETE no longer zeroes accumulated chunk metrics when
    the emit doesn't supply them (matches sibling-field guards).
  - Tests for the rebased modules restored from main; provider tests
    updated for the singleton API; deleted unused TelemetryExporterStub.

484 unit tests passing. Diff vs main: ~+2110/-383, down from the
original PR's +3640/-1173.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants