fix: route telemetry HTTP through shared connection stack; fix close() flush race by samikshya-db · Pull Request #362 · databricks/databricks-sql-nodejs

samikshya-db · 2026-04-21T06:26:32Z

Summary

Targeted fixes addressing Vikrant's Apr 21 review comments on #325.

Shared HTTP stack (DatabricksTelemetryExporter, FeatureFlagCache): sendRequest and fetchWithRetry now use connectionProvider.getRetryPolicy().invokeWithRetry(), matching the CloudFetchResultHandler pattern. Removes bespoke per-file fetch/retry logic.
MetricsAggregator.close() flush race: adds a closing flag so batch-triggered fire-and-forget flushes are suppressed during the close loop, ensuring flushForClose() is the single awaited drain with no escaping promises.

Test plan

215 unit tests pass
3-file diff only — no unrelated changes

This pull request was AI-assisted by Isaac.

…) flush race - DatabricksTelemetryExporter.sendRequest and FeatureFlagCache.fetchWithRetry now use connectionProvider.getRetryPolicy().invokeWithRetry(), matching the CloudFetchResultHandler pattern instead of bespoke fetch/retry logic - MetricsAggregator: add closing flag so batch-triggered fire-and-forget flushes are suppressed during close(), ensuring a single awaited flushForClose() drains all remaining metrics without racing past process.exit() Co-authored-by: samikshya-chand_data

github-actions · 2026-04-21T06:27:31Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Replace completeStatementForClose/flushForClose toggle pattern with a closing flag that suppresses batch-triggered fire-and-forget flushes. Set closing=true first so completeStatement works normally via addPendingMetric (closed is still false), then seal with closed=true and drain with a single awaited flush(false). Co-authored-by: samikshya-chand_data

github-actions · 2026-04-21T06:59:08Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…e, MetricsAggregator Picks up #362 (shared connection stack + close() flush race fix) merged to main. Resolves add/add conflicts by preferring main's versions of the three infrastructure files, consistent with the prior merge strategy. Co-authored-by: samikshya-chand_data

…iring, error telemetry Rebase the regressed exporter/aggregator/feature-flag-cache on main's hardened versions and re-apply only the genuinely new functionality (CONNECTION_CLOSE event, chunk-timing aggregation) on top. Closes the critical findings from the multi-reviewer audit: - SSRF guard, redactSensitive, sanitizeProcessName, hasAuthorization, auth-missing warn-once — all restored via main's telemetryUtils. - MetricsAggregator memory bounds (maxPendingMetrics with error-preferring drop, maxErrorsPerStatement, statementTtlMs eviction) restored. - FeatureFlagCache in-flight fetch dedup and TTL clamp [60s, 3600s] restored; lib/telemetry/urlUtils.ts deleted. - close() now properly awaits aggregator drain — fixes the close()/flush race that PR #362 already fixed once. - Driver version reads lib/version.ts via buildUserAgentString instead of hardcoded '1.0.0'; uuidv4() restored in place of Math.random(). - TelemetryTerminalError re-exported from lib/index.ts. Type-safe wiring: - Added optional getTelemetryEmitter() / getTelemetryAggregator() to IClientContext; removed all 7 `(this.context as any)` casts. - Six copy-pasted event listeners in DBSQLClient.initializeTelemetry collapsed into one `Object.values(TelemetryEventType)` loop — closes the listener-name mismatch that silently dropped error events. - mapAuthType now covers all 6 authType values instead of defaulting everything to 'pat'. TelemetryClient now owns the host-scoped resources: - TelemetryClientProvider is a process-wide singleton (getInstance()). - TelemetryClient owns DatabricksTelemetryExporter, MetricsAggregator, CircuitBreakerRegistry, and FeatureFlagCache for its host. Implements IClientContext itself so the owned components have a stable context that survives any single DBSQLClient's close. - DBSQLClient instances on the same host share the breaker counters, feature-flag cache, exporter, and HTTP batches. Fixes the per-instance breaker-fragmentation noted in iter-2 architecture review. - Each DBSQLClient still holds its own TelemetryEventEmitter (respects per-client telemetryEnabled); emitters bridge into the shared aggregator. - Exporter falls back to context.getAuthProvider() when no explicit auth provider is passed, so the shared exporter resolves auth through the TelemetryClient's FIFO of registered DBSQLClients. Error telemetry wired across operation entry points: - Re-added emitErrorEvent(error) on DBSQLOperation; uses ExceptionClassifier.isTerminal() to classify. - fetchChunk, cancel, close, getMetadata wrap their bodies in try/catch that calls emitErrorEvent before re-throwing. Verified end-to-end against a real Azure Databricks workspace: failed query produces STATEMENT_COMPLETE + ERROR (with redacted stack) on the wire. - Removed the await getMetadata() call from emitStatementComplete — eliminates the extra Thrift RPC on every close (F19) AND prevents spurious error telemetry from getMetadata's wrapper firing during close-cleanup of an already-failed operation. Other: - Iterating Map.keys() while mutating made safe via snapshot in close(). - STATEMENT_COMPLETE no longer zeroes accumulated chunk metrics when the emit doesn't supply them (matches sibling-field guards). - Tests for the rebased modules restored from main; provider tests updated for the singleton API; deleted unused TelemetryExporterStub. 484 unit tests passing. Diff vs main: ~+2110/-383, down from the original PR's +3640/-1173. Co-authored-by: Isaac

samikshya-db requested a deployment to azure-prod April 21, 2026 06:26 — with GitHub Actions Waiting

samikshya-db temporarily deployed to azure-prod April 21, 2026 06:58 — with GitHub Actions Inactive

samikshya-db requested a review from vikrantpuppala April 21, 2026 07:02

vikrantpuppala approved these changes Apr 21, 2026

View reviewed changes

samikshya-db merged commit 3a5e659 into main Apr 21, 2026
7 of 8 checks passed

samikshya-db deleted the telemetry-2-vikrant-fixes branch April 21, 2026 08:51

samikshya-db mentioned this pull request Apr 21, 2026

[3/7] Telemetry Client Management: TelemetryClient and Provider #326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: route telemetry HTTP through shared connection stack; fix close() flush race#362

fix: route telemetry HTTP through shared connection stack; fix close() flush race#362
samikshya-db merged 2 commits intomainfrom
telemetry-2-vikrant-fixes

samikshya-db commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samikshya-db commented Apr 21, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants