feat: CLI telemetry for all commands + crash reporting#122
Conversation
Closes security-audit finding #1 on PR #122 (telemetry message sanitization). `error.message` was flowing into 4 capture sites unsanitized, leaking homedir paths (and rarely, credentials) to the WorkOS gateway. - Add `sanitizeMessage()` in crash-reporter.ts: homedir strip + Bearer/ sk_*/JWT redaction + 1KB truncation. - Factor secret redaction into shared `redactSecrets()` used by both `sanitizeMessage` and `sanitizeStack` (Node echoes `.message` into the leading `Error.stack` line, so message-only sanitization was insufficient). - Add private `extractErrorFields()` chokepoint on `Analytics`; route all 4 capture sites through it (`captureException`, `stepCompleted`, `commandExecuted`, `captureUnhandledCrash`). `replaceLastCommandEvent` inherits sanitization via its delegation to `commandExecuted`. - `captureUnhandledCrash` now uses `sanitizeStack` instead of inline truncation, providing defense-in-depth for callers that bypass the crash-reporter wrapper. - Add regression guard test (`telemetry-sanitize.spec.ts`): poisons every capture method with homedir + Bearer + sk_live_ + JWT, asserts no marker reaches the serialized queue. Reviewed: ideation:reviewer cycle 1 PASS (0 critical, 0 high).
📝 WalkthroughWalkthroughAdds end-to-end telemetry: types and schema for command/crash events, device-id persistence, crash sanitization/reporting, telemetry-client queue and persist, store-and-forward recovery, analytics command lifecycle and auth-mode tracking, command telemetry middleware/wrappers, enriched error apiContext and exit-code mapping, and CLI wiring and tests. ChangesTelemetry Infrastructure & Command Tracking
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (4)
src/utils/telemetry-types.ts (1)
58-58: ⚡ Quick winAlign
startTimestampoptionality with backward-compat contract.
telemetry-schema.spec.tsexplicitly acceptsstep/agent.toolwithoutstartTimestamp, but these interfaces currently require it. Making both optional avoids contract drift.Suggested patch
export interface StepEvent extends TelemetryEvent { type: 'step'; name: string; - startTimestamp: string; + startTimestamp?: string; durationMs: number; success: boolean; error?: { @@ export interface AgentToolEvent extends TelemetryEvent { type: 'agent.tool'; toolName: string; - startTimestamp: string; + startTimestamp?: string; durationMs: number; success: boolean; }Also applies to: 70-70
src/utils/telemetry-client.spec.ts (1)
60-105: ⚡ Quick winAdd coverage for claim-token auth headers.
setClaimTokenAuth()is new behavior in this PR, but there’s no test assertingx-workos-claim-token+x-workos-client-idemission (and bearer-token precedence when both exist).Also applies to: 124-218
src/utils/telemetry-store-forward.ts (1)
1-3: ⚡ Quick winUse async fs APIs for startup recovery path.
recoverPendingEvents()(Line 25+) is async but uses sync disk calls (Lines 27-56), which blocks startup and conflicts with the project rule for TS files.As per coding guidelines "Avoid Node-specific sync APIs (crypto, fs sync) unless necessary".
Also applies to: 27-56
src/utils/output.ts (1)
123-124: ⚡ Quick winProtect process termination from telemetry failures
analytics.recordTermination(...)on Line 123 can throw and prevent Line 124 from executing. Error exits should remain deterministic even when telemetry is unhealthy.🔧 Proposed fix
const reason = error.apiContext ? 'api_error' : codeReason; - analytics.recordTermination(reason, error.code, error.apiContext); - process.exit(exit); + try { + analytics.recordTermination(reason, error.code, error.apiContext); + } finally { + process.exit(exit); + } }
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c45afda9-659c-43a0-9646-cd08dd0cabe3
📒 Files selected for processing (27)
CLAUDE.mdsrc/bin.tssrc/commands/debug.tssrc/lib/api-error-handler.spec.tssrc/lib/api-error-handler.tssrc/lib/command-aliases.tssrc/lib/device-id.spec.tssrc/lib/device-id.tssrc/lib/run-with-core.tssrc/utils/analytics.spec.tssrc/utils/analytics.tssrc/utils/command-telemetry.spec.tssrc/utils/command-telemetry.tssrc/utils/crash-reporter.spec.tssrc/utils/crash-reporter.tssrc/utils/exit-codes.spec.tssrc/utils/exit-codes.tssrc/utils/output.spec.tssrc/utils/output.tssrc/utils/register-subcommand.tssrc/utils/telemetry-client.spec.tssrc/utils/telemetry-client.tssrc/utils/telemetry-sanitize.spec.tssrc/utils/telemetry-schema.spec.tssrc/utils/telemetry-store-forward.spec.tssrc/utils/telemetry-store-forward.tssrc/utils/telemetry-types.ts
| try { | ||
| if (fs.existsSync(filePath)) { | ||
| const raw = fs.readFileSync(filePath, 'utf8').trim(); | ||
| if (UUID_REGEX.test(raw)) { | ||
| cached = raw; | ||
| return raw; | ||
| } | ||
| } | ||
|
|
||
| const id = crypto.randomUUID(); | ||
| fs.mkdirSync(path.dirname(filePath), { recursive: true, mode: 0o700 }); | ||
| fs.writeFileSync(filePath, id, { encoding: 'utf8', mode: 0o600 }); | ||
| cached = id; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Replace sync fs usage in the command path.
Line 31–43 and Line 49 use synchronous fs/crypto APIs on a hot CLI path. Please move this to async (node:fs/promises) and cache a pending promise to keep call sites simple.
As per coding guidelines src/**/*.ts: “Avoid Node-specific sync APIs (crypto, fs sync) unless necessary”.
Also applies to: 49-50
| export function extractUserFlags(rawArgs: string[]): string[] { | ||
| const passedFlags = rawArgs | ||
| .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2)) | ||
| .map((arg) => arg.replace(/^-+/, '').split('=')[0]); | ||
| return [...new Set(passedFlags)]; |
There was a problem hiding this comment.
Harden flag extraction to avoid invalid telemetry flags.
Line 32/33 currently treats -- and values like -1 as flags, which can pollute command.flags.
Suggested fix
export function extractUserFlags(rawArgs: string[]): string[] {
const passedFlags = rawArgs
- .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2))
- .map((arg) => arg.replace(/^-+/, '').split('=')[0]);
+ .filter((arg) => {
+ if (arg === '--') return false;
+ if (/^--[A-Za-z][\w-]*(=.*)?$/.test(arg)) return true;
+ if (/^-[A-Za-z]$/.test(arg)) return true;
+ return false;
+ })
+ .map((arg) => arg.replace(/^-+/, '').split('=')[0])
+ .filter(Boolean);
return [...new Set(passedFlags)];
}| const topLevelCommand = commandParts[0] ?? ''; | ||
| if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return; |
There was a problem hiding this comment.
Normalize the skip check using canonical top-level command.
Line 60/61 checks the raw command token. If a skipped command is invoked via alias, the exemption can be bypassed and telemetry still emitted.
Suggested fix
- const topLevelCommand = commandParts[0] ?? '';
+ const rawTopLevelCommand = commandParts[0] ?? '';
+ const topLevelCommand = COMMAND_ALIASES[rawTopLevelCommand] ?? rawTopLevelCommand;
if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;| export function sanitizeStack(stack: string | undefined): string { | ||
| if (!stack) return ''; | ||
| let sanitized = stack.replaceAll(HOME, '~'); | ||
| sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/'); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "crash-reporter.ts" -type fRepository: workos/cli
Length of output: 84
🏁 Script executed:
cat -n src/utils/crash-reporter.tsRepository: workos/cli
Length of output: 3137
🏁 Script executed:
python3 << 'EOF'
import re
# Test the current regex against various path formats
current_pattern = r'/[^\s:]+/(node_modules|dist|src)/'
windows_pattern = r'[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\'
# Test cases
test_cases = [
# POSIX paths
"/home/user/project/node_modules/lib/index.js",
"/var/app/dist/main.js",
"/Users/dev/src/index.ts",
# Windows paths
"C:\\Users\\dev\\project\\node_modules\\lib\\index.js",
"D:\\app\\dist\\main.js",
"E:\\code\\src\\index.ts",
# Stack trace examples
" at functionName (/home/user/project/node_modules/pkg/index.js:10:5)",
" at functionName (C:\\Users\\dev\\project\\node_modules\\pkg\\index.js:10:5)",
]
print("Testing CURRENT regex pattern:")
print(f"Pattern: {current_pattern}")
for test in test_cases:
match = re.search(current_pattern, test)
result = re.sub(current_pattern, r'\1/', test) if match else test
print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'")
print("\nTesting PROPOSED Windows pattern:")
print(f"Pattern: {windows_pattern}")
for test in test_cases:
match = re.search(windows_pattern, test)
result = re.sub(windows_pattern, r'\1\\', test) if match else test
print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'")
print("\nTesting COMBINED patterns:")
for test in test_cases:
result = test
result = re.sub(current_pattern, r'\1/', result)
result = re.sub(windows_pattern, r'\1\\', result)
print(f" '{test[:60]}...' -> Result: '{result[:60]}...'")
EOFRepository: workos/cli
Length of output: 2630
Windows absolute paths leak in stack trace sanitization
Line 26 only collapses POSIX-style paths (/path/to/node_modules/). Windows paths (e.g., C:\Users\...\node_modules\) bypass this regex entirely and leak full local filesystem paths into telemetry payloads.
Regex testing confirms the current pattern does not match Windows absolute paths:
C:\Users\dev\project\node_modules\pkg\index.js→ remains unsanitized/home/user/project/node_modules/pkg/index.js→ correctly collapsed tonode_modules/pkg/index.js
🔧 Proposed fix
export function sanitizeStack(stack: string | undefined): string {
if (!stack) return '';
let sanitized = stack.replaceAll(HOME, '~');
- sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');
+ sanitized = sanitized
+ .replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/')
+ .replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\');
sanitized = redactSecrets(sanitized);
return sanitized.length > MAX_STACK_LENGTH
? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]'
: sanitized;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/'); | |
| export function sanitizeStack(stack: string | undefined): string { | |
| if (!stack) return ''; | |
| let sanitized = stack.replaceAll(HOME, '~'); | |
| sanitized = sanitized | |
| .replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/') | |
| .replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\'); | |
| sanitized = redactSecrets(sanitized); | |
| return sanitized.length > MAX_STACK_LENGTH | |
| ? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]' | |
| : sanitized; | |
| } |
| export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { | ||
| if (error) { | ||
| outputError(error); | ||
| } | ||
| analytics.recordTermination(reasonForExitCode(code), error?.code); | ||
| process.exit(code); |
There was a problem hiding this comment.
Classify termination from error.code when available.
Line 76 derives reason only from numeric exit code, so all code-1 exits become validation_error. That misclassifies API failures like http_500 when they go through exitWithCode.
Suggested fix
export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
if (error) {
outputError(error);
}
- analytics.recordTermination(reasonForExitCode(code), error?.code);
+ const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code);
+ analytics.recordTermination(reason, error?.code);
process.exit(code);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { | |
| if (error) { | |
| outputError(error); | |
| } | |
| analytics.recordTermination(reasonForExitCode(code), error?.code); | |
| process.exit(code); | |
| export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { | |
| if (error) { | |
| outputError(error); | |
| } | |
| const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code); | |
| analytics.recordTermination(reason, error?.code); | |
| process.exit(code); | |
| } |
| async flush(): Promise<boolean> { | ||
| if (this.events.length === 0) return true; | ||
| if (!this.gatewayUrl) { | ||
| debug('[Telemetry] No gateway URL configured, skipping flush'); | ||
| return; | ||
| return false; | ||
| } | ||
|
|
||
| const payload: TelemetryRequest = { events: [...this.events] }; | ||
| this.events = []; | ||
| const count = this.events.length; | ||
| const payload: TelemetryRequest = { events: this.events.slice(0, count) }; | ||
|
|
There was a problem hiding this comment.
Guard against concurrent flush() calls to prevent event loss/duplication.
Line 85 currently allows overlapping flushes. Two concurrent calls can send the same snapshot twice, and later splice(0, count) can remove events queued after the first flush.
Suggested patch
export class TelemetryClient {
private events: TelemetryEvent[] = [];
+ private flushInFlight: Promise<boolean> | null = null;
@@
async flush(): Promise<boolean> {
+ if (this.flushInFlight) return this.flushInFlight;
+ this.flushInFlight = this.flushInternal();
+ try {
+ return await this.flushInFlight;
+ } finally {
+ this.flushInFlight = null;
+ }
+ }
+
+ private async flushInternal(): Promise<boolean> {
if (this.events.length === 0) return true;
@@
- }
+ }
}Also applies to: 135-137, 143-144
| mkdirSync(dirname(filePath), { recursive: true }); | ||
| writeFileSync(filePath, JSON.stringify(this.events), 'utf-8'); | ||
| this.events = []; |
There was a problem hiding this comment.
Persist telemetry with restrictive filesystem permissions.
persistToFile() currently relies on default permissions. Since payloads can include user/device identifiers, write with 0700 dir and 0600 file modes.
Suggested patch
- mkdirSync(dirname(filePath), { recursive: true });
- writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');
+ mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 });
+ writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mkdirSync(dirname(filePath), { recursive: true }); | |
| writeFileSync(filePath, JSON.stringify(this.events), 'utf-8'); | |
| this.events = []; | |
| mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 }); | |
| writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 }); | |
| this.events = []; |
Greptile SummaryThis PR adds comprehensive CLI telemetry (command events, crash reporting, store-and-forward persistence) to every non-installer command, complementing the existing installer session telemetry. The implementation is well-structured with provisional events, alias normalization, env fingerprinting, and a PID-based store-forward mechanism to survive
Confidence Score: 4/5The telemetry wiring is sound overall; one issue in the crash reporter changes observable behavior for users encountering unhandled errors. The crash handler replaces Node's default uncaught-exception and unhandled-rejection output without printing anything to stderr, so a genuine crash now exits silently. Users relying on the error message to diagnose a problem will see nothing. This is a regression in error visibility for an edge case that the reporter is specifically designed to capture. src/utils/crash-reporter.ts — the uncaughtException and unhandledRejection handlers need to emit the error to stderr before exiting. Important Files Changed
|
…amps, and new event types Add environment fingerprint fields (OS, Node version, CI detection, shell) to session.start and session.end events. Add startTimestamp to step and agent.tool events for span reconstruction. Define command and crash event types with stub emission methods. Add discriminated union Zod schema validation tests mirroring the API schema.
… persistence Wire up yargs middleware that emits a provisional command event before each handler runs, then replaces it with actual duration/success on completion. This covers the ~25 process.exit() call sites without modifying them. - Command telemetry middleware with canonical name resolution and flag extraction - Crash reporter with sanitized stack traces (sync handlers, no async) - Store-forward: persist unsent events to temp file on exit, recover on next run - Fix flush() to retain events until HTTP success (was clearing before fetch) - Auto-wrap handlers in registerSubcommand() (single change point) - Shared COMMAND_ALIASES map for telemetry and help-json - analytics.initForNonInstaller() sets gatewayUrl + JWT from stored creds
- Enable debug output for non-installer commands via env var - Log telemetry event details (type, name, duration, attributes) on flush - Register in debug command's env var catalog
- Wrap inline command handlers (seed, setup-org, doctor, etc.) with wrapCommandHandler so they report real duration/success - Skip provisional telemetry event for install command (has own session telemetry) - Add claim -> env.claim to canonical alias map - Defer store-forward file deletion until after successful flush
Client errors (401, 403) are permanent failures that won't succeed on retry. Only retain events for 5xx (transient server errors) and network failures where store-forward retry is meaningful.
- flush() returns true (sent/dropped) or false (retryable) so callers can act on the result - Use splice(0, count) instead of clearing all events, protecting events queued concurrently during the fetch - wrapCommandHandler flushes in-process so events are sent immediately instead of always deferring to next invocation via store-forward - Store-forward recovery deletes files after loading into memory (events are re-persisted by exit handler if flush fails) - Skip provisional events for dashboard and $0 (installer entry points) - Add 4xx drop test coverage
Add a section to CLAUDE.md explaining which commands auto-emit telemetry (registerSubcommand) versus which need manual wrapCommandHandler wrapping (inline top-level .command() calls). Add a pointer comment in bin.ts near the workflow commands block. Prevents new top-level commands from silently emitting duration=0 telemetry.
- Add workos.user_id to command and crash events (from stored credentials or unclaimed environment clientId) so dashboards can count unique users - Add cli.version to command and crash events for release adoption tracking - Support claim-token auth path on the telemetry client, so unclaimed environments' telemetry reaches the API (guard accepts this path too) - Rename CrashEvent's installer.version to cli.version (crashes happen outside the installer too) - initForNonInstaller() now wires up user_id and claim-token auth
Closes security-audit finding #1 on PR #122 (telemetry message sanitization). `error.message` was flowing into 4 capture sites unsanitized, leaking homedir paths (and rarely, credentials) to the WorkOS gateway. - Add `sanitizeMessage()` in crash-reporter.ts: homedir strip + Bearer/ sk_*/JWT redaction + 1KB truncation. - Factor secret redaction into shared `redactSecrets()` used by both `sanitizeMessage` and `sanitizeStack` (Node echoes `.message` into the leading `Error.stack` line, so message-only sanitization was insufficient). - Add private `extractErrorFields()` chokepoint on `Analytics`; route all 4 capture sites through it (`captureException`, `stepCompleted`, `commandExecuted`, `captureUnhandledCrash`). `replaceLastCommandEvent` inherits sanitization via its delegation to `commandExecuted`. - `captureUnhandledCrash` now uses `sanitizeStack` instead of inline truncation, providing defense-in-depth for callers that bypass the crash-reporter wrapper. - Add regression guard test (`telemetry-sanitize.spec.ts`): poisons every capture method with homedir + Bearer + sk_live_ + JWT, asserts no marker reaches the serialized queue. Reviewed: ideation:reviewer cycle 1 PASS (0 critical, 0 high).
Introduces two additive telemetry signals requested by the signals spec:
- device.id — persistent per-install UUID stored at ~/.workos/device-id.
File IO failures fall through to a one-shot session UUID; never throws.
- auth.mode — 4-state enum ('jwt' | 'claim_token' | 'api_key' | 'none')
derived during initForNonInstaller with JWT > claim_token > api_key
priority. Installer flow sets it in run-with-core after credentials
resolve.
Both fields are injected via getEnvFingerprint so they appear on every
event that already carries env context (session.start, session.end,
command, crash). No change to downstream gateway — existing deliveries
accept the new attrs as optional.
Implements Phase 1 of docs/ideation/telemetry-signals/spec-phase-1.md.
Gateway-side type mirroring is tracked separately in the gateway repo.
…and events Replace the boolean `command.success` as the primary outcome dimension with a structured `termination.reason` enum (success | cancelled | auth_required | validation_error | api_error | crash) plus a `error.code` string. - `exitWithError`, `exitWithCode`, `exitWithAuthRequired` now patch the provisional command event via `analytics.recordTermination` before calling `process.exit`. Previously the queued provisional event (success true, duration 0) was persisted as-is by store-forward, producing misleading dashboard data. - `exitWithError` now honors the string error code for exit mapping: `auth_required` -> 4, `cancelled` -> 2, `http_*`/`not_found`/`unknown_error` -> 1 (api_error), everything else -> 1 (validation_error). Previously it always exited 1. - New `TelemetryClient.patchLastEventOfType` helper mutates a queued event in place. Unlike `replaceLastEventOfType`, it preserves the event so multiple callers can update fields incrementally. - `wrapCommandHandler` records `reason: 'success'` on clean completion and `reason: 'crash'` with `error.name` on uncaught throw. - `command.success` remains on CommandEvent for backward-compat. - `api.status`/`api.code`/`api.resource` fields added to CommandEvent and plumbed through `recordTermination` for Phase 3 consumption. Review passed (2 medium + 1 low non-blocking; low fixed in same commit for crash-path test symmetry).
…mmand events Wires `createApiErrorHandler` through to the provisional command event via an optional `apiContext` param on `exitWithError`, populating `api.status`, `api.code`, and `api.resource` attributes on API-failure command events. Also treats `apiContext` presence as ground truth for `api_error` termination reason — WorkOS error codes like `rate_limited` or `validation_error` would otherwise fall through `resolveErrorCode` to `validation_error`, hiding legitimate API failures from api_error dashboards. The `CommandEvent` type already had the `api.*` attribute slots (added as forward-compat in Phase 2); only the wiring and tests are new.
- Fix duration=0 on exit-path events: middleware now sets analytics.commandStartTime so recordTermination can compute real duration when exitWithError/exitWithCode bypass the wrapper - Remove not_found/unknown_error from ERROR_CODE_MAP to prevent local config misses (env.ts) from being misclassified as API errors - Resolve auth.mode before sessionStart in installer flow so the session.start event carries the correct credential source - Make recordTermination fully idempotent: clears stale api.* and error.code fields when called without them - Tighten device-id UUID validation to proper RFC 4122 v4 regex - Only override termination reason to api_error when the code-based classification falls to the generic validation_error fallback, preserving more specific reasons like auth_required
Wrap the `api` command handler with `wrapCommandHandler()` so it reports real duration and success/failure instead of the provisional defaults (duration=0, success=true). Replace raw `process.exit()` calls with `exitWithCode`/`exitWithError` across 10 command files so `recordTermination` fires before exit. This ensures `termination.reason` and `error.code` are populated on every command event. Long-running server handlers (dev/emulate SIGINT) are left as-is. Fix help-json.ts alias drift by importing from the shared `COMMAND_ALIASES` map instead of maintaining a private copy. Add a coverage test that scans bin.ts for inline handlers missing `wrapCommandHandler`, catching the class of regression that let the `api` command slip through.
31829ec to
0475ea8
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e6b7b384-44b3-47fe-9030-070bd92bf44e
📒 Files selected for processing (39)
CLAUDE.mdsrc/bin.tssrc/commands/api/index.spec.tssrc/commands/api/index.tssrc/commands/api/interactive.spec.tssrc/commands/api/interactive.tssrc/commands/debug.tssrc/commands/dev.tssrc/commands/doctor.tssrc/commands/emulate.tssrc/commands/env.tssrc/commands/install-skill.tssrc/commands/login.tssrc/commands/uninstall-skill.tssrc/lib/api-error-handler.spec.tssrc/lib/api-error-handler.tssrc/lib/command-aliases.tssrc/lib/device-id.spec.tssrc/lib/device-id.tssrc/lib/run-with-core.tssrc/utils/analytics.spec.tssrc/utils/analytics.tssrc/utils/command-telemetry.spec.tssrc/utils/command-telemetry.tssrc/utils/crash-reporter.spec.tssrc/utils/crash-reporter.tssrc/utils/exit-codes.spec.tssrc/utils/exit-codes.tssrc/utils/help-json.tssrc/utils/output.spec.tssrc/utils/output.tssrc/utils/register-subcommand.tssrc/utils/telemetry-client.spec.tssrc/utils/telemetry-client.tssrc/utils/telemetry-sanitize.spec.tssrc/utils/telemetry-schema.spec.tssrc/utils/telemetry-store-forward.spec.tssrc/utils/telemetry-store-forward.tssrc/utils/telemetry-types.ts
✅ Files skipped from review due to trivial changes (1)
- src/commands/emulate.ts
🚧 Files skipped from review as they are similar to previous changes (22)
- CLAUDE.md
- src/lib/command-aliases.ts
- src/utils/register-subcommand.ts
- src/utils/telemetry-client.spec.ts
- src/utils/crash-reporter.spec.ts
- src/commands/debug.ts
- src/utils/telemetry-schema.spec.ts
- src/utils/telemetry-client.ts
- src/utils/telemetry-store-forward.spec.ts
- src/lib/device-id.spec.ts
- src/utils/output.spec.ts
- src/utils/telemetry-store-forward.ts
- src/utils/command-telemetry.spec.ts
- src/lib/api-error-handler.spec.ts
- src/lib/device-id.ts
- src/lib/api-error-handler.ts
- src/utils/telemetry-sanitize.spec.ts
- src/utils/crash-reporter.ts
- src/utils/analytics.ts
- src/utils/telemetry-types.ts
- src/utils/exit-codes.ts
- src/utils/analytics.spec.ts
| if (!options.json) { | ||
| clack.log.error(`Doctor failed: ${error instanceof Error ? error.message : 'Unknown error'}`); | ||
| } else { | ||
| console.error(JSON.stringify({ error: error instanceof Error ? error.message : 'Unknown error' })); | ||
| } |
There was a problem hiding this comment.
Emit structured JSON errors in doctor JSON mode.
Line 40 currently writes {"error":"..."}. JSON-mode errors should keep a stable error.code and error.message object shape.
Suggested fix
- } else {
- console.error(JSON.stringify({ error: error instanceof Error ? error.message : 'Unknown error' }));
- }
+ } else {
+ console.error(
+ JSON.stringify({
+ error: {
+ code: 'doctor_failed',
+ message: error instanceof Error ? error.message : 'Unknown error',
+ },
+ }),
+ );
+ }As per coding guidelines src/**/*.ts: Return structured JSON error objects to stderr in non-TTY mode: { "error": { "code": "...", "message": "..." } }.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (!options.json) { | |
| clack.log.error(`Doctor failed: ${error instanceof Error ? error.message : 'Unknown error'}`); | |
| } else { | |
| console.error(JSON.stringify({ error: error instanceof Error ? error.message : 'Unknown error' })); | |
| } | |
| if (!options.json) { | |
| clack.log.error(`Doctor failed: ${error instanceof Error ? error.message : 'Unknown error'}`); | |
| } else { | |
| console.error( | |
| JSON.stringify({ | |
| error: { | |
| code: 'doctor_failed', | |
| message: error instanceof Error ? error.message : 'Unknown error', | |
| }, | |
| }), | |
| ); | |
| } |
| const preSessionEnv = getActiveEnvironment(); | ||
| if (preSessionEnv?.clientId && preSessionEnv?.apiKey) { | ||
| analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key'); | ||
| } else if (getAccessToken()) { | ||
| analytics.setAuthMode('jwt'); | ||
| } else if (process.env.WORKOS_API_KEY) { | ||
| analytics.setAuthMode('api_key'); | ||
| } |
There was a problem hiding this comment.
Include CLI/discovered API key in pre-session auth-mode detection.
At Line 546, pre-session classification only checks process.env.WORKOS_API_KEY. If auth comes from --api-key (or .env.local merged into augmentedOptions.apiKey), session.start can miss auth.mode = "api_key".
Suggested fix
const preSessionEnv = getActiveEnvironment();
if (preSessionEnv?.clientId && preSessionEnv?.apiKey) {
analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key');
} else if (getAccessToken()) {
analytics.setAuthMode('jwt');
- } else if (process.env.WORKOS_API_KEY) {
+ } else if (augmentedOptions.apiKey || process.env.WORKOS_API_KEY) {
analytics.setAuthMode('api_key');
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const preSessionEnv = getActiveEnvironment(); | |
| if (preSessionEnv?.clientId && preSessionEnv?.apiKey) { | |
| analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key'); | |
| } else if (getAccessToken()) { | |
| analytics.setAuthMode('jwt'); | |
| } else if (process.env.WORKOS_API_KEY) { | |
| analytics.setAuthMode('api_key'); | |
| } | |
| const preSessionEnv = getActiveEnvironment(); | |
| if (preSessionEnv?.clientId && preSessionEnv?.apiKey) { | |
| analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key'); | |
| } else if (getAccessToken()) { | |
| analytics.setAuthMode('jwt'); | |
| } else if (augmentedOptions.apiKey || process.env.WORKOS_API_KEY) { | |
| analytics.setAuthMode('api_key'); | |
| } |
| process.on('uncaughtException', (error) => { | ||
| reportCrashSync(error); | ||
| process.exit(1); | ||
| }); | ||
|
|
||
| process.on('unhandledRejection', (reason) => { | ||
| const error = reason instanceof Error ? reason : new Error(String(reason)); | ||
| reportCrashSync(error); | ||
| process.exit(1); | ||
| }); |
There was a problem hiding this comment.
Crash handler suppresses all user-visible error output
Both uncaughtException and unhandledRejection handlers call process.exit(1) without writing anything to stderr. Adding your own uncaughtException handler replaces Node's default behavior (print the error + stack trace, then exit), so when a genuine crash occurs the process now exits silently — the user has no way to know what went wrong. The same applies to unhandledRejection in Node ≥15. Even a simple process.stderr.write(error.stack + '\n') before process.exit(1) would restore the expected output.
Summary
Right now we only have telemetry on the installer flow. This PR adds telemetry coverage to every CLI command, plus crash reporting and a debug flag.
What changed:
Event type enrichment -- session.start/end now carry environment fingerprint (OS, Node version, CI detection, shell). Step and tool events get explicit
startTimestampfor accurate span reconstruction on the backend. Two new event types:commandandcrash.Command telemetry for all commands -- every yargs command (both
registerSubcommandand inline handlers) now emits acommandevent with canonical name, duration, success/failure, and which flags were used. Alias resolution normalizesorg->organization,claim->env.claimso metrics don't fragment.Crash reporting -- global
uncaughtException/unhandledRejectionhandlers capture unhandled crashes with sanitized stack traces (home dir stripped, truncated to 4KB). Handlers are synchronous to avoid Node's async-handler footgun.Store-and-forward -- events are persisted to a PID-based temp file on process exit via
writeFileSync. Next CLI invocation recovers and sends them. Handles the ~25 locations that callprocess.exit()directly by queuing a provisional event in the middleware before the handler runs.Flush improvements --
flush()now returns a boolean (sent vs retryable), usesspliceinstead of clearing the whole queue (protects events queued during an in-flight fetch), drops events on 4xx (permanent failures like 401 won't accumulate), and retains on 5xx/network errors for store-forward. Commands flush in-process via the handler wrapper rather than always deferring to next invocation.WORKOS_DEBUG=1-- enables verbose debug logging for all commands (not just the installer's--debugflag). Shows telemetry event details on flush.Design decisions worth noting:
install,dashboard, and default$0commands are excluded from command-level telemetry since they have their own session-based telemetry.Summary by CodeRabbit
Documentation
New Features
Improvements
Privacy