feat: CLI telemetry for all commands + crash reporting by nicknisi · Pull Request #122 · workos/cli

nicknisi · 2026-04-14T23:55:36Z

Summary

Right now we only have telemetry on the installer flow. This PR adds telemetry coverage to every CLI command, plus crash reporting and a debug flag.

What changed:

Event type enrichment -- session.start/end now carry environment fingerprint (OS, Node version, CI detection, shell). Step and tool events get explicit startTimestamp for accurate span reconstruction on the backend. Two new event types: command and crash.
Command telemetry for all commands -- every yargs command (both registerSubcommand and inline handlers) now emits a command event with canonical name, duration, success/failure, and which flags were used. Alias resolution normalizes org -> organization, claim -> env.claim so metrics don't fragment.
Crash reporting -- global uncaughtException/unhandledRejection handlers capture unhandled crashes with sanitized stack traces (home dir stripped, truncated to 4KB). Handlers are synchronous to avoid Node's async-handler footgun.
Store-and-forward -- events are persisted to a PID-based temp file on process exit via writeFileSync. Next CLI invocation recovers and sends them. Handles the ~25 locations that call process.exit() directly by queuing a provisional event in the middleware before the handler runs.
Flush improvements -- flush() now returns a boolean (sent vs retryable), uses splice instead of clearing the whole queue (protects events queued during an in-flight fetch), drops events on 4xx (permanent failures like 401 won't accumulate), and retains on 5xx/network errors for store-forward. Commands flush in-process via the handler wrapper rather than always deferring to next invocation.
WORKOS_DEBUG=1 -- enables verbose debug logging for all commands (not just the installer's --debug flag). Shows telemetry event details on flush.

Design decisions worth noting:

Non-installer telemetry only works for JWT-authenticated users. API-key-only users don't have a token the LLM gateway guard accepts, so their events are silently dropped. Acceptable for v1.
The install, dashboard, and default $0 commands are excluded from command-level telemetry since they have their own session-based telemetry.
The crash reporter is sync-only -- it queues the event and relies on store-forward's exit handler to persist it. No async flush attempts in the crash path.

Summary by CodeRabbit

Documentation
- Added telemetry guidance for command handlers, aliases, and installer exclusions.
New Features
- WORKOS_DEBUG for verbose CLI logs.
- Crash reporting and persistent device identifier for telemetry.
- Store-and-forward telemetry persistence and recovery.
Improvements
- Richer telemetry (command/crash events, auth-mode, env/device fingerprint).
- Unified command aliasing to avoid fragmented metrics.
- Safer exit handling and more structured error/termination reporting.
Privacy
- Stronger sanitization of secrets in telemetry.

Closes security-audit finding #1 on PR #122 (telemetry message sanitization). `error.message` was flowing into 4 capture sites unsanitized, leaking homedir paths (and rarely, credentials) to the WorkOS gateway. - Add `sanitizeMessage()` in crash-reporter.ts: homedir strip + Bearer/ sk_*/JWT redaction + 1KB truncation. - Factor secret redaction into shared `redactSecrets()` used by both `sanitizeMessage` and `sanitizeStack` (Node echoes `.message` into the leading `Error.stack` line, so message-only sanitization was insufficient). - Add private `extractErrorFields()` chokepoint on `Analytics`; route all 4 capture sites through it (`captureException`, `stepCompleted`, `commandExecuted`, `captureUnhandledCrash`). `replaceLastCommandEvent` inherits sanitization via its delegation to `commandExecuted`. - `captureUnhandledCrash` now uses `sanitizeStack` instead of inline truncation, providing defense-in-depth for callers that bypass the crash-reporter wrapper. - Add regression guard test (`telemetry-sanitize.spec.ts`): poisons every capture method with homedir + Bearer + sk_live_ + JWT, asserts no marker reaches the serialized queue. Reviewed: ideation:reviewer cycle 1 PASS (0 critical, 0 high).

coderabbitai · 2026-05-12T19:54:15Z

📝 Walkthrough

Walkthrough

Adds end-to-end telemetry: types and schema for command/crash events, device-id persistence, crash sanitization/reporting, telemetry-client queue and persist, store-and-forward recovery, analytics command lifecycle and auth-mode tracking, command telemetry middleware/wrappers, enriched error apiContext and exit-code mapping, and CLI wiring and tests.

Changes

Telemetry Infrastructure & Command Tracking

Layer / File(s)	Summary
All telemetry & wiring changes `src/utils/telemetry-types.ts`, `src/utils/telemetry-schema.spec.ts`, `src/lib/device-id.ts`, `src/lib/device-id.spec.ts`, `src/utils/crash-reporter.ts`, `src/utils/crash-reporter.spec.ts`, `src/utils/telemetry-client.ts`, `src/utils/telemetry-client.spec.ts`, `src/utils/telemetry-store-forward.ts`, `src/utils/telemetry-store-forward.spec.ts`, `src/utils/analytics.ts`, `src/utils/analytics.spec.ts`, `src/utils/command-telemetry.ts`, `src/utils/command-telemetry.spec.ts`, `src/lib/command-aliases.ts`, `src/utils/register-subcommand.ts`, `src/utils/telemetry-sanitize.spec.ts`, `src/utils/telemetry-schema.spec.ts`, `src/bin.ts`, various command modules and tests`	Implements telemetry event types (`command`, `crash`), device-id generation and caching, crash stack/message sanitization and synchronous crash reporting, telemetry-client queue mutations and persist-to-file, store-and-forward persistence and recovery, Analytics extensions (auth-mode, command lifecycle, termination recording, sanitization), command telemetry middleware and wrap handler, canonical command aliases, CLI bootstrap wiring (debug/telemetry init, crash reporter, store-forward, middleware, top-level handler wrapping), enriched API error handling with structured `apiContext` and exit-code → termination mapping, and comprehensive tests covering these behaviors.

Possibly related PRs

workos/cli#142: Related to top-level workos api command registration and telemetry wrapping.
workos/cli#139: Overlaps in src/commands/login.ts termination/auth flow changes.
workos/cli#130: Related to doctor command updates and command registration options.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.74% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: CLI telemetry for all commands + crash reporting' accurately and concisely summarizes the main changes—adding telemetry coverage for CLI commands and implementing crash reporting functionality.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch nicknisi/telemetry

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (4)

src/utils/telemetry-types.ts (1)
58-58: ⚡ Quick win

Align startTimestamp optionality with backward-compat contract.

telemetry-schema.spec.ts explicitly accepts step/agent.tool without startTimestamp, but these interfaces currently require it. Making both optional avoids contract drift.
Suggested patch
 export interface StepEvent extends TelemetryEvent {
   type: 'step';
   name: string;
-  startTimestamp: string;
+  startTimestamp?: string;
   durationMs: number;
   success: boolean;
   error?: {
@@
 export interface AgentToolEvent extends TelemetryEvent {
   type: 'agent.tool';
   toolName: string;
-  startTimestamp: string;
+  startTimestamp?: string;
   durationMs: number;
   success: boolean;
 }
Also applies to: 70-70
src/utils/telemetry-client.spec.ts (1)

60-105: ⚡ Quick win

Add coverage for claim-token auth headers.

setClaimTokenAuth() is new behavior in this PR, but there’s no test asserting x-workos-claim-token + x-workos-client-id emission (and bearer-token precedence when both exist).

Also applies to: 124-218

src/utils/telemetry-store-forward.ts (1)

1-3: ⚡ Quick win

Use async fs APIs for startup recovery path.

recoverPendingEvents() (Line 25+) is async but uses sync disk calls (Lines 27-56), which blocks startup and conflicts with the project rule for TS files.

As per coding guidelines "Avoid Node-specific sync APIs (crypto, fs sync) unless necessary".

Also applies to: 27-56
src/utils/output.ts (1)
123-124: ⚡ Quick win

Protect process termination from telemetry failures

analytics.recordTermination(...) on Line 123 can throw and prevent Line 124 from executing. Error exits should remain deterministic even when telemetry is unhealthy.
🔧 Proposed fix
   const reason = error.apiContext ? 'api_error' : codeReason;
-  analytics.recordTermination(reason, error.code, error.apiContext);
-  process.exit(exit);
+  try {
+    analytics.recordTermination(reason, error.code, error.apiContext);
+  } finally {
+    process.exit(exit);
+  }
 }

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c45afda9-659c-43a0-9646-cd08dd0cabe3

📥 Commits

Reviewing files that changed from the base of the PR and between 524c709 and 31829ec.

📒 Files selected for processing (27)

CLAUDE.md
src/bin.ts
src/commands/debug.ts
src/lib/api-error-handler.spec.ts
src/lib/api-error-handler.ts
src/lib/command-aliases.ts
src/lib/device-id.spec.ts
src/lib/device-id.ts
src/lib/run-with-core.ts
src/utils/analytics.spec.ts
src/utils/analytics.ts
src/utils/command-telemetry.spec.ts
src/utils/command-telemetry.ts
src/utils/crash-reporter.spec.ts
src/utils/crash-reporter.ts
src/utils/exit-codes.spec.ts
src/utils/exit-codes.ts
src/utils/output.spec.ts
src/utils/output.ts
src/utils/register-subcommand.ts
src/utils/telemetry-client.spec.ts
src/utils/telemetry-client.ts
src/utils/telemetry-sanitize.spec.ts
src/utils/telemetry-schema.spec.ts
src/utils/telemetry-store-forward.spec.ts
src/utils/telemetry-store-forward.ts
src/utils/telemetry-types.ts

coderabbitai · 2026-05-12T20:00:07Z

+  try {
+    if (fs.existsSync(filePath)) {
+      const raw = fs.readFileSync(filePath, 'utf8').trim();
+      if (UUID_REGEX.test(raw)) {
+        cached = raw;
+        return raw;
+      }
+    }
+
+    const id = crypto.randomUUID();
+    fs.mkdirSync(path.dirname(filePath), { recursive: true, mode: 0o700 });
+    fs.writeFileSync(filePath, id, { encoding: 'utf8', mode: 0o600 });
+    cached = id;


🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Replace sync fs usage in the command path.

Line 31–43 and Line 49 use synchronous fs/crypto APIs on a hot CLI path. Please move this to async (node:fs/promises) and cache a pending promise to keep call sites simple.

As per coding guidelines src/**/*.ts: “Avoid Node-specific sync APIs (crypto, fs sync) unless necessary”.

Also applies to: 49-50

coderabbitai · 2026-05-12T20:00:07Z

+export function extractUserFlags(rawArgs: string[]): string[] {
+  const passedFlags = rawArgs
+    .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2))
+    .map((arg) => arg.replace(/^-+/, '').split('=')[0]);
+  return [...new Set(passedFlags)];


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden flag extraction to avoid invalid telemetry flags.

Line 32/33 currently treats -- and values like -1 as flags, which can pollute command.flags.

Suggested fix

export function extractUserFlags(rawArgs: string[]): string[] { const passedFlags = rawArgs - .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2)) - .map((arg) => arg.replace(/^-+/, '').split('=')[0]); + .filter((arg) => { + if (arg === '--') return false; + if (/^--[A-Za-z][\w-]*(=.*)?$/.test(arg)) return true; + if (/^-[A-Za-z]$/.test(arg)) return true; + return false; + }) + .map((arg) => arg.replace(/^-+/, '').split('=')[0]) + .filter(Boolean); return [...new Set(passedFlags)]; }

coderabbitai · 2026-05-12T20:00:07Z

+    const topLevelCommand = commandParts[0] ?? '';
+    if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize the skip check using canonical top-level command.

Line 60/61 checks the raw command token. If a skipped command is invoked via alias, the exemption can be bypassed and telemetry still emitted.

Suggested fix

- const topLevelCommand = commandParts[0] ?? ''; + const rawTopLevelCommand = commandParts[0] ?? ''; + const topLevelCommand = COMMAND_ALIASES[rawTopLevelCommand] ?? rawTopLevelCommand; if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;

coderabbitai · 2026-05-12T20:00:07Z

+export function sanitizeStack(stack: string | undefined): string {
+  if (!stack) return '';
+  let sanitized = stack.replaceAll(HOME, '~');
+  sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "crash-reporter.ts" -type f

Repository: workos/cli

Length of output: 84

🏁 Script executed:

cat -n src/utils/crash-reporter.ts

Repository: workos/cli

Length of output: 3137

🏁 Script executed:

python3 << 'EOF' import re # Test the current regex against various path formats current_pattern = r'/[^\s:]+/(node_modules|dist|src)/' windows_pattern = r'[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\' # Test cases test_cases = [ # POSIX paths "/home/user/project/node_modules/lib/index.js", "/var/app/dist/main.js", "/Users/dev/src/index.ts", # Windows paths "C:\\Users\\dev\\project\\node_modules\\lib\\index.js", "D:\\app\\dist\\main.js", "E:\\code\\src\\index.ts", # Stack trace examples " at functionName (/home/user/project/node_modules/pkg/index.js:10:5)", " at functionName (C:\\Users\\dev\\project\\node_modules\\pkg\\index.js:10:5)", ] print("Testing CURRENT regex pattern:") print(f"Pattern: {current_pattern}") for test in test_cases: match = re.search(current_pattern, test) result = re.sub(current_pattern, r'\1/', test) if match else test print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'") print("\nTesting PROPOSED Windows pattern:") print(f"Pattern: {windows_pattern}") for test in test_cases: match = re.search(windows_pattern, test) result = re.sub(windows_pattern, r'\1\\', test) if match else test print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'") print("\nTesting COMBINED patterns:") for test in test_cases: result = test result = re.sub(current_pattern, r'\1/', result) result = re.sub(windows_pattern, r'\1\\', result) print(f" '{test[:60]}...' -> Result: '{result[:60]}...'") EOF

Repository: workos/cli

Length of output: 2630

Windows absolute paths leak in stack trace sanitization

Line 26 only collapses POSIX-style paths (/path/to/node_modules/). Windows paths (e.g., C:\Users\...\node_modules\) bypass this regex entirely and leak full local filesystem paths into telemetry payloads.

Regex testing confirms the current pattern does not match Windows absolute paths:

C:\Users\dev\project\node_modules\pkg\index.js → remains unsanitized

/home/user/project/node_modules/pkg/index.js → correctly collapsed to node_modules/pkg/index.js

🔧 Proposed fix

export function sanitizeStack(stack: string | undefined): string { if (!stack) return ''; let sanitized = stack.replaceAll(HOME, '~'); - sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/'); + sanitized = sanitized + .replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/') + .replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\'); sanitized = redactSecrets(sanitized); return sanitized.length > MAX_STACK_LENGTH ? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]' : sanitized; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');

export function sanitizeStack(stack: string | undefined): string {

if (!stack) return '';

let sanitized = stack.replaceAll(HOME, '~');

sanitized = sanitized

.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/')

.replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\');

sanitized = redactSecrets(sanitized);

return sanitized.length > MAX_STACK_LENGTH

? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]'

: sanitized;

}

coderabbitai · 2026-05-12T20:00:07Z

 export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
  if (error) {
    outputError(error);
  }
+  analytics.recordTermination(reasonForExitCode(code), error?.code);
  process.exit(code);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Classify termination from error.code when available.

Line 76 derives reason only from numeric exit code, so all code-1 exits become validation_error. That misclassifies API failures like http_500 when they go through exitWithCode.

Suggested fix

export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { if (error) { outputError(error); } - analytics.recordTermination(reasonForExitCode(code), error?.code); + const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code); + analytics.recordTermination(reason, error?.code); process.exit(code); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {

if (error) {

outputError(error);

}

analytics.recordTermination(reasonForExitCode(code), error?.code);

process.exit(code);

export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {

if (error) {

outputError(error);

}

const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code);

analytics.recordTermination(reason, error?.code);

process.exit(code);

}

coderabbitai · 2026-05-12T20:00:07Z

+  async flush(): Promise<boolean> {
+    if (this.events.length === 0) return true;
    if (!this.gatewayUrl) {
      debug('[Telemetry] No gateway URL configured, skipping flush');
-      return;
+      return false;
    }

-    const payload: TelemetryRequest = { events: [...this.events] };
-    this.events = [];
+    const count = this.events.length;
+    const payload: TelemetryRequest = { events: this.events.slice(0, count) };



⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Guard against concurrent flush() calls to prevent event loss/duplication.

Line 85 currently allows overlapping flushes. Two concurrent calls can send the same snapshot twice, and later splice(0, count) can remove events queued after the first flush.

Suggested patch

export class TelemetryClient { private events: TelemetryEvent[] = []; + private flushInFlight: Promise<boolean> | null = null; @@ async flush(): Promise<boolean> { + if (this.flushInFlight) return this.flushInFlight; + this.flushInFlight = this.flushInternal(); + try { + return await this.flushInFlight; + } finally { + this.flushInFlight = null; + } + } + + private async flushInternal(): Promise<boolean> { if (this.events.length === 0) return true; @@ - } + } }

Also applies to: 135-137, 143-144

coderabbitai · 2026-05-12T20:00:07Z

+      mkdirSync(dirname(filePath), { recursive: true });
+      writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');
+      this.events = [];


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist telemetry with restrictive filesystem permissions.

persistToFile() currently relies on default permissions. Since payloads can include user/device identifiers, write with 0700 dir and 0600 file modes.

Suggested patch

- mkdirSync(dirname(filePath), { recursive: true }); - writeFileSync(filePath, JSON.stringify(this.events), 'utf-8'); + mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 }); + writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mkdirSync(dirname(filePath), { recursive: true });

writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');

this.events = [];

mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 });

writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });

this.events = [];

greptile-apps · 2026-05-12T20:01:08Z

Greptile Summary

This PR adds comprehensive CLI telemetry (command events, crash reporting, store-and-forward persistence) to every non-installer command, complementing the existing installer session telemetry. The implementation is well-structured with provisional events, alias normalization, env fingerprinting, and a PID-based store-forward mechanism to survive process.exit() call sites.

Command telemetry middleware + wrapCommandHandler: a provisional event is queued before the handler runs (surviving direct process.exit() calls), then replaced with the real duration and termination reason on completion or failure.
Store-and-forward: unsent events are persisted synchronously to a PID-keyed tmpdir file on exit and recovered on the next invocation, ensuring events from crashed or fast-exiting processes are eventually delivered.
Crash reporter: global uncaughtException/unhandledRejection handlers capture unhandled crashes with sanitized stack traces; however, both handlers suppress Node's default stderr output, leaving users with a silent exit when a genuine crash occurs.

Confidence Score: 4/5

The telemetry wiring is sound overall; one issue in the crash reporter changes observable behavior for users encountering unhandled errors.

The crash handler replaces Node's default uncaught-exception and unhandled-rejection output without printing anything to stderr, so a genuine crash now exits silently. Users relying on the error message to diagnose a problem will see nothing. This is a regression in error visibility for an edge case that the reporter is specifically designed to capture.

src/utils/crash-reporter.ts — the uncaughtException and unhandledRejection handlers need to emit the error to stderr before exiting.

Important Files Changed

Filename	Overview
src/utils/crash-reporter.ts	New crash reporter that queues telemetry on uncaught exceptions/rejections; silently drops stderr output before exiting, hiding errors from users. Sanitization of sensitive data in stacks is incomplete.
src/utils/telemetry-client.ts	Enhanced flush with splice-based queue protection and 4xx/5xx discrimination; flush() returns false for missing gateway URL which signals retry rather than drop, causing event accumulation.
src/utils/analytics.ts	Adds commandExecuted, replaceLastCommandEvent, recordTermination, and captureUnhandledCrash for full command lifecycle telemetry; env fingerprinting on every event.
src/utils/command-telemetry.ts	Middleware + handler wrapper for command-level telemetry; provisional event queued pre-handler, replaced with real duration/outcome post-handler; alias normalization for metrics consistency.
src/utils/telemetry-store-forward.ts	PID-based store-forward: writes unsent events to tmpdir on exit, recovers and flushes them on next startup. Correct handling of concurrent delete+re-persist on flush failure.
src/utils/telemetry-types.ts	New CommandEvent and CrashEvent types with environment fingerprint attributes; TerminationReason discriminated union replaces the boolean success flag.
src/bin.ts	Startup-time wiring for crash reporter, store-forward, and non-installer analytics init; wrapCommandHandler applied to all inline command handlers.
src/utils/exit-codes.ts	resolveErrorCode and reasonForExitCode map error codes to TerminationReason; exitWithCode/exitWithError now record termination to analytics before calling process.exit.
src/lib/device-id.ts	New persistent device UUID stored at ~/.workos/device-id with 0600 permissions; correct UUID v4 validation regex; session-scoped fallback on any IO failure.
src/lib/api-error-handler.ts	Enriches exitWithError calls with apiContext (status, code, resource) so command telemetry events capture API failure details.

Sequence Diagram

sequenceDiagram
    participant User
    participant bin.ts
    participant Middleware
    participant Handler
    participant Analytics
    participant TelemetryClient
    participant StoreForward

    User->>bin.ts: workos command
    bin.ts->>bin.ts: installCrashReporter()
    bin.ts->>StoreForward: installStoreForward()
    bin.ts->>Analytics: initForNonInstaller()
    bin.ts->>StoreForward: recoverPendingEvents() fire-and-forget
    StoreForward->>TelemetryClient: queueEvents(recovered)
    StoreForward->>TelemetryClient: flush() async
    bin.ts->>Middleware: commandTelemetryMiddleware(rawArgs)
    Middleware->>Analytics: commandExecuted(name, 0, true) provisional
    Analytics->>TelemetryClient: queueEvent(CommandEvent provisional)
    Middleware->>Handler: wrapCommandHandler(handler)(argv)
    alt Normal completion
        Handler->>Analytics: replaceLastCommandEvent(name, duration, true)
        Analytics->>TelemetryClient: replaceLastEventOfType command
        Analytics->>TelemetryClient: queueEvent(CommandEvent real)
        Handler->>Analytics: recordTermination success
        Analytics->>TelemetryClient: patchLastEventOfType command
        Handler->>TelemetryClient: flush()
    else exitWithError or exitWithCode
        Handler->>Analytics: recordTermination(reason, code, apiContext)
        Analytics->>TelemetryClient: patchLastEventOfType command
        Handler->>StoreForward: process.exit triggers exit event
        StoreForward->>TelemetryClient: persistToFile(PID.json)
    else Uncaught exception
        bin.ts->>Analytics: captureUnhandledCrash(error)
        Analytics->>TelemetryClient: queueEvent(CrashEvent)
        bin.ts->>StoreForward: process.exit(1) triggers exit event
        StoreForward->>TelemetryClient: persistToFile(PID.json)
    end

_{Reviews (2): Last reviewed commit: "fix(telemetry): close coverage gaps and ..." | Re-trigger Greptile}

…amps, and new event types Add environment fingerprint fields (OS, Node version, CI detection, shell) to session.start and session.end events. Add startTimestamp to step and agent.tool events for span reconstruction. Define command and crash event types with stub emission methods. Add discriminated union Zod schema validation tests mirroring the API schema.

… persistence Wire up yargs middleware that emits a provisional command event before each handler runs, then replaces it with actual duration/success on completion. This covers the ~25 process.exit() call sites without modifying them. - Command telemetry middleware with canonical name resolution and flag extraction - Crash reporter with sanitized stack traces (sync handlers, no async) - Store-forward: persist unsent events to temp file on exit, recover on next run - Fix flush() to retain events until HTTP success (was clearing before fetch) - Auto-wrap handlers in registerSubcommand() (single change point) - Shared COMMAND_ALIASES map for telemetry and help-json - analytics.initForNonInstaller() sets gatewayUrl + JWT from stored creds

- Enable debug output for non-installer commands via env var - Log telemetry event details (type, name, duration, attributes) on flush - Register in debug command's env var catalog

- Wrap inline command handlers (seed, setup-org, doctor, etc.) with wrapCommandHandler so they report real duration/success - Skip provisional telemetry event for install command (has own session telemetry) - Add claim -> env.claim to canonical alias map - Defer store-forward file deletion until after successful flush

Client errors (401, 403) are permanent failures that won't succeed on retry. Only retain events for 5xx (transient server errors) and network failures where store-forward retry is meaningful.

- flush() returns true (sent/dropped) or false (retryable) so callers can act on the result - Use splice(0, count) instead of clearing all events, protecting events queued concurrently during the fetch - wrapCommandHandler flushes in-process so events are sent immediately instead of always deferring to next invocation via store-forward - Store-forward recovery deletes files after loading into memory (events are re-persisted by exit handler if flush fails) - Skip provisional events for dashboard and $0 (installer entry points) - Add 4xx drop test coverage

Add a section to CLAUDE.md explaining which commands auto-emit telemetry (registerSubcommand) versus which need manual wrapCommandHandler wrapping (inline top-level .command() calls). Add a pointer comment in bin.ts near the workflow commands block. Prevents new top-level commands from silently emitting duration=0 telemetry.

- Add workos.user_id to command and crash events (from stored credentials or unclaimed environment clientId) so dashboards can count unique users - Add cli.version to command and crash events for release adoption tracking - Support claim-token auth path on the telemetry client, so unclaimed environments' telemetry reaches the API (guard accepts this path too) - Rename CrashEvent's installer.version to cli.version (crashes happen outside the installer too) - initForNonInstaller() now wires up user_id and claim-token auth

Closes security-audit finding #1 on PR #122 (telemetry message sanitization). `error.message` was flowing into 4 capture sites unsanitized, leaking homedir paths (and rarely, credentials) to the WorkOS gateway. - Add `sanitizeMessage()` in crash-reporter.ts: homedir strip + Bearer/ sk_*/JWT redaction + 1KB truncation. - Factor secret redaction into shared `redactSecrets()` used by both `sanitizeMessage` and `sanitizeStack` (Node echoes `.message` into the leading `Error.stack` line, so message-only sanitization was insufficient). - Add private `extractErrorFields()` chokepoint on `Analytics`; route all 4 capture sites through it (`captureException`, `stepCompleted`, `commandExecuted`, `captureUnhandledCrash`). `replaceLastCommandEvent` inherits sanitization via its delegation to `commandExecuted`. - `captureUnhandledCrash` now uses `sanitizeStack` instead of inline truncation, providing defense-in-depth for callers that bypass the crash-reporter wrapper. - Add regression guard test (`telemetry-sanitize.spec.ts`): poisons every capture method with homedir + Bearer + sk_live_ + JWT, asserts no marker reaches the serialized queue. Reviewed: ideation:reviewer cycle 1 PASS (0 critical, 0 high).

Introduces two additive telemetry signals requested by the signals spec: - device.id — persistent per-install UUID stored at ~/.workos/device-id. File IO failures fall through to a one-shot session UUID; never throws. - auth.mode — 4-state enum ('jwt' | 'claim_token' | 'api_key' | 'none') derived during initForNonInstaller with JWT > claim_token > api_key priority. Installer flow sets it in run-with-core after credentials resolve. Both fields are injected via getEnvFingerprint so they appear on every event that already carries env context (session.start, session.end, command, crash). No change to downstream gateway — existing deliveries accept the new attrs as optional. Implements Phase 1 of docs/ideation/telemetry-signals/spec-phase-1.md. Gateway-side type mirroring is tracked separately in the gateway repo.

…and events Replace the boolean `command.success` as the primary outcome dimension with a structured `termination.reason` enum (success | cancelled | auth_required | validation_error | api_error | crash) plus a `error.code` string. - `exitWithError`, `exitWithCode`, `exitWithAuthRequired` now patch the provisional command event via `analytics.recordTermination` before calling `process.exit`. Previously the queued provisional event (success true, duration 0) was persisted as-is by store-forward, producing misleading dashboard data. - `exitWithError` now honors the string error code for exit mapping: `auth_required` -> 4, `cancelled` -> 2, `http_*`/`not_found`/`unknown_error` -> 1 (api_error), everything else -> 1 (validation_error). Previously it always exited 1. - New `TelemetryClient.patchLastEventOfType` helper mutates a queued event in place. Unlike `replaceLastEventOfType`, it preserves the event so multiple callers can update fields incrementally. - `wrapCommandHandler` records `reason: 'success'` on clean completion and `reason: 'crash'` with `error.name` on uncaught throw. - `command.success` remains on CommandEvent for backward-compat. - `api.status`/`api.code`/`api.resource` fields added to CommandEvent and plumbed through `recordTermination` for Phase 3 consumption. Review passed (2 medium + 1 low non-blocking; low fixed in same commit for crash-path test symmetry).

…mmand events Wires `createApiErrorHandler` through to the provisional command event via an optional `apiContext` param on `exitWithError`, populating `api.status`, `api.code`, and `api.resource` attributes on API-failure command events. Also treats `apiContext` presence as ground truth for `api_error` termination reason — WorkOS error codes like `rate_limited` or `validation_error` would otherwise fall through `resolveErrorCode` to `validation_error`, hiding legitimate API failures from api_error dashboards. The `CommandEvent` type already had the `api.*` attribute slots (added as forward-compat in Phase 2); only the wiring and tests are new.

- Fix duration=0 on exit-path events: middleware now sets analytics.commandStartTime so recordTermination can compute real duration when exitWithError/exitWithCode bypass the wrapper - Remove not_found/unknown_error from ERROR_CODE_MAP to prevent local config misses (env.ts) from being misclassified as API errors - Resolve auth.mode before sessionStart in installer flow so the session.start event carries the correct credential source - Make recordTermination fully idempotent: clears stale api.* and error.code fields when called without them - Tighten device-id UUID validation to proper RFC 4122 v4 regex - Only override termination reason to api_error when the code-based classification falls to the generic validation_error fallback, preserving more specific reasons like auth_required

Wrap the `api` command handler with `wrapCommandHandler()` so it reports real duration and success/failure instead of the provisional defaults (duration=0, success=true). Replace raw `process.exit()` calls with `exitWithCode`/`exitWithError` across 10 command files so `recordTermination` fires before exit. This ensures `termination.reason` and `error.code` are populated on every command event. Long-running server handlers (dev/emulate SIGINT) are left as-is. Fix help-json.ts alias drift by importing from the shared `COMMAND_ALIASES` map instead of maintaining a private copy. Add a coverage test that scans bin.ts for inline handlers missing `wrapCommandHandler`, catching the class of regression that let the `api` command slip through.

coderabbitai

Actionable comments posted: 2

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6b7b384-44b3-47fe-9030-070bd92bf44e

📥 Commits

Reviewing files that changed from the base of the PR and between 31829ec and 0475ea8.

📒 Files selected for processing (39)

CLAUDE.md
src/bin.ts
src/commands/api/index.spec.ts
src/commands/api/index.ts
src/commands/api/interactive.spec.ts
src/commands/api/interactive.ts
src/commands/debug.ts
src/commands/dev.ts
src/commands/doctor.ts
src/commands/emulate.ts
src/commands/env.ts
src/commands/install-skill.ts
src/commands/login.ts
src/commands/uninstall-skill.ts
src/lib/api-error-handler.spec.ts
src/lib/api-error-handler.ts
src/lib/command-aliases.ts
src/lib/device-id.spec.ts
src/lib/device-id.ts
src/lib/run-with-core.ts
src/utils/analytics.spec.ts
src/utils/analytics.ts
src/utils/command-telemetry.spec.ts
src/utils/command-telemetry.ts
src/utils/crash-reporter.spec.ts
src/utils/crash-reporter.ts
src/utils/exit-codes.spec.ts
src/utils/exit-codes.ts
src/utils/help-json.ts
src/utils/output.spec.ts
src/utils/output.ts
src/utils/register-subcommand.ts
src/utils/telemetry-client.spec.ts
src/utils/telemetry-client.ts
src/utils/telemetry-sanitize.spec.ts
src/utils/telemetry-schema.spec.ts
src/utils/telemetry-store-forward.spec.ts
src/utils/telemetry-store-forward.ts
src/utils/telemetry-types.ts

✅ Files skipped from review due to trivial changes (1)

src/commands/emulate.ts

🚧 Files skipped from review as they are similar to previous changes (22)

CLAUDE.md
src/lib/command-aliases.ts
src/utils/register-subcommand.ts
src/utils/telemetry-client.spec.ts
src/utils/crash-reporter.spec.ts
src/commands/debug.ts
src/utils/telemetry-schema.spec.ts
src/utils/telemetry-client.ts
src/utils/telemetry-store-forward.spec.ts
src/lib/device-id.spec.ts
src/utils/output.spec.ts
src/utils/telemetry-store-forward.ts
src/utils/command-telemetry.spec.ts
src/lib/api-error-handler.spec.ts
src/lib/device-id.ts
src/lib/api-error-handler.ts
src/utils/telemetry-sanitize.spec.ts
src/utils/crash-reporter.ts
src/utils/analytics.ts
src/utils/telemetry-types.ts
src/utils/exit-codes.ts
src/utils/analytics.spec.ts

coderabbitai · 2026-05-13T03:24:18Z

    if (!options.json) {
      clack.log.error(`Doctor failed: ${error instanceof Error ? error.message : 'Unknown error'}`);
    } else {
      console.error(JSON.stringify({ error: error instanceof Error ? error.message : 'Unknown error' }));
    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Emit structured JSON errors in doctor JSON mode.

Line 40 currently writes {"error":"..."}. JSON-mode errors should keep a stable error.code and error.message object shape.

Suggested fix

- } else { - console.error(JSON.stringify({ error: error instanceof Error ? error.message : 'Unknown error' })); - } + } else { + console.error( + JSON.stringify({ + error: { + code: 'doctor_failed', + message: error instanceof Error ? error.message : 'Unknown error', + }, + }), + ); + }

As per coding guidelines src/**/*.ts: Return structured JSON error objects to stderr in non-TTY mode: { "error": { "code": "...", "message": "..." } }.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (!options.json) {

clack.log.error(`Doctor failed: ${error instanceof Error ? error.message : 'Unknown error'}`);

} else {

console.error(JSON.stringify({ error: error instanceof Error ? error.message : 'Unknown error' }));

}

if (!options.json) {

clack.log.error(`Doctor failed: ${error instanceof Error ? error.message : 'Unknown error'}`);

} else {

console.error(

JSON.stringify({

error: {

code: 'doctor_failed',

message: error instanceof Error ? error.message : 'Unknown error',

},

}),

);

}

coderabbitai · 2026-05-13T03:24:18Z

+  const preSessionEnv = getActiveEnvironment();
+  if (preSessionEnv?.clientId && preSessionEnv?.apiKey) {
+    analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key');
+  } else if (getAccessToken()) {
+    analytics.setAuthMode('jwt');
+  } else if (process.env.WORKOS_API_KEY) {
+    analytics.setAuthMode('api_key');
+  }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Include CLI/discovered API key in pre-session auth-mode detection.

At Line 546, pre-session classification only checks process.env.WORKOS_API_KEY. If auth comes from --api-key (or .env.local merged into augmentedOptions.apiKey), session.start can miss auth.mode = "api_key".

Suggested fix

const preSessionEnv = getActiveEnvironment(); if (preSessionEnv?.clientId && preSessionEnv?.apiKey) { analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key'); } else if (getAccessToken()) { analytics.setAuthMode('jwt'); - } else if (process.env.WORKOS_API_KEY) { + } else if (augmentedOptions.apiKey || process.env.WORKOS_API_KEY) { analytics.setAuthMode('api_key'); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const preSessionEnv = getActiveEnvironment();

if (preSessionEnv?.clientId && preSessionEnv?.apiKey) {

analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key');

} else if (getAccessToken()) {

analytics.setAuthMode('jwt');

} else if (process.env.WORKOS_API_KEY) {

analytics.setAuthMode('api_key');

}

const preSessionEnv = getActiveEnvironment();

if (preSessionEnv?.clientId && preSessionEnv?.apiKey) {

analytics.setAuthMode(isUnclaimedEnvironment(preSessionEnv) ? 'claim_token' : 'api_key');

} else if (getAccessToken()) {

analytics.setAuthMode('jwt');

} else if (augmentedOptions.apiKey || process.env.WORKOS_API_KEY) {

analytics.setAuthMode('api_key');

}

greptile-apps · 2026-05-13T03:27:09Z

+  process.on('uncaughtException', (error) => {
+    reportCrashSync(error);
+    process.exit(1);
+  });
+
+  process.on('unhandledRejection', (reason) => {
+    const error = reason instanceof Error ? reason : new Error(String(reason));
+    reportCrashSync(error);
+    process.exit(1);
+  });


Crash handler suppresses all user-visible error output

Both uncaughtException and unhandledRejection handlers call process.exit(1) without writing anything to stderr. Adding your own uncaughtException handler replaces Node's default behavior (print the error + stack trace, then exit), so when a genuine crash occurs the process now exits silently — the user has no way to know what went wrong. The same applies to unhandledRejection in Node ≥15. Even a simple process.stderr.write(error.stack + '\n') before process.exit(1) would restore the expected output.

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/utils/output.ts

Comment thread src/bin.ts

Comment thread src/utils/command-telemetry.ts

Comment thread src/utils/telemetry-client.ts

nicknisi added 15 commits May 12, 2026 15:39

feat: add WORKOS_DEBUG=1 env var for verbose logging on all commands

ff929a8

- Enable debug output for non-installer commands via env var - Log telemetry event details (type, name, duration, attributes) on flush - Register in debug command's env var catalog

fix: drop telemetry events on 4xx responses to prevent accumulation

9ddf5b7

Client errors (401, 403) are permanent failures that won't succeed on retry. Only retain events for 5xx (transient server errors) and network failures where store-forward retry is meaningful.

chore: de-slop

aa7baf6

nicknisi force-pushed the nicknisi/telemetry branch from 31829ec to 0475ea8 Compare May 13, 2026 03:20

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

		const topLevelCommand = commandParts[0] ?? '';
		if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;

Conversation

nicknisi commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

nicknisi commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading

greptile-apps Bot commented May 12, 2026 •

edited

Loading