Skip to content

fix(web-analytics): mark liveness-probe traffic so error stats exclude it#60

Open
lukaso-bot wants to merge 1 commit into
mainfrom
fix/exclude-probe-from-error-stats
Open

fix(web-analytics): mark liveness-probe traffic so error stats exclude it#60
lukaso-bot wants to merge 1 commit into
mainfrom
fix/exclude-probe-from-error-stats

Conversation

@lukaso-bot

Copy link
Copy Markdown
Collaborator

What

The maintaining loop's synthetic liveness probe drives a real end-to-end lookup every cycle (CUJ #1) against unauthenticated GitHub. When that lookup trips the soft deadline, the app writes a lookup_timeout event tagged outcome=error — indistinguishable from a real user failure, because analytics deliberately records no user-agent. The loop then reads it back as a SYSTEM error and "watches" it forever, chasing its own shadow. Both lookup_timeout events in the live 7-day window correlate to-the-second with the loop's own observe runs.

How

  • auth.ts: add isLivenessProbe(ua) — matches the probe's existing liveapp-liveness-probe/N user-agent. The UA string itself is never stored; only a boolean is derived from it, so the edge-only privacy posture is unchanged.
  • analytics.ts: map that boolean to blob12 ('1' for probe, '' otherwise) in toDataPoint.
  • index.ts: wire the middleware (already reads the UA for audience) to set the flag — the same established pattern.

The companion prod-errors.mjs query lives in loop state (not this repo) and excludes blob12='1' from SYSTEM/CLIENT counts while keeping volume/cache counts inclusive (the loop's round-trip liveness proof relies on probe traffic showing up), and surfaces the excluded count separately so the exclusion is never silent. The exclusion is forward-looking; the two historical unmarked timeouts age out of the 7-day window naturally.

Tests (TDD, written first)

  • isLivenessProbe UA matching
  • toDataPoint blob12 mapping
  • middleware wiring: probe UA → blob12='1', ordinary UA → ''

Full validate.sh gate green locally (build, typecheck, 477 tests, a11y, biome, shellcheck, actionlint, gitleaks, publint, osv).

Privacy note

blob12 is a single boolean. No user-agent, IP, or other identifying value is persisted — consistent with the project's edge-only privacy posture.

…e it

The loop's synthetic liveness probe drives a real end-to-end lookup every
cycle (CUJ #1) against unauthenticated GitHub. When that lookup trips the
soft deadline it writes a `lookup_timeout` event tagged outcome=error —
indistinguishable from a real user failure, because analytics deliberately
records no user-agent. The maintaining loop reads that back as a SYSTEM
error and "watches" it forever: it's chasing its own shadow. Both
lookup_timeout events in the live 7d window correlate to-the-second with
the loop's own observe runs.

Mark probe traffic with a privacy-safe boolean (blob12='1'), derived from
the probe's existing `liveapp-liveness-probe/N` user-agent. The UA string
itself is never stored — only the boolean — so the edge-only privacy
posture is unchanged. The middleware already reads the UA for `audience`,
so this is the same established pattern.

The companion prod-errors.mjs query (loop state, not in this repo) excludes
blob12='1' from SYSTEM/CLIENT counts while keeping volume/cache counts
inclusive (those are the loop's round-trip liveness proof, which relies on
probe traffic showing up), and surfaces the excluded count separately so
the exclusion is never silent.

Tests first (TDD): isLivenessProbe UA matching, toDataPoint blob12 mapping,
and middleware wiring (probe UA → blob12='1', ordinary UA → '').

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@lukaso-bot lukaso-bot left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-house adversarial review (Copilot quota-blocked here → no automated review will land)

Per review-policy.md, Copilot is perpetually quota-blocked on this repo, so I ran a skeptical, correctness-first pass with a distinct lens. Concrete findings:

Correctness — the one that would actually bite: Analytics Engine column ordering ✅

AE blobs are positional — every existing query (prod-errors.mjs, pnpm stats, the analytics.test.ts schema test) reads blob1..blob11 by index. This PR adds e.probe ? '1' : '' appended at the end (blobs[11] = blob12), not inserted mid-array, so no existing column shifts and every historical query keeps reading the same data. blob12 is well within AE's 20-blob cap. This is the failure mode that would silently corrupt the very error history the loop reads back, and the diff gets it right. Verified the schema test was updated in lockstep ('', // 12 probe).

Marker is set on every request, not just lookups ✅

app.use('*') sets probe on all events, so homepage//version//healthz probe hits are all markable — correct, since the goal is to exclude all self-traffic from error counts, while volume/cache stays inclusive (the round-trip liveness proof relies on probe traffic appearing). Consistent with the commit message.

Privacy ✅

Only the boolean reaches AE; the UA string is never stored. Same established pattern as audience (isUnfurlBot). Edge-only posture unchanged.

TDD ✅

isLivenessProbe matching (incl. future minor versions + negative cases), toDataPoint blob12 mapping, and middleware wiring (probe UA → '1', ordinary UA → ''). Failing-first discipline is evident in the structure.

Non-blocking observation (not a merge blocker)

isLivenessProbe uses startsWith('liveapp-liveness-probe') without the trailing /, so it also matches a UA like liveapp-liveness-probe-evil, and the marker is client-spoofable. This is analytics-only — not a security boundary — so the worst case is a user hiding their own errors from owner stats; negligible. If you want strictness without losing the future-minor-version forward-compat the comment calls out, startsWith('liveapp-liveness-probe/') (with slash) is marginally tighter. Optional polish.

Verdict: correctness verified (the column-ordering risk in particular); no blocking issues; mergeable once CI is green. Posted as a comment, not an approve — GitHub won't let the PR author self-approve. Real proof is post-merge + deploy: a probe-driven lookup_timeout should carry blob12='1' and drop out of the SYSTEM count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant