feat(pi): telemetry parity with anthropic + pi/orchestrator clamp middleware#793
feat(pi): telemetry parity with anthropic + pi/orchestrator clamp middleware#793gewenyu99 wants to merge 36 commits into
Conversation
…dleware Closes the pi runner's telemetry gaps ahead of the wizard-runner flag cut, so the wizard-switchboard dashboard compares both runners on equal footing: - pi emits 'agent completed' (duration_ms/duration_seconds/model/num_turns) and 'agent aborted' (durations + model) with the anthropic property names. Cost/token fields deliberately omitted - the gateway's $ai_generation events already track those symmetrically. - pi collects the end-of-run remark: the shared REMARK_INSTRUCTION (extracted from the anthropic Stop hook into signals.ts) is sent as a follow-up prompt and the [WIZARD-REMARK] reply is parsed via AgentOutputSignals. Best-effort, never fails a successful run. - 'setup wizard finished' now carries the tag bag flat (harness, sequence, program_id, ...) so completion-rate breakdowns work without JSON extraction; the nested tags snapshot stays for back-compat. - new switchboard sequence middleware (piLinearClampMw): when the harness axis resolves to pi, flag-driven sequence selection clamps to linear, so wizard-runner=pi + wizard-orchestrator=true can never combine into a crashing cohort (pi has no runTask). CLI --sequence stays above the clamp for dev repro. Covered by new variant-gating tests. Not needed after all: 'bash denied' capture lives inside wizardCanUseTool, which the pi security gate already calls - pi has reported it all along. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
🧙 Wizard CIRun the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands: Test all apps:
Test all apps in a directory:
Test an individual app:
Show more apps
Results will be posted here when complete. |
…-use-pi-harness Simplifies the rollout surface: one boolean flag. On → pi harness paired with openai/gpt-5-mini; off/missing/non-'true' → binding default (anthropic + sonnet). A failed flag fetch can never opt anyone in. The pi/orchestrator linear clamp keys off the resolved harness, so it is unchanged; comments and tests updated to the new flag key. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
Gateway model registration takes pricing from pi-ai's catalog (bare model name lookup) instead of zeros, so getSessionStats() can cost the run. 'agent completed' now carries total_cost_usd + the four token fields with anthropic's property names, as a cross-check against $ai_generation. Cost is omitted when pricing is unknown rather than reporting $0. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
…en fields pi-ai computes cost client-side from its pricing catalog (calculateCost), so total_cost_usd would be a pricing guess, not a measurement. Token counts are API-reported and stay. Cost stays with $ai_generation. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
|
/wizard-ci basic-integration/astro |
🧙 Wizard CI ResultsTrigger ID:
Configuration
|
Two bugs from the first flag-driven pi run (a558ca12): - pi scanned every tool call but never recorded it — recordExternalScan in yara-hooks routes pi's scans/violations through the same recordMatch path as the hook-based scans, so 'yara rule matched' and the end-of-run 'yara scan report' now fire for pi. - REMARK_INSTRUCTION ended with the literal placeholder "Your remark here", which gpt-5-mini echoed verbatim; reworded without a literal. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
…axis, final pick Middlewares stamp a trace on the switchboard ctx as they assert (cli | flag | pi-clamp | binding per axis). The runner emits one 'wizard: switchboard resolved' event with the inputs (both flags, CLI overrides), the per-axis sources, and the final harness/model/sequence, plus a matching [switchboard] decision log line. Per-axis resolver logs now name their source too. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
…allowlist Shared policy (both harnesses) allowed `install`/`add`/`ci` but not the `i` alias, so `npm i posthog-js` cost a denied turn. Exact-token match — adding 'i' to SAFE_SCRIPTS would startsWith-allow anything i-prefixed (npm init stays blocked; covered by test). Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
… line Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
|
/wizard-ci basic-integration/next-js |
🧙 Wizard CI ResultsTrigger ID:
Configuration
|
…llback CI auth already identified the key owner when /api/users/@me/ was readable, but a scope failure was swallowed silently and flags fell back to a fresh anonymous UUID — so email-targeted flags never matched and percentage rollouts rerolled every run. Now: the failure is logged with the consequence, a success is logged too, and when the key can't resolve a user, --email sets a flag-targeting person-property override (no identify/alias — flag evaluation only). Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
…key can't resolve a user Identity must match what the gateway attributes to the key, so no synthetic/override identities: the key owner from /api/users/@me/ is the one identity. When the key lacks user:read, the CI output now warns that flags evaluate anonymously instead of failing silently. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
|
/wizard-ci basic-integration/astro/astro-hybrid-marketing |
🧙 Wizard CI ResultsTrigger ID:
Configuration
|
|
/wizard-ci basic-integration/astro/astro-hybrid-marketing |
🧙 Wizard CI ResultsTrigger ID:
Configuration
|
…nstruction echoes Second field echo: gpt-5-mini replied "[WIZARD-REMARK] followed by the remark itself." — it copies whatever trails the marker in the format clause. The instruction now leads with the format and ends with the ask, and remark() discards any text that is a verbatim substring of the instruction, so no future wording tweak can silently regress this. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
Adds GPT5_4_MODEL ('openai/gpt-5.4') with a reasoning/low capability
entry and points the flag pairing at it. Gateway support unverified —
this run is the test.
Generated-By: PostHog Code
Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
🧙 Wizard CI ResultsTrigger ID:
Configuration
|
🧙 Wizard CI ResultsTrigger ID:
Configuration
|
wizard-pi-model (variants gpt-5 / gpt-5-4 / gpt-5-mini → gateway ids; unknown/missing → gpt-5.4) selects pi's model; wizard-pi-effort (minimal/low/medium/high/xhigh) overrides the capability matrix for reasoning models. Both ride the switchboard decision event. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
There was a problem hiding this comment.
This file mirrors YARA behavior on the anthropic side.
analytics.captureException(err, { step: '<context>' }) is now the first
statement of every try/catch in src/ — 194 sites across 87 files, with
snake_case step slugs derived from the enclosing function. Excluded:
catches that rethrow (outer handler captures), catches already capturing
(directly or via reportFsError / wizardAbort({error})), and analytics'
own runtime deps (debug.ts, ci-flag-overrides.ts — require cycle).
Promise .catch() handlers are unchanged (not try/catch syntax).
Generated-By: PostHog Code
Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad
This reverts commit 7e6254a. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Same-harness parity: run sonnet through pi to compare against the openai tiers without switching harness. Mirrors the gpt variant mapping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…itles Runs were emitting ~1 status line total and verbose task titles. Add a [STATUS] note (emit often — cheap, the live view between task changes) and tighten the task-list note to a short, high-level map, no file/sub-step detail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every run wasted a turn on a blocked direct read of a .env file. Point pi at check_env_keys / set_env_values up front instead of the fence. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merging brevity into the up-front-creation sentence buried it; the model created the full list but wrote long, specific titles. Pull it out as a standalone rule banning file/framework/event names and parentheticals. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The model created tasks then worked out of order (task-4 completed while task-1..3 pending), so the panel read scrambled. Guide creating the list in execution order and driving it strictly top-to-bottom, and add a TaskReorder tool so it can realign the list when it does deviate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…task tools Ordering is enforced by working the list top-to-bottom instead. Also make the [STATUS] guidance explicit: it is plain text the harness parses, not a tool. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…skList to pick next Make the control loop explicit: mark a task in_progress BEFORE starting it, and return to TaskList after each completion to choose the next step. Frequent reads of the list are the intended pattern, not waste. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The model was emitting [STATUS] fine (27/run), but the pi message_end handler only collected lines for the remark and never called spinner.message()/pushStatus() the way the anthropic path does. So every status line was silently dropped. Wire it up, last marker per turn wins. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-match Guards the [STATUS]->spinner wiring so it can't silently regress to dropping status lines again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mandments own it The shared commandments already specify task creation/ordering/titles, and anthropic relies on them alone. PI_RUNTIME_NOTES carried a second, conflicting task spec (one note even said 'plan before you touch anything', contradicting the commandment's 'as soon as you understand the work' — so pi planned before reading the skill and dropped the dashboard/report steps). Remove the four duplicating task notes; keep only the pi-specific [STATUS] mechanism note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pi never receives the SDK's runtime-bound task-management system prompt that Claude Code has (confirmed: __SYSTEM_PROMPT_DYNAMIC_BOUND placeholder in the bundle), so it needs the substance explicitly. Restore two focused notes: plan AFTER understanding the work (post-skill) covering through dashboard+report and keep the list current; one in_progress, mark completions immediately and one at a time. Fixes the terse/incomplete regression from removing guidance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…scriptions Add the pieces pi was missing vs the SDK's built-in task prompt: provide activeForm (present-continuous label the panel shows), a completion gate (don't mark done on a failing build / partial step), and the next-task loop (after completing, take the next in id order). Adapted, not copied. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-in-progress to 'try to' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ing the constant Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| '- Try to keep exactly ONE task `in_progress`. `TaskUpdate` it to `in_progress` right before you start that stage, and to `completed` the instant you finish it — one at a time, never batched at the end. Only mark `completed` when the work is genuinely done; if the build fails, a step is partial, or you hit a blocker, keep it `in_progress` and add a task for the fix.', | ||
| '- After you complete a task, take the next one in order (lowest id first — earlier stages set up later ones), mark it `in_progress`, and continue. Driving the list in order top to bottom is how you finish every stage.', | ||
| '- Each task subject is SHORT — a few words naming only the stage of work: "Analyze project", "Install SDK", "Initialize PostHog", "Instrument events", "Set env vars", "Verify", "Create dashboard". No file or directory names, no framework/router/package names, no specific event names, and no parenthetical "(...)" detail. The detail belongs in the work and the `activeForm`, not the subject.', | ||
| '- Status updates are PLAIN TEXT you write in your reply, NOT a tool call — there is no status tool. Whenever you begin a new action, write a line that starts with the literal marker [STATUS] followed by a short present-tense phrase (e.g. "[STATUS] Reading the router entry", "[STATUS] Adding the login capture"). The harness automatically parses any line beginning with [STATUS] and shows it as the live status. Do this OFTEN — several times per task, on every meaningful shift, not once per phase. It is free.', |
There was a problem hiding this comment.
I stole these out of the Claude Code system prompts. It seems to be important and load bearing in terms of having consisten status and task lists
Pi runner: telemetry parity, rollout flags, and switchboard observability. Companion dashboard: wizard-switchboard.
Changes
switchboard resolvedevent + log line showing which rung won each axis, plus the build mode on startup[status], task-list planning, env-via-MCP, and don't-spiral-on-blockersFlags
wizard-use-pi-harnesswizard-pi-modelgpt-5,gpt-5-4,gpt-5-mini,sonnet-4-6,sonnet-5openai/gpt-5.4. sonnet options run anthropic models through pi for same-harness paritywizard-pi-effortminimal,low,medium,high,xhighField verification
harness_source: flag, correct model, full event chain, YARA report, real remarks)low(medium = same 4/5 at ~3× cost/time); mini atmedium(xhigh scored lower)agent abortedwithduration_ms)🤝 Testing handoff (for the local model driving trials)
NEVER toggle, edit, or create feature flags yourself. When a config change requires a flag flip, STOP and ask a human (Vincent) to do it. You may read flag state freely. The three flags (project 2):
wizard-use-pi-harnesswizard-pi-modelgpt-5/gpt-5-4/gpt-5-mini/sonnet-4-6/sonnet-5openai/gpt-5.4wizard-pi-effortminimal→xhighlow, minimedium)Flags are read only by builds at this PR's tip or later. For CI determinism without flags, dev builds accept
WIZARD_CI_FLAG_OVERRIDES='{"wizard-use-pi-harness": true, "wizard-pi-model": "gpt-5"}'.Branches
posthog-code/pi-telemetry-parity(this PR): telemetry parity, flags, switchboard trace.posthog-code/pi-model-tuning(stacked): md prompt notes (src/lib/agent/runner/switchboard/prompts/), completion guard + zero-edits guard + verification nudge, YARA block-reason evidence, remark echo-parser fix. Trials should run this branch's tip.How to run a trial
wizard: switchboard resolved→harness_sourcemust beflag,modelmust match intent. A run withharness_source: binding+flag_use_pi_harness: Nonehit a transient empty flag fetch and silently ran sonnet — exclude it (known issue, ~1/8 frequency observed).posthog-wizard-artifacts-prod-us, prefix<trigger_id>/<app-path-dashes>/posthog-wizard.log.confidence = min(app_sanity, round(avg of 4 dimension pass-rates))— a build/typecheck failure hard-caps the score. Realistic target is consistent 4/5, not 5/5.Baselines (2026-07-03, evaluator-scored, next-js apps)
Test matrix (priority order)
agent abortedcarriesduration_ms(last unchecked box)Per run, check: evaluator confidence;
$ai_generationcost byrun_id;wizard remarktext (real remark, not an instruction echo);yara scan reportcounts;[pi] continuation nudge/verification nudgelines in the verbose log (guards firing = working, firing often = model regression).Related