feat(pi): telemetry parity with anthropic + pi/orchestrator clamp middleware by gewenyu99 · Pull Request #793 · PostHog/wizard

gewenyu99 · 2026-07-03T02:28:57Z

Pi runner: telemetry parity, rollout flags, and switchboard observability. Companion dashboard: wizard-switchboard.

Changes

Adds the run reporting pi was missing next to the anthropic harness — duration, model, token counts, the end-of-run remark, and YARA scans
Traces every routing decision: a switchboard resolved event + log line showing which rung won each axis, plus the build mode on startup
Adds flags to control which model and reasoning effort pi runs (gpt-5 tiers and sonnet)
Brings pi's agent behavior up to Claude Code's built-ins — live [status], task-list planning, env-via-MCP, and don't-spiral-on-blockers
Stops pi + orchestrator flags from ever combining into a crashing config
Identifies users in CI instead of running anonymous under an API key, and reports every wizard catch block to error tracking

Flags

Flag	Type	Behavior
`wizard-use-pi-harness`	boolean	on → pi harness + the paired model; off/missing/fetch-failure → anthropic
`wizard-pi-model`	multivariate: `gpt-5`, `gpt-5-4`, `gpt-5-mini`, `sonnet-4-6`, `sonnet-5`	selects pi's model; unknown/missing → `openai/gpt-5.4`. sonnet options run anthropic models through pi for same-harness parity
`wizard-pi-effort`	multivariate: `minimal`, `low`, `medium`, `high`, `xhigh`	reasoning-effort override for pi's reasoning models; invalid → capability-matrix default

Field verification

Flag-driven pi runs verified end-to-end in CI (telemetry: harness_source: flag, correct model, full event chain, YARA report, real remarks)
Effort trials: flagships at low (medium = same 4/5 at ~3× cost/time); mini at medium (xhigh scored lower)
Outstanding: forced-abort field check (agent aborted with duration_ms)

🤝 Testing handoff (for the local model driving trials)

⚠️ Flags: read-only for you

NEVER toggle, edit, or create feature flags yourself. When a config change requires a flag flip, STOP and ask a human (Vincent) to do it. You may read flag state freely. The three flags (project 2):

Flag	Type	Effect	Link
`wizard-use-pi-harness`	boolean	on → pi harness; off/missing/fetch-failure → anthropic + sonnet	744198
`wizard-pi-model`	multivariate `gpt-5` / `gpt-5-4` / `gpt-5-mini` / `sonnet-4-6` / `sonnet-5`	pi's model; unknown/missing → `openai/gpt-5.4`	744631
`wizard-pi-effort`	multivariate `minimal`→`xhigh`	reasoning-effort override; invalid → capability default (flagships `low`, mini `medium`)	744632

Flags are read only by builds at this PR's tip or later. For CI determinism without flags, dev builds accept WIZARD_CI_FLAG_OVERRIDES='{"wizard-use-pi-harness": true, "wizard-pi-model": "gpt-5"}'.

Branches

posthog-code/pi-telemetry-parity (this PR): telemetry parity, flags, switchboard trace.
posthog-code/pi-model-tuning (stacked): md prompt notes (src/lib/agent/runner/switchboard/prompts/), completion guard + zero-edits guard + verification nudge, YARA block-reason evidence, remark echo-parser fix. Trials should run this branch's tip.

How to run a trial

gh workflow run wizard-ci.yml --repo PostHog/wizard-workbench \
  -f app=basic-integration/next-js/15-app-router-saas \
  -f wizard_ref=<EXACT SHA — never a branch name, jobs re-resolve refs per job> \
  -f notify_slack=false -f trigger_id=<unique-slug>

Verify EVERY run's actual config from telemetry before trusting it (project 2): wizard: switchboard resolved → harness_source must be flag, model must match intent. A run with harness_source: binding + flag_use_pi_harness: None hit a transient empty flag fetch and silently ran sonnet — exclude it (known issue, ~1/8 frequency observed).
Wizard verbose log per run: S3 posthog-wizard-artifacts-prod-us, prefix <trigger_id>/<app-path-dashes>/posthog-wizard.log.
Evaluator is deterministic: confidence = min(app_sanity, round(avg of 4 dimension pass-rates)) — a build/typecheck failure hard-caps the score. Realistic target is consistent 4/5, not 5/5.

Baselines (2026-07-03, evaluator-scored, next-js apps)

Config	Score	Agent time	LLM cost/run
sonnet (control)	4/5	6–8 min	~$2.40
gpt-5.4 @low + notes + guard	4/5	6.7 min	~$1.13
gpt-5 @low + notes + guard	4/5 (incl. former 1/5 app)	7.5 min	~$0.62
gpt-5 @Medium	4/5	22.6 min	$1.67 — dead lever
gpt-5.4 @Medium	4/5	22.6 min	$3.60 — dead lever
gpt-5-mini @Medium	~3/5	~7 min	~$0.15
gpt-5-mini @xHigh	3/5	15.5 min	$0.42 — dead lever

Test matrix (priority order)

#	Config (ask human for flag flips)	Apps	Purpose
1	gpt-5 @low, tuning tip	all 4 next-js	harden the recommended config's numbers
2	gpt-5.4 @low, tuning tip	all 4 next-js	fallback variant parity
3	gpt-5-mini @Medium, tuning tip	app-router-saas	did notes+snippet-feedback fix the fence thrash? decides mini's fate
4	gpt-5 @low	astro, sveltekit, django	framework generalization beyond next-js
5	forced abort (kill a run mid-agent)	any	verify `agent aborted` carries `duration_ms` (last unchecked box)

Per run, check: evaluator confidence; $ai_generation cost by run_id; wizard remark text (real remark, not an instruction echo); yara scan report counts; [pi] continuation nudge / verification nudge lines in the verbose log (guards firing = working, firing often = model regression).

…dleware Closes the pi runner's telemetry gaps ahead of the wizard-runner flag cut, so the wizard-switchboard dashboard compares both runners on equal footing: - pi emits 'agent completed' (duration_ms/duration_seconds/model/num_turns) and 'agent aborted' (durations + model) with the anthropic property names. Cost/token fields deliberately omitted - the gateway's $ai_generation events already track those symmetrically. - pi collects the end-of-run remark: the shared REMARK_INSTRUCTION (extracted from the anthropic Stop hook into signals.ts) is sent as a follow-up prompt and the [WIZARD-REMARK] reply is parsed via AgentOutputSignals. Best-effort, never fails a successful run. - 'setup wizard finished' now carries the tag bag flat (harness, sequence, program_id, ...) so completion-rate breakdowns work without JSON extraction; the nested tags snapshot stays for back-compat. - new switchboard sequence middleware (piLinearClampMw): when the harness axis resolves to pi, flag-driven sequence selection clamps to linear, so wizard-runner=pi + wizard-orchestrator=true can never combine into a crashing cohort (pi has no runTask). CLI --sequence stays above the clamp for dev repro. Covered by new variant-gating tests. Not needed after all: 'bash denied' capture lives inside wizardCanUseTool, which the pi security gate already calls - pi has reported it all along. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

github-actions · 2026-07-03T02:29:09Z

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

/wizard-ci all

Test all apps in a directory:

/wizard-ci basic-integration
/wizard-ci error-tracking-upload-source-maps
/wizard-ci mcp-analytics
/wizard-ci misc
/wizard-ci revenue

Test an individual app:

/wizard-ci basic-integration/android
/wizard-ci basic-integration/angular
/wizard-ci basic-integration/astro

Show more apps

/wizard-ci basic-integration/django
/wizard-ci basic-integration/fastapi
/wizard-ci basic-integration/flask
/wizard-ci basic-integration/javascript-node
/wizard-ci basic-integration/javascript-web
/wizard-ci basic-integration/laravel
/wizard-ci basic-integration/next-js
/wizard-ci basic-integration/nuxt
/wizard-ci basic-integration/python
/wizard-ci basic-integration/rails
/wizard-ci basic-integration/react-native
/wizard-ci basic-integration/react-router
/wizard-ci basic-integration/sveltekit
/wizard-ci basic-integration/swift
/wizard-ci basic-integration/tanstack-router
/wizard-ci basic-integration/tanstack-start
/wizard-ci basic-integration/vue
/wizard-ci error-tracking-upload-source-maps/android
/wizard-ci error-tracking-upload-source-maps/cicd-docker-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-github-actions-docker-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-github-actions-nested-docker-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-github-actions-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-github-actions-single-stage-docker-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-gitlab-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-monorepo-pnpm-node-react
/wizard-ci error-tracking-upload-source-maps/cicd-monorepo-raw-node-react
/wizard-ci error-tracking-upload-source-maps/cicd-ssh-vps-node-raw
/wizard-ci error-tracking-upload-source-maps/flutter
/wizard-ci error-tracking-upload-source-maps/ios
/wizard-ci error-tracking-upload-source-maps/next
/wizard-ci error-tracking-upload-source-maps/next-no-posthog
/wizard-ci error-tracking-upload-source-maps/node-raw
/wizard-ci error-tracking-upload-source-maps/node-rollup
/wizard-ci error-tracking-upload-source-maps/node-rollup-typescript-plugin
/wizard-ci error-tracking-upload-source-maps/node-webpack
/wizard-ci error-tracking-upload-source-maps/nuxt-3-6
/wizard-ci error-tracking-upload-source-maps/nuxt-4-3
/wizard-ci error-tracking-upload-source-maps/react-native
/wizard-ci error-tracking-upload-source-maps/react-vite
/wizard-ci error-tracking-upload-source-maps/rust
/wizard-ci mcp-analytics/custom-dispatcher
/wizard-ci mcp-analytics/typescript-sdk
/wizard-ci misc/quack-quack
/wizard-ci revenue/stripe

Results will be posted here when complete.

…-use-pi-harness Simplifies the rollout surface: one boolean flag. On → pi harness paired with openai/gpt-5-mini; off/missing/non-'true' → binding default (anthropic + sonnet). A failed flag fetch can never opt anyone in. The pi/orchestrator linear clamp keys off the resolved harness, so it is unchanged; comments and tests updated to the new flag key. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

Gateway model registration takes pricing from pi-ai's catalog (bare model name lookup) instead of zeros, so getSessionStats() can cost the run. 'agent completed' now carries total_cost_usd + the four token fields with anthropic's property names, as a cross-check against $ai_generation. Cost is omitted when pricing is unknown rather than reporting $0. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

…en fields pi-ai computes cost client-side from its pricing catalog (calculateCost), so total_cost_usd would be a pricing guess, not a measurement. Token counts are API-reported and stay. Cost stays with $ai_generation. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 · 2026-07-03T13:53:57Z

/wizard-ci basic-integration/astro

wizard-ci-bot · 2026-07-03T13:54:21Z

🧙 Wizard CI Results

Trigger ID: 3f7863c
Workflow: View run

App	Confidence	PR	YARA
`basic-integration/astro/astro-hybrid-marketing`	4/5	#2381 (logs)	✓
`basic-integration/astro/astro-ssr-docs`	4/5	#2387 (logs)	✓
`basic-integration/astro/astro-static-marketing`	4/5	#2384 (logs)	✓
`basic-integration/astro/astro-view-transitions-marketing`	4/5	#2386 (logs)	✓

Configuration

Setting	Value
Wizard ref	`posthog-code/pi-telemetry-parity`
Context Mill ref	`main`
PostHog ref	`master`

Search for trigger ID 3f7863c in wizard-workbench PRs.

Two bugs from the first flag-driven pi run (a558ca12): - pi scanned every tool call but never recorded it — recordExternalScan in yara-hooks routes pi's scans/violations through the same recordMatch path as the hook-based scans, so 'yara rule matched' and the end-of-run 'yara scan report' now fire for pi. - REMARK_INSTRUCTION ended with the literal placeholder "Your remark here", which gpt-5-mini echoed verbatim; reworded without a literal. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

…axis, final pick Middlewares stamp a trace on the switchboard ctx as they assert (cli | flag | pi-clamp | binding per axis). The runner emits one 'wizard: switchboard resolved' event with the inputs (both flags, CLI overrides), the per-axis sources, and the final harness/model/sequence, plus a matching [switchboard] decision log line. Per-axis resolver logs now name their source too. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

…allowlist Shared policy (both harnesses) allowed `install`/`add`/`ci` but not the `i` alias, so `npm i posthog-js` cost a denied turn. Exact-token match — adding 'i' to SAFE_SCRIPTS would startsWith-allow anything i-prefixed (npm init stays blocked; covered by test). Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

… line Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 · 2026-07-03T14:25:02Z

/wizard-ci basic-integration/next-js

wizard-ci-bot · 2026-07-03T14:25:28Z

🧙 Wizard CI Results

Trigger ID: 7c47e0c
Workflow: View run

App	Confidence	PR	YARA
`basic-integration/next-js/15-app-router-saas`	4/5	#2390 (logs)	✓
`basic-integration/next-js/15-app-router-todo`	4/5	#2389 (logs)	✓
`basic-integration/next-js/15-pages-router-saas`	4/5	#2391 (logs)	⚠️
`basic-integration/next-js/15-pages-router-todo`	4/5	#2388 (logs)	✓

Configuration

Setting	Value
Wizard ref	`posthog-code/pi-telemetry-parity`
Context Mill ref	`main`
PostHog ref	`master`

Search for trigger ID 7c47e0c in wizard-workbench PRs.

⚠️ YARA Scanner — basic-integration/next-js/15-pages-router-saas

81 tool calls scanned, 1 violation(s) detected
[REVERTED] posthog_pii_in_capture_call (high) — PostToolUse:Edit

…llback CI auth already identified the key owner when /api/users/@me/ was readable, but a scope failure was swallowed silently and flags fell back to a fresh anonymous UUID — so email-targeted flags never matched and percentage rollouts rerolled every run. Now: the failure is logged with the consequence, a success is logged too, and when the key can't resolve a user, --email sets a flag-targeting person-property override (no identify/alias — flag evaluation only). Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

…key can't resolve a user Identity must match what the gateway attributes to the key, so no synthetic/override identities: the key owner from /api/users/@me/ is the one identity. When the key lacks user:read, the CI output now warns that flags evaluate anonymously instead of failing silently. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 · 2026-07-03T14:57:51Z

/wizard-ci basic-integration/astro/astro-hybrid-marketing

wizard-ci-bot · 2026-07-03T14:58:15Z

🧙 Wizard CI Results

Trigger ID: c826a31
Workflow: View run

App	Confidence	PR	YARA
`basic-integration/astro/astro-hybrid-marketing`	4/5	#2392 (logs)	✓

Configuration

Setting	Value
Wizard ref	`posthog-code/pi-telemetry-parity`
Context Mill ref	`main`
PostHog ref	`master`

Search for trigger ID c826a31 in wizard-workbench PRs.

gewenyu99 · 2026-07-03T15:21:59Z

/wizard-ci basic-integration/astro/astro-hybrid-marketing

wizard-ci-bot · 2026-07-03T15:22:27Z

🧙 Wizard CI Results

Trigger ID: 4eb4c08
Workflow: View run

App	Confidence	PR	YARA
`basic-integration/astro/astro-hybrid-marketing`	4/5	#2393 (logs)	⚠️

Configuration

Setting	Value
Wizard ref	`posthog-code/pi-telemetry-parity`
Context Mill ref	`main`
PostHog ref	`master`

Search for trigger ID 4eb4c08 in wizard-workbench PRs.

⚠️ YARA Scanner — basic-integration/astro/astro-hybrid-marketing

48 tool calls scanned, 4 violation(s) detected
[BLOCKED] hardcoded_posthog_host (high) — PostToolUse:Write
[BLOCKED] pii_in_capture_call (high) — PostToolUse:Edit
[BLOCKED] hardcoded_posthog_host (high) — PostToolUse:Write
[BLOCKED] pii_in_capture_call (high) — PostToolUse:Edit

…nstruction echoes Second field echo: gpt-5-mini replied "[WIZARD-REMARK] followed by the remark itself." — it copies whatever trails the marker in the format clause. The instruction now leads with the format and ends with the ask, and remark() discards any text that is a verbatim substring of the instruction, so no future wording tweak can silently regress this. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

Adds GPT5_4_MODEL ('openai/gpt-5.4') with a reasoning/low capability entry and points the flag pairing at it. Gateway support unverified — this run is the test. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 · 2026-07-03T16:27:38Z

🧙 Wizard CI Results

Trigger ID: gpt5-trial-2
Workflow: View run

App	Confidence	PR	YARA
`basic-integration/next-js/15-app-router-saas`	4/5	#2404 (logs)	✓
`basic-integration/next-js/15-pages-router-saas`	4/5	#2405 (logs)	✓
`basic-integration/next-js/15-pages-router-todo`	1/5	#2402 (logs)	✓
`basic-integration/next-js/15-app-router-todo`	4/5	#2403 (logs)	✓

Configuration

Setting	Value
Wizard ref	`1eb16c8` (pi flag → `openai/gpt-5`)
Context Mill ref	`main`
PostHog ref	`master`

All four runs telemetry-verified as pi + openai/gpt-5 via wizard-use-pi-harness (harness_source: flag).
Search for trigger ID gpt5-trial-2 in wizard-workbench PRs.

gewenyu99 · 2026-07-03T16:34:08Z

🧙 Wizard CI Results

Trigger ID: gpt54-trial (+ gpt54-trial-fill backfill)
Workflow: View run · backfill

App	Confidence	PR	YARA
`basic-integration/next-js/15-app-router-saas`	4/5	#2406 (logs)	✓
`basic-integration/next-js/15-pages-router-saas`	4/5	#2399 (logs)	✓
`basic-integration/next-js/15-pages-router-todo`	4/5	#2396 (logs)	✓
`basic-integration/next-js/15-app-router-todo`	3/5	#2395 (logs)	✓

Configuration

Setting	Value
Wizard ref	`543fbd7` (pi flag → `openai/gpt-5.4`)
Context Mill ref	`main`
PostHog ref	`master`

All four rows telemetry-verified as pi + openai/gpt-5.4 via wizard-use-pi-harness. (The original 15-app-router-saas job fell back to sonnet on an empty flag fetch and was excluded; its row above is the pinned backfill.)
Search for trigger IDs gpt54-trial / gpt54-trial-fill in wizard-workbench PRs.

wizard-pi-model (variants gpt-5 / gpt-5-4 / gpt-5-mini → gateway ids; unknown/missing → gpt-5.4) selects pi's model; wizard-pi-effort (minimal/low/medium/high/xhigh) overrides the capability matrix for reasoning models. Both ride the switchboard decision event. Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 · 2026-07-03T18:19:20Z

This file mirrors YARA behavior on the anthropic side.

analytics.captureException(err, { step: '<context>' }) is now the first statement of every try/catch in src/ — 194 sites across 87 files, with snake_case step slugs derived from the enclosing function. Excluded: catches that rethrow (outer handler captures), catches already capturing (directly or via reportFsError / wizardAbort({error})), and analytics' own runtime deps (debug.ts, ci-flag-overrides.ts — require cycle). Promise .catch() handlers are unchanged (not try/catch syntax). Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

This reverts commit 7e6254a. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Same-harness parity: run sonnet through pi to compare against the openai tiers without switching harness. Mirrors the gpt variant mapping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…itles Runs were emitting ~1 status line total and verbose task titles. Add a [STATUS] note (emit often — cheap, the live view between task changes) and tighten the task-list note to a short, high-level map, no file/sub-step detail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Every run wasted a turn on a blocked direct read of a .env file. Point pi at check_env_keys / set_env_values up front instead of the fence. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Merging brevity into the up-front-creation sentence buried it; the model created the full list but wrote long, specific titles. Pull it out as a standalone rule banning file/framework/event names and parentheticals. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The model created tasks then worked out of order (task-4 completed while task-1..3 pending), so the panel read scrambled. Guide creating the list in execution order and driving it strictly top-to-bottom, and add a TaskReorder tool so it can realign the list when it does deviate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…task tools Ordering is enforced by working the list top-to-bottom instead. Also make the [STATUS] guidance explicit: it is plain text the harness parses, not a tool. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…skList to pick next Make the control loop explicit: mark a task in_progress BEFORE starting it, and return to TaskList after each completion to choose the next step. Frequent reads of the list are the intended pattern, not waste. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The model was emitting [STATUS] fine (27/run), but the pi message_end handler only collected lines for the remark and never called spinner.message()/pushStatus() the way the anthropic path does. So every status line was silently dropped. Wire it up, last marker per turn wins. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-match Guards the [STATUS]->spinner wiring so it can't silently regress to dropping status lines again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…mandments own it The shared commandments already specify task creation/ordering/titles, and anthropic relies on them alone. PI_RUNTIME_NOTES carried a second, conflicting task spec (one note even said 'plan before you touch anything', contradicting the commandment's 'as soon as you understand the work' — so pi planned before reading the skill and dropped the dashboard/report steps). Remove the four duplicating task notes; keep only the pi-specific [STATUS] mechanism note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pi never receives the SDK's runtime-bound task-management system prompt that Claude Code has (confirmed: __SYSTEM_PROMPT_DYNAMIC_BOUND placeholder in the bundle), so it needs the substance explicitly. Restore two focused notes: plan AFTER understanding the work (post-skill) covering through dashboard+report and keep the list current; one in_progress, mark completions immediately and one at a time. Fixes the terse/incomplete regression from removing guidance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…scriptions Add the pieces pi was missing vs the SDK's built-in task prompt: provide activeForm (present-continuous label the panel shows), a completion gate (don't mark done on a failing build / partial step), and the next-task loop (after completing, take the next in id order). Adapted, not copied. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-in-progress to 'try to' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ing the constant Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gewenyu99 · 2026-07-03T21:45:37Z

+  '- Try to keep exactly ONE task `in_progress`. `TaskUpdate` it to `in_progress` right before you start that stage, and to `completed` the instant you finish it — one at a time, never batched at the end. Only mark `completed` when the work is genuinely done; if the build fails, a step is partial, or you hit a blocker, keep it `in_progress` and add a task for the fix.',
+  '- After you complete a task, take the next one in order (lowest id first — earlier stages set up later ones), mark it `in_progress`, and continue. Driving the list in order top to bottom is how you finish every stage.',
+  '- Each task subject is SHORT — a few words naming only the stage of work: "Analyze project", "Install SDK", "Initialize PostHog", "Instrument events", "Set env vars", "Verify", "Create dashboard". No file or directory names, no framework/router/package names, no specific event names, and no parenthetical "(...)" detail. The detail belongs in the work and the `activeForm`, not the subject.',
+  '- Status updates are PLAIN TEXT you write in your reply, NOT a tool call — there is no status tool. Whenever you begin a new action, write a line that starts with the literal marker [STATUS] followed by a short present-tense phrase (e.g. "[STATUS] Reading the router entry", "[STATUS] Adding the login capture"). The harness automatically parses any line beginning with [STATUS] and shows it as the live status. Do this OFTEN — several times per task, on every meaningful shift, not once per phase. It is free.',


I stole these out of the Claude Code system prompts. It seems to be important and load bearing in terms of having consisten status and task lists

gewenyu99 added 4 commits July 3, 2026 09:08

style: tighten comments across the pi telemetry changes

157477b

Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 added 5 commits July 3, 2026 09:58

test: cover pi scan reporting + remark instruction placeholder guard

dea7249

Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

feat: log build mode (prod/dev/ci/headless) on the agent-runner START…

0d98257

… line Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 added 2 commits July 3, 2026 10:47

gewenyu99 added 3 commits July 3, 2026 11:42

feat: pair the pi flag with gpt-5 (sonnet-class trial)

1eb16c8

Generated-By: PostHog Code Task-Id: 4da0163a-9a94-4518-b072-86f99479a0ad

gewenyu99 commented Jul 3, 2026

View reviewed changes

gewenyu99 and others added 2 commits July 3, 2026 14:44

Revert "feat: every catch block reports to PostHog error tracking"

102ca67

This reverts commit 7e6254a. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gewenyu99 and others added 12 commits July 3, 2026 15:53

feat(pi): wizard-pi-model sonnet options — sonnet-4-6 + sonnet-5

c8e1481

Same-harness parity: run sonnet through pi to compare against the openai tiers without switching harness. Mirrors the gpt variant mapping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(pi): guide env-file access through the wizard-tools MCP

7fb5ecd

Every run wasted a turn on a blocked direct read of a .env file. Point pi at check_env_keys / set_env_values up front instead of the fence. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(pi): create the whole task list up front before starting work

6682908

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(pi): preface runtime notes as binding harness commandments

ee71f73

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore(pi): drop dangling reorder reference in task-list note

f6ad3a2

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

test(pi): pin lastStatusLine — marker strip, mid-block, last-wins, no…

7f40f1e

…-match Guards the [STATUS]->spinner wiring so it can't silently regress to dropping status lines again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gewenyu99 marked this pull request as ready for review July 3, 2026 21:02

gewenyu99 and others added 5 commits July 3, 2026 17:15

feat(pi): finalize task guidance — add title-brevity note, soften one…

0f301f7

…-in-progress to 'try to' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore(pi): write [STATUS] literally in the note instead of interpolat…

4c5e086

…ing the constant Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gewenyu99 requested a review from a team July 3, 2026 21:37

feat(pi): don't spiral on blockers — note them in the report and move on

8e67673

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gewenyu99 commented Jul 3, 2026

View reviewed changes

Uh oh!

Conversation

gewenyu99 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Flags

Field verification

⚠️ Flags: read-only for you

Branches

How to run a trial

Baselines (2026-07-03, evaluator-scored, next-js apps)

Test matrix (priority order)

Related

Uh oh!

github-actions Bot commented Jul 3, 2026

🧙 Wizard CI

Uh oh!

gewenyu99 commented Jul 3, 2026

Uh oh!

wizard-ci-bot Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧙 Wizard CI Results

Configuration

Uh oh!

gewenyu99 commented Jul 3, 2026

Uh oh!

wizard-ci-bot Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧙 Wizard CI Results

Configuration

Uh oh!

gewenyu99 commented Jul 3, 2026

Uh oh!

wizard-ci-bot Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧙 Wizard CI Results

Configuration

Uh oh!

gewenyu99 commented Jul 3, 2026

Uh oh!

wizard-ci-bot Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧙 Wizard CI Results

Configuration

Uh oh!

gewenyu99 commented Jul 3, 2026

🧙 Wizard CI Results

Configuration

Uh oh!

gewenyu99 commented Jul 3, 2026

🧙 Wizard CI Results

Configuration

Uh oh!

gewenyu99 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

gewenyu99 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gewenyu99 commented Jul 3, 2026 •

edited

Loading

wizard-ci-bot Bot commented Jul 3, 2026 •

edited

Loading

wizard-ci-bot Bot commented Jul 3, 2026 •

edited

Loading

wizard-ci-bot Bot commented Jul 3, 2026 •

edited

Loading

wizard-ci-bot Bot commented Jul 3, 2026 •

edited

Loading