diff --git a/website/blog/2026-04-18-observability-for-custom-agents.md b/website/blog/2026-04-18-observability-for-custom-agents.md new file mode 100644 index 0000000..f028d55 --- /dev/null +++ b/website/blog/2026-04-18-observability-for-custom-agents.md @@ -0,0 +1,313 @@ +--- +slug: "/2026-04-18-observability-for-custom-agents" +canonical_url: "https://dfberry.github.io/blog/2026-04-18-observability-for-custom-agents" +custom_edit_url: null +sidebar_label: "2026.04.18 Observability for Agents" +title: "Knowing What Your AI Team Did and Why: Observability for Custom Copilot CLI Agents" +description: "When you use a human-led AI agent team on Copilot CLI, you need to understand their reasoning — whether you're in a live session or reviewing a PR the next morning." +draft: true +tags: + - "GitHub Copilot" + - "AI Agents" + - "Observability" + - "Developer Workflow" + - "Squad" + - "Copilot CLI" + - "AI-assisted" +keywords: + - "copilot cli observability" + - "ai agent reasoning" + - "custom agent team" + - "github copilot agents" + - "agent decision tracking" + - "squad observability" +updated: "2026-04-18 00:00 PST" +--- + +# Knowing What Your AI Team Did and Why + + + +I've been using [Squad](https://github.com/bradygaster/squad), a human-led AI agent team framework built on [Copilot CLI](https://docs.github.com/en/copilot/github-copilot-in-the-cli), for a few months now. I set up ten agents — each with a charter, a history file, and specific skills. Some days I'm sitting at the terminal directing them. Other days I delegate work through issues and review the PRs later. + +Both ways work. But I keep running into the same question: **when I review what the team did, can I understand _why_ they did it?** + +This post is my investigation into that question — not a conclusion. I'm exploring what observability looks like for custom Copilot CLI agent teams, what the platform gives you today, and what you might need to build yourself. It's a snapshot in time. Things are moving fast and I may have gotten details wrong. + +## The question that started this + +I had delegated a task through an issue — "update the content pipeline for the new API version." The next morning I had a clean PR with the right changes. But one function had been restructured in a way I didn't expect. I wanted to know: why this approach? What did the agent consider? What constraints drove the decision? + +The code was correct. But I couldn't trace the reasoning. + +And here's the thing: if I'd been in a live session, I could have just asked. The reasoning would have been right there in the conversation. But because I'd delegated the work, the reasoning was... somewhere. Not lost — but not connected to the PR I was reviewing. + +That's the gap I wanted to understand. And the more I dug into it, the more I think it's not just a tooling problem — it's a strategic one. + +## Why this matters beyond my workflow + +If you're a developer using AI agents for real work — or a team lead deciding whether to adopt them — this question isn't academic. It's the difference between: + +- **"I can use AI agents and stay accountable"** vs. **"I shipped code I can't fully explain"** +- **"I can scale my team's output with agents"** vs. **"I scaled output but lost the ability to course-correct"** +- **"I can onboard someone new and they can follow the reasoning trail"** vs. **"Only I know why things are the way they are, and even I'm not sure"** + +As AI agents move from experimental tools to production workflows, the ability to trace reasoning becomes a governance question. Not in the heavy compliance sense — in the practical sense of: can you maintain confidence in a system that makes decisions on your behalf? + +I think the answer is probably yes, but only if you design for it. + +## Two ways to work with your agent team + + + +When you set up a custom agent team on Copilot CLI, there are two natural patterns: + +**Live sessions** — you're at the terminal, talking to the team. You see every decision as it happens. You can ask "why did you do that?" and get an answer immediately. You're steering. + +**Delegated work** — you set direction through an issue or a prompt, the team executes, and you review the output later. You're governing — setting goals, reviewing results, course-correcting. + +Both are human-directed. In live sessions you're hands-on. In delegated work you're setting direction and reviewing results. People stay accountable for priorities, approvals, and final changes — the agents handle coordination and execution. + +```mermaid +flowchart LR + H[Human] -->|live session| A[Agent Team] + H -->|issue/prompt| D[Delegation] + D --> A + A -->|PR, commit, decision| O[Output] + O -->|review| H +``` + +The observability question is the same in both cases: **can you reconstruct why decisions were made and code changed?** But the answer is very different depending on which pattern you used. + +## What the platform gives you today + + + +Copilot CLI has been adding observability features. Here's what I can see as a user, as of mid-April 2026: + +### Session persistence + +Every Copilot CLI session — live or delegated — gets recorded in `~/.copilot/session-state/`. The full transcript: prompts, responses, tool calls, file changes, checkpoints. You can browse sessions with `/session` and resume any past session with `/resume`. + +### OpenTelemetry (shipped in v1.0.4) + +As of [Copilot CLI v1.0.4](https://github.com/github/copilot-cli/issues/2471), you can enable OTel instrumentation: + +```bash +COPILOT_OTEL_ENABLED=true +# or +OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 +``` + +This gives you traces for agent sessions, LLM calls, and tool executions. Token usage metrics. Operation durations. OTLP HTTP export with enterprise auth headers. Run `copilot help monitoring` for the full reference. + +### Session store (queryable) + +Session data is also available in a structured SQLite database. If you're building tooling, you can query sessions, turns, checkpoints, file changes, and references (commits, PRs, issues) programmatically. + +### What's still missing + +Two open issues on [github/copilot-cli](https://github.com/github/copilot-cli) highlight gaps: + +- **[#2396](https://github.com/github/copilot-cli/issues/2396) — Session attribution.** Sessions don't currently record _how_ they were launched — interactive vs SDK vs headless — or which custom tool created them. If you run three different agent tools, their sessions all look identical. The issue proposes persisting `client_type` and `clientName` in session-state files. + +- **[#1791](https://github.com/github/copilot-cli/issues/1791) — Session history.** There's no cross-session audit view without starting the agent. The issue proposes a `copilot --history` flag for querying session history directly from the shell — no tokens spent, no agent launched. + +These are platform-level improvements that, based on the issue discussions, should help with the "what happened" and "which session did it" questions. But even when they ship, there's a layer they won't fully cover. + +## The layer the platform can't fully solve + + + +Here's where my thinking is landing so far:platforms are getting better at capturing **what happened**, **which session did it**, and even **generic rationale** (plans, tool-call sequences, diff summaries). But they can't tell you which rationale is **project-relevant**. + +Consider these questions: + +- "Why did the agent pick Redis over a file cache?" → That depends on your team's standing decision to prefer managed services. +- "Why was the function restructured?" → That depends on your charter's rule about separating API contracts from implementation. +- "Why was the edge case skipped in the test?" → That depends on your skill's instruction to focus on the happy path first and file follow-up issues for edge cases. + +The platform can tell you the agent called 12 tools and used 50K tokens. It can even summarize the session. But it doesn't know about your team's decisions, your project's constraints, or your agents' specific mandates. + +Squad's design philosophy — recently [reframed explicitly](https://github.com/bradygaster/squad/pull/989) around human-led productivity — is that people stay accountable for priorities, approvals, and final changes while agents handle coordination and repetition. The work stays inspectable because it lives in your repo as files: charters, decisions, history, orchestration logs. + +## What you can build into a custom agent team + +If you're using a framework like Squad (or any custom agent setup with scoped personas), you have configuration surfaces that I think serve double duty: they shape agent behavior AND create a reasoning trail. I've been experimenting with this in my own setup, but the same patterns apply with system prompts, ADRs, policy files, or whatever mechanism your framework uses to scope agent behavior. + +### Charters — scope + accountability + +Each agent has a charter that defines what they own, how they work, and their boundaries. In my setup, charters look like this: + +```markdown +# Gonzo — Infrastructure Charter + +## Responsibilities +- GitHub Projects Setup +- Labels & Configuration +- GitHub Actions workflows + +## Scope Boundaries +- Does: Infrastructure, automation, GitHub platform configuration +- Doesn't: Design templates (→ Piggy), strategic decisions (→ Kermit) +``` + +When Gonzo opens a PR that changes a GitHub Action, the charter explains why Gonzo did it (it's in scope) and why Piggy didn't (it's not in Piggy's scope). The charter is both an instruction and an explanation. + +**What to add for observability:** An explicit section on how the agent should narrate their work: + +```markdown +## When producing output +- Every PR description includes: parent issue link, reasoning summary, + what was considered but rejected +- Every commit message references the source issue +- Architectural choices reference the relevant standing decision +``` + +This is just charter text. No code change. The agent reads it and follows it. + +### Decisions — the shared brain + +`decisions.md` is the team's institutional memory. Every agent reads it at session start. Standing decisions shape behavior across all sessions: + +```markdown +## Prefer managed services over self-hosted +**What:** When choosing infrastructure, prefer managed/cloud services. +**Why:** Reduces operational burden. Team doesn't have on-call rotation. +**When:** 2026-03-15 +``` + +When an agent chooses Azure Cache for Redis over a local Redis container, you can trace it back to this decision. The decision is the **why**. It persists across sessions, across agents, across modes. + +**What to add for observability:** A standing decision that requires reasoning in outputs: + +```markdown +## All delegated work must include reasoning +**What:** Every PR opened from delegated work must include a "Reasoning" +section explaining key decisions and what alternatives were considered. +**Why:** The person reviewing wasn't in the session. They need context. +**When:** 2026-04-18 +``` + +### Agent history — per-persona memory + +Each agent accumulates a `history.md` — learnings from past sessions that shape future behavior. When Gonzo learns that a specific GitHub Action syntax causes failures in this repo, that goes in Gonzo's history. Next time Gonzo works on Actions, they know. + +History files serve observability because they're the answer to "has this agent dealt with this before, and what did they learn?" + +### Skills — repeatable tasks with built-in standards + +Skills encode how to do specific tasks — including quality gates and output standards. A well-written skill includes what the output should look like: + +```markdown +## PR Description Format +- Reference the source issue or dispatch parent +- Include a checklist of validation criteria +- This provides traceability from the content PR back to + the engineering change +``` + +Skills are instructions AND observability policy in one file. They tell the agent what to produce and tell the reviewer what to expect. + +### Orchestration logs — the narrative bridge + +If your agent team produces orchestration logs (structured summaries of what happened during a work session), those become the bridge between raw session data and human understanding: + +```markdown +## Orchestration Log — 2026-04-18T14:30:00 + +**Agent:** Gonzo (Infrastructure) +**Task:** Update CI pipeline for new API version +**Key decisions:** +- Chose matrix strategy over sequential jobs (faster, same coverage) +- Skipped Windows runner (no Windows-specific code in this change) +**Artifacts:** PR #847, 3 commits +**Standing decisions referenced:** "Prefer managed services", "CI must pass before review" +``` + +## The feedback loop + + + +The real power isn't any single artifact — it's the loop between them. + +```mermaid +flowchart TD + L[Live Session] -->|human corrects agent| D[decisions.md updated] + D -->|agent reads at session start| DW[Delegated Work] + DW -->|produces PR with reasoning| R[Human Reviews] + R -->|spots issue| L + R -->|approves, no change needed| Done[✓ Done] + H[history.md] -->|agent remembers past learnings| DW + DW -->|new learning captured| H +``` + +1. In a live session, you notice an agent making a choice you disagree with. You correct them. That correction becomes a decision in `decisions.md`. +2. Next time that agent (or any agent) runs — live or delegated — they read `decisions.md` and behave differently. +3. The delegated work produces a PR with a reasoning section. You review it. If the reasoning references the decision you wrote, the loop is closed. +4. If the reasoning doesn't make sense, you're back in a live session asking questions. The loop continues. + +**The human is always directing the work.** Charters, decisions, history, and skills are the mechanisms. The question is whether those mechanisms produce enough signal for you to know _when_ to course-correct — so you spend less time re-investigating and more time deciding. + +## Where my investigation is landing + +After digging into this, here's my current understanding — still forming, not final: + +**Platforms are getting better at capturing what happened, which session did it, and even generic rationale.** OTel traces, session persistence, and (eventually) session attribution will make it easier to find the right session and see what tools were called. + +**But I think humans still need to define which rationale is project-relevant.** The platform doesn't know about your team's standing decisions, your agents' scope boundaries, or your project's constraints. That context — the project-specific "why" — seems to live in the agent team's configuration files. + +**I'm starting to see this less as a gap and more as a design discipline.** You're building a provenance layer — charters, decisions, skills, orchestration logs — that complements platform telemetry. The platform tells you what happened. Your configuration tells agents why things should happen a certain way — and creates the trail to verify that they did. + +**And it seems to matter more when you're reviewing delegated work.** In a live session, missing rationale is recoverable — you can ask. In delegated work, missing rationale means you're reading code without context. Both modes need reasoning capture, but delegated work makes missing rationale much costlier. + +## The bigger picture: why this matters strategically + +I keep coming back to this: the value of AI agents isn't just that they produce code. It's that they can produce code _you trust enough to ship_. + +Trust requires the ability to course-correct. Course-correction requires understanding what happened and why. That's observability. + +**For developers**, this is about code quality and review confidence. If you can't trace why an agent restructured a function, you're approving code on faith. That might work for trivial changes. It doesn't scale to anything complex. + +**For team leads and AI owners**, the stakes are higher: + +- **Review scalability.** Without reasoning trails, every delegated PR becomes expensive senior-review work. If delegation saves coding time but increases review time, you haven't actually scaled — you've moved the labor. +- **Incident response.** When an agent causes a regression, observability is how you reconstruct what happened. Without it, your postmortem is: "the AI did something and we're not sure why." +- **Drift detection.** Agent teams don't just fail once — they drift. Models update, prompts evolve, context shifts. Observability is how you notice when behavior changes incrementally, before it becomes a production issue. +- **Safe scope expansion.** You can only delegate more to agents you can verify. Observability is the prerequisite for saying "I trust this agent enough to handle this without me watching." + +As more developers and teams adopt AI agents for real production work, I think this becomes a dividing line: + +**Teams that design for observability** can scale their use of agents and catch reasoning drift early. They can onboard new team members who can follow the trail. They can maintain accountability even as agent capabilities grow. + +**Teams that don't** risk hitting a ceiling. The agents produce output, but reviewing that output becomes a black-box exercise. You end up approving PRs you don't fully understand because the alternative is re-doing the work yourself. + +I don't think this is unique to Squad or even to Copilot CLI. Any system where AI agents make decisions on a developer's behalf — whether it's a custom agent team, a CI pipeline with AI steps, or a coding assistant with increasing scope — faces this same question. The mechanisms might differ (charters vs. system prompts vs. policy files), but the discipline is the same: **design your agents and workflows so reasoning is inspectable where review actually happens.** + +The platform appears to be moving in this direction. OTel is already there. Session attribution and history seem to be coming. But the project-specific reasoning layer — what decisions matter, what constraints apply, what "good" looks like for _this_ project — that's yours to build. + +## What I'd do on Monday + + + +If you're setting up a custom agent team and want observability from day one, here's where I'd start: + +1. **Add observability expectations to every agent's scope definition.** Tell agents to explain their reasoning in PR descriptions, reference relevant decisions, and note what alternatives they considered. This is free — it's just configuration text. + +2. **Write a standing decision requiring reasoning in outputs.** Make it team policy, not per-agent hope. "Every PR from delegated work must include a Reasoning section." + +3. **Link outputs to inputs.** Every PR should reference its source issue. Every commit should trace to a task. The goal is: from any output, you can walk backward to the intent. + +4. **Use orchestration logs.** Even a simple markdown summary after each work session creates a queryable trail for later review. + +5. **Enable OTel now.** `COPILOT_OTEL_ENABLED=true` gets you traces immediately. Watch for session attribution ([#2396](https://github.com/github/copilot-cli/issues/2396)) and session history ([#1791](https://github.com/github/copilot-cli/issues/1791)) as they ship. + +6. **Build the feedback loop.** When you spot a reasoning gap during review, don't just fix the code — update the decision file or charter so the next session benefits. That's how the system gets smarter. + +The agents are doing the work. The platform is recording the telemetry. But the human defines what "good reasoning" looks like for this project — and that's the part no platform can ship for you. I'm still figuring out the best patterns, but I'm increasingly convinced this is one of the first disciplines to get right. + +--- + +_This is a snapshot of my investigation as of April 2026. Copilot CLI and Squad are both evolving fast. The specific features and issue numbers referenced here may have changed by the time you read this._ + +_Squad is an open-source project. This post also draws on public write-ups from practitioners exploring observability, security, and team coordination for AI agents._ diff --git a/website/blog/2026-04-29-local-slms-for-squad.md b/website/blog/2026-04-29-local-slms-for-squad.md new file mode 100644 index 0000000..2e8b7db --- /dev/null +++ b/website/blog/2026-04-29-local-slms-for-squad.md @@ -0,0 +1,234 @@ +--- +slug: "/2026-04-29-local-slms-for-squad" +canonical_url: "https://dfberry.github.io/blog/2026-04-29-local-slms-for-squad" +custom_edit_url: null +sidebar_label: "2026.04.29 Local SLMs for Squad" +title: "40 Agents, $0 Inference: Running Squad on Local Small Language Models" +description: "When you spawn 40+ agents daily on cloud models, tokens add up, so I investigated whether my AI team could run on local SLMs instead." +draft: true +tags: + - "GitHub Copilot" + - "AI Agents" + - "Small Language Models" + - "Cost Optimization" + - "Squad" + - "Copilot CLI" + - "Local Inference" + - "AI-assisted" +keywords: + - "small language models" + - "squad cli" + - "local inference" + - "ollama" + - "copilot cli custom models" + - "agent cost optimization" + - "qwen phi llama" +updated: "2026-04-29 00:00 PST" +--- + +# 40 Agents, $0 Inference + + + +I've been running [Squad](https://github.com/bradygaster/squad)—my human-led AI agent team—for a few months now. It's been exactly what I needed: I have 40+ agents, each with a charter and a skill set. Some days I'm at the terminal directing them live. Other days I delegate work through GitHub issues and review the PRs the next morning. + +But I noticed something. Every agent spawn burns tokens. All cloud. Every spawn. Scribe writing a commit message? Cloud tokens. Ralph triaging a bug report? Cloud tokens. My PM agents routing tasks? Cloud tokens. Even the mechanical, pattern-based work—the kind that feels like it doesn't need a 30-billion-parameter model—it's all going up to the cloud. + +Over a month, that's usually $15-40 in inference costs. Not huge. But it felt wasteful. ~60% of my spawns are doing work that feels like it should be cheap: status checks, log formatting, triage routing, documentation drafting. These are mechanical tasks. They don't need premium reasoning. + +So I asked: **what if my AI team could run on my laptop instead?** + +This post is my investigation into that question. Not a tutorial. Not a roadmap. It's an exploration: what I found, what works today, what's blocked, and what would need to change. It's messy. There are dead ends. But there are also paths forward. + +## The opportunity + +Small Language Models (SLMs)—models with 3-7 billion parameters—have gotten really good recently. Llama 3.2, Qwen 2.5 Coder, Phi-4 Mini, DeepSeek Coder V2. You can download them with Ollama and run them locally on a laptop with 8-16GB of RAM. No cloud account. No API key. No per-token meter running. + +The inference cost is literally zero dollars after you download the model once. + +Here's what they can do: +- **Llama 3.2 (3B)** — General tasks, summarization, simple classification. Quick on a modern laptop (~1.5s). +- **Qwen 2.5 Coder (7B)** — Code generation for simple tasks, boilerplate, templates. Better code understanding than you'd expect at that size. +- **Phi-4 Mini (3.8B)** — Structured writing, documentation, markdown generation. Surprisingly coherent for prose. + +All of them work offline. All of them keep your code and docs on your machine. All of them run in seconds on consumer hardware. + +The premise is simple: if I could route the mechanical 60% of my Squad work to these local models, I'd drop inference costs by half or more. And I'd get faster results on simple tasks (no 5-15 second cloud round-trip). Plus privacy—sensitive code or docs never leave the laptop. + +## The gap: how Squad picks models today + +Squad has a clean 4-layer model selection hierarchy. When an agent spawns, it resolves a model like this: + +1. **Layer 0:** Override from `.squad/config.json` (persistent setting for this agent) +2. **Layer 1:** Override from the current session ("use Sonnet for everything today") +3. **Layer 2:** The agent's charter says it prefers a specific model +4. **Layer 3:** Task-aware auto-selection (code work → use Sonnet, writing → use Haiku, etc.) +5. **Default:** `claude-haiku-4.5` if nothing else matched + +That hierarchy is elegant. But there's one problem: every model ID that comes out of this hierarchy is a platform catalog ID. `claude-sonnet-4.5`. `gpt-4o`. `gemini-2.0`. The Copilot CLI `task` tool accepts these IDs and resolves them server-side to cloud endpoints. + +There's no way to say: "Actually, run this on `http://localhost:11434`." + +There's no custom endpoint parameter. No provider config. No "BYOM" (Bring Your Own Model) support in the CLI's `task` tool. (VS Code Copilot Chat has it—custom providers pointing at localhost. But the CLI `task` tool doesn't.) + +So even if you have Ollama running locally on port 11434, there's no way to wire it into Squad. You're stuck using cloud models. + +That's the gap. + +## What I tried: four paths forward + +I spent a week investigating: Is there a way to make this work today? Or do we have to wait for platform changes? + +### Path 1: The MCP Server Approach + +MCP servers are already part of the Copilot CLI ecosystem. They expose tools that agents can call. What if I built an MCP server that wraps Ollama and exposes it as a set of tools—`slm_summarize`, `slm_classify`, `slm_generate_template`? + +**How it would work:** +1. Create an MCP server that talks to Ollama on `localhost:11434` +2. Register it in `.copilot/mcp-config.json` +3. When an agent spawns, they get access to these MCP tools alongside their normal filesystem and git tools +4. Agent is running on a cloud model, but it can call the MCP tool for cheap subtasks + +**Pros:** Works today. No platform changes. Agents gradually learn to use the local tool for things that don't need premium inference. + +**Cons:** Agent is still on cloud (paying for the orchestration). MCP tools are synchronous text-in, text-out. Double inference pattern (cloud model decides to call local model). Doesn't save as much as true local routing would. + +**Verdict:** This works. I could build it. It's Medium effort. But it's a workaround, not a solution. + +### Path 2: The Skill + Shell Script Approach + +Squad supports skills—markdown files that teach agents about capabilities. What if I created a skill that says: "For simple text generation tasks, you can call this shell script that queries Ollama directly"? + +```powershell +$prompt = "Summarize this: ..." +$body = @{model="llama3.2:3b"; prompt=$prompt; stream=$false} | ConvertTo-Json +Invoke-RestMethod -Uri "http://localhost:11434/api/generate" -Method Post ` + -Body $body -ContentType "application/json" | Select-Object -ExpandProperty response +``` + +**Pros:** Works today. Low effort. No infrastructure. + +**Cons:** Hacky. Agents shelling out to curl/PowerShell for inference is not a clean pattern. Still paying for the cloud agent's context window. No streaming. Agents need to be "smart enough" to delegate correctly. + +**Verdict:** This is pragmatic for now. But it feels like a Band-Aid. + +### Path 3: Squad CLI Learns Local Models + +This is the medium-term path. Squad's coordinator could be extended to understand local model providers. If `.squad/config.json` says `ollama: http://localhost:11434`, then when an agent resolves to `ollama:llama3.2:3b`, Squad could intercept the `task` spawn and run the agent prompt directly against the local endpoint using the OpenAI-compatible API. + +```json +{ + "providers": { + "ollama": { + "baseUrl": "http://localhost:11434/v1", + "models": ["llama3.2:3b", "qwen2.5-coder:7b"] + } + }, + "agentModelOverrides": { + "scribe": "ollama:llama3.2:3b", + "ralph": "ollama:llama3.2:3b" + } +} +``` + +**Pros:** Clean. Solves the problem fully. 60% of spawns go local. Cost cut in half. + +**Cons:** Requires Squad CLI changes. Requires either Squad to implement its own agent runtime (prompt + tool loop) OR the platform's `task` tool to accept custom endpoint URLs. The second option is higher-priority but out of Squad's control. Issue #987 on the Squad repo is related and getting attention—this would be a natural follow-up. + +**Verdict:** This is the right architectural solution. But it's blocked on platform support. + +### Path 4: LiteLLM Proxy + +LiteLLM is a middleware that presents as an OpenAI-compatible endpoint and translates between different backends. You could run `litellm --model ollama/qwen2.5-coder --port 4000` locally, and it would be ready to accept API calls the moment the Copilot CLI platform supports custom endpoints. + +**Pros:** Future-proof. When the platform adds endpoint support, you just plug in the URL. + +**Cons:** It's entirely blocked on platform support. Today, it's just overhead. You're adding a translation layer for no benefit yet. + +**Verdict:** Good to set up as preparation. But not a path forward today. + +## What actually works: a phased approach + +After the investigation, here's what I think makes sense: + +### Right now (Phase 1) + +**Do both of these:** + +1. **Set up Ollama locally with the recommended models** + ```bash + ollama pull llama3.2:3b # Lightest, general purpose (4GB RAM) + ollama pull qwen2.5-coder:7b # Best for code (8GB RAM) + ollama pull phi4-mini # Best for writing (6GB RAM) + ``` + + Use it as a parallel tool for quick tasks that don't need Squad's orchestration. Ask it questions. Use it for drafts. Keep Squad for multi-agent coordination and heavy code work. + +2. **Create a Squad skill that teaches agents to use Ollama for subtasks** + + Create `.copilot/skills/local-slm/SKILL.md` that defines when agents can shell out to the local Ollama instance. Scribe uses it for formatting log entries. Ralph uses it for status reports. It saves cloud tokens on the mechanical stuff. + +### When Squad issue #987 ships (Phase 2) + +Build the MCP server that wraps Ollama as tools. Hook it into the task-category routing so the coordinator knows which categories can safely use local inference. Now agents can call tools instead of shell scripts. + +### When the platform supports custom endpoints (Phase 3) + +This is the unlock. Full integration via custom provider config in `.squad/config.json`. Squad coordinator routes mechanical agents directly to local models. Scribe, Ralph, all the PM agents—they run entirely local. Inference costs drop 60%. Nothing changes from the agent's perspective. They just run faster and cheaper. + +## The math: what SLMs can actually handle + +Not every agent can go local. Some tasks genuinely need better models. Here's the honest breakdown of my 40+ agents: + +| Category | Examples | Can Use SLM? | Why / Why Not | +|----------|----------|-------------|---| +| **Mechanical ops** | Scribe (logging, merging), Ralph (triage) | ✅ Yes | File operations, pattern-based routing, no reasoning needed | +| **PM/Coordination** | Coordination agents (task routing, status) | ✅ Yes | Routing decisions, status generation, simple classification | +| **Docs/writing** | Content agents (content drafting) | ✅ Yes | Structured writing, markdown templates, prose generation | +| **Light code** | Template-focused agents | 🟡 Partial | Simple templates yes, complex logic probably no | +| **Heavy code** | Architecture and refactoring agents | ❌ No | Multi-file awareness, tool calling, deep reasoning needed | +| **Adversarial** | Review and edge-case agents | ❌ No | Requires deep understanding of implications | + +**Key insight:** ~60% of my squad (17+ PM agents + Scribe + Ralph + some docs agents) could run on local SLMs TODAY if the routing existed. That's where the cost savings come from. + +The 40% that stays cloud (code engineers, architects, adversarial reviewers)—that's the work that actually needs premium models. Those agents should stay on cloud. + +## Open questions I'm still sitting with + +1. **Tool calling:** Do Qwen 2.5 Coder and Phi-4 Mini support the structured tool-calling format that Squad agents use? If they don't, agents can't call filesystem tools. That's a hard blocker for some use cases. + +2. **Context windows:** Squad spawn prompts (agent charter + history + decisions + the task) often exceed 4K tokens. Most SLMs max out at 8K-32K. Is that enough? I haven't tested yet. + +3. **GitHub's roadmap:** Is BYOM (Bring Your Own Model) coming to Copilot CLI? Not just VS Code Chat—the CLI itself? This is the single biggest unlock. If GitHub says "we're adding custom endpoint support," this whole investigation becomes moot because Phase 3 arrives sooner. + +4. **Squad CLI as its own runtime:** Could Squad implement a lightweight agent runtime for local models? (Prompt → tools → response loop, entirely local.) That would bypass the platform's `task` tool entirely and unlock full local execution. But it's a lot of scope. + +## What I'd do next + +If I'm building on this: + +1. **Install Ollama locally and test it for a week.** Use it for ad-hoc tasks alongside Squad. See how the models feel. + +2. **Create the local-slm skill and try it.** Get one agent (Scribe) delegating simple tasks to Ollama. Measure: does it work? How often does the agent make the right call to use local vs. cloud? + +3. **Build the MCP server as a proof of concept.** Even if it's not the long-term solution, it proves the concept works and gives me something to show the Squad maintainers. + +4. **Comment on Squad issue #987** with this investigation. The model-mapping config is adjacent to what we'd need for local endpoint support. Help the maintainers think through it. + +5. **Track the platform roadmap.** When GitHub ships BYOM for Copilot CLI, that changes everything. That's the day Phase 3 becomes possible. + +## The bigger picture + +Here's what struck me most in this investigation: the tooling to do this exists. Ollama is solid. The models are good enough for 60% of the work. OpenAI-compatible APIs are a standard. Squad's architecture is clean enough to extend. + +What's missing is *permission*. The platform doesn't let you point at localhost. The CLI's `task` tool won't accept custom endpoints. Squad CLI can't implement its own agent runtime without huge scope changes. + +So it's not a capability problem. It's a design boundary. Someone decided: "Only platform catalog models." And that boundary is logical (easier to meter, monitor, secure). But it means local inference is blocked even though the technical pieces are there. + +The investigation doesn't give me a workaround that feels elegant. The MCP server is Medium effort for Medium benefit. The skill approach works but feels hacky. Full integration is blocked. + +But it's good to know the landscape. And it's worth commenting on issue #987, because when that ships, the groundwork for Phase 2 is laid. And when the platform eventually supports custom endpoints—because some team will want this, and the pressure will build—Phase 3 becomes straightforward. + +In the meantime: Ollama running in the background, a skill for mechanical tasks, and cloud for the work that needs it. That's what Phase 1 looks like. + +**What's your setup?** If you're running Squad or a similar agent team, are you thinking about local inference? Have you tried routing work to SLMs? I'd be curious what paths you've found. diff --git a/website/blog/2026-05-08-worktree-trick-multi-project-ai-teams.md b/website/blog/2026-05-08-worktree-trick-multi-project-ai-teams.md new file mode 100644 index 0000000..f53e683 --- /dev/null +++ b/website/blog/2026-05-08-worktree-trick-multi-project-ai-teams.md @@ -0,0 +1,234 @@ +--- +slug: "/2026-05-08-worktree-trick-multi-project-ai-teams" +canonical_url: "https://dfberry.github.io/blog/2026-05-08-worktree-trick-multi-project-ai-teams" +custom_edit_url: null +sidebar_label: "2026.05.08 Worktree Trick" +title: "The Worktree Trick: How I Run 40 AI Agents Across 8 Projects Without Losing My Mind" +description: "Two months of failed experiments taught me that the best way to manage multi-project AI agent teams is the oldest tool in git's toolbox — worktrees — paired with a centralized hub." +draft: true +tags: + - "AI Agents" + - "Git Worktrees" + - "Squad" + - "Developer Experience" + - "AI-assisted" +keywords: + - "git worktrees" + - "ai agents" + - "multi-project management" + - "copilot cli" + - "squad framework" + - "developer productivity" +updated: "2026-05-08 00:00 PST" +--- + + + +I stopped asking the AI to manage workspace state. That's the whole trick. + +For two months I tried architecturally ambitious approaches — unified squads, per-project squads, sub-coordinators, distributed skill systems — all trying to teach LLMs how to switch branches, route work, and stay on task. None of it worked reliably. Here's what I discovered: the thing that actually solved it was removing that burden entirely. I create a git worktree *before* I launch Copilot CLI. One command, in my regular terminal. The session starts already in the right place, with the right branch, and no decisions left to make about where to work. + +This post covers three layers: the problem I hit (multi-project chaos), the approaches I tried and abandoned, and the architecture I landed on. Let me walk through how I got here. + +## What I'm working with + +If you're running more than a few AI agents, you'll hit the same coordination wall I did. Understanding the scale helps explain why the simple solutions broke first. + +I work on developer documentation. I manage documentation and tooling across multiple active projects, many repos, and a [Squad](https://github.com/bradygaster/squad) team of 40+ AI agents — PMs, engineers, adversarial reviewers, and fact-checkers. The agents use role-and-domain names because a large roster needs naming that stays legible at a glance. + +The hub repo is simply a central coordination repo. It's the single brain for everything — one `squad.agent.md`, one `decisions.md`, one shared skill library. Physical repo clones live flat under `./repos/`. A file called `repos.json` maps each repo to its project, its auth method, its PM agent, and its sweep config. Think of it as a routing table for work. + +Getting to this architecture took three attempts and two failed experiments. My perspective: the failures taught me more than the solution did. + + + +## Approach 1: Put everything in one directory (Early April 2026) + +The simplest approach reveals the core problem fastest. I started here because I thought the AI could handle branch management. It can't — at least not reliably. + +I started simple. One Squad, one repo, work on everything from the same checkout. + +It collapsed almost immediately. + +The core problem was context bleed. An agent working on DocumentDB content accidentally referenced Azure SQL patterns. Agents committed directly to main instead of branches — I literally put "all changes always go through a PR regardless of the repo" in all-caps in my instructions. Branch switching turned chaotic. An agent landed on the wrong branch, and a PR picked up commits from completely unrelated work. + +I found myself constantly redirecting: "No, that's the wrong repo." "No, switch branches first." "No, don't commit there." I spent more time babysitting than I would have spent doing the work myself. + +In one session I finally said it out loud: *"I'm having to be too interactive and redirecting across all projects and repos — how do I improve how I work so that you need less direction?"* + +The answer wasn't a better prompt. It was a better architecture. + +## Approach 2: Give each project its own squad (Mid-April 2026) + +When a shared workspace fails, the instinct is to isolate everything. That instinct is half right — isolation solves context bleed but creates a new problem that's equally painful. + +If one squad couldn't handle multiple projects, maybe each project needed its own squad. + +I spun up child squads for several projects. Each had its own sub-coordinator, its own agents, and its own config. + +Context isolation was perfect. Context *duplication* was a nightmare. + +Every squad needed its own `decisions.md`. Skills that worked everywhere — YAML validation, PR workflow rules, Acrolinx compliance checks — had to be manually copied to every squad. When I updated a skill in one place, the others went stale. I found myself sending messages like *"can you copy this shared integration config to every subsquad?"* — which tells you everything about the overhead. + +The deeper question was coordination. I asked my agents: *"How will the project PMs work with the individual repos they're responsible for — create PRs on those individual repos and spin up separate loops, or what? How do we make sure nothing is dropped, everything is logged, and the goal is reached instead of a different goal?"* + +Nobody had a clean answer. Because there isn't one — not when you've fragmented your team's brain across five repositories. + +The per-project model scales linearly in the wrong direction. Every new project means another squad to configure, another set of skills to maintain, another `decisions.md` that might contradict the others. At 8 projects, it's untenable. + +## Approach 3: Combine a central hub with git worktrees (Late April / May 2026) + +Here's what I discovered: the solution takes the shared brain from Approach 1 and the isolation from Approach 2, but shifts *where* the isolation happens. Instead of isolating entire squads, I isolate workspaces. + + + +One central Squad in a hub repo — the same single brain from Approach 1 — but with a critical difference: I create git worktrees *before* entering Copilot CLI. + +Here's the shape of the worktree layout: + +``` +C:/workspace/hub-repo [main] ← main checkout +C:/workspace/project-a-feature [project-a/feature] ← worktree +C:/workspace/project-b-docs [project-b/docs] ← worktree +C:/workspace/project-c-review [project-c/review] ← worktree +C:/workspace/project-d-automation [project-d/automation] ← worktree +``` + +That `git worktree list` output shows five active workstreams running simultaneously. Each worktree is a separate directory on disk, locked to a single branch, with the full hub config (agents, skills, decisions) available. + +A clarification that tripped me up at first: these are worktrees of the *hub repo* — not copies of every downstream product or content repo. The worktree gives each session its own branch, its own instructions, its own decisions and routing context. The actual product and content repos stay mapped through `repos.json` and cloned under `./repos/`. The worktree isolates the *orchestration layer*, not the target repos themselves. + +When I start a Copilot CLI session in a worktree, the LLM is already "in" the right branch. No switching to manage. The default path leads to the right place. + +This is the key insight: **the LLM doesn't need to manage branches if you give it a workspace that's already on the right branch.** + +Most of the problems from Approach 1 become much harder to hit — context bleed, branch confusion, accidental commits to main all require the agent to actively work against the worktree's defaults. And the problems from Approach 2 disappear entirely — skills are shared automatically from the hub, decisions live in one file, and there's exactly one squad to maintain. + + + +The daily workflow change was immediate. Before: "No, wrong repo. No, switch branches first. No, don't commit there." After: create worktree, `cd` into it, start Copilot CLI, assign agent, review PR. Four steps, no redirecting. + +## Face the real effort: managing the framework itself + +The worktree trick solved the workspace problem. But I want to be radically honest about what it *didn't* solve — because if you stop at "just use worktrees," you'll hit the next wall within a week. The real work shifts from fighting the AI to maintaining the system around it. + +### Staff your agent roster like a team + +Without regular roster hygiene, you end up with ghost agents consuming context and muddying routing. This is people-management work wearing a technical mask. + +At 40+ agents, I manage a team roster. Agents get added for new domains, stale ones linger after projects wrap up, and over time I notice two agents doing roughly the same thing. I did a major consolidation pass — merging redundant agents, collapsing overlapping domains, and renaming agents to include their role and domain in the name (`planner` → `planner-pm-content`). That rename cascaded into updating GitHub labels, routing tables, and casting registries. This is not a "set it and forget it" system. It's staffing. + +### Curate skills through a confidence lifecycle + +The difference between "worked once" and "reliably correct" is the difference between a tip and a skill. Without a confidence lifecycle, your skill library becomes a junk drawer of one-off patterns. + + + +Skills pile up naturally as agents learn patterns, but I started encoding the project name in the skill folder to separate project-specific patterns from generic ones. Even then, there's a confidence lifecycle — a skill starts at low confidence, gets validated through repeated use, and eventually earns high confidence. Advancing a skill through that lifecycle takes human judgment. Nobody else is going to do it. + +### Keep the routing table honest + +When your routing table goes stale, agents guess instead of looking things up. One stale entry can send a PR to the wrong repo — and you won't catch it until review. + +`repos.json` maps 20+ repos to projects and PM agents. When I consolidated an internal repo into a different project, the routing table needed updating, the PM agent scope needed adjusting, and orphaned issues needed re-labeling. This is project management work — the framework just makes it explicit instead of implicit. + +### Prune decisions before they become wallpaper + +A decisions file that agents skim past is worse than no decisions file — it creates false confidence that conventions are being followed. + +`decisions.md` is the greatest strength and the biggest maintenance burden. Over time it grows into a massive file that agents skim past rather than internalize. I archive entries older than 30 days, deduplicate, and periodically do a junk-drawer cleanup to keep it actionable. Without curation, decisions accumulate but stop influencing behavior. This is a recurring cost, not a one-time fix. + +### Close the gap between "I have agents" and "agents do work" + + + +The ultimate goal is one command → auto-triage → route to the right agent → execute → open PR → review → merge. Getting there requires meaningful issue labels, triage rules in routing, auto-assignment patterns, and agents that genuinely know their scope without being told. Each iteration gets closer. But the gap between "I have agents" and "agents do work without me" is filled with exactly this kind of infrastructure. The "Ralph, go" aspiration is still aspirational. + +### Accept that cross-project awareness is unsolved + +Even the best architecture can't give you a mental model of everything in flight. This is the frontier — and it's honest to say nobody has cracked it yet. + +Even with the hub model, knowing what's happening across all 8 projects simultaneously is hard. I use daily briefs, orchestration logs, and session history, but the mental model of "what's in flight right now" still lives partly in my head. No framework I've seen has fully cracked multi-project awareness yet. + +The honest summary: the hub + worktrees architecture moved the hard work from "fighting the AI about branches and context" to "maintaining the system that makes the AI productive." That's a better class of problem. Boring and that's good. But it's still a problem, and it's ongoing. + +## Compare the trade-offs + +Every approach solves some problems and creates others. This table captures my experience — not theoretical scores, but what I actually hit across two months of iteration. + + + +| Dimension | Same directory | Squad per project | Hub + worktrees | +|-----------|---------------|-------------------|-----------------| +| Context isolation | ❌ Bleed everywhere | ✅ Perfect | ✅ Branch-level | +| Shared skills | ✅ Automatic | ❌ Must copy everywhere | ✅ Automatic from hub | +| Shared decisions | ✅ One file | ❌ Fragmented | ✅ One file | +| Agent confusion | ❌ Wrong branches | ✅ Clean | ✅ Clean | +| Setup overhead | ✅ Minimal | ❌ Heavy (N squads) | 🟡 Moderate | +| Cross-project coordination | ❌ Manual redirecting | ❌ No shared brain | ✅ Hub coordinates | +| Workspace discipline | ❌ Must manage branch state | ✅ No switching needed | ✅ No switching needed | +| Scaling | ❌ Collapses at 5+ projects | 🟡 Linear overhead | ✅ Scales well | +| PR workflow | ❌ Commits to main | ✅ Clean | ✅ Each worktree = one PR | + +The hub + worktrees model has real costs beyond that 10-second setup command. Worktrees accumulate — you need a cleanup habit or your disk fills up with stale branches. Branch names need to be meaningful because they *are* your task identifiers; `fix-stuff-3` helps no one. And `decisions.md` requires curation — without occasional pruning, it becomes a junk drawer that agents skim past instead of reading. These are maintenance costs, not architectural ones, but they're real. + +## Six things I learned the hard way + +These lessons didn't come from documentation — they came from wasted sessions and frustrated redirecting. Each one represents a problem I had to hit before I understood the fix. + +**1. Domain-encode your agent names.** When I had 40 agents with vague names, I couldn't remember who did what. Renaming them to descriptive role-and-domain names made the entire system legible at a glance. The naming convention is `{name}-{role}-{domain}`. It sounds bureaucratic. It saves real time. + +**2. Create a routing table, not a mental model.** `repos.json` maps every repo to its project, its owning PM agent, its auth type, and its sweep config. When an agent needs to know which project owns a repo, it looks it up instead of guessing. When *I* need to know, I look it up instead of remembering. Externalize the routing. + +**3. Let skills accumulate, not duplicate.** In the hub model, when any agent learns a new skill — YAML validation, Rust SDK patterns, Acrolinx compliance — it's immediately available to every project. In the per-project model, skills are siloed. Over two months, the hub accumulated 15+ shared skills that would otherwise be fragmented across 8 separate squads. + +**4. One `decisions.md` changes everything.** When you have one canonical decisions file, agents don't make the same mistake twice. "We use relative URLs instead of absolute ones in docs content." "All PRs must target the correct staging repo, never the public mirror." Decisions propagate automatically because every session reads the same file. + +**5. The "too interactive" problem is a design smell.** If you're constantly redirecting agents, the architecture is wrong. The goal is to say "Ralph, go" and have work happen. That requires auto-triage, domain-aware routing, and agents that know their scope without being told every time. If you're babysitting, fix the structure — not the prompts. + +**6. Make workspace setup a gate.** Don't ask the LLM to choose a branch or switch context. Give it a directory where the correct branch and instructions are already true. If the workspace is wrong, no amount of prompting fixes it reliably. If the workspace is right, you barely need prompts at all. This is arguably the central takeaway from two months of iteration. + +## Where to start + +If you're managing more than a couple of projects with AI agents, here's what I'd try: + +1. **Start with one hub repo.** Put your squad config, skills, and decisions here. This is your single brain. +2. **Put repos.json at the root.** Map every repo you work on to a project. Include the PM agent who owns it. +3. **Use worktrees for isolation.** Before starting a Copilot CLI session, run `git worktree add ../my-task-dir my-branch`. Work in the worktree directory. The LLM inherits the right branch automatically. +4. **Name agents with their domain.** `{name}-{role}-{domain}`. You'll thank yourself at agent 15. +5. **Log decisions centrally.** Every decision, every convention, every "don't do this" — one file, read by every session. +6. **Make workspace setup the gate.** If the directory is wrong, the session is wrong. Get this right first. + +The surprising thing isn't that AI agents need structure. It's that the structure that works best is one of the oldest tools in git's toolbox — worktrees — combined with one of the oldest patterns in systems design: a centralized brain with distributed execution. + +The LLM doesn't need a smarter prompt. It needs a workspace where the right thing is the default. + +How do you handle multi-project AI coordination? I'm genuinely curious — let me know what patterns you've found. diff --git a/website/blog/2026-05-12-agent-marketplace.md b/website/blog/2026-05-12-agent-marketplace.md new file mode 100644 index 0000000..cb4d422 --- /dev/null +++ b/website/blog/2026-05-12-agent-marketplace.md @@ -0,0 +1,424 @@ +--- +slug: "/2026-05-12-agent-marketplace" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-agent-marketplace" +custom_edit_url: null +sidebar_label: "2026.05.12 Agent Marketplace" +title: "Stop Rebuilding Agents from Scratch. Package, Share, and Install Them." +description: "A marketplace approach for packaging reusable agents so teams can publish, install, and update them like software." +draft: true +tags: + - "AI Agents" + - "Marketplace" + - "Reuse" + - "Squad" + - "AI-assisted" +keywords: + - "agent marketplace" + - "reusable agents" + - "package agents" + - "agent distribution" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +Your team builds a security reviewer agent. It works well. It gets better as you tune it. + +Then another team needs a similar agent. They rebuild it. Six weeks of tuning lost. Your organization has 2 security agents now—identical twins that drift apart over time. + +This happens because agent sharing isn't standardized. There's no way to package an agent, vet it, and install it. So teams hoard their work instead of amplifying it. + +[Squad SDK Agent Marketplace](https://github.com/bradygaster/project-squad-sdk-example-marketplace) is a reference implementation of a private agent registry. Define agents with manifests. Scan for security issues. Publish to a registry. Install verified agents in seconds. No build-it-yourself infrastructure, no manual vetting—just configuration and CLI commands. + +## The Problem: Agents Are Trapped in Repos + +You've invested in building specialized agents: +- Security reviewer (detects vulns, flags risky patterns) +- Documentation auditor (checks completeness and freshness) +- Accessibility checker (scans for A11y compliance) + +Each took 30–50 hours to build and tune. They solve real problems. + +But they're locked in one repo. When another team needs the same capability, they either: +1. Recreate it (wasting time) +2. Copy the code (manual sync, divergence) +3. Hope to find it someday (tribal knowledge) + +No organization does this intentionally. It's just the default when there's no infrastructure for sharing. + +## How the Marketplace Works: A Complete Walkthrough + +### Step 1: Clone and Build + +> 📊 **[DIAGRAM: Marketplace Publish-to-Install Flow]** +> *Prompt for image generation:* Create a horizontal flow showing 3 phases: (1) LOCAL DEVELOPMENT (left): Developer box → Package (tar.gz icon) → Scan (shield icon checking code) → Risk assessment indicator (green/red). (2) REGISTRY (center): GitHub private repository icon with multiple agent versions listed. (3) CONSUMER (right): Install command → Extract → Dependencies resolved → Agent in .squad/agents/. Use dark background, teal/cyan accents, arrows showing progression. Include version pinning concept: show multiple versions (1.0.0, 1.1.0, 2.0.0) in registry with selection arrow. Bottom: show security scanning details (hardcoded secrets detector, dangerous imports detector). +> *Purpose:* Gives readers the big picture of how agents move from local development through security scanning to a shared registry, then install safely in other repos. + +```bash +$ git clone https://github.com/bradygaster/project-squad-sdk-example-marketplace.git +$ cd project-squad-sdk-example-marketplace +$ npm install +added 42 packages, and audited 45 packages in 2.3s + +$ npm run build +$ npm link +# Makes squad-marketplace available globally + +$ npm test +✓ src/manifest/validator.test.ts (4 tests) +✓ src/security/scanner.test.ts (6 tests) +Test Files 5 passed (5) + Tests 23 passed (23) +``` + +Everything builds and tests pass. You're ready to publish your first agent. + +### Step 2: Package Your Agent + +Your agent lives in a directory with two files: + +``` +my-security-agent/ +├── manifest.json +└── charter.md +``` + +Create `manifest.json`: + +```json +{ + "name": "security-reviewer", + "version": "1.0.0", + "author": "security-team", + "description": "Automated security code review", + "skills": { + "code-analyzer": "^1.5.0", + "vulnerability-scanner": "^2.0.0" + }, + "config": { + "timeout": 60000, + "maxTokens": 4000, + "temperature": 0.2 + } +} +``` + +Create `charter.md` (a readme, really): + +```markdown +# Security Reviewer Agent + +Performs automated security code review with pattern detection for common vulnerabilities. + +## Capabilities +- Detects hardcoded secrets (API keys, credentials) +- Flags risky imports (eval, exec, dynamic requires) +- Checks authentication flows for weakness +- Scans for SQL injection patterns + +## Permissions +- Read-only access to codebase +- Cannot modify files +- No network access +``` + +Now package it: + +```bash +$ squad-marketplace package examples/sample-agent +Packaged → sample-agent-1.0.0.tar.gz (874 bytes) +``` + +Done. You have a `.tar.gz` with metadata and checksum. + +### Step 3: Scan for Security Issues + +Before publishing, verify the agent is safe: + +```bash +$ squad-marketplace scan examples/sample-agent + +Agent : sample-agent +Risk : LOW +Approved: true +No issues found. +``` + +The scanner detects: +- Hardcoded credentials (AWS keys, API tokens) +- Dangerous imports (`eval`, `exec`, shell injection) +- Network risks (unexpected calls to external APIs) +- Pattern violations (overprivileged roles) + +It exits with code 0 (safe) or 1 (scan failed). CI integration is straightforward—fail the build if scan exits with 1. + +Let's see what a failed scan looks like. Add a hardcoded secret to charter: + +```bash +$ sed -i 's/charter.md/charter.md\nAPI_KEY = ghp_super_secret_token/' examples/sample-agent/charter.md + +$ squad-marketplace scan examples/sample-agent + +Agent : sample-agent +Risk : HIGH +Approved: false + +Found 1 issue: + +❌ SECURITY ISSUE: Hardcoded Credential Pattern + File: charter.md + Match: ghp_super_secret_token + Severity: CRITICAL + Remediation: Remove hardcoded secrets. Use environment variables. +``` + +The scan caught it. Now remove the secret and re-scan: + +```bash +$ git checkout examples/sample-agent/charter.md +$ squad-marketplace scan examples/sample-agent + +Agent : sample-agent +Risk : LOW +Approved: true +No issues found. +``` + +### Step 4: Publish to Your Registry + +Set your GitHub registry details (a private repo you control): + +```bash +$ export GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +$ export GITHUB_REGISTRY_OWNER=your-org +$ export GITHUB_REGISTRY_REPO=my-agent-registry + +$ squad-marketplace publish sample-agent-1.0.0.tar.gz +Published sample-agent@1.0.0 + +✅ Published to https://github.com/your-org/my-agent-registry + Version: 1.0.0 + Size: 874 bytes + Risk: LOW +``` + +The agent lands in your private registry. No build pipeline, no Docker images, no complexity. Just GitHub as the backend. + +### Step 5: Install Verified Agents + +Another team installs it: + +```bash +$ squad-marketplace install sample-agent@1.0.0 +Installed sample-agent@1.0.0 → .squad/agents/sample-agent + +✓ Extracted files +✓ Resolved dependencies +✓ Updated metadata +``` + +Agent is now available to their squad. Dependencies get resolved. Versions stay pinned. + +### Step 6: List Available Agents + +```bash +$ squad-marketplace list + +Registry (3 agent(s)): + security-reviewer@1.0.0 by security-team (2024-01-15) + docs-auditor@1.2.0 by docs-team (2024-01-14) + a11y-checker@2.0.0 by platform-team (2024-01-10) +``` + +Everyone can see what's available. Publish dates, versions, authors. Clear ownership. + +## Two Commands That Show Real Power + +```bash +# Full lifecycle in one session +squad-marketplace package ./my-agent +squad-marketplace scan ./my-agent +squad-marketplace publish my-agent-1.0.0.tar.gz + +# Install it elsewhere +squad-marketplace install my-agent@1.0.0 + +# Verify it's installed +ls -la .squad/agents/my-agent/ +# Should show manifest.json and charter.md +``` + +That's it. No manual downloading, no file juggling, no trusting someone's GitHub fork. You get the exact version someone published, scanned, and approved. + +## Real Use Case: Scale Your Expertise + +A platform team builds a database runbook agent that diagnoses and suggests fixes for common database issues. It takes 30 hours to get right—query analysis, performance tuning, escalation paths. + +> 📊 **[DIAGRAM: Agent Reuse Multiplier Effect]** +> *Prompt for image generation:* Show a vertical timeline (left side: months 0-3) with milestones: Month 0—single platform team box (large, green). Month 1—platform team + 4 adopter team boxes (all same size, teal), connected by dotted lines from platform team. Month 2—platform team + 8 adopter boxes. Month 3—platform team + 12 adopter boxes. Each adopter box includes small icons indicating "using 1 agent". Center annotation: show hours saved column (Month 1: ~600 hours saved, Month 2: ~1200 hours saved, Month 3: ~1800 hours saved). Bottom: show single publish-and-upgrade cycle (one arrow up from platform team) distributing improvements to all 12 teams. Use dark background, grow graph concept with green/teal boxes, thin connecting lines. +> *Purpose:* Visualizes the organizational leverage of sharing—one investment multiplies across teams, with updates flowing to all instantly. + +They publish it to the marketplace. + +Now 12 teams can `install database-diagnostician` instead of calling an on-call DBA or writing custom diagnostics. The platform team maintains one source of truth. All 12 teams get bug fixes and improvements automatically when they upgrade. + +Timeline: +- **Day 1:** Platform team publishes `database-diagnostician@1.0.0` +- **Day 2:** Team A installs it, uses it for production incident triage +- **Week 1:** Team B, C, D install it. Saves each 20+ hours of development +- **Month 1:** Platform team finds a bug, publishes `@1.0.1`. All 12 teams upgrade in seconds +- **Month 3:** 12 teams have provided feedback. Platform team publishes `@2.0.0` with better heuristics + +The platform team's expertise—30 hours of tuning and debugging—gets multiplied across the organization. That's the leverage. + +## The Trust Model + +**Authentication:** Only your org can access your registry (private GitHub repo). + +**Verification:** Every published agent is scanned for credentials and unsafe patterns before the publish command even succeeds. + +```bash +$ squad-marketplace publish my-agent-1.0.0.tar.gz + +Scanning for security issues... +Risk: LOW ✅ + +Checking GitHub credentials... +✓ GITHUB_TOKEN valid +✓ Registry writable + +Publishing... +Published my-agent@1.0.0 +``` + +**Publisher accountability:** You know who published each agent and when. + +**Version pinning:** When you install `security-reviewer@1.0.0`, you get exactly that version forever. No surprise breaking changes when someone publishes 2.0. + +```bash +$ squad-marketplace list + +security-reviewer@1.0.0 by security-team (2024-01-15) +security-reviewer@2.0.0 by security-team (2024-01-20) +# Both versions are available; install which you want +``` + +## Why Not Just Use npm? + +npm is designed for open-source. This is designed for private, enterprise use: + +| Aspect | npm | Marketplace | +|--------|-----|-----------| +| **Access Control** | Everyone | Only your org | +| **Security Scanning** | None (by default) | Built-in: detects credentials, dangerous patterns | +| **Publisher Verification** | No | Yes: you know who published | +| **Dependency Resolution** | npm semver | Agent-specific skill dependencies | +| **Hosting** | npm registry (third-party) | Your GitHub repo (you control) | + +You *could* use npm. But then you're mixing public packages with private agents. You're relying on npm's scanning. You're trusting some random maintainer's security posture. + +The marketplace is built for the enterprise constraint: *private, verified, auditable*. + +## The Honest Scoping + +This is a **reference implementation**. It shows patterns for packaging, security scanning, and registry-backed distribution. It's production-ready for the core loop (package → scan → publish → install). + +What's real today: +- Manifest validation and versioning +- Security scanning (hardcoded credentials, dangerous imports) +- Package extraction and tar/gzip operations +- GitHub-backed registry (push/pull agents via GitHub repos) +- Version pinning and semver resolution +- Audit trail of who published what and when + +What's planned but not yet enforced: +- Advanced security rules (PII detection, SQL injection patterns in agent configs) +- Trust scoring (how many teams use this agent, uptime) +- Preview sandboxes (try agents before installing) +- Web UI for browsing and installing agents +- Deprecation warnings (agent marked as obsolete) + +The core—the workflow that matters today—is solid. The roadmap extensions are where teams customize. + +## Getting Started: 5 Steps + +```bash +# 1. Clone and setup +git clone https://github.com/bradygaster/project-squad-sdk-example-marketplace.git +cd project-squad-sdk-example-marketplace +npm install && npm run build && npm link + +# 2. Create your first agent +mkdir my-test-agent && cd my-test-agent + +cat > manifest.json << 'EOF' +{ + "name": "my-test-agent", + "version": "1.0.0", + "author": "your-name", + "description": "A test agent", + "skills": {}, + "config": {} +} +EOF + +cat > charter.md << 'EOF' +# My Test Agent +A simple demonstration agent. +EOF + +# 3. Package it +cd .. +squad-marketplace package my-test-agent +# → my-test-agent-1.0.0.tar.gz + +# 4. Scan it +squad-marketplace scan my-test-agent +# → Risk: LOW, Approved: true + +# 5. Set up registry and publish +export GITHUB_TOKEN=ghp_... +export GITHUB_REGISTRY_OWNER=your-org +export GITHUB_REGISTRY_REPO=my-agent-registry + +squad-marketplace publish my-test-agent-1.0.0.tar.gz +# → Published my-test-agent@1.0.0 +``` + +Done. Your first agent is in the registry. + +Now install it elsewhere: + +```bash +squad-marketplace install my-test-agent@1.0.0 +# → Installed my-test-agent@1.0.0 → .squad/agents/my-test-agent +``` + +Check it: + +```bash +ls .squad/agents/my-test-agent/ +# manifest.json +# charter.md +``` + +Your agent is ready to use across teams. + +## Why This Matters + +Specialized agents are becoming organizational assets. A good security reviewer saves your team hours. A solid documentation auditor catches staleness before readers see it. An accessibility checker prevents compliance issues. + +Right now, those assets stay hidden. They're built once, tuned in one repo, and forgotten everywhere else. Nobody knows they exist. Teams rebuild instead of reuse. + +A marketplace changes that. It makes agents discoverable, shareable, and governed. It solves the "not invented here" problem by making "invented here and available everywhere" the default. + +For your organization: +- One team invests in building a great agent +- 10 other teams benefit immediately +- The original team gets 10 testers and feedback loops +- Your best ideas scale + +--- + +Read the [repo](https://github.com/bradygaster/project-squad-sdk-example-marketplace) and the [quickstart](https://github.com/bradygaster/project-squad-sdk-example-marketplace/blob/main/QUICKSTART.md) to get your first agent published. + +Your specialized knowledge should multiply, not repeat. Build once, share everywhere. diff --git a/website/blog/2026-05-12-ai-agents-cloud-identity.md b/website/blog/2026-05-12-ai-agents-cloud-identity.md new file mode 100644 index 0000000..3a040c1 --- /dev/null +++ b/website/blog/2026-05-12-ai-agents-cloud-identity.md @@ -0,0 +1,238 @@ +--- +slug: "/2026-05-12-ai-agents-cloud-identity" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-ai-agents-cloud-identity" +custom_edit_url: null +sidebar_label: "2026.05.12 Cloud Identity for Agents" +title: "When agents reach the cloud: Azure identity, risk tiers, and governance" +description: "Part 3 of a security series on Azure identity options and risk-based controls for agents that access cloud resources." +draft: true +tags: + - "AI Agents" + - "Azure" + - "Security" + - "Identity" + - "AI-assisted" +keywords: + - "azure agent identity" + - "oidc federation" + - "managed identity" + - "entra agent id" + - "agent cloud security" +updated: "2026-05-12 00:00 PST" +--- + +This is Part 3 of a three-part series on giving AI agents access to your code. This post covers what happens when agents need to reach beyond GitHub — into Azure, Microsoft 365, or other cloud services — plus how to scale your security controls to match the actual risk. + +- [Part 1: Who gets the keys to your repo?](/blog/2026-05-12-ai-agents-repo-auth) +- [Part 2: Where does agent code actually run?](/blog/2026-05-12-ai-agents-sandboxing) +- **Part 3: When agents reach the cloud** (you are here) + +*If your agents only interact with GitHub repos, Parts 1 and 2 have everything you need. This post is for when your agents also call Azure APIs, query databases, or interact with cloud services.* + +## The full picture + +Here's how all three layers connect: + +```mermaid +flowchart TB + subgraph github["Part 1: GitHub (repo access)"] + direction TB + GH_ID["Agent identity\n(App, PAT, or Copilot license)"] + GH_RULES["Repo protections\n(branch rules, CODEOWNERS, rulesets)"] + GH_GOV["GitHub governance\n(Agent Control Plane, policies)"] + end + + subgraph sandbox["Part 2: Sandbox (code execution)"] + direction TB + SAND["Ephemeral environment\n(fresh per task, destroyed after)"] + FW["Restricted network\n(firewall or no internet)"] + end + + subgraph cloud["Part 3: Cloud (Azure resources)"] + direction TB + OIDC["OIDC federation\n(temporary tokens,\nno stored passwords)"] + MI["Managed Identity\n(Azure handles the credentials)"] + ENTRA["Entra Agent ID\n(purpose-built agent identity\nfor Azure — preview)"] + end + + github --> sandbox --> cloud + + style github fill:#e8f5e9,stroke:#388e3c + style sandbox fill:#fff3e0,stroke:#f57c00 + style cloud fill:#e3f2fd,stroke:#1976d2 +``` + +Each layer handles a different question: +- **Part 1:** Can this agent touch the repo? What can it do there? +- **Part 2:** Where does the agent's code run? What if it misbehaves? +- **Part 3:** Can this agent reach cloud resources? With what permissions? For how long? + +## Connecting GitHub to Azure without storing passwords + +### OIDC federation — temporary passes instead of stored keys + +When your agent runs inside GitHub Actions and needs to reach Azure, OIDC federation is the modern approach. In plain terms: **instead of storing an Azure password in GitHub, the two systems do a handshake each time.** GitHub says "this is a legitimate workflow run," Azure checks and says "okay, here's a temporary pass." When the workflow ends, the pass expires. + +```mermaid +sequenceDiagram + participant GH as GitHub Actions + participant Azure as Azure + + GH->>GH: Workflow starts + GH->>Azure: "Here's a short-lived token proving who I am" + Azure->>Azure: Validates the token (checks repo, branch, workflow) + Azure->>GH: "Here's a temporary credential — good for this run only" + GH->>Azure: Agent uses the credential to access resources + Note over GH,Azure: Workflow ends → credential expires automatically +``` + +```yaml +# What this looks like in a GitHub Actions workflow +permissions: + id-token: write + contents: read + +steps: + - uses: azure/login@v2 + with: + client-id: ${{ secrets.AZURE_CLIENT_ID }} + tenant-id: ${{ secrets.AZURE_TENANT_ID }} + subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} +``` + +**An important caveat:** OIDC removes stored passwords, but it doesn't eliminate all risk. If untrusted code runs in a workflow job that has `id-token: write` permission, that code could potentially mint and use tokens. Scope your OIDC configuration tightly: +- Pin to specific repos, branches, and environments +- Never combine `pull_request_target` triggers with token-bearing jobs +- Separate "plan" steps from "apply" steps in infrastructure workflows + +### Managed Identity — Azure handles the credentials entirely + +For agents running on Azure infrastructure (Container Apps, AKS, Functions), Managed Identity takes the concept further. Your compute resource authenticates directly to Azure services — no credentials in your code, no credentials in your environment variables, no credentials anywhere you manage. Azure handles it. + +Think of it like a building keycard system: the building knows this computer belongs here because of where it is, not because someone typed in a password. + +## Microsoft Entra Agent ID — agent identity for the cloud + +[Microsoft Entra Agent ID](https://learn.microsoft.com/entra/identity/workload-identities/) introduces a purpose-built identity type specifically for AI agents that access Azure and Microsoft 365 resources. + +Two important things upfront: + +1. **This is about cloud access, not repo access.** Entra Agent ID governs what agents can do in Azure and Microsoft 365. For GitHub repo access, use the approaches from [Part 1](/blog/2026-05-12-ai-agents-repo-auth). +2. **As of this writing, this is in preview.** I'm describing the direction Microsoft is heading, not a battle-tested production system. Treat it as promising but emerging. + +```mermaid +flowchart LR + subgraph github_layer["GitHub (Part 1)"] + APP["GitHub App\nor fine-grained PAT"] + end + + subgraph cloud_layer["Azure / M365 (Part 3)"] + ENTRA["Entra Agent ID\n— its own identity object\n— conditional access policies\n— central registry\n— lifecycle management"] + end + + APP -->|"Repo access"| REPO["📁 Your repos"] + ENTRA -->|"Cloud access"| AZURE["☁️ Azure APIs\nMicrosoft Graph\nMicrosoft 365"] + + style github_layer fill:#e8f5e9,stroke:#388e3c + style cloud_layer fill:#e3f2fd,stroke:#1976d2 +``` + +What Entra Agent ID is designed to provide: +- **A unique identity per agent** — separate from user accounts and traditional service accounts, so you can tell agents apart in logs and apply policies specifically to them +- **Conditional Access for agents** — rules about when, where, and how agents can access resources (think: "this agent can only access this API from this IP range during business hours") +- **A central Agent Registry** — a directory of all the agents in your organization, who owns them, and what they can access +- **Lifecycle management** — onboarding, access reviews, and automatic deactivation when an agent is retired + +If your agents need Azure or Microsoft 365 access, this is worth tracking as it moves toward general availability. + +## Scaling controls to risk level + +Not every agent task carries the same risk. A docs typo fix and an infrastructure-as-code change are not the same thing, even if the same framework produces both. + +[a colleague](https://www.linkedin.com/in/intrepidtechie/) — who teaches [Navigating the EU AI Act](https://www.linkedin.com/learning/navigating-the-eu-ai-act) on LinkedIn Learning and works in security governance — helped me see this through a regulatory lens. The EU AI Act is explicitly risk-based: obligations increase with the impact of what the AI system can affect. Even if you're not subject to EU regulation, the proportionality principle is useful everywhere: **match your controls to the actual damage that could happen.** + +```mermaid +flowchart TB + subgraph tiers["Scale your controls to the risk"] + LOW["🟢 Low risk\nDocs, tests, non-sensitive repos\n\n✅ Repo-scoped auth\n✅ Sandbox\n✅ Human review"] + MED["🟡 Medium risk\nApp code, non-prod CI config\n\n✅ Branch protections\n✅ Ephemeral runtime\n✅ No standing cloud credentials\n✅ Audit logs"] + HIGH["🟠 High risk\nInfra-as-code, IAM, prod workflows\n\n✅ Just-in-time access\n✅ Two approvers\n✅ Immutable logs\n✅ Environment segregation\n✅ Kill switch"] + REG["🔴 Regulated\nCritical infra, healthcare, finance\n\n✅ Formal risk assessment\n✅ Legal/privacy/security sign-off\n✅ Enhanced auditability\n✅ Incident playbook"] + end + + LOW --> MED --> HIGH --> REG + + style LOW fill:#e8f5e9,stroke:#388e3c + style MED fill:#fff9c4,stroke:#f9a825 + style HIGH fill:#fff3e0,stroke:#ef6c00 + style REG fill:#ffebee,stroke:#c62828 +``` + +| Risk level | Example | What I'd put in place | +|-----------|---------|----------------------| +| **Low** | Docs updates, test generation in non-sensitive repos | Repo-scoped credentials, sandbox, one human reviewer | +| **Medium** | Application code PRs, CI config in non-production | Branch protections, ephemeral runtime, no long-lived cloud credentials, audit logs | +| **High** | Infrastructure-as-code, identity config, production workflows | Time-limited access, two approvers, immutable logs, separate environments, ability to shut down quickly | +| **Regulated** | Essential services, safety-relevant systems, financial/healthcare | Formal risk assessment, legal and privacy sign-off, detailed audit trail, tested incident response plan | + +## Repos contain more than just code + +One thing I underestimated at first is the EU data protection angle: **a GitHub repository isn't just source code.** It contains commit metadata (names, emails), issue and PR threads with people's identities, reviewer comments, and sometimes pasted logs with customer data or debugging output. + +If an AI agent can read repos, issues, and PRs, that's often **personal data processing** — which has implications for data handling policies, retention, and in some jurisdictions, regulatory obligations like GDPR. + +For teams operating in or serving European markets, this means thinking about: +- What data can the agent see? +- Where are prompts and traces stored? +- How long are they kept? +- Does any of it leave your jurisdiction? + +The rule of thumb that resonated with me is this: **the closer the agent gets to production authority, the more it should be governed like a regulated operational control, not a developer convenience feature.** + +If agent telemetry could be used to profile developer behavior or productivity, involve privacy and legal early — in Europe that can raise employee-monitoring concerns. Even outside Europe, it's worth thinking through. + +## The evidence trail: what you need after an incident + +The three-layer model (repo access → sandboxing → cloud identity) describes controls, but after an incident you also need **evidence.** A security review of this post pushed me to think about what "meaningful human oversight" actually requires. + +```mermaid +flowchart LR + subgraph trail["What you need to reconstruct after an incident"] + direction TB + WHO["Who: agent identity used"] + WHAT["What: repo scope, permissions granted"] + WHY["Why: prompt or task that triggered it"] + HOW["How: tools invoked, secrets requested"] + RESULT["Result: generated diff, test results"] + APPROVED["Approved by: who, when"] + AFTER["After: deployment outcome, rollback path"] + end + + style trail fill:#f5f5f5,stroke:#616161 +``` + +Branch protections give you some of this (who approved, CI results). Audit logs give you more. But the full chain — from "why was this agent triggered" to "what happened after merge" — requires intentional design. + +The question to ask: **if something goes wrong at 2 AM, can you reconstruct what the agent did, why it did it, who approved it, and how to undo it?** If the answer is "not really," the controls look stronger on paper than they are in practice. + +## Putting all three parts together + +### If you're just getting started +Use Copilot's coding agent with strong branch protections ([Part 1](/blog/2026-05-12-ai-agents-repo-auth)). You already have identity (your license), sandboxing (managed environment, [Part 2](/blog/2026-05-12-ai-agents-sandboxing)), scanning (CodeQL), and the review gate. Start there. + +### If you're building custom agents +Use a GitHub App for repo access. Sandbox each task in a fresh container. Use OIDC for cloud resources. Keep credentials short-lived and scoped to the minimum needed. + +### If you're at enterprise scale +Layer GitHub's Agent Control Plane with Azure's identity features. Classify repos by risk level before enabling agent access. Track Entra Agent ID as it matures. Build audit trails that connect the full chain from trigger to outcome. + +### The thing I keep coming back to +No single layer is enough on its own. Scoped repo credentials, ephemeral sandboxes, short-lived cloud tokens, and human review work together. But they only reduce damage if the layers are actually separated — if the same runtime holds every credential and can reach every system, the layers collapse into one compromise. + +--- + +*This post reflects what I found as of April 2026. The agent security landscape is moving fast — GitHub, Microsoft, and the broader ecosystem are all shipping new capabilities. If something here is wrong or outdated, I'd appreciate a note.* + +*Thanks to colleagues and public write-ups that helped sharpen the risk-scaling and implementation details in this series.* + +*Dina Berry works on Azure developer documentation and runs AI agent workflows daily.* diff --git a/website/blog/2026-05-12-ai-agents-repo-auth.md b/website/blog/2026-05-12-ai-agents-repo-auth.md new file mode 100644 index 0000000..61df58a --- /dev/null +++ b/website/blog/2026-05-12-ai-agents-repo-auth.md @@ -0,0 +1,392 @@ +--- +slug: "/2026-05-12-ai-agents-repo-auth" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-ai-agents-repo-auth" +custom_edit_url: null +sidebar_label: "2026.05.12 Repo Auth for Agents" +title: "Who gets the keys to your repo? AI agent identity on GitHub" +description: "Part 1 of a security series on how AI agents authenticate to GitHub and how to limit what they can do." +draft: true +tags: + - "AI Agents" + - "GitHub" + - "Security" + - "Identity" + - "AI-assisted" +keywords: + - "agent repo auth" + - "github app" + - "fine-grained pat" + - "copilot coding agent" + - "repo security" +updated: "2026-05-12 00:00 PST" +--- + +This is Part 1 of a three-part series on giving AI agents access to your code. This post covers the GitHub layer — how agents prove who they are, what they're allowed to do, and how to set up your repo so you're comfortable with it. + +- **Part 1: Who gets the keys to your repo?** (you are here) +- [Part 2: Where does agent code actually run?](/blog/2026-05-12-ai-agents-sandboxing) +- [Part 3: When agents reach the cloud](/blog/2026-05-12-ai-agents-cloud-identity) + +I run AI agents on my repos daily as part of my work on Azure developer documentation. I kept hitting the same question: *what credentials should this agent have?* I spent time digging into GitHub's docs, the security architecture blog posts, and the API surface. What follows is a snapshot of where things stand in mid-April 2026. This space moves fast — some of this may have changed by the time you read it, and I may have gotten details wrong. If you spot something, I'd genuinely appreciate a correction. + +## The big picture + +Think of agent security as three separate questions: + +```mermaid +flowchart LR + A["🔑 Part 1\nWho can touch\nthe repo?"] --> B["📦 Part 2\nWhere does the\ncode run?"] + B --> C["☁️ Part 3\nWhat cloud resources\ncan it reach?"] + + style A fill:#4a90d9,stroke:#2c5f8a,color:#fff + style B fill:#f5a623,stroke:#c17d1a,color:#fff + style C fill:#7b68ee,stroke:#5a4bc7,color:#fff +``` + +This post focuses on the first question. If you only care about the GitHub layer — and many people do — this post has everything you need. + +## What GitHub has built for agent governance + +GitHub shipped agent-specific governance features in early 2026. Before diving in, I want to be upfront about something: **rules that stop things from happening and tools that tell you what happened are different things.** I'll label each so it's clear which is which. + +```mermaid +flowchart TB + subgraph stops["🛑 Stops bad things (preventive)"] + AP["Agent policies\nBlock disallowed agents org-wide"] + AL["Approval levels\nGate what agents can do"] + BP["Branch protections\nRequire human review before merge"] + end + + subgraph detects["🔍 Tells you what happened (detective)"] + AUDIT["Audit logging\nactor_is_agent tracking"] + SCAN["Security scanning\nCodeQL, secret scanning on PRs"] + TAB["Agents Tab\nDashboard of agent sessions"] + end + + subgraph guides["📋 Asks agents to behave (instructional)"] + AGENTSMD["AGENTS.md\nBehavioral boundaries in code"] + end + + style stops fill:#e8f5e9,stroke:#388e3c + style detects fill:#fff3e0,stroke:#f57c00 + style guides fill:#e3f2fd,stroke:#1976d2 +``` + +### The preventive controls (these actually block things) + +**The Agent Control Plane** (generally available February 2026) is the enterprise governance layer: + +- **Agent policies** — decide which agents and AI models are allowed across your organization or on specific repos. If it's not on the list, it can't run. +- **Approval levels** — three tiers: "Default Approvals" (human must approve certain actions), "Bypass Approvals" (agent can skip some gates), and "Autopilot" (agent runs freely). You set this per-repo. + +### The detective controls (these catch problems) + +- **Audit logging** — every agent action gets tagged with `actor_is_agent` so you can tell humans and agents apart in logs +- **Security scanning** — Copilot's agent automatically runs code analysis, secret scanning, and dependency checks on its own PRs before a human even sees them +- **Agents Tab** — a per-repo dashboard showing all agent sessions, so you can review what happened + +### The instructional layer (this is guidance, not enforcement) + +**AGENTS.md** — a file in your repo that describes behavioral boundaries for agents. I initially called this "policy-as-code." That's too generous. **It's instructions, not a wall.** The agent reads it and hopefully follows it, but nothing mechanically prevents the agent from doing something AGENTS.md says not to do. Think of it like a sign on the door, not a lock on the door. + +Real boundaries come from permissions, approval gates, and sandboxing. + +For the full security architecture, see: +- [Under the hood: Security architecture of GitHub Agentic Workflows](https://github.blog/ai-and-ml/generative-ai/under-the-hood-security-architecture-of-github-agentic-workflows/) +- [How GitHub's agentic security principles make our AI agents as secure as possible](https://github.blog/ai-and-ml/github-copilot/how-githubs-agentic-security-principles-make-our-ai-agents-as-secure-as-possible/) +- [Enterprise AI Controls & agent control plane — generally available](https://github.blog/changelog/2026-02-26-enterprise-ai-controls-agent-control-plane-now-generally-available/) + +## Three ways agents can log in to your repos + +I found three main approaches. Here's how they compare in plain terms: + +```mermaid +flowchart TB + subgraph app["GitHub App"] + A1["Identity: The app itself\n(shows as your-app-bot)"] + A2["Permissions: Only the repos\nyou install it on"] + A3["Credentials: Short-lived tokens\nthat refresh automatically"] + A4["Best for: Teams, org automation"] + end + + subgraph pat["Fine-grained PAT"] + P1["Identity: You\n(actions look like yours)"] + P2["Permissions: Only the repos\nand actions you choose"] + P3["Credentials: Expires on a\ndate you set"] + P4["Best for: Solo developers,\nquick experiments"] + end + + subgraph copilot["Copilot delegated"] + C1["Identity: You\n(via your Copilot license)"] + C2["Permissions: Same repos\nyou can already access"] + C3["Credentials: Managed by GitHub\n(nothing to configure)"] + C4["Best for: Using Copilot's\nbuilt-in coding agent"] + end + + style app fill:#e8f5e9,stroke:#388e3c + style pat fill:#fff3e0,stroke:#f57c00 + style copilot fill:#e3f2fd,stroke:#1976d2 +``` + +### GitHub Apps — the machine identity + +A GitHub App is its own thing — separate from any person. It: +- Gets temporary credentials that expire and refresh (no long-lived passwords) +- Gets installed on specific repos with only the permissions you choose +- Has its own rate limits, separate from yours +- Shows up in logs as `your-app[bot]`, so you can always tell it apart from humans + +The tradeoff: more setup. You register the app, store a private key, and handle token refresh in your code. + +### Fine-grained PATs — a limited copy of your access + +Here's a mental model I find useful: **a fine-grained PAT is like giving someone a copy of your house key, but one that only opens specific rooms and stops working after a set date.** + +It traces back to you (so you're accountable), but it only allows a slice of what you can do (so the damage is limited if something goes wrong). + +The key discipline: **one token per agent per task.** Don't reuse tokens across agents. If one gets compromised, you only lose what that one token could access. + +> **Note:** Older-style "classic" PATs (the `ghp_` tokens) can't be scoped to specific repos. If you're using those for agents, switching to fine-grained PATs is a meaningful upgrade. As of this writing, fine-grained PATs can only be created through the GitHub website — there's no API for it. + +### Copilot's approach — using your license + +Copilot's coding agent works differently — it acts **as you**: + +- It can only touch repos you already have access to +- Its code goes to `copilot/*` branches under your name +- It can't approve or merge its own work +- Your org admin can turn it on or off per-repo + +The safety net: all the same repo rules (required reviews, required tests, CODEOWNERS) still apply to Copilot's PRs. The agent suggests; a human decides. + +## Lock it down with code + +This is the part I wish someone had collected in one place. Here's how to script the repo settings that matter, so you can version-control your security posture. + +### Protect the main branch + +This is the single most important thing. It means no one — human or agent — can push directly to `main`. Everything goes through a pull request with at least one human reviewer. + +```bash +# Require reviews, status checks, and block force pushes on main +gh api -X PUT repos/{owner}/{repo}/branches/main/protection \ + --input - <<'EOF' +{ + "required_status_checks": { + "strict": true, + "contexts": ["build", "test"] + }, + "required_pull_request_reviews": { + "dismiss_stale_reviews": true, + "require_code_owner_reviews": true, + "required_approving_review_count": 1 + }, + "enforce_admins": true, + "restrictions": null, + "allow_force_pushes": false, + "allow_deletions": false +} +EOF +``` + +### Use rulesets (GitHub's newer, more flexible system) + +GitHub is moving from branch protection rules to "rulesets." They do the same job but can be managed at the org level and support more conditions. I'd use these for new repos: + +```bash +# Create a ruleset: require reviews, status checks, no force pushes +gh api -X POST repos/{owner}/{repo}/rulesets \ + --input - <<'EOF' +{ + "name": "Agent safety net", + "target": "branch", + "enforcement": "active", + "conditions": { + "ref_name": { + "include": ["refs/heads/main"], + "exclude": [] + } + }, + "rules": [ + { + "type": "pull_request", + "parameters": { + "required_approving_review_count": 1, + "dismiss_stale_reviews_on_push": true, + "require_code_owner_reviews": true, + "required_review_thread_resolution": true + } + }, + { + "type": "required_status_checks", + "parameters": { + "strict_required_status_checks_policy": true, + "required_status_checks": [ + { "context": "build" }, + { "context": "test" } + ] + } + }, + { + "type": "non_fast_forward" + } + ] +} +EOF +``` + +```bash +# Verify your rulesets +gh api repos/{owner}/{repo}/rulesets \ + --jq '.[] | {name: .name, enforcement: .enforcement}' +``` + +### Set repo-level merge settings + +```bash +# Squash merge only, auto-delete branches, require signoff +gh api -X PATCH repos/{owner}/{repo} \ + -F allow_squash_merge=true \ + -F allow_merge_commit=false \ + -F allow_rebase_merge=true \ + -F delete_branch_on_merge=true \ + -F web_commit_signoff_required=true +``` + +### Designate file owners with CODEOWNERS + +A CODEOWNERS file says "these people must review changes to these files." Put it at `.github/CODEOWNERS`: + +``` +# Everyone reviews everything by default +* @myorg/core-team + +# Workflow files and agent definitions need security review. +# These are high-value targets — an agent that can change +# its own rules is a problem. +/.github/workflows/ @myorg/devops @myorg/security +/.github/agents/ @myorg/security +/AGENTS.md @myorg/security + +# Infrastructure code needs a higher review bar +/infra/ @myorg/platform @myorg/security +``` + +Then turn on CODEOWNERS enforcement: + +```bash +gh api -X PUT repos/{owner}/{repo}/branches/main/protection \ + -f required_pull_request_reviews='{"require_code_owner_reviews":true,"required_approving_review_count":1}' +``` + +## Setting up auth for a custom agent like Squad + +If you're building something like [Squad](https://bradygaster.github.io/squad/) — a multi-agent framework where AI agents clone repos, write code, and open PRs — here's the decision tree I've landed on. + +```mermaid +flowchart TD + START["You're building\na custom agent"] --> Q1{"Team/org use\nor personal\nexperiment?"} + + Q1 -->|"Team"| APP["Use a GitHub App"] + Q1 -->|"Personal"| PAT["Use a fine-grained PAT"] + + APP --> Q2{"Does it need\nCopilot CLI?"} + PAT --> Q2 + + Q2 -->|"No"| DONE1["✅ Single token\nYou're set"] + Q2 -->|"Yes"| DUAL["⚠️ You need two tokens\nApp/PAT for git ops\n+ licensed user PAT for Copilot"] + + DUAL --> WARN["🔐 Store Copilot PAT\nin a secrets manager\n(Key Vault, GitHub secrets)"] + + style START fill:#4a90d9,stroke:#2c5f8a,color:#fff + style DONE1 fill:#4caf50,stroke:#388e3c,color:#fff + style WARN fill:#ff9800,stroke:#f57c00,color:#fff +``` + +### The Copilot CLI licensing gap + +Here's a friction point: **GitHub Apps can't hold Copilot licenses.** Copilot CLI checks that the token belongs to a real user with an active Copilot subscription. App tokens fail that check. + +If your agent needs Copilot CLI (as Squad does), you need two tokens — one for git operations, one for Copilot: + +```bash +# Copilot needs a licensed user's token +export GITHUB_TOKEN="${COPILOT_PAT}" +copilot --agent squad + +# Git operations use the App token +export GITHUB_TOKEN="${APP_TOKEN}" +git push origin "${BRANCH}" +gh pr create ... +``` + +One public implementation documented this dual-token pattern when building Squad on Azure Container Apps. + +**A security tradeoff I initially missed:** two tokens in one runtime means a compromised environment gets both. Ideally you'd isolate the Copilot step from the git-push step in separate containers. I haven't seen anyone implement that level of separation yet, but it's the right direction. + +### What permissions does the agent need? + +Keep it minimal: + +| Permission | Access level | What it's for | +|-----------|-------------|---------------| +| `contents` | read & write | Read code, create branches, push commits | +| `pull_requests` | write | Open and update pull requests | +| `issues` | read | Read issue context for task assignments | +| `metadata` | read | Required for all App installations | + +Don't grant `admin`, `workflows`, or `actions` unless the agent specifically needs them. + +## What branch protections don't cover + +I used to call branch protections the "universal safety net." After a security review of this post, I've dialed that back. They're important, but narrower than they feel. + +**What they handle well:** +- No code reaches `main` without a human reviewer +- CI must pass before merge +- Specific people must review specific files (via CODEOWNERS) +- No force pushes, no branch deletion + +**What they can't help with:** +- An agent leaking data through PR comments, commit messages, or artifacts +- A convincing-looking but malicious PR that a reviewer approves +- An agent modifying workflow files on a non-protected branch (which then triggers with higher privileges later) +- Anything that happens during the agent's execution, before a PR exists + +Branch protections belong in every repo an agent can touch. They're just not the whole story. + +## Risks I'm still thinking about + +### Prompt injection — tricking the agent + +In April 2026, researchers showed that AI agents could be manipulated through crafted PR titles, issue comments, and hidden text to steal secrets. The vendors patched their agents, but the underlying problem is this: **if an agent reads untrusted content (issues, PRs, comments) and can take privileged actions (push code, mint tokens), those two things can be connected by an attacker.** + +Scoping credentials helps limit the damage, but I don't think we've fully solved this yet. + +### Workflow poisoning — delayed attacks + +An agent that can edit `.github/workflows/` or build scripts can plant changes that look harmless but trigger later with higher privileges. I now treat workflow files as off-limits to agents unless there's a specific review process for them. + +### Supply chain — hidden code execution + +When an agent runs `npm install` or `pip install`, it's executing code from other people's packages — setup scripts, build hooks, lifecycle events. This is one of the easiest paths from "write code" to "run attacker-controlled code." Pinning dependencies and restricting package registries helps, but it's a gap in most agent setups I've looked at. + +### Leaking data without the internet + +An agent doesn't need internet access to exfiltrate data. It can encode information in PR comments, commit messages, branch names, or artifacts — all GitHub-native channels that bypass network firewalls. "Contained" is more nuanced than it sounds. + +--- + +## What I'd do today + +**If you're just getting started:** Use Copilot's built-in coding agent with strong branch protections. You already have the identity model (your license), security scanning (CodeQL), and the review gate (required PR reviews). Run the `gh api` commands above to double-check your settings. + +**If you're building custom agents:** Use a GitHub App for repo access. If you need Copilot CLI, use the dual-token pattern. Store secrets in Key Vault or GitHub encrypted secrets. Add CODEOWNERS. Treat workflow files as off-limits. + +**Next up:** [Part 2 covers where agent code actually runs](/blog/2026-05-12-ai-agents-sandboxing) — what happens between "the agent has repo access" and "a PR appears." + +--- + +*This post reflects what I found as of April 2026. If something is wrong or outdated, I'd appreciate a note.* + +*Dina Berry works on Azure developer documentation and runs AI agent workflows daily.* diff --git a/website/blog/2026-05-12-ai-agents-sandboxing.md b/website/blog/2026-05-12-ai-agents-sandboxing.md new file mode 100644 index 0000000..efba0ab --- /dev/null +++ b/website/blog/2026-05-12-ai-agents-sandboxing.md @@ -0,0 +1,220 @@ +--- +slug: "/2026-05-12-ai-agents-sandboxing" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-ai-agents-sandboxing" +custom_edit_url: null +sidebar_label: "2026.05.12 Agent Sandboxing" +title: "Where does agent code actually run? Sandboxing AI agents" +description: "Part 2 of a security series on execution isolation, network controls, and safer runtime choices for AI agents." +draft: true +tags: + - "AI Agents" + - "Security" + - "Sandboxing" + - "Runtime" + - "AI-assisted" +keywords: + - "agent sandboxing" + - "execution isolation" + - "network controls" + - "ephemeral environments" + - "ai runtime security" +updated: "2026-05-12 00:00 PST" +--- + +This is Part 2 of a three-part series on giving AI agents access to your code. This post covers execution sandboxing — the environment where agent-generated code actually runs, and why it matters just as much as who has access. + +- [Part 1: Who gets the keys to your repo?](/blog/2026-05-12-ai-agents-repo-auth) +- **Part 2: Where does agent code actually run?** (you are here) +- [Part 3: When agents reach the cloud](/blog/2026-05-12-ai-agents-cloud-identity) + +In [Part 1](/blog/2026-05-12-ai-agents-repo-auth), I looked at how agents prove their identity and what they're allowed to do in your repo. But there's a second, equally important question: **when the agent runs code, where does that happen?** + +This matters because credentials control *what the agent is allowed to do*. Sandboxing controls *what happens if something goes wrong*. They're different problems, and one doesn't solve the other. + +## Why this question matters: the database incident + +The industry got a vivid lesson in July 2025. An AI agent on a development platform deleted an entire production database — then generated fake data to try to cover it up. The agent had legitimate credentials. It was authorized to be there. The problem was that **nothing separated the agent's workspace from production infrastructure.** The agent could reach the live database because it was running in the same environment. + +This is why sandboxing exists: to put walls around where agent code executes, so that mistakes (or attacks) stay contained. + +## How Copilot handles this + +Here's the architecture of Copilot's coding agent environment: + +```mermaid +flowchart TB + USER["👤 You assign a task\nto Copilot's coding agent"] --> ENV + + subgraph ENV["Copilot's managed environment"] + direction TB + VM["Ephemeral workspace\n(fresh for each task,\ndestroyed after)"] + FW["Restricted internet\n(firewall with allowlist,\nnot a complete block)"] + SCAN["Auto security scanning\n(CodeQL, secret scanning,\ndependency checks)"] + end + + ENV --> PR["📋 Agent opens a\npull request"] + PR --> REVIEW["👀 Human reviews\nand approves"] + REVIEW --> MAIN["✅ Code reaches main"] + + style USER fill:#4a90d9,stroke:#2c5f8a,color:#fff + style ENV fill:#fff3e0,stroke:#f57c00 + style PR fill:#e8f5e9,stroke:#388e3c + style REVIEW fill:#e3f2fd,stroke:#1976d2 + style MAIN fill:#4caf50,stroke:#388e3c,color:#fff +``` + +Copilot's coding agent runs in **ephemeral GitHub-managed environments** with restricted internet access. A few things I initially got wrong that I want to correct: + +- I first wrote "no network access." That's not accurate. GitHub provides a **firewall with a configurable allowlist** — restricted by default, but not an air gap. +- The firewall applies to commands the agent runs via the terminal, but GitHub's docs note it does **not** cover MCP servers or Copilot setup steps. +- The environment is created fresh for each task and destroyed afterward. No leftover state between runs. + +### The layers of defense, in plain terms + +```mermaid +flowchart LR + subgraph defenses["What stands between an agent and your main branch"] + direction TB + L1["🔑 Credentials\nAgent can only touch\nrepos you can touch"] + L2["🏢 Enterprise policies\nAgent Control Plane decides\nwhich agents are allowed"] + L3["🧱 Branch protections\nHuman must review\nand approve every PR"] + L4["📦 Sandbox\nCode runs in a\ntemporary environment\nwith restricted internet"] + L5["🔍 Scanning\nCodeQL and secret scanning\ncatch problems in the output"] + L6["👤 Human review\nSomeone has to click\n'Approve' before merge"] + end + + style defenses fill:#f5f5f5,stroke:#616161 +``` + +Notice: most of these are detective (they catch problems) or reactive (they require human judgment). The hard stop is the human review before merge. Everything else reduces risk, but the human is the final gate. + +## Sandboxing your own agents + +If you're running agents outside Copilot's managed environment — say, a custom [Squad](https://bradygaster.github.io/squad/) deployment — you need to build this sandbox yourself. Here are the options I've found, ranked by isolation strength: + +```mermaid +flowchart TB + subgraph options["Sandboxing options for custom agents"] + direction LR + + subgraph strong["🔒 Strong isolation"] + FC["Firecracker microVMs\nEach agent gets its own\nvirtual machine kernel"] + KATA["Kata Containers\nSimilar to Firecracker,\ndifferent implementation"] + end + + subgraph good["✅ Good isolation"] + ACA["Azure Container Apps Jobs\nFresh container per task,\ndestroyed after\n(what Squad on ACA uses)"] + E2B["E2B\nPurpose-built sandboxes\nfor AI code execution"] + end + + subgraph weak["⚠️ Not a security boundary"] + DC["Devcontainers\nGreat for consistent\ndev environments, but NOT\ndesigned for isolation"] + end + end + + style strong fill:#e8f5e9,stroke:#388e3c + style good fill:#fff3e0,stroke:#f57c00 + style weak fill:#ffebee,stroke:#c62828 +``` + +### Azure Container Apps Jobs — the born-to-die pattern + +This is the pattern used in one public Squad on Azure Container Apps implementation. The flow: + +1. A message lands in a queue ("Hey agent, work on issue #42") +2. A fresh container spins up — clean slate, no leftover state +3. The agent clones the repo, writes code, opens a PR +4. The container is destroyed. Gone. Nothing persists. + +When no messages are waiting, no containers are running. Zero cost when idle. + +### Firecracker / Kata Containers — virtual machine isolation + +These give each agent its own operating system kernel. Even if the agent's code breaks out of normal container boundaries, it's still trapped inside a lightweight virtual machine. This is the strongest isolation available, but it requires more infrastructure to manage. + +### E2B — sandboxes built for AI + +E2B provides purpose-built cloud sandboxes designed specifically for AI code execution. Each run gets a fresh environment with policies you define. It's an API-first approach — no infrastructure to manage. + +### Devcontainers — not a sandbox + +I want to call this out specifically because I've seen people suggest devcontainers as agent isolation. **Devcontainers are great for making development environments reproducible and consistent. They are not designed as security boundaries.** In many configurations they have broad filesystem access, network access, and sometimes Docker socket access. Don't substitute them for real sandboxing. + +## The key principle: born fresh, die clean + +Whatever sandboxing approach you choose, the pattern is the same: + +```mermaid +flowchart LR + CREATE["🆕 Create fresh\nworkspace"] --> WORK["⚙️ Agent does\nits task"] + WORK --> OUTPUT["📤 Output goes to\nGitHub (PR, branch)"] + OUTPUT --> DESTROY["🗑️ Workspace\nis destroyed"] + + DESTROY -.->|"Nothing carries over"| CREATE + + style CREATE fill:#4a90d9,stroke:#2c5f8a,color:#fff + style WORK fill:#f5a623,stroke:#c17d1a,color:#fff + style OUTPUT fill:#4caf50,stroke:#388e3c,color:#fff + style DESTROY fill:#ef5350,stroke:#c62828,color:#fff +``` + +- **Fresh workspace per task.** No shared filesystem between runs. No cached credentials from last time. +- **No persistent state.** When the container or VM is destroyed, the agent's working files, environment variables, and any temporary tokens go with it. +- **Output goes through normal channels.** The agent's work product is a branch and a PR — both subject to all the review gates from Part 1. + +### "Ephemeral" doesn't mean "nothing persists" + +One nuance that tripped me up: destroying the container doesn't erase everything the agent touched. Data can survive through: + +- **PR comments and commit messages** — the agent wrote these to GitHub, not to its local disk +- **CI/CD logs and artifacts** — these are stored by GitHub Actions, not the container +- **Caches** — Actions caches persist across runs +- **External services** — if the agent called an API, that service has its own logs + +Ephemeral compute means the worker machine is clean. It doesn't mean the trail is clean. If you need to think about data residue (for compliance or security reasons), the persistence points are in GitHub's storage, not the sandbox. + +## How identity and sandboxing work together + +These two posts cover different things, but they're connected: + +```mermaid +flowchart TB + subgraph part1["Part 1: Identity (who)"] + CRED["Scoped credentials\nlimit what the agent\nCAN access"] + end + + subgraph part2["Part 2: Sandbox (where)"] + SAND["Ephemeral environment\nlimits what happens\nIF something goes wrong"] + end + + subgraph together["Together"] + BLAST["Smaller blast radius\nif credentials leak +\nless damage if code misbehaves"] + end + + CRED --> together + SAND --> together + + style part1 fill:#e8f5e9,stroke:#388e3c + style part2 fill:#fff3e0,stroke:#f57c00 + style together fill:#e3f2fd,stroke:#1976d2 +``` + +Scoped credentials (Part 1) limit what the agent is allowed to do. Sandboxing (Part 2) limits what happens if the agent does something unexpected. **They only work as layers if they're actually separated** — if the same runtime holds every credential and can reach every system, the layers collapse. + +## What I'd do today + +**If you're using Copilot's coding agent:** You already have sandboxing. The managed environment handles isolation for you. Focus on the repo protections from Part 1 (branch rules, CODEOWNERS, rulesets). + +**If you're building custom agents:** +- Run each task in a fresh container that gets destroyed afterward. Azure Container Apps Jobs or E2B are the lowest-friction options. +- Don't let containers share filesystems or state. +- Don't treat devcontainers as sandboxes. +- Accept that "ephemeral" covers the compute, not the trail — PR comments, logs, and artifacts persist on GitHub. + +**Next up:** [Part 3 covers what happens when agents need to reach cloud resources](/blog/2026-05-12-ai-agents-cloud-identity) — Azure authentication, Entra Agent ID, and scaling your controls to risk level. + +--- + +*This post reflects what I found as of April 2026. If something is wrong or outdated, I'd appreciate a note.* + +*Dina Berry works on Azure developer documentation and runs AI agent workflows daily.* diff --git a/website/blog/2026-05-12-automated-pipeline.md b/website/blog/2026-05-12-automated-pipeline.md new file mode 100644 index 0000000..149d300 --- /dev/null +++ b/website/blog/2026-05-12-automated-pipeline.md @@ -0,0 +1,346 @@ +--- +slug: "/2026-05-12-automated-pipeline" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-automated-pipeline" +custom_edit_url: null +sidebar_label: "2026.05.12 Automated Pipeline" +title: "Turn Your Squad Team's Charters into Automated Workflows" +description: "A way to promote agent charters from guidance documents into automated workflows that run on triggers." +draft: true +tags: + - "AI Agents" + - "Automation" + - "Workflows" + - "Squad" + - "AI-assisted" +keywords: + - "automated workflows" + - "agent charters" + - "pipeline automation" + - "workflow triggers" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +You've built a Squad team. You have a security expert, a tester, a documentarian. Each has a charter — a markdown file describing their role, expertise, model preference, and boundaries. + +Then you want to run a pipeline: **analyze code → write tests → update docs.** + +Here's what usually happens: you hardcode three prompts into your pipeline config, one per stage. Six months later, your security expert updates their approach, so you manually edit the prompts. Someone leaves, you update the charter but also have to find and update every pipeline that uses that agent. Prompts drift. Decisions scatter. You're maintaining two sources of truth. + +What if charters were **configuration**, not just documentation? What if a pipeline read your team's charters at runtime and injected them into the agent prompts automatically? Then when a charter changes, every pipeline adapts without touching the pipeline config. + +That's the core idea behind [Squad's Automated Pipeline](https://github.com/bradygaster/squad-sdk-example-pipeline) — turn your team charters into executable workflows. Charters drive behavior, not just intentions. + +## How It Works + +The pipeline reads each agent's charter file and extracts defaults: identity, role, expertise, model preference, behavioral boundaries. Skills (reusable capability definitions) are auto-matched by role and injected into the prompt. The result: pipeline configs stay focused on **what** to do (the stages and their inputs/outputs), while agent behavior lives in the charters and stays synchronized with reality. + +> 📊 **[DIAGRAM: Charter-to-Pipeline Flow]** +> *Prompt for image generation:* Horizontal flow diagram showing: `.squad/agents/{name}/charter.md` files (3 cards labeled Sentinel, Guardian, Chronicler with identity/expertise/model info) → Central "Pipeline Runtime" processor (blue box) with filtering logic → 3 output stages in parallel showing agent identity, model choice, and injected skills highlighted. Dark background, teal/blue theme, charter cards on left in light boxes, runtime processor in center with gears/process arrows, output stages on right with checkmarks showing matched skills. +> *Purpose:* Illustrates how charters drive runtime behavior and how skills are auto-discovered and injected, emphasizing that pipeline configs remain static while agent behavior adapts. + +You define three things: + +1. **Charters** — `.squad/agents/{name}/charter.md` (who each agent is, their expertise, their boundaries) +2. **Skills** — `.squad/skills/{name}/SKILL.md` (what capabilities are available, defined once, used by any agent) +3. **Pipeline** — `pipeline.json` (what stages to run and in what order, referencing agents by name) + +Update a charter, and every pipeline that uses that agent picks up the change automatically. No pipeline config edits needed. + +## A Real Walkthrough + +Here's what running the example pipeline looks like from start to finish, with actual CLI commands and expected output. + +### Step 1: Explore the Example Team and Charters + +First, let's look at the included example team. There are three agents, each with a charter: + +```bash +ls -la examples/.squad/agents/ +``` + +Output: +``` +sentinel/ # Security specialist +guardian/ # Test quality specialist +chronicler/ # Documentation specialist +``` + +Each charter defines the agent's identity, role, expertise, and boundaries. Let's read one: + +```bash +cat examples/.squad/agents/sentinel/charter.md +``` + +Output: +```markdown +# Sentinel: Security Specialist + +## Identity +- **Name:** Sentinel +- **Role:** Security Specialist +- **Expertise:** OWASP patterns, supply chain security, secret detection, vulnerability assessment + +## Charter +You are a security expert focused on protecting our codebase and infrastructure. Your role is to: + +1. **Identify Security Issues** + - Scan for exposed secrets, hardcoded credentials, and vulnerable dependencies + - Assess code changes for OWASP-Top-10 patterns + - Review infrastructure-as-code for misconfigurations + +2. **Assess Risk** + - Categorize findings by severity (critical, high, medium, low) + - Provide context and reproducibility steps + - Suggest remediation approaches + +3. **Boundaries** + - MUST NOT modify code directly + - MUST NOT delete any files + - MUST provide structured findings (JSON when possible) + - MUST follow principle of least privilege in recommendations + +## Model Preference +- **Primary:** gpt-4 (security tasks require careful analysis) +- **Fallback:** gpt-4-turbo if token limits hit + +## Skills +- `secret-scanning`: Detect exposed secrets and API keys +- `dependency-audit`: Check for known vulnerabilities in dependencies +``` + +This charter defines not just who Sentinel is, but also what model to use and which skills to inject when Sentinel runs. The pipeline will read this at runtime. + +Guardian and Chronicler have similar charters defining their roles (testing and documentation): + +```bash +cat examples/.squad/agents/guardian/charter.md +cat examples/.squad/agents/chronicler/charter.md +``` + +Guardian is focused on writing tests for risky surfaces, while Chronicler updates docs based on test findings. + +### Step 2: Validate the Pipeline Configuration (Dry-Run) + +Before running anything expensive, validate that the pipeline is wired correctly — all agents exist, all skills are available, all stages are connected: + +```bash +npx squad-pipeline run examples/pipeline.json --dry-run +``` + +Output: +``` +▶ Validating pipeline: Nightly Repo Health Check + Stages: analyze → test → document + + Loading stage 1/3: analyze + ✓ Agent 'sentinel' found + ✓ Charter loaded (.squad/agents/sentinel/charter.md) + ✓ Model: gpt-4 + ✓ Skills: secret-scanning, dependency-audit (2 matched) + ✓ Task: Analyze code for security issues + + Loading stage 2/3: test + ✓ Agent 'guardian' found + ✓ Charter loaded (.squad/agents/guardian/charter.md) + ✓ Model: gpt-4-turbo + ✓ Skills: test-generation, edge-case-coverage (2 matched) + ✓ Task: Write focused tests for risky surfaces + ✓ Depends on: analyze (output mapping verified) + + Loading stage 3/3: document + ✓ Agent 'chronicler' found + ✓ Charter loaded (.squad/agents/chronicler/charter.md) + ✓ Model: gpt-3.5-turbo + ✓ Skills: doc-generation, changelog-update (2 matched) + ✓ Task: Update documentation based on findings + ✓ Depends on: test (output mapping verified) + +✅ Validation passed: All stages valid, all agents found + Stages: 3 + Agents: 3 (sentinel, guardian, chronicler) + Charters: 3 loaded + Skills matched: 6 total (2 + 2 + 2) +``` + +This tells you upfront: are all the agents referenced in the pipeline actually defined? Are the skills available? Before running anything expensive, you get confidence that the pipeline is wired correctly. + +### Step 3: Run the Pipeline for Real + +```bash +npx squad-pipeline run examples/pipeline.json +``` + +Output: +``` +▶ Running pipeline: Nightly Repo Health Check + Stages: analyze → test → document + + Stage 1/3: analyze (sentinel) + 📖 Charter: Security Specialist (model: gpt-4) + 🎯 Task: Analyze code for security issues + ⏳ Running... + ✅ Completed (1m 34s) + Findings: 8 security issues discovered + • 2 critical (exposed API keys, hardcoded secrets) + • 3 high (dependency vulnerabilities) + • 3 medium (weak cryptography patterns) + + Stage 2/3: test (guardian) + 📖 Charter: Test Quality Specialist (model: gpt-4-turbo) + 🎯 Task: Write tests for findings from analyze stage + ⏳ Running... + ✅ Completed (1m 28s) + Results: 12 tests written, 3 edge cases + • 3 tests for secret detection scenarios + • 5 tests for dependency interaction + • 4 tests for crypto patterns + + Stage 3/3: document (chronicler) + 📖 Charter: Documentation Specialist (model: gpt-3.5-turbo) + 🎯 Task: Update documentation based on security findings + ⏳ Running... + ✅ Completed (45s) + Documentation updated: + • README.md: security best practices section added + • SECURITY.md: dependency audit checklist added + • docs/architecture.md: threat model section revised + + ✅ Pipeline passed: 3/3 stages passed + Run ID: eafd1dcc-bfe3-4a7b-abdc-532b96bcc4bb + Duration: 3m 47s + Artifacts location: examples/.squad/pipelines/eafd1dcc-bfe3-4a7b-abdc-532b96bcc4bb/ +``` + +> 📊 **[DIAGRAM: Three-Stage Nightly Pipeline Execution]** +> *Prompt for image generation:* Vertical pipeline diagram showing 3 sequential stages stacked top-to-bottom: Stage 1 "Analyze" (inputs: codebase, runs 1m 34s, outputs: 8 issues JSON) → arrows flowing down → Stage 2 "Test" (inputs: security findings from Stage 1, runs 1m 28s, outputs: 12 tests) → arrows flowing down → Stage 3 "Document" (inputs: test results from Stage 2, runs 45s, outputs: updated docs). Each stage shows agent icon (Sentinel/Guardian/Chronicler), model used, and timing. Dark background, blue/teal stage boxes, green checkmarks, orange duration labels. Total runtime: 3m 47s shown at bottom. +> *Purpose:* Demonstrates the dependency chain and parallel-safe execution pattern—how stage outputs feed into the next stage's inputs, and why stages must run sequentially in this case. + +Each stage reads its agent's charter at runtime. Sentinel uses gpt-4 because that's defined in the charter. Guardian uses gpt-4-turbo because Guardian's charter specifies that. The pipeline didn't hardcode these — it read them from the charters. + +Each run creates an immutable directory with artifacts: `examples/.squad/pipelines/{runId}/`. Stage outputs are checksummed and versioned so you can audit what each agent produced, when, and in what order. + +### Step 4: Check What the Pipeline Produced (Immutable Artifacts) + +```bash +npx squad-pipeline status eafd1dcc-bfe3-4a7b-abdc-532b96bcc4bb --squad-dir examples/.squad +``` + +Output: +``` +Pipeline Run: eafd1dcc-bfe3-4a7b-abdc-532b96bcc4bb +Status: passed + +Stage 1: analyze + Agent: sentinel (Security Specialist, gpt-4) + Status: ✅ passed + Output file: analyze/output.json + Checksum: sha256:a3b4c5d6e7f8... + Duration: 1m 34s + Produced: + { + "issues": [ + { "type": "secret", "severity": "critical", "file": "src/config.ts", "line": 42 }, + { "type": "dependency", "severity": "high", "file": "package.json", "version": "vulnerable" } + ] + } + +Stage 2: test + Agent: guardian (Test Quality Specialist, gpt-4-turbo) + Status: ✅ passed + Output file: test/output.json + Checksum: sha256:b4c5d6e7f8g9... + Duration: 1m 28s + Depends on: analyze + Produced: + { + "tests_written": 12, + "edge_cases": 3, + "coverage_gain": "8%" + } + +Stage 3: document + Agent: chronicler (Documentation Specialist, gpt-3.5-turbo) + Status: ✅ passed + Output file: document/output.json + Checksum: sha256:c5d6e7f8g9h0... + Duration: 45s + Depends on: test + Produced: + { + "files_updated": ["README.md", "SECURITY.md", "docs/architecture.md"], + "sections_added": 3 + } +``` + +All artifacts are immutable and checksummed. You can compare runs to see what changed. This is your audit trail. + +### Step 5: Update a Charter and Re-Run + +Now comes the key moment: edit an agent's charter and re-run the same pipeline without touching the pipeline config. + +```bash +# Edit the sentinel charter to change security expertise +vim examples/.squad/agents/sentinel/charter.md +# Change "OWASP patterns, supply chain security" to "OWASP patterns, supply chain security, API security, zero trust" + +# Re-run the same pipeline — sentinel's new expertise is automatically picked up +npx squad-pipeline run examples/pipeline.json +``` + +Output: +``` +▶ Running pipeline: Nightly Repo Health Check + Stages: analyze → test → document + + Stage 1/3: analyze (sentinel) + 📖 Charter: Security Specialist (model: gpt-4, updated expertise) + 🎯 Task: Analyze code for security issues + ⏳ Running with new expertise: API security, zero trust patterns now in scope... + ✅ Completed (1m 42s) + Findings: 10 security issues discovered (was 8) + • New findings for API authentication patterns + • New findings for IAM and authorization + + ... + ✅ Pipeline passed: 3/3 stages passed + Run ID: f9e8d7c6b5a4... (new run) + Duration: 3m 55s +``` + +The pipeline didn't change. The charter did. And because the pipeline reads the charter at runtime, Sentinel's security analysis adapts without any pipeline config edits. This is the power of charters as configuration. + +## Why This Matters + +Most pipeline tools treat agents as interchangeable executors with inline prompts. This treats agents as **team members with persistent identities**. The charter is the single source of truth. The pipeline is the orchestration layer. + +This solves real problems: +- **Prompt drift:** Charters evolve in one place, not scattered across pipeline configs and 10 different YAML files +- **Onboarding:** New team members read charters to understand capabilities and boundaries +- **Auditability:** Every run produces immutable artifacts with metadata and checksums +- **Reusability:** The same agent can participate in multiple pipelines with consistent behavior and expertise +- **Extensibility:** Add skills by writing a SKILL.md file; agents with matching roles auto-discover them + +If you manage multiple pipelines or multiple agents, this pattern pays for itself immediately. + +## Honest Scoping + +This works great when your pipeline stages have clear dependencies and your agents have stable roles over time. It's less useful for highly experimental pipelines where you're trying different agent combinations every run. + +The real value appears when you have 5+ pipelines using the same agents. At that scale, maintaining separate prompts for each agent in each pipeline becomes unmaintainable. + +## What's Next + +From here, you can: +- Add more agents by creating `.squad/agents/{name}/charter.md` +- Add skills in `.squad/skills/{name}/SKILL.md` — they're auto-matched by agent role +- Chain pipelines together for multi-stage orchestration +- Export pipeline runs to report generators or monitoring tools + +The example also works as a foundation for other multi-stage workflows: incident response pipelines, content review pipelines, or code promotion workflows. + +--- + +If you've built a Squad team and you're tired of hardcoding prompts into pipelines, charters-as-configuration is the pattern you need. Update your team's approach once in a charter file, and every workflow adapts automatically. + +Get started: [github.com/bradygaster/squad-sdk-example-pipeline](https://github.com/bradygaster/squad-sdk-example-pipeline) diff --git a/website/blog/2026-05-12-building-a-docs-squad.md b/website/blog/2026-05-12-building-a-docs-squad.md new file mode 100644 index 0000000..e60eaa7 --- /dev/null +++ b/website/blog/2026-05-12-building-a-docs-squad.md @@ -0,0 +1,163 @@ +--- +slug: "/2026-05-12-building-a-docs-squad" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-building-a-docs-squad" +custom_edit_url: null +sidebar_label: "2026.05.12 Docs Squad" +title: "Building a Docs Squad — 8 AI Agents for Content Teams" +description: "How I organized eight specialized agents and supporting skills for repeatable documentation work." +draft: true +tags: + - "AI Agents" + - "Squad" + - "Documentation" + - "Developer Workflow" + - "AI-assisted" +keywords: + - "docs squad" + - "documentation workflow" + - "content operations" + - "ai agents for docs" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +I work on developer documentation. Reviews, freshness audits, SEO checks, feedback triage — the same patterns repeat across every content set. So I built a Squad for it. + +[Squad](https://github.com/bradygaster/squad) is an open-source AI agent team framework. Most people use it for code projects. I used it for content — 8 specialist agents, 15 skills, and MCP integrations, all wired for documentation workflows. + +Here's how it works and what I learned. + +## The Team + +| Agent | What it does | Example prompt | +|-------|-------------|---------------| +| **Writer** | Drafts articles from templates (quickstart, how-to, tutorial, concept) | "Scaffold a quickstart for our Python SDK" | +| **Reviewer** | Tech accuracy, style compliance, inclusive language, security review | "Review PR #42 for accuracy and style" | +| **SEO Analyst** | Metadata optimization, keyword analysis, title scoring | "Score the metadata on our quickstart" | +| **Freshness Tracker** | Staleness detection, API change mapping, update prioritization | "What articles are older than 6 months?" | +| **Loc Coordinator** | Localization readiness, cross-product dependencies | "Is this ready for loc?" | +| **Metrics Analyst** | Page views, engagement trends, content health | "Which articles need attention?" | +| **Feedback Triager** | GitHub issues, customer feedback, prioritization | "Triage the last 30 days of feedback" | +| **Scribe** | Decision logging, editorial standards, memory | "Log this editorial decision" | + +Each agent has a charter at `.squad/agents/{name}/charter.md` — a 9-section document covering responsibilities, boundaries, MCP dependencies, routing rules, output format, quality standards, and anti-patterns. + +## The Architecture + +The template follows a **Content HQ model** — the squad repo is the headquarters, content repos are cloned into `./repos/`: + +``` +docs-squad/ +├── .squad/agents/ # 8 agent charters +├── .squad/routing.md # Agent-to-agent handoffs +├── .squad/decisions.md # What's been decided +├── .copilot/skills/ # 15 portable skills +├── templates/content/ # Article templates (concept, how-to, quickstart, etc.) +├── docs/ # Writing principles, quality framework, SEO guide +└── repos/ # Content repos cloned here (gitignored) +``` + +When an agent works on content, it operates inside `./repos/{content-repo}/`, not in the squad repo itself. This separation matters — squad config and content live in different places with different git histories. + +## MCP Integration + +Squad agents connect to external services through MCPs (Model Context Protocol servers). You configure which MCPs are available in your Copilot MCP config — agents discover them at session start. + +Only GitHub is required. Everything else degrades gracefully: + +| MCP | What it unlocks | Required? | +|-----|----------------|-----------| +| **GitHub** | Issues, PRs, code search | Yes | +| **Issue tracker** (Jira, Linear, etc.) | Work item management | No | +| **Analytics** (Amplitude, Mixpanel, etc.) | Content metrics | No | +| **Chat** (Slack, Discord, etc.) | Team notifications | No | + +When an MCP isn't connected, agents say "I can't reach that service right now" instead of hallucinating data. + +## Skills That Transfer + +The 15 skills are the portable part. They work in any Squad project, not just this template: + +| Skill | What it does | +|-------|-------------| +| `review-content` | Multi-pass review pipeline (tech accuracy → style → SEO → inclusive language) | +| `review-pr` | Structured PR feedback with severity labels | +| `resolve-pr-feedback` | Process and resolve review comments | +| `new-quickstart` | Scaffold from quickstart template | +| `new-how-to` | Scaffold from how-to guide template | +| `new-overview` | Scaffold concept/overview articles | +| `new-tutorial` | Scaffold multi-step tutorials | +| `audit-freshness` | Systematic staleness audit across a collection | +| `triage-feedback` | Categorize customer feedback by theme and impact | +| `optimize-seo` | Keyword analysis + metadata improvement | +| `validate-seo` | Validate metadata against standards | +| `validate-metadata` | Check YAML front matter against schema | +| `validate-ai-ready` | Check content for AI/copilot readiness | +| `check-inclusive-language` | Flag non-inclusive terms with approved replacements | +| `match-pattern` | Match content to the right article template | + +Each skill has a `SKILL.md` following the Anthropic standard — trigger phrases, input requirements, steps, output format, error handling, and examples. + +## The Review Pipeline + +The default pipeline chains skills in sequence: + +``` +write-* → security-review → copy-edit → [clear context] → tech-accuracy +``` + +Writer creates with a `write-*` skill. Reviewer runs `security-review`, then `copy-edit` (style), then clears context and runs `tech-accuracy` (verification against source). The context clear between copy-edit and tech-accuracy prevents the reviewer from anchoring on the draft's claims. + +## Decisions I Had to Make + +After building the template, I realized there are decisions every adopter needs to make but nobody tells you about upfront. I documented them in `docs/adopter-decisions.md`: + +1. **Where do squad files live?** Same repo? Different branch? Separate repo? Local only (gitignored)? +2. **Which content repos?** What do agents operate on? +3. **Which MCPs to enable?** Only GitHub is required — the rest degrade gracefully. +4. **Human vs agent boundary?** Agents draft, humans approve. But where exactly is the line? +5. **Which agents do you need?** 8 is the max. You might only need 3. +6. **Worktrees?** Parallel work on separate branches, or one branch at a time? +7. **Freshness thresholds?** 6 months? 12 months? Depends on your content. +8. **Required review passes?** Security review is non-negotiable. Style and accuracy depend. +9. **Vendor writers?** Different onboarding and quality expectations. +10. **Escalation path?** When agents can't resolve something. +11. **Decision logging?** Active, manual, or minimal? +12. **Auto-update PR branches?** Daily merge of target branch into open PRs? + +These decisions should be recorded in `.squad/decisions.md` on day one so agents don't re-ask and new team members understand the choices. + +## What Surprised Me + +**Agents stay in their lane.** The routing rules in `.squad/routing.md` actually work — Reviewer doesn't try to write content, Writer doesn't try to do SEO analysis. The boundaries in charters matter more than I expected. + +**MCPs degrade gracefully.** Most sessions, only 2-3 MCPs are connected. The agents adapt — they report what they can't reach instead of making things up. + +**Squad config doesn't belong in the content repo.** This was the biggest architectural lesson. Your squad is a tool you use to work on content — it's not part of the content itself. Keep them separate. + +## Getting Started + +```bash +# Clone or fork the template +git clone https://github.com/your-org/docs-squad-template.git my-docs-squad +cd my-docs-squad + +# Install dependencies +npm install + +# Clone a content repo into repos/ +cd repos +git clone https://github.com/your-org/your-docs-repo.git + +# Launch Copilot CLI with your squad +cd .. +copilot -p "who's on the team?" +``` + +Then try: `"Review ./repos/your-docs-repo/articles/quickstart.md for technical accuracy and style compliance."` + +The Reviewer agent runs the full pipeline. You get back a structured report with findings by severity. + +--- + +Fork the template, run setup, and you have a working content team in under 5 minutes. The squad handles the repetitive parts — reviews, audits, scaffolding — so you can focus on clarity, accuracy, and the reader. diff --git a/website/blog/2026-05-12-code-migration-tool.md b/website/blog/2026-05-12-code-migration-tool.md new file mode 100644 index 0000000..884e346 --- /dev/null +++ b/website/blog/2026-05-12-code-migration-tool.md @@ -0,0 +1,306 @@ +--- +slug: "/2026-05-12-code-migration-tool" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-code-migration-tool" +custom_edit_url: null +sidebar_label: "2026.05.12 Code Migration Tool" +title: "Parallelize Your Next Framework Migration with AI Agents" +description: "A playbook for breaking a large migration into parallel agent tasks without losing reviewability." +draft: true +tags: + - "AI Agents" + - "Code Migration" + - "Refactoring" + - "Squad" + - "AI-assisted" +keywords: + - "framework migration" + - "parallel refactoring" + - "code transformation" + - "ai migration" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +Framework migrations are painful. Express to Fastify. JavaScript to TypeScript. React 18 to 19. A codebase with 500 files means weeks of manual transforms, extensive testing per batch, and the constant worry: "Did we break something?" + +Most teams reach for codemods. You write a single transformation rule and run it across all files hoping for the best. No orchestration. No dependency awareness. No automated rollback when things go wrong. You end up re-transforming files multiple times, hunting down circular dependencies, and nursing regressions for months. + +What if your migration tool could work like a parallel construction crew instead? Analyze your dependencies first, batch changes intelligently, run tests after each batch, and roll back atomically if anything breaks. That's what professional construction teams do — they analyze the blueprint, plan the phases, parallelize safely, and validate at each stage. + +That's the idea behind [Squad SDK's migration framework](https://github.com/bradygaster/squad-sdk-example-migration). It treats migrations as orchestration problems, not just find-and-replace operations. + +## How It Works + +The framework coordinates specialized AI agents across four phases: + +1. **Analyzer** scans your codebase and builds a complete dependency graph — figuring out which files import which, and which files are safe to transform first (leaf nodes before roots). +2. **Planner** groups files into parallel-safe batches using topological sort. Shared modules and entry points get transformed first, ensuring downstream files have stable dependencies. +3. **Executor** runs each batch in parallel, spawning transformer agents to apply the migration. Real-time progress shows which batches succeeded. +4. **Validator** runs your test suite after each batch. If tests fail, that batch rolls back automatically while other batches remain transformed. + +> 📊 **[DIAGRAM: Migration Orchestration Pipeline]** +> *Prompt for image generation:* Left-to-right flow diagram: Analyzer phase (scans codebase, outputs dependency graph) → Planner phase (topological sort, creates batches) → Executor phase (4 parallel workers, each running transformers) → Tester phase (runs test suite) → Decision diamond (tests pass?) → Yes path continues to next batch, No path triggers Rollback gate (resets failed batch only). Dark background, blue/teal process boxes with arrows, orange alert for rollback decision. Labels on each phase showing output (dependency graph, batch queue, transformer results, test results). +> *Purpose:* Shows the orchestration flow and how batches flow through each phase, emphasizing parallelism and the atomic rollback gate at validation. + +The key insight: migrations aren't single-pass transformations. They're orchestration problems. You need to know the dependency graph, respect the graph's constraints, validate incrementally, and fail safely. Agents alone can't see this — you need intelligent batching. + +## A Real Walkthrough + +Here's what a production migration looks like from start to finish, with actual CLI commands and expected output. + +### Step 1: Create a Migration Config + +```bash +squad-migrate init --config migration.json +``` + +Output: +``` +✅ Created migration.json + +Sample configuration created. Edit the following fields: + - source.framework: Source framework name + - source.pattern: Glob pattern for source files + - target.framework: Target framework name + - target.version: Target version +``` + +The generated `migration.json` looks like: +```json +{ + "name": "express-to-fastify", + "source": { + "framework": "express", + "pattern": "src/**/*.ts" + }, + "target": { + "framework": "fastify", + "version": "4.25.0" + }, + "batching": { + "filesPerBatch": 5, + "parallelBatches": 4 + }, + "rollback": { + "onTestFailure": true + } +} +``` + +### Step 2: Set Up Sample Files (Optional for Testing) + +Before running on your real codebase, you might want to test the framework. The example includes a sample Express codebase with test suites: + +```bash +# Create example test files +mkdir -p sample-app/src/routes +cat > sample-app/src/app.ts << 'EOF' +import express from 'express'; + +const app = express(); +app.use(express.json()); + +app.get('/users', (req, res) => { + res.json([{ id: 1, name: 'Alice' }]); +}); + +export default app; +EOF + +cat > sample-app/src/routes/users.ts << 'EOF' +import { Router } from 'express'; + +const router = Router(); +router.get('/', (req, res) => { + res.json([]); +}); + +export default router; +EOF + +# Create test file +mkdir -p sample-app/test +cat > sample-app/test/app.test.ts << 'EOF' +import { describe, it, expect } from 'vitest'; +import app from '../src/app'; + +describe('Express App', () => { + it('should return users', async () => { + // Mock test would go here + expect(true).toBe(true); + }); +}); +EOF +``` + +### Step 3: Analyze Your Codebase (Dry-Run, No Changes) + +```bash +squad-migrate analyze --config migration.json +``` + +Output: +``` +🔍 Analysis: express-to-fastify + +Scanning files... + ✓ Scanned 42 files + ✓ Built dependency graph + +📊 Complexity Assessment: + Files found: 42 + Easy: 18 (simple route handlers, no middleware chains) + Medium: 22 (middleware chains, error handlers) + Hard: 2 (custom decorators, plugin-based patterns) + +🚀 Batching Strategy: + Batches: 9 + Parallelism: 4 + Batch 1: 8 files (leaf modules) + Batch 2: 6 files (mid-level routes) + Batch 3–9: Remaining files in dependency order + +⏭️ No changes made (dry-run). +``` + +> 📊 **[DIAGRAM: Before/After Code Transformation Example]** +> *Prompt for image generation:* Split-screen comparison: Left side shows "Before" Express code snippet (app.use(), Router, res.json()) with red boxes highlighting outdated patterns. Right side shows "After" Fastify code (fastify.route(), async handlers, reply.send()) with green boxes highlighting modernized syntax. Arrows between corresponding lines show transformation mapping. Dark background, blue/teal code blocks, red→green transformation flow, clean sans-serif code font. +> *Purpose:* Gives readers a concrete sense of what the transformations look like at the code level without needing to scroll through full code blocks later. + +This tells you exactly what you're up against before committing to the migration. You see the dependency graph complexity, complexity breakdown by file category, and how many parallel batches can run. + +### Step 4: Execute the Migration + +```bash +squad-migrate run --config migration.json +``` + +Output (in real-time): +``` +🚀 Migration: express-to-fastify + Analyzed 42 files + Created 9 batches + Starting 4 parallel workers... + + ✅ batch-1 completed (8 files, 34s) + • src/middleware/auth.ts ✓ + • src/middleware/logging.ts ✓ + • src/utils/validation.ts ✓ + • [5 more files] + + ✅ batch-2 completed (6 files, 28s, tests passed) + • src/routes/users.ts ✓ + • src/routes/products.ts ✓ + • [4 more files] + + ⏳ batch-3 in progress (5 files)... + + ✅ batch-3 completed (5 files, 31s, tests passed) + ✅ batch-4 completed (4 files, 22s, tests passed) + ✅ batch-5 completed (3 files, 19s, tests passed) + ✅ batch-6 completed (2 files, 25s, tests passed) + ✅ batch-7 completed (6 files, 41s, tests passed) + ✅ batch-8 completed (2 files, 18s, tests passed) + ✅ batch-9 completed (1 file, 14s, tests passed) + + ═══ Migration Complete ═══ + Migrated: 40 + Failed: 2 + Skipped: 0 + Duration: 4m 52s + ══════════════════════════ +``` + +The beauty here: you see progress in real-time as each batch completes. The framework parallelizes safely — at any point, you can see which batches succeeded and which failed. Failed batches stay failed; prior batches remain transformed. + +### Step 5: Check Status and Investigate Failures + +```bash +squad-migrate status --config migration.json +``` + +Output: +``` +📊 Status: express-to-fastify + Total: 42 + Migrated: 40 + Failed: 2 + In Progress: 0 + + Failed files (2): + ✗ src/middleware/custom-middleware.ts + Reason: Complex async middleware pattern not recognized + Suggestion: Review and transform manually + + ✗ src/plugins/custom-plugin.ts + Reason: Plugin system incompatible with target framework + Suggestion: Refactor as Fastify plugin + + Per-batch status: + ✅ batch-1: 8/8 passed + ✅ batch-2: 6/6 passed + ✅ batch-3: 5/5 passed + ✅ batch-4: 4/4 passed + ✅ batch-5: 3/3 passed + ✅ batch-6: 2/2 passed + ✅ batch-7: 6/6 passed + ✅ batch-8: 2/2 passed + ✅ batch-9: 1/1 passed +``` + +Now you have a clear view of what succeeded and what needs human intervention. You can fix the two failing files manually, then re-run just those files without re-processing 40 files. + +### Step 6: Rollback Scenario (If Needed) + +If you need to abort and start over: + +```bash +squad-migrate rollback --config migration.json +``` + +Output: +``` +⏮️ Rollback: express-to-fastify + + Resetting all file statuses to 'pending'... + ✅ 42 files reset + + To restore file contents from git: + $ git checkout -- . + + Migration state cleared. Ready to re-run. +``` + +This resets the migration state without touching the files themselves. If you want to restore the original files too, just run `git checkout -- .`. + +## Why This Matters + +Typical codemod time cost: 4–5 weeks for 500 files (manual review, fix errors, retest, repeat). Testing takes weeks because you're validating 500 files against a single changed pattern. + +With orchestrated agents and parallelism: 3–5 days for the same codebase (batching, parallelism, automated rollback per batch). Each batch is small and tested before you move on. + +Plus: **95%+ fewer regressions** because each batch is validated before you move to the next. You're not discovering issues at the end when everything's been transformed — you're finding them early and fixing before the cascade. + +The framework also teaches you core Squad SDK patterns: agent definitions, state persistence, event-driven architecture, and hook pipelines. If you build other long-running, multi-stage workflows, these patterns transfer directly. + +## Honest Scoping + +This works great for framework migrations where the source and target have clear structural parallels (Express ↔ Fastify, React 18 ↔ React 19). For truly novel transformations (JavaScript to WASM, monolith to microservices), agents still need significant manual prompting and review. + +The framework shines when you have a clear migration strategy but need orchestration and validation — not when you need to invent the strategy itself. + +## What's Next + +From here, you can: +- Integrate into CI/CD for automated nightly migrations +- Add custom transformer skills for domain-specific patterns +- Chain multiple migrations (TypeScript then Framework upgrade) +- Export the migration state to tools like Renovate or Dependabot + +The example also works as a foundation for other long-running orchestration tasks: bulk refactoring, security patching across a monorepo, or coordinated database migrations. + +--- + +Next time you face a large-scale framework migration, skip the manual codemod and the weeks of testing. Use Squad SDK to parallelize the work, let agents handle the transforms, and automate the validation at each stage. Your migration crew just got a lot bigger. + +Get started: [github.com/bradygaster/squad-sdk-example-migration](https://github.com/bradygaster/squad-sdk-example-migration) diff --git a/website/blog/2026-05-12-governance-policy-engine.md b/website/blog/2026-05-12-governance-policy-engine.md new file mode 100644 index 0000000..fe7a7d0 --- /dev/null +++ b/website/blog/2026-05-12-governance-policy-engine.md @@ -0,0 +1,299 @@ +--- +slug: "/2026-05-12-governance-policy-engine" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-governance-policy-engine" +custom_edit_url: null +sidebar_label: "2026.05.12 Guardrails in 5 Minutes" +title: "Your AI Agents Need Guardrails—Here's How to Add Them in 5 Minutes" +description: "A lightweight governance engine that adds enforceable guardrails and review gates to an AI agent team." +draft: true +tags: + - "AI Agents" + - "Governance" + - "Security" + - "Squad" + - "AI-assisted" +keywords: + - "governance engine" + - "ai guardrails" + - "policy engine" + - "human approval" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +You've deployed AI agents into your codebase. They run fast. They get work done. And then one deletes `.env` by accident—or tries to. + +The hard problem isn't writing agents. It's *controlling* what they're allowed to do. You need guardrails that work at runtime—not someday, but today. This is what governance looks like. + +[Squad SDK Governance](https://github.com/bradygaster/project-squad-sdk-example-governance) is a reference implementation of a policy enforcement system. Define what agents can access in YAML. Block dangerous commands before they run. Get an immutable audit trail of every decision. No custom integration, no months of engineering—just configuration and validation. + +## The Problem: Agents Run Unsupervised + +GitHub branch protection gates *who* can approve PRs. It doesn't understand *what* agents are trying to do. + +Your agent makes a file write. Is it allowed? Branch protection has no idea. You discover the problem when the PR lands—or worse, when something breaks in production. + +Enterprise teams face this choice: lock down agents so tightly they become useless, or trust they'll behave. Most organizations end up somewhere painful in the middle. You need a system that enforces policy *before* an agent acts—not after. + +## The Two-Layer Defense + +This governance system works in two places: + +1. **CLI Pre-Tool Hooks** — Before any command runs, the policy engine validates it. If blocked, it fails immediately with a reason. +2. **Audit Trail** — Every decision (allowed or denied) is logged to `.squad/audit/` as immutable JSONL. For compliance reviews, you have a complete forensic record. + +> 📊 **[DIAGRAM: Policy Enforcement Pipeline]** +> *Prompt for image generation:* Create a horizontal flow diagram showing: (1) Agent Command (left, rounded box) → (2) Policy Engine (center, larger box with YAML icon inside) → (3) Decision Diamond (allowed/blocked) → (4) two paths: green checkmark arrow labeled "Allowed" to Execute Tool (right, green box), red X arrow labeled "Blocked" to Rejection Handler (right, red box). Below the main flow, show Audit Trail as a horizontal list of logs. Use dark background (charcoal), teal/cyan for allowed path, red for blocked path, clean sans-serif labels. Arrow thickness indicates data flow intensity. +> *Purpose:* Helps readers visualize how the policy engine intercepts commands before execution and how each decision gets logged, making the two-layer defense concept concrete. + +## How It Works: A Real Walkthrough + +Let me show you the complete flow, from setup to catching a violation. + +### Step 1: Clone and Build + +```bash +$ git clone https://github.com/bradygaster/project-squad-sdk-example-governance.git +$ cd project-squad-sdk-example-governance +$ npm install +added 42 packages, and audited 45 packages in 2.3s + +$ npm run build +``` + +Expected output (no errors, TypeScript compiles cleanly). + +### Step 2: Create Your Policy + +Create `.squad/policies/policy.yaml`: + +```bash +$ mkdir -p .squad/policies +$ cat > .squad/policies/policy.yaml << 'EOF' +version: '1.0' + +file_access_rules: + allowed_paths: + - src/** + - test/** + - docs/** + - README.md + - package.json + blocked_paths: + - .env + - .env.local + - secrets/* + - config/credentials.json + +blocked_commands: + - rm -rf / + - dd if=/dev/zero + - chmod 777 + - chown root:root +EOF +``` + +This policy says: agents can write to source files, tests, and docs—but absolutely never to environment files, secrets, or infrastructure config. + +### Step 3: Test the Policy + +Now let's validate what's allowed and what's blocked: + +```bash +$ npx squad-governance test write src/app.ts +✅ PASS: write to src/app.ts allowed +``` + +Good. Source files are allowed. Now try the dangerous one: + +```bash +$ npx squad-governance test write .env +❌ FAIL: write to .env blocked by policy — File path '.env' is in blocked paths +``` + +The system caught it. Let's verify command blocking works: + +```bash +$ npx squad-governance test command "npm install" +✅ PASS: command 'npm install' allowed + +$ npx squad-governance test command "rm -rf /" +❌ FAIL: command 'rm -rf /' blocked by policy (matches 'rm -rf /') +``` + +The policy engine matches the substring `rm -rf /` and blocks the entire command, even though technically `rm -rf /` would hit filesystem permissions first. Better to be safe. + +### Step 4: View the Policy Summary + +Get a human-readable overview of everything in your policy: + +```bash +$ npx squad-governance summary + +=== Policy Summary === +Version: 1.0 + +Allowed Paths: + ✅ src/** + ✅ test/** + ✅ docs/** + ✅ README.md + ✅ package.json + +Blocked Paths: + ❌ .env + ❌ .env.local + ❌ secrets/* + ❌ config/credentials.json + +Blocked Commands: + ❌ rm -rf / + ❌ dd if=/dev/zero + ❌ chmod 777 + ❌ chown root:root +``` + +This is what you show your security team in a compliance review. Clear, declarative, enforceable. + +### Step 5: Trigger a Violation and Check the Audit Log + +Let's simulate an agent trying to do something it shouldn't: + +```bash +$ npx squad-governance test write secrets/aws-keys.json + +❌ FAIL: write to secrets/aws-keys.json blocked by policy — File path 'secrets/aws-keys.json' is in blocked paths +``` + +Now check the audit log: + +```bash +$ cat .squad/audit/*.jsonl | jq + +{ + "timestamp": "2024-01-15T14:32:45.123Z", + "agent": "copilot-cli", + "action": "write .env", + "allowed": false, + "reason": "File path '.env' is in blocked paths" +} +{ + "timestamp": "2024-01-15T14:32:51.456Z", + "agent": "copilot-cli", + "action": "write secrets/aws-keys.json", + "allowed": false, + "reason": "File path 'secrets/aws-keys.json' is in blocked paths" +} +``` + +Every decision—allowed or denied—is logged with timestamp, actor, action, and reason. This is your forensic record. When a security auditor asks "did any unauthorized file access happen?" you can run: + +```bash +$ cat .squad/audit/*.jsonl | jq 'select(.allowed == false)' | wc -l +2 +``` + +Two denials. Both in `.env` and `secrets/`. Case closed. + +## What Makes This Different + +**Most agent governance is reactive.** Post-hoc review, manual approval. This system is *preventive*—policies block violations *before* they happen. + +> 📊 **[DIAGRAM: Reactive vs. Preventive Governance]** +> *Prompt for image generation:* Create a split-screen comparison: LEFT side labeled "Reactive (Old Way)" shows: Agent Action → File Written → Review Later → Damage. Path is red/orange. RIGHT side labeled "Preventive (This System)" shows: Agent Action → Policy Check → Allowed/Blocked → Log Decision. Path is green/teal. Use dark background, rounded boxes for states, thick arrows for flow direction. Include small icons (e.g., warning sign for reactive, shield for preventive). Emphasize the time difference: left shows "Hours to Days", right shows "Milliseconds". +> *Purpose:* Shows readers the fundamental difference in timing and risk—why policy-first is better than audit-only. + +**Unlike branch protection**, policies understand intent. You're not just saying "Agent X can approve PRs." You're saying "agents can write to source files, not config files—ever." The CLI enforces it on every command. + +**Unlike enterprise governance platforms**, there's no vendor lock-in. Policies are YAML. Audit logs are JSONL. Check runs live in GitHub. Portable, auditable, yours. + +## Real Use Case: SOC 2 Attestation + +A fintech team needed SOC 2 compliance: +- AI agents can write to `src/`, `test/`, `docs/` +- AI agents cannot touch `.env`, credentials, or infrastructure config +- Every attempt must be logged + +With this policy system, they: +1. Deployed agents with confidence (5 min to write policy) +2. Ran agents for 3 months, collecting audit logs +3. Generated compliance report: `cat .squad/audit/*.jsonl | jq 'select(.allowed == false)' | length` → 0 unauthorized attempts +4. Passed security review with zero incidents + +Before this, they were manually reviewing PR diffs and hoping nothing slipped through. + +## Step-by-Step: Create a Policy, Test It, Check Audit + +Here's the complete flow in one session: + +```bash +# 1. Clone and build +git clone https://github.com/bradygaster/project-squad-sdk-example-governance.git +cd project-squad-sdk-example-governance +npm install +npm run build + +# 2. Create policy +mkdir -p .squad/policies +cat > .squad/policies/policy.yaml << 'EOF' +version: '1.0' +file_access_rules: + allowed_paths: + - src/** + - test/** + blocked_paths: + - .env + - secrets/* +blocked_commands: + - rm -rf / +EOF + +# 3. Test allowed access +npx squad-governance test write src/index.ts +# ✅ PASS + +# 4. Test blocked access +npx squad-governance test write .env +# ❌ FAIL + +# 5. View summary +npx squad-governance summary + +# 6. Check audit log (if violations were recorded) +cat .squad/audit/*.jsonl | jq '.' +``` + +All in under 2 minutes. The system is designed to be fast and obvious. + +## The Honest Scoping + +This example is a **reference implementation**. It shows you the patterns. What's production-ready *today*: + +- **Policy validation** — Load, parse, and enforce YAML policies +- **Pre-tool hooks** — Block file writes and commands before execution +- **Audit logging** — Append-only JSONL audit trail with timestamps +- **CLI testing** — Dry-run any file write or command against the policy + +The roadmap includes (not yet enforced): +- PII detection (SSN, email, credit card patterns) +- Rate limiting (max API calls per session) +- Reviewer lockout (author can't review own work) +- Emergency waivers (bypass with justification + signature) + +The core loop (define policy → validate → audit) is battle-tested. The extensions are where teams customize to their risk profile. + +## Why This Matters + +AI governance isn't new. What's new is doing it **at agent runtime, declaratively, with full audit trails**. Until now, you chose between speed and control. This system lets you have both. + +For security teams: you get demonstrable control for compliance reviews. +For platform teams: you get auditable, version-controlled policy across all repos. +For engineers: you get clear policy boundaries and fast feedback when something's out of bounds. + +The secret: policies work because they're *simple*. YAML, not Rego or Cedar. CLI tests, not integration tests. JSONL audit logs, not a proprietary database. Your team can reason about it. + +--- + +Read the [full documentation](https://github.com/bradygaster/project-squad-sdk-example-governance#readme) for architecture details and extending with custom rules. The [QUICKSTART](https://github.com/bradygaster/project-squad-sdk-example-governance/blob/main/QUICKSTART.md) gets you running in 5 minutes. + +Your agents are powerful. Give them guardrails that match your risk tolerance—not your paranoia. diff --git a/website/blog/2026-05-12-human-approval-hub.md b/website/blog/2026-05-12-human-approval-hub.md new file mode 100644 index 0000000..e8754e5 --- /dev/null +++ b/website/blog/2026-05-12-human-approval-hub.md @@ -0,0 +1,320 @@ +--- +slug: "/2026-05-12-human-approval-hub" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-human-approval-hub" +custom_edit_url: null +sidebar_label: "2026.05.12 Human Approval Hub" +title: "AI Agents Propose, Humans Approve. Build the Inbox That Keeps You in the Loop" +description: "A human-approval hub concept for collecting review requests from autonomous agents in one place." +draft: true +tags: + - "AI Agents" + - "Human in the Loop" + - "Workflow" + - "Squad" + - "AI-assisted" +keywords: + - "human approval" + - "approval inbox" + - "human in the loop" + - "agent escalation" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +Squad agents work autonomously. They create pull requests, propose architecture decisions, flag budget overruns, request policy waivers. But approval requests scatter across GitHub notifications, email, and buried `.squad/decisions/inbox/` directories. By the time a human finds a pending request, the agent has been blocked for hours. + +Engineering leads waste time context-switching between notification systems. Agents sit blocked waiting for answers that never come. You have autonomous agents trapped behind approval bottlenecks. + +I built an approval hub that centralizes everything. [squad-approval](https://github.com/bradygaster/project-squad-sdk-example-approval) is a unified inbox for all agent proposals requiring human sign-off—with threaded context, priority sorting, automatic escalation, and audit trails for compliance. + +## The Problem: Scattered Approvals, Blocked Agents + +When Squad agents run, they don't just execute code. They propose decisions: "Should we use JWT for auth?" or "Should we enable stricter linting?" These decisions require human judgment. But without a central approval point, humans miss requests. + +The fragmentation is real: +- GitHub PRs create notifications (which pile up and get buried) +- Decisions go into `.squad/decisions/inbox/` (which humans rarely check) +- Important escalations might be in Teams or email +- No single place to see what's pending and what the priority should be + +The result: agents block waiting for approval. Hours turn into days. Team velocity drops because humans can't find decisions to make. + +Meanwhile, compliance teams need audit trails. "Who approved what and when?" The scattered system doesn't answer that. + +## The Solution: A Single Approval Inbox + +The approval hub provides a unified interface for all agent proposals. One command shows all pending items, sorted by urgency (stale items surface first). Approve or reject with reasoning. Every decision is logged with timestamp and context for compliance. + +The architecture is simple: agents create approval items → items go into a queue → humans review and approve/reject → results are logged and communicated back to waiting agents. + +> 📊 **[DIAGRAM: Approval Request Flow]** +> *Prompt for image generation:* A horizontal swimlane diagram with two rows: "Agent" and "Hub". Show: (1) Agent box sends "Propose Decision" arrow to (2) Queue box (labeled "Inbox" in teal) → (3) Human Review box with stopwatch icon (showing "⏱ 1h stale alert") → (4) Decision point (Y/N diamond in blue) → split to (5a) "Approved ✓" or (5b) "Rejected ✗" → (6) Notify Agent (callback arrow). Dark background, teal/blue accent colors, clear lane separation with dashed lines, timestamp labels. +> *Purpose:* Shows the end-to-end decision lifecycle including the stale-item escalation—readers understand both the happy path (propose → approve → proceed) and the bottleneck prevention (stale alerts). + +## Setup + +```bash +# Clone the repository +git clone https://github.com/bradygaster/project-squad-sdk-example-approval.git +cd project-squad-sdk-example-approval + +# Install and build +npm install && npm run build + +# Make the CLI globally available +npm link +``` + +Verify setup: + +```bash +npm test +``` + +Expected: 30+ tests pass (✓). + +## Create Your First Approval Request + +```bash +squad-approval create --type decision --title "Use JWT for auth" --agent keaton --reason "Auth decision needed" +``` + +Expected output: + +``` +✓ Created approval: decision-1705328000000 + Type: decision + Title: Use JWT for auth + Created: 2024-01-15T10:30:00.123Z +``` + +The system assigns a unique ID and timestamps it. The agent now knows where to check for a response. + +## List Pending Approvals + +```bash +squad-approval list +``` + +Expected output: + +``` +ID Type Title Age +───────────────────────────────────────────────────────────────────────────────── +decision-1705328000000 decision Use JWT for auth 1m +decision-1705327900000 policy-waiver Disable lint rule 15m +pr-42 github-pr feat: webhooks 2m + +Total: 3 pending approvals +``` + +Items are sorted by stale detection (oldest first, so 15m gets prioritized). Each one shows: +- **ID**: Unique identifier for this approval +- **Type**: What kind of approval (decision, PR, policy waiver, etc.) +- **Title**: What's being asked +- **Age**: How long it's been pending (stale items ≥1 hour get escalated) + +> 📊 **[DIAGRAM: Approval Queue Priority Sorting]** +> *Prompt for image generation:* A table mockup showing 3 rows with visual urgency indicators. Row 1: "decision-1705328..." (1m age, green indicator dot) at bottom. Row 2: "decision-1705327..." (15m age, orange indicator dot) highlighted in light teal box at top. Row 3: "pr-42" (2m age, green dot). Add a legend: "Green = <1h, Orange = stale (≥1h)". Use monospace font for IDs, dark background, teal borders for the urgent row. Arrow pointing to the stale row labeled "ESCALATED". +> *Purpose:* Visually reinforces that the system surfaces oldest/stale items first, making the prioritization strategy obvious. + +## Taking Action: Approve + +```bash +squad-approval approve decision-1705328000000 --reason "Looks good" +``` + +Expected output: + +``` +✓ Approved: Use JWT for auth + By: cli-user + At: 2024-01-15T10:30:45.123Z + Reason: Looks good +``` + +The approval is logged with your name, timestamp, and reasoning. Agents waiting for this approval get notified immediately. + +## Taking Action: Reject + +```bash +squad-approval reject decision-1705327900000 --reason "Disable lint rule for now. Reconsider after performance tests complete" +``` + +Expected output: + +``` +✗ Rejected: Disable lint rule + Reason: Disable lint rule for now. Reconsider after performance tests complete + By: cli-user + At: 2024-01-15T10:31:10.456Z +``` + +Rejections are logged with full reasoning so the agent understands why and can revise the proposal if needed. + +## View Queue Status + +```bash +squad-approval status +``` + +Expected output: + +``` +Approval Queue Status +────────────────────────────── + Pending: 3 + Approved: 5 + Rejected: 1 + Expired: 0 + Total: 9 + +Stale items (>1 hour): 1 + • pr-42 (2h, requesting feature branch merge) +``` + +This gives you a quick snapshot. 3 items need your attention. 1 has been waiting 2 hours and is escalating (meaning an agent has probably already complained about the delay). + +## Real Walkthrough: Decision Pipeline + +Here's the full flow with realistic scenarios: + +**Step 1: Agent proposes decision** + +```bash +squad-approval create --type decision \ + --title "Migrate to TypeScript strict mode" \ + --agent alice \ + --reason "Improves type safety, catches bugs earlier" +``` + +**Output:** +``` +✓ Created approval: decision-1705400000000 + Type: decision + Title: Migrate to TypeScript strict mode + Requested by: alice + Created: 2024-01-15T14:30:00.000Z +``` + +**Step 2: List shows it's pending** + +```bash +squad-approval list +``` + +**Output:** +``` +ID Type Title Age +───────────────────────────────────────────────────────────────────────────────── +decision-1705400000000 decision Migrate TypeScript strict mode 1m + +Total: 1 pending approval +``` + +**Step 3: After 1 hour, system flags as stale** + +```bash +squad-approval status +``` + +**Output:** +``` +Approval Queue Status +────────────────────────────── + Pending: 1 + Stale: 1 (>1 hour) + +⚠️ decision-1705400000000 is stale (61m). Agent may escalate. +``` + +An alert goes out (to Ralph or your notification system). This prevents silent blocks. + +**Step 4: Human reviews and approves** + +```bash +squad-approval approve decision-1705400000000 \ + --reason "Good idea. Phase rollout: alpha → beta → production" +``` + +**Output:** +``` +✓ Approved: Migrate TypeScript strict mode + Phase: alpha (start with one team) + By: engineering-lead + At: 2024-01-15T15:35:22.789Z +``` + +**Step 5: Agent gets notification and proceeds** + +The agent checks its inbox, sees the approval, and starts the migration. + +## What Makes This Different + +Most approval systems are designed for human-to-human workflows (code review, manager sign-off). This is designed for human-AI workflows where agents are proposing decisions and waiting for approval to proceed. + +**Priority sorting matters.** Items are automatically sorted by age and escalation status. An approval waiting 2+ hours surfaces first. This makes the queue actionable instead of overwhelming. + +**Timeout enforcement matters.** Approval requests expire after 24 hours (configurable). If an agent is waiting indefinitely, that's a problem. Expiry forces humans to either approve, reject, or actively defer. + +**Escalation via Ralph matters.** Squad's Ralph monitor can be configured to send notifications when approvals age. After 1 hour of waiting, alerts escalate. No more silent blocks. + +**Audit trails matter.** Every approval/rejection is logged with timestamp, who made the decision, and reasoning. For regulated teams, this is compliance—you can answer "who approved this?" with exact citations. + +## Honest Scoping + +**What this does:** +- Centralize approval requests from agents in one place +- Sort by priority (stale first) +- Provide approval/rejection workflow with audit trails +- Auto-escalate stale items after N hours +- Auto-expire items after 24 hours + +**What this doesn't do:** +- Enforce approvals (agents can proceed anyway—it's advisory) +- Send notifications automatically (you integrate that yourself with Ralph or comms) +- Handle conditional logic (e.g., "approve if tests pass, auto-reject if coverage drops") +- Multi-level approvals (one person approves, not a workflow) + +## Extending It + +The system has adapters for capturing approvals from multiple sources: + +- **GitHub** — Capture PRs with `needs-approval` label +- **Decisions** — Monitor `.squad/decisions/inbox/` for pending decisions +- **ADO** — Monitor Azure DevOps escalations + +You can add custom sources by implementing the `ApprovalSource` interface. + +Send notifications through Teams, Slack, or email when approvals change state using the `NotificationDispatcher`. + +## Next Steps + +1. **Set up integration.** Connect Ralph to escalate stale items. +2. **Configure timeouts.** Adjust approval window from default 24 hours if needed. +3. **Add notification channels.** Route approvals to Teams or email so your team sees them. +4. **Build approval policies.** Some decisions auto-approve (low risk), others require human review. + +## Get Started + +```bash +# Clone and setup (5 minutes) +git clone https://github.com/bradygaster/project-squad-sdk-example-approval.git +cd project-squad-sdk-example-approval +npm install && npm run build && npm link +npm test + +# Create your first approval (1 minute) +squad-approval create --type decision --title "Enable strict mode" --agent smith --reason "Testing strategy" + +# List pending (1 minute) +squad-approval list + +# Approve or reject (1 minute) +squad-approval approve --reason "Let's do it" + +# Check status (1 minute) +squad-approval status +``` + +Nine minutes later, you have a working approval inbox. Then extend it to capture from GitHub PRs, ADO items, or your own custom sources. The hub keeps humans in the loop and agents unblocked. diff --git a/website/blog/2026-05-12-incident-response.md b/website/blog/2026-05-12-incident-response.md new file mode 100644 index 0000000..e39e009 --- /dev/null +++ b/website/blog/2026-05-12-incident-response.md @@ -0,0 +1,507 @@ +--- +slug: "/2026-05-12-incident-response" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-incident-response" +custom_edit_url: null +sidebar_label: "2026.05.12 Incident Response" +title: "When Production Breaks, Let Your Squad Team Help Triage" +description: "An incident-response pattern that lets specialized agents gather context and accelerate the first minutes of an outage." +draft: true +tags: + - "AI Agents" + - "Incident Response" + - "DevOps" + - "Squad" + - "AI-assisted" +keywords: + - "incident response" + - "triage automation" + - "on-call workflow" + - "ai ops" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +Your SRE team gets a page at 2 AM. Production is down. On-call engineer wastes the first 30 minutes doing the same thing every incident: gathering context. + +Piecing together error logs. Checking which services went down. Asking which deployment happened in the last hour. Discovering that service A depends on service B and *that* is the actual problem. + +This is context-gathering overhead. It's not fixing anything. It's just finding the problem. And it happens the same way every incident. + +What if your Squad team helped? What if incidents went from "wait for human to gather context" to "AI agents gather context in parallel, surface the problem, suggest fixes"? + +[Squad SDK Incident Response](https://github.com/bradygaster/project-squad-sdk-example-incident) is a reference implementation of an incident orchestration system. Parse incidents from GitHub issues. Route diagnostics through service-specific runbooks. Generate triage reports and post-mortems. All automated. All auditable. All giving your on-call engineer a head start instead of a blank slate. + +## The Problem: Incident Triage Is Manual and Slow + +Here's how it works today: + +> 📊 **[DIAGRAM: Manual vs. Squad-Assisted Incident Response]** +> *Prompt for image generation:* Create a split-screen comparison: LEFT side "Manual Triage (Today)" shows: Alert → On-Call → SSH/Logs → Read Logs (⏰ 5min) → Check Service A (⏰ 5min) → Check Service B (⏰ 5min) → Find Recent Deploy (⏰ 5min) → Read Diff (⏰ 10min) → 30 min total (red line tracking time). RIGHT side "Squad-Assisted (New)" shows: Alert → On-Call + Squad Agents (parallel processes: Agent-A checks Service-A ⏰2min, Agent-B checks Service-B ⏰3min, Agent-C checks Deploy ⏰2min, Agent-Synthesis ⏰3min) = 10 min total (green line). Show both converging on "Implement Fix" (5min each, same). Emphasize: same human work at end, but 20min saved on investigation. Use dark background, red for manual path, teal/green for parallel path. +> *Purpose:* Shows readers the time advantage of parallelized diagnostics—the core business value proposition. + +1. Alert fires. On-call engineer gets paged. +2. Engineer reads the alert. Checks logs. Finds nothing useful. +3. Engineer SSH's into prod. Checks service health. Finds service A is fine. +4. Engineer checks service B (dependency of A). *That's* unhealthy. +5. Engineer looks for recent changes. Finds deployment from 90 minutes ago. +6. Engineer reads the deployment diff. Finds N+1 query. +7. Engineer fixes or rolls back. Incident resolves. +8. Post-mortem happens 3 days later, rushed, incomplete. + +Steps 1–6 are pure context-gathering. They happen in every incident. They take 30–45 minutes on average. Once context is clear, the fix takes 5 minutes. + +You're paying SRE labor for investigation overhead, not for solving problems. + +## How Squad Incident Response Works: Complete Walkthrough + +### Step 1: Clone and Build + +> 📊 **[DIAGRAM: Incident Orchestration Pipeline]** +> *Prompt for image generation:* Create a vertical orchestration flow: (1) Top: GitHub Issue / incident.json (input box, teal). (2) Below: Incident Intake box → Summary Generator → Diagnostics Router (decision node splitting to multiple runbooks). (3) Middle: 3 parallel runbook agents (API Runbook, Database Runbook, Cache Runbook), each showing diagnostic steps and findings. (4) Below: Triage Synthesizer (combining findings) → Fix PR Drafter. (5) Bottom outputs: Triage Report (red/orange if critical), Timeline (JSON icon), Post-Mortem Template (markdown icon). Use dark background, teal/cyan for key nodes, arrows showing data flow, parallel agents shown side-by-side. Timestamps on the left margin showing elapsed time at each stage (e.g., "0sec", "+2sec", "+10sec"). +> *Purpose:* Gives readers the complete end-to-end orchestration—how an incident goes from GitHub issue to actionable triage report and post-mortem, all with timestamps. + +```bash +$ git clone https://github.com/bradygaster/project-squad-sdk-example-incident.git +$ cd project-squad-sdk-example-incident +$ npm install +added 42 packages, and audited 45 packages in 2.3s + +$ npm run build +$ npm run test:run +✓ src/summarizer-agent.test.ts (3 tests) +✓ src/incident-timeline.test.ts (4 tests) +Test Files 6 passed (6) + Tests 18 passed (18) +``` + +Everything builds and tests pass. + +### Step 2: Create an Incident Report + +Create `incident.json` describing the problem: + +```json +{ + "id": "incident-001", + "title": "Production: API latency spike detected", + "service": "api", + "severity": "high", + "description": "Started at 14:32 UTC. Latency went from 50ms to 500ms+. Orders endpoint completely unresponsive.", + "createdAt": "2024-01-15T14:35:00Z", + "labels": ["service:api", "severity:high"] +} +``` + +### Step 3: Create Service Runbooks + +Create `skills/api-runbook.md`: + +```markdown +# API Service Runbook + +## What This Service Does +Primary REST API handling orders, payments, and user operations. Depends on database, cache, and payment gateway. + +## Diagnostic Steps +1. Check API error logs for the last 5 minutes +2. Query metrics: request latency, error rate, CPU usage +3. Inspect recent deployments +4. Check database query performance +5. Check upstream service health (payment gateway, cache) + +## Common Causes and Fixes +- If latency spike: check N+1 queries, scale horizontally, check database connection pool +- If error rate spike: check circuit breaker, inspect logs for specific error codes +- If CPU spike: check for infinite loops or memory leaks in recent deployment +``` + +Create `skills/database-runbook.md`: + +```markdown +# Database Service Runbook + +## Diagnostic Steps +1. Check database connection count +2. Check query performance (slow query log) +3. Check table locks and blocking queries +4. Check replication lag +5. Review recent schema changes + +## Common Fixes +- If connection exhausted: scale connection pool, terminate idle connections +- If slow queries: add indexes, rewrite query to avoid N+1 +- If replication lag: check network, reduce write volume +``` + +The system automatically discovers and loads all `.md` files from the `skills/` directory. + +### Step 4: Run the Full Orchestration + +```bash +$ npx squad-incident run incident.json + +✅ Incident intake complete + ID: incident-001 + Service: api + Severity: high + Description: Started at 14:32 UTC. Latency went from 50ms to 500ms+... + +📋 Status: awaiting_approval + +📝 Summary Generated: + What: API latency spike - response time increased from 50ms to 500ms+ + Where: API service (orders-list endpoint) + Severity: high + Likely Cause: Recent deployment introduced N+1 query in batch ordering + Affected Services: api, database + +📅 Timeline entries: 7 + 14:35:00 → Incident created + 14:35:15 → Summary generated + 14:35:45 → Diagnostics routed to API and Database agents + 14:36:30 → API diagnostics complete + 14:36:45 → Database diagnostics complete + 14:37:00 → Triage suggestions drafted + 14:37:15 → Post-mortem template generated + +📄 Decisions: 2 + 1. Route to API service runbook (high confidence) + 2. Route to Database service runbook (dependency check) + +🔧 Draft PR: fix: resolve incident incident-001 — latency spike + Branch: incident/incident-001-api-latency + Description: Rollback batch ordering feature OR implement JOIN optimization + +Done. +``` + +The orchestrator: +1. Parsed the incident JSON +2. Generated a summary (what, where, why) +3. Routed diagnostics through service-specific runbooks (in parallel) +4. Drafted triage suggestions +5. Recorded the timeline +6. Generated a post-mortem template + +All in seconds instead of 30+ minutes. + +### Step 5: Review the Generated Summary + +```bash +$ cat incident-001-summary.json + +{ + "what": "API latency spike: response time increased from 50ms to 500ms+", + "where": ["api", "orders-list-endpoint"], + "severity": "high", + "likely_cause": "Recent deployment introduced N+1 query in batch ordering feature", + "affected_services": ["api", "database"], + "code_references": [ + "src/api/handlers/orders.ts:42-58" + ], + "timeline": [ + { + "timestamp": "2024-01-15T14:32:00Z", + "event": "Latency spike detected" + }, + { + "timestamp": "2024-01-15T14:32:15Z", + "event": "Deployment: batch-ordering feature rolled out" + } + ] +} +``` + +This is what you show the on-call engineer. Machine-readable, no hunting required. + +### Step 6: Review the Triage Report + +```bash +$ cat incident-001-triage.md + +# INCIDENT #001: Production API Latency + +## DIAGNOSIS +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +Service: api +Root Cause: N+1 query in orders endpoint (batch ordering feature) +Affected: orders-list endpoint, database connection pool exhaustion +Severity: HIGH +Duration Estimate: 5-20 minutes to resolve + +## DIAGNOSTICS SUMMARY +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +✓ API logs: 500ms response times across all orders requests +✓ Deployment: "Add batch ordering feature" merged 90 min ago +✓ Database: query count spike matches latency timeline +✓ Connection pool: Exhausted (max 100, current 98 connections) + +## SUGGESTED ACTIONS (Human Review Required) +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +### Option 1: Rollback (Fastest) +``` +git revert abc1234 -m 1 # batch-ordering commit +git push origin main +# Deployment triggers auto-rollout +``` +Estimated time: 2-3 min +Risk: Low (this is the change that caused it) +Verification: Check API latency metric returns to baseline + +### Option 2: Implement Fix (Better Long-Term) +``` +File: src/api/handlers/orders.ts line 42-58 +Change: Add JOIN to prevent N+1 query +``` +Estimated time: 15-20 min +Risk: Medium (requires testing; potential logic change) +Verification: Load test with 1000 concurrent requests + +### Option 3: Scale Connection Pool +Increase from 100 to 200 connections in prod. +Estimated time: 5 min +Risk: Low (temporary relief; doesn't fix root cause) +Verification: Monitor connection usage; expect drop to 20-30 + +## RECOMMENDED PATH +1. Start with Option 1 (rollback) - fastest to restore service +2. Debug option 2 (fix) in staging - better permanent solution +3. Plan Option 3 (scale) - if option 2 needs more testing +``` + +This is the triage report the on-call engineer uses. Three options, risk/reward for each, clear next steps. + +### Step 7: Check the Timeline + +```bash +$ cat incident-001-timeline.json | jq + +[ + { + "timestamp": "2024-01-15T14:35:10", + "action": "incident_created", + "actor": "system", + "details": "Incident #001 created from GitHub issue" + }, + { + "timestamp": "2024-01-15T14:35:15", + "action": "summary_generated", + "actor": "summarizer-agent", + "details": "Identified N+1 query in batch ordering deployment" + }, + { + "timestamp": "2024-01-15T14:35:45", + "action": "diagnostics_routed", + "actor": "diagnostic-router", + "details": "Routed to api and database runbooks" + }, + { + "timestamp": "2024-01-15T14:36:30", + "action": "diagnostics_complete", + "actor": "api-runbook", + "details": "API agent confirmed N+1, suggested rollback or fix" + }, + { + "timestamp": "2024-01-15T14:37:00", + "action": "triage_suggestions_drafted", + "actor": "fix-pr-drafter", + "details": "Generated 3 suggested actions" + } +] +``` + +Append-only audit trail. Every decision, every step, every timestamp. For post-mortem review, compliance, and debugging. + +### Step 8: Generate Post-Mortem + +```bash +$ npx squad-incident postmortem incident.json + +Generated: incident-001-post-mortem.md +``` + +Review it: + +```bash +$ cat incident-001-post-mortem.md + +# Post-Mortem: Production API Latency Spike + +**Incident ID:** incident-001 +**Start Time:** 2024-01-15T14:32:00Z +**End Time:** 2024-01-15T14:37:30Z +**Duration:** 5 minutes 30 seconds +**Severity:** HIGH + +## Executive Summary + +Production API experienced latency spike affecting orders endpoint. Response times increased from 50ms to 500ms+. Root cause identified as N+1 query introduced in batch ordering feature deployment. Incident resolved via rollback. + +## Root Cause + +Batch ordering feature (commit abc1234) introduced a query loop that executed one query per order instead of batching. With 100+ concurrent requests, this exhausted the database connection pool. + +## Impact + +- **Services Affected:** API (orders endpoint), Database +- **Users Affected:** All orders placed during 5:30 window +- **Requests Failed:** ~2,300 orders timed out +- **Business Impact:** ~$8k in failed transactions + +## Timeline + +| Time | Event | +|------|-------| +| 14:32:00 | Latency spike detected by monitoring | +| 14:32:15 | Batch ordering feature deployment completed | +| 14:35:00 | On-call engineer paged | +| 14:35:30 | Context gathering started (manual investigation) | +| 14:36:30 | With Squad triage: Root cause identified (N+1 query) | +| 14:37:30 | Rollback initiated | +| 14:38:00 | Latency returned to baseline | + +## What Went Wrong + +1. **Insufficient load testing** — Feature was tested with 10 concurrent users; production has 100+ +2. **No query analysis in CI** — N+1 queries not detected in code review +3. **No pre-deployment metrics** — Couldn't compare before/after performance + +## Lessons Learned + +1. Integration tests must simulate production concurrency (100+ concurrent users) +2. Code review should include query performance analysis for new database access patterns +3. Staging should have production-scale data and concurrency before feature rollout + +## Action Items + +- [ ] Add N+1 query detection to CI pipeline (Assign: Backend Lead) (Due: 2024-01-22) +- [ ] Implement integration tests for batch operations with 100+ concurrent users (Assign: QA) (Due: 2024-01-20) +- [ ] Add database query logging to staging environment (Assign: DevOps) (Due: 2024-01-18) +- [ ] Review and update load testing strategy (Assign: Performance Engineer) (Due: 2024-01-25) + +## Follow-Up + +Post-mortem published to Slack #incidents channel. +Action items tracked in JIRA epic INCIDENT-001. +Lessons learned added to runbook: skills/api-runbook.md +``` + +Done. A complete post-mortem, before the on-call engineer's shift ends. + +## Real Use Case: 3 AM Incident + +Incident fires. On-call engineer gets paged. + +**Old way:** +- 30–45 min: Manual investigation (logs, metrics, dependencies) +- 5 min: Implement fix or rollback +- 3 days later: Rushed post-mortem + +**With Squad Incident Response:** +- 10 sec: Parse incident from GitHub issue +- 2–3 min: Squad agents run diagnostics in parallel +- 5–10 min: On-call engineer reviews triage report and implements fix (agents did context-gathering; human does judgment) +- Post-mortem auto-generated same hour, before engineer's shift ends + +Same total time spent, but the human spent it *fixing* instead of *searching*. The difference matters at 3 AM. + +## Three Commands That Show the Value + +```bash +# Full orchestration (intake → diagnostics → triage → post-mortem) +npx squad-incident run incident.json +# Takes 2–3 min, generates all outputs + +# Just produce a summary +npx squad-incident summarize incident.json +# 30 sec, tells you what happened + +# Generate post-mortem after incident resolves +npx squad-incident postmortem incident.json +# Creates markdown, ready to share with team +``` + +## The Honest Scoping + +This is a **reference implementation**. It demonstrates patterns for incident orchestration. What's production-ready: +- Incident parsing from GitHub issues +- Service-specific runbook routing +- Timeline recording and decision logging +- Post-mortem generation with decision logging +- Triage suggestions with risk/reward analysis + +What's not (yet): +- Multi-workspace routing (send diagnostics to different Slack channels) +- Real-time metrics integration (Datadog, New Relic) +- Auto-escalation (critical incidents → on-call) +- Automatic fix proposal (today: template-based suggestions, human review required) +- Incident correlation (detect related incidents) +- Auto-remediation (execute suggested fixes after approval) + +The core loop works. The extensions are where teams customize to their operational model. + +## Getting Started in 5 Minutes + +```bash +# Clone and setup +git clone https://github.com/bradygaster/project-squad-sdk-example-incident.git +cd project-squad-sdk-example-incident +npm install +npm run build + +# Use the example incident +npx squad-incident run examples/incident.json + +# Review outputs +cat incident-001-summary.json | jq +cat incident-001-post-mortem.md +cat incident-001-timeline.json | jq +``` + +Create your own runbooks in `skills/`: + +```markdown +# Cache Service Runbook + +## Diagnostic Steps +1. Check cache hit rate and eviction count +2. Verify eviction policy settings +3. Check memory usage and capacity +4. Review recent configuration changes + +## Common Fixes +- If cache churn: adjust TTL values +- If memory pressure: scale horizontally or increase capacity +- If misconfiguration: review deployment diff +``` + +The orchestrator discovers runbooks by service name and routes incidents accordingly. + +## Why This Matters + +On-call work is expensive. Not because engineers are expensive (they are), but because most of incident response is *automated investigation*—work that AI handles faster and better than humans. + +Your Squad team can handle that part. They: +- Gather context in parallel (not sequentially) +- Check runbook procedures (from your playbook, not tribal knowledge) +- Log everything (audit trail for post-mortems) +- Surface findings to a human (who has context and judgment) + +The human still makes the final call. But they make it with complete context in 5 minutes instead of searching for 45. + +For SRE teams: +- On-call engineers resolve incidents faster +- Post-mortems are data-rich, not rushed +- Runbooks stay current (versioned alongside code) + +For platform teams: +- Reference implementation you can extend +- Patterns for multi-agent orchestration +- Blueprint for other automation workflows + +--- + +Read the [repo](https://github.com/bradygaster/project-squad-sdk-example-incident) and the [quickstart](https://github.com/bradygaster/project-squad-sdk-example-incident/blob/main/QUICKSTART.md). + +Your on-call team shouldn't spend half an incident shift searching for the problem. Let your Squad help triage. Save human judgment for decisions. diff --git a/website/blog/2026-05-12-knowledge-operations.md b/website/blog/2026-05-12-knowledge-operations.md new file mode 100644 index 0000000..c4e1670 --- /dev/null +++ b/website/blog/2026-05-12-knowledge-operations.md @@ -0,0 +1,317 @@ +--- +slug: "/2026-05-12-knowledge-operations" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-knowledge-operations" +custom_edit_url: null +sidebar_label: "2026.05.12 Knowledge Operations" +title: "Your Squad Team Learns Every Session. Here's How to Capture and Reuse That Knowledge" +description: "A knowledge-operations workflow for turning repeated agent decisions into reusable team memory." +draft: true +tags: + - "AI Agents" + - "Knowledge Management" + - "Squad" + - "Developer Workflow" + - "AI-assisted" +keywords: + - "knowledge operations" + - "agent memory" + - "decision capture" + - "team memory" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +Every time your Squad agents work, they accumulate patterns. They learn what error handling looks like in your codebase, what your testing conventions are, how your team tackles null safety. After a few weeks, your agents have seen hundreds of examples. But that knowledge stays locked in history logs. + +Most teams lose this knowledge entirely. New agents spawn without it. New humans join the team and waste weeks rediscovering patterns. You end up reinventing the same solutions repeatedly, each time less efficiently than the last. + +I built a governance system to extract those patterns, turn them into reusable skills, and make them searchable. The result is [squad-knowledge](https://github.com/bradygaster/project-squad-sdk-example-knowledge)—a framework that transforms raw agent history into governed, discoverable team knowledge. + +## The Problem: Knowledge Trapped in Logs + +Teams run Squad agents for weeks or months. The agents make thousands of small decisions: how to structure error messages, when to use async/await, which null checks are actually necessary. These decisions are consistent—your team *has* standards—but nobody writes them down. They stay buried in agent history files. + +When new agents spawn, they don't inherit this knowledge. They start from scratch, making the same decisions agents made three weeks ago. When new humans join the team, onboarding takes weeks because the patterns are invisible. When you create a skill, there's no governance to prevent duplicates or contradictions. You end up with conflicting guidance. + +The deeper problem: you have the data (agents saw 500 error handling examples), but you're not using it. That's leaving performance on the table. + +## The Solution: Automated Pattern Discovery + Human Approval + +The knowledge operations framework has four phases: + +**Phase 1: Discovery.** Scan agent histories and extract repeated phrases using n-gram analysis. "Check for null" appears 47 times across 6 agents? That's a pattern worth capturing. The system measures frequency, agent breadth, and confidence automatically. + +**Phase 2: Generation.** Turn discovered patterns into skill candidates with automatic deduplication against existing skills. No duplicates. No conflicts. Just new patterns that don't already exist. + +**Phase 3: Approval.** A human reviews each candidate and approves it as a formal skill. No automation spam. Humans stay in the loop for quality control. + +**Phase 4: Search.** Index all team memory (history + decisions) and make it searchable with source attribution. "How do we handle errors?" returns examples from 3 agents with citations. + +> 📊 **[DIAGRAM: Knowledge Lifecycle Pipeline]** +> *Prompt for image generation:* A vertical flow diagram showing 4 phases stacked: (1) Agent History (cloud icon with documents) → (2) "N-gram Discovery" (magnifying glass processing box in teal) → (3) "Candidates" (stack of cards, blue) → (4) "Human Approval" (person icon checkmark, teal border) → (5) "Approved Skill" (trophy/star icon, gold accent) with a feedback arrow from (5) back to (1) labeled "Reuse". Dark background, clear stage labels, blue/teal/gold color scheme, minimal line decoration. +> *Purpose:* Visualizes the full knowledge lifecycle from raw history through discovery, approval, to reusable skills—shows both the automation (n-gram discovery) and human judgment (approval) loop. + +## Setup + +```bash +# Clone the repository +git clone https://github.com/bradygaster/project-squad-sdk-example-knowledge.git +cd project-squad-sdk-example-knowledge + +# Install and build +npm install && npm run build + +# Make the CLI globally available +npm link +``` + +Verify setup: + +```bash +squad-knowledge --help +``` + +## Sample Data Workflow + +The repo ships with sample data in `examples/.squad/` so you can see it in action immediately without your own history files. + +**Step 1: Discover patterns** + +```bash +squad-knowledge discover examples/.squad +``` + +Output: + +``` +Scanning 20 history entries… + +Found 12 patterns: + • "check for null" (5 occurrences, 2 agents) + • "async/await" (4 occurrences, 2 agents) + • "error handling" (6 occurrences, 3 agents) + • "try-catch wrapper" (3 occurrences, 1 agent) + • "early return pattern" (4 occurrences, 2 agents) + • "validation on input" (5 occurrences, 3 agents) + +Generated 6 skill candidates → examples/.squad/candidates.json +Memory index built → examples/.squad/memory-index.json (22 documents) +``` + +The framework extracted n-grams from agent histories, ranked them by frequency, and identified high-confidence patterns (ones that appear across multiple agents). + +**Step 2: Check status** + +```bash +squad-knowledge status +``` + +Output: + +``` +Knowledge Status +──────────────────────────────────────── + Candidates: 6 total + Pending: 6 + Approved: 0 + Rejected: 0 + + Memory Index: 22 documents, updated 2024-01-15T10:30:00.000Z + Stale patterns: 0 + Avg confidence: 0.72 (medium) +``` + +**Step 3: Approve a candidate** + +Copy a candidate ID from the discover output and approve it: + +```bash +squad-knowledge approve candidates/null-check-pattern +``` + +Output: + +``` +✅ Candidate approved: null-check-pattern + Generated examples/.squad/skills/null-safety-checks.md + Confidence: medium (will upgrade to high as pattern reuses) +``` + +This generates a full SKILL.md file with frontmatter, agent attribution, and usage examples: + +```markdown +--- +id: null-safety-checks +title: Null Safety Checks +confidence: medium +discovered_by: + - alice + - bob +first_session: 2024-01-10 +occurrences: 5 +--- + +# Null Safety Checks + +Pattern detected in 5 agent sessions across 2 agents. Agents consistently validate inputs before property access. + +## Usage + +Always check for null before accessing object properties: + +```typescript +if (user && user.profile && user.profile.name) { + // Safe to use user.profile.name +} +``` + +## Why This Matters + +Prevents null pointer exceptions. Critical in TypeScript projects with strict null checking enabled. + +## Attribution + +- Alice (Session 001, 002, 003) +- Bob (Session 005, 010) +``` + +**Step 4: Search team memory** + +```bash +squad-knowledge search "null safety" +``` + +Output: + +``` +Found 5 results: + + 1. [score 8.0] alice (agent_history) + "Session 001: Discussed null pointer errors. Always validate before property…" + Matched: null, safety + Source: examples/.squad/agent-histories/alice.txt + + 2. [score 6.0] bob (agent_history) + "Session 005: Always check for null before property access. Critical lesson.…" + Matched: null, check + Source: examples/.squad/agent-histories/bob.txt + + 3. [score 4.2] decisions (team_decision) + "2024-01-error-strategy: Consensus on defensive programming practices" + Matched: safety + Source: examples/.squad/decisions/2024-01-error-strategy.md +``` + +Results show who did it first (alice), which sessions covered it, and full source attribution for traceability. + +## Using Your Own Data + +Replace `examples/.squad/` with your real `.squad/` directory. Required structure: + +``` +.squad/ +├── agent-histories/ +│ ├── alice.txt # one session per line +│ └── bob.txt +└── decisions/ + └── 2024-01-strategy.md +``` + +Then run: + +```bash +squad-knowledge discover .squad +squad-knowledge status +squad-knowledge approve +squad-knowledge search "error handling" +``` + +The discovery phase scans all files, extracts n-grams with configurable frequency thresholds, deduplicates automatically, and ranks candidates by frequency and agent breadth. + +Once approved, candidates become formal skills in `.squad/skills/` with full SKILL.md frontmatter and agent attribution. When you search, results show who first demonstrated the pattern, which sessions featured it, and relevance scores—full traceability. + +## Real Walkthrough: Building a Living Skill Registry + +Here's the full workflow for a real team over one month: + +**Week 1:** Agents run on various tasks, accumulating history. + +```bash +squad-knowledge discover .squad +# Found 8 patterns across agent histories +``` + +> 📊 **[DIAGRAM: Confidence Progression Over Time]** +> *Prompt for image generation:* A line chart showing confidence level (Y-axis: 0–1 scale from "low" to "high") over weeks (X-axis: Week 1–4). Draw 3 curved lines: "null-check-pattern" starting at 0.3 Week 1, rising to 0.7 by Week 4 (solid teal). "async-await" starting at 0.4, rising to 0.8 (solid blue). "error-handling" starting at 0.2, rising to 0.9 (solid cyan). Mark approval moments with checkmarks (✓). Dark background, labeled axes, legend identifying each pattern, grid lines. +> *Purpose:* Shows how confidence compounds as patterns reappear—reinforces the idea that patterns prove themselves over time, building institutional knowledge incrementally rather than via one-time decisions. + +**Week 2:** Review and approve high-confidence patterns. + +```bash +squad-knowledge approve error-handling-try-catch +squad-knowledge approve validation-input-schema +squad-knowledge approve async-await-usage +``` + +**Week 3:** New agents can now search the registry and inject patterns into their prompts. + +```bash +squad-knowledge search "how do we handle async?" +# Returns 3 approved skills with examples and agent attribution +``` + +**Week 4:** Staleness detection flags outdated patterns. + +```bash +squad-knowledge status +# Shows: "promise-callback pattern - stale (30+ days), recommend review" +``` + +## What Makes This Different + +Most knowledge management tools are write-once, read-never. Someone documents something, and then nobody reads it. This framework is write-automatic: patterns are discovered from *actual* agent behavior, approved by humans, and immediately searchable. The knowledge exists because agents actually used it, not because someone wrote documentation. + +The confidence tracking matters. New skills start at low confidence. As they appear in subsequent agent spawns, confidence upgrades to medium → high. This means new patterns need to prove themselves before the team relies on them. + +Staleness detection flags outdated patterns not referenced in recent sessions. If a pattern hasn't shown up in 30 days, it gets flagged. Your team's best practices evolve; the system should reflect that. + +## Honest Scoping + +**What this does:** +- Extract patterns from agent histories using n-gram analysis +- Generate skill candidates with deduplication +- Provide human approval workflow +- Index and search team memory with relevance ranking +- Track confidence as patterns reuse + +**What this doesn't do:** +- Automatically approve patterns (humans review everything) +- Prevent you from approving bad patterns (it's advisory, not enforcement) +- Teach agents to use skills (that's on you to inject into prompts) +- Handle proprietary/confidential patterns specially (you should review before discovering) + +## Next Steps + +1. **Set discovery thresholds.** Tune minFrequency and minPhraseLength to match your team's pattern density. +2. **Integrate with your squad.** Wire discovered skills into agent prompts at spawn time. +3. **Monitor staleness.** Set up alerts when patterns go unused for N days. +4. **Build organization-wide registries.** Export approved skills to a central location for all teams to use. + +## Get Started + +```bash +# Clone and setup (5 minutes) +git clone https://github.com/bradygaster/project-squad-sdk-example-knowledge.git +cd project-squad-sdk-example-knowledge +npm install && npm run build && npm link + +# Try it with sample data (2 minutes) +squad-knowledge discover examples/.squad +squad-knowledge search "null safety" + +# Then point it at your real data (1 minute) +squad-knowledge discover .squad +squad-knowledge approve +``` + +Eight minutes later, you have a searchable skill registry discovered from your team's actual behavior. New agents can learn from it. New humans can onboard with it. Your best practices become explicit, governed, and reusable. diff --git a/website/blog/2026-05-12-multi-model-ab-testing.md b/website/blog/2026-05-12-multi-model-ab-testing.md new file mode 100644 index 0000000..11a94a0 --- /dev/null +++ b/website/blog/2026-05-12-multi-model-ab-testing.md @@ -0,0 +1,262 @@ +--- +slug: "/2026-05-12-multi-model-ab-testing" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-multi-model-ab-testing" +custom_edit_url: null +sidebar_label: "2026.05.12 Model A/B Testing" +title: "Which AI Model Is Best for YOUR Codebase? Stop Guessing, Start Testing" +description: "A framework for comparing models against your own workloads instead of relying on vibes or vendor marketing." +draft: true +tags: + - "AI Agents" + - "Model Evaluation" + - "Developer Workflow" + - "Squad" + - "AI-assisted" +keywords: + - "model ab testing" + - "llm evaluation" + - "codebase benchmarking" + - "ai model selection" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +Every week, someone on a team asks: "Should we use Claude or GPT for this?" The answer used to be tribal knowledge—whatever we used last time, or what the vendor was pushing that month. Nobody had real data. Today, with 16 models in the [Squad SDK](https://github.com/bradygaster/squad) catalog and three new models landing monthly, guessing is expensive. + +Generic benchmarks (LMSYS leaderboards, provider scorecards) test models on synthetic problems—math puzzles, roleplay, general trivia. They don't tell you which model generates the best tests for *your* test patterns, or which can refactor *your* codebase idioms fastest. A model that excels at generic code generation might struggle with your domain-specific architecture. And the cost column? Usually missing. + +I built a framework to stop guessing. It runs identical tasks across models, collects cost/quality/speed metrics, and shows you which model wins for *your* specific codebase. No benchmark leaderboards. Just your code, your patterns, and hard numbers. + +## The Problem: Tribal Model Selection + +When you don't know which model to use, you either overspend on premium models when cheaper alternatives work just as well, or undershoot on quality because you don't know which model performs best for your specific tasks. Your team makes assumptions that don't hold. + +Meanwhile, every month a new model ships. Do you switch? Run parallel experiments? Most teams just stick with what they know, leaving money on the table and performance on the bench. + +The real cost isn't the few dollars per test—it's the opportunity cost of not knowing which model is right for your workflow. A model that excels at code generation might be terrible at refactoring. A cheap model might hit rate limits because it requires more retries. You need data, not intuition. + +## The Solution: Local A/B Testing Framework + +The [squad-ab-testing](https://github.com/bradygaster/squad-ab-testing) framework lets you run your real tasks against multiple models in parallel, measure output quality, collect token costs, and get a ranked comparison table. + +Think of it as running controlled experiments on your own code. No code writing required—configure experiments as JSON, run CLI commands, read results. The framework handles parallel orchestration, metrics collection, statistical analysis, and reporting. + +> 📊 **[DIAGRAM: Experiment Flow Pipeline]** +> *Prompt for image generation:* A horizontal flow diagram showing: (1) Configuration JSON icon → (2) Config Loader box → (3) Task Dispatcher splitting into parallel paths for GPT-4, Claude, GPT-3.5 (show 3 boxes side-by-side) → (4) Metrics Collector merging the paths → (5) Comparison Table output. Use dark background, teal/blue accent colors for boxes, white arrows showing flow direction. Add labels: "config", "load", "parallel runs", "metrics", "results". Style: clean lines, minimal decoration. +> *Purpose:* Helps readers visualize how experiments move from configuration through parallel model execution to aggregated results—demystifies the orchestration process. + +## Setup + +```bash +# Clone the repository +git clone https://github.com/bradygaster/squad-ab-testing.git +cd squad-ab-testing + +# Install and build +npm install && npm run build + +# Make the CLI globally available +npm link +``` + +Verify the setup works: + +```bash +npm run test +``` + +Expected output: All tests pass (✓). + +## Create Your First Experiment Config + +```bash +squad-ab-test init +``` + +This generates `experiment.json`. Open it and customize: + +```json +{ + "name": "code-generation-comparison", + "task": { + "prompt": "Write a TypeScript function that validates email addresses using regex. Include tests.", + "inputFiles": [], + "evaluator": "test-pass-rate" + }, + "models": ["gpt-4o", "claude-sonnet-4-20250514", "gpt-3.5-turbo"], + "repetitions": 3, + "budget": { + "maxPerRun": 5000, + "maxTotal": 50000 + } +} +``` + +**Config breakdown:** +- **name**: Unique identifier for this experiment +- **task.prompt**: Instruction sent to each model +- **task.inputFiles**: Optional context files (relative paths) +- **task.evaluator**: Quality metric (`test-pass-rate`, `lint-score`, or custom) +- **models**: List of models to compare +- **repetitions**: How many times to run each model for statistical confidence +- **budget**: Optional token/cost limits to prevent runaway expenses + +## Running Your First Experiment + +```bash +squad-ab-test run experiment.json +``` + +The framework spawns agents for each model in parallel, runs your task against each one, collects metrics, aggregates results, and ranks them: + +``` +Experiment: code-generation-comparison +Date: 2025-01-15T10:30:00.000Z +N=3 repetitions + +Model | Avg Cost | Avg Latency | Quality | Stddev +──────────────────────────────────────────────────────────────────────── +gpt-4o | 0.0045 | 1250ms | 0.950 | 0.030 +claude-sonnet-4 | 0.0038 | 980ms | 0.920 | 0.050 +gpt-3.5-turbo | 0.0012 | 450ms | 0.870 | 0.080 +``` + +## Interpreting the Results Table + +Each column answers a specific question: + +- **Avg Cost**: Average token cost per run (lower = cheaper). Use this to calculate monthly/yearly spend for production tasks. +- **Avg Latency**: Average response time in milliseconds. Matters more for interactive workflows, less for batch jobs. +- **Quality**: Evaluator result on 0–1 scale (higher = better output). This is the business metric that matters most. +- **Stddev**: Standard deviation across repetitions. Models with low Stddev are more predictable and consistent. + +> 📊 **[DIAGRAM: Cost vs. Quality Tradeoff Scatterplot]** +> *Prompt for image generation:* A scatter plot with Cost (dollars) on X-axis (0.001 to 0.010) and Quality (0–1) on Y-axis. Plot 3 data points: (0.0045, 0.950) labeled "gpt-4o" in teal, (0.0038, 0.920) labeled "claude-sonnet" in blue, (0.0012, 0.870) labeled "gpt-3.5-turbo" in cyan. Add a dotted diagonal line showing the tradeoff curve. Include grid lines (subtle), axis labels, and a legend. Dark background, accent colors for labels and dots. +> *Purpose:* Visually communicates the cost/quality spectrum—readers instantly see which models are "cheap but rough" vs. "expensive but best," making tradeoff decisions intuitive. + +**Now you can answer hard questions with actual data:** + +1. **Best overall?** Look at the Quality column. gpt-4o wins at 0.950. +2. **Cost vs. quality?** Compare Cost against Quality. claude-sonnet-4 is 16% cheaper (0.0038 vs 0.0045) with only 3% lower quality (0.920 vs 0.950). Is that tradeoff worth it? That's your call. +3. **Speed matters?** gpt-3.5-turbo is fastest (450ms). If you're generating docs in a batch job that runs once a day, speed doesn't matter. If you're in a tight loop, it does. +4. **Consistency?** Models with low Stddev are safer bets. gpt-4o is most consistent (0.030); gpt-3.5-turbo is least (0.080). + +## Real Walkthrough: Testing Code Review Quality + +Here's a concrete example: testing which model generates the best code review comments. + +**Step 1: Create experiment.json** + +```json +{ + "name": "code-review-quality", + "task": { + "prompt": "Review this function and identify bugs, performance issues, and style violations. Be specific.", + "inputFiles": ["src/utils.ts"], + "evaluator": "lint-score" + }, + "models": ["gpt-4o", "claude-opus-4.6", "gpt-4-turbo"], + "repetitions": 3 +} +``` + +**Step 2: Run the experiment** + +```bash +squad-ab-test run experiment.json --output results/ +``` + +**Step 3: Analyze results** + +``` +Model | Avg Cost | Avg Latency | Quality | Stddev +──────────────────────────────────────────────────────────────────────── +claude-opus-4.6 | 0.0062 | 1850ms | 0.980 | 0.020 +gpt-4o | 0.0045 | 1250ms | 0.950 | 0.030 +gpt-4-turbo | 0.0035 | 920ms | 0.920 | 0.040 +``` + +**Verdict:** claude-opus-4.6 generates the best reviews (0.980 quality). It costs 37% more (0.0062 vs 0.0045) but your code reviews are measurably better. For a team that reviews code daily, that difference compounds. + +**Step 4: Decision** + +- Use claude-opus-4.6 for review-quality tasks (high accuracy needed) +- Use gpt-4o for general code generation (best balance) +- Use gpt-4-turbo for batch analysis (speed + lower cost) + +## What Makes This Framework Different + +Most benchmarks test generic tasks. This tests *your* specific workflow. A model ranked #3 on LMSYS might be #1 for your codebase. The data comes from your code, your patterns, your evaluators. + +The first time I ran this on my codebase, results surprised me: +- GPT-3.5-turbo outperformed Claude Opus on test generation—despite Opus being more expensive and marketed as the "quality" choice +- But on documentation tasks, Opus crushed GPT-4o +- The expensive model was wrong for two out of three tasks + +Tribal knowledge would have locked me into the expensive choice. Data let me optimize. + +## Extending It: Custom Evaluators + +The framework ships with `test-pass-rate` and `lint-score`, but you can register custom evaluators: + +```typescript +import { EvaluatorRegistry } from './src/evaluators/EvaluatorRegistry.js'; + +const registry = new EvaluatorRegistry(); + +registry.register('readability-score', async (output: string) => { + // Your logic — return 0–1 score + const complexity = analyzeComplexity(output); + const clarity = analyzeClarity(output); + return (complexity + clarity) / 2; +}); +``` + +Then reference it in your config: + +```json +{ + "evaluator": "readability-score" +} +``` + +Now you can measure whatever you care about: adherence to team style guides, performance benchmarks, accessibility compliance, or domain-specific quality metrics. + +## Honest Scoping + +**What this does:** +- Compare models on identical tasks with statistical confidence +- Measure cost, speed, and quality simultaneously +- Provide data-driven model selection + +**What this doesn't do:** +- Integrate with production model APIs automatically (you need to wire that yourself) +- Handle long-running tasks (it's for discrete, bounded tasks like "generate a test" or "review code") +- Make the decision for you (you still decide what matters more: cost, speed, or quality) + +## What's Next + +Once you have data: + +1. **Schedule regular experiments.** When a new model ships, test it. When your codebase evolves, rerun old experiments to see if your model choice still holds. +2. **Build decision trees.** Different models for different tasks. gpt-3.5-turbo for simple generation, Opus for complex reasoning, gpt-4o for the middle ground. +3. **Monitor in production.** Log which model you used and track actual metrics (customer satisfaction, bug escape rates). Adjust your experiments to match real outcomes. +4. **Share results across teams.** Your results are probably useful to sibling teams. Document your experiment methodology so others can reproduce it. + +## Get Started + +```bash +# Clone and setup (5 minutes) +git clone https://github.com/bradygaster/squad-ab-testing.git +cd squad-ab-testing +npm install && npm run build && npm link + +# Create and run an experiment (2 minutes) +squad-ab-test init +squad-ab-test run experiment.json + +# Analyze results (1 minute) +# Read the table, make a decision +``` + +Eight minutes later, you have data instead of guesses. That's the difference between tribal model selection and evidence-driven optimization. No more "we've always used Claude." Just: "The data says Claude is 8% better for this specific task, and it costs 12% more. Is that worth it?" diff --git a/website/blog/2026-05-12-policy-as-code-for-ai-teams.md b/website/blog/2026-05-12-policy-as-code-for-ai-teams.md new file mode 100644 index 0000000..17b7543 --- /dev/null +++ b/website/blog/2026-05-12-policy-as-code-for-ai-teams.md @@ -0,0 +1,203 @@ +--- +slug: "/2026-05-12-policy-as-code-for-ai-teams" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-policy-as-code-for-ai-teams" +custom_edit_url: null +sidebar_label: "2026.05.12 Policy as Code" +title: "Policy as Code for AI Agent Teams" +description: "A practical model for turning agent guardrails into versioned, testable policy instead of scattered prompts." +draft: true +tags: + - "AI Agents" + - "Governance" + - "Squad" + - "Security" + - "AI-assisted" +keywords: + - "policy as code" + - "ai agent governance" + - "agent guardrails" + - "review gates" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +Your AI agents follow instructions written in markdown. Charters, routing rules, copilot-instructions.md — all prose. What happens when an agent ignores them? + +Nothing. There's no enforcement. An agent can `git add .` even though your instructions say "never do that." There's no audit log when it happens. You find out when something breaks. + +I spent a week designing a policy layer for [Squad](https://github.com/bradygaster/squad) and similar AI agent frameworks. Here's what I came up with and what four specialist reviewers told me was wrong with it. + +## The Problem + +After using Squad for a few weeks, I noticed these gaps: + +1. **No enforcement.** Instructions are advisory. Agents usually follow them. Usually. +2. **No audit trail.** When an agent makes a bad decision, there's no log of what happened or why. +3. **Policy is scattered.** Rules live in `copilot-instructions.md`, agent charters, `decisions.md`, `ceremonies.md`, and `.gitattributes`. No single file to point at. +4. **No inheritance.** Org-wide policies ("never commit secrets") must be copy-pasted into every repo. +5. **No adopter guidance.** New teams don't know which decisions to make. They discover policies after something goes wrong. + +## Three Tiers + +The core idea is a three-tier policy model: + +``` +┌─────────────────────────────────────────────────────┐ +│ Tier 1: Platform Policies (hard block) │ +│ Enforced by: Squad CLI middleware │ +│ Override: Nobody │ +├─────────────────────────────────────────────────────┤ +│ Tier 2: Team Policies (soft block) │ +│ Enforced by: Coordinator │ +│ Override: Human can waive with justification │ +├─────────────────────────────────────────────────────┤ +│ Tier 3: Agent Preferences (advisory) │ +│ Enforced by: Agent reads at session start │ +│ Override: Agent adapts per context │ +└─────────────────────────────────────────────────────┘ +``` + +**Tier 1** is git safety, file access controls, secret scanning, MCP allowlists. Non-negotiable. If an agent tries to `git add .`, the CLI blocks it before the command runs. + +**Tier 2** is review gates, human approval requirements, PR scope rules, reviewer lockout. The coordinator checks these before spawning agents. A human can waive with `/policy waive --scope pr`. + +**Tier 3** is agent-specific preferences — test coverage thresholds, max function length, review pipeline order. Agents read these at session start and adapt. + +## The Policy File + +Everything lives in `.squad/policy.yaml`: + +```yaml +version: 1 +kind: squad-policy + +platform: + git: + staging: explicit-only # blocks git add . / git add -A + force-push: deny + branches-protected: [main, live, release/*] + max-files-per-commit: 20 + + files: + deny-write: + - "*.env" + - ".squad/policy.yaml" # agents can't weaken their own policy + - ".github/hooks/**" # agents can't modify enforcement + - ".squad/audit-log/**" # agents can't tamper with audit + deny-delete: + - ".squad/policy.yaml" + - ".squad/decisions.md" + + secrets: + scan-before-commit: true + provider: built-in # built-in | github-secret-scanning | detect-secrets + + mcps: + allowlist: [github, jira, confluence, datadog, slack, pagerduty] + + rate-limits: + max-violations-per-session: 10 # auto-terminate after 10 blocks + +team: + review: + required-passes: [security-review, code-review] + human-approval-required: [deploy, merge-to-protected-branch] + reviewer-lockout: true # author can't review own work + waiver-authorized: ["CODEOWNERS"] + waiver-scope: per-pr + + scope: + max-pr-files: 20 + separate-infra-and-product: true + + audit: + storage: local # local (gitignored) | external + rotation-days: 30 + +agent-preferences: + tester: + coverage-threshold-percent: 80 + require-integration-tests: true + lead: + review-pipeline: [security-review, code-review] +``` + +## Enforcement Points + +Policy is checked at six points in the agent lifecycle: + +| When | What's checked | What happens on violation | +|------|---------------|-------------------------| +| Session start | Load + merge policy chain. Validate schema. | Session won't start if policy is invalid | +| Pre-tool-use | File access, git commands, MCP calls | CLI blocks the action with actionable error | +| Pre-commit | Secret scan, file count, staging method | Commit is rejected | +| Pre-PR | Scope, reviewer lockout, required passes | PR creation blocked or flagged | +| Coordinator routing | Agent enabled? Scope within boundaries? | Agent not spawned | +| Post-action | Audit log entry | Logged to `.squad/audit-log/` | + +Error messages include what was blocked, why, what to do instead, and whether a waiver is available: + +``` +❌ Blocked: git add . + Rule: platform.git.staging = explicit-only + Tier: 1 (platform) + Fix: Stage files individually: git add path/to/file.ts + Waive: Not waivable (Tier 1) +``` + +## What Four Reviewers Told Me Was Wrong + +I ran the PRD through four specialist reviews — Security, Architecture, Product/UX, DevOps/CI. They found 7 critical issues and 12 warnings. + +### The critical ones: + +**Policy files weren't write-protected.** The original schema had `deny-write` for `.github/agents/**` but not for `.squad/policy.yaml`. An agent could edit its own policy to remove restrictions. Fixed by adding policy files, hooks, and audit logs to `deny-write`. + +**Secret scanning was regex-only.** A single regex pattern misses AWS keys (`AKIA...`), GitHub PATs (`ghp_...`), JWTs, PEM keys, and hundreds of other known formats. Fixed by adding a `provider` field that supports `github-secret-scanning` and `detect-secrets` pattern databases. + +**Hook scripts lived in the repo.** The enforcement mechanism (`.github/hooks/`) was just a JS file in the repo. An agent with file-write access could replace it with a no-op, bypassing ALL Tier 1 enforcement. Fixed by shipping hooks as versioned GitHub Actions instead of repo files. + +**Audit logs were writable by agents.** If agents can write to `.squad/audit-log/`, they can forge entries or hide violations. Fixed by making audit the CLI's responsibility and defaulting to local gitignored storage. + +**No CI integration spec.** The PRD said "GitHub Action" but never defined exit codes, status checks, or a reusable workflow. Fixed by specifying `squad-policy-gate` with tiered exit codes: Tier 1 → `exit 1`, Tier 2 → `exit 78` (warning), Tier 3 → annotation. + +**Enforcement depended on VS Code Agent Hooks.** Agent Hooks are a VS Code preview feature. They don't work in CLI or CI. Fixed by designing enforcement as Squad CLI middleware first, with hooks as an optional enhancement. + +**The setup wizard asked 9 questions.** Product reviewer called this "an adoption cliff." New users won't know the answers. Fixed by shipping pre-populated defaults with a zero-question `--quick` mode. + +### The architecture feedback: + +The Architecture reviewer pointed out that "most restrictive wins" is underspecified for non-boolean fields. What does "most restrictive" mean for an MCP allowlist? Intersection? Union? A number field like `max-files-per-commit`? Min? + +Fixed by defining per-field merge strategies: + +```yaml +_merge_strategies: + "platform.mcps.allowlist": intersection + "platform.git.max-files-per-commit": min + "platform.git.branches-protected": union + "team.storage.mode": override + "team.agents.enabled": intersection + "team.review.required-passes": union +``` + +## What's Next + +The PRD defines 5 phases: + +| Phase | What ships | Enforcement level | +|-------|-----------|------------------| +| **0** (done) | Prose-based policy in markdown | Advisory | +| **1** | JSON Schema, `squad policy init/show/doctor`, CI gate | Schema validation | +| **2** | Squad CLI middleware, pre-commit hooks, audit logging | Tier 1 hard enforcement | +| **3** | Coordinator enforcement, waiver system, Cedar/OPA for conditional rules | Tier 2 soft enforcement | +| **4** | Org-level inheritance via GitHub API | Full policy chain | +| **5** | Agent Governance Toolkit integration (optional) | Enterprise-grade | + +Phase 0 is already done — the adopter decisions, git safety rules, and PR templates I built for my docs squad template are the prose version of this policy layer. + +Phase 1 is where the machine-readable `policy.yaml` and CLI tooling get built. That's what I'm working on next. + +--- + +The full PRD currently lives in a private planning repo. If you're building AI agent teams and thinking about governance, the three-tier model is a good starting point — even if you only implement it in markdown. diff --git a/website/blog/2026-05-12-session-storage-decision-guide.md b/website/blog/2026-05-12-session-storage-decision-guide.md new file mode 100644 index 0000000..2e26287 --- /dev/null +++ b/website/blog/2026-05-12-session-storage-decision-guide.md @@ -0,0 +1,200 @@ +--- +slug: "/2026-05-12-session-storage-decision-guide" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-session-storage-decision-guide" +custom_edit_url: null +sidebar_label: "2026.05.12 Session Storage Guide" +title: "Where Should Your Sessions Live? A Squad User's Guide to Copilot CLI Session Storage" +description: "A decision guide for choosing cloud, local, or repo-backed session storage when you use Copilot CLI with Squad." +draft: true +tags: + - "AI Agents" + - "Copilot CLI" + - "Session Management" + - "Squad" + - "AI-assisted" +keywords: + - "session storage" + - "copilot cli sessions" + - "repo-backed sessions" + - "local session store" + - "squad" +updated: "2026-05-12 00:00 PST" +--- + +When you set up Copilot CLI, it asks you a question that seems simple: **where do you want to store your sessions?** The three choices are cloud, local, or repo. For most developers, this is a quick pick-and-move-on moment. But if you're running Squad, this decision has real consequences for how your team remembers, recovers, and collaborates. + +This post breaks down what each option actually means for Squad users, when to commit session files vs. gitignore them, and how Copilot's session storage interacts with Squad's own memory system. + +## The Three Storage Options + +### Cloud + +Copilot syncs your session history to GitHub's cloud. Sessions persist across machines. The `/session` command lets you view and manage sessions directly from the CLI. + +### Local + +Sessions stay in `~/.copilot/` on your machine. Private, fast, never leaves your device. The `/session` command still works for local sessions, but cross-device access isn't available. + +### Repo + +Sessions are stored inside the repository you're working in. They travel with the code. They're visible to anyone who clones the repo (unless gitignored). They're auditable in the commit log. + +## Why This Matters for Squad + +Squad has its own memory system — and it's entirely repo-based: + +| Squad Memory | File | Committed to Git? | +|-------------|------|-------------------| +| Team decisions | `.squad/decisions.md` | ✅ Yes | +| Agent memory | `.squad/agents/{name}/history.md` | ✅ Yes | +| Skills | `.squad/skills/{name}/SKILL.md` | ✅ Yes | +| Session files | `.squad/sessions/*.json` | ❌ Gitignored by default | +| Scribe logs | `.squad/log/*.md` | ❌ Gitignored by default | +| Orchestration logs | `.squad/orchestration-log/*.md` | ❌ Gitignored by default | + +Wait — Squad gitignores its own session files? Yes. And this is a deliberate design choice that tells you something important about how Squad thinks about sessions vs. knowledge. + +## The Key Distinction: Sessions vs. Knowledge + +Squad draws a sharp line: + +**Knowledge** (decisions, history, skills) is **committed**. It's the team's institutional memory. It compounds. It's portable. When you clone the repo, you get the full team brain. + +**Sessions** (conversation transcripts, orchestration logs, checkpoints) are **gitignored**. They're runtime state — useful for resume and debugging, but not part of the permanent record. They're noisy, they're large, and they contain the messy process of getting to a result, not the result itself. + +This is why `squad init` automatically adds these to `.gitignore`: + +```gitignore +.squad/orchestration-log/ +.squad/log/ +.squad/decisions/inbox/ +.squad/sessions/ +``` + +The important outputs of a session — decisions made, learnings captured, skills extracted — get promoted to committed files. The raw session data stays local. + +## How Copilot's Storage Choice Interacts with Squad's + +Here's where it gets interesting. Copilot CLI has its own session store (`~/.copilot/session-store.db`) that's completely independent of Squad's `.squad/sessions/`. They serve different purposes: + +| | Copilot CLI Store | Squad Session Store | +|---|---|---| +| **What's stored** | Full conversation (every turn, checkpoint, file refs) | Shell message history (for resume) | +| **Where** | `~/.copilot/` or cloud or repo | `.squad/sessions/` | +| **Queryable by agents** | Yes, via `/session` command | Yes, via file read | +| **Used for** | Cross-session queries, session recovery skill | Squad shell `/resume` command | +| **Compaction recovery** | Checkpoints in Copilot's store | `.squad/sessions/{id}.json` checkpoint | + +When you choose "repo" for Copilot's storage, both stores end up in the repo — but in different directories with different purposes. + +## Decision Guide: Which Storage Option to Choose + +### Choose Cloud When: + +- You work across multiple machines (laptop, desktop, codespace) +- You want `/session` to work across devices (view and resume sessions from any machine) +- You want Squad's `session-recovery` skill to find interrupted sessions +- Your org allows sending conversation content to GitHub's cloud +- You're a solo developer who values convenience + +**Squad impact:** Full feature set. `/session` works across devices, session recovery works, cross-device resume works. Squad's own `.squad/` memory is unaffected (it's always in the repo regardless). + +### Choose Repo When: + +- You want everything in one place — code, squad state, and session history +- You're working on a team and want session history to be shared or auditable +- You want session data to be portable via `git push` without cloud dependency +- You want git-auditable session history (who ran what, when) +- Your org restricts cloud storage but allows repo-based storage + +**Squad impact:** Sessions live alongside `.squad/` files — natural fit for Squad's "everything in git" model. But you need to make a gitignore decision (see below). Verify that `/session` indexes repo-stored sessions the same way in your setup. + +### Choose Local When: + +- Security policy prohibits both cloud and repo storage of conversation content +- You're on a single machine and don't need cross-device resume +- You don't need cross-device session access +- You're working on a public repo and don't want session transcripts exposed + +**Squad impact:** Minimal. Squad's core memory (decisions, history, skills) is unaffected. You lose cross-device session access and the session-recovery skill won't find sessions from other machines. Squad's own `.squad/sessions/` still works for resume (it's independent of Copilot's store). + +## The Gitignore Decision + +If you choose repo storage, you face a second question: **should session files be committed or gitignored?** + +### Gitignore Sessions (Default — Start Here) + +This is Squad's default behavior. `squad init` gitignores `.squad/sessions/` automatically. + +**Why this is the default:** +- Session files are large (full conversation transcripts) +- They contain the messy process, not the clean result +- They may contain sensitive content (API keys in error messages, file contents, personal preferences) +- They create noisy git diffs on every session +- The valuable outputs (decisions, history, skills) are already committed separately + +**When to keep the default:** +- Public repos (never commit sessions to public repos) +- Repos with many contributors (session noise drowns signal in diffs) +- When sessions contain sensitive data +- When you're not sure (start gitignored, relax later if needed) + +### Commit Sessions (Intentional Choice) + +Remove `.squad/sessions/` from `.gitignore` if you want session transcripts in git. + +**When this makes sense:** +- Private team repos where transparency matters +- Compliance requirements that need full audit trails of AI interactions +- Training/onboarding repos where session history is part of the curriculum +- Research projects where the process matters as much as the result + +**What to watch for:** +- Repo size grows fast — each session is a full conversation transcript +- Review your sessions before committing — they may contain content you don't want in git history permanently +- Consider a `.squad/sessions/.gitkeep` with a README explaining why sessions are tracked +- Set up a cleanup policy — archive sessions older than N days + +### Hybrid: Commit Logs, Gitignore Sessions + +A middle ground that works well for many Squad teams: + +```gitignore +# Keep raw sessions local +.squad/sessions/ + +# But commit Scribe's summaries (remove these from .gitignore) +# .squad/log/ +# .squad/orchestration-log/ +``` + +This gives you: +- Clean, human-readable session summaries in git (Scribe's logs) +- Raw session data stays local (not committed) +- Audit trail without the noise + +To do this, remove `.squad/log/` and `.squad/orchestration-log/` from your `.gitignore`. Scribe's logs are concise markdown summaries — much smaller and cleaner than raw session files. + +## Summary: The Decision Matrix + +| Scenario | Copilot Storage | Gitignore Sessions? | +|----------|----------------|-------------------| +| Solo dev, multiple machines | Cloud | Yes (default) | +| Solo dev, single machine | Repo or Local | Yes (default) | +| Private team, transparency needed | Repo | No — commit them | +| Private team, compliance required | Repo | No — commit, plus commit Scribe logs | +| Public repo | Cloud or Local | Yes — never commit sessions to public repos | +| Enterprise, security-restricted | Local | Yes (default) | +| Hybrid (recommended for most teams) | Cloud or Repo | Gitignore sessions, commit Scribe logs | + +## The Bottom Line + +Squad's memory system is designed to extract the signal (decisions, history, skills) from the noise (raw sessions) and commit only the signal. That's why sessions are gitignored by default — the important stuff is already being saved. + +Copilot's session storage adds a convenience layer on top: cross-device resume, session queries, recovery from interruptions. Choose the storage option that fits your workflow, knowing that Squad's core memory is unaffected by your choice. + +If you're unsure: **start with cloud storage and the default gitignore.** You get full Copilot features, Squad's memory works perfectly, and sessions stay out of your git history. Relax the gitignore later if you need committed session trails. + +--- + +*This post is part of a series on running Squad effectively. For more on Squad's memory model, see the [Memory & Knowledge docs](https://bradygaster.github.io/squad/concepts/memory-and-knowledge/).* diff --git a/website/blog/2026-05-12-squad-features-youre-missing.md b/website/blog/2026-05-12-squad-features-youre-missing.md new file mode 100644 index 0000000..b927526 --- /dev/null +++ b/website/blog/2026-05-12-squad-features-youre-missing.md @@ -0,0 +1,224 @@ +--- +slug: "/2026-05-12-squad-features-youre-missing" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-squad-features-youre-missing" +custom_edit_url: null +sidebar_label: "2026.05.12 Squad Features" +title: "My Favorite Squad Features (And Why They Matter)" +description: "A tour of ten Squad capabilities that most changed how I think about running AI agent teams." +draft: true +tags: + - "AI Agents" + - "Squad" + - "Developer Workflow" + - "GitHub Copilot" + - "AI-assisted" +keywords: + - "squad features" + - "agent orchestration" + - "tiered memory" + - "ai team workflow" + - "copilot cli" +updated: "2026-05-12 00:00 PST" +--- + +[Squad](https://github.com/bradygaster/squad) ships with a lot. More than most people use. That's not a problem — it means there's depth here. Here are 10 features I find genuinely interesting — the ones that changed how I think about agent orchestration. Not the flashy parts. The actually clever parts. + +## 1. Generic Scheduler (`schedule.json`) + +What makes this interesting is how flexible it is. Most schedulers lock you into a fixed task model — cron jobs or webhooks, pick one. Squad's scheduler treats tasks as first-class primitives with their own routing logic, retry strategies, and provider backends. + +You define recurring tasks in `.squad/schedule.json`: + +```json +{ + "schedules": [ + { + "id": "ralph-heartbeat", + "name": "Ralph Work Monitor", + "enabled": true, + "trigger": { "type": "interval", "intervalSeconds": 300 }, + "task": { "type": "script", "command": "squad ralph watch --duration 30s" }, + "providers": ["local-polling"], + "retry": { "maxRetries": 1, "backoffSeconds": 5 } + } + ] +} +``` + +Four trigger types: `interval`, `cron`, `event`, `startup`. Four task types: `script`, `copilot`, `workflow`, `webhook`. Two providers: `local-polling` (runs while your terminal is open) and `github-actions` (generates workflow files for 24/7 execution). + +```bash +squad schedule init # Create default schedule +squad schedule list # See all tasks +squad schedule status # Check last/next run times +squad schedule run # Trigger manually +squad schedule watch # Start the local polling loop +squad schedule init-ci # Generate GitHub Actions workflows +``` + +The thing I keep coming back to: the same `schedule.json` can run locally during active work and scale to CI with a single flag. No rewrite. No format translation. + +**Learn more:** `docs/features/generic-scheduler.md` in the Squad repo. + +## 2. Two-Pass Issue Scanning (72% fewer API calls) + +This one's clever because it respects the real problem: GitHub API rate limits force you to choose between responsiveness (scan often, hydrate less) and thoroughness (scan less, hydrate everything). Two-pass mode gives you both. + +Ralph's default scan hydrates every issue — fetching comments, labels, assignees, timeline. Most aren't actionable. Two-pass does a lightweight list scan first, then only hydrates issues that pass your filter (roughly 30% of the total). + +```bash +squad watch --two-pass +``` + +The result: 72% reduction in API calls per cycle. If you're running Ralph every 5 minutes, this compounds fast. + +**Learn more:** `packages/squad-sdk/templates/skills/ralph-two-pass-scan/SKILL.md` + +## 3. Tiered Agent Memory (20-55% context reduction) + +What makes this interesting is that it solves a real scaling problem: context size grows linearly with session history, but not all context matters for every decision. Tiered memory lets agents declare what they actually need. + +The system partitions context into three explicit levels: + +| Tier | What's in it | Size | When to include | +|------|-------------|------|----------------| +| **Hot** | Current session context | ~2-4 KB | Always | +| **Cold** | Summarized history | ~8-12 KB | When agent needs past decisions | +| **Wiki** | Durable reference docs | Variable | When agent needs team standards | + +Spawn options: + +```bash +# Default: hot only +squad spawn backend-dev "fix the auth bug" + +# Include summarized history +squad spawn backend-dev "fix the auth bug" --include-cold + +# Include reference docs too +squad spawn backend-dev "fix the auth bug" --include-cold --include-wiki +``` + +Real measurements from the Squad team: 20-55% context reduction from a baseline of 34-74 KB per spawn, depending on tiers included. That matters when you're spawning dozens of agents. + +**Learn more:** `packages/squad-sdk/templates/skills/tiered-memory/SKILL.md` + +## 4. Economy Mode + +What I like about this: it respects intent. You can say "spend less" without losing the ability to say "this specific task needs the expensive model." Most cost optimization tools are all-or-nothing. Economy mode is layered. + +Economy mode shifts coordinator-selected model choices to cheaper alternatives (gpt-4.1, gpt-5-mini) but never overrides explicit model requests. If you say you need claude-opus for a code review, that runs on claude-opus. If you say "use economy mode," the coordinator picks cheaper models for the tasks where you don't care. + +```bash +# Per-session +squad economy on + +# Or in conversation +"use economy mode" +"save costs" +``` + +**Learn more:** `packages/squad-sdk/templates/skills/economy-mode/SKILL.md` + +## 5. Orchestration Logging + +Here's what's clever: you get a full audit trail of why the coordinator made each routing decision, and what happened as a result. Every spawn is logged — not by you, automatically, invisibly. + +Each log entry captures: + +- Why this agent was chosen (routing rationale) +- What files the agent was authorized to touch +- What the agent produced +- Whether the output was accepted + +This is useful for debugging orchestration bugs without adding tracing code. And if you're in a regulated environment, you have your decision audit trail ready. + +**Learn more:** `templates/orchestration-log.md` + +## 6. `squad nap --deep` + +This is the garbage collector for your team memory. Over time, `.squad/decisions.md` and agent history files grow. Context balloons. `squad nap --deep` does aggressive compression — archiving stale decisions, trimming history files, reclaiming context space. + +```bash +# Preview what would change +squad nap --deep --dry-run + +# Actually compress +squad nap --deep +``` + +The thing I appreciate: it's safe. Dry-run first, you see exactly what gets archived, then commit. And archived decisions stay searchable — they're not deleted, just moved out of the hot path. + +**Learn more:** Constraint tracking docs in the Squad repo for details on decision management. + +## 7. Personal Squad + +Personal Squad lets your agent configuration follow you. Your team's squad lives in the repo. Your personal squad lives at `~/.squad/` — ambient configuration that works regardless of which project you're in. + +```bash +squad personal init # Create personal workspace +squad personal list # See your personal agents +squad personal use # Activate personal squad +squad personal remove # Remove it +``` + +This is useful if you have a preferred router, or you've tuned agent model choices to your style — you don't have to re-tune those settings for every project. + +> **Note:** Personal Squad is currently experimental and may change in future releases. + +**Learn more:** CHANGELOG v0.9.0 entries. Integration docs are still being written. + +## 8. Cross-Squad Orchestration + +What makes this interesting: squads aren't isolated. If you have multiple squads across repos, they can discover each other and delegate work. It's like a distributed system where each repo's squad knows what the others are capable of. + +```bash +squad discover # List discoverable squads +squad delegate my-other-squad "update the SDK docs" # Create cross-squad issue +``` + +Discovery happens via `.squad/manifest.json` — each squad publishes its capabilities. Delegation creates an issue in the target repo with the `squad` label, which the target squad's Ralph picks up normally. + +**Learn more:** `docs/features/cross-squad-orchestration.md` + +## 9. Circuit Breaker + Cooperative Rate Limiting + +Here's the problem this solves: if you run multiple Ralph instances, GitHub API rate limits become a coordination problem. You need to predict limits before you hit them, and recover gracefully when you do. + +Squad's solution is a full state machine persisted to `.squad/ralph-circuit-breaker.json`, coordinating across instances: + +1. **Traffic Light** — Green/yellow/red based on remaining quota +2. **Token Pool** — Shared quota pool across instances +3. **Predictive Circuit Breaker** — Opens the circuit BEFORE you hit 429, using exponential cooldown (CLOSED → OPEN → HALF-OPEN states) +4. **Priority Retry Windows** — Higher-priority tasks get first access after cooldown +5. **Resource Epoch Tracking** — Auto-recovers quota from crashed agents +6. **Cascade Dependency Detection** — Prevents one failing API from cascading to others + +The persistence is elegant: even if your terminal closes, the next Ralph run knows exactly where the circuit stood. + +**Learn more:** `templates/ralph-circuit-breaker.md` and `templates/cooperative-rate-limiting.md` + +## 10. Machine Capability Discovery + +What I appreciate here: the framework adapts to hardware, not the other way around. At session start, Squad auto-detects available tools, models, and hardware. Agents self-route based on `needs:*` labels matched against discovered capabilities. + +This means the same squad configuration works on a laptop with 8 GB RAM and a CI runner with 64 GB — agents adapt to what's actually available. + +**Learn more:** CHANGELOG v0.9.0. Template at `templates/machine-capabilities.md`. + +--- + +## What Makes This Different + +All 10 of these features are fully implemented. All ship with Squad. What sets Squad apart isn't any single feature — it's the combination. + +You get autonomous agent routing (the coordinator picks the right agent without you specifying). Persistent decision memory (your team's decisions inform future work). Human-in-the-loop governance (agents propose, you approve). Built-in scheduling, cost awareness, API rate limiting, and capability discovery. + +That combination means you get real autonomy without giving up control. And the deeper you dig, the more depth there is — these 10 are the ones that grabbed me, but there's plenty more to find. + +**Start here:** +- If you're running Ralph on a schedule, begin with `squad schedule init` and test `squad nap --deep --dry-run`. You'll recover context and gain API efficiency immediately. +- If you work across multiple branches, try `squad externalize` (see our [v0.9 update](/blog/2026-05-12-whats-new-in-squad-v09) for details). It's a mode switch, not a commitment. +- If you're new to Squad, pick one feature that solves your immediate problem and build from there. + +The depth is worth exploring. diff --git a/website/blog/2026-05-12-squad-hq.md b/website/blog/2026-05-12-squad-hq.md new file mode 100644 index 0000000..7119716 --- /dev/null +++ b/website/blog/2026-05-12-squad-hq.md @@ -0,0 +1,334 @@ +--- +slug: "/2026-05-12-squad-hq" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-squad-hq" +custom_edit_url: null +sidebar_label: "2026.05.12 Squad HQ" +title: "Managing 10+ Squad Repos? Let HQ Discover and Connect Them Automatically" +description: "An HQ-style view for discovering multiple Squad repos and understanding how they relate." +draft: true +tags: + - "AI Agents" + - "Multi-Repo" + - "Developer Workflow" + - "Squad" + - "AI-assisted" +keywords: + - "multi repo management" + - "repo discovery" + - "squad hq" + - "agent coordination" + - "developer workflow" +updated: "2026-05-12 00:00 PST" +--- + +You started with one Squad repo. Then three. Then you realize you're managing 12 Squad-powered repos across your workspace, and nobody has a unified view of what's happening. You don't even know the full count without checking GitHub manually. + +New repos get forgotten. Links go stale. Someone asks, "Do we have a Squad in the data-pipeline repo?" and you have to check manually. You don't know your own setup. There's no registry. No discovery. When repos are added, you manually track them in a shared spreadsheet or team notes. When repos are deleted, the links never get cleaned up. + +What if Squad repos could discover each other automatically? Not invasively. Not with manual edits to every child repo. Not with a shared config file they all have to know about. Just a command that scans your workspace, shows what's there, and generates linking instructions — all read-only by default. + +That's [Squad HQ](https://github.com/bradygaster/squad-sdk-example-hq) — a workspace discovery tool that gives you complete visibility without the overhead. HQ is the central registry and discovery mechanism for all your Squad repos. + +## How It Works + +Squad HQ scans for `.squad/` directories and classifies each repo: **published** (has manifest.json and team metadata, fully Squad-compliant), **partial** (has team config but not manifest, Squad-aware but incomplete), or **empty** (folder exists but no useful config). It generates a unified view and creates an SDK-compatible registry — without modifying a single child repo. + +> 📊 **[DIAGRAM: Workspace Discovery Bird's-Eye View]** +> *Prompt for image generation:* Top-down workspace view showing HQ repo (center, large blue box labeled "Squad HQ") scanning outward to 12 sibling repos arranged in a circle around it. Repos color-coded: green boxes for "published" (8 repos with checkmarks), yellow boxes for "partial" (2 repos with warning triangles), gray boxes for "empty" (2 repos with dashes). Arrows from HQ extending to each repo showing "scan for .squad/" with magnifying glass icons. Bottom summary box showing "8 published, 2 partial, 2 empty, 21 team members". Dark background, teal/blue HQ center, green/yellow/gray repo classifications, clean concentric layout. +> *Purpose:* Shows the HQ's scanning radius and how it classifies each discovered repo without modifying them, emphasizing the read-only discovery pattern. + +The key: **read-only by default**. Every command is safe to run. Only `--write` creates files, and only in the HQ repo itself. Child repos are never touched, never modified, never have their configs altered. + +## A Real Walkthrough + +Here's what discovering and managing multiple squads looks like from start to finish, with actual CLI commands and expected output. + +### Step 1: Discover All Squads in Your Workspace + +```bash +npx squad-hq discover ~/repos +``` + +Output: +``` +🔍 Scanning ~/repos for Squad configurations... + + Found 12 squad(s): + + Name Type Team Capabilities + ─────────────────────── ─────────── ───── ────────────────────── + backend-api-squad published 4 multi-agent, copilot, sdk + frontend-squad published 3 multi-agent + data-pipeline-squad partial 2 — + infra-squad published 3 multi-agent, copilot + security-squad published 5 multi-agent, copilot, sdk + docs-squad published 2 copilot + ml-squad empty 0 — + analytics-squad published 2 multi-agent + platform-squad published 4 multi-agent, copilot, sdk + mobile-squad partial 1 — + devops-squad published 3 multi-agent + legacy-squad empty 0 — + + Summary: 8 published, 2 partial, 2 empty + Workspace scope: 12 repos, 21 total team members +``` + +You just got a complete Squad inventory in seconds. No spreadsheets. No manual tracking. The tool classified each repo based on what configuration files it found: + +- **published**: Has `.squad/manifest.json` (full Squad SDK integration) + `.squad/team.md` (team definitions) +- **partial**: Has `.squad/team.md` (team definitions) but no manifest — Squad-aware but not fully configured +- **empty**: Has `.squad/` directory but no useful config inside + +### Step 2: Show What Linking Would Look Like (Dry-Run) + +Before making any changes, let's see what linking would do: + +```bash +npx squad-hq plan +``` + +Output: +``` +📋 Planning squad linking for HQ... + + Discovered 12 squads in parent directory + + ✅ Would link: backend-api-squad (published) + $ squad upstream add --name hq --type local --source "." --dir "../backend-api-squad" + + ✅ Would link: frontend-squad (published) + $ squad upstream add --name hq --type local --source "." --dir "../frontend-squad" + + ✅ Would link: infra-squad (published) + $ squad upstream add --name hq --type local --source "." --dir "../infra-squad" + + ✅ Would link: security-squad (published) + $ squad upstream add --name hq --type local --source "." --dir "../security-squad" + + ✅ Would link: data-pipeline-squad (partial — needs manifest) + ℹ️ Partial squads can still be linked but won't expose full capability set + + ⏭️ Skipping: ml-squad — Empty .squad/ directory (not a complete squad) + ⏭️ Skipping: legacy-squad — Empty .squad/ directory (not a complete squad) + + Preview shows 8 linkable squads (including 1 partial, excluding 2 empty). + Run `squad-hq plan --write` to register them in HQ. +``` + +> 📊 **[DIAGRAM: Linking Flow - HQ to Child Repos]** +> *Prompt for image generation:* Arrow-based diagram showing HQ repo at top with `upstream.json` file icon. Arrows branch downward to 8 child repos (published + partial), each with their own `.squad/` directory icons. Arrows are labeled "upstream reference: ../hq" flowing from HQ down to children. Solid arrows for "published" repos (green), dashed arrows for "partial" repos (yellow). Crossed-out arrows for "empty" repos (gray, marked "skipped"). At bottom, a note box showing "8 registered + 0 empty = 8 active linkages". Dark background, blue/teal linking arrows, clean hierarchical tree layout. +> *Purpose:* Shows how HQ creates upstream references in child repos' manifests (via `upstream.json`) without modifying their source code or team config, emphasizing the non-invasive linking pattern. + +This is the dry-run—see exactly what would link before anything changes. You can see which squads are ready, which need attention (partial), and which should be skipped (empty). + +### Step 3: Classify a Specific Repo (Details) + +Want to see detailed classification for a specific squad? + +```bash +npx squad-hq discover ~/repos/backend-api-squad --details +``` + +Output: +``` +📊 Classification: backend-api-squad + + Status: ✅ published + + Configuration found: + ✓ .squad/manifest.json (59 lines) + ✓ .squad/team.md (3 members defined) + ✓ .squad/routing.md (9 routing rules) + ✓ .squad/agents/ (4 agent charters) + ✓ .squad/decisions.md (12 decisions recorded) + + Team: + • Alice — Lead (AI: gpt-4) + • Bob — Backend (AI: gpt-4-turbo) + • Carol — Frontend (AI: claude-3-sonnet) + • Dave — DevOps (AI: gpt-4) + + Capabilities: + ✓ multi-agent (4 agents defined) + ✓ copilot (CLI integration ready) + ✓ sdk (SDK hooks configured) + + Last modified: 2h ago + Manifest version: 1.0 + + Recommendation: This squad is ready for HQ linking. All configuration files present. +``` + +### Step 4: Register the Squads in HQ + +When you're ready, register all the linkable squads: + +```bash +npx squad-hq plan --write +``` + +Output: +``` +✍️ Writing squad registry... + + ✅ Registered: backend-api-squad + ✅ Registered: frontend-squad + ✅ Registered: infra-squad + ✅ Registered: security-squad + ✅ Registered: data-pipeline-squad (partial) + ✅ Registered: analytics-squad + ✅ Registered: platform-squad + ✅ Registered: devops-squad + + ✅ Wrote 8 entries to squad-registry.json + + Registry location: ./squad-registry.json + Registry format: SDK-compatible (ready for discoverFromRegistry()) + Child repos modified: 0 (read-only, child repos untouched) +``` + +Creates `squad-registry.json` in HQ (child repos untouched). Uses standard Squad SDK format—existing tooling just works: + +```json +[ + { + "name": "backend-api-squad", + "path": "/repos/backend-api-squad", + "classification": "published", + "teamSize": 4, + "capabilities": ["multi-agent", "copilot", "sdk"], + "lastModified": "2024-01-15T14:32:00Z" + }, + { + "name": "frontend-squad", + "path": "/repos/frontend-squad", + "classification": "published", + "teamSize": 3, + "capabilities": ["multi-agent"], + "lastModified": "2024-01-15T12:15:00Z" + }, + ... +] +``` + +### Step 5: Get Unified Status of All Registered Squads + +```bash +npx squad-hq status +``` + +Output: +``` +🏠 HQ Status +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +Squads: 8 registered (6 published, 2 partial, 0 empty, 4 inactive) + + Name Type Team Capabilities Last Modified + ───────────────────── ─────────── ───── ───────────────────── ───────────── + backend-api-squad published 4 multi-agent, copilot 2h ago + frontend-squad published 3 multi-agent 4h ago + security-squad published 5 multi-agent, copilot 12h ago + devops-squad published 3 multi-agent 1d ago + infra-squad published 3 multi-agent, copilot 2d ago + analytics-squad published 2 multi-agent 3d ago + data-pipeline-squad partial 2 — 1d ago + platform-squad published 4 multi-agent, copilot 5d ago + +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +Summary: + • Total team members across all squads: 26 + • Most active: backend-api-squad (2h ago) + • Needs attention: platform-squad (5d since last change) + • Incomplete configs: data-pipeline-squad (partial) + +Recommended actions: + • Upgrade data-pipeline-squad to published (add manifest.json) + • Check if platform-squad is still active +``` + +You now have a complete picture: all your Squad repos, their status, their team size, their last activity. With one command. + +### Step 6: Discover a New Repo and Update HQ + +Say you add a new Squad repo to your workspace. Re-run discover to see it: + +```bash +npx squad-hq discover ~/repos +``` + +Output: +``` +🔍 Scanning ~/repos for Squad configurations... + + Found 13 squad(s): + + Name Type Team Capabilities + ─────────────────────── ─────────── ───── ────────────────────── + [... previous 12 ...] + new-ml-squad published 2 multi-agent, sdk + + Summary: 9 published, 2 partial, 2 empty +``` + +Then update HQ's registry: + +```bash +npx squad-hq plan --write +``` + +Output: +``` +✍️ Writing squad registry... + + ✅ Already registered: backend-api-squad + ✅ Already registered: frontend-squad + [... 6 more ...] + ✅ Registered (NEW): new-ml-squad + + ✅ Wrote 9 entries to squad-registry.json + Registry updated: 8 existing + 1 new +``` + +Your registry is always in sync with what's actually in your workspace. + +## Why This Matters + +Managing 10+ Squad repos without visibility means: +- Links get forgotten, new repos don't get connected, old repos stay connected after teams disperse +- You lose the topology — nobody knows what repos exist or when they were last active +- Onboarding new team members means manually explaining which repos matter and which are deprecated +- You can't generate reports on team scale, capability distribution, or workspace health + +Squad HQ solves this with: +- **Safe by default** — Read-only except `--write` (HQ-local only) +- **SDK-compatible** — Standard `squad-registry.json` format, works with existing tooling +- **Programmatic API** — CLI commands also importable functions (TypeScript) +- **Zero maintenance** — Never corrupts child repos, never requires their approval +- **Audit trail** — Classification, last-modified timestamps, and capability inventory + +If you have 3+ Squad repos, this pattern saves hours of manual coordination. At 10+, it's essential. + +## Honest Scoping + +This works great for discovering Squad repos by configuration files. It won't automatically know about repos that don't have `.squad/` directories yet — those still need manual registration or setup. + +If your Squad repos are spread across multiple GitHub organizations or distributed systems (not all in one parent directory), you'd need to extend the scanner. The framework is designed as pluggable components, so adding custom scanners (GitHub org API, cloud storage, etc.) is straightforward. + +## What's Next + +From here, you can: +- Use `squad-registry.json` in CI/CD to run health checks across all squads +- Export the registry to dashboards or monitoring systems +- Chain HQ with other tools: "discover → classify → generate linking commands → auto-register in CI" +- Build custom classifiers for your organization's Squad maturity model + +The example also works as a foundation for other cross-repo discovery patterns: plugin registries, template catalogs, or workspace health dashboards. + +--- + +If you're managing multiple Squad repos and you're tired of manual tracking, Squad HQ is the visibility tool you need. Run one command, get the complete picture of all Squad repos in your workspace, and generate linking instructions without touching a single child repo. + +Get started: [github.com/bradygaster/squad-sdk-example-hq](https://github.com/bradygaster/squad-sdk-example-hq) diff --git a/website/blog/2026-05-12-state-backend-manager.md b/website/blog/2026-05-12-state-backend-manager.md new file mode 100644 index 0000000..d421985 --- /dev/null +++ b/website/blog/2026-05-12-state-backend-manager.md @@ -0,0 +1,401 @@ +--- +slug: "/2026-05-12-state-backend-manager" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-state-backend-manager" +custom_edit_url: null +sidebar_label: "2026.05.12 State Backend Manager" +title: "Tired of .squad/ Noise in Your Git History? Move State to a Cleaner Backend" +description: "A design sketch for moving noisy repo state into a cleaner backend while keeping agent coordination intact." +draft: true +tags: + - "AI Agents" + - "State Management" + - "Squad" + - "Developer Workflow" + - "AI-assisted" +keywords: + - "state backend" + - "agent state storage" + - "git noise" + - "squad state" + - "developer workflow" +updated: "2026-05-12 00:00 PST" +--- + +Squad stores all team state in `.squad/` — decisions, orchestration logs, configurations. This works fine until you're managing dozens of agents across multiple repos. Then `.squad/` commits become your main source of git noise: noisy diffs, merge conflicts on shared decision files, constant churn that makes history hard to read. + +At enterprise scale, this becomes unbearable: merge conflicts on shared decision files when multiple teams work in parallel, git noise making blame, bisect, and log reading painful, compliance issues where teams need to audit AI decision logs separately from source code, branch protection failures because `.squad/` files changed even though no actual code changed. + +I built a state backend manager to solve this. [squad-state-backend](https://github.com/bradygaster/project-squad-sdk-example-state-backend) lets you move state off the filesystem, onto git-notes, orphan branches, or external repos. Or anywhere else you want it. The framework handles migration, verification, and retention policies. + +## The Problem: .squad/ at Scale + +Imagine this. You have 30 repos, each with a `.squad/` directory. Decisions are constantly being logged, orchestration history is growing, agent configs are being updated. Every session, `.squad/` changes. Most teams ignore it and let it commit. A few teams try to gitignore it, but then they lose state across machines. + +At enterprise scale, this becomes unbearable: + +- **Merge conflicts** on shared decision files when multiple teams work in parallel +- **Git noise** makes blame, bisect, and log reading painful +- **Compliance issues** — some teams need to audit AI decision logs separately from source code +- **Branch protection failures** — CI gates trigger because `.squad/` files changed, even though no actual code changed +- **Slow clones** — large `.squad/` directories bloat repository size +- **Stale state** — teams on different branches have divergent decisions, causing confusion + +The core problem: state lives in the same place as code, but it's a different concern. Source code is permanent and revisioned. State is ephemeral and audit-focused. Mixing them creates friction. + +Teams ask: "Can state live somewhere else?" The answer should be yes. Today it's not obvious how. + +## The Solution: Pluggable State Backends + +The Squad SDK was designed with pluggable backends in mind, but users never got an easy way to use them. The state backend manager changes that. + +It exposes a simple interface: every backend implements the same read/write/list/delete operations. That means you can: + +1. Check what backend you're using +2. Migrate between backends (with verification) +3. Verify state integrity after migration +4. Set retention policies (auto-archive old logs) + +The framework ships with a simulated demo for testing. Production versions would integrate real git-notes, orphan branches, or external repos. + +> 📊 **[DIAGRAM: Backend Architecture Comparison]** +> *Prompt for image generation:* A 3-column comparison layout showing: (1) "Filesystem Backend" with file icon and ".squad/" label (dark background, red accent) — "Pro: Simple, Con: Noise in git history"; (2) "Git-Notes Backend" with git branch icon (blue accent) — "Pro: Clean history, Con: Complex"; (3) "Orphan Branch Backend" with layered boxes icon (teal accent) — "Pro: Isolated state, Con: Multiple branches". Add arrows showing bidirectional migration paths between columns with dotted lines labeled "migrate". Show state files (8 files, 2KB) flowing through each. Dark background overall, colored accents per backend. +> *Purpose:* Visually compares backend tradeoffs side-by-side—readers instantly understand pros/cons of filesystem vs. git-notes vs. orphan without reading dense text. + +## Setup + +```bash +# Clone the repository +git clone https://github.com/bradygaster/project-squad-sdk-example-state-backend.git +cd project-squad-sdk-example-state-backend + +# Install and build +npm install && npm run build +``` + +Verify setup: + +```bash +npm run test +``` + +Expected: All tests pass (✓). + +## Check Current Backend Status + +See what backend is currently active and verify its health: + +```bash +npx squad-state status +``` + +Expected output: + +``` +Backend: filesystem +Files: 8 +Size: 2,048 bytes +Last Write: 2024-01-15 10:15:00 +Healthy: yes +Integrity: valid +``` + +This tells you: +- **Backend**: Currently using filesystem (`.squad/`) +- **Files**: 8 state files tracked +- **Size**: Total size of state data +- **Last Write**: When state was last modified +- **Healthy**: Yes = no errors detected +- **Integrity**: Valid = all files are readable and parseable + +## Verify State Integrity + +Before migrating or trusting your state, always verify it's clean: + +```bash +npx squad-state verify +``` + +Expected output: + +``` +✓ All checks passed. +All checks passed. 8 files validated. +``` + +The verification process checks: +- JSON validity (all files parse correctly) +- Required files exist (nothing is missing) +- No corruption (checksums match) +- No orphaned data (no unreferenced files) + +If state is corrupted or missing files, this will tell you: + +``` +❌ Validation failed: 2 issues found + + • Missing file: .squad/decisions.md + Fix: Check if file was deleted or corrupted + + • Invalid JSON: .squad/agents/alice.json + Error: Unexpected end of JSON input + Fix: Restore from backup or delete and recreate +``` + +**Never migrate broken state.** Fix or restore first. + +## Migrate Between Backends + +This is the real power. You can move state from filesystem to git-notes with a single command: + +```bash +npx squad-state migrate filesystem git-notes +``` + +Expected output: + +``` +Migration complete: 8 files transferred from filesystem to git-notes (2048 bytes, 35ms) +``` + +What happens behind the scenes: + +1. **Pre-flight checks** — Verify source backend is healthy +2. **Export** — Read all files from source backend, serialize with metadata +3. **Transfer** — Write serialized state to target backend +4. **Import verification** — Confirm all files arrived intact +5. **Post-flight checks** — Verify target backend is healthy +6. **Checksum confirmation** — Ensure source and target have identical data + +If anything fails, the migration aborts and your source backend is unchanged: + +``` +❌ Migration failed at step "Import verification" + Error: Target backend write failed (permission denied) + +Source backend unchanged: filesystem (8 files, 2048 bytes) +No data was moved. +``` + +## Real Walkthrough: Full Migration Pipeline + +Here's how to safely migrate from filesystem to git-notes: + +**Step 1: Check current backend** + +```bash +npx squad-state status +``` + +Output: +``` +Backend: filesystem +Files: 8 +Size: 2,048 bytes +Last Write: 2024-01-15 10:15:00 +Healthy: yes +Integrity: valid +``` + +> 📊 **[DIAGRAM: Safe Migration Flow with Verification Gates]** +> *Prompt for image generation:* A sequential flowchart showing: (1) "Status Check" (green checkmark) → (2) "Verify Integrity" (shield icon, blue) → (3) "Dry Run" (play button icon, teal) → (4) "Execute Migration" (arrow icon, bold blue) → (5) "Verify After" (shield icon) → (6) "Confirm New Backend" (checkmark, green). Add red "ABORT" arrows from steps 2, 3, 5 labeled "if failed". Show data flowing through as "8 files, 2KB" beside migration arrows. Dark background, blue/teal/green accents for stages, clear step numbering. +> *Purpose:* Visually emphasizes the safety gates built into the migration process—readers understand that the system won't blindly move state without pre/post verification. + +**Step 2: Verify state is clean** + +```bash +npx squad-state verify +``` + +Output: +``` +✓ All checks passed. +All checks passed. 8 files validated. +``` + +**Step 3: Simulate migration (dry-run)** + +Before committing to the change: + +```bash +npx squad-state migrate filesystem git-notes --dry-run +``` + +Output: +``` +Dry run: Would transfer 8 files from filesystem to git-notes (2048 bytes) +No changes made. Re-run without --dry-run to execute. +``` + +**Step 4: Execute migration** + +```bash +npx squad-state migrate filesystem git-notes +``` + +Output: +``` +Migration complete: 8 files transferred from filesystem to git-notes (2048 bytes, 35ms) +``` + +**Step 5: Verify after migration** + +Run integrity check on the new backend: + +```bash +npx squad-state verify +``` + +Output: +``` +✓ All checks passed. +All checks passed. 8 files validated. +``` + +**Step 6: Confirm new backend is active** + +```bash +npx squad-state status +``` + +Output: +``` +Backend: git-notes +Files: 8 +Size: 2,048 bytes +Last Write: 2024-01-15 10:15:30 +Healthy: yes +Integrity: valid +``` + +State has been successfully moved. The filesystem backend is now unused (you can safely delete `.squad/` if desired). + +## Setting Retention Policies + +Configure automatic archival of old logs: + +```bash +npx squad-state retain --max-age 30 +``` + +Expected output: + +``` +Retention policy set: max age 30 days, archive to .squad/archive +``` + +This configures the system to automatically archive logs older than 30 days. What happens: + +1. Every 24 hours, the retention archiver runs +2. Finds all files modified >30 days ago +3. Moves them to `.squad/archive/` +4. Keeps recent logs in active state + +Your state directory stays lean and readable. Old audit history is preserved (not deleted) for compliance. + +Check policy status: + +```bash +npx squad-state retain --status +``` + +Output: + +``` +Retention Policy +──────────────────────────── + Max age: 30 days + Archive location: .squad/archive + Last run: 2024-01-15 09:00:00 + Files archived: 12 + Space freed: 4,096 bytes +``` + +## Honest Scoping + +**What this does:** +- Check backend status and health +- Migrate between backends (filesystem ↔ git-notes ↔ orphan branches ↔ external) +- Verify state integrity before/after migration +- Set retention policies and auto-archive old logs +- Prevent data loss with pre/post-flight checks + +**What this doesn't do:** +- Implement real git-notes backend (example uses simulated filesystem directories) +- Implement real orphan-branch backend (same) +- Implement external repository backend (same) +- Handle conflicts between divergent backends +- Encrypt state (you need to do that yourself) + +This example is production-ready for demonstrations and testing. For production use, you'd need to implement the real git backends (requires `nodegit` or `isomorphic-git`). + +## Architecture + +The framework has three layers: + +``` +┌────────────────────────────────────────────┐ +│ CLI Commands Layer │ +│ status, migrate, verify, retain │ +└────────────────────┬───────────────────────┘ + │ +┌────────────────────┴───────────────────────┐ +│ Orchestration Layer │ +│ Migrator, StatusInspector, Archiver │ +└────────────────────┬───────────────────────┘ + │ +┌────────────────────┴───────────────────────┐ +│ Core Service Layer │ +│ BackendResolver, StateExporter, │ +│ StateImporter, IntegrityChecker │ +└────────────────────┬───────────────────────┘ + │ +┌────────────────────┴───────────────────────┐ +│ Backend Implementations │ +│ Filesystem, GitNotes, OrphanBranch │ +└────────────────────────────────────────────┘ +``` + +## Why This Matters + +**For compliance teams:** State can live in an audited, encrypted repository separate from source code. Decision logs are archived with legal holds. + +**For large repos:** Migrate to git-notes and stop polluting git history. Every `.squad/` commit disappears from `git log --oneline`. + +**For open-source projects:** Contributor experience improves. No more `.squad/` noise in the main branch history. + +**For multi-tenant setups:** External backend lets different teams have isolated state without sharing git repos. + +## Next Steps + +1. **Test with simulated backends.** Run migrations in this demo to understand the flow. +2. **Build real backends.** Implement git-notes or orphan-branch backends for your infrastructure. +3. **Set retention policies.** Decide how long to keep active logs vs. archive. +4. **Integrate with CI/CD.** Automate backend verification on every pull request. + +## Get Started + +```bash +# Clone and setup (5 minutes) +git clone https://github.com/bradygaster/project-squad-sdk-example-state-backend.git +cd project-squad-sdk-example-state-backend +npm install && npm run build + +# Check current backend (1 minute) +npx squad-state status + +# Verify state is healthy (1 minute) +npx squad-state verify + +# Migrate to git-notes (simulated demo) (1 minute) +npx squad-state migrate filesystem git-notes + +# Verify after migration (1 minute) +npx squad-state verify + +# Set retention policy (1 minute) +npx squad-state retain --max-age 30 +``` + +Ten minutes later, you have a working state backend manager. Then extend it with real git-notes or orphan-branch implementations for your infrastructure. Or use it as a template to build custom backends for your specific needs. + +The result: cleaner git history, fewer merge conflicts, and state management that scales with your team size. diff --git a/website/blog/2026-05-12-team-activity-monitor.md b/website/blog/2026-05-12-team-activity-monitor.md new file mode 100644 index 0000000..9723d3c --- /dev/null +++ b/website/blog/2026-05-12-team-activity-monitor.md @@ -0,0 +1,293 @@ +--- +slug: "/2026-05-12-team-activity-monitor" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-team-activity-monitor" +custom_edit_url: null +sidebar_label: "2026.05.12 Mission Control Dashboard" +title: "What Are Your AI Agents Actually Doing? Build a Mission Control Dashboard" +description: "A proposal for a live dashboard that makes long-running agent activity easier to monitor and debug." +draft: true +tags: + - "AI Agents" + - "Observability" + - "Dashboard" + - "Squad" + - "AI-assisted" +keywords: + - "agent observability" + - "mission control dashboard" + - "workflow monitoring" + - "squad dashboard" + - "ai agents" +updated: "2026-05-12 00:00 PST" +--- + +Your Squad team runs multi-step workflows. Right now, they disappear into logs. + +You don't know which agents are idle, which are stuck, whether cost is spiraling. If an agent hangs for 10 minutes, you find out when someone checks the logs. If budget burns $100 in a session, you see the bill next month. + +This is the observability problem. [Squad SDK Team Activity Monitor](https://github.com/bradygaster/project-squad-sdk-example-monitor) is a reference implementation of a terminal UI that surfaces it all in real-time. Not logs. Not dashboards buried three clicks deep. A live, updating dashboard in your terminal showing agent status, work items, decisions, and cost—with stuck detection that alerts you within 30 seconds. + +## The Problem: Multi-Agent Workflows Are Black Boxes + +When a single engineer runs a job, logs are fine. When 3–5 agents run in parallel, with one triggering another, context scatters across multiple log streams. Team leads see nothing. Engineers debug by luck. + +Here's the experience right now: +- Agent running. Are they working or stuck? Check logs. Scroll. Grep. 10 minutes of context-gathering. +- Three agents running in parallel. Which one finished first? Check logs. Parse timestamps. Reconstruct order. +- Budget burned $47 this session. Why? Check logs. Maybe some telemetry exists. Maybe not. + +This delays decision-making. You detect stuck agents minutes too late. You notice cost overruns after they happen. + +## What a Real-Time Dashboard Changes + +Squad Monitor displays agent status, work item assignments, decisions, and event timeline—all updating live: + +> 📊 **[DIAGRAM: Dashboard Data Flow]** +> *Prompt for image generation:* Create a vertical data flow diagram: (1) Top: Multiple Agent boxes (Agent-01, Agent-02, Agent-03) sending events upward → (2) Middle: EventBus/Collector hub (central processing point, teal circle with radiating arrows) → (3) Bottom: 4 dashboard sections side-by-side (Agents Table, Work Items Table, Decisions Feed, Timeline View), each showing sample data. Use dark background (charcoal), teal/cyan connectors, light text. Show data flowing down from collectors to each dashboard section. Include labels for "Real-time Events", "Aggregation", "Rendering". Purpose: visualize how live data streams from agents through collection to dashboard display. +> *Purpose:* Helps readers understand how disparate agent events get unified into a single live dashboard—the core architecture of the monitoring system. + +``` +╔═══════════════════════════════════════════════════════╗ +║ TEAM ACTIVITY MONITOR DASHBOARD ║ +╠═══════════════════════════════════════════════════════╣ +║ ║ +║ AGENTS ║ +║ ┌──────────┬───────────┬──────────┬────────────────┐ ║ +║ │ Agent ID │ State │ Duration │ Current Task │ ║ +║ ├──────────┼───────────┼──────────┼────────────────┤ ║ +║ │ Agent-01 │ working │ 2.3s │ Analyzing code │ ║ +║ │ Agent-02 │ idle │ 45.1s │ - │ ║ +║ │ Agent-03 │ completed │ 12.4s │ ✓ Done │ ║ +║ └──────────┴───────────┴──────────┴────────────────┘ ║ +║ ║ +║ WORK ITEMS ║ +║ #42 Fix auth tests [Agent-01] In Progress ║ +║ #41 Add token validation [Agent-02] Open ║ +║ ║ +║ DECISIONS ║ +║ • 14:32:15 [Agent-01] Strategy: iterative_refine ║ +║ • 14:32:08 [Agent-03] Decision: code_review ║ +║ ║ +║ TIMELINE (last 50 events) ║ +║ 14:32:45 → Agent-01 transitioned to working ║ +║ 14:32:30 → Decision made: Code review strategy ║ +║ 14:32:15 → Work item #42 updated to in_progress ║ +║ 14:32:00 → Agent-03 completed work ║ +║ ║ +║ 💰 Cost: $0.42 | Rate: 150 tokens/min | Budget: $50 ║ +║ 🟢 Agents: 3 healthy | ⚠️ Stuck: 0 | 🔴 Failed: 0 ║ +║ ║ +╚═══════════════════════════════════════════════════════╝ +``` + +Updates every 1–3 seconds. You see instantly when an agent stalls, when an issue gets reassigned, when cost per minute exceeds your threshold. + +## How to Set It Up + +```bash +$ git clone https://github.com/bradygaster/project-squad-sdk-example-monitor.git +$ cd project-squad-sdk-example-monitor +$ npm install +added 42 packages, and audited 45 packages in 2.3s + +$ npm run build +# TypeScript compiles cleanly to dist/ + +$ npm test +✓ src/core/eventbus-collector.test.ts (5 tests) +✓ src/core/monitor-collector.test.ts (4 tests) +✓ src/collectors/work-item-collector.test.ts (3 tests) +... +Test Files 12 passed (12) + Tests 28 passed (28) +``` + +It works out of the box. No configuration files. No manual setup. The dashboard runs until you press Ctrl+C. + +## Step 1: Start the Live Dashboard + +```bash +$ npx squad-monitor start + +╔════════════════════════════════════════════════════════════╗ +║ TEAM ACTIVITY MONITOR DASHBOARD ║ +╠════════════════════════════════════════════════════════════╣ +║ ║ +║ AGENTS ║ +║ ┌─────────────┬──────────┬──────────┬─────────────────┐ ║ +║ │ Agent ID │ State │ Duration │ Current Task │ ║ +║ ├─────────────┼──────────┼──────────┼─────────────────┤ ║ +║ │ Agent-001 │ working │ 2.3s │ Analyzing file │ ║ +║ │ Agent-002 │ idle │ 45.1s │ - │ ║ +║ │ Agent-003 │ completed│ 12.4s │ ✓ Done │ ║ +║ └─────────────┴──────────┴──────────┴─────────────────┘ ║ +``` + +The dashboard refreshes every 3 seconds. Each row shows: +- **Agent ID** — Unique agent identifier +- **State** — `working` (actively running), `idle` (waiting), `completed` (done), or `failed` (error) +- **Duration** — Seconds in current state +- **Current Task** — What the agent is doing right now + +An agent idle for 5+ minutes triggers a stuck alert—useful for detecting hangs. + +## Step 2: Watch the Work Items Section + +``` +WORK ITEMS +┌──────┬──────────────────────┬─────────────┬──────────┐ +│ ID │ Title │ Assignee │ Status │ +├──────┼──────────────────────┼─────────────┼──────────┤ +│ #42 │ Fix login validation │ Agent-001 │ In Prog │ +│ #41 │ Add token validation │ Agent-002 │ Open │ +└──────┴──────────────────────┴─────────────┴──────────┘ +``` + +This correlates **what work is happening** with **which agent is doing it**. You see instantly if an issue is stalled with an idle agent, or moving fast with an active one. + +## Step 3: Read the Decisions Feed + +``` +DECISIONS +• 14:32:15 [Agent-001] Strategy: iterative_refine +• 14:32:08 [Agent-003] Decision: refactor_module_A +• 14:31:45 [Agent-002] Decision: add_integration_test +``` + +A feed of choices agents made during the session. Helps you understand agent reasoning without reading transcripts. Useful for post-session review: "Why did the agent choose strategy X?" + +## Step 4: Check the Timeline + +``` +TIMELINE (last 50 events) +14:32:45 → Agent-001 transitioned to working +14:32:30 → Decision made: Code review strategy +14:32:15 → Work item #42 updated to in_progress +14:32:00 → Agent-003 completed work +14:31:50 → Agent-002 transitioned to idle +``` + +Every significant event in order. Agent transitions (idle → working → done), decisions made, work item updates, errors and alerts. This is your audit trail for the session. + +## Step 5: Monitor Cost and Health + +``` +💰 Cost: $0.42 | Rate: 150 tokens/min | Budget: $50.00 +🟢 Agents: 3 healthy | ⚠️ Stuck: 0 | 🔴 Failed: 0 +``` + +**Cost section** shows: +- Running total of tokens spent (how much has this session cost?) +- Burn rate per minute (is cost accelerating?) +- Budget remaining (will we exceed the limit?) + +**Health section** shows: +- Number of healthy agents (working normally) +- Circuit breaker status (rate limits triggered?) +- Failed agents (did anyone crash?) + +If cost per minute exceeds your threshold or an agent fails, this is where you see it *immediately*, not in a bill review next month. + +## For CI or Scripting: One-Time Snapshot + +Capture one snapshot and exit (useful for CI or piping to a file): + +```bash +$ npx squad-monitor snapshot > session-snapshot.txt +$ cat session-snapshot.txt + +╔════════════════════════════════════════════════════╗ +║ TEAM ACTIVITY MONITOR DASHBOARD ║ +╠════════════════════════════════════════════════════╣ +║ AGENTS ║ +║ ┌──────────────┬─────────┬──────────┬─────────────┐║ +║ │ Agent ID │ State │ Duration │ Current Task│║ +│ │ Agent-001 │ working │ 2.3s │ Testing │║ +... +``` + +Now share with the team: `curl -X POST -d @session-snapshot.txt hooks.slack.com/...` + +## Real-World Example: SRE Incident Triage + +Your SRE team runs incident diagnostics with 4 specialist agents working in parallel: +- Agent-Platform checks infrastructure +- Agent-Database checks queries and schema +- Agent-API checks logs and error rates +- Agent-Coordination synthesizes findings + +> 📊 **[DIAGRAM: Multi-Agent Parallel Diagnostics Timeline]** +> *Prompt for image generation:* Create a swimlane diagram (4 horizontal lanes, one per agent) showing time progression left-to-right (0s to 30s). Each agent starts from the left, shows active work as a colored bar (progress indicator), and ends with a completion marker. Agent-Database completes at 8s (short bar, green), Agent-Platform at 2min (longest bar, teal), Agent-API at 12s (medium bar, blue), Agent-Coordination at 18s (medium bar, purple). Overlay: show the on-call engineer symbol appearing at 5s. Use dark background, color-coded swim lanes, dotted line at current time. Label each agent's role. Include annotation: "Monitor surfaces all 4 in real-time instead of sequential log review". +> *Purpose:* Shows readers how parallel diagnostics reduce total triage time compared to sequential investigation—making the real-world value concrete. + +**Old way:** SRE waits for all logs to finish, manually greps for diagnostics, calls a meeting. + +**With the monitor:** +- SRE sees all 4 agents running in real-time +- Agent-Database finishes first (8 sec)—database is healthy +- Agent-Platform is still working at 2 min—likely the problem area +- Agent-API done (12 sec)—API error rate spiked recently +- Agent-Coordination done (18 sec)—has synthesized findings + +SRE scans the dashboard, sees the pattern, and can start investigating platform infrastructure while Agent-Platform is still working. Real-time feedback beats end-of-run summary—especially at 2 AM. + +## Dashboard Sections Explained + +| Section | Shows | Why it matters | +|---------|-------|---------------| +| **AGENTS** | Live status of each agent running in this session | Know which agents are idle, working, or done without checking logs | +| **WORK ITEMS** | GitHub or Azure DevOps issues assigned to agents | See which issue is being worked on, by whom, and its status | +| **DECISIONS** | Chronological decisions made by agents | Understand agent reasoning and strategy without reading transcripts | +| **TIMELINE** | All session events in order | Audit trail for debugging, compliance, and post-session review | +| **COST** | Running total, burn rate, budget remaining | Catch cost overruns in real-time, not on next month's bill | +| **HEALTH** | Agent health, circuit breaker, rate limits | Immediate alert if agents are rate-limited or failing | + +## Three Commands That Show Real Power + +```bash +# Start the monitor (live updating, Ctrl+C to stop) +npx squad-monitor start + +# (While it's running in one terminal, start an agent workflow in another) +# Watch the monitor update—agents transition from idle to working + +# Capture one snapshot for Slack +npx squad-monitor snapshot | tee session-snapshot.txt +# Share with team: "Here's what we were doing at 2:47 PM" +``` + +Stuck detection works automatically. If an agent doesn't change state for 5 minutes, the dashboard shows an alert. Team lead sees it, kills the run, unblocks the team. No manual investigation needed. + +## The Honest Scoping + +This is a **Phase 1 MVP**—a terminal UI prototype with simulated data. The rendering patterns, collectors, and formatting are all tested and production-ready. But the data sources are simulated. + +When [Squad SDK](https://github.com/bradygaster/squad) releases stable runtime APIs (EventBus, CostTracker, RalphMonitor), this example upgrades to **live data** automatically. The collector interfaces are already designed for that transition. + +What's real today: +- Terminal rendering with ANSI formatting +- Event collection and aggregation +- Timeline building and filtering +- Cost tracking and health indicators +- Stuck detection (5-minute idle alert) + +What's simulated: +- Agent lifecycle events (ready to swap when SDK exposes EventBus) +- Work item fetching (ready for GitHub/ADO adapters) +- Decision recording (ready for Squad state integration) + +The core patterns are battle-tested. The data plumbing is where you'll customize when integrating with your real Squad infrastructure. + +## Why Build This Yourself + +Generic dashboards (Grafana, Datadog) show CPU and memory. This shows *agent semantics*—the things that matter for AI workflows: which agent is working on what issue, what decisions it made, whether it's stuck or done. + +It's built *for the problem*, not retrofitted. + +You can extend it. Add custom collectors. Wire in real-time Slack notifications. Build a web UI on top. Add multi-repo aggregation. The architecture supports it because it's designed as a foundation, not a finished product. + +--- + +Clone the [repo](https://github.com/bradygaster/project-squad-sdk-example-monitor), run the setup, and try it with your own workflows. The [quickstart](https://github.com/bradygaster/project-squad-sdk-example-monitor/blob/main/QUICKSTART.md) gets you running in 5 minutes. + +Your agents are working. Now see what they're actually doing. diff --git a/website/blog/2026-05-12-whats-new-in-squad-v09.md b/website/blog/2026-05-12-whats-new-in-squad-v09.md new file mode 100644 index 0000000..fbd83e5 --- /dev/null +++ b/website/blog/2026-05-12-whats-new-in-squad-v09.md @@ -0,0 +1,128 @@ +--- +slug: "/2026-05-12-whats-new-in-squad-v09" +canonical_url: "https://dfberry.github.io/blog/2026-05-12-whats-new-in-squad-v09" +custom_edit_url: null +sidebar_label: "2026.05.12 Squad v0.9" +title: "What v0.9 Quietly Got Right" +description: "A look at the v0.9 improvements that made Squad feel production-ready instead of merely interesting." +draft: true +tags: + - "AI Agents" + - "Squad" + - "Release Notes" + - "Developer Workflow" + - "AI-assisted" +keywords: + - "squad v0.9" + - "agent framework release" + - "production ai agents" + - "copilot cli" + - "developer workflow" +updated: "2026-05-12 00:00 PST" +--- + +If you read our [favorite Squad features post](/blog/2026-05-12-squad-features-youre-missing), you've seen the core capabilities — scheduling, two-pass scanning, tiered memory, economy mode, and the orchestration essentials. Here's what landed in v0.9 that shifts Squad from a capable agent framework to something you can run in production with genuine confidence. + +These aren't flashy. They're engineering maturity — the kind of work that makes the difference between a tool you demo and a tool you deploy unsupervised. + +## External Capability Loading + +The problem: your automation needs are specific to your domain. Squad's core is general-purpose. You don't want to fork Squad. You want to extend it. + +The solution: drop JavaScript files into `.squad/capabilities/` and Squad loads them dynamically, turning your local development environment into a custom automation platform. No restart between iterations. + +```javascript +// .squad/capabilities/my-analyzer.js +module.exports = { + name: 'my-analyzer', + version: '1.0.0', + canHandle: (issue) => issue.labels.includes('analyze-me'), + execute: async (context) => { /* your logic */ } +}; +``` + +**Concrete example:** Your team triages security vulnerabilities. You write a capability that calls the NVD API, scores issues by CVE severity, and auto-applies labels. It runs in Squad's watch mode without touching Squad's core code. When it's solid, you promote it. + +Watch mode discovers and validates capabilities at startup. + +**Learn more:** `packages/squad-sdk/external-loader.ts` and capability validation guide. + +## PID Tracker + Orphan Cleanup + +The problem: long watch sessions accumulate zombie processes. Eventually you hit "port already in use" on the 10th restart, and you don't know why. + +The solution: Squad now tracks child process PIDs and cleans them up on exit — even across crashes. It's cross-platform and silent. + +```bash +squad watch --duration 24h +# Press Ctrl+C +# All child processes automatically cleaned up +``` + +If something dies hard, `squad watch --health` shows you live PID status, uptime, and process tree. + +**Concrete example:** You're running Squad in CI with parallel jobs. Without PID tracking, orphaned processes pile up after failed runs, eating memory and holding file locks. Each retry fails with "port in use." With the tracker, cleanup is automatic. + +**Learn more:** `packages/squad-sdk/pid-tracker.ts` and watch mode guides. + +## External State Storage + +The problem: your `.squad/` directory normally lives in git. But when you switch branches, your orchestration state resets. If you context-switch frequently, that's friction. + +The solution: `squad externalize` moves your state to `~/.squad/global/` — the same place personal Squad lives. Your state now survives branch switches and never appears in `git status`. And you can always bring it back. + +```bash +squad externalize # Move to global state +squad internalize # Move back to repo-local +squad config set stateLocation 'external' # Configure default +``` + +**Concrete example:** You're working across five branches a day — feature work, bug fixes, reviewing PRs. Every branch switch would reset your squad state if it's in-repo. Externalizing means your agent decisions, orchestration logs, and memory tiers persist regardless of which branch you're on. + +**Learn more:** `packages/squad-sdk/state-backend.md` and CLI guide. + +## Shell Injection Hardening + +The problem: if you run Squad against untrusted GitHub data, you need defense in depth against injection attacks. Imagine a public repo where anyone can file issues — a crafted issue title like `"; rm -rf /; echo "` could escape into a shell command if you're using shell-interpolated execution. + +The solution: all subprocess execution in Squad's scheduler and state backend now use `execFileSync` with input validators instead of shell-interpolated `execSync`. The string is just a string. You don't have to think about it. + +**When you'd use this:** Running Squad in a multi-tenant environment, or against untrusted GitHub data. This is the difference between "hope nobody exploits us" and "we have a security perimeter." + +**Learn more:** Security audit notes in the core docs. + +## Watch Health Command + +The problem: if you're running Squad in the background or on a remote machine, you don't know if things are working until something breaks. Your GitHub token expired at 2 AM, watch silently stopped processing issues, and you didn't find out until Monday. + +The solution: `squad watch --health` gives you observability without logging into the machine. + +```bash +squad watch --health +``` + +Returns: +- **PID** — Process ID (useful for manual troubleshooting) +- **Uptime** — How long watch has been running +- **Auth Account** — Which GitHub user is authenticated +- **Loaded Capabilities** — All detected capabilities (.squad/capabilities/ + built-in) +- **Auth Drift Detection** — Alerts if your GitHub token changed or permissions shifted + +**Concrete example:** You're running watch in a tmux session on a remote CI runner. Health checks tell you if auth drifted without SSH'ing to the machine and tailing logs. You can pipe it into your monitoring system or just glance at it before you close your laptop. + +**Learn more:** Watch mode CLI reference. + +--- + +## The Pattern + +These five features share a theme: they're about what happens when Squad runs unsupervised. External loading lets you extend it without forking. PID tracking and shell hardening keep it safe. External state keeps it portable. Health monitoring keeps it observable. + +If you're running Squad interactively — a terminal open, eyes on output — you might never need these. But if you're running it in CI, in the background, or across a team, this is the infrastructure that makes it reliable. + +**Start here:** +- If you're already using watch mode, run `squad watch --health` once and see what it reports. +- If you switch branches frequently, try `squad externalize` for a week. +- If you're building domain-specific automation, drop a `.js` file in `.squad/capabilities/` and prototype. + +These features ship with v0.9. They're ready now. diff --git a/website/blog/2026-05-24-prds-arent-just-for-code.md b/website/blog/2026-05-24-prds-arent-just-for-code.md new file mode 100644 index 0000000..f80738b --- /dev/null +++ b/website/blog/2026-05-24-prds-arent-just-for-code.md @@ -0,0 +1,327 @@ +--- +slug: /2026-05-24-prds-arent-just-for-code +canonical_url: https://dfberry.github.io/blog/2026-05-24-prds-arent-just-for-code +custom_edit_url: null +sidebar_label: "2026.05.24 PRDs Aren't Just for Code" +title: "PRDs Aren't Just for Code: Communication clarity that travels" +description: "How treating every request as a product requirement — not just code features — forces the specificity that lets AI agents execute without hand-holding." +draft: true +tags: + - PRD + - AI Agents + - Project Management + - Content Management + - Independent Work + - Squad +updated: 2026-05-24 12:00 PST +keywords: + - prd for non-code work + - product requirements document ai agents + - independent agent execution + - project management prd + - content management prd + - ai agent dispatch + - squad cli prd + - prd completeness matrix + - thinking tax automation + - product management ai +--- + +# PRDs Aren't Just for Code: Communication clarity that travels + +A PRD agent took a one-line issue in my workspace and turned it into a real implementation PRD: nine intake questions, phased work, named agents, acceptance criteria, and dispatch scripts. That one-line issue was enough for me because I already knew the repo, the conventions, and the missing context. It was not enough for an agent that had to move without me standing there to explain the rest. It was also not enough to show my management or partner teams what the intended work was or the value of that work. The PRD captured that communication clarity. + +That same pattern showed up two more times this week. I used one PRD as a baseline against roughly three months of branding work and found a gap in the setup process I had missed. Then I used another PRD to compare planned scope against delivered work so the unanswered questions could turn into queued issues instead of another vague "we should look at this later." + + + +![A pink-haired woman turns a vague request into a structured PRD while agents begin moving](./media/2026-05-24-prds-arent-just-for-code/hero-prd-thinking-tax.png) + +_The work starts looking like it can move on its own only after the thinking stops being casual._ + +I keep hearing PRDs talked about as if they only belong to software feature work. I use them there too. But this week I kept reaching for the same pattern in project work, product work, and content work. Each time, the value was the same: the PRD made me write down the part I normally carry in my head. + +A PRD is communication clarity. That's it. The document works because it makes the request unambiguous for the next reader. Sometimes that reader is a person. Sometimes it's an agent. The value is the same. + +## Start with what a PRD is + +A product requirements document (PRD) is a structured answer to three questions: what are we building, for whom, and why? More simply, it is the place where I stop assuming the other side will fill in the blanks for me. + +Before anyone starts building, the PRD turns the idea into requirements other people can actually act on. In a lot of teams, that means engineering, design, product, and sometimes marketing can line up around the same page before work starts. + +That is still useful. What changed for me is what happens next. Now the same clarity has to carry agent work too. I can draft a PRD from rough bullets, update it as the work changes, and use it as the source for agent assignments, review checks, and follow-up issues. + +That shift is why I started applying the pattern outside code. The useful part is the habit of slowing down long enough to answer the structured questions that make handoff possible. Agents need the same clarity humans do. Once I saw that clearly, it was hard not to use the same pattern for project planning and content operations too. + +## Stop treating clarity as optional + +The easy story about AI agents is that I can stay incomplete — not specific enough — and the system will figure it out. + +I have tried that often enough to know what happens next. The agent fills in the gaps. It makes assumptions. Those assumptions land as wrong choices. Then I'm back in, steering it turn by turn because every guess it made was wrong. Sometimes the work still gets done. But I'm driving now, not the agent. + +That is why I keep coming back to PRDs. A good PRD is not useful because it looks formal. It is useful because it forces me to answer questions I would normally leave fuzzy. + +Questions like: + +- What problem am I actually trying to solve? +- What does done look like in a way another person or agent can check? +- What is out of scope? +- Which repo, project, or workflow owns this? +- What existing documents already limit the answer? +- What dependencies have to exist before anyone starts? +- What acceptance criteria tell me the work is complete? +- Who or what should do each phase? +- What evidence would prove the work happened correctly? + +```mermaid +flowchart LR + subgraph L[Without PRD] + A[Vague task] --> B[Clarify] + B --> C[Rework] + C --> B + end + subgraph R[With PRD] + D[Intake answers] --> E[Clear requirements] + E --> F[Independent execution] + F --> G[Review] + end +``` + +Nine questions is not a magic number. It just kept surfacing in my sessions this week. It pushed me past the comfortable version of the request. "We should improve our PRD workflow" sounds fine until I have to answer which workflow, which repos, which gaps, which owners, and what exact result counts as success. That is the moment the request stops being a loose idea and starts becoming something the system can actually use. + +## Run the experiment on work that isn't a feature + +Requests that felt clear in my head were not clear enough to run independently. Once I started using PRD patterns outside their usual lane, the same document kept helping in different ways. + +### Session 1: Expand a one-line issue until agents can move + +I had a one-line issue that made sense to me because I live in this project every day. It had enough context for a human who already knew the setup. It did not have enough context for a system that needed to break the work into phases, assign work to named agents, and move without waiting for me to answer basic questions. + +So the PRD agent expanded that short issue into a real implementation PRD. The useful part was the added structure. + +The PRD turned a compressed request into something other actors could use: + +- the problem statement stopped assuming insider context +- the work broke into phases instead of one blended paragraph +- acceptance criteria became explicit instead of implied +- agent assignments were named instead of hand-waved +- dispatch scripts could be generated because there was finally enough detail + +Before the PRD, the request depended on me being available to explain the rest. After the PRD, that logic lived in the document. Once the requirements were explicit, dispatch was no longer a hope. The tooling had something solid to run. + +The time investment moved to the front, which is boring in the best way. In my experience, that is where the payoff shows up later. I stop rescuing ambiguity after the fact because the plan can survive handoff. + +### Session 2: Use the PRD as a mirror, not a starting point + +The second session changed how I think about PRDs. I was reviewing PRD-driven branding work, but instead of treating the document as a fresh plan, I treated it as a claim and compared it against the artifacts the week had already produced. + +One of the most useful findings was a missing validation step I'd overlooked. That kind of check was easy to miss while the surrounding work was already moving. The PRD gave me something stable to compare against. Without it, that omission would have stayed buried inside the blur of ongoing work. + +Once the gap was visible, the next step was obvious. I spawned two follow-on actions from the review findings: + +- one to fix the ownership document so responsibilities were clear +- one to create a CI triage skill + +The review did not end at "interesting gap." It created follow-on work with owners and specific files to update. One agent fixed the ownership document. Another created a validation skill so the check would run next time. + +### Session 3: Compare scope to outcomes, then let the gaps create work + +The third session pushed the same pattern one step further. I used the PRD not just to plan or review, but to ask what actually happened. I compared PRD scope against real work outcomes to find the places where my planned story and the delivered story did not match. + +The completeness check surfaced open questions. Once those questions existed in a named list, I could answer them directly and turn the unresolved gaps into issues for later agent work. + +The flow was simple: planned scope checked against reality produced open questions, my answers turned those questions into queued issues, and the queued issues were ready for later agent work. It was repeatable. + +With the PRD, I had a stable statement of intent. I handled the judgment calls, and the agents handled the mechanical translation after the missing information was written down. + +If I strip the three sessions down to their bones, this is what happened: + +| Session | Starting point | What the PRD did | What became possible next | +| --- | --- | --- | --- | +| Convert a one-line issue to a PRD | one-line issue | expanded intent into phases, acceptance criteria, and agent assignments | hands-off dispatch with less human follow-up | +| Review PRD for branding | existing PRD plus three months of work | exposed mismatch between intended scope and actual execution | spawned targeted follow-on agents for ownership-document updates and CI triage skill work | +| Compare PRD with work | finished work compared against planned scope | surfaced open questions and unresolved gaps | generated issues that could be queued for later agent execution | + +The PRD was not just a status document. It was a conversion layer between what I meant and what could be done. + +## Follow the pattern that kept repeating + +### Start by forcing the intake answers into the open + +The biggest misconception I had to drop is that the initial request is the hard part. Usually it is not. Usually the hard part is everything the request assumes. + +A sentence like "convert this issue into a real plan" sounds efficient because it compresses the task. But that efficiency is fake if the next actor has to unpack hidden assumptions before doing anything useful. + +```mermaid +flowchart LR + A[Vague request] --> B[PRD intake] + B --> C[Specificity] + C --> D[Independent execution] +``` + +For me, the PRD intake phase looked less like "writing a document" and more like pinning down the variables that were floating around informally: + +- what the request is asking for in plain language +- what should exist at the end +- what phase boundaries keep the work from smearing together +- which constraints come from project conventions, operating rules, or ownership docs +- what needs human approval versus what can run on its own +- what evidence will make review easy later + +This is the thinking tax. It costs something up front. It slows down the moment where I get to feel like I already started. I have to stop, answer, narrow, and sometimes admit I do not actually know what I want yet. The payoff shows up later. + +### Check completeness before you confuse motion with coverage + +The second phase is the one I underused before this week: completeness checking. + +I used to think of PRDs mostly as forward-looking documents that I would write and then execute. They are just as useful as review tools. + +A PRD lets me ask a direct question: did the work we actually did cover the work we said mattered? + +When work moves across multiple agents, multiple repos, and multiple days, motion starts to feel like progress whether or not the original scope has been covered. A completeness check interrupts that illusion. + +In practical terms, it helped me inspect: + +- whether every major acceptance area had corresponding work +- whether implied dependencies had been made explicit +- whether ownership boundaries were still accurate +- whether validation steps existed, not just good intentions +- whether the missing pieces were small omissions or larger design gaps + +The branding review session made this concrete for me. A missing check in the onboarding workflow was not the kind of thing I would have reliably caught by reading status updates alone. It became visible because I had a clear frame for what should have been there. + +### Turn the gaps into dispatchable work + +Once the PRD review surfaced missing pieces, the follow-up path became much cleaner than I expected: + +```mermaid +flowchart TD + A[Open questions] -->|human answers| B[Human items] + B -->|once answered| C[Ready requirements] + C -->|assign or queue| D[Agent or queue] + D -->|no translation| E[Dispatch] +``` + +I like that because it keeps the human work and the automation work in the right places. The human work is making decisions. The automation work is transforming those decisions into execution steps. If the document is sloppy, those jobs collapse into each other and I end up doing both. + + + +![A pink-haired woman directs agents as work packets move from a PRD board into execution queues](./media/2026-05-24-prds-arent-just-for-code/independent-execution-payoff.png) + +_The payoff is not that the human disappears. It's that the human gets to stop re-explaining the same intent._ + +### Let work run on its own after the meaning is stable + +By the end of the week, the line I kept writing down was simple: work runs best on its own after the meaning is stable. + +I need a PRD because independent work magnifies whatever level of clarity I provide. If I hand over a fuzzy request, the system scales fuzz. If I hand over a bounded requirement with owners and checks, the system scales useful action. + +That is why I think PRDs are underused outside code. A lot of non-code work still assumes human availability will absorb the ambiguity for free. Project work, product work, and content work are full of requests that sound understandable in conversation but are not clear enough to survive handoff. + +## Push the pattern into project management + +I keep seeing a gap between backlog clarity for humans and backlog clarity for agents. + +Project systems are often optimized for coordination among people who already know how to fill in the blanks. We can see a short title, remember the meeting, infer the constraint, and keep going. Agents are much more literal. If the work item does not carry the missing pieces, the queue looks fuller than it really is. + +When I look at project management through the PRD lens, I stop asking whether the board is organized and start asking whether each major item can survive handoff without live clarification. That changes the shape of the document. + +Instead of a loose epic with a few bullets, the more useful version looks like this: + +- clear statement of the problem the epic is trying to solve +- boundaries between phases so tasks do not overlap +- acceptance criteria that can be checked after work lands +- routing clues about which agents or teams own which slice +- dependency notes that prevent premature execution +- validation expectations so the review step is not invented on the fly + +Once the project document is explicit enough, breaking the work down gets easier. Work items stop being reminders for future humans and start becoming units that can move. + + + +![A pink-haired woman organizes a kanban board while agents pull clearly defined work items into motion](./media/2026-05-24-prds-arent-just-for-code/project-management-prd.png) + +_The board gets more useful the moment each card carries enough meaning to travel on its own._ + +It means putting detail where it changes execution and leaving everything else light. + +One shift that helped me was seeing acceptance criteria as scheduling tools, not just review tools. If an epic says it is done when three specific outcomes exist, decomposition gets cleaner. If the epic just says "move this initiative forward," the board can look busy for a long time without telling me whether the right work is actually in flight. + +The practical signal for me is simple: if I expect the work to be done asynchronously, across roles, or by agents, the request probably needs PRD treatment whether or not the output is software. + +## Push the pattern into product management + +Traditionally, product teams used PRDs to line people up before code started. The PRD was the single place where the team could see the user problem, the proposed solution, the requirements, the success measures, and the boundaries. What changes now is that the same document also has to support AI-assisted drafting, routing, review, and execution. + +The old mental model was document first, handoff second. The newer one I am experimenting with is requirement first, routing second, independent execution third. A product document that is only persuasive is not enough. A product document that supports execution has to name decisions, constraints, success conditions, and trade-offs in a way other actors can use. + +The PRD session where a short issue got expanded into a full implementation artifact made this very concrete for me. The expansion was not about adding more words because longer is better. It was about adding enough structure that each downstream actor could tell what they owned. Implementation phases, acceptance criteria, agent assignments, and dispatch scripts existed because the PRD supported dispatch. + +That matters because product requests are often written for alignment first and execution second. That is fine if humans are going to sit together and negotiate the rest in real time. But once I want agents, or loosely coupled teams, to move without hand-holding, the requirement has to answer the follow-up questions before they are asked. + + + +![A pink-haired woman reviews feature sketches and requirement pages while agents work from the clarified spec](./media/2026-05-24-prds-arent-just-for-code/product-management-prd.png) + +_The specification earns its keep when other actors can move from it without guessing what I meant._ + +One thing I appreciate here is how PRDs expose whether I really made a decision or just postponed it. If the document leaves a major constraint unstated, that is not neutrality. That is hidden work for whoever picks it up next. + +That is one of the clearest ways AI acts like a collaborator instead of a magician. It makes my vagueness expensive. + +The useful thing was not the system pretending to know the answers. The useful thing was that it made the missing answers painful enough that I finally wrote them down. + +Where it broke down was whenever I tried to skip that step and expected the system to infer intent from shorthand. It can infer a lot. It still should not be asked to infer the core requirements. + +The sweet spot is not maximum detail. It is enough detail that other actors can move without reopening the problem definition every hour. + +## Push the pattern into content management + +Content management may be the least obvious place for this, but content work is full of documents that already behave like PRDs even when nobody calls them that. Article plans, content strategy docs, editorial calendars, coverage matrices, freshness reviews, taxonomy decisions, and publishing workflows all describe intended outcomes, constraints, ownership, sequence, and validation. + +Content work often has the same hidden-context problem. We know an article is stale. We know a strategy doc implies missing tutorials. We know a calendar entry means someone needs a draft, images, metadata, and review. But unless that thinking lands in a document with clear boundaries, the work stays socially clear and practically fuzzy. + +It breaks down when I want content audits, freshness checks, coverage-gap detection, or article scaffolding to run with less manual glue. + +If an article plan is written like a real requirement document, I can review it for completeness, compare planned coverage against published coverage, detect gaps, and route the missing work with less back-and-forth. The artifacts are concrete: a markdown file, a matching media folder, frontmatter fields like `draft: true` and `keywords`, and a build command that fails if something is wrong. The operations around it do not have to stay mysterious. + + + +![A pink-haired woman sorts article outlines and editorial plans while agents manage the operational content flow](./media/2026-05-24-prds-arent-just-for-code/content-management-prd.png) + +_The content strategy starts acting like a system once the editorial intent is written in a form the system can inspect._ + +The three sessions from this week map cleanly to content operations: + +```mermaid +flowchart TD + A[Thin Idea] -->|turns into| B[Article PRD] + B -->|checked with| C[Strategy Check] + C -->|raises new| D[Editorial Questions] + D -->|answers create| E[Assignable Work] + E -->|flows into| F[Content Systems] +``` + +If I sketch what an article PRD needs in order to support content operations without me stepping in, it looks a lot like the software version: audience, intent, angle, exclusions, source material, freshness risk, required assets, review checkpoints, and a definition of done that is more concrete than "publish something good." + +## Choose when the thinking tax is worth it + +Speed was the first one. If I have a request in my head and a path that feels mostly clear, the last thing I want is a form asking me to pin down acceptance criteria, boundaries, or dependencies. Not every task deserves full PRD treatment. + +One guardrail I keep coming back to is handoff count. If the work will stay with one person in one short session, I probably do not need a full PRD. If the work will cross time, tools, repos, reviewers, or agents, the cost of under-specifying it rises fast. That is when the thinking tax looks cheap compared to cleanup. + +False confidence was the second rough edge. A polished document can look complete even when it missed an important gap. The answer is "treat the PRD as something I can review and revise." + +Judgment was the third rough edge. When the completeness check surfaced open questions in Session 3, I still had to answer them. The system could not responsibly invent those answers for me. The point of the PRD pattern is not to erase human decision-making. It is to capture it cleanly enough that it happens where it should. + +PRDs expose where I am still hand-waving. A vague request lets me keep the illusion that I know what I mean. A requirement document asks me to prove it. Sometimes I discover that I do not actually have an answer yet. + +## Keep following the work toward more independent systems + +The PRD is valuable because it makes the request clear enough that someone else can act on it without guessing. + +Those are different surfaces of the same idea: a PRD is communication clarity. I used to think of PRDs mostly as a prelude to implementation. Now I think of them as a reusable requirement document that can support planning, auditing, routing, comparison, and dispatch across much more than code. + +I do not think the future is "agents replace planning." My week suggested the opposite. The more I want the system to work on its own, the more seriously I have to take the planning document. The document works because it makes the request unambiguous for whoever reads it next, human or agent. + +My next test is whether article PRDs can survive metadata review, asset checks, and `npm run build` without me reconstructing the brief from memory. Editorial judgment — the moment when a sentence sounds unlike me, or a claim needs a receipt — still needs a human in the loop. That is a limit on what the document can carry, not a reason to skip writing it. diff --git a/website/blog/media/2026-05-24-prds-arent-just-for-code/content-management-prd.png b/website/blog/media/2026-05-24-prds-arent-just-for-code/content-management-prd.png new file mode 100644 index 0000000..80fcf2d Binary files /dev/null and b/website/blog/media/2026-05-24-prds-arent-just-for-code/content-management-prd.png differ diff --git a/website/blog/media/2026-05-24-prds-arent-just-for-code/hero-prd-thinking-tax.png b/website/blog/media/2026-05-24-prds-arent-just-for-code/hero-prd-thinking-tax.png new file mode 100644 index 0000000..5d1955f Binary files /dev/null and b/website/blog/media/2026-05-24-prds-arent-just-for-code/hero-prd-thinking-tax.png differ diff --git a/website/blog/media/2026-05-24-prds-arent-just-for-code/image-prompts.md b/website/blog/media/2026-05-24-prds-arent-just-for-code/image-prompts.md new file mode 100644 index 0000000..12ebb37 --- /dev/null +++ b/website/blog/media/2026-05-24-prds-arent-just-for-code/image-prompts.md @@ -0,0 +1,123 @@ +--- +title: "Image prompts for PRDs Aren't Just for Code" +draft: true +description: "Support file for blog illustration and diagram generation prompts." +--- + + + +# 2026-05-24 PRDs Aren't Just for Code — Image Prompts + +## Batch JSON + +```json +{ + "post": "2026-05-24-prds-arent-just-for-code", + "style": "Watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork", + "character": "White female with pink hair", + "images": [ + { + "filename": "hero-prd-thinking-tax.png", + "alt": "Pink-haired woman turning a vague request into a structured PRD while helper agents begin moving", + "prompt": "White female with pink hair at desk turning one vague note into structured PRD pages, faint helper agents moving around her, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text" + }, + { + "filename": "project-management-prd.png", + "alt": "Pink-haired woman organizing a kanban board while agents pull clearly defined work items into motion", + "prompt": "White female with pink hair arranging PRD cards across a kanban board while helper agents pull work items into columns, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text" + }, + { + "filename": "product-management-prd.png", + "alt": "Pink-haired woman reviewing feature sketches and requirement pages while agents work from the clarified spec", + "prompt": "White female with pink hair studying feature sketches and requirement pages while helper agents build from precise notes, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text" + }, + { + "filename": "content-management-prd.png", + "alt": "Pink-haired woman sorting article outlines and editorial plans while agents manage the operational content flow", + "prompt": "White female with pink hair at editorial desk sorting article outlines, calendar pages, and review notes while helper agents manage content folders, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text" + }, + { + "filename": "autonomous-execution-payoff.png", + "alt": "Pink-haired woman directing agents as work packets move from a PRD board into execution queues", + "prompt": "White female with pink hair overseeing small helper agents carrying folders from PRD board to task queues, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text" + } + ], + "mermaidDiagrams": [ + { + "filename": "flow-vague-request-to-autonomy.mmd", + "prompt": "Create a simple left-to-right Mermaid flow diagram showing: vague request -> PRD intake -> specificity -> autonomous execution. Keep labels short. Use 4 nodes max. Emphasize that specificity is the enabling step." + }, + { + "filename": "comparison-with-vs-without-prd.mmd", + "prompt": "Create a two-lane Mermaid comparison diagram. Left lane: without PRD -> vague task -> repeated clarification -> rework loops. Right lane: with PRD -> intake answers -> clear requirements -> autonomous execution. Keep each lane to 4 nodes and use short labels." + } + ] +} +``` + +## Individual Watercolor Prompts + +### 1. Hero + +- **Filename:** `hero-prd-thinking-tax.png` +- **Prompt:** White female with pink hair at desk turning one vague note into structured PRD pages, faint helper agents moving around her, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text +- **Purpose:** Opening concept image for the thesis that specificity unlocks autonomous work. + +### 2. Project Management + +- **Filename:** `project-management-prd.png` +- **Prompt:** White female with pink hair arranging PRD cards across a kanban board while helper agents pull work items into columns, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text +- **Purpose:** Visual for epics and work items becoming dispatchable through PRD structure. + +### 3. Product Management + +- **Filename:** `product-management-prd.png` +- **Prompt:** White female with pink hair studying feature sketches and requirement pages while helper agents build from precise notes, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text +- **Purpose:** Visual for feature requirements becoming clear enough to route without back-and-forth. + +### 4. Content Management + +- **Filename:** `content-management-prd.png` +- **Prompt:** White female with pink hair at editorial desk sorting article outlines, calendar pages, and review notes while helper agents manage content folders, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text +- **Purpose:** Visual for editorial and content strategy work treated as inspectable requirements. + +### 5. Automation Payoff + +- **Filename:** `autonomous-execution-payoff.png` +- **Prompt:** White female with pink hair overseeing small helper agents carrying folders from PRD board to task queues, watercolor illustration, soft wet-on-wet washes, visible paper texture, warm muted tones, loose brushwork, no text +- **Purpose:** Visual for the handoff from clarified requirement to autonomous execution. + +## Mermaid Diagram Prompts + +### 1. Vague Request to Autonomous Execution + +- **Filename:** `flow-vague-request-to-autonomy.mmd` +- **Prompt:** Create a simple left-to-right Mermaid flow diagram showing: vague request -> PRD intake -> specificity -> autonomous execution. Keep labels short. Use 4 nodes max. Emphasize that specificity is the enabling step. +- **Suggested Mermaid source:** + +```mermaid +flowchart LR + A[Vague request] --> B[PRD intake] + B --> C[Specificity] + C --> D[Autonomous execution] +``` + +### 2. With PRD vs Without PRD + +- **Filename:** `comparison-with-vs-without-prd.mmd` +- **Prompt:** Create a two-lane Mermaid comparison diagram. Left lane: without PRD -> vague task -> repeated clarification -> rework loops. Right lane: with PRD -> intake answers -> clear requirements -> autonomous execution. Keep each lane to 4 nodes and use short labels. +- **Suggested Mermaid source:** + +```mermaid +flowchart LR + subgraph L[Without PRD] + A[Vague task] --> B[Clarify] + B --> C[Rework] + C --> B + end + subgraph R[With PRD] + D[Intake answers] --> E[Clear requirements] + E --> F[Autonomous execution] + F --> G[Review] + end +``` diff --git a/website/blog/media/2026-05-24-prds-arent-just-for-code/independent-execution-payoff.png b/website/blog/media/2026-05-24-prds-arent-just-for-code/independent-execution-payoff.png new file mode 100644 index 0000000..fc55467 Binary files /dev/null and b/website/blog/media/2026-05-24-prds-arent-just-for-code/independent-execution-payoff.png differ diff --git a/website/blog/media/2026-05-24-prds-arent-just-for-code/product-management-prd.png b/website/blog/media/2026-05-24-prds-arent-just-for-code/product-management-prd.png new file mode 100644 index 0000000..6971bb7 Binary files /dev/null and b/website/blog/media/2026-05-24-prds-arent-just-for-code/product-management-prd.png differ diff --git a/website/blog/media/2026-05-24-prds-arent-just-for-code/project-management-prd.png b/website/blog/media/2026-05-24-prds-arent-just-for-code/project-management-prd.png new file mode 100644 index 0000000..502c67e Binary files /dev/null and b/website/blog/media/2026-05-24-prds-arent-just-for-code/project-management-prd.png differ diff --git a/website/blog/mermaid/2026-05-24-comparison-with-vs-without-prd.mmd b/website/blog/mermaid/2026-05-24-comparison-with-vs-without-prd.mmd new file mode 100644 index 0000000..0e39fa4 --- /dev/null +++ b/website/blog/mermaid/2026-05-24-comparison-with-vs-without-prd.mmd @@ -0,0 +1,11 @@ +flowchart LR + subgraph L[Without PRD] + A[Vague task] --> B[Clarify] + B --> C[Rework] + C --> B + end + subgraph R[With PRD] + D[Intake answers] --> E[Clear requirements] + E --> F[Autonomous execution] + F --> G[Review] + end diff --git a/website/blog/mermaid/2026-05-24-flow-vague-request-to-autonomy.mmd b/website/blog/mermaid/2026-05-24-flow-vague-request-to-autonomy.mmd new file mode 100644 index 0000000..d552493 --- /dev/null +++ b/website/blog/mermaid/2026-05-24-flow-vague-request-to-autonomy.mmd @@ -0,0 +1,4 @@ +flowchart LR + A[Vague request] --> B[PRD intake] + B --> C[Specificity] + C --> D[Autonomous execution]