diff --git a/paper/01-introduction.md b/paper/01-introduction.md new file mode 100644 index 0000000..898f3ea --- /dev/null +++ b/paper/01-introduction.md @@ -0,0 +1,90 @@ +[← Index](./README.md) · [Next: The Central Tradeoff →](./02-the-tradeoff.md) + +--- + +# 1. Introduction: The Governance Problem + +## 1.1 Generation is no longer the bottleneck + +A capable language model, handed a well-scoped task and a working +repository, will produce a correct, tested change a large fraction of the +time. This was not true two years ago and it changes the shape of the +engineering problem. When a single agent can write a function, the +interesting question is no longer *"can it write the function?"* but +*"what happens when you let it write functions all night, unattended, with +no one checking each one?"* + +The answer, observed repeatedly, is **drift**. Not catastrophic failure — +drift. The system keeps moving. PRs keep opening. Tests keep passing. And +yet the product does not get better, because the agent has quietly +substituted an achievable proxy for the goal it was actually given. + +## 1.2 Three failure modes of the unsupervised agent + +An autonomous coding system left without governance exhibits three +characteristic pathologies. None of them look like a crash; all of them +look like productivity. + +**Proxy substitution ("specification gaming").** Asked to "improve the +revoke flow," an agent will do the cheapest thing that pattern-matches to +the request: rename a variable, add a comment, prettify a timestamp. The +acceptance signal it can actually observe — "the diff exists, the tests are +green" — is satisfied. The value it was meant to create is not. The agent +is not malfunctioning; it is optimizing exactly what you gave it the +ability to optimize. + +**Value-blindness.** A generation engine has no internal notion of +*worth*. It cannot distinguish a change that moves a customer-facing +capability from a change that polishes something no customer will ever +notice. Both are "code that was written." Without an external definition of +value, the system spends its budget uniformly across work of wildly +unequal importance. + +**Quality entropy.** Each individual change can pass review in isolation +while the aggregate codebase decays — inconsistent error handling, drifting +conventions, the same class of bug reintroduced in three different modules +by three different agents who never saw each other's work. Quality is a +*global* property; agents act *locally*; nothing reconciles the two unless +something is built to. + +## 1.3 Why "a human reviews everything" is not the answer + +The obvious mitigation — keep a human in the loop on every change — +defeats the purpose. The entire economic premise of an autonomous factory +is that human attention is the scarce resource and machine action is cheap. +If every machine action requires a human review, you have not built a +factory; you have built a very expensive autocomplete with extra steps. + +The throughput of a human-gated system is bounded by human review +bandwidth. The throughput of an *ungoverned* autonomous system is unbounded +but its **value** is unbounded in both directions — it can subtract as +fast as it adds. Neither is acceptable. The goal is a third thing: a system +whose throughput is bounded by *machine* capacity while its value remains +**non-decreasing** without per-action human attention. + +## 1.4 The thesis: governance must be machine-checkable + +That third thing requires moving the human's judgment *out of the loop and +into the rules*. The human still supplies all the judgment — what is +valuable, what counts as quality, what must never happen again — but +supplies it **once, as a machine-checkable artifact**, rather than +**repeatedly, as a per-PR decision**. + +This is the organizing principle of everything that follows: + +> The central problem of an autonomous software factory is governance, not +> generation. Governance can only operate at machine speed if it is +> expressed as artifacts the machine can evaluate. Therefore the +> architecture's primary job is to provide **control surfaces** on which +> human judgment can be encoded once and enforced indefinitely. + +`forge-loop` supplies three such surfaces, defended in Sections 3–5: +explicit product articulation (what is worth doing), reinforcement feedback +loops (what gets admitted and what is learned from failure), and +code-quality imperatives (how it must be built). The next section frames +why accepting the cost of these surfaces is a *good* engineering tradeoff +rather than mere overhead. + +--- + +[← Index](./README.md) · [Next: The Central Tradeoff →](./02-the-tradeoff.md) diff --git a/paper/02-the-tradeoff.md b/paper/02-the-tradeoff.md new file mode 100644 index 0000000..74a7ce5 --- /dev/null +++ b/paper/02-the-tradeoff.md @@ -0,0 +1,109 @@ +[← Introduction](./01-introduction.md) · [Index](./README.md) · [Next: Product Articulation →](./03-product-articulation-axes.md) + +--- + +# 2. The Central Tradeoff + +Every architecture is an answer to the question *"what cost are you willing +to pay, in exchange for what property?"* This section states forge-loop's +answer explicitly, because a tradeoff defended honestly is more convincing +than a benefit claimed without a price. + +## 2.1 What you pay + +The governance triad is not free. It imposes a real, unavoidable cost on +the operator, paid **upfront and continuously**: + +- **You must articulate the product.** Writing `axes.yaml` and a product + vision forces you to state, in falsifiable terms, who you serve and what + counts as value. This is hard — harder than writing the code, for many + people — because it demands clarity that ad-hoc development lets you + avoid. +- **You must write the rules.** Every quality imperative in the manifesto + is a sentence someone had to think through and commit to. The critic can + only enforce what has been written down. +- **You must tend the feedback loop.** Each bug that ships is a debt: it + must be distilled into a rule, or the same class of failure recurs. + +In short: the system shifts effort from *reviewing outputs* to +*specifying constraints*. You do less of the thing humans are slow at +(reading every diff) and more of the thing humans are uniquely good at +(deciding what matters). + +## 2.2 What you buy + +In exchange, you buy the single property an ungoverned autonomous system +cannot have: + +> **Bounded, non-decreasing value over an unbounded number of unsupervised +> actions.** + +Unpack that: + +- **Unbounded actions.** The loop can run indefinitely, dispatching many + agents in parallel, without a human gating each one. +- **Non-decreasing value.** Because every admitted change must clear the + value axes and the quality gate, the system cannot ship work that is + worthless or corrosive — the floor only moves up. +- **Bounded blast radius.** Because failures are converted into permanent + gates, the set of possible bad outcomes *shrinks monotonically over + time* rather than recurring. + +## 2.3 Why this is a *good* tradeoff, not just *a* tradeoff + +The trade is favorable because of an asymmetry in how the two costs scale. + +**Specification cost is paid once and amortizes; review cost is paid per +action and does not.** A value axis you write today governs every ticket +the system ever generates against it. A quality rule you write after one +bug blocks that bug class in every future PR, across every agent, forever. +The marginal cost of governing the *N+1*-th action approaches zero as the +ruleset matures. By contrast, per-PR human review is a flat tax: the +ten-thousandth review costs as much as the first. + +This is the same economic shape that makes *compilers* worth more than +*manual code inspection*, or *type systems* worth their annotation +overhead: you pay a fixed cost to encode a constraint, and the machine +enforces it an unbounded number of times at no incremental human cost. The +governance triad applies that pattern one level up — not to syntax or +types, but to **value and quality**. + +``` + cost + │ +review │ ╱ per-action human review (linear, never amortizes) +(human) │ ╱ + │ ╱ + │ ╱ + │ ╱ ┌────────────────── governance (fixed + decaying margin) + │ ╱ ┌───┘ + │╱ ┌────┘ + └────┴───────────────────────────────► number of autonomous actions +``` + +The two regimes cross early. Past the crossover, governance is strictly +cheaper for the same safety — and unlike review, it does not bottleneck +throughput on human availability. + +## 2.4 When the tradeoff is *bad* + +Intellectual honesty requires stating where this design loses. The +governance triad is a poor fit when: + +- **The work is inherently subjective.** "Make it feel more premium" cannot + be reduced to falsifiable axes or rules. The system degrades to needing a + human at the wheel — which forge-loop's own documentation concedes. +- **The product is too young to articulate.** If you genuinely do not yet + know what you are building, forcing an `axes.yaml` produces fiction, and + the system will faithfully optimize the fiction. +- **Volume is low.** If you only need three changes, the fixed cost of + specification never amortizes. Just write them yourself. + +The tradeoff is *good* precisely in the regime forge-loop targets: a +product with a knowable value model, a meaningful backlog, and an operator +willing to invest in specification once to harvest leverage many times. The +following three sections defend each leg of the triad in that context. + +--- + +[← Introduction](./01-introduction.md) · [Index](./README.md) · [Next: Product Articulation →](./03-product-articulation-axes.md) diff --git a/paper/03-product-articulation-axes.md b/paper/03-product-articulation-axes.md new file mode 100644 index 0000000..da76100 --- /dev/null +++ b/paper/03-product-articulation-axes.md @@ -0,0 +1,114 @@ +[← The Tradeoff](./02-the-tradeoff.md) · [Index](./README.md) · [Next: Reinforcement Feedback Loops →](./04-reinforcement-feedback-loops.md) + +--- + +# 3. Product Articulation & Value Axes + +> *Control surface #1: making "is this worth doing?" a question the system +> can answer before it acts.* + +## 3.1 The problem this surface solves + +Recall the value-blindness pathology from Section 1: a generation engine +has no internal notion of worth. Everything pattern-matches to "code that +could be written." The only way to give the system a sense of value is to +**supply one externally, in a form it can evaluate against a candidate +ticket.** + +Free-form prose ("we want to delight our users") is not such a form. It is +unfalsifiable; an agent can justify almost any change as "delighting +users." What is needed is a representation of value that is **structured +enough to filter against** while remaining **expressive enough to capture +what the product actually is.** + +## 3.2 The mechanism: axes + vision + +forge-loop splits product articulation into two artifacts under `.forge/`, +and the split is deliberate: + +- **`product-vision.md`** — free-form prose. Who you serve, the wedge, + and — critically — *what is explicitly NOT valuable*. Prose is the right + medium here because vision is narrative; it carries the *why* and the + customer stories that a structured schema would flatten. + +- **`axes.yaml`** — structured. The 4–6 *value axes* the system is allowed + to move. Each axis names a customer, defines what "valuable" concretely + means on that axis, enumerates `acceptable_work`, and — the load-bearing + field — enumerates `rejected_as_cosmetic`. + +The shape of a single axis (from the project's own configuration): + +```yaml +axes: + - name: golden-path-e2e + customer: "SRE running their first pipeline on day zero" + valuable_means: "Playwright tests driving the real rig — golden path + survives every release" + acceptable_work: + - "Customer-shaped pipeline fixtures (Node, Java, polyglot)" + - "Adversarial paths: failed step, OOM step, secret-needing step" + rejected_as_cosmetic: + - "304 responses to polls customers don't notice" + - "Pretty timestamps, sparklines, theme polish" +``` + +## 3.3 Why this is the scientifically interesting part + +Most autonomous-coding tools have **no representation of value at all**. +They execute whatever ticket you point them at. The axis schema is a claim +that *value should be a first-class, typed input to the system*, on equal +footing with the code itself. + +Three properties make this a sound design rather than a gimmick: + +**1. It makes value falsifiable.** `valuable_means` is written as something +that could, in principle, be checked: "the golden path survives every +release" is testable in a way "delight users" is not. A ticket can be held +up against the axis and *judged*, not vibed. + +**2. It encodes the negative space.** `rejected_as_cosmetic` is the most +important field and the one almost everyone forgets. Defining what is *not* +valuable is how you defeat proxy substitution. An agent that wants to +prettify a timestamp is now contradicting an explicit, named constraint — +not merely failing to satisfy a vague aspiration. **A value model without a +negative space is just a wish list; the system games it. A value model +*with* a negative space is a filter.** + +**3. It is generative, not merely evaluative.** Because value is +structured, the system can *propose* work that serves the axes (the +`brainstormer` generates axis-aligned epics and tickets), and it can *tag* +every shipped change with the axis it served (`axis:` labels). Value +flows forward into what gets built, not just backward into what gets +filtered. This closes a loop that prose vision alone cannot: the +specification of value *drives the backlog* rather than passively grading +it. + +## 3.4 The anti-cosmetic guardrail as a Goodhart defense + +There is a well-known failure of optimization: when a measure becomes a +target, it ceases to be a good measure. An autonomous agent optimizing +"ship PRs" will ship the easiest PRs — which are exactly the cosmetic ones. +The `rejected_as_cosmetic` list is a direct structural defense: it removes +the easiest proxies from the set of admissible work, forcing the +optimizer's pressure back onto the axes that actually represent value. + +This is why forge-loop's brainstormer carries an explicit *anti-cosmetic +guardrail*: the value model is not just consulted at generation time, it is +designed so that the cheapest-to-satisfy moves are precisely the ones it +forbids. The system is built to make gaming it harder than doing the real +work. + +## 3.5 The cost, stated plainly + +This surface is only as good as the axes the operator writes. A vague +`valuable_means`, an empty `rejected_as_cosmetic`, or axes that do not +actually capture the product's value model will all produce a system that +confidently optimizes the wrong thing. Garbage axes in, garbage backlog +out — and worse, *confidently and at scale*. The leverage of this surface +is real, but it is leverage on the operator's clarity, which means it +amplifies a poor value model as faithfully as a good one. This is the +upfront cost named in Section 2, located precisely. + +--- + +[← The Tradeoff](./02-the-tradeoff.md) · [Index](./README.md) · [Next: Reinforcement Feedback Loops →](./04-reinforcement-feedback-loops.md) diff --git a/paper/04-reinforcement-feedback-loops.md b/paper/04-reinforcement-feedback-loops.md new file mode 100644 index 0000000..a760ad6 --- /dev/null +++ b/paper/04-reinforcement-feedback-loops.md @@ -0,0 +1,113 @@ +[← Product Articulation](./03-product-articulation-axes.md) · [Index](./README.md) · [Next: Code-Quality Imperatives →](./05-code-quality-imperatives.md) + +--- + +# 4. Reinforcement Feedback Loops + +> *Control surface #2: gating what gets admitted, and turning every failure +> into a constraint the system can never violate again.* + +## 4.1 Two loops, two timescales + +Product articulation (Section 3) decides what work *enters* the system. +This section concerns what happens to work *inside* it. forge-loop runs two +feedback loops at different timescales: + +- **The fast loop (per-PR): the critic gate.** On every candidate change, + a typed critic produces a structured verdict; severity-1 findings block + the merge. This operates in seconds-to-minutes and decides *this* change. +- **The slow loop (per-bug): the ratchet.** When a defect escapes the fast + loop and ships, it is distilled into a permanent rule that the fast loop + will enforce forever after. This operates over days and changes *all + future* changes. + +The combination is the interesting claim: a system that not only filters +its outputs but **improves its own filter from its own failures.** That is +the "reinforcement" in the title — not gradient-based RL, but a structural +reinforcement loop where the policy (the ruleset) is updated by the +environment's signal (production bugs). + +## 4.2 The fast loop: the typed critic as an admission policy + +The critic is the system's reviewer-of-record. Its design has two +properties worth defending. + +**It emits a typed verdict, not prose.** A `CriticReport` carries +structured findings, each with a *severity* (sev1/sev2/sev3) and a +*category* (correctness / security / style / tests / docs). This typing is +what makes the verdict *actionable by the loop*: sev1 mechanically disables +auto-merge and labels the PR blocking; sev2/sev3 become inline review +comments. A prose review ("looks mostly fine but I have concerns") cannot +drive an automated gate; a typed one can. + +**Severity encodes a deliberate asymmetry.** The system is built to +*believe sev1 and discount sev3*. Blocking findings are treated as +authoritative; advisory findings are treated as likely-noise. This is a +direct acknowledgment that the critic is itself an imperfect agent: the +gate is tuned so that the *expensive* error (blocking good work) is rarer +than the *cheap* error (letting through a stylistic nit). An admission +policy that did not distinguish severities would either block too much +(throttling throughput) or block too little (admitting defects). Typing the +severity is how the tradeoff is made tunable instead of binary. + +## 4.3 The slow loop: the bug → rule → permanent gate ratchet + +This is, in our assessment, the most original idea in the system. + +The ratchet works like this: + +1. A defect escapes the critic and ships in PR #N. +2. It gets fixed. +3. The *shape* of the failure is distilled into a new rule and added to the + quality manifesto (the project provides `manifesto suggest --from-pr N` + to draft this delta). +4. From the next run onward, the critic enforces the new rule. **Any future + PR exhibiting that failure shape is blocked at merge.** + +A real instance from the project's own history: a stringly-typed +event-boundary bug shipped, was fixed, and the quality manifesto gained a +rule forbidding cross-module string discriminators. The critic now blocks +any future PR that compares `event["kind"] == "literal"` across a module +boundary. The class of bug was retired, not just the instance. + +## 4.4 Why the ratchet is scientifically the right shape + +Three reasons this mechanism is more than a convenience. + +**It makes the failure set monotonically shrink.** In an ungoverned +system, the set of possible bad outcomes is constant — every bug class that +ever happened can happen again. Under the ratchet, each realized failure +*permanently removes itself* from the future failure set. This is the +formal source of the "non-decreasing value" property claimed in Section 2: +the system's worst case improves with experience. + +**It converts a per-instance cost into a one-time cost.** Without the +ratchet, the same bug class recurs across agents and modules, and each +recurrence costs a fresh debugging session. With it, the *first* occurrence +is paid in full and every subsequent occurrence is paid at the price of an +automated block. This is the amortization argument of Section 2 instantiated +at the level of defects. + +**It is institutional memory for a memoryless workforce.** The agents do +not learn between runs; each dispatch is fresh. The ratchet is where the +*system* remembers what the *agents* cannot. Knowledge that would normally +live in a senior engineer's head ("we don't do X here, we got burned") +becomes an executable artifact that outlives any individual run and applies +uniformly to every agent. This is the closest thing an agent swarm has to +seniority. + +## 4.5 The honest caveat: the loop is only as sharp as its distillations + +The ratchet's power depends entirely on the quality of the +*distillation* — step 3. A rule written too narrowly ("don't compare +`event['kind']` in `events.py`") fails to generalize and the bug returns in +the next module. A rule written too broadly throttles legitimate work with +false positives. And the loop is not autonomous: a human must still notice +the bug, decide it is worth a rule, and write the rule well. The system +*supports* the ratchet (it drafts the delta); it does not *guarantee* it. +The mechanism is sound; its yield is bounded by operator discipline — once +again locating the cost exactly where Section 2 said it would be. + +--- + +[← Product Articulation](./03-product-articulation-axes.md) · [Index](./README.md) · [Next: Code-Quality Imperatives →](./05-code-quality-imperatives.md) diff --git a/paper/05-code-quality-imperatives.md b/paper/05-code-quality-imperatives.md new file mode 100644 index 0000000..748d7db --- /dev/null +++ b/paper/05-code-quality-imperatives.md @@ -0,0 +1,111 @@ +[← Reinforcement Feedback Loops](./04-reinforcement-feedback-loops.md) · [Index](./README.md) · [Next: System Architecture →](./06-system-architecture.md) + +--- + +# 5. Code-Quality Imperatives + +> *Control surface #3: turning "good code" from a matter of taste into an +> executable admission policy.* + +## 5.1 Why quality must be a first-class control surface + +The third pathology of Section 1 was *quality entropy*: each change passes +review in isolation while the aggregate codebase decays. The root cause is +that "quality" normally lives as **tacit knowledge** — the taste a senior +engineer applies in review, never fully written down. Tacit knowledge does +not scale to a machine workforce. An agent cannot consult a taste it was +never given, and a swarm of agents cannot converge on a consistency none of +them can see. + +The only remedy is to **make the tacit explicit**: write quality down as +rules, and enforce those rules as a *gate*, not as a suggestion. This is +the same move as Sections 3 and 4 — encode human judgment once, enforce it +indefinitely — applied to the *how* of code rather than the *what* or the +*whether*. + +## 5.2 The mechanism: manifestos + severity rubric + +forge-loop separates quality into two operator-owned manifestos under +`.forge/`: + +- **`quality-manifesto.md`** — how code *must* be written. Enforced by the + critic; violations at sev1 block the merge. +- **`testing-manifesto.md`** — how tests *must* be written. Consulted by + the worker after implementation, so the standard shapes the code as it is + produced, not only after. + +Two placement decisions matter: + +**Quality is injected at *both* ends of the pipeline.** The relevant +manifesto content is prepended to the *worker's* brief (so the agent writes +to the standard) **and** enforced by the *critic* (so violations are caught +if the agent ignores it). Guidance at generation time plus enforcement at +admission time is strictly stronger than either alone: the first reduces +violations, the second guarantees they cannot ship. + +**Manifestos are versioned and the version is recorded.** Each worker +outcome records which manifesto versions governed it. This is what makes +the ratchet of Section 4 auditable — you can answer "which standard was +this PR held to?" and "did this rule exist when that bug shipped?" Quality +becomes a tracked, evolving artifact rather than an ambient assumption. + +## 5.3 Why "tight imperatives" is the right stance — and why tight, not maximal + +It would be easy to read this section as "more rules are always better." +That is not the claim. The claim is that quality rules must be **tight** in +a specific sense: *precise, falsifiable, and motivated by a realized +failure* — not maximal in number. + +The discipline that keeps the manifesto tight is the ratchet itself +(Section 4): rules earn their place by corresponding to a bug that actually +happened. This is a crucial constraint. A manifesto grown by speculation +("we should probably also forbid...") accumulates false positives that +throttle the workforce and erode trust in the gate. A manifesto grown by +the ratchet stays *grounded* — every rule has a corpse behind it. The +imperatives are tight because they are *earned*, and earned rules are the +ones least likely to be wrong. + +This gives a principled answer to the perennial question "how many quality +rules should we have?": **exactly as many as you have had distinct, +worth-preventing failures** — no more (speculative rules throttle), no +fewer (ungated bug classes recur). + +## 5.4 The deeper argument: quality as the precondition for autonomy + +There is a reason quality cannot be deferred in an autonomous system the +way it sometimes can in a human one. A human team can carry quality debt +because humans *route around* bad code — they know which modules are +landmines and tread carefully. Agents have no such situational awareness; +they read the code as ground truth and faithfully imitate whatever +conventions they find. **In a codebase tended by agents, today's quality +defect is tomorrow's training example.** A god-function that ships becomes +the template the next agent copies. Inconsistent error handling, once +present, propagates. + +This means quality entropy is not merely undesirable in an autonomous +system — it is *self-amplifying*. The code the agents read shapes the code +the agents write. The quality gate is therefore not a finishing step; it is +the mechanism that keeps the system's own training surface clean enough to +remain governable. Tight code-quality imperatives are, in the most literal +sense, a precondition for the autonomy being safe to continue. + +## 5.5 The cost, and the irony + +The honest cost: writing and maintaining manifestos is real work, and an +over-eager manifesto can throttle throughput with false positives — the +gate blocks good work, the operator loses trust, the gate gets disabled, +and the whole surface collapses. The discipline of "tight, earned rules" +mitigates this but does not remove the maintenance burden. + +There is also an irony worth stating, because it bears on credibility: +*this very codebase* exhibits some of the quality defects its manifestos +preach against — a 567-line orchestration function, dead scaffolding, +duplicated config systems. That the artifact does not fully live up to its +own imperatives is not a refutation of the imperatives; if anything it is +evidence *for* them, demonstrating that without relentless enforcement even +a quality-conscious author drifts. Section 7 treats this honestly rather +than hiding it. + +--- + +[← Reinforcement Feedback Loops](./04-reinforcement-feedback-loops.md) · [Index](./README.md) · [Next: System Architecture →](./06-system-architecture.md) diff --git a/paper/06-system-architecture.md b/paper/06-system-architecture.md new file mode 100644 index 0000000..14f8621 --- /dev/null +++ b/paper/06-system-architecture.md @@ -0,0 +1,115 @@ +[← Code-Quality Imperatives](./05-code-quality-imperatives.md) · [Index](./README.md) · [Next: Limitations →](./07-limitations.md) + +--- + +# 6. System Architecture: The Tick + +> *How the three control surfaces compose into a single, repeating control +> loop.* + +Sections 3–5 defended each control surface in isolation. This section shows +how they compose at runtime. The thesis of the composition is simple: the +three surfaces are not three features bolted together — they are three +*gates on a single pipeline*, positioned so that work must pass value, +then quality, then liveness checks before it can affect the world. + +## 6.1 The loop, abstractly + +The system advances in discrete **ticks**. Each tick is one pass of a +control loop that pulls candidate work, runs it through the agents, gates +the results, and lands what survives. Abstractly: + +``` + ┌─────────────────────────────────────────────┐ + │ TICK │ + │ │ + value gate │ 1. select admissible work │ + (Section 3) │ └─ axis filter: only work that serves │ + │ a declared value axis │ + │ │ + │ 2. (periodic) maintenance / grooming │ + │ └─ dedupe, retitle, expand thin specs │ + │ │ + generation │ 3. dispatch N workers in parallel │ + │ └─ each in an isolated git worktree │ + │ └─ brief carries the quality manifesto │ + │ (Section 5: guidance at gen-time) │ + │ │ + quality gate │ 4. critic reviews each PR → typed verdict │ + (Sections │ └─ sev1 blocks; sev2/3 advise │ + 4 & 5) │ └─ manifesto compliance enforced │ + │ │ + liveness │ 5. merge gate │ + gate │ └─ refuse if source issue closed, │ + │ conflicts unresolved, etc. │ + │ │ + │ 6. land survivors; (optional) redeploy │ + │ 7. emit audit events; sleep; repeat ↺ │ + └─────────────────────────────────────────────┘ +``` + +## 6.2 Why this ordering is the right ordering + +The sequence is not arbitrary. Each gate is positioned to fail work **as +early and as cheaply as possible**, which is a core efficiency argument. + +**Value gate first (cheapest).** Filtering by value axis happens before any +agent is dispatched — before a dollar of compute is spent. Rejecting +cosmetic work at selection time is free; rejecting it after an agent has +written 700 lines is expensive. Putting the value gate first means the +system never pays generation cost for work it would refuse to ship anyway. + +**Generation in isolation.** Each worker runs in its own git worktree off +the base branch. This is the concurrency-safety argument: parallel agents +cannot corrupt each other's working state, and a failed agent leaves no +trace on the others. Isolation is what makes "dispatch N in parallel" +safe rather than a race condition. + +**Quality gate after generation, before merge.** The critic runs on the +produced PR. This is the only correct place for it — you cannot review code +that does not exist yet, and you must not merge code that has not been +reviewed. Manifesto enforcement and the typed verdict both live here. + +**Liveness gate last.** The merge gate checks conditions that can only be +known at the last moment: has the source issue been closed mid-flight? Are +there unresolved conflicts? These are *time-of-merge* facts; checking them +any earlier would be checking stale state. The merge gate is the system's +defense against acting on a world that changed while it was working. + +## 6.3 The composition is the contribution + +The individual gates are each defensible (Sections 3–5). The architectural +claim of this section is that **their composition is what produces the +Section 2 property.** Value-first selection bounds *what* the system spends +effort on; the quality gate bounds *how good* what it ships is; the +liveness gate bounds *whether the world still wants it*; the audit log and +the ratchet make the whole loop *improvable*. Remove any one gate and a +pathology from Section 1 returns: + +| Remove this gate | Pathology that returns | +|------------------|------------------------| +| Value (axis filter) | Value-blindness — effort spread across worthless work | +| Quality (critic + manifesto) | Quality entropy — codebase decays change by change | +| Liveness (merge gate) | Acting on stale state — landing work the world abandoned | +| Audit + ratchet | Static failure set — same bugs recur forever | + +The loop is a pipeline of gates, each cheap relative to the cost of the +failure it prevents, composed so that human judgment encoded on the three +control surfaces is enforced on every one of an unbounded number of +autonomous actions. That composition — not any single gate — is the design +this paper defends. + +## 6.4 A note on what the architecture deliberately does *not* do + +The loop does not try to make the agents smarter, and that is intentional. +It treats the generation engine as a fixed, fallible black box and invests +entirely in the *governance* around it. This is a bet that, as base models +improve, a system organized around durable control surfaces will compound +those improvements (better agents, same gates, strictly better outcomes), +whereas a system organized around clever prompting will have to be +re-engineered each model generation. The architecture is designed to age +well by refusing to depend on the thing that changes fastest. + +--- + +[← Code-Quality Imperatives](./05-code-quality-imperatives.md) · [Index](./README.md) · [Next: Limitations →](./07-limitations.md) diff --git a/paper/07-limitations.md b/paper/07-limitations.md new file mode 100644 index 0000000..b70c1e8 --- /dev/null +++ b/paper/07-limitations.md @@ -0,0 +1,104 @@ +[← System Architecture](./06-system-architecture.md) · [Index](./README.md) · [Next: Conclusion →](./08-conclusion.md) + +--- + +# 7. Limitations & Threats to Validity + +A position paper that only argues its own strengths is advertising. This +section is the honest ledger: where the *thesis* may be wrong, and where the +*artifact* does not yet live up to the thesis. The two are distinct and we +keep them separate. + +## 7.1 Threats to the thesis + +These are reasons the argument of Sections 2–6 might fail *even if perfectly +implemented*. + +**The value model may not be reducible.** The entire edifice rests on the +claim that product value can be captured in 4–6 falsifiable axes with an +enumerated negative space. For many real products this is true; for some it +is not. Taste-driven, brand-driven, or research-driven work resists +axis-ization, and forcing it produces confident optimization of a fiction +(Section 2.4). The thesis is scoped, not universal, and the boundary of its +scope is not sharp. + +**The gates can be collectively gamed.** Sections 3–5 each defend their +gate against *local* gaming. But an agent optimizing across all gates +simultaneously may find work that is technically axis-aligned, technically +manifesto-compliant, and technically mergeable while still being +low-value — the proxy substitution problem reasserting itself one level up. +The gates raise the cost of gaming; they do not prove it impossible. We +claim a better tradeoff, not a closed system. + +**The ratchet may not converge.** Section 4 argues the failure set shrinks +monotonically. This assumes failures recur in distillable *classes*. If +defects are mostly novel one-offs, the ratchet adds rules without ever +catching a repeat, and the manifesto bloats toward the false-positive +regime of Section 5.5 without buying safety. Whether real defects cluster +into classes tightly enough is an empirical question this paper does not +answer. + +**The evidence base is thin.** The performance claims that motivate the +design (cost-per-PR, throughput) come from the author dogfooding the system +on its own backlog — an N of 1, on a codebase uniquely suited to it. None +of this paper's claims have been validated across independent teams, +domains, or operators. Read every quantitative assertion as a *hypothesis +generated by one practitioner*, not a measured result. + +## 7.2 Threats from the artifact + +These are places where the *current implementation* contradicts the paper's +own argument. We state them because Section 5.4 made a strong claim — that +quality is self-amplifying in agent-tended code — and intellectual honesty +requires applying that lens to forge-loop itself. + +**The orchestration core is a god-function.** The central `_tick` body is a +single ~570-line function carrying maintenance, repair, dispatch, critic, +merge-gate, rescue, and drift logic inline. This is precisely the kind of +artifact the quality manifesto exists to prevent. It is the most direct +evidence that authoring discipline alone is insufficient — the very thesis +of Section 5. + +**Speculative scaffolding inflates the system.** Several subsystems (a DAG +executor, a durable queue, telemetry exporters, an async orchestrator, +multirepo support) are built but not exercised by the default control loop — +some imported and never called. This is the opposite failure from quality +entropy: not decay, but un-amortized *breadth*. It dilutes the lean core the +paper actually describes and is itself a violation of the "earn your +complexity" spirit of Section 5.3. + +**Duplicated control structures.** The system carries two GitHub-access +layers and two configuration representations, each a half-finished +migration. Duplicated representations of the same concept are a classic +correctness hazard — exactly the sort of thing the ratchet is supposed to +retire — and their presence shows the ratchet has not (yet) been turned on +the harness's own code with the rigor it prescribes for client code. + +**Provenance shows the pattern the paper warns about.** The codebase was +produced largely by the system itself, at high speed, by a single author. +The result has strong bones and over-built edges — which is the predicted +signature of a powerful generation engine governed by *incompletely +enforced* imperatives. The artifact is, in effect, a data point *for* the +paper's central claim: governance that is designed but not relentlessly +enforced produces exactly this shape. + +## 7.3 What would falsify the thesis + +To keep this honest, we state what evidence would change our minds: + +- **Independent operators** applying the triad to diverse products and + finding that value axes consistently fail to capture what matters → + the value-model claim (Section 3) is too optimistic. +- **Manifestos that grow without bound** while defect rates stay flat → + the ratchet (Section 4) does not converge in practice. +- **Governed systems showing no value advantage** over ungoverned ones at + equal throughput, in controlled comparison → the central tradeoff + (Section 2) does not pay off. + +None of these experiments has been run. Until they are, this remains a +*position* — a reasoned argument for a design philosophy — and should be +read as one. + +--- + +[← System Architecture](./06-system-architecture.md) · [Index](./README.md) · [Next: Conclusion →](./08-conclusion.md) diff --git a/paper/08-conclusion.md b/paper/08-conclusion.md new file mode 100644 index 0000000..52acfd8 --- /dev/null +++ b/paper/08-conclusion.md @@ -0,0 +1,87 @@ +[← Limitations](./07-limitations.md) · [Index](./README.md) + +--- + +# 8. Conclusion + +## 8.1 The argument, restated + +We began with an inversion: in an era where language models can reliably +generate correct code, the hard problem of autonomous software development +is no longer *generation* but *governance*. An unsupervised generation +engine does not fail loudly; it drifts — substituting achievable proxies +for real value, spending uniformly across work of unequal worth, and +degrading the codebase one locally-acceptable change at a time. + +The remedy cannot be a human reviewing every output, because that +re-bottlenecks the system on the scarce resource autonomy was meant to +free. The remedy must be to move human judgment **out of the per-action +loop and into machine-checkable artifacts** — encoded once, enforced +indefinitely. + +forge-loop instantiates this principle as three coupled control surfaces: + +- **Product articulation** (Section 3) makes *what is worth doing* a typed, + falsifiable input, with an explicit negative space that defeats proxy + substitution. +- **Reinforcement feedback loops** (Section 4) gate admission with a typed + critic and — the most original idea — ratchet every shipped defect into a + permanent constraint, making the system's failure set shrink with + experience. +- **Code-quality imperatives** (Section 5) turn quality from tacit taste + into an executable admission policy, which an agent-tended codebase + requires because its own code is its next training example. + +Composed as ordered gates on a single tick (Section 6), these surfaces +deliver the property an ungoverned system cannot have: **bounded, +non-decreasing value over an unbounded number of unsupervised actions** — +bought with an upfront specification cost that amortizes while per-action +review cost never does (Section 2). + +## 8.2 Why the tradeoff is favorable + +The economic core of the argument is an asymmetry. Specification is a fixed +cost that an unbounded number of future actions draw against; per-PR review +is a linear cost that bottlenecks on human availability and never +amortizes. Past an early crossover, governance is strictly cheaper for the +same safety — the same reason type systems beat manual inspection, lifted +from the level of syntax to the level of *value and quality*. That is the +sense in which the architecture is "scientifically a good tradeoff": not +that it is free, but that its cost structure is the right shape for the +regime it targets. + +## 8.3 What is durable here + +The forge-loop *product* competes in a crowded and fast-converging space; +platform-native "assign an issue, get a PR" features may well absorb the +dispatch loop. We are explicit about that, and about the artifact's own +unevenness (Section 7). + +But the **idea** is more durable than the product. The proposition that +autonomous agents should be governed by *machine-checkable value and +quality manifestos, with a feedback ratchet that converts failures into +permanent constraints* is not tied to any one orchestrator, model, or +vendor. It is a stance on how to make machine-speed software development +*safe to leave running* — and that problem only grows as the agents get +better. The control surfaces age well precisely because they do not depend +on the thing that changes fastest (the model); they depend on the thing +that changes slowest (what the operator actually values and refuses to +ship). + +## 8.4 Closing + +If there is a single sentence to carry away, it is this: + +> As generation becomes free, value is decided at the gates. Build the gates +> well, encode your judgment in them once, and let the machine enforce that +> judgment a million times — that is the whole of the discipline. + +forge-loop is one attempt to build those gates. This paper has argued that +the *shape* of that attempt — three control surfaces, composed as a gated +loop, improved by a ratchet — is the right shape, while being candid that +the *execution* is a first draft and the *evidence* is preliminary. The +gates are the contribution. Everything else is plumbing. + +--- + +[← Limitations](./07-limitations.md) · [Index](./README.md) diff --git a/paper/README.md b/paper/README.md new file mode 100644 index 0000000..7fd2d34 --- /dev/null +++ b/paper/README.md @@ -0,0 +1,80 @@ +# Governing Autonomous Software Factories + +### A position paper on the architecture of forge-loop + +*An argument that three coupled control surfaces — explicit product articulation, reinforcement feedback loops, and tight code-quality imperatives — together form a favorable engineering tradeoff for autonomous, multi-agent software development systems.* + +--- + +## Abstract + +Autonomous coding systems — swarms of language-model agents that pick up +work, write code, and ship it without a human in the loop — fail not +because the agents cannot write code, but because nobody told them what is +*worth* writing and nothing stops them from drifting. An unsupervised agent +optimizes for the nearest proxy of "done": tests pass, the diff is large, +the PR is open. Left alone, it produces motion without value and erodes the +codebase it touches. + +This paper argues that the central engineering problem of an autonomous +software factory is not *generation* but *governance*, and that governance +must be made **machine-checkable** to operate at machine speed. We present +the architecture of `forge-loop` as a worked example of one such governance +design, built around three coupled control surfaces: + +1. **Product articulation as a typed artifact** — value *axes* and a + product vision that make "is this worth doing?" a question the system + can answer before it spends a dollar of compute (see + [Product Articulation & Value Axes](./03-product-articulation-axes.md)). +2. **Reinforcement feedback loops** — a typed critic that gates merges, and + a *bug → rule → permanent gate* ratchet that converts every failure into + a constraint the system can never violate again (see + [Reinforcement Feedback Loops](./04-reinforcement-feedback-loops.md)). +3. **Code-quality imperatives as a control surface** — manifestos and a + severity rubric that turn "good code" from a matter of taste into an + executable admission policy (see + [Code-Quality Imperatives](./05-code-quality-imperatives.md)). + +We make the case that this triad is a *good tradeoff* — that the cost it +imposes (operator effort to write specifications and rules upfront) buys +the one property an autonomous system cannot otherwise have: **bounded, +non-decreasing value over an unbounded number of unsupervised actions.** We +also state, honestly, where the current implementation falls short of the +thesis it embodies (see [Limitations & Threats to Validity](./07-limitations.md)). + +--- + +## Table of Contents + +| # | Section | What it argues | +|---|---------|----------------| +| 1 | [Introduction: The Governance Problem](./01-introduction.md) | Why generation is solved and governance is not. | +| 2 | [The Central Tradeoff](./02-the-tradeoff.md) | Upfront specification cost in exchange for bounded autonomous value. | +| 3 | [Product Articulation & Value Axes](./03-product-articulation-axes.md) | Making "is this worth doing?" machine-checkable. | +| 4 | [Reinforcement Feedback Loops](./04-reinforcement-feedback-loops.md) | The critic gate and the bug→rule→gate ratchet. | +| 5 | [Code-Quality Imperatives](./05-code-quality-imperatives.md) | Quality as an executable admission policy, not taste. | +| 6 | [System Architecture: The Tick](./06-system-architecture.md) | How the three surfaces compose into one control loop. | +| 7 | [Limitations & Threats to Validity](./07-limitations.md) | Where implementation diverges from thesis. | +| 8 | [Conclusion](./08-conclusion.md) | The governance triad as the durable contribution. | + +--- + +## How to read this paper + +Each section stands alone but the argument is cumulative. Section 1 +establishes the problem; Section 2 states the thesis as an explicit +tradeoff; Sections 3–5 defend each of the three control surfaces in turn; +Section 6 shows how they compose at runtime; Section 7 is the honest ledger +of where the artifact does not yet live up to the argument; Section 8 +states what we believe is durable. + +Throughout, claims are grounded in the actual mechanisms of the +`forge-loop` codebase (`.forge/axes.yaml`, `.forge/quality-manifesto.md`, +the `critic` module, the `brainstormer`, the merge gate) rather than in an +idealized system. This is a *position paper*, not a controlled study: it +argues a design philosophy and is explicit about its evidentiary limits. + +--- + +*Status: draft. Authored as an architectural rationale for the forge-loop +project.*