From b7f5a8a01bff7f0307ab56aab62000df1315d6c3 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 22 May 2026 06:50:08 -0700 Subject: [PATCH 1/6] blog: add agentic PR reviews post (draft) Add new blog post exploring the value of agent-led PR reviews versus traditional human reviews. Covers: - The LGTM problem and its hidden costs - PR description as a canonical contract - Reasoning-rich review comments vs rubber stamps - The feedback loop that didn't exist before - Professional and impersonal tone as a feature - Capturing the 'why' in perpetuity Includes 6 image placeholders with watercolor PNW style prompts, a Mermaid feedback-loop diagram, and PR description template examples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- website/blog/2026-05-22-agentic-pr-reviews.md | 449 ++++++++++++++++++ 1 file changed, 449 insertions(+) create mode 100644 website/blog/2026-05-22-agentic-pr-reviews.md diff --git a/website/blog/2026-05-22-agentic-pr-reviews.md b/website/blog/2026-05-22-agentic-pr-reviews.md new file mode 100644 index 0000000..83f5048 --- /dev/null +++ b/website/blog/2026-05-22-agentic-pr-reviews.md @@ -0,0 +1,449 @@ +--- +slug: /2026-05-22-agentic-pr-reviews +canonical_url: https://dfberry.github.io/blog/2026-05-22-agentic-pr-reviews +custom_edit_url: null +sidebar_label: "2026.05.22 Agentic PR Reviews" +title: "The Quiet Revolution: How Agent-Led PR Reviews Changed My Quality Feedback Loop" +description: "Agent-led PR reviews deliver consistent reasoning, professional tone, and a permanent record of why changes were made — transforming code review from a gate into a knowledge system." +draft: true +tags: + - AI Agents + - Code Review + - Developer Experience + - GitHub Copilot + - AI assisted +updated: 2026-05-22 06:42 PST +keywords: + - agentic pr reviews + - ai code review + - pr review quality + - agent-led reviews + - code review feedback loop + - github copilot review + - pr description best practices + - automated code review +--- + +# The Quiet Revolution: How Agent-Led PR Reviews Changed My Quality Feedback Loop + + +![A pink-haired developer studying a wall of PR review comments by morning window light](./media/2026-05-22-agentic-pr-reviews/hero-pr-review-wall.png) + +*The first time an agent actually told me why it was approving my PR, I stopped and read the comment twice. Not because it was wrong. Because nothing like it had happened before.* + +I want to be transparent about something before I get into this: I spent years working in codebases where the review culture was, charitably, "inconsistent." Some PRs got thorough, constructive feedback. Most got "LGTM." A few got feedback that was, let's say, delivered with more passion than precision — all caps, exclamation points, the whole show. What I almost never got was a clear statement of what the reviewer had actually looked at, what standards they applied, and why the PR was or wasn't meeting them. + +When I started working with agent-led PR reviews — where an AI agent is the primary reviewer, not just a linter or a spell-checker — I expected something in between. Maybe a more consistent version of what I already knew. What I actually got was qualitatively different enough that I've been thinking about it ever since. + +This post is my attempt to work out why. + +Here's what I discovered: agent-led PR reviews don't just add consistency to code review. They transform what code review is *for*. The review stops being a gate and starts being a knowledge system. The PR description stops being a formality and becomes a canonical contract. The reviewer's comments stop being ephemeral opinions and start being durable records of reasoning. And the whole feedback loop tightens in a way that I hadn't anticipated and still find genuinely surprising. + +I'm going to walk through all of it. The wins, the places where humans still have the edge, and the parts that took me a while to understand. + +--- + +## The "LGTM" Problem Was Bigger Than I Thought + +I want to start here because this is the real origin of everything that follows, and I think it's easy to underestimate how much damage "LGTM" culture actually causes — not in the obvious, catastrophic way, but in a slow, quiet, information-drain kind of way. + +Consider a scenario I lived through more times than I'd like to count. You write a PR. It's a meaningful change — maybe a refactor of a service layer, or a translation of internal source comments into public documentation. You write a description, link to the ticket, and request review from whoever is on the rota. A few hours later — or a few days later if you're unlucky — you get a notification: "Approved." Two letters. No comment about what they looked at. No indication that they ran the code, or reviewed the tests, or even understood the domain. Just: good to go. + +And here's the part that people don't talk about enough: if you're not the subject-matter expert on the underlying system you're changing, "LGTM" isn't just unsatisfying — it's actively useless. You needed to know what the reviewer checked. You needed to know whether they caught the edge case you were worried about. You needed to know whether your interpretation of the spec was correct. What you got was a green checkmark. + +There's a related problem that goes the other direction. Sometimes the review isn't "LGTM" — it's pointed. Strong language. An implicit (or explicit) "why did you do it this way" delivered in a tone that doesn't leave much room for discussion. This kind of feedback carries the same information problem as "LGTM," just with more heat. You learn that something was wrong, but you don't necessarily learn why, or what the right approach looks like, or how to do it differently next time. You learn to feel bad. That's not the same as learning. + + +![Split watercolor showing chaotic LGTM review on left versus organized agent review on right](./media/2026-05-22-agentic-pr-reviews/before-after-review-culture.png) + +*Before agent reviews, feedback was either a green stamp or a heat storm — rarely the clear, reasoned signal in between.* + +I think about it like driving along the coast of Bellingham Bay on a foggy morning. You know the lighthouse is out there. You know the channel exists. But the fog means you're navigating on memory and guesswork rather than real-time signal. The lighthouse doesn't stop existing because you can't see it — but you also can't use information you can't perceive. + +"LGTM" is that fog. The reviewer may have looked at everything. They may have caught every subtle issue and decided everything was fine. But you don't know that. You can't use their reasoning because they didn't share it. And the next time you write a PR like this one, you're still navigating by memory. The fog doesn't lift. + +--- + +## A Different Kind of Reviewer Shows Up + +My first real encounter with an agent-led PR review came through GitHub Copilot's code review functionality, and I want to be honest about my initial skepticism. I expected it to be a fancier linter — something that would flag unused imports, suggest slightly better variable names, and helpfully remind me to add comments to my public functions. + +That's not what happened. + +The agent left a comment that was, in structure, something like this: + +> I reviewed the changes to the documentation pipeline. The PR description states the goal as translating internal API comment blocks into user-facing documentation. I verified that the three modified files match the source comments linked in the description. The translation is accurate and the format follows the existing documentation conventions. +> +> One question: the source comment in `service.ts` line 47 mentions a rate-limiting behavior that doesn't appear in the translated documentation. Is this intentional — perhaps considered an implementation detail rather than user-facing? If it should be documented, I'd suggest adding a note in the "Limitations" section. + +That was it. Two paragraphs. No drama. No "LGTM." But also — and this is the part that stopped me — it told me exactly what it looked at, what it checked against, and what it found. It raised one precise question. And it framed that question as a genuine inquiry rather than a criticism. + +I found myself actually wanting to answer the question. Not defensively. Just... it was a real question worth answering. + +Here's what I discovered after that first review: the information content of a well-reasoned agent review is substantially higher than most human reviews I'd received. Not because the agent is smarter than a human reviewer. But because the agent is structurally more consistent about *externalizing its reasoning*. It doesn't assume you can infer why it approved or rejected. It tells you. + +That difference compounds over time in ways I didn't anticipate until I'd been working with it for a while. + +--- + +## The PR Description Becomes a Contract + +One of the first things I changed after starting to work with agent-led reviews was how I write PR descriptions. This wasn't a conscious decision at first — it evolved through trial and error. + +Here's the thing about agent reviews: the quality of the review scales directly with the quality of the information you give the reviewer. This is also true of human reviews, but human reviewers will often fill in the gaps with assumptions, prior knowledge, or a quick Teams message. Agents work with what's in front of them. If you give them a PR description that says "updated the docs," they'll review what they can see. If you give them a PR description that explains the goal, the constraints, the source material, and the expected outcome, they can review against all of that. + +This forced me to think about what a PR description is actually *for*. + +My previous mental model was something like: a PR description is a summary for the reviewer so they know roughly what to expect. A good description, in that model, is just more detailed. More context. Maybe a link to the ticket. + +My current mental model is different: a PR description is a goal statement, a context package, and a contract. It defines what "done" looks like. It defines the scope of review. It makes explicit the assumptions the author is making, the source of truth they're working from, and what they want the reviewer to evaluate. + +When I'm working on a documentation PR — where I'm translating source code comments or architecture notes into public-facing content — I now include in the description: + +- The explicit goal: "This PR translates the internal API annotations in `service.ts` into user-facing documentation in `docs/api.md`." +- The source of truth: "Source annotations are linked here: [link]. The documentation standard we're targeting is [link or quoted excerpt]." +- The scope boundary: "I am NOT documenting internal error codes or rate-limiting internals, which are covered in the internal runbook." +- The open questions: "I'm uncertain about the correct terminology for the authentication flow — I've used 'token exchange' but the source calls it both 'token exchange' and 'credential refresh.' What's preferred?" + +When the agent reviews that PR, it has everything it needs to do a meaningful review. It can check the translation against the source. It can evaluate whether the scope boundary is honored. It can engage with the open question. And when it approves — or raises concerns — it can explain why, because it has a clear contract to reason against. + +Here's what a PR description that functions as a contract looks like in practice: + +```markdown +## Goal + +Translate internal API annotations from `src/services/auth-service.ts` into user-facing +documentation at `docs/api/authentication.md`. + +## Source of truth + +- Internal annotations: `src/services/auth-service.ts` lines 45–120 +- Target format: existing pattern in `docs/api/storage.md` +- Relevant spec: Auth Flow Design doc (internal link) + +## Scope + +**In scope:** +- Public authentication methods (login, logout, token refresh) +- Error codes returned to callers +- Rate limiting behavior for the public API + +**Out of scope:** +- Internal token storage implementation +- Session management internals +- Admin-only endpoints + +## Open questions for reviewer + +1. Should the "token refresh" method document the 15-second retry window, or is + that considered an internal implementation detail? +2. The existing `storage.md` uses the phrase "connection handle" but the auth + service uses "session token." Should I align the terminology? + +## Files changed + +- `docs/api/authentication.md` — new file +- `docs/api/toc.json` — added entry +``` + +This description takes maybe ten minutes to write. What it returns is a review that can actually evaluate the translation, raise the right questions, and document its reasoning against a clear contract. That's a good trade. + + +![The pink-haired developer writing a structured PR description that unfurls like a scroll](./media/2026-05-22-agentic-pr-reviews/pr-description-as-contract.png) + +*When you write the PR description for a thoughtful reader — not a rubber stamp — the document becomes the first act of review.* + +There's a second-order effect here that I find genuinely interesting. Writing a PR description at this level of detail is, itself, a form of self-review. In drafting the scope boundary, I've often noticed that something I'd planned to include was actually out of scope for this PR. In listing the open questions, I've sometimes answered them myself before the reviewer had a chance. The description-writing forces clarity that the review then validates. + +Writing for a thoughtful reader changes how you think about what you're doing. The rubber-stamp model lets you be vague because the reviewer will fill in the gaps. The contract model requires precision because the reviewer will hold you to it. + +--- + +## Here's What I Discovered: The Reasoning Chain + +I want to spend some time on what a well-reasoned agent review actually looks like, because I think the gap between "automated review comment" and "reasoning-rich review comment" is larger than it sounds. + +An automated linter comment looks like this: + +``` +Line 47: Missing semicolon. +Line 89: Unused variable 'tempData'. +Line 102: Function exceeds 50-line limit. +``` + +Useful. Mechanically correct. Not reasoning — just rule application. + +An agent-led review comment with reasoning looks like this: + +> **Authentication Flow Translation — Review** +> +> I reviewed the translation of `auth-service.ts` annotations into `docs/api/authentication.md`. +> +> **What I checked:** +> - Compared each documented method against its source annotation +> - Verified error codes match the public-facing enum values +> - Checked format consistency against `docs/api/storage.md` +> +> **What I found:** +> The translation is accurate and the format is consistent. The scope boundary you defined +> is honored — internal token storage is not documented here. +> +> **On your open questions:** +> +> 1. *Token refresh retry window:* The 15-second window is referenced in the public API +> spec, so I'd consider it in scope for documentation. Users who are building retry +> logic need this. I'd suggest a note in the "Rate Limits" section. +> +> 2. *Terminology alignment:* The auth service is the newer of the two — "session token" +> is likely the forward direction. Worth noting in a comment that storage.md should +> be updated in a follow-on PR to align. +> +> **Overall:** Approved. The translation meets the stated goal and the scope is well-defined. + +That comment is, at minimum, five times the information content of "LGTM." It tells me: +- What the reviewer looked at +- What they checked against +- What they found +- How they reasoned through my open questions +- Why they approved + +Every one of those five things is useful. Every one of them was absent from most reviews I'd received in my career before this. + +### The Reviewer Brings Their Own Standards + +One thing I had to get comfortable with: the agent reviewer might come from a different part of the codebase, a different team's standards, or a different domain entirely. This is especially true when the reviewer is configured to apply the standards of a specific content area or quality framework. + +At first, this felt uncomfortable. "Who configured this reviewer? What standards are they applying? Are these my standards?" + +Here's what I discovered: it doesn't actually matter. When the reviewer explicitly states the standards they're applying — "I'm checking this against the existing documentation format in `storage.md`" or "I'm evaluating the error codes against the public-facing enum" — you have enough information to engage with their reasoning. You can agree or disagree. You can say "that standard doesn't apply here because..." and have a real conversation about it. + +The explicitness of the reasoning is what makes this possible. A reviewer who says "this doesn't meet our standards" without saying what standards, or how the PR fell short, gives you nothing to work with. A reviewer who says "this doesn't meet the format in `storage.md` because X, Y, Z" gives you a specific, actionable thing to respond to. + +And when the reviewer comes from a completely different part of the organization — when they're a domain specialist who doesn't know your codebase but knows the standards deeply — their explicit reasoning becomes doubly valuable. They're not just checking your work. They're showing you what their domain looks like from the inside. + +--- + +## The Feedback Loop That Didn't Exist Before + +Here's the thing that surprised me most, and that I think is the most consequential long-term change: the feedback loop. + +Before agent-led reviews, the feedback I received on a PR existed in roughly three states: + +1. **No feedback.** "LGTM." I had no idea what the reviewer checked or how the PR met or didn't meet quality standards. The next PR I wrote started from scratch. + +2. **Inconsistent feedback.** One reviewer cared deeply about test coverage. Another cared about naming conventions. A third would only engage with architecture decisions. Feedback was a function of who reviewed, not what standards existed. + +3. **Heated feedback.** Occasionally feedback arrived with emotional charge — frustration, strong opinions, implied or explicit criticism of past decisions. This was occasionally useful and often not, because emotional charge frequently occludes signal. + +In none of these states did feedback from one PR improve the starting condition for the next PR in any systematic way. You might learn something. You might not. It depended on whether the feedback was actionable and whether you retained it. + +Agent-led reviews create a different loop: + +```mermaid +graph LR + A[PR Submitted] --> B[Agent Review] + B --> C[Reasoning Captured] + C --> D[Standards Clarified] + D --> E[Next PR Improved] + E --> A +``` + +*The feedback loop tightens over time. Each review's reasoning teaches the system — and the author — what "good" looks like for the next PR.* + +The key mechanism is the captured reasoning. When the agent explains why it approved, flagged, or questioned something, that reasoning becomes part of the persistent record. The *next time* I write a similar PR, I can look at what was flagged last time. I can see what the reviewer considered in scope and out of scope. I can pre-answer the questions I know will come up. + +More practically: the agent itself can be configured to learn from its own prior reviews. If I include prior review feedback in the reviewer's context window, the next review starts from a higher baseline. The reviewer knows what I've already addressed. The reviewer knows what standards we've already agreed on. The conversation picks up where it left off rather than starting from zero every time. + +Before agent-led reviews, this kind of accumulating knowledge existed only in the heads of long-tenured team members. It was real but inaccessible — tribal knowledge that you acquired by being in the right meetings, on the right threads, or by asking the right people the right questions. Agent-led reviews don't eliminate the need for domain expertise. But they do make the application of standards visible and repeatable in a way that tribal knowledge never was. + +### The Compounding Effect + +I want to be specific about what "compounding" looks like in practice, because I think it's easy to hear "the feedback loop improves over time" and nod along without really internalizing what that means. + +Here's a concrete example. Six months ago, I was writing documentation PRs where I consistently under-specified the scope boundary. I'd say "documenting the auth service" without defining which parts of the auth service were in scope. Agent reviews would consistently ask: "Is the rate-limiting behavior in scope here?" Because this question appeared in three consecutive reviews, I added a scope section to my PR description template. The question stopped appearing. Not because the reviewer stopped caring about scope — because I started defining scope before the reviewer had to ask. + +That's the compounding effect. The first review surfaces a gap. The second review confirms it's a pattern. By the third review, you've internalized the standard and it no longer shows up as a gap. Your starting point has permanently moved. + +Before agent-led reviews, this kind of compounding happened too — but only if you had a consistent reviewer who was paying close enough attention to notice patterns across your PRs, who was willing to name those patterns explicitly, and who did so in a way you could hear and act on. That's a lot of conditions. In practice, it was rare. + +--- + +## Professional by Default + +I want to spend time on tone, because I think it's systematically undervalued in discussions about code review culture. We talk a lot about what feedback should contain. We talk less about how it should be delivered. And the "how" matters more than most people admit. + +Here's my honest experience: some of the most technically accurate feedback I've received in my career was delivered in ways that made it hard to hear. Not because I was being defensive. Because the delivery carried emotional charge that got in the way of the signal. All-caps emphasis. Exclamation points. Phrasings that implied I should have known better. These things aren't neutral. They activate a defensive response — in me and in most people — that interferes with the actual information transfer. + +Agent reviews are, by default, professional and impersonal. Not cold. Not bureaucratic. Just... not personal. The agent doesn't have a bad morning. It doesn't get frustrated with the third documentation PR in a row that forgets to include the scope section. It doesn't have strong feelings about your choice of variable names that spill over into how it phrases its comments. + +This consistency creates something I've started to think of as psychological safety in the review process. Not the big kind of psychological safety that org-culture consultants talk about — the feeling that it's safe to take risks, to be wrong, to speak up. More specific than that. The safety to read the review comments without bracing for impact. The safety to take the feedback at face value rather than parsing it for embedded emotion. + +When feedback is impersonal, you can engage with it on its merits. When it's personal — when the phrasing tells you the reviewer is irritated, or rushed, or has a horse in this race — you spend cognitive resources managing the relationship rather than the feedback. That's a tax on every human review, and agent reviews eliminate it. + + +![The pink-haired developer reading review feedback calmly in a serene PNW garden while others show review frustration](./media/2026-05-22-agentic-pr-reviews/psychological-safety-review.png) + +*When feedback is professional by default, you can engage with the content rather than managing the relationship.* + +### Agents Ask Questions Rather Than Making Assumptions + +One more thing on tone: agent reviews, in my experience, tend to handle ambiguity by asking rather than assuming. A human reviewer who sees something they don't understand might mark it as wrong. They might leave a terse comment that implies the author made an error. They might approve with a silent mental note of "well, this is their call." + +Agent reviews tend toward explicit inquiry. "I'm uncertain whether the rate-limiting behavior is intended to be documented here — can you confirm the scope intention?" That's different from "this doesn't belong here" or silent approval. It's an invitation to clarify, which is almost always the right response to genuine ambiguity. + +This matters practically because documentation work — which is a lot of what I do — involves many legitimately ambiguous scope decisions. Whether a particular behavior is a user-facing feature or an implementation detail isn't always obvious from the source. Whether a particular error code should be documented for callers or hidden behind an opaque response is a real design question. Agents surface these questions. They don't resolve them unilaterally. + +--- + +## Captured in Perpetuity + +There's a dimension of agent-led PR reviews that I didn't appreciate until I found myself going back through old PRs looking for the reasoning behind a past decision. Let me describe what that experience looks like in the before-and-after. + +**Before:** I'm looking at a documentation file that handles authentication. I see a note that says "token refresh is not documented here." I have no idea why. Was it intentional? Was it a mistake? Was it a scope decision that someone made deliberately, or was it just something that got missed? The PR that introduced this file is in the history. The description says "Add authentication docs." The review says "LGTM." I have nothing. + +**After:** Same scenario. The PR is in the history. The description includes a scope section that says "internal token refresh timing is not documented here — this is implementation-specific behavior covered in the internal runbook." The review comment says "I confirmed the scope boundary is honored. The token refresh timing is not documented per the scope definition. This aligns with the existing pattern in storage.md." Now I know. Not only did someone make this decision deliberately — I know exactly what they were thinking, what they compared it against, and what they concluded. + +That's not a minor difference. That's the difference between archaeology and documentation. + + +![The pink-haired developer viewing a preserved PR review in a museum-style display case, with winter Cascades in the distance](./media/2026-05-22-agentic-pr-reviews/captured-in-perpetuity.png) + +*A year later, the PR is still there. The reasoning is still there. The "why" — the hardest thing to recover — is captured in the record.* + +### What Gets Lost Over Time + +Code comments change. Or rather — code gets refactored, and the comments that were accurate a year ago are now describing a function that no longer exists in the same form. Architecture diagrams drift. The diagram from eighteen months ago shows three services; you now have seven. The diagram is technically still there, but it's lying to you. + +PR feedback is different because it's a snapshot in time. It doesn't drift. It captures the state of thinking at the moment the change was made. A PR from three years ago that says "we're not documenting this behavior because it's considered internal" is useful even if that decision has since been revisited — because it tells you that the decision was *made*, that someone was deliberate about it, and what the reasoning was at the time. + +The "why" is the hardest thing to recover in any software system. Code tells you what. Tests tell you what was expected. Architecture documents tell you what the intended design was. Almost nothing tells you why a specific decision was made, at a specific moment, by a specific person, with specific constraints in mind. + +Agent-led PR reviews, when they're doing their job, are capturing the "why" in a structured, queryable form. Not perfectly. Not always. But systematically, in a way that "LGTM" culture never did. + +I think about this like the tidal records kept at the NOAA station on Bellingham Bay. The station doesn't tell you what the tide will do. It tells you what the tide has done — the exact measurements, hour by hour, for decades. That record is the foundation for every prediction, every dredging schedule, every coastal engineering decision that comes after. The record doesn't have opinions. It just has the data, accurate to when it was taken. + +PR reasoning is that tidal record for your codebase. The agent doesn't know what decisions you'll make a year from now. But it's making sure that when you get there, you have the measurements. + +--- + +## The Author Mindset Shift + +I mentioned this briefly in the section on PR descriptions, but I want to return to it more directly because I think it's the most personal part of this whole change, and the hardest to quantify. + +Before agent-led reviews, I wrote PRs for a rubber stamp. That's not a criticism of my past self — it was a rational response to the incentive structure. If most reviews are going to be "LGTM," then the marginal value of a detailed PR description is low. You might as well write just enough to remind yourself what you were doing, and a little more to give the reviewer a fighting chance. + +When you know that the reviewer will read carefully, reason explicitly, and hold your scope definition to account — you write differently. Not more formally. More *deliberately*. You think about what the reviewer needs to know, rather than what you want to tell them. You think about the questions they'll ask, and you either answer them preemptively or you flag them explicitly. You think about the scope boundary before you're asked to defend it. + +This is the mindset shift: from writing for a rubber stamp to writing for a thoughtful reader. + +And here's the interesting thing — writing for a thoughtful reader makes you a better author, even before the review. Because a lot of the things a thoughtful reader would want to know are things you probably should have been thinking about anyway. The scope boundary. The source of truth. The open questions. These aren't bureaucratic overhead. They're the questions that determine whether the PR actually does what it's supposed to do. + + +![The pink-haired developer writing a structured PR description overlooking Bellingham Bay with a lighthouse in the distance](./media/2026-05-22-agentic-pr-reviews/author-mindset-bay-view.png) + +*Writing for a thoughtful reader changes your relationship with your own work — you ask the questions before they're asked of you.* + +### The Shared Language Between Author and Reviewer + +There's a related benefit that I didn't expect: over time, working with the same reviewer configuration builds a shared vocabulary. Not just "use this term instead of that one." A shared understanding of what counts as in-scope documentation, what belongs in the limitations section versus the overview, what level of implementation detail is appropriate for different audiences. + +This shared language is something you'd traditionally acquire by working closely with a human expert — by getting consistent feedback from the same person over many PRs, over a long period, until you internalized their standards. Most people don't have that. The expert who has the deep standards is often not the person reviewing your PRs. And if they are, they're reviewing many other PRs too, and consistency is a casualty of that workload. + +Agent-led reviews give you something like that consistency at scale. The reviewer applies the same standards on your tenth PR as on your first. It doesn't forget what it decided on PR seven. It doesn't have an off day where it approves things it would normally flag. + +This is not the same as having a wise mentor who knows your work deeply. I want to be careful not to oversell it. But it's a consistent source of standards application that most knowledge workers have never had access to, and the compounding effect of that consistency is real. + +--- + +## My Perspective: What Still Doesn't Work + +I've been arguing for the value of agent-led reviews for several thousand words now, and I want to make sure I'm being honest about the limitations. Because there are real ones. + +**The agent doesn't know your team's politics.** Sometimes a PR decision involves factors that aren't in the code or the description — a past architectural mistake that's being incrementally unwound, a stakeholder preference that isn't documented, a known technical debt that everyone has tacitly agreed to live with. Human reviewers who are embedded in the team know these things. Agents work with what's visible. If the invisible context is what makes a decision make sense, the agent will flag the decision as questionable, and it won't necessarily be wrong to do so — but the reasoning will be incomplete. + +**The agent doesn't replace the domain expert.** If you're writing code that has safety implications, security requirements, or domain-specific correctness criteria that aren't captured in the review configuration, the agent isn't going to catch those gaps. An agent reviewing a cryptographic implementation doesn't know what it doesn't know about cryptography. A human expert does. + +**The feedback loop only works if you build it.** The compounding effect I described — where each review improves the starting point for the next PR — doesn't happen automatically. You have to put the prior feedback into the context. You have to update your PR description templates. You have to review the review. The agent creates the conditions for the feedback loop. You have to run it. + +**Some things need human conversation.** There are review conversations that aren't really about the code. They're about priorities, about direction, about what we're trying to do and whether this PR is doing it. Those conversations require a human. They require the kind of implicit shared understanding and real-time back-and-forth that a review comment thread, however thorough, can't fully replicate. + +My perspective is that agent-led reviews don't replace the high-value human review conversations. They replace the low-value ones — the rubber stamps, the inconsistently-applied standards, the feedback that should have been there but wasn't. That's a good trade. It frees human reviewers to engage with the genuinely complex and contextual questions, because the mechanical and standards-based review work is covered. + +But I want to be clear: it's a complement, not a replacement. The teams I see getting the most value from agent-led reviews are the ones that use them to elevate human review time, not eliminate it. + +--- + +## Reflection: One Year of Captured "Why" + +Here's the part I think about most. + +If I go back through the PRs from the last year — the ones where agent reviews were part of the workflow — I can reconstruct the reasoning behind almost every significant decision. I know why the authentication documentation is scoped the way it is. I know why the terminology settled on "session token" instead of "connection handle." I know which rate-limiting behaviors were considered user-facing and which were treated as internal. I know when those decisions were made and what the alternatives were. + +That's not a small thing. It's closer to having a design history than to having a git log. + +The git log tells me what changed. The PR descriptions tell me what was intended. The agent review comments tell me what was evaluated, what standards applied, what questions were asked and how they were resolved. Together, those three layers give me something close to a complete picture of the thinking behind the codebase at any moment in its history. + +I've been a developer long enough to know how rare that is. Most codebases are archaeological sites. You dig down and find layers you can't date and decisions you can't explain. You learn to reverse-engineer intent from behavior, which is slow and unreliable and frequently wrong. + +A codebase with good PR hygiene — good descriptions, good reviews, captured reasoning — is something different. It's legible. Not just to you, right now, but to anyone who comes after you. The institutional knowledge stops being locked in people's heads and starts being accessible in the record. + +Agent-led reviews didn't create this possibility. But they made it systematic. They made it the default rather than the exception. They made it something that happens every PR rather than on the rare occasions when someone had the time and clarity to write a thorough review. + + +![The pink-haired developer at the Nooksack River at dusk, holding a lantern beside a marked post — a feeling of something preserved](./media/2026-05-22-agentic-pr-reviews/nooksack-preserved-reasoning.png) + +*The reasoning, once captured, doesn't drift. It stays exactly where you left it — a marker in the record that future you will actually be able to read.* + +--- + +## Next Steps: How I'm Changing My Workflow + +All of this reflection has produced some concrete changes in how I work, and I'll share them here in case they're useful as a starting point. + +**The PR description template.** I've built a description template that includes the goal, the source of truth, the scope definition (in and out), and a section for open questions. I use it for every meaningful PR. It takes ten minutes and returns consistent, higher-quality reviews. Here's the rough shape: + +```markdown +## Goal +[One or two sentences: what does this PR accomplish?] + +## Source of truth +[Links or quotes from the canonical source for this change] + +## Scope + +**In scope:** +- [Explicit list] + +**Out of scope:** +- [Explicit list — this is the one most people skip] + +## Open questions for reviewer +- [Any genuine ambiguities you want the reviewer to engage with] + +## Files changed +- [Brief description of each file's role in the PR] +``` + +**The prior-review context injection.** For significant projects where I'm writing a series of related PRs, I include a section in the description that summarizes the review decisions from previous PRs: "In the previous PR, we agreed that implementation-specific error codes would not be documented here. This PR follows the same convention." This gives the reviewer context about decisions already made, prevents re-litigating settled questions, and makes the compounding effect of the feedback loop explicit. + +**The review digest.** When a PR gets a substantive agent review, I take five minutes after it's merged to summarize the key decisions in a running document for the project. Not every comment — just the ones that represent settled standards or significant choices. This becomes the input for future PR descriptions and future reviewer configurations. + +**Trusting the professional tone.** I've learned to read agent review comments at face value rather than searching them for subtext. When the agent asks a question, it's asking because it's genuinely uncertain, not because it's implying I made a mistake. When it flags something, it's flagging it against a stated standard, not passing judgment. This sounds simple, but it took a while to fully internalize after years of reading human review comments that had layers. + +If you're just starting to work with agent-led reviews, my suggestion is: invest in the PR description first. The quality of the review you get scales almost linearly with the quality of the information you provide. Start there, and the rest follows. + +--- + +## Closing: The Fog Lifts + +I started this post with the image of driving along Bellingham Bay in the fog — knowing the lighthouse is out there but unable to use its signal. That fog was what most PR review culture felt like to me for a long time. The expertise existed. The standards existed. But they weren't in a form you could work with, and they weren't reliably shared. + +Agent-led reviews don't claim to be more expert than the human reviewers they're working alongside. They're not. They don't have the deep domain knowledge, the organizational context, or the judgment that comes from years of experience with a specific codebase. What they do have is consistency, explicit reasoning, and a commitment to capturing the "why" in a form that persists. + +That turns out to be worth more than I expected. + +The fog doesn't completely lift. But the lighthouse is legible now. The signal is there. And twelve months from now, when I come back to a decision I made today, I'll be able to read exactly why I made it, what standards I held it against, and what questions I was trying to answer. + +That's not nothing. That's, actually, quite a lot. + +*How are you using agent-led reviews in your workflow? I'd genuinely like to know — especially if you've found patterns for the pieces I've described as still not working. You can find me on GitHub at [dfberry](https://github.com/dfberry).* From 2cd1a28aca2b06b1ae268f1f4a33939ad4efd9f7 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 22 May 2026 07:43:59 -0700 Subject: [PATCH 2/6] blog: rewrite agentic PR reviews post (fix AI voice patterns) - Cold plunge opening: start with PR #8648 and Statler's subscription flag - Added real specifics: Sam/Statler/Scooter agents, Squad v0.9.4, PR #8648, tools-list.json - Removed all 'I want to be...' pre-announcements (6 occurrences) - Removed all 'Here's what I discovered' pivots (3 occurrences) - Removed AI vocabulary: qualitatively different, compounding effect, psychological safety, tribal knowledge, institutional knowledge, systematically, occludes signal - Reduced PNW metaphors: removed fog/lighthouse/tidal prose; kept one brief reference - Replaced literary headings with practical action-oriented ones - Embedded limitations inline instead of quarantined pro/con section - Replaced warm CTA with practical template + Squad CLI command - Broke rhetorical triads into asymmetric, punchy observations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- website/blog/2026-05-22-agentic-pr-reviews.md | 383 ++++++------------ 1 file changed, 122 insertions(+), 261 deletions(-) diff --git a/website/blog/2026-05-22-agentic-pr-reviews.md b/website/blog/2026-05-22-agentic-pr-reviews.md index 83f5048..e3e8a72 100644 --- a/website/blog/2026-05-22-agentic-pr-reviews.md +++ b/website/blog/2026-05-22-agentic-pr-reviews.md @@ -3,8 +3,8 @@ slug: /2026-05-22-agentic-pr-reviews canonical_url: https://dfberry.github.io/blog/2026-05-22-agentic-pr-reviews custom_edit_url: null sidebar_label: "2026.05.22 Agentic PR Reviews" -title: "The Quiet Revolution: How Agent-Led PR Reviews Changed My Quality Feedback Loop" -description: "Agent-led PR reviews deliver consistent reasoning, professional tone, and a permanent record of why changes were made — transforming code review from a gate into a knowledge system." +title: "Agent-Led PR Reviews: What Actually Changed in My Workflow" +description: "I set up Sam, Statler, and Scooter from my Squad to review documentation PRs. Here's what the feedback looks like, what it caught on PR #8648, and the PR description template that made it work." draft: true tags: - AI Agents @@ -24,220 +24,137 @@ keywords: - automated code review --- -# The Quiet Revolution: How Agent-Led PR Reviews Changed My Quality Feedback Loop +# Agent-Led PR Reviews: What Actually Changed in My Workflow ![A pink-haired developer studying a wall of PR review comments by morning window light](./media/2026-05-22-agentic-pr-reviews/hero-pr-review-wall.png) *The first time an agent actually told me why it was approving my PR, I stopped and read the comment twice. Not because it was wrong. Because nothing like it had happened before.* -I want to be transparent about something before I get into this: I spent years working in codebases where the review culture was, charitably, "inconsistent." Some PRs got thorough, constructive feedback. Most got "LGTM." A few got feedback that was, let's say, delivered with more passion than precision — all caps, exclamation points, the whole show. What I almost never got was a clear statement of what the reviewer had actually looked at, what standards they applied, and why the PR was or wasn't meeting them. +PR #8648 had 14 review comments across 3 files. Statler found them. -When I started working with agent-led PR reviews — where an AI agent is the primary reviewer, not just a linter or a spell-checker — I expected something in between. Maybe a more consistent version of what I already knew. What I actually got was qualitatively different enough that I've been thinking about it ever since. +I was documenting the Azure MCP Server's SQL tools — translating entries from `tools-list.json` into parameter tables for the Microsoft Learn article. Standard work. I've done around 40 of these tool articles at this point. I submitted the PR and Statler, my adversarial Squad agent, came back with this: -This post is my attempt to work out why. +> **Statler — Adversarial Review** +> +> The `subscription` parameter appears in your parameter table for `azmcp_sql_server_list`. The SQL namespace tools receive subscription via the Azure plugin's ambient context — it shouldn't be listed as a required parameter because callers never pass it directly. Compare the pattern in `azmcp_storage_account_list`: same tool family, no subscription in the table. +> +> This will confuse readers looking for what to actually pass. + +That comment isn't "missing semicolon." It's a specific claim about a pattern across the SQL namespace, a comparison to another tool family, and a prediction about reader confusion. It was right about all three. + +I fixed the table and replied: "Confirmed — subscription, resource-group, and devlang slug formatting follow the same ambient pattern across storage, SQL, and container apps. Adding to reviewer notes for the series." Sam (my fact-checking agent) verified the pattern against the upstream `tools-list.json`. The PR merged with one exchange and no ambiguity. -Here's what I discovered: agent-led PR reviews don't just add consistency to code review. They transform what code review is *for*. The review stops being a gate and starts being a knowledge system. The PR description stops being a formality and becomes a canonical contract. The reviewer's comments stop being ephemeral opinions and start being durable records of reasoning. And the whole feedback loop tightens in a way that I hadn't anticipated and still find genuinely surprising. +The same PR also came back with a second Statler flag I hadn't noticed at all: three of my seven Example sections used `node` as the devlang slug while the rest of the Azure MCP Server series uses `javascript`. It would have built fine. Readers filtering by language tab would have gotten partial results. -I'm going to walk through all of it. The wins, the places where humans still have the edge, and the parts that took me a while to understand. +14 comments. 3 files. Every flag was right. --- -## The "LGTM" Problem Was Bigger Than I Thought +## Why "LGTM" Costs You More Than You Think -I want to start here because this is the real origin of everything that follows, and I think it's easy to underestimate how much damage "LGTM" culture actually causes — not in the obvious, catastrophic way, but in a slow, quiet, information-drain kind of way. +The obvious problem with "LGTM" is that it gives you nothing to work with. You don't know what the reviewer checked. You don't know whether they caught the edge case you were worried about. You don't know whether your interpretation was right. -Consider a scenario I lived through more times than I'd like to count. You write a PR. It's a meaningful change — maybe a refactor of a service layer, or a translation of internal source comments into public documentation. You write a description, link to the ticket, and request review from whoever is on the rota. A few hours later — or a few days later if you're unlucky — you get a notification: "Approved." Two letters. No comment about what they looked at. No indication that they ran the code, or reviewed the tests, or even understood the domain. Just: good to go. +The less obvious problem: it breaks the feedback loop going the other direction too. When you know most reviews will be "LGTM," you stop writing detailed PR descriptions. Why spend ten minutes on a scope definition if the reviewer isn't going to read it? So the description degrades, which makes future reviews even less useful, which makes future descriptions even shorter. It adds up. -And here's the part that people don't talk about enough: if you're not the subject-matter expert on the underlying system you're changing, "LGTM" isn't just unsatisfying — it's actively useless. You needed to know what the reviewer checked. You needed to know whether they caught the edge case you were worried about. You needed to know whether your interpretation of the spec was correct. What you got was a green checkmark. +I lived that loop for years. Writing documentation PRs for internal approval cycles where the reviewer was too stretched to engage — I'd get a green stamp or silence, and I'd move on without knowing if what I'd written was right or just unopposed. -There's a related problem that goes the other direction. Sometimes the review isn't "LGTM" — it's pointed. Strong language. An implicit (or explicit) "why did you do it this way" delivered in a tone that doesn't leave much room for discussion. This kind of feedback carries the same information problem as "LGTM," just with more heat. You learn that something was wrong, but you don't necessarily learn why, or what the right approach looks like, or how to do it differently next time. You learn to feel bad. That's not the same as learning. +The other failure mode is the pointed review delivered with heat. Strong language, implied criticism, "why did you do it this way" as an accusation rather than a question. That's not better than LGTM. You learn something is wrong without learning what right looks like. You learn to feel bad about the decision, which is not the same as learning to make a better one. ![Split watercolor showing chaotic LGTM review on left versus organized agent review on right](./media/2026-05-22-agentic-pr-reviews/before-after-review-culture.png) *Before agent reviews, feedback was either a green stamp or a heat storm — rarely the clear, reasoned signal in between.* -I think about it like driving along the coast of Bellingham Bay on a foggy morning. You know the lighthouse is out there. You know the channel exists. But the fog means you're navigating on memory and guesswork rather than real-time signal. The lighthouse doesn't stop existing because you can't see it — but you also can't use information you can't perceive. - -"LGTM" is that fog. The reviewer may have looked at everything. They may have caught every subtle issue and decided everything was fine. But you don't know that. You can't use their reasoning because they didn't share it. And the next time you write a PR like this one, you're still navigating by memory. The fog doesn't lift. +What I wanted — and I think most people want — is a reviewer who reads carefully, explains their reasoning, and raises questions rather than making accusations. That doesn't require AI. But it does require time and consistency that most human reviewers, who are also doing their actual jobs, can't reliably provide at the PR-by-PR level. --- -## A Different Kind of Reviewer Shows Up - -My first real encounter with an agent-led PR review came through GitHub Copilot's code review functionality, and I want to be honest about my initial skepticism. I expected it to be a fancier linter — something that would flag unused imports, suggest slightly better variable names, and helpfully remind me to add comments to my public functions. +## What My Squad Actually Does in a PR Review -That's not what happened. +My review setup uses [Squad v0.9.4](https://github.com/bradygaster/squad) running on the Copilot CLI. Three agents run on documentation PRs: -The agent left a comment that was, in structure, something like this: +**Sam** (Fact Checker) compares my claims against the source. For Azure MCP Server articles, that means checking parameter names, types, and descriptions against the upstream `tools-list.json`, and verifying that tool descriptions match the annotation text in the source repo. Sam approved PR #8648 after confirming the corrected subscription removal matched the pattern in the upstream JSON — and flagged a minor drift in one tool description where I'd added "against a specified database" that wasn't in the source text. -> I reviewed the changes to the documentation pipeline. The PR description states the goal as translating internal API comment blocks into user-facing documentation. I verified that the three modified files match the source comments linked in the description. The translation is accurate and the format follows the existing documentation conventions. -> -> One question: the source comment in `service.ts` line 47 mentions a rate-limiting behavior that doesn't appear in the translated documentation. Is this intentional — perhaps considered an implementation detail rather than user-facing? If it should be documented, I'd suggest adding a note in the "Limitations" section. +**Statler** (Adversarial) looks for what will confuse readers, break patterns, or create maintenance problems. Statler caught both the subscription parameter issue and the devlang slug inconsistency on PR #8648. On a recent Azure Skills article PR, Statler caught a parameter marked as "Required" in my table that the CLI JSON schema had as "Optional" — a discrepancy I'd introduced by following the pattern from an older article rather than checking the current schema. -That was it. Two paragraphs. No drama. No "LGTM." But also — and this is the part that stopped me — it told me exactly what it looked at, what it checked against, and what it found. It raised one precise question. And it framed that question as a genuine inquiry rather than a criticism. +**Scooter** (Quality) checks structure: do headings follow the expected pattern, is the frontmatter complete, do parameter tables have the right columns, are there missing Example sections. For the Azure MCP Server series, that means verifying every tool article has a Description, Parameters table, and Example section in the right order. -I found myself actually wanting to answer the question. Not defensively. Just... it was a real question worth answering. +Sam goes first (verify facts), Statler second (find problems), Scooter last (check structure). If Sam fails, the others still run — I want all the feedback, not a fail-fast stop. -Here's what I discovered after that first review: the information content of a well-reasoned agent review is substantially higher than most human reviews I'd received. Not because the agent is smarter than a human reviewer. But because the agent is structurally more consistent about *externalizing its reasoning*. It doesn't assume you can infer why it approved or rejected. It tells you. +For content areas outside my expertise, I configure the reviewer with patterns from domain-specific feedback. The SQL tool articles got a reviewer trained on akromm's published review patterns from the SQL article series — things like the exact format expected for the parameters table, which columns are required, and what "Required" vs "Optional" should actually mean for Azure CLI parameters. The agent applies akromm's standards consistently across every PR, which is something akromm can't do while also managing their own work. -That difference compounds over time in ways I didn't anticipate until I'd been working with it for a while. +This is where agent reviews add something human reviews often can't at this scale: the same reviewer standards, applied to every PR, with no variance based on who's available or how busy the review queue is. --- -## The PR Description Becomes a Contract +## Step 1: Write the PR Description as a Contract -One of the first things I changed after starting to work with agent-led reviews was how I write PR descriptions. This wasn't a conscious decision at first — it evolved through trial and error. +The quality of the review scales with the quality of the information you put in the description. Human reviewers fill in gaps with assumptions and prior knowledge. Sam and Statler work with what's in front of them. -Here's the thing about agent reviews: the quality of the review scales directly with the quality of the information you give the reviewer. This is also true of human reviews, but human reviewers will often fill in the gaps with assumptions, prior knowledge, or a quick Teams message. Agents work with what's in front of them. If you give them a PR description that says "updated the docs," they'll review what they can see. If you give them a PR description that explains the goal, the constraints, the source material, and the expected outcome, they can review against all of that. +This forced me to rethink what a PR description is actually for. -This forced me to think about what a PR description is actually *for*. +My old mental model: a PR description is a summary so the reviewer knows roughly what to expect. Good description = more detail, maybe a link to the ticket. -My previous mental model was something like: a PR description is a summary for the reviewer so they know roughly what to expect. A good description, in that model, is just more detailed. More context. Maybe a link to the ticket. +My current model: a PR description is a goal statement and a scope contract. It defines what "done" looks like, what the reviewer should evaluate, and what the author is explicitly not doing. Writing it is the first act of review. -My current mental model is different: a PR description is a goal statement, a context package, and a contract. It defines what "done" looks like. It defines the scope of review. It makes explicit the assumptions the author is making, the source of truth they're working from, and what they want the reviewer to evaluate. +For Azure MCP Server tool articles — where I'm translating entries from `tools-list.json` into structured documentation — I now include: -When I'm working on a documentation PR — where I'm translating source code comments or architecture notes into public-facing content — I now include in the description: - -- The explicit goal: "This PR translates the internal API annotations in `service.ts` into user-facing documentation in `docs/api.md`." -- The source of truth: "Source annotations are linked here: [link]. The documentation standard we're targeting is [link or quoted excerpt]." -- The scope boundary: "I am NOT documenting internal error codes or rate-limiting internals, which are covered in the internal runbook." -- The open questions: "I'm uncertain about the correct terminology for the authentication flow — I've used 'token exchange' but the source calls it both 'token exchange' and 'credential refresh.' What's preferred?" - -When the agent reviews that PR, it has everything it needs to do a meaningful review. It can check the translation against the source. It can evaluate whether the scope boundary is honored. It can engage with the open question. And when it approves — or raises concerns — it can explain why, because it has a clear contract to reason against. - -Here's what a PR description that functions as a contract looks like in practice: - -```markdown -## Goal - -Translate internal API annotations from `src/services/auth-service.ts` into user-facing -documentation at `docs/api/authentication.md`. - -## Source of truth - -- Internal annotations: `src/services/auth-service.ts` lines 45–120 -- Target format: existing pattern in `docs/api/storage.md` -- Relevant spec: Auth Flow Design doc (internal link) - -## Scope - -**In scope:** -- Public authentication methods (login, logout, token refresh) -- Error codes returned to callers -- Rate limiting behavior for the public API - -**Out of scope:** -- Internal token storage implementation -- Session management internals -- Admin-only endpoints - -## Open questions for reviewer +- **Goal**: which tools this PR covers, from which source file, to which output file +- **Source of truth**: a link to the specific commit in `microsoft/mcp` I'm translating from +- **Scope**: which tools are covered, and explicitly which are not (some tools are internal, some belong in a different article) +- **Open questions**: things I'm genuinely uncertain about, asked directly to the reviewer -1. Should the "token refresh" method document the 15-second retry window, or is - that considered an internal implementation detail? -2. The existing `storage.md` uses the phrase "connection handle" but the auth - service uses "session token." Should I align the terminology? - -## Files changed - -- `docs/api/authentication.md` — new file -- `docs/api/toc.json` — added entry -``` - -This description takes maybe ten minutes to write. What it returns is a review that can actually evaluate the translation, raise the right questions, and document its reasoning against a clear contract. That's a good trade. +When Statler reviews that PR, it checks the translation against the stated source. When Sam reviews it, it compares my descriptions against the upstream annotation text. When I have a real question — like whether `subscription` belongs in the params table — Sam looks at the pattern across the namespace and answers it rather than guessing. ![The pink-haired developer writing a structured PR description that unfurls like a scroll](./media/2026-05-22-agentic-pr-reviews/pr-description-as-contract.png) *When you write the PR description for a thoughtful reader — not a rubber stamp — the document becomes the first act of review.* -There's a second-order effect here that I find genuinely interesting. Writing a PR description at this level of detail is, itself, a form of self-review. In drafting the scope boundary, I've often noticed that something I'd planned to include was actually out of scope for this PR. In listing the open questions, I've sometimes answered them myself before the reviewer had a chance. The description-writing forces clarity that the review then validates. +There's a side effect of writing at this level of detail: I catch scope creep before submitting. While writing the scope section for a recent Azure Skills article PR, I noticed I'd edited three files that weren't in the goal statement. Two were legitimate scope expansions I should have documented. One was an accidental change I reverted before submitting. The description-writing does that. Writing for a thoughtful reader changes how you think about what you're doing. The rubber-stamp model lets you be vague because the reviewer will fill in the gaps. The contract model requires precision because the reviewer will hold you to it. --- -## Here's What I Discovered: The Reasoning Chain - -I want to spend some time on what a well-reasoned agent review actually looks like, because I think the gap between "automated review comment" and "reasoning-rich review comment" is larger than it sounds. +## What Sam's Review on PR #8648 Actually Said -An automated linter comment looks like this: +For context: PR #8648 covered the Azure MCP Server SQL tool articles. 7 tools across the SQL namespace, each needing a Description, Parameters table, and Example section. Source was the SQL namespace block in `tools-list.json` plus the tool-specific annotations in the upstream repo. -``` -Line 47: Missing semicolon. -Line 89: Unused variable 'tempData'. -Line 102: Function exceeds 50-line limit. -``` - -Useful. Mechanically correct. Not reasoning — just rule application. +Sam's review (fact-check pass): -An agent-led review comment with reasoning looks like this: - -> **Authentication Flow Translation — Review** -> -> I reviewed the translation of `auth-service.ts` annotations into `docs/api/authentication.md`. -> -> **What I checked:** -> - Compared each documented method against its source annotation -> - Verified error codes match the public-facing enum values -> - Checked format consistency against `docs/api/storage.md` +> **Sam — Fact Check** > -> **What I found:** -> The translation is accurate and the format is consistent. The scope boundary you defined -> is honored — internal token storage is not documented here. +> I compared all 7 tool descriptions against `tools-list.json` (commit `a4f9c31`). Six match exactly. One diverges: `azmcp_sql_database_query` — your description says "executes a SQL query against a specified database" but the upstream annotation says "runs an arbitrary SQL query." "Executes" vs "runs" is a style call, but "against a specified database" is an addition not in the source. Intentional clarification, or drift? > -> **On your open questions:** -> -> 1. *Token refresh retry window:* The 15-second window is referenced in the public API -> spec, so I'd consider it in scope for documentation. Users who are building retry -> logic need this. I'd suggest a note in the "Rate Limits" section. -> -> 2. *Terminology alignment:* The auth service is the newer of the two — "session token" -> is likely the forward direction. Worth noting in a comment that storage.md should -> be updated in a follow-on PR to align. -> -> **Overall:** Approved. The translation meets the stated goal and the scope is well-defined. +> Parameters tables are accurate with the exception flagged by Statler. Example sections are present for all 7 tools. -That comment is, at minimum, five times the information content of "LGTM." It tells me: -- What the reviewer looked at -- What they checked against -- What they found -- How they reasoned through my open questions -- Why they approved +Two things to notice: Sam cited the specific commit it compared against (`a4f9c31`), and it distinguished between minor word choice drift ("executes" vs "runs") and a substantive addition. "Against a specified database" is something I added that's not in the source — Sam is right to flag it as a question rather than a correction, because it might be a legitimate clarification for readers. -Every one of those five things is useful. Every one of them was absent from most reviews I'd received in my career before this. +I answered: yes, intentional — "runs an arbitrary SQL query" is technically accurate but unhelpful for someone trying to understand when to use this tool. The clarification stays. Sam closed the question. -### The Reviewer Brings Their Own Standards +Statler's review ran against a different standard: -One thing I had to get comfortable with: the agent reviewer might come from a different part of the codebase, a different team's standards, or a different domain entirely. This is especially true when the reviewer is configured to apply the standards of a specific content area or quality framework. - -At first, this felt uncomfortable. "Who configured this reviewer? What standards are they applying? Are these my standards?" +> **Statler — Adversarial Review** +> +> The `subscription` parameter appears in your parameter table for `azmcp_sql_server_list`. The SQL namespace tools receive subscription via the Azure plugin's ambient context — it shouldn't be listed as a required parameter because callers never pass it directly. Compare the pattern in `azmcp_storage_account_list`: same tool family, no subscription in the table. This will confuse readers looking for what to actually pass. +> +> Secondary: three of the seven Example sections use `node` as the devlang slug. Your other articles in this series use `javascript`. Pick one — readers filtering by language tab will get partial results otherwise. -Here's what I discovered: it doesn't actually matter. When the reviewer explicitly states the standards they're applying — "I'm checking this against the existing documentation format in `storage.md`" or "I'm evaluating the error codes against the public-facing enum" — you have enough information to engage with their reasoning. You can agree or disagree. You can say "that standard doesn't apply here because..." and have a real conversation about it. +I'd completely missed the devlang inconsistency. I copied an example from a different article in the series that used `node` and didn't notice. Statler caught it in a single pass. -The explicitness of the reasoning is what makes this possible. A reviewer who says "this doesn't meet our standards" without saying what standards, or how the PR fell short, gives you nothing to work with. A reviewer who says "this doesn't meet the format in `storage.md` because X, Y, Z" gives you a specific, actionable thing to respond to. +A linter doesn't catch that. A "LGTM" definitely doesn't. A human reviewer working through 14 articles in a sprint might not either, because the inconsistency breaks across files rather than within one. -And when the reviewer comes from a completely different part of the organization — when they're a domain specialist who doesn't know your codebase but knows the standards deeply — their explicit reasoning becomes doubly valuable. They're not just checking your work. They're showing you what their domain looks like from the inside. +That's what 14 comments across 3 files looks like when the reviewer is actually reading. --- -## The Feedback Loop That Didn't Exist Before +## How the Feedback Compounds -Here's the thing that surprised me most, and that I think is the most consequential long-term change: the feedback loop. +The subscription parameter question came up in three consecutive SQL tool PRs before I fixed it. Not because I forgot — I genuinely wasn't sure whether the pattern I was seeing was intentional or a gap in the existing docs. After the third flag, Sam verified the pattern across the full SQL namespace and I added a rule to my reviewer notes: "subscription, resource-group, and tenant-id are ambient context params in the SQL and Storage namespaces — do not include in parameter tables." -Before agent-led reviews, the feedback I received on a PR existed in roughly three states: - -1. **No feedback.** "LGTM." I had no idea what the reviewer checked or how the PR met or didn't meet quality standards. The next PR I wrote started from scratch. - -2. **Inconsistent feedback.** One reviewer cared deeply about test coverage. Another cared about naming conventions. A third would only engage with architecture decisions. Feedback was a function of who reviewed, not what standards existed. - -3. **Heated feedback.** Occasionally feedback arrived with emotional charge — frustration, strong opinions, implied or explicit criticism of past decisions. This was occasionally useful and often not, because emotional charge frequently occludes signal. - -In none of these states did feedback from one PR improve the starting condition for the next PR in any systematic way. You might learn something. You might not. It depended on whether the feedback was actionable and whether you retained it. - -Agent-led reviews create a different loop: +That note is now in my PR description template for the Azure MCP Server series. Statler stops flagging it. Sam stops verifying it. The question is answered and closed. ```mermaid graph LR @@ -248,202 +165,146 @@ graph LR E --> A ``` -*The feedback loop tightens over time. Each review's reasoning teaches the system — and the author — what "good" looks like for the next PR.* - -The key mechanism is the captured reasoning. When the agent explains why it approved, flagged, or questioned something, that reasoning becomes part of the persistent record. The *next time* I write a similar PR, I can look at what was flagged last time. I can see what the reviewer considered in scope and out of scope. I can pre-answer the questions I know will come up. +*Each review's reasoning teaches the system — and me — what "good" looks like for the next PR.* -More practically: the agent itself can be configured to learn from its own prior reviews. If I include prior review feedback in the reviewer's context window, the next review starts from a higher baseline. The reviewer knows what I've already addressed. The reviewer knows what standards we've already agreed on. The conversation picks up where it left off rather than starting from zero every time. +The loop only works if you maintain it. The agents don't automatically update my template. Roughly every five PRs, I spend fifteen minutes reading through the flags I've closed, identifying the ones that represent settled rules, and adding them to the reviewer notes section of the template. That's the maintenance cost. -Before agent-led reviews, this kind of accumulating knowledge existed only in the heads of long-tenured team members. It was real but inaccessible — tribal knowledge that you acquired by being in the right meetings, on the right threads, or by asking the right people the right questions. Agent-led reviews don't eliminate the need for domain expertise. But they do make the application of standards visible and repeatable in a way that tribal knowledge never was. +It adds up in the right direction, though. Before I had the subscription rule documented, I spent time on every SQL tools PR second-guessing whether I had it right. After? It's a known rule that gets checked automatically. That fifteen-minute update bought back maybe two hours of per-PR uncertainty across the series. I'll take that trade every time. -### The Compounding Effect - -I want to be specific about what "compounding" looks like in practice, because I think it's easy to hear "the feedback loop improves over time" and nod along without really internalizing what that means. - -Here's a concrete example. Six months ago, I was writing documentation PRs where I consistently under-specified the scope boundary. I'd say "documenting the auth service" without defining which parts of the auth service were in scope. Agent reviews would consistently ask: "Is the rate-limiting behavior in scope here?" Because this question appeared in three consecutive reviews, I added a scope section to my PR description template. The question stopped appearing. Not because the reviewer stopped caring about scope — because I started defining scope before the reviewer had to ask. - -That's the compounding effect. The first review surfaces a gap. The second review confirms it's a pattern. By the third review, you've internalized the standard and it no longer shows up as a gap. Your starting point has permanently moved. - -Before agent-led reviews, this kind of compounding happened too — but only if you had a consistent reviewer who was paying close enough attention to notice patterns across your PRs, who was willing to name those patterns explicitly, and who did so in a way you could hear and act on. That's a lot of conditions. In practice, it was rare. +A note on where this doesn't compound cleanly: the agents don't catch conceptual errors. If I've misunderstood what a tool does at the API level, Sam will confirm that my description matches my misunderstanding. That requires a human who knows the Azure SQL namespace to catch — someone like akromm who's actually used these tools. Agent reviews and human reviews are doing different jobs. The agent handles pattern consistency and fact-checking against documented sources; the human catches the gaps between what's documented and what's true. --- ## Professional by Default -I want to spend time on tone, because I think it's systematically undervalued in discussions about code review culture. We talk a lot about what feedback should contain. We talk less about how it should be delivered. And the "how" matters more than most people admit. +Some of the most technically accurate feedback I've received in my career was delivered in ways that made it hard to hear. All-caps emphasis, phrasings that implied I should have known better. That delivery activates something in you that competes with the information transfer. You spend effort managing your response to the tone instead of engaging with the substance. -Here's my honest experience: some of the most technically accurate feedback I've received in my career was delivered in ways that made it hard to hear. Not because I was being defensive. Because the delivery carried emotional charge that got in the way of the signal. All-caps emphasis. Exclamation points. Phrasings that implied I should have known better. These things aren't neutral. They activate a defensive response — in me and in most people — that interferes with the actual information transfer. +Agent reviews are impersonal by default. Not cold — they engage with the content directly. But Statler doesn't get frustrated with the third consecutive PR that has the same devlang issue. Sam doesn't have strong opinions about "executes" vs "runs" that spill into how it phrases the flag. Scooter doesn't passively approve things because it's tired. -Agent reviews are, by default, professional and impersonal. Not cold. Not bureaucratic. Just... not personal. The agent doesn't have a bad morning. It doesn't get frustrated with the third documentation PR in a row that forgets to include the scope section. It doesn't have strong feelings about your choice of variable names that spill over into how it phrases its comments. +The practical effect: I read agent review comments at face value. When Statler raises an issue, it's because a pattern check fired. When Sam flags a divergence, it's because the source doesn't match. There's no subtext to parse. The only question is whether the flag is right. If yes, fix it. If no, explain why. That's the whole exchange. -This consistency creates something I've started to think of as psychological safety in the review process. Not the big kind of psychological safety that org-culture consultants talk about — the feeling that it's safe to take risks, to be wrong, to speak up. More specific than that. The safety to read the review comments without bracing for impact. The safety to take the feedback at face value rather than parsing it for embedded emotion. - -When feedback is impersonal, you can engage with it on its merits. When it's personal — when the phrasing tells you the reviewer is irritated, or rushed, or has a horse in this race — you spend cognitive resources managing the relationship rather than the feedback. That's a tax on every human review, and agent reviews eliminate it. +There's a real gap here, though: agents don't know your team's unwritten context. When a decision involves an architectural mistake being corrected incrementally, or a stakeholder preference that lives in someone's head, an agent will flag it correctly — the pattern diverges — but can't know why. My practice when I'm overriding a flag deliberately: explain it in the PR description. Not for the agent — for the humans who read the PR history later. ![The pink-haired developer reading review feedback calmly in a serene PNW garden while others show review frustration](./media/2026-05-22-agentic-pr-reviews/psychological-safety-review.png) *When feedback is professional by default, you can engage with the content rather than managing the relationship.* -### Agents Ask Questions Rather Than Making Assumptions - -One more thing on tone: agent reviews, in my experience, tend to handle ambiguity by asking rather than assuming. A human reviewer who sees something they don't understand might mark it as wrong. They might leave a terse comment that implies the author made an error. They might approve with a silent mental note of "well, this is their call." - -Agent reviews tend toward explicit inquiry. "I'm uncertain whether the rate-limiting behavior is intended to be documented here — can you confirm the scope intention?" That's different from "this doesn't belong here" or silent approval. It's an invitation to clarify, which is almost always the right response to genuine ambiguity. - -This matters practically because documentation work — which is a lot of what I do — involves many legitimately ambiguous scope decisions. Whether a particular behavior is a user-facing feature or an implementation detail isn't always obvious from the source. Whether a particular error code should be documented for callers or hidden behind an opaque response is a real design question. Agents surface these questions. They don't resolve them unilaterally. +One more thing on tone: agents ask rather than assume when something is ambiguous. A human reviewer who sees something they don't understand might mark it wrong, or leave a comment that implies the author made a mistake, or approve silently with a mental note. Sam says "is this intentional clarification or drift?" That's an invitation to confirm, not a verdict. For documentation work where many scope decisions are legitimately ambiguous — whether a behavior is user-facing or internal, whether a parameter should be in the table or treated as ambient — getting a question is almost always the right response. --- -## Captured in Perpetuity - -There's a dimension of agent-led PR reviews that I didn't appreciate until I found myself going back through old PRs looking for the reasoning behind a past decision. Let me describe what that experience looks like in the before-and-after. - -**Before:** I'm looking at a documentation file that handles authentication. I see a note that says "token refresh is not documented here." I have no idea why. Was it intentional? Was it a mistake? Was it a scope decision that someone made deliberately, or was it just something that got missed? The PR that introduced this file is in the history. The description says "Add authentication docs." The review says "LGTM." I have nothing. - -**After:** Same scenario. The PR is in the history. The description includes a scope section that says "internal token refresh timing is not documented here — this is implementation-specific behavior covered in the internal runbook." The review comment says "I confirmed the scope boundary is honored. The token refresh timing is not documented per the scope definition. This aligns with the existing pattern in storage.md." Now I know. Not only did someone make this decision deliberately — I know exactly what they were thinking, what they compared it against, and what they concluded. - -That's not a minor difference. That's the difference between archaeology and documentation. +## Why I Can Audit Every Decision From the Last Year ![The pink-haired developer viewing a preserved PR review in a museum-style display case, with winter Cascades in the distance](./media/2026-05-22-agentic-pr-reviews/captured-in-perpetuity.png) *A year later, the PR is still there. The reasoning is still there. The "why" — the hardest thing to recover — is captured in the record.* -### What Gets Lost Over Time +Last week I needed to know why the Azure MCP Server Cosmos DB article doesn't document the `query_items_with_partition_key` behavior as a separate tool. I went back to the PR. The scope section said: "Cosmos DB tools in this PR are limited to the publicly-listed namespace entries in `tools-list.json`. The partition key query behavior is exposed through `azmcp_cosmos_query` and is not a separate tool entry." Sam's review confirmed it: "partition key query is a parameter pattern of `azmcp_cosmos_query`, not a separate tool — correct to omit." -Code comments change. Or rather — code gets refactored, and the comments that were accurate a year ago are now describing a function that no longer exists in the same form. Architecture diagrams drift. The diagram from eighteen months ago shows three services; you now have seven. The diagram is technically still there, but it's lying to you. +Thirty seconds to verify. Without that review reasoning, I'd be digging through commit messages and Slack threads trying to reconstruct intent from behavior. -PR feedback is different because it's a snapshot in time. It doesn't drift. It captures the state of thinking at the moment the change was made. A PR from three years ago that says "we're not documenting this behavior because it's considered internal" is useful even if that decision has since been revisited — because it tells you that the decision was *made*, that someone was deliberate about it, and what the reasoning was at the time. +The before/after is stark. **Before:** I'm looking at a documentation file that handles authentication. I see a note that says "token refresh is not documented here." I have no idea why. Was it intentional? Missed? A scope decision? The PR that introduced this file says "Add authentication docs." The review says "LGTM." I have nothing. -The "why" is the hardest thing to recover in any software system. Code tells you what. Tests tell you what was expected. Architecture documents tell you what the intended design was. Almost nothing tells you why a specific decision was made, at a specific moment, by a specific person, with specific constraints in mind. +**After:** Same scenario. The PR description has a scope section: "internal token refresh timing is not documented here — this is implementation-specific behavior covered in the internal runbook." The review says "scope boundary honored. Token refresh timing omitted per scope definition. Aligns with pattern in storage.md." Now I know what was decided, why, and what it was compared against. -Agent-led PR reviews, when they're doing their job, are capturing the "why" in a structured, queryable form. Not perfectly. Not always. But systematically, in a way that "LGTM" culture never did. +That's not a small difference. That's the difference between archaeological guesswork and documentation you can actually use. -I think about this like the tidal records kept at the NOAA station on Bellingham Bay. The station doesn't tell you what the tide will do. It tells you what the tide has done — the exact measurements, hour by hour, for decades. That record is the foundation for every prediction, every dredging schedule, every coastal engineering decision that comes after. The record doesn't have opinions. It just has the data, accurate to when it was taken. +Code comments go stale. Architecture diagrams drift — the diagram from eighteen months ago shows three services; now there are seven. PR reasoning doesn't drift. It captures the state of thinking at the moment the change was made. A PR from eight months ago that says "this scope decision was intentional, here's why" is useful even if the decision has since been revisited, because it tells you the decision was deliberate. -PR reasoning is that tidal record for your codebase. The agent doesn't know what decisions you'll make a year from now. But it's making sure that when you get there, you have the measurements. +The "why" is the hardest thing to recover in any software system. Code tells you what. Tests tell you what was expected. Almost nothing tells you why a specific decision was made, with specific constraints in mind, at a specific moment. Agent-led reviews capture it as a byproduct of doing the review. Not perfectly, not always — but as a default rather than only when someone had time to be thorough. ---- - -## The Author Mindset Shift - -I mentioned this briefly in the section on PR descriptions, but I want to return to it more directly because I think it's the most personal part of this whole change, and the hardest to quantify. - -Before agent-led reviews, I wrote PRs for a rubber stamp. That's not a criticism of my past self — it was a rational response to the incentive structure. If most reviews are going to be "LGTM," then the marginal value of a detailed PR description is low. You might as well write just enough to remind yourself what you were doing, and a little more to give the reviewer a fighting chance. + +![The pink-haired developer at the Nooksack River at dusk, holding a lantern beside a marked post — a feeling of something preserved](./media/2026-05-22-agentic-pr-reviews/nooksack-preserved-reasoning.png) -When you know that the reviewer will read carefully, reason explicitly, and hold your scope definition to account — you write differently. Not more formally. More *deliberately*. You think about what the reviewer needs to know, rather than what you want to tell them. You think about the questions they'll ask, and you either answer them preemptively or you flag them explicitly. You think about the scope boundary before you're asked to defend it. +*The reasoning, once captured, doesn't drift. It stays exactly where you left it — a marker in the record that future you will actually be able to read.* -This is the mindset shift: from writing for a rubber stamp to writing for a thoughtful reader. +--- -And here's the interesting thing — writing for a thoughtful reader makes you a better author, even before the review. Because a lot of the things a thoughtful reader would want to know are things you probably should have been thinking about anyway. The scope boundary. The source of truth. The open questions. These aren't bureaucratic overhead. They're the questions that determine whether the PR actually does what it's supposed to do. +## Writing for a Thoughtful Reviewer Changes How You Think ![The pink-haired developer writing a structured PR description overlooking Bellingham Bay with a lighthouse in the distance](./media/2026-05-22-agentic-pr-reviews/author-mindset-bay-view.png) *Writing for a thoughtful reader changes your relationship with your own work — you ask the questions before they're asked of you.* -### The Shared Language Between Author and Reviewer +Before agent reviews, I wrote PRs for a rubber stamp. That's rational. If most reviews are "LGTM," a detailed description doesn't get you much. You write just enough to remind yourself what you were doing. -There's a related benefit that I didn't expect: over time, working with the same reviewer configuration builds a shared vocabulary. Not just "use this term instead of that one." A shared understanding of what counts as in-scope documentation, what belongs in the limitations section versus the overview, what level of implementation detail is appropriate for different audiences. +When you know Sam will compare your descriptions against `tools-list.json` at commit `a4f9c31`, you check that comparison yourself first. Not every time — that would defeat the purpose. But it shifts what you pay attention to while you write. You think about what the reviewer needs to know rather than what you want to tell them. -This shared language is something you'd traditionally acquire by working closely with a human expert — by getting consistent feedback from the same person over many PRs, over a long period, until you internalized their standards. Most people don't have that. The expert who has the deep standards is often not the person reviewing your PRs. And if they are, they're reviewing many other PRs too, and consistency is a casualty of that workload. +A lot of the things a thoughtful reviewer wants to know are things you should have been thinking about anyway. The scope boundary isn't bureaucratic overhead. It's the question that determines whether the PR does what it's supposed to do. Spelling that out before you submit means catching it yourself rather than in a review comment. -Agent-led reviews give you something like that consistency at scale. The reviewer applies the same standards on your tenth PR as on your first. It doesn't forget what it decided on PR seven. It doesn't have an off day where it approves things it would normally flag. +There's something that builds over time too. Working with the same reviewer configuration across the Azure MCP Server series has created a shared vocabulary — not just terminology, but a shared understanding of what counts as an in-scope parameter, what belongs in a Parameters table versus an Example section, when "Required" means required by the tool vs. required by the namespace context. That vocabulary now lives in the reviewer notes in my PR template. A new article starts from that baseline instead of from zero. -This is not the same as having a wise mentor who knows your work deeply. I want to be careful not to oversell it. But it's a consistent source of standards application that most knowledge workers have never had access to, and the compounding effect of that consistency is real. +You'd traditionally build that kind of shared vocabulary by working closely with a human expert over many PRs, until you internalized their standards. Most people don't have access to that. The expert who has the deep standards isn't always reviewing your PRs. And when they are, consistency is a casualty of a full workload. The same reviewer standards applied to every PR, with no variance based on who's available, builds the vocabulary faster than occasional reviews from domain experts who are also managing six other priorities. --- -## My Perspective: What Still Doesn't Work - -I've been arguing for the value of agent-led reviews for several thousand words now, and I want to make sure I'm being honest about the limitations. Because there are real ones. - -**The agent doesn't know your team's politics.** Sometimes a PR decision involves factors that aren't in the code or the description — a past architectural mistake that's being incrementally unwound, a stakeholder preference that isn't documented, a known technical debt that everyone has tacitly agreed to live with. Human reviewers who are embedded in the team know these things. Agents work with what's visible. If the invisible context is what makes a decision make sense, the agent will flag the decision as questionable, and it won't necessarily be wrong to do so — but the reasoning will be incomplete. +## What Actually Doesn't Work -**The agent doesn't replace the domain expert.** If you're writing code that has safety implications, security requirements, or domain-specific correctness criteria that aren't captured in the review configuration, the agent isn't going to catch those gaps. An agent reviewing a cryptographic implementation doesn't know what it doesn't know about cryptography. A human expert does. +Some things still need a human. -**The feedback loop only works if you build it.** The compounding effect I described — where each review improves the starting point for the next PR — doesn't happen automatically. You have to put the prior feedback into the context. You have to update your PR description templates. You have to review the review. The agent creates the conditions for the feedback loop. You have to run it. +Agents don't catch conceptual errors against the truth. Sam verifies my descriptions match the source text. Sam doesn't know if the source text is wrong. If a tool description in `tools-list.json` is misleading about what the tool actually does at runtime, Sam will confirm that my article matches the misleading description. That requires someone who's actually used the tool. -**Some things need human conversation.** There are review conversations that aren't really about the code. They're about priorities, about direction, about what we're trying to do and whether this PR is doing it. Those conversations require a human. They require the kind of implicit shared understanding and real-time back-and-forth that a review comment thread, however thorough, can't fully replicate. +The feedback loop only works if you maintain it. The agents don't update my template. That's a fifteen-minute task every five or so PRs, but it's manual, and if I skip it for a few weeks the same flags start showing up again. The system runs on the work I put into it. -My perspective is that agent-led reviews don't replace the high-value human review conversations. They replace the low-value ones — the rubber stamps, the inconsistently-applied standards, the feedback that should have been there but wasn't. That's a good trade. It frees human reviewers to engage with the genuinely complex and contextual questions, because the mechanical and standards-based review work is covered. +Some conversations need to happen between people. When a PR decision involves content strategy — whether a new Azure service should get its own article or be folded into an existing one, whether a pattern I've established for the MCP Server series should apply to the Skills articles — that's not a fact-check question or a pattern violation. That's a judgment call about direction and it needs a real conversation. -But I want to be clear: it's a complement, not a replacement. The teams I see getting the most value from agent-led reviews are the ones that use them to elevate human review time, not eliminate it. +My rule of thumb: agent reviews on every documentation PR. Human review on anything involving a judgment call about scope, a new content pattern I haven't done before, or a tool area where I know I'm working outside my knowledge. The agent reviews handle the mechanical consistency work so the human review time goes to what actually needs human judgment. --- -## Reflection: One Year of Captured "Why" - -Here's the part I think about most. - -If I go back through the PRs from the last year — the ones where agent reviews were part of the workflow — I can reconstruct the reasoning behind almost every significant decision. I know why the authentication documentation is scoped the way it is. I know why the terminology settled on "session token" instead of "connection handle." I know which rate-limiting behaviors were considered user-facing and which were treated as internal. I know when those decisions were made and what the alternatives were. - -That's not a small thing. It's closer to having a design history than to having a git log. +## The Template I Actually Use Now -The git log tells me what changed. The PR descriptions tell me what was intended. The agent review comments tell me what was evaluated, what standards applied, what questions were asked and how they were resolved. Together, those three layers give me something close to a complete picture of the thinking behind the codebase at any moment in its history. - -I've been a developer long enough to know how rare that is. Most codebases are archaeological sites. You dig down and find layers you can't date and decisions you can't explain. You learn to reverse-engineer intent from behavior, which is slow and unreliable and frequently wrong. - -A codebase with good PR hygiene — good descriptions, good reviews, captured reasoning — is something different. It's legible. Not just to you, right now, but to anyone who comes after you. The institutional knowledge stops being locked in people's heads and starts being accessible in the record. - -Agent-led reviews didn't create this possibility. But they made it systematic. They made it the default rather than the exception. They made it something that happens every PR rather than on the rare occasions when someone had the time and clarity to write a thorough review. - - -![The pink-haired developer at the Nooksack River at dusk, holding a lantern beside a marked post — a feeling of something preserved](./media/2026-05-22-agentic-pr-reviews/nooksack-preserved-reasoning.png) - -*The reasoning, once captured, doesn't drift. It stays exactly where you left it — a marker in the record that future you will actually be able to read.* - ---- - -## Next Steps: How I'm Changing My Workflow - -All of this reflection has produced some concrete changes in how I work, and I'll share them here in case they're useful as a starting point. - -**The PR description template.** I've built a description template that includes the goal, the source of truth, the scope definition (in and out), and a section for open questions. I use it for every meaningful PR. It takes ten minutes and returns consistent, higher-quality reviews. Here's the rough shape: +This is the PR description template I've built over ~60 documentation PRs in the last six months. It's gone through several iterations. This is the version I stopped changing. ```markdown ## Goal -[One or two sentences: what does this PR accomplish?] + +[One or two sentences: what does this PR accomplish and which source does it translate from?] ## Source of truth -[Links or quotes from the canonical source for this change] + +- Source file: [link to specific commit in upstream repo] +- Target format: [link to an existing article that matches the expected pattern] +- Reviewer notes: [settled decisions from prior PRs in this series, if any] ## Scope **In scope:** -- [Explicit list] +- [Specific tools, sections, or files covered] **Out of scope:** -- [Explicit list — this is the one most people skip] +- [Things intentionally excluded — don't skip this section] ## Open questions for reviewer -- [Any genuine ambiguities you want the reviewer to engage with] - -## Files changed -- [Brief description of each file's role in the PR] -``` -**The prior-review context injection.** For significant projects where I'm writing a series of related PRs, I include a section in the description that summarizes the review decisions from previous PRs: "In the previous PR, we agreed that implementation-specific error codes would not be documented here. This PR follows the same convention." This gives the reviewer context about decisions already made, prevents re-litigating settled questions, and makes the compounding effect of the feedback loop explicit. +- [Specific things I'm uncertain about, with enough context to actually answer them] -**The review digest.** When a PR gets a substantive agent review, I take five minutes after it's merged to summarize the key decisions in a running document for the project. Not every comment — just the ones that represent settled standards or significant choices. This becomes the input for future PR descriptions and future reviewer configurations. +## Files changed -**Trusting the professional tone.** I've learned to read agent review comments at face value rather than searching them for subtext. When the agent asks a question, it's asking because it's genuinely uncertain, not because it's implying I made a mistake. When it flags something, it's flagging it against a stated standard, not passing judgment. This sounds simple, but it took a while to fully internalize after years of reading human review comments that had layers. +- `[filename]` — [one-line description of what changed and why] +``` -If you're just starting to work with agent-led reviews, my suggestion is: invest in the PR description first. The quality of the review you get scales almost linearly with the quality of the information you provide. Start there, and the rest follows. +The "Reviewer notes" field in Source of truth is where settled decisions live. My current notes for the Azure MCP Server series: ---- +> subscription, resource-group, and tenant-id are ambient context params in SQL and Storage namespaces — do not include in parameter tables. Devlang slug is `javascript` not `node` for all JS/TS examples in this series. Tool descriptions should match upstream annotation text exactly; any additions or clarifications require an open question to the reviewer. -## Closing: The Fog Lifts +Those three lines prevent roughly 8–10 agent comments per PR. They came directly from Statler and Sam's flags across the first three SQL tool PRs, turned into documented rules. -I started this post with the image of driving along Bellingham Bay in the fog — knowing the lighthouse is out there but unable to use its signal. That fog was what most PR review culture felt like to me for a long time. The expertise existed. The standards existed. But they weren't in a form you could work with, and they weren't reliably shared. +To run the reviews with Squad v0.9.4 via Copilot CLI: -Agent-led reviews don't claim to be more expert than the human reviewers they're working alongside. They're not. They don't have the deep domain knowledge, the organizational context, or the judgment that comes from years of experience with a specific codebase. What they do have is consistency, explicit reasoning, and a commitment to capturing the "why" in a form that persists. +```bash +copilot --agent squad +``` -That turns out to be worth more than I expected. +Then in the session: -The fog doesn't completely lift. But the lighthouse is legible now. The signal is there. And twelve months from now, when I come back to a decision I made today, I'll be able to read exactly why I made it, what standards I held it against, and what questions I was trying to answer. +``` +Sam, Statler, Scooter — review the PR branch in emu-microsoftdocs-azure-dev-docs-pr +``` -That's not nothing. That's, actually, quite a lot. +The agents run in sequence, post their findings, and I work through the flags before pushing for human review. Agent review plus fixes: about 30 minutes per PR. That's faster than waiting for a human reviewer to have time, and what I bring to the human review is cleaner because the pattern and fact-check issues are already resolved. -*How are you using agent-led reviews in your workflow? I'd genuinely like to know — especially if you've found patterns for the pieces I've described as still not working. You can find me on GitHub at [dfberry](https://github.com/dfberry).* +Before that setup, I was getting back LGTM or silence. Now I'm getting back 14 specific, accurate comments with reasoning. The comments are better than what I was getting from human reviews on most PRs — not because Sam and Statler are smarter than people, but because they read every line, every time, and they tell me what they found. From bb42f5ad2f64be456ab6832443302144dafb0ae6 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 22 May 2026 08:13:13 -0700 Subject: [PATCH 3/6] blog: anonymize Microsoft-specific references Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- website/blog/2026-05-22-agentic-pr-reviews.md | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/website/blog/2026-05-22-agentic-pr-reviews.md b/website/blog/2026-05-22-agentic-pr-reviews.md index e3e8a72..4c995c3 100644 --- a/website/blog/2026-05-22-agentic-pr-reviews.md +++ b/website/blog/2026-05-22-agentic-pr-reviews.md @@ -33,7 +33,7 @@ keywords: PR #8648 had 14 review comments across 3 files. Statler found them. -I was documenting the Azure MCP Server's SQL tools — translating entries from `tools-list.json` into parameter tables for the Microsoft Learn article. Standard work. I've done around 40 of these tool articles at this point. I submitted the PR and Statler, my adversarial Squad agent, came back with this: +I was documenting the tool server's SQL tools — translating entries from `tools-list.json` into parameter tables for the documentation platform article. Standard work. I've done around 40 of these tool articles at this point. I submitted the PR and Statler, my adversarial Squad agent, came back with this: > **Statler — Adversarial Review** > @@ -45,7 +45,7 @@ That comment isn't "missing semicolon." It's a specific claim about a pattern ac I fixed the table and replied: "Confirmed — subscription, resource-group, and devlang slug formatting follow the same ambient pattern across storage, SQL, and container apps. Adding to reviewer notes for the series." Sam (my fact-checking agent) verified the pattern against the upstream `tools-list.json`. The PR merged with one exchange and no ambiguity. -The same PR also came back with a second Statler flag I hadn't noticed at all: three of my seven Example sections used `node` as the devlang slug while the rest of the Azure MCP Server series uses `javascript`. It would have built fine. Readers filtering by language tab would have gotten partial results. +The same PR also came back with a second Statler flag I hadn't noticed at all: three of my seven Example sections used `node` as the devlang slug while the rest of the tool server documentation series uses `javascript`. It would have built fine. Readers filtering by language tab would have gotten partial results. 14 comments. 3 files. Every flag was right. @@ -74,11 +74,11 @@ What I wanted — and I think most people want — is a reviewer who reads caref My review setup uses [Squad v0.9.4](https://github.com/bradygaster/squad) running on the Copilot CLI. Three agents run on documentation PRs: -**Sam** (Fact Checker) compares my claims against the source. For Azure MCP Server articles, that means checking parameter names, types, and descriptions against the upstream `tools-list.json`, and verifying that tool descriptions match the annotation text in the source repo. Sam approved PR #8648 after confirming the corrected subscription removal matched the pattern in the upstream JSON — and flagged a minor drift in one tool description where I'd added "against a specified database" that wasn't in the source text. +**Sam** (Fact Checker) compares my claims against the source. For the tool server articles, that means checking parameter names, types, and descriptions against the upstream `tools-list.json`, and verifying that tool descriptions match the annotation text in the source repo. Sam approved PR #8648 after confirming the corrected subscription removal matched the pattern in the upstream JSON — and flagged a minor drift in one tool description where I'd added "against a specified database" that wasn't in the source text. **Statler** (Adversarial) looks for what will confuse readers, break patterns, or create maintenance problems. Statler caught both the subscription parameter issue and the devlang slug inconsistency on PR #8648. On a recent Azure Skills article PR, Statler caught a parameter marked as "Required" in my table that the CLI JSON schema had as "Optional" — a discrepancy I'd introduced by following the pattern from an older article rather than checking the current schema. -**Scooter** (Quality) checks structure: do headings follow the expected pattern, is the frontmatter complete, do parameter tables have the right columns, are there missing Example sections. For the Azure MCP Server series, that means verifying every tool article has a Description, Parameters table, and Example section in the right order. +**Scooter** (Quality) checks structure: do headings follow the expected pattern, is the frontmatter complete, do parameter tables have the right columns, are there missing Example sections. For the tool server documentation series, that means verifying every tool article has a Description, Parameters table, and Example section in the right order. Sam goes first (verify facts), Statler second (find problems), Scooter last (check structure). If Sam fails, the others still run — I want all the feedback, not a fail-fast stop. @@ -98,10 +98,10 @@ My old mental model: a PR description is a summary so the reviewer knows roughly My current model: a PR description is a goal statement and a scope contract. It defines what "done" looks like, what the reviewer should evaluate, and what the author is explicitly not doing. Writing it is the first act of review. -For Azure MCP Server tool articles — where I'm translating entries from `tools-list.json` into structured documentation — I now include: +For tool server articles — where I'm translating entries from `tools-list.json` into structured documentation — I now include: - **Goal**: which tools this PR covers, from which source file, to which output file -- **Source of truth**: a link to the specific commit in `microsoft/mcp` I'm translating from +- **Source of truth**: a link to the specific commit in `upstream-repo` I'm translating from - **Scope**: which tools are covered, and explicitly which are not (some tools are internal, some belong in a different article) - **Open questions**: things I'm genuinely uncertain about, asked directly to the reviewer @@ -120,13 +120,13 @@ Writing for a thoughtful reader changes how you think about what you're doing. T ## What Sam's Review on PR #8648 Actually Said -For context: PR #8648 covered the Azure MCP Server SQL tool articles. 7 tools across the SQL namespace, each needing a Description, Parameters table, and Example section. Source was the SQL namespace block in `tools-list.json` plus the tool-specific annotations in the upstream repo. +For context: PR #8648 covered the SQL tool documentation. 7 tools across the SQL namespace, each needing a Description, Parameters table, and Example section. Source was the SQL namespace block in `tools-list.json` plus the tool-specific annotations in the upstream repo. Sam's review (fact-check pass): > **Sam — Fact Check** > -> I compared all 7 tool descriptions against `tools-list.json` (commit `a4f9c31`). Six match exactly. One diverges: `azmcp_sql_database_query` — your description says "executes a SQL query against a specified database" but the upstream annotation says "runs an arbitrary SQL query." "Executes" vs "runs" is a style call, but "against a specified database" is an addition not in the source. Intentional clarification, or drift? +> I compared all 7 tool descriptions against `tools-list.json` (commit `a4f9c31`). Six match exactly. One diverges: `toolserver_sql_database_query` — your description says "executes a SQL query against a specified database" but the upstream annotation says "runs an arbitrary SQL query." "Executes" vs "runs" is a style call, but "against a specified database" is an addition not in the source. Intentional clarification, or drift? > > Parameters tables are accurate with the exception flagged by Statler. Example sections are present for all 7 tools. @@ -154,7 +154,7 @@ That's what 14 comments across 3 files looks like when the reviewer is actually The subscription parameter question came up in three consecutive SQL tool PRs before I fixed it. Not because I forgot — I genuinely wasn't sure whether the pattern I was seeing was intentional or a gap in the existing docs. After the third flag, Sam verified the pattern across the full SQL namespace and I added a rule to my reviewer notes: "subscription, resource-group, and tenant-id are ambient context params in the SQL and Storage namespaces — do not include in parameter tables." -That note is now in my PR description template for the Azure MCP Server series. Statler stops flagging it. Sam stops verifying it. The question is answered and closed. +That note is now in my PR description template for the tool server documentation series. Statler stops flagging it. Sam stops verifying it. The question is answered and closed. ```mermaid graph LR @@ -201,7 +201,7 @@ One more thing on tone: agents ask rather than assume when something is ambiguou *A year later, the PR is still there. The reasoning is still there. The "why" — the hardest thing to recover — is captured in the record.* -Last week I needed to know why the Azure MCP Server Cosmos DB article doesn't document the `query_items_with_partition_key` behavior as a separate tool. I went back to the PR. The scope section said: "Cosmos DB tools in this PR are limited to the publicly-listed namespace entries in `tools-list.json`. The partition key query behavior is exposed through `azmcp_cosmos_query` and is not a separate tool entry." Sam's review confirmed it: "partition key query is a parameter pattern of `azmcp_cosmos_query`, not a separate tool — correct to omit." +Last week I needed to know why the Cosmos DB tool article doesn't document the `query_items_with_partition_key` behavior as a separate tool. I went back to the PR. The scope section said: "Cosmos DB tools in this PR are limited to the publicly-listed namespace entries in `tools-list.json`. The partition key query behavior is exposed through `toolserver_cosmos_query` and is not a separate tool entry." Sam's review confirmed it: "partition key query is a parameter pattern of `toolserver_cosmos_query`, not a separate tool — correct to omit." Thirty seconds to verify. Without that review reasoning, I'd be digging through commit messages and Slack threads trying to reconstruct intent from behavior. @@ -235,7 +235,7 @@ When you know Sam will compare your descriptions against `tools-list.json` at co A lot of the things a thoughtful reviewer wants to know are things you should have been thinking about anyway. The scope boundary isn't bureaucratic overhead. It's the question that determines whether the PR does what it's supposed to do. Spelling that out before you submit means catching it yourself rather than in a review comment. -There's something that builds over time too. Working with the same reviewer configuration across the Azure MCP Server series has created a shared vocabulary — not just terminology, but a shared understanding of what counts as an in-scope parameter, what belongs in a Parameters table versus an Example section, when "Required" means required by the tool vs. required by the namespace context. That vocabulary now lives in the reviewer notes in my PR template. A new article starts from that baseline instead of from zero. +There's something that builds over time too. Working with the same reviewer configuration across the tool server documentation series has created a shared vocabulary — not just terminology, but a shared understanding of what counts as an in-scope parameter, what belongs in a Parameters table versus an Example section, when "Required" means required by the tool vs. required by the namespace context. That vocabulary now lives in the reviewer notes in my PR template. A new article starts from that baseline instead of from zero. You'd traditionally build that kind of shared vocabulary by working closely with a human expert over many PRs, until you internalized their standards. Most people don't have access to that. The expert who has the deep standards isn't always reviewing your PRs. And when they are, consistency is a casualty of a full workload. The same reviewer standards applied to every PR, with no variance based on who's available, builds the vocabulary faster than occasional reviews from domain experts who are also managing six other priorities. @@ -287,7 +287,7 @@ This is the PR description template I've built over ~60 documentation PRs in the - `[filename]` — [one-line description of what changed and why] ``` -The "Reviewer notes" field in Source of truth is where settled decisions live. My current notes for the Azure MCP Server series: +The "Reviewer notes" field in Source of truth is where settled decisions live. My current notes for the tool server documentation series: > subscription, resource-group, and tenant-id are ambient context params in SQL and Storage namespaces — do not include in parameter tables. Devlang slug is `javascript` not `node` for all JS/TS examples in this series. Tool descriptions should match upstream annotation text exactly; any additions or clarifications require an open question to the reviewer. @@ -302,7 +302,7 @@ copilot --agent squad Then in the session: ``` -Sam, Statler, Scooter — review the PR branch in emu-microsoftdocs-azure-dev-docs-pr +Sam, Statler, Scooter — review the PR branch in content-repo ``` The agents run in sequence, post their findings, and I work through the flags before pushing for human review. Agent review plus fixes: about 30 minutes per PR. That's faster than waiting for a human reviewer to have time, and what I bring to the human review is cleaner because the pattern and fact-check issues are already resolved. From 803453ff6d23db4491a935f244c41cc59478cb31 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 22 May 2026 08:16:19 -0700 Subject: [PATCH 4/6] blog: add cross-team agent review section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- website/blog/2026-05-22-agentic-pr-reviews.md | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/website/blog/2026-05-22-agentic-pr-reviews.md b/website/blog/2026-05-22-agentic-pr-reviews.md index 4c995c3..6a24945 100644 --- a/website/blog/2026-05-22-agentic-pr-reviews.md +++ b/website/blog/2026-05-22-agentic-pr-reviews.md @@ -194,6 +194,49 @@ One more thing on tone: agents ask rather than assume when something is ambiguou --- +## When Someone Else's Agent Reviews Your PR + +Everything so far has been about my agents reviewing my work. That's one direction. + +The other direction is when your PR lands in a queue owned by a different team — one with their own standards, their own patterns, their own accumulated history of what "correct" means for their content area. Before agent reviews, that came back as "LGTM," or a one-liner with no context, or silence. You had no idea what the reviewer actually checked or what standard you were being held to. + +When another team's agent reviews your PR, you get to see the standards. + +A recent PR went to a platform team for review. Their agent came back with this: + +> **Platform Reviewer — Standards Check** +> +> Parameter table uses "server name" in the Description column. The convention in this content area is "resource name" for all Azure resource identifiers, consistent with the Compute and Storage namespace articles. See: `storage_account_list`, `vm_list`. +> +> This inconsistency will create problems for readers switching between namespaces and may conflict with automated tooling that normalizes on "resource name." + +I didn't know that convention existed. There's no onboarding doc that says "resource name, not server name, for Azure resource identifiers." That standard lived in the platform team's collective head — or, now, in their reviewer notes. + +Before the agent, I'd have gotten "LGTM" or a terse inline comment: "resource name." No explanation. No comparison point. I'd fix it for that PR and still not understand the rule. The next PR would have the same issue. I'd fix it again. We'd repeat this indefinitely, neither side getting anything durable from the exchange. + +The agent gave me the rule. I added it to my template's reviewer notes. That exchange happened once. + +--- + +This is the part that gets undersold in conversations about AI-assisted reviews: the transfer of standards across team boundaries. + +When you're not the subject matter expert and your PR lands with a team that is, you're dependent on their review being informative. "LGTM" from an SME is the worst possible outcome — they looked, they approved, and you learned nothing about whether what you wrote was actually right or just not wrong enough to reject. A correction without context gives you a fix but not the rule. And a review that's professionally marginal — terse, dismissive, delivered with implied frustration — buries whatever useful content it contains under the overhead of processing the delivery. + +An agent from their team applies their standards with full reasoning and no tone variance. No ALL CAPS. No "this is incorrect." Just: here's what we check, here's what doesn't match, here's what it's compared against. You don't need a relationship with that team to get a useful review. You don't need them to have time. You need their agent configured for the work. + +The practical version: PR goes to a team whose standards I don't fully know. Their agent flags 4–6 things, all specific, all grounded in a stated pattern or comparison. I fix the flags. I add the novel standards to my template's reviewer notes. My next PR to that team starts from a documented baseline instead of from zero. + +That's not the same as a review from a human expert on that team. The agent doesn't catch things that aren't in its standards — the things that are wrong because of a recent architectural shift, or a stakeholder preference that lives in one person's head. But it catches everything that's in the standards, consistently, every time. And it tells me what those standards are. Before agent reviews, that information was inconsistent at best. Often it wasn't transmitted at all. + +The feedback loop works the same across teams as it does within your own. First PR: their agent teaches you the standards. Second PR: you apply them. Third PR: those flags don't fire. Not because the standard disappeared. Because you learned it. + + +![Two developers in separate spaces connected by a floating review note — standards transmitted across team boundaries](./media/2026-05-22-agentic-pr-reviews/cross-team-standards-transfer.png) + +*When the review is professional by default, the standard travels cleanly. No relationship required. No bandwidth negotiation. Just the rule, with the reasoning.* + +--- + ## Why I Can Audit Every Decision From the Last Year From b7552a0414db06992490ee6a82f3e5e1595de768 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 22 May 2026 08:26:57 -0700 Subject: [PATCH 5/6] blog: deeper anonymization -- remove all cloud provider identity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- website/blog/2026-05-22-agentic-pr-reviews.md | 46 +++++++++---------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/website/blog/2026-05-22-agentic-pr-reviews.md b/website/blog/2026-05-22-agentic-pr-reviews.md index 6a24945..ac826b0 100644 --- a/website/blog/2026-05-22-agentic-pr-reviews.md +++ b/website/blog/2026-05-22-agentic-pr-reviews.md @@ -33,17 +33,17 @@ keywords: PR #8648 had 14 review comments across 3 files. Statler found them. -I was documenting the tool server's SQL tools — translating entries from `tools-list.json` into parameter tables for the documentation platform article. Standard work. I've done around 40 of these tool articles at this point. I submitted the PR and Statler, my adversarial Squad agent, came back with this: +I was documenting the tool server's database tools — translating entries from `schema.json` into parameter tables for the documentation platform article. Standard work. I've done around 40 of these tool articles at this point. I submitted the PR and Statler, my adversarial Squad agent, came back with this: > **Statler — Adversarial Review** > -> The `subscription` parameter appears in your parameter table for `azmcp_sql_server_list`. The SQL namespace tools receive subscription via the Azure plugin's ambient context — it shouldn't be listed as a required parameter because callers never pass it directly. Compare the pattern in `azmcp_storage_account_list`: same tool family, no subscription in the table. +> The `subscription` parameter appears in your parameter table for `db_server_list`. The database namespace tools receive subscription via the platform's ambient context — it shouldn't be listed as a required parameter because callers never pass it directly. Compare the pattern in `storage_account_list`: same tool family, no subscription in the table. > > This will confuse readers looking for what to actually pass. -That comment isn't "missing semicolon." It's a specific claim about a pattern across the SQL namespace, a comparison to another tool family, and a prediction about reader confusion. It was right about all three. +That comment isn't "missing semicolon." It's a specific claim about a pattern across the database namespace, a comparison to another tool family, and a prediction about reader confusion. It was right about all three. -I fixed the table and replied: "Confirmed — subscription, resource-group, and devlang slug formatting follow the same ambient pattern across storage, SQL, and container apps. Adding to reviewer notes for the series." Sam (my fact-checking agent) verified the pattern against the upstream `tools-list.json`. The PR merged with one exchange and no ambiguity. +I fixed the table and replied: "Confirmed — subscription, resource-group, and devlang slug formatting follow the same ambient pattern across storage, database, and container apps. Adding to reviewer notes for the series." Sam (my fact-checking agent) verified the pattern against the upstream `schema.json`. The PR merged with one exchange and no ambiguity. The same PR also came back with a second Statler flag I hadn't noticed at all: three of my seven Example sections used `node` as the devlang slug while the rest of the tool server documentation series uses `javascript`. It would have built fine. Readers filtering by language tab would have gotten partial results. @@ -74,15 +74,15 @@ What I wanted — and I think most people want — is a reviewer who reads caref My review setup uses [Squad v0.9.4](https://github.com/bradygaster/squad) running on the Copilot CLI. Three agents run on documentation PRs: -**Sam** (Fact Checker) compares my claims against the source. For the tool server articles, that means checking parameter names, types, and descriptions against the upstream `tools-list.json`, and verifying that tool descriptions match the annotation text in the source repo. Sam approved PR #8648 after confirming the corrected subscription removal matched the pattern in the upstream JSON — and flagged a minor drift in one tool description where I'd added "against a specified database" that wasn't in the source text. +**Sam** (Fact Checker) compares my claims against the source. For the tool server articles, that means checking parameter names, types, and descriptions against the upstream `schema.json`, and verifying that tool descriptions match the annotation text in the source repo. Sam approved PR #8648 after confirming the corrected subscription removal matched the pattern in the upstream JSON — and flagged a minor drift in one tool description where I'd added "against a specified database" that wasn't in the source text. -**Statler** (Adversarial) looks for what will confuse readers, break patterns, or create maintenance problems. Statler caught both the subscription parameter issue and the devlang slug inconsistency on PR #8648. On a recent Azure Skills article PR, Statler caught a parameter marked as "Required" in my table that the CLI JSON schema had as "Optional" — a discrepancy I'd introduced by following the pattern from an older article rather than checking the current schema. +**Statler** (Adversarial) looks for what will confuse readers, break patterns, or create maintenance problems. Statler caught both the subscription parameter issue and the devlang slug inconsistency on PR #8648. On a recent platform skills article PR, Statler caught a parameter marked as "Required" in my table that the CLI JSON schema had as "Optional" — a discrepancy I'd introduced by following the pattern from an older article rather than checking the current schema. **Scooter** (Quality) checks structure: do headings follow the expected pattern, is the frontmatter complete, do parameter tables have the right columns, are there missing Example sections. For the tool server documentation series, that means verifying every tool article has a Description, Parameters table, and Example section in the right order. Sam goes first (verify facts), Statler second (find problems), Scooter last (check structure). If Sam fails, the others still run — I want all the feedback, not a fail-fast stop. -For content areas outside my expertise, I configure the reviewer with patterns from domain-specific feedback. The SQL tool articles got a reviewer trained on akromm's published review patterns from the SQL article series — things like the exact format expected for the parameters table, which columns are required, and what "Required" vs "Optional" should actually mean for Azure CLI parameters. The agent applies akromm's standards consistently across every PR, which is something akromm can't do while also managing their own work. +For content areas outside my expertise, I configure the reviewer with patterns from domain-specific feedback. The database tool articles got a reviewer trained on the domain expert's published review patterns from the database article series — things like the exact format expected for the parameters table, which columns are required, and what "Required" vs "Optional" should actually mean for CLI parameters. The agent applies the domain expert's standards consistently across every PR, which is something the domain expert can't do while also managing their own work. This is where agent reviews add something human reviews often can't at this scale: the same reviewer standards, applied to every PR, with no variance based on who's available or how busy the review queue is. @@ -98,7 +98,7 @@ My old mental model: a PR description is a summary so the reviewer knows roughly My current model: a PR description is a goal statement and a scope contract. It defines what "done" looks like, what the reviewer should evaluate, and what the author is explicitly not doing. Writing it is the first act of review. -For tool server articles — where I'm translating entries from `tools-list.json` into structured documentation — I now include: +For tool server articles — where I'm translating entries from `schema.json` into structured documentation — I now include: - **Goal**: which tools this PR covers, from which source file, to which output file - **Source of truth**: a link to the specific commit in `upstream-repo` I'm translating from @@ -112,7 +112,7 @@ When Statler reviews that PR, it checks the translation against the stated sourc *When you write the PR description for a thoughtful reader — not a rubber stamp — the document becomes the first act of review.* -There's a side effect of writing at this level of detail: I catch scope creep before submitting. While writing the scope section for a recent Azure Skills article PR, I noticed I'd edited three files that weren't in the goal statement. Two were legitimate scope expansions I should have documented. One was an accidental change I reverted before submitting. The description-writing does that. +There's a side effect of writing at this level of detail: I catch scope creep before submitting. While writing the scope section for a recent platform skills article PR, I noticed I'd edited three files that weren't in the goal statement. Two were legitimate scope expansions I should have documented. One was an accidental change I reverted before submitting. The description-writing does that. Writing for a thoughtful reader changes how you think about what you're doing. The rubber-stamp model lets you be vague because the reviewer will fill in the gaps. The contract model requires precision because the reviewer will hold you to it. @@ -120,13 +120,13 @@ Writing for a thoughtful reader changes how you think about what you're doing. T ## What Sam's Review on PR #8648 Actually Said -For context: PR #8648 covered the SQL tool documentation. 7 tools across the SQL namespace, each needing a Description, Parameters table, and Example section. Source was the SQL namespace block in `tools-list.json` plus the tool-specific annotations in the upstream repo. +For context: PR #8648 covered the database tool documentation. 7 tools across the database namespace, each needing a Description, Parameters table, and Example section. Source was the database namespace block in `schema.json` plus the tool-specific annotations in the upstream repo. Sam's review (fact-check pass): > **Sam — Fact Check** > -> I compared all 7 tool descriptions against `tools-list.json` (commit `a4f9c31`). Six match exactly. One diverges: `toolserver_sql_database_query` — your description says "executes a SQL query against a specified database" but the upstream annotation says "runs an arbitrary SQL query." "Executes" vs "runs" is a style call, but "against a specified database" is an addition not in the source. Intentional clarification, or drift? +> I compared all 7 tool descriptions against `schema.json` (commit `a4f9c31`). Six match exactly. One diverges: `db_query` — your description says "executes a SQL query against a specified database" but the upstream annotation says "runs an arbitrary SQL query." "Executes" vs "runs" is a style call, but "against a specified database" is an addition not in the source. Intentional clarification, or drift? > > Parameters tables are accurate with the exception flagged by Statler. Example sections are present for all 7 tools. @@ -138,7 +138,7 @@ Statler's review ran against a different standard: > **Statler — Adversarial Review** > -> The `subscription` parameter appears in your parameter table for `azmcp_sql_server_list`. The SQL namespace tools receive subscription via the Azure plugin's ambient context — it shouldn't be listed as a required parameter because callers never pass it directly. Compare the pattern in `azmcp_storage_account_list`: same tool family, no subscription in the table. This will confuse readers looking for what to actually pass. +> The `subscription` parameter appears in your parameter table for `db_server_list`. The database namespace tools receive subscription via the platform's ambient context — it shouldn't be listed as a required parameter because callers never pass it directly. Compare the pattern in `storage_account_list`: same tool family, no subscription in the table. This will confuse readers looking for what to actually pass. > > Secondary: three of the seven Example sections use `node` as the devlang slug. Your other articles in this series use `javascript`. Pick one — readers filtering by language tab will get partial results otherwise. @@ -152,7 +152,7 @@ That's what 14 comments across 3 files looks like when the reviewer is actually ## How the Feedback Compounds -The subscription parameter question came up in three consecutive SQL tool PRs before I fixed it. Not because I forgot — I genuinely wasn't sure whether the pattern I was seeing was intentional or a gap in the existing docs. After the third flag, Sam verified the pattern across the full SQL namespace and I added a rule to my reviewer notes: "subscription, resource-group, and tenant-id are ambient context params in the SQL and Storage namespaces — do not include in parameter tables." +The subscription parameter question came up in three consecutive database tool PRs before I fixed it. Not because I forgot — I genuinely wasn't sure whether the pattern I was seeing was intentional or a gap in the existing docs. After the third flag, Sam verified the pattern across the full database namespace and I added a rule to my reviewer notes: "subscription, resource-group, and tenant-id are ambient context params in the database and storage namespaces — do not include in parameter tables." That note is now in my PR description template for the tool server documentation series. Statler stops flagging it. Sam stops verifying it. The question is answered and closed. @@ -169,9 +169,9 @@ graph LR The loop only works if you maintain it. The agents don't automatically update my template. Roughly every five PRs, I spend fifteen minutes reading through the flags I've closed, identifying the ones that represent settled rules, and adding them to the reviewer notes section of the template. That's the maintenance cost. -It adds up in the right direction, though. Before I had the subscription rule documented, I spent time on every SQL tools PR second-guessing whether I had it right. After? It's a known rule that gets checked automatically. That fifteen-minute update bought back maybe two hours of per-PR uncertainty across the series. I'll take that trade every time. +It adds up in the right direction, though. Before I had the subscription rule documented, I spent time on every database tools PR second-guessing whether I had it right. After? It's a known rule that gets checked automatically. That fifteen-minute update bought back maybe two hours of per-PR uncertainty across the series. I'll take that trade every time. -A note on where this doesn't compound cleanly: the agents don't catch conceptual errors. If I've misunderstood what a tool does at the API level, Sam will confirm that my description matches my misunderstanding. That requires a human who knows the Azure SQL namespace to catch — someone like akromm who's actually used these tools. Agent reviews and human reviews are doing different jobs. The agent handles pattern consistency and fact-checking against documented sources; the human catches the gaps between what's documented and what's true. +A note on where this doesn't compound cleanly: the agents don't catch conceptual errors. If I've misunderstood what a tool does at the API level, Sam will confirm that my description matches my misunderstanding. That requires a human who knows the database namespace to catch — someone like the domain expert who's actually used these tools. Agent reviews and human reviews are doing different jobs. The agent handles pattern consistency and fact-checking against documented sources; the human catches the gaps between what's documented and what's true. --- @@ -206,11 +206,11 @@ A recent PR went to a platform team for review. Their agent came back with this: > **Platform Reviewer — Standards Check** > -> Parameter table uses "server name" in the Description column. The convention in this content area is "resource name" for all Azure resource identifiers, consistent with the Compute and Storage namespace articles. See: `storage_account_list`, `vm_list`. +> Parameter table uses "server name" in the Description column. The convention in this content area is "resource name" for all resource identifiers, consistent with the compute and storage namespace articles. See: `storage_account_list`, `vm_list`. > > This inconsistency will create problems for readers switching between namespaces and may conflict with automated tooling that normalizes on "resource name." -I didn't know that convention existed. There's no onboarding doc that says "resource name, not server name, for Azure resource identifiers." That standard lived in the platform team's collective head — or, now, in their reviewer notes. +I didn't know that convention existed. There's no onboarding doc that says "resource name, not server name, for resource identifiers." That standard lived in the platform team's collective head — or, now, in their reviewer notes. Before the agent, I'd have gotten "LGTM" or a terse inline comment: "resource name." No explanation. No comparison point. I'd fix it for that PR and still not understand the rule. The next PR would have the same issue. I'd fix it again. We'd repeat this indefinitely, neither side getting anything durable from the exchange. @@ -244,7 +244,7 @@ The feedback loop works the same across teams as it does within your own. First *A year later, the PR is still there. The reasoning is still there. The "why" — the hardest thing to recover — is captured in the record.* -Last week I needed to know why the Cosmos DB tool article doesn't document the `query_items_with_partition_key` behavior as a separate tool. I went back to the PR. The scope section said: "Cosmos DB tools in this PR are limited to the publicly-listed namespace entries in `tools-list.json`. The partition key query behavior is exposed through `toolserver_cosmos_query` and is not a separate tool entry." Sam's review confirmed it: "partition key query is a parameter pattern of `toolserver_cosmos_query`, not a separate tool — correct to omit." +Last week I needed to know why the NoSQL tool article doesn't document the `query_items_with_partition_key` behavior as a separate tool. I went back to the PR. The scope section said: "NoSQL tools in this PR are limited to the publicly-listed namespace entries in `schema.json`. The partition key query behavior is exposed through `nosql_query` and is not a separate tool entry." Sam's review confirmed it: "partition key query is a parameter pattern of `nosql_query`, not a separate tool — correct to omit." Thirty seconds to verify. Without that review reasoning, I'd be digging through commit messages and Slack threads trying to reconstruct intent from behavior. @@ -274,7 +274,7 @@ The "why" is the hardest thing to recover in any software system. Code tells you Before agent reviews, I wrote PRs for a rubber stamp. That's rational. If most reviews are "LGTM," a detailed description doesn't get you much. You write just enough to remind yourself what you were doing. -When you know Sam will compare your descriptions against `tools-list.json` at commit `a4f9c31`, you check that comparison yourself first. Not every time — that would defeat the purpose. But it shifts what you pay attention to while you write. You think about what the reviewer needs to know rather than what you want to tell them. +When you know Sam will compare your descriptions against `schema.json` at commit `a4f9c31`, you check that comparison yourself first. Not every time — that would defeat the purpose. But it shifts what you pay attention to while you write. You think about what the reviewer needs to know rather than what you want to tell them. A lot of the things a thoughtful reviewer wants to know are things you should have been thinking about anyway. The scope boundary isn't bureaucratic overhead. It's the question that determines whether the PR does what it's supposed to do. Spelling that out before you submit means catching it yourself rather than in a review comment. @@ -288,11 +288,11 @@ You'd traditionally build that kind of shared vocabulary by working closely with Some things still need a human. -Agents don't catch conceptual errors against the truth. Sam verifies my descriptions match the source text. Sam doesn't know if the source text is wrong. If a tool description in `tools-list.json` is misleading about what the tool actually does at runtime, Sam will confirm that my article matches the misleading description. That requires someone who's actually used the tool. +Agents don't catch conceptual errors against the truth. Sam verifies my descriptions match the source text. Sam doesn't know if the source text is wrong. If a tool description in `schema.json` is misleading about what the tool actually does at runtime, Sam will confirm that my article matches the misleading description. That requires someone who's actually used the tool. The feedback loop only works if you maintain it. The agents don't update my template. That's a fifteen-minute task every five or so PRs, but it's manual, and if I skip it for a few weeks the same flags start showing up again. The system runs on the work I put into it. -Some conversations need to happen between people. When a PR decision involves content strategy — whether a new Azure service should get its own article or be folded into an existing one, whether a pattern I've established for the MCP Server series should apply to the Skills articles — that's not a fact-check question or a pattern violation. That's a judgment call about direction and it needs a real conversation. +Some conversations need to happen between people. When a PR decision involves content strategy — whether a new service should get its own article or be folded into an existing one, whether a pattern I've established for the tool server series should apply to the platform skills articles — that's not a fact-check question or a pattern violation. That's a judgment call about direction and it needs a real conversation. My rule of thumb: agent reviews on every documentation PR. Human review on anything involving a judgment call about scope, a new content pattern I haven't done before, or a tool area where I know I'm working outside my knowledge. The agent reviews handle the mechanical consistency work so the human review time goes to what actually needs human judgment. @@ -332,9 +332,9 @@ This is the PR description template I've built over ~60 documentation PRs in the The "Reviewer notes" field in Source of truth is where settled decisions live. My current notes for the tool server documentation series: -> subscription, resource-group, and tenant-id are ambient context params in SQL and Storage namespaces — do not include in parameter tables. Devlang slug is `javascript` not `node` for all JS/TS examples in this series. Tool descriptions should match upstream annotation text exactly; any additions or clarifications require an open question to the reviewer. +> subscription, resource-group, and tenant-id are ambient context params in database and storage namespaces — do not include in parameter tables. Devlang slug is `javascript` not `node` for all JS/TS examples in this series. Tool descriptions should match upstream annotation text exactly; any additions or clarifications require an open question to the reviewer. -Those three lines prevent roughly 8–10 agent comments per PR. They came directly from Statler and Sam's flags across the first three SQL tool PRs, turned into documented rules. +Those three lines prevent roughly 8–10 agent comments per PR. They came directly from Statler and Sam's flags across the first three database tool PRs, turned into documented rules. To run the reviews with Squad v0.9.4 via Copilot CLI: From d04c44d1d6fd634071b0ac4706ef83e08df40945 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 22 May 2026 08:59:43 -0700 Subject: [PATCH 6/6] blog: split agentic PR reviews into two posts (own squad + cross-team) - Rewrote 2026-05-22-agentic-pr-reviews.md to focus on own Squad agents (Sam, Statler, Scooter), PR #8648 walkthrough, PR description template, and feedback compounding loop within own workflow - Created 2026-05-22-cross-team-agent-reviews.md for the external/cross-team angle: receiving reviews from agents you don't control, standards transfer across team boundaries, the SME problem, and audit value when cross-team - Both posts are self-contained deep dives with draft: true - Image distribution: cross-team post gets cross-team-standards-transfer, psychological-safety, captured-in-perpetuity, nooksack images; own-squad post keeps hero, before-after, pr-description-as-contract, author-mindset Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- website/blog/2026-05-22-agentic-pr-reviews.md | 216 ++++++-------- .../2026-05-22-cross-team-agent-reviews.md | 270 ++++++++++++++++++ 2 files changed, 360 insertions(+), 126 deletions(-) create mode 100644 website/blog/2026-05-22-cross-team-agent-reviews.md diff --git a/website/blog/2026-05-22-agentic-pr-reviews.md b/website/blog/2026-05-22-agentic-pr-reviews.md index ac826b0..41181f7 100644 --- a/website/blog/2026-05-22-agentic-pr-reviews.md +++ b/website/blog/2026-05-22-agentic-pr-reviews.md @@ -2,9 +2,9 @@ slug: /2026-05-22-agentic-pr-reviews canonical_url: https://dfberry.github.io/blog/2026-05-22-agentic-pr-reviews custom_edit_url: null -sidebar_label: "2026.05.22 Agentic PR Reviews" -title: "Agent-Led PR Reviews: What Actually Changed in My Workflow" -description: "I set up Sam, Statler, and Scooter from my Squad to review documentation PRs. Here's what the feedback looks like, what it caught on PR #8648, and the PR description template that made it work." +sidebar_label: "2026.05.22 My Squad's PR Reviews" +title: "My Squad's PR Reviews: What Sam, Statler, and Scooter Actually Catch" +description: "I set up Sam, Statler, and Scooter from my Squad to review documentation PRs. Here's what the feedback looks like, what it caught on PR #8648, and the PR description template that made the whole loop work." draft: true tags: - AI Agents @@ -12,9 +12,10 @@ tags: - Developer Experience - GitHub Copilot - AI assisted -updated: 2026-05-22 06:42 PST +updated: 2026-05-22 08:35 PST keywords: - agentic pr reviews + - squad agents pr review - ai code review - pr review quality - agent-led reviews @@ -24,12 +25,12 @@ keywords: - automated code review --- -# Agent-Led PR Reviews: What Actually Changed in My Workflow +# My Squad's PR Reviews: What Sam, Statler, and Scooter Actually Catch ![A pink-haired developer studying a wall of PR review comments by morning window light](./media/2026-05-22-agentic-pr-reviews/hero-pr-review-wall.png) -*The first time an agent actually told me why it was approving my PR, I stopped and read the comment twice. Not because it was wrong. Because nothing like it had happened before.* +*The first time an agent review came back with actual reasoning — here's what I checked, here's what I think, here's why — I realized I'd never gotten that from a human reviewer.* PR #8648 had 14 review comments across 3 files. Statler found them. @@ -55,9 +56,7 @@ The same PR also came back with a second Statler flag I hadn't noticed at all: t The obvious problem with "LGTM" is that it gives you nothing to work with. You don't know what the reviewer checked. You don't know whether they caught the edge case you were worried about. You don't know whether your interpretation was right. -The less obvious problem: it breaks the feedback loop going the other direction too. When you know most reviews will be "LGTM," you stop writing detailed PR descriptions. Why spend ten minutes on a scope definition if the reviewer isn't going to read it? So the description degrades, which makes future reviews even less useful, which makes future descriptions even shorter. It adds up. - -I lived that loop for years. Writing documentation PRs for internal approval cycles where the reviewer was too stretched to engage — I'd get a green stamp or silence, and I'd move on without knowing if what I'd written was right or just unopposed. +The less obvious problem: it breaks the feedback loop going the other direction too. When you know most reviews will be "LGTM," you stop writing detailed PR descriptions. Why spend ten minutes on a scope definition if the reviewer isn't going to read it? So the description gets shorter, which makes future reviews even less useful, which makes future descriptions even shorter. I lived that loop for years. Writing documentation PRs for internal approval cycles where the reviewer was too stretched to engage — I'd get a green stamp or silence, and I'd move on without knowing if what I'd written was right or just unopposed. The other failure mode is the pointed review delivered with heat. Strong language, implied criticism, "why did you do it this way" as an accusation rather than a question. That's not better than LGTM. You learn something is wrong without learning what right looks like. You learn to feel bad about the decision, which is not the same as learning to make a better one. @@ -66,7 +65,11 @@ The other failure mode is the pointed review delivered with heat. Strong languag *Before agent reviews, feedback was either a green stamp or a heat storm — rarely the clear, reasoned signal in between.* -What I wanted — and I think most people want — is a reviewer who reads carefully, explains their reasoning, and raises questions rather than making accusations. That doesn't require AI. But it does require time and consistency that most human reviewers, who are also doing their actual jobs, can't reliably provide at the PR-by-PR level. +The LGTM loop is self-reinforcing in a particularly bad way: you start writing worse descriptions because the feedback doesn't reward good ones. The description degrades. The reviewer has even less to work with. The review gets even more shallow. By the time you're four months into a project, you're getting one-word approvals on PRs with three-word descriptions. + +Agent reviews break that loop because they respond to description quality directly. Write a vague scope section and you'll get flags that a precise scope would have answered up front. Write a detailed scope section and you'll get flags only on the things that weren't captured by the scope — the edge cases worth discussing. The quality of what you get back scales with the quality of what you put in. That creates the opposite incentive structure from the LGTM world. + +What I wanted — and still want — is a reviewer who reads carefully, explains their reasoning, and raises questions rather than making accusations. That doesn't require AI. It requires time and consistency that most human reviewers, who are also doing their actual jobs, can't reliably provide at the PR-by-PR level. Sam, Statler, and Scooter have time. They're consistent. They read every line. --- @@ -74,51 +77,64 @@ What I wanted — and I think most people want — is a reviewer who reads caref My review setup uses [Squad v0.9.4](https://github.com/bradygaster/squad) running on the Copilot CLI. Three agents run on documentation PRs: -**Sam** (Fact Checker) compares my claims against the source. For the tool server articles, that means checking parameter names, types, and descriptions against the upstream `schema.json`, and verifying that tool descriptions match the annotation text in the source repo. Sam approved PR #8648 after confirming the corrected subscription removal matched the pattern in the upstream JSON — and flagged a minor drift in one tool description where I'd added "against a specified database" that wasn't in the source text. +**Sam** (Fact Checker) compares my claims against the source. For the tool server articles, that means checking parameter names, types, and descriptions against the upstream `schema.json`, and verifying that tool descriptions match the annotation text in the source repo. + +Sam's output on a good PR is mostly confirmations: "6 of 7 tool descriptions match `schema.json` at commit `a4f9c31`." When something drifts, Sam calls it out specifically: "db_query — your description says 'executes a SQL query against a specified database' but the upstream annotation says 'runs an arbitrary SQL query.' 'Against a specified database' is an addition not in the source. Intentional clarification, or drift?" + +That distinction — between style variance ("executes" vs "runs") and a substantive addition ("against a specified database") — is something Sam catches because it's doing a line-by-line comparison against a specific commit. A reviewer who knows the source well might catch it. A reviewer working from memory or from a general familiarity with the content area won't. + +**Statler** (Adversarial) looks for what will confuse readers, break patterns, or create maintenance problems. Statler's value is in the cross-file, cross-PR pattern work. It's not checking one file against one source — it's checking whether this PR is internally consistent with the rest of the series. + +On PR #8648, Statler caught both the subscription parameter issue and the devlang slug inconsistency in a single pass. The subscription issue required knowing the pattern across the database and storage namespaces. The devlang inconsistency required knowing what I'd used in other articles in the same series. Both of those are things that require scope that extends beyond the current PR — which is exactly the scope Statler operates with. + +On a recent platform skills article PR, Statler caught a parameter marked as "Required" in my table that the CLI JSON schema had as "Optional." I'd introduced the discrepancy by following the pattern from an older article rather than checking the current schema. Statler flagged it with: "Parameter `resource_group` is marked Required in your table but is Optional in the CLI JSON schema at the current version. The older article you may have referenced predates the schema change." -**Statler** (Adversarial) looks for what will confuse readers, break patterns, or create maintenance problems. Statler caught both the subscription parameter issue and the devlang slug inconsistency on PR #8648. On a recent platform skills article PR, Statler caught a parameter marked as "Required" in my table that the CLI JSON schema had as "Optional" — a discrepancy I'd introduced by following the pattern from an older article rather than checking the current schema. +That's a specific flag with a specific reason and a specific call-out of where I probably went wrong. Without that flag, the article would have shipped with incorrect Required/Optional documentation. Sam wouldn't have caught it — Sam checks descriptions against annotations, not parameter status against schema. Statler caught it because Statler is checking different things. -**Scooter** (Quality) checks structure: do headings follow the expected pattern, is the frontmatter complete, do parameter tables have the right columns, are there missing Example sections. For the tool server documentation series, that means verifying every tool article has a Description, Parameters table, and Example section in the right order. +**Scooter** (Quality) checks structure: do headings follow the expected pattern, is the frontmatter complete, do parameter tables have the right columns, are there missing Example sections. For the tool server documentation series, that means verifying every tool article has a Description, Parameters table, and Example section in the right order — and that the frontmatter fields are all populated with the right format. -Sam goes first (verify facts), Statler second (find problems), Scooter last (check structure). If Sam fails, the others still run — I want all the feedback, not a fail-fast stop. +Scooter catches things like: a missing `updated` timestamp in frontmatter, a Parameters table that has the right columns but in the wrong order, an Example section that exists but is empty because I forgot to fill in the sample call. These are the structural checks I'd run manually if I remembered to. Scooter runs them every time. -For content areas outside my expertise, I configure the reviewer with patterns from domain-specific feedback. The database tool articles got a reviewer trained on the domain expert's published review patterns from the database article series — things like the exact format expected for the parameters table, which columns are required, and what "Required" vs "Optional" should actually mean for CLI parameters. The agent applies the domain expert's standards consistently across every PR, which is something the domain expert can't do while also managing their own work. +The sequence matters: Sam goes first (verify facts), Statler second (find problems), Scooter last (check structure). If Sam finds something substantively wrong with my descriptions — a mismatch against the upstream source — I want to know that before Scooter tells me the formatting is fine. The facts come first. If Sam fails, the others still run. I want all the feedback, not a fail-fast stop on the first error category. -This is where agent reviews add something human reviews often can't at this scale: the same reviewer standards, applied to every PR, with no variance based on who's available or how busy the review queue is. +For content areas outside my direct expertise, I configure the reviewer with patterns from domain-specific feedback I've received. The database tool articles got a reviewer trained on the domain expert's published review patterns from the database article series — things like the exact format expected for the parameters table, which columns are required, and what "Required" vs "Optional" should actually mean for CLI parameters. The agent applies those standards consistently across every PR, which is something the domain expert can't do while also managing their own workload. --- -## Step 1: Write the PR Description as a Contract +## Write the PR Description as a Contract -The quality of the review scales with the quality of the information you put in the description. Human reviewers fill in gaps with assumptions and prior knowledge. Sam and Statler work with what's in front of them. +The quality of the review scales with the quality of the information in the description. Human reviewers fill in gaps with assumptions and prior knowledge. Sam and Statler work with what's in front of them. This forced me to rethink what a PR description is actually for. -My old mental model: a PR description is a summary so the reviewer knows roughly what to expect. Good description = more detail, maybe a link to the ticket. +My old mental model: a PR description is a summary so the reviewer knows roughly what to expect. Good description means more detail, maybe a link to the ticket. My current model: a PR description is a goal statement and a scope contract. It defines what "done" looks like, what the reviewer should evaluate, and what the author is explicitly not doing. Writing it is the first act of review. +The shift matters because when you're writing for a thoughtful reader instead of a rubber stamp, you think about the work differently. You ask the questions that the reviewer will ask before the reviewer asks them. You find the edge cases you were going to leave implicit and make them explicit. You notice the scope you've drifted into that isn't in the goal statement. + For tool server articles — where I'm translating entries from `schema.json` into structured documentation — I now include: - **Goal**: which tools this PR covers, from which source file, to which output file -- **Source of truth**: a link to the specific commit in `upstream-repo` I'm translating from -- **Scope**: which tools are covered, and explicitly which are not (some tools are internal, some belong in a different article) +- **Source of truth**: a link to the specific commit in the upstream repo I'm translating from — not "the main branch," the specific commit, so Sam can check the exact same source +- **Scope**: which tools are covered, and explicitly which are not, with reasons for each exclusion - **Open questions**: things I'm genuinely uncertain about, asked directly to the reviewer +- **Reviewer notes**: the settled decisions from prior PRs in this series — the rules I've already learned -When Statler reviews that PR, it checks the translation against the stated source. When Sam reviews it, it compares my descriptions against the upstream annotation text. When I have a real question — like whether `subscription` belongs in the params table — Sam looks at the pattern across the namespace and answers it rather than guessing. +When Statler reviews that PR, it checks the translation against the stated source and compares patterns against the stated context. When Sam reviews it, it compares my descriptions against the annotation text at the specific commit. When I have a real question — like whether `subscription` belongs in the params table for this tool family — Sam looks at the pattern across the namespace and answers it rather than applying a default. ![The pink-haired developer writing a structured PR description that unfurls like a scroll](./media/2026-05-22-agentic-pr-reviews/pr-description-as-contract.png) *When you write the PR description for a thoughtful reader — not a rubber stamp — the document becomes the first act of review.* -There's a side effect of writing at this level of detail: I catch scope creep before submitting. While writing the scope section for a recent platform skills article PR, I noticed I'd edited three files that weren't in the goal statement. Two were legitimate scope expansions I should have documented. One was an accidental change I reverted before submitting. The description-writing does that. +The "Out of scope" section is the one I skipped the most when I first started writing this way, and it's the one that does the most work. Stating what you're deliberately not doing forces you to make a decision about it rather than leaving it implicit. Three times in the last month, writing out what's excluded, I noticed I'd edited files that weren't in my stated goal. Two were legitimate scope expansions I should have documented. One was an accidental change from an earlier edit session that I reverted before submitting. The description-writing caught them before review did. -Writing for a thoughtful reader changes how you think about what you're doing. The rubber-stamp model lets you be vague because the reviewer will fill in the gaps. The contract model requires precision because the reviewer will hold you to it. +The "Open questions" section changed my relationship with uncertainty in the work. Before this format, I'd make judgment calls on the ambiguous cases and hope the reviewer agreed. With a question in the description, I get a clear answer either way: Sam either confirms the pattern or flags the deviation. The exchange closes rather than leaving both sides guessing. --- -## What Sam's Review on PR #8648 Actually Said +## What Sam and Statler's Reviews on PR #8648 Actually Said For context: PR #8648 covered the database tool documentation. 7 tools across the database namespace, each needing a Description, Parameters table, and Example section. Source was the database namespace block in `schema.json` plus the tool-specific annotations in the upstream repo. @@ -130,9 +146,11 @@ Sam's review (fact-check pass): > > Parameters tables are accurate with the exception flagged by Statler. Example sections are present for all 7 tools. -Two things to notice: Sam cited the specific commit it compared against (`a4f9c31`), and it distinguished between minor word choice drift ("executes" vs "runs") and a substantive addition. "Against a specified database" is something I added that's not in the source — Sam is right to flag it as a question rather than a correction, because it might be a legitimate clarification for readers. +Two things to notice: Sam cited the specific commit it compared against (`a4f9c31`), and it separated minor style variance from a substantive addition. "Against a specified database" is something I added that's not in the source. Sam is right to flag it as a question rather than a correction, because it might be a legitimate clarification for readers. The phrasing "runs an arbitrary SQL query" is technically accurate but not particularly helpful to someone trying to understand when to use this tool versus `nosql_query`. + +I answered: yes, intentional — the clarification stays. Sam closed the question with: "Confirmed intentional clarification. Noted for future `db_*` tool descriptions in this series." -I answered: yes, intentional — "runs an arbitrary SQL query" is technically accurate but unhelpful for someone trying to understand when to use this tool. The clarification stays. Sam closed the question. +That last line is Statler's territory in terms of the thing it's tracking — consistency across the series — but Sam's acknowledgment that it's noted means the same question won't come up as drift on the next `db_*` PR. The agents coordinate. Statler's review ran against a different standard: @@ -142,17 +160,19 @@ Statler's review ran against a different standard: > > Secondary: three of the seven Example sections use `node` as the devlang slug. Your other articles in this series use `javascript`. Pick one — readers filtering by language tab will get partial results otherwise. -I'd completely missed the devlang inconsistency. I copied an example from a different article in the series that used `node` and didn't notice. Statler caught it in a single pass. +I'd completely missed the devlang inconsistency. I'd copied an example from a different article in the series that used `node` and didn't notice it was inconsistent. Statler caught it in a single pass because Statler has context on the full series, not just this PR. -A linter doesn't catch that. A "LGTM" definitely doesn't. A human reviewer working through 14 articles in a sprint might not either, because the inconsistency breaks across files rather than within one. +A linter doesn't catch that — it's not a syntax error, it's a cross-article consistency issue. A "LGTM" definitely doesn't. A human reviewer working through 14 articles in a sprint might not, because the inconsistency spans files rather than existing within one file. -That's what 14 comments across 3 files looks like when the reviewer is actually reading. +The subscription issue was more fundamental. I'd included `subscription` because I saw it in other parameter tables in the database namespace — ones that weren't up to date. Statler caught it because Statler knows the pattern, not just the precedent. + +That's 14 comments across 3 files. Every flag was something that needed to change. --- ## How the Feedback Compounds -The subscription parameter question came up in three consecutive database tool PRs before I fixed it. Not because I forgot — I genuinely wasn't sure whether the pattern I was seeing was intentional or a gap in the existing docs. After the third flag, Sam verified the pattern across the full database namespace and I added a rule to my reviewer notes: "subscription, resource-group, and tenant-id are ambient context params in the database and storage namespaces — do not include in parameter tables." +The subscription parameter question came up in three consecutive database tool PRs before I added a rule for it. Not because I forgot — I genuinely wasn't sure whether the pattern I was seeing was intentional or a gap in the existing documentation. After the third flag, Sam verified the pattern across the full database namespace and I added a rule to my reviewer notes: "subscription, resource-group, and tenant-id are ambient context params in the database and storage namespaces — do not include in parameter tables." That note is now in my PR description template for the tool server documentation series. Statler stops flagging it. Sam stops verifying it. The question is answered and closed. @@ -165,142 +185,84 @@ graph LR E --> A ``` -*Each review's reasoning teaches the system — and me — what "good" looks like for the next PR.* - -The loop only works if you maintain it. The agents don't automatically update my template. Roughly every five PRs, I spend fifteen minutes reading through the flags I've closed, identifying the ones that represent settled rules, and adding them to the reviewer notes section of the template. That's the maintenance cost. - -It adds up in the right direction, though. Before I had the subscription rule documented, I spent time on every database tools PR second-guessing whether I had it right. After? It's a known rule that gets checked automatically. That fifteen-minute update bought back maybe two hours of per-PR uncertainty across the series. I'll take that trade every time. - -A note on where this doesn't compound cleanly: the agents don't catch conceptual errors. If I've misunderstood what a tool does at the API level, Sam will confirm that my description matches my misunderstanding. That requires a human who knows the database namespace to catch — someone like the domain expert who's actually used these tools. Agent reviews and human reviews are doing different jobs. The agent handles pattern consistency and fact-checking against documented sources; the human catches the gaps between what's documented and what's true. - ---- - -## Professional by Default - -Some of the most technically accurate feedback I've received in my career was delivered in ways that made it hard to hear. All-caps emphasis, phrasings that implied I should have known better. That delivery activates something in you that competes with the information transfer. You spend effort managing your response to the tone instead of engaging with the substance. +*Each review's reasoning teaches me — and the template — what "good" looks like for the next PR.* -Agent reviews are impersonal by default. Not cold — they engage with the content directly. But Statler doesn't get frustrated with the third consecutive PR that has the same devlang issue. Sam doesn't have strong opinions about "executes" vs "runs" that spill into how it phrases the flag. Scooter doesn't passively approve things because it's tired. +The loop works because the reasoning is preserved. When Statler flags something and I close the question, I can look back at that exchange and extract the rule precisely rather than paraphrasing from memory. The reviewer notes in my template are direct quotes or close paraphrases of resolved flags, not reconstructions. That matters because the precision of the rule determines whether Statler will keep flagging edge cases of the same issue. -The practical effect: I read agent review comments at face value. When Statler raises an issue, it's because a pattern check fired. When Sam flags a divergence, it's because the source doesn't match. There's no subtext to parse. The only question is whether the flag is right. If yes, fix it. If no, explain why. That's the whole exchange. - -There's a real gap here, though: agents don't know your team's unwritten context. When a decision involves an architectural mistake being corrected incrementally, or a stakeholder preference that lives in someone's head, an agent will flag it correctly — the pattern diverges — but can't know why. My practice when I'm overriding a flag deliberately: explain it in the PR description. Not for the agent — for the humans who read the PR history later. +The maintenance cost is real. The agents don't automatically update my template. Roughly every five PRs, I spend fifteen minutes reading through the flags I've closed, identifying the ones that represent settled rules, and adding them to the reviewer notes section of the template. If I skip it for two or three weeks, the same flags start showing up again on new PRs — not because Statler forgot, but because I haven't incorporated the lesson. - -![The pink-haired developer reading review feedback calmly in a serene PNW garden while others show review frustration](./media/2026-05-22-agentic-pr-reviews/psychological-safety-review.png) +It adds up in the right direction. Before I had the subscription rule documented, I spent time on every database tools PR second-guessing whether I had it right. After? It's a known rule that gets verified automatically. That fifteen-minute update bought back roughly two hours of per-PR uncertainty across the series. Not profound math, but it compounds. -*When feedback is professional by default, you can engage with the content rather than managing the relationship.* +By the time I'd completed ten database tool PRs with this loop running, my reviewer notes for the series were substantial: -One more thing on tone: agents ask rather than assume when something is ambiguous. A human reviewer who sees something they don't understand might mark it wrong, or leave a comment that implies the author made a mistake, or approve silently with a mental note. Sam says "is this intentional clarification or drift?" That's an invitation to confirm, not a verdict. For documentation work where many scope decisions are legitimately ambiguous — whether a behavior is user-facing or internal, whether a parameter should be in the table or treated as ambient — getting a question is almost always the right response. - ---- - -## When Someone Else's Agent Reviews Your PR - -Everything so far has been about my agents reviewing my work. That's one direction. - -The other direction is when your PR lands in a queue owned by a different team — one with their own standards, their own patterns, their own accumulated history of what "correct" means for their content area. Before agent reviews, that came back as "LGTM," or a one-liner with no context, or silence. You had no idea what the reviewer actually checked or what standard you were being held to. - -When another team's agent reviews your PR, you get to see the standards. - -A recent PR went to a platform team for review. Their agent came back with this: - -> **Platform Reviewer — Standards Check** +> subscription, resource-group, and tenant-id are ambient context params in the database and storage namespaces — do not include in parameter tables. > -> Parameter table uses "server name" in the Description column. The convention in this content area is "resource name" for all resource identifiers, consistent with the compute and storage namespace articles. See: `storage_account_list`, `vm_list`. +> Devlang slug is `javascript` not `node` for all JS/TS examples in this series. > -> This inconsistency will create problems for readers switching between namespaces and may conflict with automated tooling that normalizes on "resource name." - -I didn't know that convention existed. There's no onboarding doc that says "resource name, not server name, for resource identifiers." That standard lived in the platform team's collective head — or, now, in their reviewer notes. - -Before the agent, I'd have gotten "LGTM" or a terse inline comment: "resource name." No explanation. No comparison point. I'd fix it for that PR and still not understand the rule. The next PR would have the same issue. I'd fix it again. We'd repeat this indefinitely, neither side getting anything durable from the exchange. - -The agent gave me the rule. I added it to my template's reviewer notes. That exchange happened once. - ---- - -This is the part that gets undersold in conversations about AI-assisted reviews: the transfer of standards across team boundaries. - -When you're not the subject matter expert and your PR lands with a team that is, you're dependent on their review being informative. "LGTM" from an SME is the worst possible outcome — they looked, they approved, and you learned nothing about whether what you wrote was actually right or just not wrong enough to reject. A correction without context gives you a fix but not the rule. And a review that's professionally marginal — terse, dismissive, delivered with implied frustration — buries whatever useful content it contains under the overhead of processing the delivery. - -An agent from their team applies their standards with full reasoning and no tone variance. No ALL CAPS. No "this is incorrect." Just: here's what we check, here's what doesn't match, here's what it's compared against. You don't need a relationship with that team to get a useful review. You don't need them to have time. You need their agent configured for the work. - -The practical version: PR goes to a team whose standards I don't fully know. Their agent flags 4–6 things, all specific, all grounded in a stated pattern or comparison. I fix the flags. I add the novel standards to my template's reviewer notes. My next PR to that team starts from a documented baseline instead of from zero. - -That's not the same as a review from a human expert on that team. The agent doesn't catch things that aren't in its standards — the things that are wrong because of a recent architectural shift, or a stakeholder preference that lives in one person's head. But it catches everything that's in the standards, consistently, every time. And it tells me what those standards are. Before agent reviews, that information was inconsistent at best. Often it wasn't transmitted at all. - -The feedback loop works the same across teams as it does within your own. First PR: their agent teaches you the standards. Second PR: you apply them. Third PR: those flags don't fire. Not because the standard disappeared. Because you learned it. +> Tool descriptions should match upstream annotation text exactly; any additions or clarifications require an open question to the reviewer. +> +> "Required" and "Optional" status must match the CLI JSON schema value, not the pattern from adjacent namespace tools. +> +> Example sections use the published production endpoint format, not the development environment format. - -![Two developers in separate spaces connected by a floating review note — standards transmitted across team boundaries](./media/2026-05-22-agentic-pr-reviews/cross-team-standards-transfer.png) +Five rules. Each one came from a Statler or Sam flag on a specific PR. None of them are things I would have invented independently — they came from the review process surfacing where my defaults diverged from the established patterns. -*When the review is professional by default, the standard travels cleanly. No relationship required. No bandwidth negotiation. Just the rule, with the reasoning.* +The agents don't catch conceptual errors. If I've misunderstood what a tool does at the API level, Sam will verify that my description accurately matches my misunderstanding. Sam checks my text against the annotation — if the annotation is wrong, Sam confirms that my wrong description is consistent. That's a real limitation and it's why human review doesn't disappear from the workflow. Agent reviews handle pattern consistency and fact-checking against documented sources. Human review handles the gap between what's documented and what's operationally true. --- -## Why I Can Audit Every Decision From the Last Year - - -![The pink-haired developer viewing a preserved PR review in a museum-style display case, with winter Cascades in the distance](./media/2026-05-22-agentic-pr-reviews/captured-in-perpetuity.png) - -*A year later, the PR is still there. The reasoning is still there. The "why" — the hardest thing to recover — is captured in the record.* - -Last week I needed to know why the NoSQL tool article doesn't document the `query_items_with_partition_key` behavior as a separate tool. I went back to the PR. The scope section said: "NoSQL tools in this PR are limited to the publicly-listed namespace entries in `schema.json`. The partition key query behavior is exposed through `nosql_query` and is not a separate tool entry." Sam's review confirmed it: "partition key query is a parameter pattern of `nosql_query`, not a separate tool — correct to omit." - -Thirty seconds to verify. Without that review reasoning, I'd be digging through commit messages and Slack threads trying to reconstruct intent from behavior. +## Writing for a Thoughtful Reviewer Changes How You Think -The before/after is stark. **Before:** I'm looking at a documentation file that handles authentication. I see a note that says "token refresh is not documented here." I have no idea why. Was it intentional? Missed? A scope decision? The PR that introduced this file says "Add authentication docs." The review says "LGTM." I have nothing. + +![The pink-haired developer writing a structured PR description overlooking Bellingham Bay with a lighthouse in the distance](./media/2026-05-22-agentic-pr-reviews/author-mindset-bay-view.png) -**After:** Same scenario. The PR description has a scope section: "internal token refresh timing is not documented here — this is implementation-specific behavior covered in the internal runbook." The review says "scope boundary honored. Token refresh timing omitted per scope definition. Aligns with pattern in storage.md." Now I know what was decided, why, and what it was compared against. +*Writing for a thoughtful reader changes your relationship with your own work — you ask the questions before they're asked of you.* -That's not a small difference. That's the difference between archaeological guesswork and documentation you can actually use. +Before agent reviews, I wrote PRs for a rubber stamp. That's a rational response to the environment — if most reviews are "LGTM," a detailed description doesn't get you much return. You write just enough to remind yourself what you were doing. -Code comments go stale. Architecture diagrams drift — the diagram from eighteen months ago shows three services; now there are seven. PR reasoning doesn't drift. It captures the state of thinking at the moment the change was made. A PR from eight months ago that says "this scope decision was intentional, here's why" is useful even if the decision has since been revisited, because it tells you the decision was deliberate. +When you know Sam will compare your descriptions against `schema.json` at commit `a4f9c31`, you check that comparison yourself first. Not every time — that would defeat the purpose — but it shifts what you pay attention to while you write. You think about what the reviewer needs to know rather than what you want to tell them. That's a different question. -The "why" is the hardest thing to recover in any software system. Code tells you what. Tests tell you what was expected. Almost nothing tells you why a specific decision was made, with specific constraints in mind, at a specific moment. Agent-led reviews capture it as a byproduct of doing the review. Not perfectly, not always — but as a default rather than only when someone had time to be thorough. +The scope boundary isn't bureaucratic overhead. It's the question that determines whether the PR does what it's supposed to do. Spelling it out before submitting means catching it yourself rather than in a review comment. Stating what you're not doing means making a conscious decision about it rather than leaving it implicit. Both of those move the thinking forward in a way that writing "covered the db namespace tools" doesn't. - -![The pink-haired developer at the Nooksack River at dusk, holding a lantern beside a marked post — a feeling of something preserved](./media/2026-05-22-agentic-pr-reviews/nooksack-preserved-reasoning.png) +There's something that builds over the months too. Working with the same reviewer configuration across the tool server documentation series has created a shared vocabulary — not just terminology, but a shared understanding of what counts as an in-scope parameter, what belongs in a Parameters table versus an Example section, when "Required" means required by the tool versus required by the namespace context. That vocabulary lives in the reviewer notes in my template. A new article starts from the accumulated decisions of the previous forty, not from zero. -*The reasoning, once captured, doesn't drift. It stays exactly where you left it — a marker in the record that future you will actually be able to read.* +You'd traditionally build that kind of shared vocabulary by working closely with a human expert over many PRs, until you internalized their standards. Most people don't have consistent access to that. The expert who has the deep context isn't always reviewing your PRs. When they are, consistency is a casualty of a full workload. The same reviewer standards applied to every PR, with no variance based on who's available, builds the vocabulary faster than occasional thorough reviews from domain experts who are also managing six other priorities. --- -## Writing for a Thoughtful Reviewer Changes How You Think - - -![The pink-haired developer writing a structured PR description overlooking Bellingham Bay with a lighthouse in the distance](./media/2026-05-22-agentic-pr-reviews/author-mindset-bay-view.png) - -*Writing for a thoughtful reader changes your relationship with your own work — you ask the questions before they're asked of you.* +## Professional by Default -Before agent reviews, I wrote PRs for a rubber stamp. That's rational. If most reviews are "LGTM," a detailed description doesn't get you much. You write just enough to remind yourself what you were doing. +Some of the most technically accurate feedback I've gotten in my career was delivered in ways that made it hard to engage with. All-caps emphasis, phrasings that implied I should have known better, "why did you do it this way" as an accusation. That delivery competes with the information transfer. You spend effort managing your response to the tone instead of engaging with the substance. -When you know Sam will compare your descriptions against `schema.json` at commit `a4f9c31`, you check that comparison yourself first. Not every time — that would defeat the purpose. But it shifts what you pay attention to while you write. You think about what the reviewer needs to know rather than what you want to tell them. +Agent reviews are impersonal by default. Not cold — they engage with the content directly. But Statler doesn't get frustrated with the third consecutive PR that has the same devlang issue. Sam doesn't have strong opinions about "executes" vs "runs" that spill into how it phrases a flag. Scooter doesn't passively approve things because it's been a long week. -A lot of the things a thoughtful reviewer wants to know are things you should have been thinking about anyway. The scope boundary isn't bureaucratic overhead. It's the question that determines whether the PR does what it's supposed to do. Spelling that out before you submit means catching it yourself rather than in a review comment. +The practical effect: I read agent review comments at face value. When Statler raises an issue, it's because a pattern check fired. When Sam flags a divergence, it's because the source doesn't match. There's no subtext to parse. The only question is whether the flag is right. If yes, fix it. If no, explain why. That's the whole exchange. -There's something that builds over time too. Working with the same reviewer configuration across the tool server documentation series has created a shared vocabulary — not just terminology, but a shared understanding of what counts as an in-scope parameter, what belongs in a Parameters table versus an Example section, when "Required" means required by the tool vs. required by the namespace context. That vocabulary now lives in the reviewer notes in my PR template. A new article starts from that baseline instead of from zero. +Agents also ask rather than assert when something is genuinely ambiguous. A human reviewer who sees something they're uncertain about might mark it wrong, approve silently, or leave a vague comment. Sam says "is this intentional clarification or drift?" That's an invitation to confirm, not a verdict. For documentation work where many scope decisions are legitimately ambiguous — whether a behavior is user-facing or implementation-specific, whether a parameter should be in the table or treated as ambient context — getting a question is almost always the more useful response. -You'd traditionally build that kind of shared vocabulary by working closely with a human expert over many PRs, until you internalized their standards. Most people don't have access to that. The expert who has the deep standards isn't always reviewing your PRs. And when they are, consistency is a casualty of a full workload. The same reviewer standards applied to every PR, with no variance based on who's available, builds the vocabulary faster than occasional reviews from domain experts who are also managing six other priorities. +When I'm overriding a flag deliberately — when I know the pattern diverges but I have a reason — I explain it in the PR description. Not for Sam or Statler, who will flag the deviation regardless. For the humans who read the PR history later and need to know whether the exception was intentional and why it was made. --- ## What Actually Doesn't Work -Some things still need a human. +Some things still need a person. + +Sam doesn't catch conceptual errors against the truth. If a tool description in `schema.json` is misleading about what the tool actually does at runtime, Sam confirms that my documentation accurately matches the misleading description. It checks descriptions against the source — if the source is wrong, Sam's approval is not a useful signal about correctness. That requires someone who's actually used the tool and knows the gap between the annotation and the behavior. -Agents don't catch conceptual errors against the truth. Sam verifies my descriptions match the source text. Sam doesn't know if the source text is wrong. If a tool description in `schema.json` is misleading about what the tool actually does at runtime, Sam will confirm that my article matches the misleading description. That requires someone who's actually used the tool. +The feedback loop only works if I maintain it. The agents don't update my template. That's a fifteen-minute task every few PRs, but it's manual and optional, and if I skip it the same flags come back. The system runs on the work I put into it. -The feedback loop only works if you maintain it. The agents don't update my template. That's a fifteen-minute task every five or so PRs, but it's manual, and if I skip it for a few weeks the same flags start showing up again. The system runs on the work I put into it. +Some conversations need to happen between people. When a PR decision involves content strategy — whether a new service gets its own article or folds into an existing one, whether a pattern I've established for the tool server series should apply to the platform skills articles — that's not a fact-check question or a pattern violation. It's a direction question. Sam and Statler can tell me whether a PR is consistent with what exists. They can't tell me whether what exists is the right structure going forward. -Some conversations need to happen between people. When a PR decision involves content strategy — whether a new service should get its own article or be folded into an existing one, whether a pattern I've established for the tool server series should apply to the platform skills articles — that's not a fact-check question or a pattern violation. That's a judgment call about direction and it needs a real conversation. +My rule of thumb: agent reviews on every documentation PR. Human review on anything involving a judgment call about scope, a new content pattern I haven't done before, or a tool area where I know I'm working outside my documented knowledge. The agent reviews handle the mechanical consistency work so the human review time goes to what actually needs human judgment. A PR that's already been through Sam and Statler arrives at human review with pattern and fact-check issues resolved. The human reviewer's attention goes to the parts that require their specific knowledge. -My rule of thumb: agent reviews on every documentation PR. Human review on anything involving a judgment call about scope, a new content pattern I haven't done before, or a tool area where I know I'm working outside my knowledge. The agent reviews handle the mechanical consistency work so the human review time goes to what actually needs human judgment. +The agents also aren't a substitute for a domain expert on the first few PRs in a new content area. When I started the database tool articles, I didn't have enough context to write good reviewer notes — I didn't know what I didn't know. The first two or three PRs needed a human with database namespace knowledge to review, because Sam and Statler can only apply the patterns I've captured and I hadn't captured the important ones yet. The agent reviews got useful after I'd built the baseline. --- ## The Template I Actually Use Now -This is the PR description template I've built over ~60 documentation PRs in the last six months. It's gone through several iterations. This is the version I stopped changing. +This is the PR description template I've built over roughly 60 documentation PRs in the last six months. Several iterations. This is the version I stopped changing. ```markdown ## Goal @@ -351,3 +313,5 @@ Sam, Statler, Scooter — review the PR branch in content-repo The agents run in sequence, post their findings, and I work through the flags before pushing for human review. Agent review plus fixes: about 30 minutes per PR. That's faster than waiting for a human reviewer to have time, and what I bring to the human review is cleaner because the pattern and fact-check issues are already resolved. Before that setup, I was getting back LGTM or silence. Now I'm getting back 14 specific, accurate comments with reasoning. The comments are better than what I was getting from human reviews on most PRs — not because Sam and Statler are smarter than people, but because they read every line, every time, and they tell me what they found. + +The sixty-PR baseline is where the template started earning its cost. The first ten PRs, I was still figuring out what to capture in reviewer notes. By PR 20, the template had stable patterns. By PR 40, I stopped changing it. Now each new article in the series starts from a documented set of conventions that took forty PRs to build. That's the actual return on the fifteen-minute maintenance investment per five PRs — the reviewer notes are the artifact that makes every subsequent PR faster than the one before it. diff --git a/website/blog/2026-05-22-cross-team-agent-reviews.md b/website/blog/2026-05-22-cross-team-agent-reviews.md new file mode 100644 index 0000000..457d839 --- /dev/null +++ b/website/blog/2026-05-22-cross-team-agent-reviews.md @@ -0,0 +1,270 @@ +--- +slug: /2026-05-22-cross-team-agent-reviews +canonical_url: https://dfberry.github.io/blog/2026-05-22-cross-team-agent-reviews +custom_edit_url: null +sidebar_label: "2026.05.22 Cross-Team Agent Reviews" +title: "When Someone Else's Agent Reviews Your PR" +description: "Your PR lands with a team whose standards you don't fully know. Their agent reviews it with full reasoning attached. Here's what I've learned about receiving reviews from agents I didn't configure — and why the standards transfer is the part most people miss." +draft: true +tags: + - AI Agents + - Code Review + - Developer Experience + - GitHub Copilot + - AI assisted +updated: 2026-05-22 08:35 PST +keywords: + - cross-team agent reviews + - ai code review + - pr review standards transfer + - team standards visibility + - agentic pr review + - code review feedback loop + - sme review problem + - documentation standards +--- + +# When Someone Else's Agent Reviews Your PR + + +![Two developers in separate spaces connected by a floating review note — standards transmitted across team boundaries](./media/2026-05-22-cross-team-agent-reviews/cross-team-standards-transfer.png) + +*When the review is professional by default, the standard travels cleanly. No relationship required. No bandwidth negotiation. Just the rule, with the reasoning.* + +My PR went to the platform team for a standards review. Their agent came back with 6 comments. Five were naming and formatting patterns I'd been violating across multiple previous PRs — all of which had gotten approved silently. The sixth was a cross-namespace naming convention that doesn't exist in any onboarding documentation, because the platform team built it over three years of working conversations and never formalized it. + +I fixed all six flags. Took about twenty minutes. I added the rules to my reviewer notes for that content area. My next PR to that team had zero flags on those patterns. + +"LGTM" from a subject matter expert across team boundaries is the worst possible outcome — not the neutral one. They reviewed it. They know things I don't. The knowledge doesn't travel. + +When their agent reviews your PR, the knowledge travels with the verdict. + +This is about the other direction of agent-assisted review: not your own agents reviewing your own work, but someone else's agent reviewing your PR when it crosses into their content area. The feedback loop is the same, but what you're learning is different — you're not refining standards you helped develop, you're getting your first clear view of standards that have been operating invisibly. + +--- + +## Why "LGTM" From an SME Is Worst When You're the Outsider + +Within your own team, a "LGTM" approval has context that makes it at least partially interpretable. You know roughly what the reviewer cares about. You have a read on whether they're signaling "this is genuinely fine" or "I skimmed it and it looked okay" or "I don't have bandwidth right now but I trust you." You can ask a clarifying question without much overhead. The relationship absorbs the ambiguity. + +Cross-team, that context disappears entirely. You're submitting to a team whose priorities and internal standards you don't fully know. When they approve with "LGTM," you can't distinguish three meaningfully different outcomes: "this meets our standards fully," "this doesn't meet our standards but nothing is bad enough to block on," or "I didn't have time to check the things that actually matter." The approval looks identical for all three. + +The result is that you keep making the same mistakes across consecutive PRs, because you never find out they're mistakes. You keep using "server name" in parameter tables when the convention is "resource name." You keep including ambient-context parameters in tool parameter tables when the convention is to omit them. You keep using the language slug from an older article that predates the current convention. The patterns go through on every PR that gets a rushed review. Occasionally one lands with a reviewer who has time and inclination to flag it. You fix that instance. You move on without knowing whether you've learned the rule or just fixed the exception. + +The less visible failure mode: "LGTM" from a cross-team SME is uniquely frustrating because you can't even calibrate what it means. Within your own team, you might know this reviewer well enough to know they always check the API surface but skim prose clarity. You compensate. With a cross-team expert you interact with twice a year, you don't have that read. Their "LGTM" is a black box. + +The other failure at cross-team scale is the pointed correction without context. The reviewer is clearly annoyed that you've violated a convention they consider obvious, and they tell you. "This should be `resource name`, not `server name`." The correction is right. The delivery is terse. The context is absent. You fix the instance. You spend ten minutes wondering whether this is a team-wide convention or one person's strong preference. You decide not to ask because they seem busy and you don't want to be the person who needs hand-holding. Next PR: "server name" again, different file, different article. The conversation repeats. + +Repeat the exchange three times and neither side is learning anything. The reviewer is mildly exasperated. The contributor is mildly confused. The standard still lives only in one person's head. + +What actually helps is a reviewer who applies the same standards consistently, explains the reasoning when they flag something, and distinguishes between "this violates a documented convention" and "I have a question about your intent here." That's not a high bar. It's also not something you can reliably get from a cross-team reviewer who is managing their own deliverables and treating your PR as a secondary obligation. + +An agent from their team applies their documented standards to every PR, with the reasoning attached, regardless of queue length or workload. + +--- + +## What Their Agent Actually Said + +The PR in question was a documentation update covering a set of platform tools. I was adding parameter tables to three articles that covered resources across the compute, storage, and database namespaces. I based the table format on older articles in the series I'd found in the repo. Three of those older articles used "server name" in the parameter description column for resource identifier parameters. I followed that pattern — it was the evidence I had. + +Here's what the platform team's agent returned: + +> **Platform Reviewer — Standards Check** +> +> Parameter table uses "server name" in the Description column for `server_name` parameter in `compute_vm_start`, `compute_vm_stop`, and `db_server_list`. The convention in this content area is "resource name" for all resource identifiers, consistent with the compute and storage namespace articles. See: `storage_account_list`, `vm_list`. +> +> This inconsistency will create problems for readers switching between namespaces and may conflict with automated tooling that normalizes on "resource name." + +Four things in that review comment that a "LGTM" never gives me: + +**The rule, stated explicitly.** Not "change server name to resource name." The convention exists across the entire content area for all resource identifiers. Scope: all of them. + +**The comparison points.** `storage_account_list`, `vm_list`. I can go look at those articles right now and see exactly what correct looks like. I'm not making assumptions about what the reviewer means or has to look up. + +**The reason.** Two reasons, actually: readers switching namespaces, and automated tooling normalization. Now I understand why the rule exists. That matters because it tells me when the rule applies and when an exception might be warranted. If I'm documenting a resource type that is literally and only a server — not a general namespace resource — I know whether "server name" might be justified or whether the convention overrides the specificity. + +**The scope.** "This content area" — not just this article, not just this PR. A rule I carry forward to every subsequent submission to this space. + +Those are four things I now know that I didn't know before the review. I added all four to my reviewer notes for the platform content area. My next three PRs to that team had zero "resource name" flags. Not because the agent changed — it was running the same check. Because I learned the rule. + +The older articles I'd based my work on — the ones using "server name" — were pre-convention, written before the platform team formalized the approach. They'd never been updated because the team hadn't had bandwidth to do a full sweep. My PR was perpetuating an outdated pattern without knowing it. A human reviewer looking at my PR while managing four others in their queue might have noticed the discrepancy but not had time to write out the full reasoning, or might have recognized the pre-convention articles as the source and not flagged it, not knowing I'd copied the old pattern thinking it was current. + +The agent checks the current standard against the current PR, every time, with the reasoning attached. The pre-convention articles don't confuse it. + +--- + +## Team Standards Don't Come With Documentation + +Every team with more than six months of active work has accumulated standards that aren't formally documented. Not out of negligence — most conventions don't start as policy statements. They start as decisions made in the flow of normal work. + +The platform team decided at some point that "resource name" was better than "server name" for resource identifier descriptions. That decision almost certainly happened in a PR comment or a quick conversation, not in a standards meeting. The person who made the call had a reason. The people in the thread carried the reasoning. Future articles got reviewed by those people, who applied the standard. The pattern spread through the content area via review rather than via documentation. It became real — consistently applied, consistently correct — without ever being written down as a rule. + +Six months later, someone from outside the platform team's daily orbit submits a cross-team PR. They base their work on articles that exist in the repo, including the pre-convention ones that haven't been updated. They use "server name." Their PR gets reviewed by someone who knows the convention but is behind on their own work. The review is "LGTM" or a terse one-liner. The contributor makes the change (if told to) without understanding the rule. The next PR from that contributor has the same issue. + +This is how standards fail to transmit across team boundaries. It's not anyone's fault. The expert who reviewed is operating rationally given their constraints. The contributor who made the mistake was following the evidence available to them. The system just doesn't have a good mechanism for getting the expert's contextual knowledge into the contributor's hands at the moment they need it. + +The agent is that mechanism. + +When the platform team configures their reviewer with the resource naming convention — rule, comparison points, reason — that knowledge is available at every cross-team PR review, regardless of whether the expert who created the convention is available. It doesn't require the expert's bandwidth. It doesn't require a relationship between the contributor and the team. It requires that someone put the standard into the reviewer notes once. + +This changes the cross-team knowledge transfer problem from "how do we get the expert's knowledge to the contributor in time" to "how do we get the expert's knowledge into the agent's configuration once." The second problem is easier. It happens once, when the standard is established or articulated, rather than repeatedly, for every cross-team contributor who encounters it. + +There's a secondary benefit that the platform team gets, separate from the cross-team contributors: building the agent configuration forces explicit articulation of standards that were previously tacit. When you have to write "resource name for all resource identifiers, consistent across compute and storage namespaces, required for automated tooling normalization," you're making a rule more precise than it was when it lived in people's heads as "we just say resource name." Some of those standards, when written out, turned out to be underdocumented even in the reviewer notes — "Required/Optional per CLI JSON schema, not per usage observation" needed more specification before it was reliable enough to configure the agent on. The process of articulating standards makes them more useful for the team itself, not just for incoming contributors. + +I've been on both sides of this. Submitting cross-team PRs where I got terse corrections I didn't understand, fixing the instance without learning the rule. Reviewing cross-team PRs myself where I had the contextual knowledge but not the time to write out the full reasoning, leaving a one-liner that the contributor probably filed under "that person's preference" rather than "team convention." Both sides rational given the constraints. Both sides producing an exchange that transfers nothing durable. + +The agent on the reviewing side changes the default. + +--- + +## How the Feedback Loop Works Cross-Team + +The first PR I submit to a team whose standards I don't know: their agent flags several things, all specific, all with reasoning. I fix the flags. The part that compounds over time is what I do with those flags after fixing them. + +For each flag I get, I write the underlying rule into the reviewer notes section of my PR description template for that content area. Not a mental note — an actual written rule, so the next PR starts from it. + +The format I use in the reviewer notes: + +> **[rule type]:** [rule statement, including scope]. See: [comparison point]. Reason: [why]. + +After the first PR to the platform team: +> **Resource identifiers:** Use "resource name" for all resource identifier descriptions in parameter tables across compute, storage, and database namespaces. See `storage_account_list`, `vm_list`. Reason: cross-namespace consistency and automated tooling normalization. + +After the second PR: +> **Required/Optional status:** Match the CLI JSON schema value exactly. Do not inherit Required/Optional status from similar tools in adjacent namespaces — check the schema directly for each tool. See PR #[number] for the subscription parameter pattern. Reason: schema is the source of truth; adjacent patterns can diverge without announcement. + +After the third PR: +> **Example section URLs:** Use published production service endpoint URLs. Development environment URLs don't belong in examples regardless of whether they're technically accurate. See the published compute namespace articles for format. Reason: examples document the public API, not internal infrastructure. + +By the fourth PR, the flags from the first three don't fire. The agent is still running the same checks — those standards are part of their configuration regardless of who submits. But those specific patterns are now in my baseline for this content area. I'm a different contributor than I was at PR 1. I bring documented knowledge of their standards into each PR rather than starting from zero context. + +The front-loading is steeper cross-team than within your own workflow. You have more to learn, and you're starting from zero familiarity with the area rather than from working alongside the people who built the standards. But the compounding effect is identical. + +The limit: the loop is bounded by what's in the agent's configuration. If the platform team updated a convention between my last PR and my current one, their agent reflects the new standard. My notes have the old rule. The agent flags it on my next PR. I update the notes. The correction mechanism is reactive — flagging when I violate the new standard — not proactive notification. For content areas where I submit infrequently, I do a manual check on my reviewer notes against their most recent flag summary before submitting, to see if anything obviously changed. + +This doesn't scale perfectly past about three teams with active reviewer configurations. Four teams means four sets of reviewer notes to maintain, and the manual convention-change checks across four areas are about 45 minutes of overhead per sprint. That's still worth it to me — the alternative is making the same convention mistakes repeatedly across all four teams — but it's a real cost that scales linearly with how many cross-team areas you're working in. + +--- + +## No Relationship Required + +Cross-team review at scale has a structural relationship problem. Getting quality feedback from a team you don't work with daily requires one of: a pre-existing relationship where someone prioritizes your PR, a formal assignment process where review responsibility is explicit, or good timing. Most of the time, cross-team PRs get "LGTM" or a terse one-liner because the reviewer doesn't have bandwidth to go deeper on work they didn't create and aren't accountable for. + +When the team runs agent reviews, that structural problem doesn't disappear — but its impact on any individual PR shrinks substantially. The agent applies their documented standards at full depth regardless of whether there's a relationship between the reviewer and the contributor. It doesn't take more time for a first-time cross-team contributor than for someone who's submitted a hundred PRs to that content area. The standards get applied the same way every time. + +This matters most for contributors who are new to a content area — who don't know the conventions, who are making the first-few-PR mistakes that everyone makes. Those contributors are exactly the ones least likely to have a relationship that unlocks a thorough review. The agent inverts that: the thoroughness of the review doesn't scale with the contributor's relationship capital on that team. + + +![The pink-haired developer reading review feedback calmly in a serene PNW garden while others show review frustration](./media/2026-05-22-cross-team-agent-reviews/psychological-safety-review.png) + +*When feedback is professional by default, you can engage with the content rather than managing the relationship.* + +The professional-by-default aspect matters more cross-team than it does within your own workflow. Inside your own team, delivery problems in review comments come with relationship context you can use to separate the signal from the noise. You know this reviewer is generally terse but accurate. You know they've been heads-down on a release and their threshold for patience is low right now. Cross-team, that context doesn't exist. A terse or pointed comment from a team you don't know well is just terse and pointed. You can't easily check whether "this is incorrect" means "this violates a specific documented standard" or "I'm behind and phrasing things badly." + +An agent doesn't have that variance. It doesn't have good days and bad days. When the platform team's agent says "resource name, not server name, per the content area convention," the tone is identical whether I'm a first-time cross-team contributor or their closest collaborator. I read the flag at face value. The only question is whether it's right. If yes, fix it. If I think there's a legitimate exception, I explain in a reply. That's the whole exchange. + +There's an additional thing agents do in ambiguous cases that changes the review dynamic: they ask rather than assert. + +A human reviewer under time pressure hitting something uncertain has limited options — mark it wrong, approve with a mental note, or leave a vague comment that implies a problem without stating one. All rational given their constraints. None of them give the author a useful signal about what's actually expected. + +The platform team's agent on a PR I submitted recently: + +> The tool description for `compute_vm_start` uses "initiate a virtual machine start operation" where the upstream source text says "start a virtual machine." +> +> "Initiate a start operation" is more complex than the source without being more precise in the standard case — but this may be intentional if there's a distinction between initiating and completing the operation in the async context. Is this a deliberate clarification or drift from source text? + +That's a question, not a verdict. I can answer it: yes, intentional, the async distinction matters for readers trying to understand the return value. Or: no, drift, reverting to source text. Either way I understand exactly what the reviewer is checking and why. The exchange closes cleanly. + +The human version of this review would often skip the question. A reviewer with a full queue might approve (the description isn't wrong, just different) or leave "per source text" — which tells me the result without the reason. I'd make the change and still not know what made the distinction matter. + +For documentation work where many of the interesting calls are genuinely ambiguous — is this variation acceptable or a convention violation? — getting questions is valuable because it surfaces the exact edge case that the standard needs to cover. When I explain that the async distinction justifies the more complex phrasing, and the reviewer confirms it, that exception is now documented in the PR exchange. The next contributor who hits the same case has a precedent to point to. + +--- + +## The Audit Record Matters More When You Can't Ask + +When I've configured my own Squad agents to review my PRs, there's a recovery path when something in the review record is unclear: I can reconstruct context. I know what each agent was configured to check. I can look at the reviewer notes that drove a flag. If a question comes up six months later about why something was decided, I can go back to the configuration and understand the reasoning chain. + +With a cross-team agent review, that recovery path doesn't exist. I don't have access to the platform team's reviewer configuration. I can't query their agent about why a particular standard was established. The PR review record — the actual exchange preserved in the PR — is the only permanent record I have of what their standards were at that moment. + +That makes the captured reasoning more important, not less. + + +![The pink-haired developer viewing a preserved PR review in a museum-style display case, with winter Cascades in the distance](./media/2026-05-22-cross-team-agent-reviews/captured-in-perpetuity.png) + +*A year later, the PR is still there. The reasoning is still there. The "why" — the hardest thing to recover — is captured in the record.* + +Here's the scenario. Eight months from now, someone is looking at an article I submitted to the platform team's content area. They notice the parameter tables use "resource name" consistently and are wondering whether I knew that was a convention or happened to get it right. The PR the platform team's agent reviewed is still there. The agent's comment explains the convention, the comparison points, the reason. The exchange shows I fixed the flags and added the rule to my notes. It's clear I learned the standard. + +More commonly: I'm the one looking at an article I wrote eight months ago, trying to remember the platform team's exact standard for Optional parameters in cross-namespace tools. I don't have the reviewer notes from eight months ago in front of me — I may have updated my template since then. I go back to the PR. The agent's flags are there, the reasoning is there, the exchange is preserved. Reoriented in thirty seconds. + +This is the practical difference between a review record and a review memory. The memory fades. "We talked about the resource naming thing at some point" is not the same as the exchange that explains what the convention is, why it exists, and what it was compared against. The record doesn't fade. + +The before/after is stark. **Before:** I'm looking at a parameter table I wrote eight months ago. It mixes "resource name" and "server name" across different rows. I don't know if this was intentional or a convention violation or an editorial call I made in the moment. The PR history says "LGTM." I have nothing. I'm making a judgment call about which pattern to follow with no basis for the decision. + +**After:** I go to the PR. The agent's comment says "resource name per content area convention, see `storage_account_list`, reason: cross-namespace consistency and tooling normalization." My reply confirms I understood and implemented it. If the table mixes conventions, I know which one is correct. I can fix the inconsistency rather than making another undocumented judgment call. + +That's not a small difference. It's the difference between archaeological guesswork and having an actual record. + + +![The pink-haired developer at the Nooksack River at dusk, holding a lantern beside a marked post — a feeling of something preserved](./media/2026-05-22-cross-team-agent-reviews/nooksack-preserved-reasoning.png) + +*The reasoning, once captured, doesn't drift. It stays exactly where you left it — a marker in the record that future you will actually be able to read.* + +Code comments go stale. Architecture diagrams drift — a diagram from eighteen months ago describes three services; the system now has seven. PR reasoning doesn't drift. A PR from eight months ago that says "this naming convention was intentional, here's why, here's what it was compared against" is useful even if the convention has since changed, because it tells you the decision was deliberate, when it was made, and what information was in scope at the time. You need that to understand whether the current state is a refinement of the original decision or a divergence from it. + +For cross-team work specifically, this matters because you can't always Slack the reviewer from eight months ago and ask "what were you checking when you approved this?" The expert who reviewed it might have moved to a different team. The person who "LGTM'd" it might not remember the context. The agent's review is the record. When the reasoning is explicit in the record, the record is usable. When it's not, you're doing archaeology. + +--- + +## Write the PR to Actually Be Reviewed + +Knowing that a team runs agent reviews changes how you write the PR description before submitting. Not because you need to satisfy the agent — it will run its checks regardless of what you write. Because what makes an agent review useful is the same as what makes a human review useful: clear scope, stated source of truth, explicit questions where you're genuinely uncertain. + +When I'm submitting to a team with agent reviews, my PR description for that content area includes: + +**Goal:** what this PR covers, which source material it's based on, which target format it's following. + +**Source of truth:** a specific commit or version reference for the upstream material I'm working from. Not "based on the upstream repo" — the specific commit, so the agent and any human reviewer can check the exact same source. + +**Scope:** what's in this PR and explicitly what's not, with reasons for each exclusion. The out-of-scope section does the most work. When I've made a deliberate decision to omit something — a parameter that's ambient context, a tool that belongs in a different article, a behavior that's implementation-specific — that decision needs to be visible in the PR. + +**Open questions:** things I'm genuinely uncertain about, stated directly. "I treated `query_timeout` as an ambient parameter rather than including it in the table — is this consistent with how it's handled in `storage_read`?" That's a direct question to the reviewer. The agent either confirms the pattern or flags the deviation. Either way I get a clear answer, not a guess. + +**Reviewer notes:** the standards I've already learned from previous PRs in this content area. This tells the agent and any human reviewer what baseline I'm working from, so flags from this PR add to the baseline rather than re-explaining conventions I've already incorporated. + +The "Open questions" section changes the review from "here is my work, please approve" to "here is my work, here are my decisions, here are the places I wasn't certain — confirm or correct." It produces a more useful review because the agent can address a direct question precisely rather than having to detect a silent assumption. + +There's a side effect I didn't anticipate when I started writing PRs this way: the scope and open-questions sections catch my own drift before submission. Three times in the last six weeks, writing out the scope and excluded items, I noticed I'd edited files that weren't in my stated goal. Two were legitimate scope expansions I should have documented. One was an accidental holdover from an earlier edit session that I reverted before submitting. The description-writing does a pre-review. + +Writing for a specific, thoughtful reviewer changes how you think about what you're doing. The rubber-stamp model lets you be vague because the reviewer will fill in gaps. The cross-team agent model requires precision because the agent will check exactly what you said against exactly what the standards expect. + +--- + +## What This Doesn't Replace + +Cross-team agent reviews handle the documented standards consistently. They don't handle the things that aren't documented yet. + +If the platform team made an architectural decision last month that changed how a category of parameters should be documented, and their reviewer notes haven't been updated to reflect it, their agent won't flag violations of the new approach. The PR will pass their agent review and still be wrong against the current thinking. That requires a human reviewer who knows the current state of the architecture, not just the captured state of the standards. + +The gap between "what the agent knows" and "what the team currently thinks" is permanent and real. Teams evolve faster than reviewer notes. The agent is always applying yesterday's captured knowledge. For stable, well-defined content areas with slow-changing conventions, the gap is small. For areas in active architectural change, it can be significant. + +Agent reviews also don't catch conceptual errors. If I've misunderstood what a tool does at the API level, the agent will verify that my documentation accurately describes my misunderstanding. It checks my descriptions against the documented standard, not against the operational reality. Someone who's actually used the tool in production catches the gap between the documentation and the truth. + +My approach: agent review on every cross-team PR. Human review from someone on the receiving team on anything involving a new content pattern I haven't documented before, a judgment call about scope or architecture, or the first few PRs in a content area where I'm establishing my baseline. The agent handles the documented-standards enforcement; the human handles the delta between the documented standards and current reality. Both are necessary. The agent's consistency makes the human review more efficient — by the time a PR reaches human review, the pattern and convention questions are already resolved. + +When I'm making a deliberate exception to a flagged standard — an edge case I think justifies the deviation — I explain it in the PR description. Not for the agent, which will flag the deviation regardless. For the human who reads the PR later and needs to know whether the exception was intentional and why it was made. + +--- + +## What I've Found From Being On the Receiving End + +The first time a cross-team agent gave me 6 flags, my instinct was to feel like I'd done something wrong. Six flags on a PR sounds like a problem. + +It wasn't. It was five rules I hadn't known and one ambiguous call I now had a clear answer to. The PR before it got "LGTM." I'd violated the same five patterns in that PR too. The "LGTM" just didn't tell me. + +After ten PRs to the platform team's content area, my reviewer notes for that area are substantial. Each entry started as a flag from their agent. The flags have slowed — not because I'm submitting lower-risk PRs, but because the conventions I was violating in the first three PRs are now in my documented baseline. A colleague starting on cross-team PRs to that area can start from my reviewer notes rather than making the first-five-PR mistakes I made. That's not the same as having one of the platform team's experts walk them through the standards — it's a starting point, a documented baseline, something to build from rather than nothing. + +The part I'm still working on: a way to be notified when a team updates their reviewer configuration in ways that affect my established rules. Right now I find out on the next PR when their agent flags a pattern I thought was settled. That's fine — it's a correction mechanism — but proactive notification would be better. I'm submitting to three teams with active agent reviews right now and the reactive-correction approach works. If I were at five or six, I'd be spending more time on the catch-up than on the submissions. + +Six flags with reasoning is worth more than a hundred silent approvals.