diff --git a/docs/squash-merge-recovery-plan.md b/docs/squash-merge-recovery-plan.md
new file mode 100644
index 0000000..f7160ee
--- /dev/null
+++ b/docs/squash-merge-recovery-plan.md
@@ -0,0 +1,517 @@
+# Squash-Merge PR Commit Recovery — Implementation Plan
+
+> **Status: Implemented (2026-06-11) on `feature/squash-merge-recovery`.** All five rollout
+> steps (§8) landed as separate commits — Stage 0 (`feat(stage-0)`), the `on_default_branch`
+> column + migration, the scan enricher, the no-code Step 4 confirmation, and per-line
+> `pr-origin` attribution (`feat(stage-2)`). Full suite green (493 tests). The notes below are
+> preserved as the design record.
+> _Original status: Reviewed — scope locked (all §0 decisions folded in) + consistency pass applied._
+> **Consistency pass (2026-06-11):** corrected config placement — the three Stage-0 rendering caps
+> live in `RationaleConfig` (`[rationale]`), not `AnalyzeConfig`, because `RationaleGenerator` is
+> built from `RationaleConfig` (`config.py:256`); only the enrichment gate `pr_origin_min_commits`
+> sits in `AnalyzeConfig`. Fixed the §4.8 per-line trigger (it must fire for *every* enriched squash,
+> not only file-bulk ones). Added skeleton snippets for the two NEW modules (§4.5, §4.8).
+> **Goal:** when a feature PR is squash-merged into the default branch, restore WhyGraph's
+> analyzing power over the *original* feature-branch commits — surface them (and the PR
+> discussion) as evidence, and attribute each line of the squash commit back to the specific
+> original commit that introduced it.
+> **Date:** 2026-06-11
+
+---
+
+## 0. Resolved scope decisions (from the requester)
+
+| Topic | Decision |
+|---|---|
+| Scope | **Stage 0 + Stage 1.** Stage 2 (graph-aware fallback splitter for local-only / GC'd history) is **out of scope** for this plan. |
+| Attribution | **Per-line mapping required** — map each line of the squash `merge_commit_sha` back to the original PR commit that introduced it, not just PR-level evidence. |
+| Provider | **GitHub only** for Stage 1 (the only provider WhyGraph integrates today — `services/github/`). Azure/GitLab `refs/pull` equivalents are a later provider-abstraction concern. |
+| Link storage | **No PR↔commit table and no `pr_id` column.** Reuse the existing `commit_titles`/`_linked_prs` linkage (many-to-many safe); only the `commit.on_default_branch` flag is new (§4.3). |
+| `commit_titles` retention | **Keep for now**, remove in a later separate migration once `_linked_prs` and Stage 0 are repointed. |
+| Stage-0 caps | **Make them `whygraph.toml`-configurable** under **`[rationale]`** (consumed by the rationale generator) and document them in the example config so developers can tune roster/discussion size (§4.2, §4.11). |
+| Enrichment cadence | **Remote `scan` phase only** (forced by the offline-MCP / `--no-remote` invariants); diffs/descriptions stay lazy (§3.1, §4.7). |
+| Enrichment gate | **Balanced:** squash-detected **AND** (`files_changed > large_commit_file_count` **OR** `len(commit_titles) >= pr_origin_min_commits`, default 5, configurable) (§3.3, §4.11). |
+| Fetch strategy | **Targeted batched** — one `git fetch` carrying only the gated candidates' refspecs; local ref count == #candidates, not #all-PRs (§3.1). |
+| Enrichment default | **On by default**; `--no-pr-origins` disables, always skipped under `--no-remote` (§4.6). |
+
+---
+
+## 1. Goals & Non-Goals
+
+### Goals
+- **Stage 0** — stop discarding data WhyGraph *already stores*: inline a squash PR's
+  `commit_titles` and `comments` into the evidence bundle and the rationale prompt.
+- **Stage 1** — fetch each squash-merged PR's **original feature-branch commits** (full message at
+  scan, diff lazily), persist them as `Commit` rows linked to their PR via the existing
+  `commit_titles` (no new relation), and make them first-class evidence.
+- **Per-line attribution** — when a blame hit lands on a squash commit, re-attribute the queried
+  line range to the original PR commits that authored those lines.
+
+### Non-Goals
+- **Stage 2 / LLM or graph-based diff splitting** — *out of scope.* When no provider data exists
+  (local-only history, GC'd branch, non-PR squash), the commit keeps today's bulk-stub behaviour.
+  Reconstructing boundaries from the diff is explicitly deferred.
+- **Non-GitHub providers** — Azure DevOps / GitLab have no `refs/pull/*/head`; their enrichment is
+  deferred. Stage 0 is provider-agnostic and benefits them for free.
+- **Re-describing the squash commit whole-diff** — the bulk-stub path (`scan/analyze_crawler.py`,
+  `analyze/backfill.py`) is **unchanged on purpose**; we add a parallel origin-commit signal, we do
+  not resurrect the expensive whole-squash-diff LLM call.
+- **Changing the first-parent main-branch walk** — `scan/git_crawler.py`'s notion of "a commit on
+  the default branch" is untouched; origin commits are tagged so they never leak into the
+  main-walk-only queries (area-history, refactor-walk).
+
+---
+
+## 2. How the system works today (reference map)
+
+| Concern | Location | Current behaviour |
+|---|---|---|
+| PR ingestion | `services/github/pull_requests.py:43-56` | GraphQL already pulls `commits(first: 250)` (oid, headline, author) **and** `comments(first: 100)`. |
+| PR value object | `services/github/pull_request.py:97-115` | `commits: tuple[CommitSummary,...]`, `comments: tuple[Comment,...]` already parsed. |
+| PR persistence | `scan/github_crawler.py:92-129` | `commit_titles` ← JSON of `{oid, headline, author_*}`; `comments` ← JSON of `{author, body, created_at}`. **Both stored, neither surfaced.** |
+| PR row | `db/models/pull_request.py:33,41-46` | has `merge_commit_sha` (indexed), `head_sha`, `commit_titles`, `comments`. **No merge-method field.** |
+| Squash→PR link | `mcp/evidence.py:81-101` (`_linked_prs`) | matches `merge_commit_sha == sha` OR `head_sha == sha` OR oid in `commit_titles`. A squash commit **is** the `merge_commit_sha`, so the link already works. |
+| Evidence serialization | `mcp/evidence.py:465-476` (`_pr_dict`) | emits only number/title/body/state/merged_at/author/html_url/labels — **drops `commit_titles` and `comments`.** |
+| Rationale prompt | `analyze/rationale_generator.py:93-102` (`_format_pr`) | renders only PR title + body — **drops commits and comments.** |
+| Evidence join | `mcp/evidence.py:200-217` | per blame hunk: `session.get(Commit, hunk.sha)`; drops the hunk if no `Commit` row exists. |
+| Blame at a rev | `services/git/repository.py:190-258` + `commands.py:125-209` | `blame(path, a, b, rev=<sha>)` already supported; **predecessor-blame uses `rev=parent_sha`** (`evidence.py:299-324`). `-w -M -C` always on. |
+| Source labels/priority | `evidence.py:39-44`, `rationale_generator.py:119-124`, `rationale.py:37-62` | `blame > blame-walked > predecessor-blame > area`; each label has a human string. |
+| Commit row | `db/models/commit.py:11-32` | PK `sha`; has `parent_shas`, `files_changed`, `refactor_score`, `llm_description`. **No origin discriminator.** |
+| PR→commit linkage | `mcp/evidence.py:69-101` (`_commit_titles_contain_oid`, `_linked_prs`) | already resolves a commit to its PR(s) by oid-in-`commit_titles`; returns *all* matching PRs (many-to-many safe). Reused instead of a new link table (see §4.3). |
+| Lazy diff/description | `analyze/backfill.py` (`backfill_file_description`, `backfill_all`); `evidence.py:362-449` | per-file diff via `Repository.diff(commit, pathspec=...)`, cached on the `commit_file_change` row. The reuse target for origin-commit diffs. |
+| Model registration | `db/models/__init__.py:31-47` | a new model must be imported here for Alembic autogenerate to see it. |
+
+**Key reuse insight (de-risks per-line attribution):** git already does hunk-matching. A squash
+merge applies the PR's `base..head` diff as one commit, so for a changed file the squash tree
+equals `head_sha`'s tree. Blaming the *same* `path:line_start-line_end` at **`rev=head_sha`** maps
+each line to the original PR commit — exactly the mechanism predecessor-blame already uses with
+`rev=parent_sha`. No bespoke hunk re-diffing is required; we add a branch, not an algorithm.
+
+---
+
+## 3. Issues found / foreseen problems
+
+### 3.1 Original PR commits are not in the local object store (Us / enabling)
+After a normal clone, `refs/pull/*/head` are **not** fetched, so the squashed feature-branch
+commits aren't local — `git blame rev=head_sha` and `Repository.diff` would fail on unknown SHAs.
+- **Mitigation:** during the `scan` GitHub phase (already remote), one **targeted batched** fetch
+  brings only the gated candidates (§3.3): a single `git fetch origin` call carrying one
+  `refs/pull/<N>/head:refs/whygraph/pull/<N>` refspec per candidate PR. One round-trip, and the local
+  ref count equals #candidates — **not** the wildcard `refs/pull/*` (which would litter `.git` with a
+  ref per PR and slow every git op on large repos). Storing under our own `refs/whygraph/pull/*`
+  namespace pins the objects against local GC. All later blame/diff is then **offline** — preserving
+  the MCP-server and `--no-remote` git-hook invariants (no network at query time).
+- **Required from provider:** GitHub retains `refs/pull/N/head` indefinitely (survives branch
+  deletion). True for GitHub; **not** for Azure/GitLab → why they're deferred.
+
+### 3.2 Squash tree may not exactly equal `head_sha` (Mixed / Low)
+"Rebase and merge", or a squash re-applied onto an advanced base, can shift surrounding context.
+Within a changed file the *line content* still matches (same diff), so `git blame`'s own
+move/whitespace-tolerant matching (`-w -M -C`) absorbs the drift.
+- **Mitigation:** per-line re-blame is **best-effort**, mirroring predecessor-blame (`evidence.py:314-323`
+  swallows `GitError` per event). On mismatch we keep PR-level evidence (Stage 1) — never worse than today.
+
+### 3.3 DB volume if every PR's commits are stored (Us / Medium)
+5,000 PRs × ~30 commits = ~150k extra `commit` rows if we enriched every merged PR.
+- **Mitigation — balanced gate (resolved Q2).** Enrich a PR only when it is squash-detected (§3.5)
+  **and** either of the two distinct loss signals fires:
+  - **file-bulk** — `merge_commit_sha`'s `files_changed > analyze.large_commit_file_count` (a "huge
+    commit" — the squash that lost the most *description* fidelity, the stub case); **or**
+  - **commit-rich** — `len(commit_titles) >= analyze.pr_origin_min_commits` (default 5) — a squash
+    that collapsed many commits' worth of *narrative*, even if few files changed.
+  File-count and commit-count are **distinct signals**: the first is about description cost, the
+  second about how much per-commit rationale was destroyed. The `OR` catches both; the threshold knob
+  (§4.11) caps volume. Small single-/few-commit PRs that merged as their own commits already work via
+  the existing `commit_titles` link and are skipped.
+
+### 3.4 Origin commits must not pollute main-walk queries (Us / High)
+`commit` today means "first-parent walk of the default branch". area-history (`path_history.py`) and
+refactor-walk (`_boring_shas_in`, `evidence.py:286-296`) assume that.
+- **Mitigation:** add an `on_default_branch` discriminator to `commit` (default `1`); origin commits
+  insert as `0`. Gate area-history and refactor-walk on `on_default_branch == 1`. The per-line
+  blame-at-`head_sha` path resolves origin commits via `session.get(Commit, sha)` regardless.
+
+### 3.5 No stored merge method (Them→Us / Low)
+GraphQL gives `mergeCommit{oid}` but WhyGraph stores no squash/rebase/merge flag.
+- **Mitigation:** we don't need the method. **Squash detection** = a merged PR whose `commit_titles`
+  oids are **absent from the `commit` table** (i.e. the originals are not on the main walk). That PR
+  then passes the §3.3 balanced gate (file-bulk OR commit-rich) to be enriched. Self-correcting: if
+  the originals *are* on main (a plain merge / rebase), the normal path already wins and we skip.
+
+---
+
+## 4. Detailed changes (file by file)
+
+### Phase 0 (Stage 0) — surface already-stored data
+
+#### 4.1 `mcp/evidence.py` — `_pr_dict` (line 465)
+Add the two dropped fields (decode with the existing `_json_list` helper, line 58):
+```python
+def _pr_dict(pr: PullRequest) -> dict:
+    return {
+        # ...existing keys unchanged...
+        "labels": _json_list(pr.labels),
+        "commit_titles": _json_list(pr.commit_titles),  # NEW
+        "comments": _json_list(pr.comments),             # NEW
+    }
+```
+No new query — both columns are already on the loaded `PullRequest` row (and already detached via
+`session.expunge_all()` at `evidence.py:217`). **Intentionally uncapped:** `_pr_dict` feeds the raw
+`whygraph_evidence_for` JSON, whose consumer is an agent that can handle the full list — the
+context-budget caps (§4.2) apply only to the LLM *rationale prompt*, not to this tool output.
+
+#### 4.2 `analyze/rationale_generator.py` — `_format_pr` (line 93)
+After the body block, append a compact commit roster and the discussion so the LLM sees the
+narrative that the squash destroyed. Reuse `_indent_block` (line 69) and the JSON-decode idiom from
+`_labels_suffix` (line ~57) — add a small local `_json_list` (mirror `evidence.py:58`). The three
+caps are passed in (see threading note below), not read globally:
+```python
+def _format_pr(pr: PullRequest, caps: RationaleConfig) -> list[str]:   # or a small caps tuple
+    # ...existing title/body lines unchanged...
+    titles = _json_list(pr.commit_titles)[: caps.pr_roster_max_commits]
+    if titles:
+        lines.append("    Squashed commits:")
+        for c in titles:
+            lines.append(f"      - {c.get('headline','')}  ({(c.get('oid') or '')[:9]})")
+    comments = _json_list(pr.comments)[: caps.pr_discussion_max_comments]
+    if comments:
+        lines.append("    Discussion:")
+        for cm in comments:
+            who = cm.get("author") or "unknown"
+            body = (cm.get("body") or "").strip()[: caps.pr_comment_max_chars]
+            lines.append(_indent_block(f"[{who}] {body}", "      "))
+    return lines
+```
+**Execution gotcha:** `from_config` today *discards* the `RationaleConfig` after pulling
+`provider`/`model`/`timeout_sec` (`rationale_generator.py:393`). To make caps reach `_format_pr`,
+retain the three cap ints (or the whole `RationaleConfig`) on the generator in `__init__`, then pass
+them into `_format_evidence(evidence, caps)` at the call site (line 436).
+**Config source & threading (corrected in the consistency pass).** The caps come from
+**`RationaleConfig`** (the `[rationale]` table — §4.11), **not** `AnalyzeConfig`: the generator is
+built via `RationaleGenerator.from_config(get_config().rationale)` (`rationale_generator.py:360-393`),
+so `AnalyzeConfig` never reaches it. `_format_pr`/`_format_evidence` are **module-level** functions
+called from `generate()` (`bundle = _format_evidence(evidence)`, line 436). Thread the caps by:
+storing them on the generator in `from_config`/`__init__`, then passing a small `caps` value into
+`_format_evidence(evidence, caps)` → `_format_pr(pr, caps)`. Keeping them as a parameter (not a
+global `get_config()` reach-in) keeps the formatters pure and unit-testable. **Bounding rationale
+(resolved Q4):** the rationale bundle is already unbounded (the documented "evidence-bundle builder"
+TODO in this module), so this stays a prompt-size guard, not a feature — but developers can tune it.
+
+### Phase 1 (Stage 1) — provider enrichment + persistence
+
+#### 4.3 No PR↔commit link table — reuse the existing `commit_titles` linkage
+We do **not** add a join table. `_linked_prs()` (`mcp/evidence.py:81-101`) already resolves a
+commit to its PR(s) by matching the oid inside the PR's `commit_titles` JSON
+(`_commit_titles_contain_oid`, line 69). Recovered origin commits *come from* `commit_titles`, so
+once inserted as `Commit` rows the existing query finds their PR(s) for free — and because it
+returns **all** matching PRs, the many-to-many edge (a commit shared across stacked / backport PRs)
+is handled without a column or a join table. A single `pr_id` column on `commit` was rejected: it
+cannot represent that many-to-many and would silently drop links. Commit ordering is taken from
+`committed_at` (evidence is already sorted that way, `evidence.py:231`), so no `position` field is
+needed either. This is the "don't add the abstraction until a second concrete case forces it" rule
+(CLAUDE.md) applied: nothing in scope queries "all original commits of PR #N" relationally.
+
+#### 4.4 `db/models/commit.py` — origin discriminator (line 32)
+```python
+    on_default_branch: int = Field(
+        default=1, sa_column_kwargs={"server_default": text("1")}
+    )  # 0 = recovered PR-origin commit, not on the first-parent main walk
+```
+`int` 0/1 to keep SQLite INTEGER affinity (same rationale as `PullRequest.draft`,
+`pull_request.py:20-22`).
+
+#### 4.5 NEW: `scan/pr_origin_enricher.py`
+A scan sub-phase (driven from `scan/crawler.py`, after `github_crawler` so PR rows exist).
+**Candidate selection (§3.3, §3.5):** a merged PR is a candidate when its `commit_titles` oids are
+absent from `commit` (squash) **and** (`merge_commit_sha`'s `files_changed > large_commit_file_count`
+**or** `len(commit_titles) >= pr_origin_min_commits`). Then:
+1. Ensure objects are local: **one targeted batched** `git fetch origin <refspec…>` carrying a
+   `refs/pull/<N>/head:refs/whygraph/pull/<N>` refspec for **each candidate PR only** (not the
+   `refs/pull/*` wildcard) — one round-trip, candidate-many local refs.
+2. For each oid in `commit_titles`: insert a `Commit` row (`on_default_branch=0`) — parse via the
+   existing `git log -1 <oid>` path (reuse `GitLogShortstatCmd` shape / `Commit.from_git_log`,
+   `commands.py:250-279`) so full message body + stats come from git, not just the headline.
+3. Leave `llm_description` NULL — diffs/descriptions are lazy (4.7).
+No link row is written — the PR↔commit association is already carried by `commit_titles` and
+resolved at query time by `_linked_prs` (4.3). Idempotent: skip oids already present in `commit`
+(mirror `github_crawler.py:67-75`).
+
+Skeleton (mirrors `GitHubCrawler.work` session/idempotency shape, `github_crawler.py:59-89`):
+```python
+def enrich(repo: Repository, *, min_commits: int, large_commit_file_count: int) -> None:
+    with get_session() as session:
+        existing: set[str] = set(session.exec(select(Commit.sha)).all())
+        candidates = []
+        for pr in session.exec(select(PullRequest).where(col(PullRequest.merged_at).is_not(None))):
+            oids = [c["oid"] for c in _json_list(pr.commit_titles) if c.get("oid")]
+            if not oids or all(o in existing for o in oids):
+                continue  # not a squash (originals already on main) → skip
+            squash = session.get(Commit, pr.merge_commit_sha) if pr.merge_commit_sha else None
+            file_bulk = bool(squash and squash.files_changed > large_commit_file_count)
+            if not (file_bulk or len(oids) >= min_commits):       # balanced gate (§3.3)
+                continue
+            candidates.append((pr, [o for o in oids if o not in existing]))
+        if not candidates:
+            return
+        # ONE batched fetch — only candidate refs, not refs/pull/* wildcard (§3.1):
+        refspecs = [f"refs/pull/{pr.number}/head:refs/whygraph/pull/{pr.number}" for pr, _ in candidates]
+        repo.fetch_refs(refspecs)                                 # NEW thin Repository method (4.5a)
+        for pr, new_oids in candidates:
+            for oid in new_oids:
+                row = repo.commit_metadata(oid)                   # reuse Commit.from_git_log (commands.py:250-279)
+                session.add(_to_commit_row(row, on_default_branch=0))
+```
+
+#### 4.5a `services/git/` — two thin additions
+The enricher needs two small read/fetch helpers; both are new `ShellCommand` pairs in
+`services/git/commands.py` + thin `Repository` methods (mirror `GitDiffCmd`/`Repository.diff`,
+`commands.py:90-122` / `repository.py:146-188`):
+- `fetch_refs(refspecs: list[str])` → `git fetch origin <refspec…>` (one process, many refspecs).
+- `commit_metadata(oid)` → `git log -1 --pretty=… --shortstat <oid>` parsed by the **existing**
+  `Commit.from_git_log` (`commands.py:250-279`); do not write a new parser.
+
+#### 4.6 `scan/crawler.py` + `cli/commands/scan.py`
+Wire the new phase under a `--pr-origins / --no-pr-origins` flag, **default on** (resolved Q1/default):
+squash recovery is the point, so it runs unless opted out. Always **skipped under `--no-remote`** (the
+fetch needs network — like the other remote phases). Mirror the existing `--codegraph/--no-codegraph`
+flag wiring.
+
+#### 4.7 `analyze/backfill.py` + `mcp/evidence.py:backfill_evidence_descriptions`
+Origin commits are normal-sized real commits → reuse `backfill_all` / the whole-diff path
+(`evidence.py:408-449`) as-is; `Repository.diff(commit)` now resolves because the object is local
+(4.5 step 1). **No new descriptor code** — only ensure origin commits flow through the existing
+`normal` branch (they will: `files_changed <= threshold` and `llm_description is None`).
+
+### Phase 2 — per-line attribution
+
+#### 4.8 `mcp/evidence.py` — new `_attribute_squash_origins` signal
+Model on `_predecessor_blame` (line 299) and `_walk_past_boring` (line 244). For each blame hit SHA
+that is the **`merge_commit_sha` of an *enriched* squash PR**, re-blame the same range at the PR's
+`head_sha` and emit each resulting original commit as `source="pr-origin"`.
+
+**Trigger condition (fixed in the consistency pass).** The trigger is **not** "the squash is a bulk
+commit" — that would *skip* commit-rich-but-small squashes that the §3.3 gate enriched (file-bulk
+**OR** commit-rich). The correct, gate-agnostic predicate is: *the blame SHA equals a PR's
+`merge_commit_sha` and that PR has ≥1 origin `Commit` row* (i.e. it was actually enriched, so
+`head_sha`'s objects are local). That naturally covers exactly the PRs Stage 1 enriched.
+```python
+def _attribute_squash_origins(repo, target, *, blame_shas, session) -> list[BlameHunk]:
+    out: list[BlameHunk] = []
+    for pr in _enriched_squash_prs_for(session, blame_shas):   # merge_commit_sha in blame_shas
+                                                               # AND has on_default_branch=0 origins
+        try:
+            hunks = repo.blame(target.path, target.line_start, target.line_end, rev=pr.head_sha)
+        except GitError:
+            continue                                           # best-effort, mirrors line 321
+        out.extend(h for h in hunks if not h.is_uncommitted)
+    return out
+```
+Feed the result into the existing `labeled_hunks` list (line 194) tagged `"pr-origin"`; the dedupe /
+priority / cap machinery (`_should_replace`, `_SOURCE_PRIORITY`, the `limit` slice) then needs no
+further change. Each hunk resolves via the existing `session.get(Commit, sha)` (origin rows exist) and
+`_linked_prs` (via `commit_titles`).
+
+#### 4.9 Source-label plumbing (three small edits)
+- `evidence.py:39-44` — add `"pr-origin": 0.5` to `_SOURCE_PRIORITY` (just below `blame`=0: a real
+  authoring commit reached through the squash is high-precision).
+- `rationale_generator.py:119-124` — add `"pr-origin": "original commit recovered from a squash-merged PR"`.
+- `rationale.py:37-62` — extend the `source` docstring enum (doc-only).
+
+#### 4.10 Gate main-walk-only queries on `on_default_branch` (defensive)
+- `evidence.py:_boring_shas_in` (286-296) — add `.where(col(Commit.on_default_branch) == 1)`.
+- `path_history.py:area_history_commits` (the `Commit`⨝`CommitFileChange` join at `path_history.py:116-121`)
+  — same guard.
+
+**Why "defensive":** origin commits get **no** `commit_file_change` rows (the enricher writes only
+`Commit` rows, §4.5) and default `refactor_score=0`, so they are *already* naturally excluded from
+both area-history (which joins through `commit_file_change`) and refactor-walk (which filters
+`refactor_score >= BORING_THRESHOLD`). These guards make the invariant explicit and protect against a
+future broad `select(Commit)` consumer; they are belt-and-suspenders, not load-bearing — which is why
+acceptance criterion 3 (byte-identical area-history) holds trivially.
+
+#### 4.11 `core/config.py` + `whygraph.example.toml` — new config (resolved Q4)
+Two distinct config homes, by *which component consumes the value* (corrected in the consistency pass):
+
+**(a) Rendering caps → `RationaleConfig` (`config.py:256`), `[rationale]` table.** Consumed by the
+rationale generator (§4.2). Add three fields (mirror `RationaleConfig`'s existing field shape):
+```python
+    pr_roster_max_commits: int = 30        # max squashed-commit headlines rendered into a PR block
+    pr_discussion_max_comments: int = 20   # max PR comments rendered into a PR block
+    pr_comment_max_chars: int = 500        # per-comment body clip
+```
+```toml
+[rationale]
+# pr_roster_max_commits = 30        # squashed-commit headlines shown per PR in rationale evidence
+# pr_discussion_max_comments = 20   # PR comments shown per PR in rationale evidence
+# pr_comment_max_chars = 500        # each PR comment clipped to this length
+```
+
+**(b) Enrichment gate → `AnalyzeConfig` (`config.py:248-252`), `[analyze]` table.** Consumed by the
+scan enricher (§4.5). Add one field next to `large_commit_file_count`:
+```python
+    pr_origin_min_commits: int = 5         # commit-rich half of the Stage-1 enrichment gate (§3.3)
+```
+```toml
+[analyze]
+# pr_origin_min_commits = 5         # enrich a squash PR's original commits once it collapsed >= this many
+```
+`pr_origin_min_commits` is the commit-count half of the balanced gate (§3.3); the file-bulk half
+reuses the existing `large_commit_file_count`. **All four** new fields get the same `>= 1` validation
+that `max_diff_chars`/`large_commit_file_count` already have (extend the validation pass at
+`config.py:469-477`). Document them commented-out in `whygraph.example.toml` alongside the existing
+`# max_diff_chars` / `# large_commit_file_count` lines under `[analyze]` (`whygraph.example.toml:23-28`)
+and under the existing `[rationale]` table (`whygraph.example.toml:33`).
+
+---
+
+## 5. Schema / migration
+
+One Alembic migration (`db/migrations/`) for a single additive change: the `commit.on_default_branch`
+column (4.4). No new table (see 4.3). `commit_titles`/`comments` already exist — Phase 0 needs **no**
+migration. Follow the existing `migrations/versions/` autogenerate flow; the column is additive with a
+server default, so existing rows backfill to `on_default_branch=1` and re-scans are safe.
+
+---
+
+## 6. Acceptance criteria
+
+1. **Stage 0:** `whygraph_evidence_for` on a line owned by a squash commit returns that commit's PR
+   with non-empty `commit_titles` and `comments`; the rationale prompt shows a "Squashed commits"
+   roster and "Discussion". No DB migration required for this criterion.
+2. **Stage 1:** after `whygraph scan` on a repo with a squash-merged PR, the `commit` table contains
+   the PR's original commits with `on_default_branch=0`, and `_linked_prs` resolves each back to the
+   PR via `commit_titles` (no link table).
+3. **Stage 1 isolation:** area-history and refactor-walk results are byte-identical to pre-change for
+   a repo with no squash PRs (origin commits never leak into those paths).
+4. **Per-line:** for a line in a squash commit, the evidence bundle contains a `source="pr-origin"`
+   entry whose commit is the original feature-branch commit that authored that line (verified against
+   `git blame <head_sha>`).
+5. **Offline & `--no-remote`:** the MCP query path and `whygraph scan --no-remote` make **no** network
+   calls; enrichment fetch happens only in the remote scan phase.
+6. **Graceful degrade:** a non-GitHub repo, a GC'd/absent PR ref, or a squash-vs-head mismatch yields
+   today's behaviour (PR-level evidence + bulk stub), never an error.
+
+---
+
+## 7. Testing plan
+
+### Unit
+1. `_pr_dict` includes decoded `commit_titles` + `comments`; malformed JSON → `[]` (via `_json_list`).
+2. `_format_pr` renders roster + discussion; respects the caps; empty inputs add no lines.
+2b. Cap config: the three new `RationaleConfig` fields default correctly, reject `< 1` (mirror
+   `max_diff_chars` validation), and a lowered cap actually truncates the rendered roster/discussion
+   (passed in via the generator, not read from a global).
+3. `commit.on_default_branch` defaults to 1, origin insert sets 0; `_linked_prs` resolves an inserted
+   origin commit to its PR via `commit_titles` (incl. a commit shared across two PRs → both returned).
+4. Balanced gate: squash (oids-absent-from-`commit`) + file-bulk → enrich; squash + commit-count ≥
+   `pr_origin_min_commits` → enrich; squash but below both thresholds → skip; merged-as-own-commits
+   (oids present on main) → skip regardless of size.
+5. `_boring_shas_in` / area-history exclude `on_default_branch=0` rows.
+6. Source priority: a SHA surfacing as both `pr-origin` and `area` is kept as `pr-origin`.
+
+### Integration (fixture repo with a real squash merge)
+7. End-to-end scan → origin `commit` rows populated with `on_default_branch=0`; the targeted batched
+   `git fetch` is mocked to a local bundle (assert it requests only candidate refspecs, not the
+   wildcard) so the test is offline.
+8. `whygraph_evidence_for` on a squash line returns a `pr-origin` entry matching `git blame <head_sha>`.
+8b. Per-line trigger covers a **commit-rich but non-file-bulk** squash (≥ `pr_origin_min_commits`
+   commits, ≤ `large_commit_file_count` files): it was enriched, so its line gets a `pr-origin` entry —
+   guards against the §4.8 trigger regressing to a file-bulk-only check.
+
+### Regression
+9. Repo with **no** PRs / no remote: evidence + rationale output unchanged vs. `main`.
+10. `tests/test_smoke.py` invariants intact (package imports; MCP server named `"whygraph"`).
+
+---
+
+## 8. Rollout order (single-shot recipe)
+
+### Step 1 — Stage 0: surface `commit_titles` + `comments` (+ rendering-cap config)
+Add the three **`RationaleConfig`** cap fields + `>= 1` validation + `[rationale]` example-toml lines
+(4.11a); edit `_pr_dict` (4.1, uncapped) and `_format_pr` (4.2, caps passed in); thread the caps from
+the generator into `_format_evidence`/`_format_pr`; add a local `_json_list` in the generator.
+
+**Model:** Sonnet 4.6 · **Complexity:** Low
+**Why this model:** Well-specified renderers reading columns already on the row + three config fields
+that mirror existing siblings; no schema migration, no queries.
+**Execution notes:** Caps live in **`RationaleConfig`** (`config.py:256`), **not** `AnalyzeConfig` —
+the generator is built from `RationaleConfig` (§4.2). Store them on the generator in `from_config`/
+`__init__` and pass into `_format_evidence(evidence, caps)` → `_format_pr(pr, caps)`; do **not** reach
+into `get_config()` from the module-level formatters. Add a local `_json_list` (mirror `evidence.py:58`)
+and reuse `_indent_block` (`rationale_generator.py:69`). Mirror `max_diff_chars`'s `>= 1` validation
+(`config.py:469-477`). `_pr_dict` stays **uncapped** (§4.1). Do not touch the bulk-stub path. Do **not**
+add `pr_origin_min_commits` here — that field belongs to Step 3 (the enricher). Ships independently.
+**Verify:** `uv run pytest tests/ -k "pr_dict or format_pr or rationale_config"`; manual `whygraph_evidence_for` on a known squash line.
+
+### Step 2 — Schema: `commit.on_default_branch` + migration
+Add the column (4.4) to `db/models/commit.py`; autogenerate one additive Alembic migration (§5). No new
+table — the PR↔commit link is the existing `commit_titles` (4.3).
+
+**Model:** Sonnet 4.6 · **Complexity:** Low
+**Why this model:** A single column add + an additive migration; nothing to design.
+**Execution notes:** Column default uses `server_default=text("1")` (mirror `pull_request.py:28`). Run
+the project's Alembic autogenerate flow; review the generated migration is additive-only (no table
+rewrite). Do **not** add a join table or a `pr_id` column (§4.3 explains why).
+**Verify:** `uv run alembic upgrade head` on a copy of a scanned DB; existing rows show `on_default_branch=1`.
+
+### Step 3 — Enricher: fetch + persist original commits
+Add `AnalyzeConfig.pr_origin_min_commits` + validation + `[analyze]` example-toml line (4.11b). Add the
+two thin git helpers `fetch_refs` / `commit_metadata` (4.5a). Build `scan/pr_origin_enricher.py` (4.5),
+wire the phase + `--pr-origins` flag (4.6), apply the §4.10 query guards.
+
+**Model:** Opus 4.8 · **Complexity:** High
+**Why this model:** New scan phase with network, idempotency, balanced-gate detection, and the
+`on_default_branch` isolation contract — correctness across re-scans and `--no-remote` matters.
+**Execution notes:** **One targeted batched** `git fetch` carrying only the gated candidates' refspecs
+(§3.1) — **not** the `refs/pull/*` wildcard. Apply the balanced gate (§3.3) exactly as in the §4.5
+skeleton: squash (oids-absent-from-`commit`) AND (file-bulk OR `len(commit_titles) >= pr_origin_min_commits`).
+`pr_origin_min_commits` is added **here** (`AnalyzeConfig`), not in Step 1. Default the phase **on**; skip
+entirely under `--no-remote`. Reuse `Commit.from_git_log` for full bodies via the new `commit_metadata`
+helper (4.5a); do **not** invent a new commit parser. Apply the §4.10 guards in the same step or
+area-history could regress for a future consumer.
+**Verify:** integration test 7; unit test 4; regression test 9.
+
+### Step 4 — Lazy diffs for origin commits
+Confirm origin commits flow through the existing `normal` backfill branch (4.7).
+
+**Model:** Haiku 4.5 · **Complexity:** Low
+**Why this model:** Likely zero code — verifying the existing `backfill_all` path already covers them.
+**Execution notes:** Origin commits are `files_changed <= threshold` with `llm_description is None`, so
+`backfill_evidence_descriptions` (`evidence.py:408-412`) already routes them to `backfill_all`. Only add
+code if a test proves they're missed. Do not special-case them.
+**Verify:** integration test 8 shows a populated `llm_description` on a `pr-origin` commit.
+
+### Step 5 — Per-line attribution signal + labels
+Add `_attribute_squash_origins` (4.8) and the three label edits (4.9).
+
+**Model:** Opus 4.8 · **Complexity:** High
+**Why this model:** Novel evidence signal in the dedupe/priority machinery; correctness of the
+`rev=head_sha` re-blame and best-effort error handling is the crux of the feature.
+**Execution notes:** Model the function on `_predecessor_blame` (`evidence.py:299-324`) — same
+`rev=`-blame call, same per-event `GitError` swallow. Feed its hunks into the existing `labeled_hunks`
+list (line 194) with `source="pr-origin"`; the dedupe/priority/cap machinery then needs no change beyond
+the `_SOURCE_PRIORITY` entry. Do not add a second blame implementation.
+**Verify:** unit test 6; integration tests 8 & 8b; acceptance criteria 4 & 5.
+
+---
+
+## 9. Open Questions for the reviewer
+
+**None outstanding** — all six review forks are resolved and folded into §0 (scope, attribution,
+provider, link storage, `commit_titles` retention, Stage-0 caps, enrichment cadence, enrichment gate,
+fetch strategy, enrichment default). The plan is ready for final approval to implement.
+
+---
+
+## 10. Summary
+
+A squash merge doesn't destroy a feature's history — it relocates it to the provider, and WhyGraph
+already ingests most of it (`commit_titles`, `comments`) then drops it at the last serialization step.
+**Stage 0** stops dropping it (no schema, immediate win). **Stage 1** fetches the squash PR's original
+commits once during the remote scan and stores them as `Commit` rows flagged `on_default_branch=0` —
+linked back to their PR through the existing `commit_titles`/`_linked_prs` path (no new table) — so
+they enrich evidence without polluting the main-walk queries. **Per-line
+attribution** reuses the existing `blame(rev=…)` machinery — blaming the queried range at the PR's
+`head_sha` lets git itself map each squashed line to its original commit, mirroring how
+predecessor-blame already crosses renames. The expensive, low-fidelity alternative (LLM/graph diff
+splitting) is explicitly deferred; every degraded path falls back to today's safe behaviour.
diff --git a/src/whygraph/analyze/rationale.py b/src/whygraph/analyze/rationale.py
index 9bfbc0e..44324dd 100644
--- a/src/whygraph/analyze/rationale.py
+++ b/src/whygraph/analyze/rationale.py
@@ -39,6 +39,10 @@ class CommitEvidence:
 
         * ``"blame"`` — line-level attribution from the target's current
           range (highest-precision signal).
+        * ``"pr-origin"`` — an original feature-branch commit recovered
+          from a squash-merged PR: when the queried lines blame to a
+          squash commit, they are re-blamed at the PR's ``head_sha`` so
+          each line maps back to the commit that actually authored it.
         * ``"blame-walked"`` — surfaced only after blame walked past a
           refactor-heavy commit. Still line-level, but one or more boring
           commits were skipped to reach this author.
diff --git a/src/whygraph/analyze/rationale_generator.py b/src/whygraph/analyze/rationale_generator.py
index 7cde31b..341e57d 100644
--- a/src/whygraph/analyze/rationale_generator.py
+++ b/src/whygraph/analyze/rationale_generator.py
@@ -21,6 +21,7 @@
 
 import json
 from collections.abc import Sequence
+from dataclasses import dataclass
 
 from whygraph.core.config import RationaleConfig
 from whygraph.db.models import Commit, Issue, PullRequest
@@ -66,6 +67,53 @@ def _labels_suffix(raw: str) -> str:
     return "  [" + ", ".join(str(label) for label in labels) + "]"
 
 
+def _json_list(raw: str | None) -> list:
+    """Decode a JSON-encoded list column; empty list on anything malformed.
+
+    Mirrors :func:`whygraph.mcp.evidence._json_list` — kept local so the
+    formatters do not depend on the MCP layer.
+    """
+    if not raw:
+        return []
+    try:
+        parsed = json.loads(raw)
+    except (TypeError, json.JSONDecodeError):
+        return []
+    return parsed if isinstance(parsed, list) else []
+
+
+@dataclass(frozen=True, slots=True)
+class _PrRenderCaps:
+    """Size caps for rendering a PR block into the rationale prompt.
+
+    Passed explicitly into the module-level formatters (rather than read
+    from a global :func:`get_config`) so they stay pure and unit-testable.
+    Defaults mirror :class:`~whygraph.core.config.RationaleConfig`.
+
+    Attributes
+    ----------
+    pr_roster_max_commits : int
+        Max squashed-commit headlines rendered into a PR block.
+    pr_discussion_max_comments : int
+        Max PR comments rendered into a PR block.
+    pr_comment_max_chars : int
+        Per-comment body clip.
+    """
+
+    pr_roster_max_commits: int = 30
+    pr_discussion_max_comments: int = 20
+    pr_comment_max_chars: int = 500
+
+    @classmethod
+    def from_config(cls, config: RationaleConfig) -> "_PrRenderCaps":
+        """Project the three rendering caps out of a :class:`RationaleConfig`."""
+        return cls(
+            pr_roster_max_commits=config.pr_roster_max_commits,
+            pr_discussion_max_comments=config.pr_discussion_max_comments,
+            pr_comment_max_chars=config.pr_comment_max_chars,
+        )
+
+
 def _indent_block(text: str, prefix: str) -> str:
     """Indent every line of ``text`` by ``prefix``."""
     return "\n".join(prefix + line for line in text.splitlines())
@@ -90,8 +138,14 @@ def _format_commit(commit: Commit) -> list[str]:
     return lines
 
 
-def _format_pr(pr: PullRequest) -> list[str]:
-    """Render one pull request as the indented lines of an evidence block."""
+def _format_pr(pr: PullRequest, caps: _PrRenderCaps = _PrRenderCaps()) -> list[str]:
+    """Render one pull request as the indented lines of an evidence block.
+
+    Appends the squashed-commit roster and the PR discussion so the LLM
+    sees the narrative a squash merge collapsed. Both are clipped by
+    ``caps`` to bound the prompt size; ``pr.commit_titles`` / ``pr.comments``
+    are JSON-encoded list columns decoded via :func:`_json_list`.
+    """
     when = f"merged {pr.merged_at}" if pr.merged_at else pr.state
     author = f"by {pr.author}" if pr.author else "by unknown"
     lines = [f"  PR #{pr.number}  {author}  {when}{_labels_suffix(pr.labels)}"]
@@ -99,6 +153,23 @@ def _format_pr(pr: PullRequest) -> list[str]:
     if pr.body and pr.body.strip():
         lines.append("    Body:")
         lines.append(_indent_block(pr.body.strip(), "      "))
+    titles = _json_list(pr.commit_titles)[: caps.pr_roster_max_commits]
+    if titles:
+        lines.append("    Squashed commits:")
+        for c in titles:
+            if not isinstance(c, dict):
+                continue
+            oid = (c.get("oid") or "")[:9]
+            lines.append(f"      - {c.get('headline', '')}  ({oid})")
+    comments = _json_list(pr.comments)[: caps.pr_discussion_max_comments]
+    if comments:
+        lines.append("    Discussion:")
+        for cm in comments:
+            if not isinstance(cm, dict):
+                continue
+            who = cm.get("author") or "unknown"
+            body = (cm.get("body") or "").strip()[: caps.pr_comment_max_chars]
+            lines.append(_indent_block(f"[{who}] {body}", "      "))
     return lines
 
 
@@ -118,13 +189,16 @@ def _format_issue(issue: Issue) -> list[str]:
 
 _SOURCE_LABELS = {
     "blame": "line-blame",
+    "pr-origin": "original commit recovered from a squash-merged PR",
     "blame-walked": "line-blame (skipped a refactor commit)",
     "predecessor-blame": "line-blame on a pre-rename predecessor file",
     "area": "area-history (touched the file but not these lines)",
 }
 
 
-def _format_evidence(evidence: Sequence[CommitEvidence]) -> str:
+def _format_evidence(
+    evidence: Sequence[CommitEvidence], caps: _PrRenderCaps = _PrRenderCaps()
+) -> str:
     """Render an evidence bundle as the text payload for the rationale prompt.
 
     Commits are formatted in the order given — the caller controls
@@ -135,6 +209,9 @@ def _format_evidence(evidence: Sequence[CommitEvidence]) -> str:
     the row reached this bundle (line-blame, area-history, etc.), so the
     LLM can weight precision-vs-coverage signals when synthesising the
     rationale.
+
+    ``caps`` bounds the per-PR roster / discussion rendering (see
+    :func:`_format_pr`).
     """
     n_prs = sum(len(item.pull_requests) for item in evidence)
     n_issues = sum(len(item.issues) for item in evidence)
@@ -149,7 +226,7 @@ def _format_evidence(evidence: Sequence[CommitEvidence]) -> str:
         lines.insert(1, f"  Source: {label}")
         for pr in item.pull_requests:
             lines.append("")
-            lines.extend(_format_pr(pr))
+            lines.extend(_format_pr(pr, caps))
         for issue in item.issues:
             lines.append("")
             lines.extend(_format_issue(issue))
@@ -331,6 +408,11 @@ class RationaleGenerator:
         ``task`` should contain the
         :data:`~whygraph.analyze.prompt.RATIONALE_PLACEHOLDER` token. Mostly
         used in tests and one-off overrides.
+    caps : _PrRenderCaps, optional
+        Size caps for rendering each PR's squashed-commit roster and
+        discussion into the prompt. ``None`` (default) uses the
+        :class:`~whygraph.core.config.RationaleConfig` defaults;
+        :meth:`from_config` projects them from the loaded config.
 
     Examples
     --------
@@ -345,9 +427,11 @@ def __init__(
         *,
         timeout_sec: int | None = None,
         rationale_prompt: Prompt | None = None,
+        caps: _PrRenderCaps | None = None,
     ) -> None:
         self._client = client
         self._timeout_sec = timeout_sec
+        self._caps = caps if caps is not None else _PrRenderCaps()
         self._rationale_prompt = (
             rationale_prompt
             if rationale_prompt is not None
@@ -390,7 +474,11 @@ def from_config(
         """
         factory = factory if factory is not None else LlmClientFactory()
         client = factory.make(config.provider, model=config.model)
-        return cls(client, timeout_sec=config.timeout_sec)
+        return cls(
+            client,
+            timeout_sec=config.timeout_sec,
+            caps=_PrRenderCaps.from_config(config),
+        )
 
     def generate(
         self,
@@ -433,7 +521,7 @@ def generate(
 
         # TODO: capping bundle size belongs to the future evidence-bundle
         # builder — the generator neither truncates nor chunks its input.
-        bundle = _format_evidence(evidence)
+        bundle = _format_evidence(evidence, self._caps)
         if symbol_context is not None:
             bundle = f"{_format_symbol_context(symbol_context)}\n\n{bundle}"
         task = render(
diff --git a/src/whygraph/cli/commands/scan.py b/src/whygraph/cli/commands/scan.py
index 2cc0946..adfc5df 100644
--- a/src/whygraph/cli/commands/scan.py
+++ b/src/whygraph/cli/commands/scan.py
@@ -12,7 +12,13 @@
 from rich.table import Table
 from rich.text import Text
 
-from whygraph.scan import CodeGraphCrawler, Crawler, GitCrawler, GitHubCrawler
+from whygraph.scan import (
+    CodeGraphCrawler,
+    Crawler,
+    GitCrawler,
+    GitHubCrawler,
+    PROriginEnricher,
+)
 
 from ..console import console
 
@@ -73,11 +79,25 @@
         "auto-rescan git hooks (`whygraph hooks install`). Default: on."
     ),
 )
+@click.option(
+    "--pr-origins/--no-pr-origins",
+    "enrich_pr_origins",
+    default=True,
+    help=(
+        "Recover a squash-merged PR's original feature-branch commits — "
+        "one targeted `git fetch` of the gated PRs' heads, persisted as "
+        "`commit` rows flagged off the default-branch walk so they enrich "
+        "evidence without polluting area-history / refactor-walk. Needs "
+        "the network, so it always runs in the remote phase and is skipped "
+        "under `--no-remote`. Default: on."
+    ),
+)
 def scan_cmd(
     no_llm_descriptions: bool,
     refresh_codegraph: bool,
     codegraph_image: str | None,
     remote: bool,
+    enrich_pr_origins: bool,
 ) -> None:
     """Run the source crawlers, then describe each commit with the LLM."""
     # Lazy-imported so that --help and other lightweight CLI surfaces
@@ -121,6 +141,7 @@ def scan_cmd(
         analyze_skip=analyze_skip,
         codegraph_enabled=refresh_codegraph,
         remote_enabled=remote,
+        pr_origins_enabled=enrich_pr_origins and github_client is not None,
     )
 
     scan_log_path = db_path.parent / "scan.log"
@@ -142,8 +163,11 @@ def scan_cmd(
         if github_client is not None:
             phase1.append(GitHubCrawler(progress, client=github_client))
 
-        # Phase 2 — the analyzer, started only once phase 1 has joined
-        # (it reads the commits phase 1 persisted).
+        # Phase 2 — started only once phase 1 has joined (it reads the
+        # commits + PRs phase 1 persisted). The analyzer and the PR-origin
+        # enricher run concurrently: analyze only touches main-walk commits,
+        # the enricher only inserts new on_default_branch=0 rows, so they
+        # never contend over the same commit row.
         phase2: list[Crawler] = []
         if descriptor is not None:
             phase2.append(
@@ -155,6 +179,18 @@ def scan_cmd(
                     large_commit_file_count=config.analyze.large_commit_file_count,
                 )
             )
+        # The enricher needs PR rows (the GitHub crawler ran) and the
+        # network for its fetch — so it is gated on a resolved client, which
+        # is itself None under --no-remote.
+        if enrich_pr_origins and github_client is not None:
+            phase2.append(
+                PROriginEnricher(
+                    progress,
+                    repository=repository,
+                    min_commits=config.analyze.pr_origin_min_commits,
+                    large_commit_file_count=config.analyze.large_commit_file_count,
+                )
+            )
 
         if codegraph_crawler is not None:
             codegraph_crawler.start()
@@ -257,6 +293,7 @@ def _render_scan_panel(
     analyze_skip: str | None,
     codegraph_enabled: bool,
     remote_enabled: bool,
+    pr_origins_enabled: bool,
 ) -> None:
     """Print a summary panel of what the upcoming scan will collect.
 
@@ -316,6 +353,14 @@ def _render_scan_panel(
             ("LLM descriptions", Text(f"skipped — {analyze_skip}", style="yellow"))
         )
     rows.append(("Worker threads", str(config.scan_max_workers)))
+    rows.append(
+        (
+            "PR commit recovery",
+            "recover squash-merged PR commits"
+            if pr_origins_enabled
+            else Text("skipped", style="yellow"),
+        )
+    )
 
     grid = Table.grid(padding=(0, 3))
     grid.add_column(style="bold cyan", justify="right", no_wrap=True)
diff --git a/src/whygraph/core/config.py b/src/whygraph/core/config.py
index a5a358c..7b773cd 100644
--- a/src/whygraph/core/config.py
+++ b/src/whygraph/core/config.py
@@ -243,6 +243,12 @@ class AnalyzeConfig:
     timeout_sec : int or None
         Per-call timeout forwarded into :class:`CompletionRequest`.
         ``None`` (default) defers to the bound adapter's default.
+    pr_origin_min_commits : int
+        Commit-rich half of the squash-merge enrichment gate
+        (:mod:`whygraph.scan.pr_origin_enricher`). A squash-merged PR has
+        its original feature-branch commits recovered when it collapsed at
+        least this many commits (the file-bulk half reuses
+        ``large_commit_file_count``). Must be ``>= 1``.
     """
 
     provider: str = "anthropic"
@@ -250,6 +256,7 @@ class AnalyzeConfig:
     max_diff_chars: int = 50_000
     large_commit_file_count: int = 30
     timeout_sec: int | None = None
+    pr_origin_min_commits: int = 5
 
 
 @dataclass(frozen=True, slots=True)
@@ -277,11 +284,24 @@ class RationaleConfig:
     timeout_sec : int or None
         Per-call timeout forwarded into :class:`CompletionRequest`.
         ``None`` (default) defers to the bound adapter's default.
+    pr_roster_max_commits : int
+        Cap on how many squashed-commit headlines are rendered into a
+        single PR block in the rationale prompt. Bounds the prompt size
+        when a squash collapsed a long feature branch. Must be ``>= 1``.
+    pr_discussion_max_comments : int
+        Cap on how many PR comments are rendered into a single PR block
+        in the rationale prompt. Must be ``>= 1``.
+    pr_comment_max_chars : int
+        Per-comment body clip applied before rendering a PR comment into
+        the rationale prompt. Must be ``>= 1``.
     """
 
     provider: str = "anthropic"
     model: str | None = None
     timeout_sec: int | None = None
+    pr_roster_max_commits: int = 30
+    pr_discussion_max_comments: int = 20
+    pr_comment_max_chars: int = 500
 
 
 @dataclass(frozen=True, slots=True)
@@ -449,9 +469,12 @@ def __post_init__(self) -> None:
         ConfigError
             If ``log_level`` is not a known :class:`LogLevel` name, if
             ``scan_max_workers`` is less than ``1``, if ``scan_provider``
-            is not one of ``"off"`` / ``"github"`` / ``"auto"``, or if
-            ``analyze.max_diff_chars`` or ``analyze.large_commit_file_count``
-            is less than ``1``.
+            is not one of ``"off"`` / ``"github"`` / ``"auto"``, if
+            ``analyze.max_diff_chars``, ``analyze.large_commit_file_count``
+            or ``analyze.pr_origin_min_commits`` is less than ``1``, or if
+            any of the ``rationale`` PR-rendering caps
+            (``pr_roster_max_commits``, ``pr_discussion_max_comments``,
+            ``pr_comment_max_chars``) is less than ``1``.
         """
         try:
             LogLevel[self.log_level.upper()]
@@ -476,6 +499,26 @@ def __post_init__(self) -> None:
                 "analyze.large_commit_file_count must be >= 1, "
                 f"got {self.analyze.large_commit_file_count}"
             )
+        if self.analyze.pr_origin_min_commits < 1:
+            raise ConfigError(
+                "analyze.pr_origin_min_commits must be >= 1, "
+                f"got {self.analyze.pr_origin_min_commits}"
+            )
+        if self.rationale.pr_roster_max_commits < 1:
+            raise ConfigError(
+                "rationale.pr_roster_max_commits must be >= 1, "
+                f"got {self.rationale.pr_roster_max_commits}"
+            )
+        if self.rationale.pr_discussion_max_comments < 1:
+            raise ConfigError(
+                "rationale.pr_discussion_max_comments must be >= 1, "
+                f"got {self.rationale.pr_discussion_max_comments}"
+            )
+        if self.rationale.pr_comment_max_chars < 1:
+            raise ConfigError(
+                "rationale.pr_comment_max_chars must be >= 1, "
+                f"got {self.rationale.pr_comment_max_chars}"
+            )
 
     @classmethod
     def from_toml(cls, path: Path) -> Config:
diff --git a/src/whygraph/db/migrations/versions/4e231ec6f0e1_add_on_default_branch_to_commit.py b/src/whygraph/db/migrations/versions/4e231ec6f0e1_add_on_default_branch_to_commit.py
new file mode 100644
index 0000000..40982b8
--- /dev/null
+++ b/src/whygraph/db/migrations/versions/4e231ec6f0e1_add_on_default_branch_to_commit.py
@@ -0,0 +1,53 @@
+"""add on_default_branch to commit
+
+Revision ID: 4e231ec6f0e1
+Revises: a1f7c2e9b4d8
+Create Date: 2026-06-11 16:27:51.724319
+
+Adds the ``on_default_branch`` discriminator on ``commit``. Default 1
+means every existing row is treated as a default-branch (first-parent
+walk) commit; PR-origin commits recovered from a squash-merged PR are
+inserted with 0 so they stay out of the main-walk-only queries
+(area-history, refactor-walk). Additive, server-default backfilled — safe
+to re-run scans against an upgraded DB.
+
+(Autogenerate also surfaced pre-existing server-default / index drift on
+``commit.refactor_score`` and ``commit_file_change``; those are unrelated
+to this change and intentionally omitted to keep the migration additive.)
+"""
+
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+
+
+# revision identifiers, used by Alembic.
+revision: str = "4e231ec6f0e1"
+down_revision: Union[str, Sequence[str], None] = "a1f7c2e9b4d8"
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    """Upgrade schema.
+
+    A plain ``add_column`` (native ``ALTER TABLE ADD COLUMN``) rather than a
+    batch recreate: SQLite supports adding a column with a constant server
+    default in place, and recreating ``commit`` would trip the foreign key
+    from ``commit_file_change`` on a populated DB.
+    """
+    op.add_column(
+        "commit",
+        sa.Column(
+            "on_default_branch",
+            sa.Integer(),
+            nullable=False,
+            server_default=sa.text("1"),
+        ),
+    )
+
+
+def downgrade() -> None:
+    """Downgrade schema."""
+    op.drop_column("commit", "on_default_branch")
diff --git a/src/whygraph/db/models/commit.py b/src/whygraph/db/models/commit.py
index 587b820..5edfc6e 100644
--- a/src/whygraph/db/models/commit.py
+++ b/src/whygraph/db/models/commit.py
@@ -2,14 +2,25 @@
 
 from __future__ import annotations
 
-from sqlalchemy import Text
+from sqlalchemy import Text, text
 from sqlmodel import Field
 
 from whygraph.db.base import WhygraphTable
 
 
 class Commit(WhygraphTable, table=True):
-    """One row per scanned Git commit (first-parent walk of the default branch)."""
+    """One row per scanned Git commit (first-parent walk of the default branch).
+
+    Notes
+    -----
+    * ``on_default_branch`` is ``int`` (0/1), not ``bool``, to keep the
+      declared SQLite affinity as INTEGER (same rationale as
+      :attr:`whygraph.db.models.PullRequest.draft`). ``1`` (default) marks a
+      commit on the first-parent main walk; ``0`` marks a PR-origin commit
+      recovered from a squash-merged PR (see ``scan/pr_origin_enricher.py``)
+      that must stay out of the main-walk-only queries (area-history,
+      refactor-walk).
+    """
 
     sha: str = Field(primary_key=True, nullable=True, sa_type=Text)
     parent_shas: str = Field(sa_type=Text)
@@ -30,3 +41,8 @@ class Commit(WhygraphTable, table=True):
     # uses it to drive ``git blame --ignore-rev`` walk-past so older
     # authorship surfaces through commits that would otherwise mask it.
     refactor_score: int = Field(default=0)
+    # 0 = PR-origin commit recovered from a squash-merged PR (not on the
+    # first-parent main walk); 1 = on the default-branch walk (the norm).
+    on_default_branch: int = Field(
+        default=1, sa_column_kwargs={"server_default": text("1")}
+    )
diff --git a/src/whygraph/mcp/evidence.py b/src/whygraph/mcp/evidence.py
index 053d8bc..3dc293a 100644
--- a/src/whygraph/mcp/evidence.py
+++ b/src/whygraph/mcp/evidence.py
@@ -1,8 +1,9 @@
 """The ``whygraph_evidence_for`` MCP tool and its evidence collector.
 
-The collector combines four signals: line-blame at HEAD, line-blame
+The collector combines five signals: line-blame at HEAD, line-blame
 after walking past refactor-heavy commits, line-blame against a rename
-predecessor at its pre-rename location, and area-history from the
+predecessor at its pre-rename location, per-line attribution back to a
+squash-merged PR's original commits, and area-history from the
 ``commit_file_change`` index. Each commit is tagged with a ``source``
 label so the rationale generator can weight precision vs coverage.
 :mod:`whygraph.mcp.rationale` reuses :func:`collect_evidence`; the tool
@@ -35,9 +36,12 @@
 
 # Source ordering for dedupe — a SHA that surfaces from multiple paths
 # is kept with the strongest source label only. ``blame`` beats every
-# other label; ``area`` is the weakest.
+# other label; ``area`` is the weakest. ``pr-origin`` sits just below
+# ``blame`` (0.5): a real authoring commit reached through the squash is
+# high-precision, but a direct HEAD blame hit is still preferred.
 _SOURCE_PRIORITY = {
     "blame": 0,
+    "pr-origin": 0.5,
     "blame-walked": 1,
     "predecessor-blame": 2,
     "area": 3,
@@ -128,16 +132,20 @@ def collect_evidence(target: Target, *, limit: int = 20) -> list[CommitEvidence]
 
     1. ``"blame"`` — line-level attribution from the target's current
        range. The primary signal.
-    2. ``"blame-walked"`` — line-level attribution surfaced by walking
+    2. ``"pr-origin"`` — when a blamed line lands on a squash-merged PR's
+       commit that Stage 1 enriched, the same range is re-blamed at the
+       PR's ``head_sha`` so each line maps back to the original
+       feature-branch commit that authored it.
+    3. ``"blame-walked"`` — line-level attribution surfaced by walking
        past refactor-heavy commits (scored at scan time). Each round
        runs blame again with ``--ignore-rev`` for every boring SHA seen
        so far; bounded by :data:`_MAX_BORING_HOPS`.
-    3. ``"predecessor-blame"`` — for every rename event in the target
+    4. ``"predecessor-blame"`` — for every rename event in the target
        path's lineage (``commit_file_change`` rows with ``change_type``
        ``"R"``), blame the predecessor file at the rename commit's
        parent so authorship for code that has been moved across files
        still surfaces.
-    4. ``"area"`` — drawn from the ``commit_file_change`` index for the
+    5. ``"area"`` — drawn from the ``commit_file_change`` index for the
        target's path and every rename ancestor. Used to fill the cap
        when the line-blame signals are thin.
 
@@ -191,14 +199,24 @@ def _collect_evidence_against_db(
     walked_hunks, _boring = _walk_past_boring(repo, target, initial_hunks=initial)
     predecessor_hunks = _predecessor_blame(repo, target)
 
-    labeled_hunks: list[tuple[BlameHunk, str]] = (
+    base_labeled: list[tuple[BlameHunk, str]] = (
         [(h, "blame") for h in initial]
         + [(h, "blame-walked") for h in walked_hunks]
         + [(h, "predecessor-blame") for h in predecessor_hunks]
     )
+    # SHAs the queried lines actually blamed to — the candidates for being
+    # an enriched squash's merge_commit_sha.
+    blame_shas = {h.sha for h, _ in base_labeled if not h.is_uncommitted}
 
     by_sha: dict[str, CommitEvidence] = {}
     with get_session() as session:
+        # Per-line squash attribution: re-blame at each enriched squash PR's
+        # head_sha so origin commits surface with the existing dedupe /
+        # priority / cap machinery (no special-casing past the label).
+        origin_hunks = _attribute_squash_origins(
+            repo, target, blame_shas=blame_shas, session=session
+        )
+        labeled_hunks = base_labeled + [(h, "pr-origin") for h in origin_hunks]
         for hunk, source in labeled_hunks:
             if hunk.is_uncommitted:
                 continue
@@ -284,7 +302,14 @@ def _walk_past_boring(
 
 
 def _boring_shas_in(shas: set[str]) -> set[str]:
-    """Return the subset of ``shas`` whose ``refactor_score`` is boring."""
+    """Return the subset of ``shas`` whose ``refactor_score`` is boring.
+
+    Restricted to default-branch commits (``on_default_branch == 1``):
+    refactor-walk is a main-walk-only notion, and recovered PR-origin
+    commits (``0``) must never drive a walk-past — they carry no
+    ``commit_file_change`` rows and default ``refactor_score=0``, so the
+    guard is belt-and-suspenders against a future broad consumer.
+    """
     if not shas:
         return set()
     with get_session() as session:
@@ -292,6 +317,7 @@ def _boring_shas_in(shas: set[str]) -> set[str]:
             select(Commit.sha)
             .where(col(Commit.sha).in_(shas))
             .where(col(Commit.refactor_score) >= BORING_THRESHOLD)
+            .where(col(Commit.on_default_branch) == 1)
         ).all()
     return set(rows)
 
@@ -324,6 +350,77 @@ def _predecessor_blame(repo: Repository, target: Target) -> list[BlameHunk]:
     return out
 
 
+def _enriched_squash_prs_for(
+    session: Session, blame_shas: set[str]
+) -> list[PullRequest]:
+    """Squash PRs in ``blame_shas`` that have recovered origin commits.
+
+    A PR qualifies when its ``merge_commit_sha`` is one of the blamed SHAs
+    (so the queried lines land on its squash commit) **and** at least one
+    of its ``commit_titles`` oids exists in ``commit`` as an origin row
+    (``on_default_branch == 0``) — i.e. Stage 1 actually enriched it, so
+    ``head_sha``'s objects are local and a re-blame there will resolve.
+    This gate-agnostic predicate covers exactly the PRs the enricher
+    recovered, whether they were gated as file-bulk or commit-rich.
+    """
+    if not blame_shas:
+        return []
+    prs = session.exec(
+        select(PullRequest).where(col(PullRequest.merge_commit_sha).in_(blame_shas))
+    ).all()
+    enriched: list[PullRequest] = []
+    for pr in prs:
+        oids = [
+            entry["oid"]
+            for entry in _json_list(pr.commit_titles)
+            if isinstance(entry, dict) and entry.get("oid")
+        ]
+        if not oids:
+            continue
+        has_origin = session.exec(
+            select(Commit.sha)
+            .where(col(Commit.sha).in_(oids))
+            .where(col(Commit.on_default_branch) == 0)
+            .limit(1)
+        ).first()
+        if has_origin is not None:
+            enriched.append(pr)
+    return enriched
+
+
+def _attribute_squash_origins(
+    repo: Repository,
+    target: Target,
+    *,
+    blame_shas: set[str],
+    session: Session,
+) -> list[BlameHunk]:
+    """Re-blame the target range at each enriched squash PR's ``head_sha``.
+
+    When a blamed line lands on a squash ``merge_commit_sha``, the squash
+    tree equals the PR's ``head_sha`` tree for that file, so blaming the
+    same range at ``head_sha`` maps each line back to the original
+    feature-branch commit that authored it — the same mechanism
+    predecessor-blame uses with ``rev=parent_sha``. Best-effort: a
+    squash-vs-head mismatch or absent object skips that PR (mirrors
+    :func:`_predecessor_blame`'s per-event ``GitError`` swallow), leaving
+    the PR-level Stage 1 evidence untouched.
+    """
+    out: list[BlameHunk] = []
+    for pr in _enriched_squash_prs_for(session, blame_shas):
+        try:
+            hunks = repo.blame(
+                target.path,
+                target.line_start,
+                target.line_end,
+                rev=pr.head_sha,
+            )
+        except GitError:
+            continue
+        out.extend(h for h in hunks if not h.is_uncommitted)
+    return out
+
+
 def _rename_events_for(path: str) -> list[tuple[str, str]]:
     """Return ``(rename_commit_sha, predecessor_path)`` for every rename in path's lineage."""
     # Lazy import: path_history reuses _linked_prs / _linked_issues from
@@ -463,7 +560,16 @@ def _commit_dict(commit: Commit) -> dict:
 
 
 def _pr_dict(pr: PullRequest) -> dict:
-    """Serialize a pull request to a JSON-ready dict."""
+    """Serialize a pull request to a JSON-ready dict.
+
+    Notes
+    -----
+    ``commit_titles`` and ``comments`` are emitted **uncapped**: the
+    consumer of ``whygraph_evidence_for`` is an agent that can handle the
+    full lists. The size caps in :class:`RationaleConfig` apply only to the
+    LLM *rationale prompt* (see ``analyze.rationale_generator._format_pr``),
+    not to this raw tool output.
+    """
     return {
         "number": pr.number,
         "title": pr.title,
@@ -473,6 +579,8 @@ def _pr_dict(pr: PullRequest) -> dict:
         "author": pr.author,
         "html_url": pr.html_url,
         "labels": _json_list(pr.labels),
+        "commit_titles": _json_list(pr.commit_titles),
+        "comments": _json_list(pr.comments),
     }
 
 
diff --git a/src/whygraph/mcp/path_history.py b/src/whygraph/mcp/path_history.py
index e1a6471..bbe96ff 100644
--- a/src/whygraph/mcp/path_history.py
+++ b/src/whygraph/mcp/path_history.py
@@ -119,6 +119,11 @@ def area_history_commits(
                 col(CommitFileChange.commit_sha) == col(Commit.sha),
             )
             .where(col(CommitFileChange.path).in_(aliases))
+            # Area-history is a main-walk-only view. Recovered PR-origin
+            # commits (on_default_branch=0) carry no commit_file_change
+            # rows so the join already excludes them; this makes the
+            # invariant explicit for a future broad consumer.
+            .where(col(Commit.on_default_branch) == 1)
         )
         if exclude_shas:
             stmt = stmt.where(col(Commit.sha).not_in(exclude_shas))
diff --git a/src/whygraph/scan/__init__.py b/src/whygraph/scan/__init__.py
index 5a45dfd..8860960 100644
--- a/src/whygraph/scan/__init__.py
+++ b/src/whygraph/scan/__init__.py
@@ -3,11 +3,12 @@
 Exposes :class:`Crawler` (the threaded base class) and the concrete
 crawlers — :class:`GitCrawler` for local git history,
 :class:`GitHubCrawler` for GitHub pull requests and issues,
-:class:`CodeGraphCrawler` which refreshes the CodeGraph index, and
+:class:`CodeGraphCrawler` which refreshes the CodeGraph index,
 :class:`AnalyzeCrawler` which describes each commit's diff with an LLM
-(run after :class:`GitCrawler`). The CLI runs the source crawlers (and
-CodeGraph) concurrently, then the analyzer, against the shared SQLite
-database.
+(run after :class:`GitCrawler`), and :class:`PROriginEnricher` which
+recovers a squash-merged PR's original commits. The CLI runs the source
+crawlers (and CodeGraph) concurrently, then the analyzer and enricher,
+against the shared SQLite database.
 """
 
 from whygraph.scan.analyze_crawler import AnalyzeCrawler
@@ -15,6 +16,7 @@
 from whygraph.scan.crawler import Crawler
 from whygraph.scan.git_crawler import GitCrawler
 from whygraph.scan.github_crawler import GitHubCrawler
+from whygraph.scan.pr_origin_enricher import PROriginEnricher
 
 __all__ = [
     "AnalyzeCrawler",
@@ -22,4 +24,5 @@
     "Crawler",
     "GitCrawler",
     "GitHubCrawler",
+    "PROriginEnricher",
 ]
diff --git a/src/whygraph/scan/pr_origin_enricher.py b/src/whygraph/scan/pr_origin_enricher.py
new file mode 100644
index 0000000..37c31e2
--- /dev/null
+++ b/src/whygraph/scan/pr_origin_enricher.py
@@ -0,0 +1,248 @@
+"""PROriginEnricher — recover a squash-merged PR's original commits.
+
+A squash merge collapses a feature branch into one commit on the default
+branch, discarding the per-commit history WhyGraph would otherwise index.
+GitHub still retains the originals under ``refs/pull/<N>/head``, so this
+crawler — run after :class:`~whygraph.scan.github_crawler.GitHubCrawler`,
+once PR rows exist — fetches them once during the remote scan and persists
+them as ``commit`` rows flagged ``on_default_branch=0`` so they enrich
+evidence without leaking into the main-walk-only queries (area-history,
+refactor-walk).
+
+Only the squashes that actually lost history are enriched (the *balanced
+gate*, see :func:`_select_candidates`), and the fetch is one targeted
+batched ``git fetch`` carrying only the gated candidates' refspecs — never
+the ``refs/pull/*`` wildcard. No PR↔commit link row is written: the
+association is already carried by the PR's ``commit_titles`` and resolved
+at query time by ``mcp/evidence.py:_linked_prs``.
+
+The recovered commits' diffs and LLM descriptions stay lazy — this crawler
+writes only the ``commit`` row (full message + stats from ``git log``),
+leaving ``llm_description`` ``NULL`` for the on-read backfill to fill.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+from dataclasses import dataclass
+from datetime import datetime, timezone
+
+from rich.progress import Progress
+from sqlmodel import col, select
+
+from whygraph.db import get_session
+from whygraph.db.models.commit import Commit as CommitRow
+from whygraph.db.models.pull_request import PullRequest
+from whygraph.services.git import GitError, Repository
+from whygraph.services.git.commit import Commit as CommitDC
+
+from .crawler import Crawler
+
+_log = logging.getLogger(__name__)
+
+# Pin each candidate PR's head under our own ref namespace so the fetched
+# objects survive local GC and later blame/diff stay offline. The source
+# ``refs/pull/<N>/head`` is GitHub's server-side immutable PR ref.
+_PULL_REFSPEC = "refs/pull/{number}/head:refs/whygraph/pull/{number}"
+
+
+def _json_list(raw: str | None) -> list:
+    """Decode a JSON-encoded list column; empty list on anything malformed.
+
+    Mirrors ``mcp/evidence.py:_json_list`` — duplicated rather than
+    imported to keep the scan layer free of an upward dependency on the
+    MCP server layer.
+    """
+    if not raw:
+        return []
+    try:
+        parsed = json.loads(raw)
+    except (TypeError, json.JSONDecodeError):
+        return []
+    return parsed if isinstance(parsed, list) else []
+
+
+@dataclass(frozen=True, slots=True)
+class _Candidate:
+    """A gated squash PR and the subset of its oids not yet in ``commit``.
+
+    Attributes
+    ----------
+    number : int
+        The PR number, used to build its ``refs/pull/<N>/head`` refspec.
+    new_oids : list[str]
+        The PR's ``commit_titles`` oids that are absent from the ``commit``
+        table — exactly the rows this run will materialize.
+    """
+
+    number: int
+    new_oids: list[str]
+
+
+def _select_candidates(
+    session, *, min_commits: int, large_commit_file_count: int
+) -> list[_Candidate]:
+    """Pick the squash-merged PRs whose original commits should be recovered.
+
+    Applies the *balanced gate* (plan §3.3 / §3.5): a merged PR qualifies
+    when its ``commit_titles`` oids are **absent** from the ``commit``
+    table (squash detection — the originals are not on the main walk)
+    **and** either loss signal fires:
+
+    - **file-bulk** — its ``merge_commit_sha`` touched more than
+      ``large_commit_file_count`` files (the squash that lost the most
+      *description* fidelity); **or**
+    - **commit-rich** — it collapsed at least ``min_commits`` commits (lost
+      *narrative*, even if few files changed).
+
+    A PR whose oids are already on the main walk (a plain merge / rebase)
+    is skipped — the normal path already indexed it.
+
+    Parameters
+    ----------
+    session : sqlmodel.Session
+        Open session to read PR / commit rows through.
+    min_commits : int
+        Commit-rich threshold (``analyze.pr_origin_min_commits``).
+    large_commit_file_count : int
+        File-bulk threshold (``analyze.large_commit_file_count``).
+
+    Returns
+    -------
+    list[_Candidate]
+        One entry per gated PR, carrying the oids still to insert.
+    """
+    existing: set[str] = set(session.exec(select(CommitRow.sha)).all())
+    candidates: list[_Candidate] = []
+    merged = session.exec(
+        select(PullRequest).where(col(PullRequest.merged_at).is_not(None))
+    ).all()
+    for pr in merged:
+        oids = [
+            c["oid"]
+            for c in _json_list(pr.commit_titles)
+            if isinstance(c, dict) and c.get("oid")
+        ]
+        if not oids or all(o in existing for o in oids):
+            continue  # not a squash — originals already on the main walk
+        squash = (
+            session.get(CommitRow, pr.merge_commit_sha) if pr.merge_commit_sha else None
+        )
+        file_bulk = bool(squash and squash.files_changed > large_commit_file_count)
+        if not (file_bulk or len(oids) >= min_commits):
+            continue  # below both halves of the balanced gate
+        candidates.append(
+            _Candidate(
+                number=pr.number, new_oids=[o for o in oids if o not in existing]
+            )
+        )
+    return candidates
+
+
+def _to_origin_row(dc: CommitDC, *, scanned_at: str) -> CommitRow:
+    """Build an ``on_default_branch=0`` commit row from a git value object.
+
+    Mirrors ``git_crawler._to_row`` but flags the row as a recovered
+    PR-origin commit and leaves ``refactor_score`` at its default — origin
+    commits carry no ``commit_file_change`` rows, so the refactor-walk
+    never reaches them regardless.
+    """
+    return CommitRow(
+        sha=dc.sha,
+        parent_shas=" ".join(dc.parent_shas),
+        author_name=dc.author_name,
+        author_email=dc.author_email,
+        authored_at=dc.authored_at,
+        committed_at=dc.committed_at,
+        subject=dc.subject,
+        body=dc.body,
+        files_changed=dc.stats.files_changed,
+        insertions=dc.stats.insertions,
+        deletions=dc.stats.deletions,
+        scanned_at=scanned_at,
+        on_default_branch=0,
+    )
+
+
+class PROriginEnricher(Crawler):
+    """Recover and persist squash-merged PRs' original feature-branch commits.
+
+    Runs after the GitHub crawler (so PR rows exist) and the git crawler
+    (so the squash commits and the main walk exist). Selects the gated
+    squash PRs (:func:`_select_candidates`), fetches their PR-head refs in
+    one batched ``git fetch``, then inserts one ``commit`` row per recovered
+    oid with ``on_default_branch=0``. Re-scans are idempotent: oids already
+    in ``commit`` are excluded at selection time, so a second run with no
+    new squashes does nothing.
+
+    Best-effort by design (plan §6.6): a failed fetch (GC'd ref, no
+    network) or an unreadable single oid is logged and skipped rather than
+    failing the scan — the PR keeps its existing PR-level evidence.
+
+    Parameters
+    ----------
+    progress : rich.progress.Progress
+        Shared Progress instance owned by the orchestrator.
+    repository : Repository
+        The git repository to fetch and read commit metadata through.
+    min_commits : int
+        Commit-rich half of the gate (``analyze.pr_origin_min_commits``).
+    large_commit_file_count : int
+        File-bulk half of the gate (``analyze.large_commit_file_count``).
+    """
+
+    def __init__(
+        self,
+        progress: Progress,
+        *,
+        repository: Repository,
+        min_commits: int,
+        large_commit_file_count: int,
+    ) -> None:
+        super().__init__("pr-origins", progress, total=None)
+        self._repository = repository
+        self._min_commits = min_commits
+        self._large_commit_file_count = large_commit_file_count
+
+    def work(self) -> None:
+        with get_session() as session:
+            candidates = _select_candidates(
+                session,
+                min_commits=self._min_commits,
+                large_commit_file_count=self._large_commit_file_count,
+            )
+            self.set_total(len(candidates))
+            if not candidates:
+                return
+
+            # ONE batched fetch — only the gated candidates' refs, never the
+            # refs/pull/* wildcard. A failure here degrades the whole phase
+            # gracefully rather than aborting the scan.
+            refspecs = [_PULL_REFSPEC.format(number=c.number) for c in candidates]
+            try:
+                self._repository.fetch_refs(refspecs)
+            except GitError as exc:
+                _log.warning(
+                    "pr-origin fetch failed; skipping enrichment for %d PR(s): %s",
+                    len(candidates),
+                    exc,
+                )
+                return
+
+            scanned_at = datetime.now(timezone.utc).isoformat()
+            inserted: set[str] = set()
+            for cand in candidates:
+                for oid in cand.new_oids:
+                    # A commit shared across stacked / backport PRs appears
+                    # in two candidates' new_oids; insert it once.
+                    if oid in inserted:
+                        continue
+                    try:
+                        dc = self._repository.commit_metadata(oid)
+                    except GitError as exc:
+                        _log.warning("skipping origin commit %s: %s", oid[:9], exc)
+                        continue
+                    session.add(_to_origin_row(dc, scanned_at=scanned_at))
+                    inserted.add(oid)
+                self.advance(1)
diff --git a/src/whygraph/services/git/commands.py b/src/whygraph/services/git/commands.py
index 4483210..3436039 100644
--- a/src/whygraph/services/git/commands.py
+++ b/src/whygraph/services/git/commands.py
@@ -247,6 +247,74 @@ def parse(self, result: CompletedProcess[str]) -> tuple[FileChange, ...]:
         return FileChange.from_diff_tree(result.stdout)
 
 
+class GitFetchRefsCmd(ShellCommand[None]):
+    """``git fetch <remote> <refspec...>`` — fetch one or more refspecs in one call.
+
+    Carries every refspec in a single ``git fetch`` invocation so the
+    squash-origin enricher pins all its candidate PR refs with one network
+    round-trip rather than one fetch per PR. The parser returns ``None`` —
+    the command is run for its effect (objects + refs land in the local
+    object store), not its stdout.
+
+    Parameters
+    ----------
+    refspecs : tuple[str, ...]
+        One or more ``<src>:<dst>`` refspecs, e.g.
+        ``"refs/pull/12/head:refs/whygraph/pull/12"``. Passed as separate
+        argv tokens, so no shell quoting is involved.
+    remote : str, optional
+        Name of the remote to fetch from. Default ``"origin"``.
+    """
+
+    def __init__(self, *refspecs: str, remote: str = "origin") -> None:
+        self.refspecs = refspecs
+        self.remote = remote
+
+    def argv(self) -> list[str]:
+        return ["git", "fetch", "--no-tags", self.remote, *self.refspecs]
+
+    def parse(self, result: CompletedProcess[str]) -> None:
+        return None
+
+
+class GitLogCommitCmd(ShellCommand[Commit]):
+    """``git log -1 --shortstat --pretty=format:Commit.LOG_FORMAT <ref>``.
+
+    Reads a *single* commit's full metadata (message body + diff stats)
+    and parses it via :meth:`Commit.from_git_log`. The ``-1`` is load
+    bearing: ``git log <ref>`` without it walks ``ref`` and all its
+    ancestors, so the cap restricts output to just ``ref``.
+
+    Used by the squash-origin enricher to materialize a recovered PR
+    commit from its oid once the object is local, reusing the same parser
+    as the main history walk rather than inventing a second one.
+
+    Parameters
+    ----------
+    ref : str
+        Any commit-ish ``git`` accepts (typically a full oid).
+    """
+
+    def __init__(self, ref: str) -> None:
+        self.ref = ref
+
+    def argv(self) -> list[str]:
+        return [
+            "git",
+            "log",
+            "-1",
+            f"--pretty=format:{Commit.LOG_FORMAT}",
+            "--shortstat",
+            self.ref,
+        ]
+
+    def parse(self, result: CompletedProcess[str]) -> Commit:
+        for block in result.stdout.split("\x1e"):
+            if block.strip():
+                return Commit.from_git_log(block)
+        raise ValueError(f"no commit record in git log output for {self.ref!r}")
+
+
 class GitLogShortstatCmd(ShellCommand[Iterator[Commit]]):
     """``git log --shortstat --pretty=format:Commit.LOG_FORMAT <ref>``.
 
diff --git a/src/whygraph/services/git/repository.py b/src/whygraph/services/git/repository.py
index c593a3f..164c6a4 100644
--- a/src/whygraph/services/git/repository.py
+++ b/src/whygraph/services/git/repository.py
@@ -13,6 +13,8 @@
     GitCurrentBranchCmd,
     GitDiffCmd,
     GitDiffTreeFileChangesCmd,
+    GitFetchRefsCmd,
+    GitLogCommitCmd,
     GitRemoteUrlCmd,
 )
 from .commit import Commit
@@ -324,3 +326,70 @@ def diff_range(self, base: str, head: str) -> str:
             return self._shell.run(GitDiffCmd(f"{base}..{head}"), cwd=self.root)
         except ShellError as exc:
             raise GitError(f"failed to diff {base[:7]}..{head[:7]}") from exc
+
+    def fetch_refs(self, refspecs: list[str], *, remote: str | None = None) -> None:
+        """Fetch one or more refspecs from ``remote`` in a single round-trip.
+
+        The only network-touching method on :class:`Repository`. It writes
+        refs and objects into the local store (not the working tree), so
+        the squash-origin enricher can pin a batch of PR refs
+        (``refs/pull/<N>/head:refs/whygraph/pull/<N>``) against local GC
+        with one ``git fetch`` rather than one per PR — keeping later
+        blame/diff offline. ``--no-tags`` keeps the fetch from dragging in
+        the remote's tag set.
+
+        Parameters
+        ----------
+        refspecs : list[str]
+            ``<src>:<dst>`` refspecs to fetch. An empty list is a no-op
+            (no ``git`` is invoked).
+        remote : str or None, optional
+            Remote to fetch from. ``None`` (default) uses the remote this
+            repository was constructed with (``origin_remote``).
+
+        Raises
+        ------
+        GitError
+            If ``git fetch`` fails — an unknown remote, a missing/GC'd
+            source ref, or no network. Callers that treat enrichment as
+            best-effort should catch this.
+        """
+        if not refspecs:
+            return
+        target = remote or self._origin_remote
+        try:
+            self._shell.run(GitFetchRefsCmd(*refspecs, remote=target), cwd=self.root)
+        except ShellError as exc:
+            raise GitError(
+                f"failed to fetch {len(refspecs)} refspec(s) from {target!r}"
+            ) from exc
+
+    def commit_metadata(self, ref: str) -> Commit:
+        """Full :class:`Commit` value object for a single commit-ish.
+
+        Reads one commit's message body and diff stats via
+        ``git log -1 --shortstat`` and parses it with the same
+        :meth:`Commit.from_git_log` the main history walk uses. The object
+        must already be local (the enricher calls :meth:`fetch_refs`
+        first); an unknown ``ref`` surfaces as :class:`GitError`.
+
+        Parameters
+        ----------
+        ref : str
+            Any commit-ish ``git`` accepts (typically a full oid).
+
+        Returns
+        -------
+        Commit
+            The parsed commit value object.
+
+        Raises
+        ------
+        GitError
+            If ``git`` fails — most commonly an unknown ``ref`` whose
+            object is not in the local store.
+        """
+        try:
+            return self._shell.run(GitLogCommitCmd(ref), cwd=self.root)
+        except ShellError as exc:
+            raise GitError(f"failed to read commit metadata for {ref[:7]}") from exc
diff --git a/tests/test_analyze_rationale_generator.py b/tests/test_analyze_rationale_generator.py
index 8a3fe55..e060145 100644
--- a/tests/test_analyze_rationale_generator.py
+++ b/tests/test_analyze_rationale_generator.py
@@ -23,7 +23,9 @@
     RationaleGenerator,
 )
 from whygraph.analyze.rationale_generator import (
+    _PrRenderCaps,
     _format_evidence,
+    _format_pr,
     _format_symbol_context,
 )
 from whygraph.core.config import RationaleConfig
@@ -122,6 +124,8 @@ def _pr(
     number: int = 12,
     title: str = "Cache prompts",
     body: str | None = "Speeds up scans.",
+    commit_titles: str = "[]",
+    comments: str = "[]",
 ) -> PullRequest:
     """A :class:`PullRequest` row with sensible defaults for tests."""
     return PullRequest(
@@ -137,6 +141,8 @@ def _pr(
         author="octocat",
         html_url="https://example.com/pr/12",
         labels='["perf"]',
+        commit_titles=commit_titles,
+        comments=comments,
         fetched_at="2026-05-02T00:00:00Z",
     )
 
@@ -490,6 +496,109 @@ def test_format_evidence_renders_source_label_per_commit() -> None:
     assert "area-history" in bundle
 
 
+# ---- _format_pr roster + discussion --------------------------------------
+
+
+def _commit_titles(*headlines: str) -> str:
+    """JSON-encode a commit_titles roster the way the github crawler stores it."""
+    return json.dumps(
+        [
+            {"oid": f"{i:040x}", "headline": h, "author_name": "Jane"}
+            for i, h in enumerate(headlines, start=1)
+        ]
+    )
+
+
+def _comments(*pairs: tuple[str, str]) -> str:
+    """JSON-encode a PR comments list ([{author, body, created_at}, ...])."""
+    return json.dumps(
+        [
+            {"author": author, "body": body, "created_at": "2026-05-01T00:00:00Z"}
+            for author, body in pairs
+        ]
+    )
+
+
+def test_format_pr_renders_roster_and_discussion() -> None:
+    lines = _format_pr(
+        _pr(
+            commit_titles=_commit_titles("first commit", "second commit"),
+            comments=_comments(("alice", "looks good"), ("bob", "ship it")),
+        )
+    )
+    text = "\n".join(lines)
+
+    assert "Squashed commits:" in text
+    assert "first commit" in text
+    assert "second commit" in text
+    assert "Discussion:" in text
+    assert "[alice] looks good" in text
+    assert "[bob] ship it" in text
+
+
+def test_format_pr_empty_roster_and_discussion_add_no_lines() -> None:
+    lines = _format_pr(_pr())  # default commit_titles="[]", comments="[]"
+    text = "\n".join(lines)
+
+    assert "Squashed commits:" not in text
+    assert "Discussion:" not in text
+
+
+def test_format_pr_respects_caps() -> None:
+    caps = _PrRenderCaps(
+        pr_roster_max_commits=1,
+        pr_discussion_max_comments=1,
+        pr_comment_max_chars=4,
+    )
+    lines = _format_pr(
+        _pr(
+            commit_titles=_commit_titles("kept commit", "dropped commit"),
+            comments=_comments(("alice", "hello world"), ("bob", "dropped")),
+        ),
+        caps,
+    )
+    text = "\n".join(lines)
+
+    assert "kept commit" in text
+    assert "dropped commit" not in text
+    assert "[alice] hell" in text  # clipped to 4 chars
+    assert "hello world" not in text
+    assert "[bob]" not in text  # second comment dropped by the cap
+
+
+def test_format_pr_tolerates_malformed_json() -> None:
+    lines = _format_pr(_pr(commit_titles="not json", comments="{}"))
+    text = "\n".join(lines)
+
+    assert "Squashed commits:" not in text
+    assert "Discussion:" not in text
+
+
+# ---- RationaleConfig PR-rendering caps -----------------------------------
+
+
+def test_rationale_config_cap_defaults() -> None:
+    cfg = RationaleConfig()
+
+    assert cfg.pr_roster_max_commits == 30
+    assert cfg.pr_discussion_max_comments == 20
+    assert cfg.pr_comment_max_chars == 500
+
+
+def test_pr_render_caps_from_config_projects_fields() -> None:
+    caps = _PrRenderCaps.from_config(
+        RationaleConfig(
+            pr_roster_max_commits=3,
+            pr_discussion_max_comments=2,
+            pr_comment_max_chars=10,
+        )
+    )
+
+    assert caps.pr_roster_max_commits == 3
+    assert caps.pr_discussion_max_comments == 2
+    assert caps.pr_comment_max_chars == 10
+
+
 # ---- _format_symbol_context ---------------------------------------------
 
 
diff --git a/tests/test_core_config_analyze.py b/tests/test_core_config_analyze.py
index 4e3801d..1a32f74 100644
--- a/tests/test_core_config_analyze.py
+++ b/tests/test_core_config_analyze.py
@@ -28,6 +28,7 @@ def test_analyze_defaults_when_section_omitted(tmp_path: Path) -> None:
     assert cfg.analyze.provider == "anthropic"
     assert cfg.analyze.max_diff_chars == 50_000
     assert cfg.analyze.large_commit_file_count == 30
+    assert cfg.analyze.pr_origin_min_commits == 5
     assert cfg.analyze.timeout_sec is None
 
 
@@ -99,6 +100,25 @@ def test_analyze_invalid_large_commit_file_count_raises(tmp_path: Path) -> None:
         Config.from_toml(config)
 
 
+def test_analyze_pr_origin_min_commits_parsed(tmp_path: Path) -> None:
+    config = _write(
+        tmp_path / "whygraph.toml",
+        "[analyze]\npr_origin_min_commits = 12\n",
+    )
+    cfg = Config.from_toml(config)
+
+    assert cfg.analyze.pr_origin_min_commits == 12
+
+
+def test_analyze_invalid_pr_origin_min_commits_raises(tmp_path: Path) -> None:
+    config = _write(
+        tmp_path / "whygraph.toml",
+        "[analyze]\npr_origin_min_commits = 0\n",
+    )
+    with pytest.raises(ConfigError, match="pr_origin_min_commits"):
+        Config.from_toml(config)
+
+
 def test_analyze_config_is_frozen() -> None:
     cfg = AnalyzeConfig()
     with pytest.raises(Exception):  # FrozenInstanceError
diff --git a/tests/test_core_config_rationale.py b/tests/test_core_config_rationale.py
index 6571022..915ed05 100644
--- a/tests/test_core_config_rationale.py
+++ b/tests/test_core_config_rationale.py
@@ -13,7 +13,7 @@
 
 import pytest
 
-from whygraph.core.config import Config, RationaleConfig
+from whygraph.core.config import Config, ConfigError, RationaleConfig
 
 
 def _write(path: Path, body: str) -> Path:
@@ -71,6 +71,49 @@ def test_rationale_unknown_key_warns_but_loads(
     assert cfg.rationale.provider == "anthropic"
 
 
+def test_rationale_pr_render_cap_defaults(tmp_path: Path) -> None:
+    config = _write(tmp_path / "whygraph.toml", "")
+    cfg = Config.from_toml(config)
+
+    assert cfg.rationale.pr_roster_max_commits == 30
+    assert cfg.rationale.pr_discussion_max_comments == 20
+    assert cfg.rationale.pr_comment_max_chars == 500
+
+
+def test_rationale_pr_render_caps_parsed(tmp_path: Path) -> None:
+    config = _write(
+        tmp_path / "whygraph.toml",
+        "[rationale]\n"
+        "pr_roster_max_commits = 5\n"
+        "pr_discussion_max_comments = 3\n"
+        "pr_comment_max_chars = 120\n",
+    )
+    cfg = Config.from_toml(config)
+
+    assert cfg.rationale.pr_roster_max_commits == 5
+    assert cfg.rationale.pr_discussion_max_comments == 3
+    assert cfg.rationale.pr_comment_max_chars == 120
+
+
+@pytest.mark.parametrize(
+    "field_name",
+    [
+        "pr_roster_max_commits",
+        "pr_discussion_max_comments",
+        "pr_comment_max_chars",
+    ],
+)
+def test_rationale_pr_render_cap_below_one_raises(
+    tmp_path: Path, field_name: str
+) -> None:
+    config = _write(
+        tmp_path / "whygraph.toml",
+        f"[rationale]\n{field_name} = 0\n",
+    )
+    with pytest.raises(ConfigError, match=field_name):
+        Config.from_toml(config)
+
+
 def test_rationale_config_is_frozen() -> None:
     cfg = RationaleConfig()
     with pytest.raises(Exception):  # FrozenInstanceError
diff --git a/tests/test_db_plumbing.py b/tests/test_db_plumbing.py
index 5568333..be91576 100644
--- a/tests/test_db_plumbing.py
+++ b/tests/test_db_plumbing.py
@@ -77,6 +77,59 @@ def test_alembic_upgrade_on_empty_db(_isolate_config_and_engine: Path) -> None:
     assert _table_names(db_path) == SQLMODEL_TABLES | {"alembic_version"}
 
 
+def _insert_commit(sha: str, **overrides: object) -> None:
+    """Insert one minimal ``commit`` row, applying any field overrides."""
+    from whygraph.db import get_session
+    from whygraph.db.models import Commit
+
+    fields: dict[str, object] = dict(
+        sha=sha,
+        parent_shas="[]",
+        author_name="Jane",
+        author_email="jane@example.com",
+        authored_at="2026-01-01T00:00:00Z",
+        committed_at="2026-01-01T00:00:00Z",
+        subject="s",
+        body="",
+        files_changed=1,
+        insertions=1,
+        deletions=0,
+        scanned_at="2026-01-02T00:00:00Z",
+    )
+    fields.update(overrides)
+    with get_session() as session:
+        session.add(Commit(**fields))
+        session.commit()
+
+
+def test_commit_on_default_branch_default_and_explicit(
+    _isolate_config_and_engine: Path,
+) -> None:
+    """The column defaults to 1; an explicit 0 (PR-origin commit) persists."""
+    from sqlmodel import select
+
+    from whygraph.db import get_session
+    from whygraph.db.models import Commit
+
+    command.upgrade(alembic_config(), "head")
+    _insert_commit("default_sha")  # no on_default_branch → server default
+    _insert_commit("origin_sha", on_default_branch=0)
+
+    with get_session() as session:
+        # Read scalars inside the session: the default value is server-set, so
+        # accessing it on a detached row would trigger a refresh load and raise
+        # DetachedInstanceError.
+        default_on_main = session.get(Commit, "default_sha").on_default_branch
+        origin_on_main = session.get(Commit, "origin_sha").on_default_branch
+        on_main = set(
+            session.exec(select(Commit.sha).where(Commit.on_default_branch == 1)).all()
+        )
+
+    assert default_on_main == 1
+    assert origin_on_main == 0
+    assert on_main == {"default_sha"}
+
+
 def test_connect_pragmas_applied(_isolate_config_and_engine: Path) -> None:
     """The connect listener enables WAL + a busy timeout for concurrent writers."""
     engine = db_engine.get_engine()
diff --git a/tests/test_mcp_evidence.py b/tests/test_mcp_evidence.py
index 058cd32..c013a91 100644
--- a/tests/test_mcp_evidence.py
+++ b/tests/test_mcp_evidence.py
@@ -15,7 +15,7 @@
 from whygraph.db import get_session
 from whygraph.db.models import Commit, CommitFileChange, Issue, PRIssueLink, PullRequest
 from whygraph.mcp.errors import WhyGraphError
-from whygraph.mcp.evidence import whygraph_evidence_for
+from whygraph.mcp.evidence import _pr_dict, whygraph_evidence_for
 from whygraph.scan.refactor_score import BORING_THRESHOLD
 from whygraph.services.git import Repository
 
@@ -73,6 +73,33 @@ def _db_pr(*, number: int, merge_commit_sha: str) -> PullRequest:
     )
 
 
+# ---- _pr_dict serialization ---------------------------------------------
+
+
+def test_pr_dict_includes_decoded_commit_titles_and_comments() -> None:
+    pr = _db_pr(number=1, merge_commit_sha="squashsha")
+    pr.commit_titles = '[{"oid": "abc123", "headline": "first", "author_name": "Jane"}]'
+    pr.comments = '[{"author": "alice", "body": "lgtm", "created_at": "x"}]'
+
+    out = _pr_dict(pr)
+
+    assert out["commit_titles"] == [
+        {"oid": "abc123", "headline": "first", "author_name": "Jane"}
+    ]
+    assert out["comments"] == [{"author": "alice", "body": "lgtm", "created_at": "x"}]
+
+
+def test_pr_dict_malformed_commit_titles_and_comments_yield_empty_lists() -> None:
+    pr = _db_pr(number=1, merge_commit_sha="squashsha")
+    pr.commit_titles = "not json"
+    pr.comments = "{}"  # an object, not a list
+
+    out = _pr_dict(pr)
+
+    assert out["commit_titles"] == []
+    assert out["comments"] == []
+
+
 def _db_issue(*, number: int) -> Issue:
     """A WhyGraph ``issue`` row for tests."""
     return Issue(
diff --git a/tests/test_mcp_evidence_pr_origin.py b/tests/test_mcp_evidence_pr_origin.py
new file mode 100644
index 0000000..8258f74
--- /dev/null
+++ b/tests/test_mcp_evidence_pr_origin.py
@@ -0,0 +1,291 @@
+"""Per-line squash attribution (Stage 2 of the squash-merge recovery plan).
+
+Builds a real squash scenario on disk: a feature branch whose commits
+authored ``sample.py``'s lines, collapsed into one squash commit on
+``main``. With the original commits recovered as ``on_default_branch=0``
+rows and linked through a PR's ``commit_titles``,
+``whygraph_evidence_for`` must re-blame the queried lines at the PR's
+``head_sha`` and surface each original commit as ``source="pr-origin"``.
+"""
+
+from __future__ import annotations
+
+import subprocess
+from pathlib import Path
+
+import pytest
+
+from whygraph.analyze import CommitEvidence
+from whygraph.db import get_session
+from whygraph.db.models import Commit, PullRequest
+from whygraph.mcp.evidence import (
+    _attribute_squash_origins,
+    _should_replace,
+    whygraph_evidence_for,
+)
+from whygraph.mcp.targets import Target
+from whygraph.services.git import Repository
+
+
+def _git(repo: Path, *args: str) -> str:
+    return subprocess.run(
+        ["git", "-C", str(repo), *args],
+        check=True,
+        capture_output=True,
+        text=True,
+    ).stdout
+
+
+def _origin_row(sha: str, *, subject: str, committed_at: str) -> Commit:
+    """A recovered PR-origin commit row (on_default_branch=0)."""
+    return Commit(
+        sha=sha,
+        parent_shas="",
+        author_name="Feature Dev",
+        author_email="dev@example.com",
+        authored_at=committed_at,
+        committed_at=committed_at,
+        subject=subject,
+        body="",
+        files_changed=1,
+        insertions=1,
+        deletions=0,
+        scanned_at="2026-05-01T00:00:00+00:00",
+        on_default_branch=0,
+    )
+
+
+def _squash_row(sha: str, *, committed_at: str) -> Commit:
+    """The squash commit on the main walk (small — not file-bulk, so this
+    fixture also guards the §4.8 'commit-rich but not file-bulk' trigger)."""
+    return Commit(
+        sha=sha,
+        parent_shas="",
+        author_name="Maintainer",
+        author_email="maint@example.com",
+        authored_at=committed_at,
+        committed_at=committed_at,
+        subject="Squash-merge feature",
+        body="",
+        files_changed=1,
+        insertions=3,
+        deletions=0,
+        scanned_at="2026-05-01T00:00:00+00:00",
+        on_default_branch=1,
+    )
+
+
+def _build_squash_repo(root: Path) -> dict[str, str]:
+    """A repo whose ``main`` HEAD is a squash of a 2-commit feature branch.
+
+    ``sample.py`` ends as three lines: feat1 wrote lines 1-2, feat2 added
+    line 3; the squash commit on ``main`` reproduces all three at once.
+    The feature tip is pinned under ``refs/whygraph/pull/1`` (and its
+    branch deleted) to mirror what the enricher leaves behind.
+
+    Returns a dict of ``squash`` / ``feat1`` / ``feat2`` → sha.
+    """
+    root.mkdir(parents=True, exist_ok=True)
+    _git(root, "init", "-q", "-b", "main")
+    _git(root, "config", "user.email", "test@example.com")
+    _git(root, "config", "user.name", "Test User")
+    _git(root, "config", "commit.gpgsign", "false")
+
+    # Base commit on main — no sample.py yet.
+    (root / "README.md").write_text("base\n")
+    _git(root, "add", "README.md")
+    _git(root, "commit", "-q", "-m", "base")
+
+    # Feature branch: two commits authoring sample.py line by line.
+    _git(root, "checkout", "-q", "-b", "feature")
+    (root / "sample.py").write_text("line one\nline two\n")
+    _git(root, "add", "sample.py")
+    _git(root, "commit", "-q", "-m", "feat: first two lines")
+    feat1 = _git(root, "rev-parse", "HEAD").strip()
+    (root / "sample.py").write_text("line one\nline two\nline three\n")
+    _git(root, "add", "sample.py")
+    _git(root, "commit", "-q", "-m", "feat: third line")
+    feat2 = _git(root, "rev-parse", "HEAD").strip()
+
+    # Squash commit on main: the same final sample.py in one commit.
+    _git(root, "checkout", "-q", "main")
+    (root / "sample.py").write_text("line one\nline two\nline three\n")
+    _git(root, "add", "sample.py")
+    _git(root, "commit", "-q", "-m", "Squash-merge feature")
+    squash = _git(root, "rev-parse", "HEAD").strip()
+
+    # Pin the feature tip the way the enricher does, then drop the branch.
+    _git(root, "update-ref", "refs/whygraph/pull/1", feat2)
+    _git(root, "branch", "-q", "-D", "feature")
+
+    return {"squash": squash, "feat1": feat1, "feat2": feat2}
+
+
+def _seed_squash_pr(shas: dict[str, str]) -> None:
+    with get_session() as session:
+        session.add(
+            _squash_row(shas["squash"], committed_at="2026-04-01T00:00:00+00:00")
+        )
+        session.add(
+            _origin_row(
+                shas["feat1"],
+                subject="feat: first two lines",
+                committed_at="2026-03-01T00:00:00+00:00",
+            )
+        )
+        session.add(
+            _origin_row(
+                shas["feat2"],
+                subject="feat: third line",
+                committed_at="2026-03-02T00:00:00+00:00",
+            )
+        )
+        session.add(
+            PullRequest(
+                number=1,
+                title="Feature",
+                state="MERGED",
+                created_at="2026-03-01T00:00:00+00:00",
+                updated_at="2026-04-01T00:00:00+00:00",
+                merged_at="2026-04-01T00:00:00+00:00",
+                merge_commit_sha=shas["squash"],
+                head_sha=shas["feat2"],
+                base_ref="main",
+                html_url="https://example.test/pr/1",
+                labels="[]",
+                fetched_at="2026-04-02T00:00:00+00:00",
+                commit_titles=(
+                    f'[{{"oid": "{shas["feat1"]}", "headline": "feat: first two lines"}},'
+                    f' {{"oid": "{shas["feat2"]}", "headline": "feat: third line"}}]'
+                ),
+            )
+        )
+
+
+def test_evidence_attributes_squash_lines_to_origin_commits(
+    tmp_path: Path,
+    whygraph_db_initialized: Path,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """A line owned by the squash commit yields pr-origin entries whose
+    commits match ``git blame <head_sha>`` of the same range."""
+    shas = _build_squash_repo(tmp_path / "repo")
+    _seed_squash_pr(shas)
+
+    monkeypatch.chdir(tmp_path / "repo")
+    result = whygraph_evidence_for(path="sample.py", line_start=1, line_end=3)
+
+    by_source: dict[str, set[str]] = {}
+    for item in result["evidence"]:
+        by_source.setdefault(item["source"], set()).add(item["commit"]["sha"])
+
+    # The squash commit itself surfaces via direct HEAD blame...
+    assert shas["squash"] in by_source.get("blame", set())
+    # ...and the two originals surface via per-line squash attribution.
+    assert by_source.get("pr-origin") == {shas["feat1"], shas["feat2"]}
+
+    # Cross-check against git blame at head_sha directly.
+    blame = Repository(tmp_path / "repo").blame("sample.py", 1, 3, rev=shas["feat2"])
+    blamed = {h.sha for h in blame if not h.is_uncommitted}
+    assert blamed == {shas["feat1"], shas["feat2"]}
+
+
+def test_attribute_squash_origins_degrades_on_bad_head_sha(
+    tmp_path: Path,
+    whygraph_db_initialized: Path,
+) -> None:
+    """An unresolvable head_sha (GC'd ref) is swallowed — no error, no
+    hunks — rather than failing the evidence collection."""
+    shas = _build_squash_repo(tmp_path / "repo")
+    with get_session() as session:
+        session.add(
+            _squash_row(shas["squash"], committed_at="2026-04-01T00:00:00+00:00")
+        )
+        session.add(
+            _origin_row(
+                shas["feat1"], subject="x", committed_at="2026-03-01T00:00:00+00:00"
+            )
+        )
+        session.add(
+            PullRequest(
+                number=2,
+                title="Feature",
+                state="MERGED",
+                created_at="2026-03-01T00:00:00+00:00",
+                updated_at="2026-04-01T00:00:00+00:00",
+                merged_at="2026-04-01T00:00:00+00:00",
+                merge_commit_sha=shas["squash"],
+                head_sha="0" * 40,  # not a real object
+                base_ref="main",
+                html_url="https://example.test/pr/2",
+                labels="[]",
+                fetched_at="2026-04-02T00:00:00+00:00",
+                commit_titles=f'[{{"oid": "{shas["feat1"]}", "headline": "x"}}]',
+            )
+        )
+
+    repo = Repository(tmp_path / "repo")
+    target = Target(path="sample.py", line_start=1, line_end=3, qualified_name=None)
+    with get_session() as session:
+        hunks = _attribute_squash_origins(
+            repo, target, blame_shas={shas["squash"]}, session=session
+        )
+    assert hunks == []
+
+
+def test_unenriched_squash_yields_no_pr_origin(
+    tmp_path: Path,
+    whygraph_db_initialized: Path,
+) -> None:
+    """A squash PR with no recovered origin rows is not attributed (the
+    trigger requires >= 1 on_default_branch=0 commit)."""
+    shas = _build_squash_repo(tmp_path / "repo")
+    with get_session() as session:
+        session.add(
+            _squash_row(shas["squash"], committed_at="2026-04-01T00:00:00+00:00")
+        )
+        # No origin rows inserted for this PR.
+        session.add(
+            PullRequest(
+                number=3,
+                title="Feature",
+                state="MERGED",
+                created_at="2026-03-01T00:00:00+00:00",
+                updated_at="2026-04-01T00:00:00+00:00",
+                merged_at="2026-04-01T00:00:00+00:00",
+                merge_commit_sha=shas["squash"],
+                head_sha=shas["feat2"],
+                base_ref="main",
+                html_url="https://example.test/pr/3",
+                labels="[]",
+                fetched_at="2026-04-02T00:00:00+00:00",
+                commit_titles=f'[{{"oid": "{shas["feat1"]}", "headline": "x"}}]',
+            )
+        )
+
+    repo = Repository(tmp_path / "repo")
+    target = Target(path="sample.py", line_start=1, line_end=3, qualified_name=None)
+    with get_session() as session:
+        hunks = _attribute_squash_origins(
+            repo, target, blame_shas={shas["squash"]}, session=session
+        )
+    assert hunks == []
+
+
+# ---- source priority (plan unit test 6) -------------------------------------
+
+
+def _ev(sha: str, source: str) -> CommitEvidence:
+    return CommitEvidence(
+        _origin_row(sha, subject="x", committed_at="x"), source=source
+    )
+
+
+def test_pr_origin_beats_area_but_loses_to_blame() -> None:
+    # pr-origin supersedes area / blame-walked / predecessor-blame...
+    assert _should_replace(_ev("s", "area"), "pr-origin") is True
+    assert _should_replace(_ev("s", "blame-walked"), "pr-origin") is True
+    # ...but a direct HEAD blame hit wins over pr-origin.
+    assert _should_replace(_ev("s", "blame"), "pr-origin") is False
+    # And pr-origin is not displaced by the weaker labels.
+    assert _should_replace(_ev("s", "pr-origin"), "area") is False
diff --git a/tests/test_scan_pr_origin_enricher.py b/tests/test_scan_pr_origin_enricher.py
new file mode 100644
index 0000000..793ff4c
--- /dev/null
+++ b/tests/test_scan_pr_origin_enricher.py
@@ -0,0 +1,355 @@
+"""Tests for :mod:`whygraph.scan.pr_origin_enricher` (squash-merge recovery).
+
+Covers the balanced enrichment gate, the end-to-end ``work()`` loop with a
+stubbed repository (so no network / real git is needed), and the §4.10
+``on_default_branch`` query guards that keep recovered origin commits out
+of the main-walk-only queries.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from rich.progress import Progress
+
+from whygraph.db import get_session
+from whygraph.db.models import Commit, CommitFileChange, PullRequest
+from whygraph.scan.pr_origin_enricher import (
+    PROriginEnricher,
+    _select_candidates,
+)
+from whygraph.services.git.commit import Commit as CommitDC
+from whygraph.services.git.commit import DiffStats
+
+
+def _commit(sha: str, *, files_changed: int = 1, on_default_branch: int = 1) -> Commit:
+    return Commit(
+        sha=sha,
+        parent_shas="",
+        author_name="Test User",
+        author_email="tester@example.com",
+        authored_at="2026-01-01T00:00:00+00:00",
+        committed_at="2026-01-01T00:00:00+00:00",
+        subject=sha,
+        body="",
+        files_changed=files_changed,
+        insertions=1,
+        deletions=0,
+        scanned_at="2026-01-02T00:00:00+00:00",
+        on_default_branch=on_default_branch,
+    )
+
+
+def _pr(
+    number: int,
+    *,
+    oids: list[str],
+    merge_commit_sha: str | None = None,
+    merged: bool = True,
+) -> PullRequest:
+    return PullRequest(
+        number=number,
+        title=f"PR {number}",
+        state="MERGED" if merged else "OPEN",
+        created_at="2026-01-01T00:00:00+00:00",
+        updated_at="2026-01-01T00:00:00+00:00",
+        merged_at="2026-01-02T00:00:00+00:00" if merged else None,
+        merge_commit_sha=merge_commit_sha,
+        head_sha=f"head{number}",
+        base_ref="main",
+        html_url=f"https://example.test/pr/{number}",
+        labels="[]",
+        fetched_at="2026-01-02T00:00:00+00:00",
+        commit_titles=json.dumps([{"oid": o, "headline": o} for o in oids]),
+    )
+
+
+def _dc(oid: str, *, files_changed: int = 2) -> CommitDC:
+    return CommitDC(
+        sha=oid,
+        parent_shas=(),
+        author_name="Feature Dev",
+        author_email="dev@example.com",
+        authored_at="2025-12-01T00:00:00+00:00",
+        committed_at="2025-12-01T00:00:00+00:00",
+        subject=f"original {oid}",
+        body="full body",
+        stats=DiffStats(files_changed=files_changed, insertions=3, deletions=1),
+    )
+
+
+class _StubRepo:
+    """Duck-typed stand-in for :class:`Repository` — no network, no git.
+
+    Records the refspecs passed to :meth:`fetch_refs` and serves
+    :meth:`commit_metadata` from a preloaded oid → :class:`CommitDC` map.
+    An oid absent from the map raises, mirroring an unknown/GC'd object.
+    """
+
+    def __init__(self, metadata: dict[str, CommitDC]) -> None:
+        self._metadata = metadata
+        self.fetched: list[list[str]] = []
+
+    def fetch_refs(self, refspecs: list[str], *, remote: str | None = None) -> None:
+        self.fetched.append(list(refspecs))
+
+    def commit_metadata(self, ref: str) -> CommitDC:
+        from whygraph.services.git import GitError
+
+        try:
+            return self._metadata[ref]
+        except KeyError as exc:  # pragma: no cover - defensive
+            raise GitError(f"unknown commit {ref}") from exc
+
+
+# --- balanced gate (plan §3.3 / unit test 4) ---------------------------------
+
+
+def test_gate_enriches_file_bulk_squash(whygraph_db_initialized: Path) -> None:
+    """Squash (oids absent) + file-bulk merge commit → enrich."""
+    with get_session() as session:
+        # The squash commit IS on the main walk and is file-bulk (> threshold).
+        session.add(_commit("squash1", files_changed=40))
+        session.add(_pr(1, oids=["o1", "o2"], merge_commit_sha="squash1"))
+        session.commit()
+        candidates = _select_candidates(
+            session, min_commits=5, large_commit_file_count=30
+        )
+
+    assert len(candidates) == 1
+    assert candidates[0].number == 1
+    assert sorted(candidates[0].new_oids) == ["o1", "o2"]
+
+
+def test_gate_enriches_commit_rich_squash(whygraph_db_initialized: Path) -> None:
+    """Squash + many collapsed commits (>= min_commits) → enrich, even when
+    the merge commit touched few files."""
+    with get_session() as session:
+        session.add(_commit("squash2", files_changed=2))  # not file-bulk
+        session.add(
+            _pr(2, oids=[f"c{i}" for i in range(5)], merge_commit_sha="squash2")
+        )
+        session.commit()
+        candidates = _select_candidates(
+            session, min_commits=5, large_commit_file_count=30
+        )
+
+    assert len(candidates) == 1
+    assert candidates[0].number == 2
+
+
+def test_gate_skips_below_both_thresholds(whygraph_db_initialized: Path) -> None:
+    """Squash but small (few files, few commits) → skip."""
+    with get_session() as session:
+        session.add(_commit("squash3", files_changed=2))
+        session.add(_pr(3, oids=["x1", "x2"], merge_commit_sha="squash3"))
+        session.commit()
+        candidates = _select_candidates(
+            session, min_commits=5, large_commit_file_count=30
+        )
+
+    assert candidates == []
+
+
+def test_gate_skips_merged_as_own_commits(whygraph_db_initialized: Path) -> None:
+    """A plain merge / rebase — the PR's oids are already on the main walk —
+    is skipped regardless of size (no history was lost)."""
+    with get_session() as session:
+        # All of the PR's oids are present in `commit` on the main walk.
+        for oid in (f"m{i}" for i in range(8)):
+            session.add(_commit(oid, files_changed=1))
+        session.add(_commit("merge4", files_changed=50))
+        session.add(_pr(4, oids=[f"m{i}" for i in range(8)], merge_commit_sha="merge4"))
+        session.commit()
+        candidates = _select_candidates(
+            session, min_commits=5, large_commit_file_count=30
+        )
+
+    assert candidates == []
+
+
+def test_gate_ignores_unmerged_prs(whygraph_db_initialized: Path) -> None:
+    """An open PR is never a candidate even if it is large."""
+    with get_session() as session:
+        session.add(_commit("tip5", files_changed=40))
+        session.add(
+            _pr(
+                5, oids=[f"u{i}" for i in range(9)], merge_commit_sha=None, merged=False
+            )
+        )
+        session.commit()
+        candidates = _select_candidates(
+            session, min_commits=5, large_commit_file_count=30
+        )
+
+    assert candidates == []
+
+
+# --- end-to-end work() (plan integration test 7) -----------------------------
+
+
+def test_work_inserts_origin_rows_and_fetches_only_candidates(
+    whygraph_db_initialized: Path,
+) -> None:
+    """One batched fetch of only the candidate refspec; origin rows land
+    with on_default_branch=0 and full git metadata."""
+    with get_session() as session:
+        session.add(_commit("squash6", files_changed=40))
+        session.add(_pr(6, oids=["a6", "b6"], merge_commit_sha="squash6"))
+        session.commit()
+
+    repo = _StubRepo({"a6": _dc("a6"), "b6": _dc("b6")})
+    enricher = PROriginEnricher(
+        Progress(), repository=repo, min_commits=5, large_commit_file_count=30
+    )
+    enricher.run()
+
+    assert enricher.error is None
+    # Exactly one fetch, carrying only PR #6's targeted refspec (no wildcard).
+    assert repo.fetched == [["refs/pull/6/head:refs/whygraph/pull/6"]]
+
+    with get_session() as session:
+        a6 = session.get(Commit, "a6")
+        b6 = session.get(Commit, "b6")
+        a6_flag = a6.on_default_branch
+        a6_body = a6.body
+        a6_files = a6.files_changed
+        b6_flag = b6.on_default_branch
+        squash_flag = session.get(Commit, "squash6").on_default_branch
+
+    assert a6_flag == 0 and b6_flag == 0
+    assert a6_body == "full body"  # full message from git, not just headline
+    assert a6_files == 2
+    assert squash_flag == 1  # the squash commit stays on the main walk
+
+
+def test_work_no_candidates_makes_no_fetch(whygraph_db_initialized: Path) -> None:
+    """When nothing is gated, the enricher never touches the network."""
+    with get_session() as session:
+        session.add(_commit("squash7", files_changed=2))
+        session.add(_pr(7, oids=["s1", "s2"], merge_commit_sha="squash7"))
+        session.commit()
+
+    repo = _StubRepo({})
+    enricher = PROriginEnricher(
+        Progress(), repository=repo, min_commits=5, large_commit_file_count=30
+    )
+    enricher.run()
+
+    assert enricher.error is None
+    assert repo.fetched == []
+
+
+def test_work_dedups_oid_shared_across_prs(whygraph_db_initialized: Path) -> None:
+    """An oid attached to two squash PRs is inserted exactly once (no PK
+    collision) — the many-to-many edge §4.3 relies on."""
+    with get_session() as session:
+        session.add(_commit("squashA", files_changed=40))
+        session.add(_commit("squashB", files_changed=40))
+        session.add(_pr(8, oids=["shared", "onlyA"], merge_commit_sha="squashA"))
+        session.add(_pr(9, oids=["shared", "onlyB"], merge_commit_sha="squashB"))
+        session.commit()
+
+    repo = _StubRepo(
+        {"shared": _dc("shared"), "onlyA": _dc("onlyA"), "onlyB": _dc("onlyB")}
+    )
+    enricher = PROriginEnricher(
+        Progress(), repository=repo, min_commits=5, large_commit_file_count=30
+    )
+    enricher.run()
+
+    assert enricher.error is None
+    with get_session() as session:
+        from sqlmodel import func, select
+
+        shared_count = session.exec(
+            select(func.count(Commit.sha)).where(Commit.sha == "shared")
+        ).one()
+    assert shared_count == 1
+
+
+def test_work_skips_scan_when_fetch_fails(whygraph_db_initialized: Path) -> None:
+    """A fetch failure (GC'd ref / no network) degrades gracefully — the
+    scan does not error and no origin rows are written."""
+    from whygraph.services.git import GitError
+
+    class _FailingRepo(_StubRepo):
+        def fetch_refs(self, refspecs, *, remote=None):  # type: ignore[override]
+            raise GitError("network down")
+
+    with get_session() as session:
+        session.add(_commit("squashF", files_changed=40))
+        session.add(_pr(10, oids=["f1", "f2"], merge_commit_sha="squashF"))
+        session.commit()
+
+    repo = _FailingRepo({"f1": _dc("f1"), "f2": _dc("f2")})
+    enricher = PROriginEnricher(
+        Progress(), repository=repo, min_commits=5, large_commit_file_count=30
+    )
+    enricher.run()
+
+    assert enricher.error is None
+    with get_session() as session:
+        assert session.get(Commit, "f1") is None
+        assert session.get(Commit, "f2") is None
+
+
+# --- §4.10 main-walk-only query guards (unit test 5) -------------------------
+
+
+def test_boring_shas_excludes_origin_commits(whygraph_db_initialized: Path) -> None:
+    """``_boring_shas_in`` ignores on_default_branch=0 rows even when their
+    refactor_score crosses the boring threshold."""
+    from whygraph.mcp.evidence import BORING_THRESHOLD, _boring_shas_in
+
+    with get_session() as session:
+        main = _commit("boring_main", on_default_branch=1)
+        main.refactor_score = BORING_THRESHOLD + 5
+        origin = _commit("boring_origin", on_default_branch=0)
+        origin.refactor_score = BORING_THRESHOLD + 5
+        session.add(main)
+        session.add(origin)
+        session.commit()
+
+    result = _boring_shas_in({"boring_main", "boring_origin"})
+    assert result == {"boring_main"}
+
+
+def test_area_history_excludes_origin_commits(whygraph_db_initialized: Path) -> None:
+    """area-history excludes on_default_branch=0 commits even if a
+    commit_file_change row points at the queried path (defensive guard)."""
+    from whygraph.mcp.path_history import area_history_commits
+
+    with get_session() as session:
+        session.add(_commit("area_main", on_default_branch=1))
+        session.add(_commit("area_origin", on_default_branch=0))
+        session.add(
+            CommitFileChange(
+                commit_sha="area_main",
+                path="src/x.py",
+                change_type="M",
+                renamed_from=None,
+                similarity=None,
+                lines_added=1,
+                lines_deleted=0,
+            )
+        )
+        # Artificial: origin commits normally get no file-change rows; this
+        # forces the join to reach one so the explicit guard is exercised.
+        session.add(
+            CommitFileChange(
+                commit_sha="area_origin",
+                path="src/x.py",
+                change_type="M",
+                renamed_from=None,
+                similarity=None,
+                lines_added=1,
+                lines_deleted=0,
+            )
+        )
+        session.commit()
+
+    items = area_history_commits("src/x.py", include_renames=False)
+    shas = {item.commit.sha for item in items}
+    assert shas == {"area_main"}
diff --git a/tests/test_services_git_fetch_metadata.py b/tests/test_services_git_fetch_metadata.py
new file mode 100644
index 0000000..6b4bf1f
--- /dev/null
+++ b/tests/test_services_git_fetch_metadata.py
@@ -0,0 +1,108 @@
+"""Tests for :meth:`Repository.fetch_refs` and :meth:`Repository.commit_metadata`.
+
+Both run the real ``git`` binary against repos on disk (same pattern as
+``test_services_git_diff.py``). ``fetch_refs`` is exercised against a local
+bare "remote" so no network is involved; ``commit_metadata`` parses a single
+commit's full message + stats.
+"""
+
+from __future__ import annotations
+
+import subprocess
+from pathlib import Path
+
+import pytest
+
+from whygraph.services.git import GitError, Repository
+
+
+def _git(cwd: Path, *args: str) -> str:
+    return subprocess.run(
+        ["git", "-C", str(cwd), *args],
+        check=True,
+        capture_output=True,
+        text=True,
+    ).stdout
+
+
+def _init(root: Path) -> None:
+    root.mkdir(parents=True, exist_ok=True)
+    _git(root, "init", "-q", "-b", "main")
+    _git(root, "config", "user.email", "test@example.com")
+    _git(root, "config", "user.name", "Test User")
+    _git(root, "config", "commit.gpgsign", "false")
+
+
+def test_commit_metadata_reads_single_commit_body_and_stats(tmp_path: Path) -> None:
+    root = tmp_path / "repo"
+    _init(root)
+    (root / "a.txt").write_text("one\ntwo\n")
+    _git(root, "add", "a.txt")
+    _git(root, "commit", "-q", "-m", "first")
+    (root / "a.txt").write_text("one\ntwo\nthree\n")
+    _git(root, "add", "a.txt")
+    _git(root, "commit", "-q", "-m", "second subject\n\nA longer body line.")
+    sha = _git(root, "rev-parse", "HEAD").strip()
+
+    commit = Repository(root).commit_metadata(sha)
+
+    assert commit.sha == sha
+    assert commit.subject == "second subject"
+    assert "A longer body line." in commit.body
+    # Only the single commit's stats — not its ancestors'.
+    assert commit.stats.files_changed == 1
+    assert commit.stats.insertions == 1
+
+
+def test_commit_metadata_unknown_ref_raises_git_error(tmp_path: Path) -> None:
+    root = tmp_path / "repo"
+    _init(root)
+    (root / "a.txt").write_text("x\n")
+    _git(root, "add", "a.txt")
+    _git(root, "commit", "-q", "-m", "only")
+
+    with pytest.raises(GitError):
+        Repository(root).commit_metadata("0" * 40)
+
+
+def test_fetch_refs_brings_remote_commit_local(tmp_path: Path) -> None:
+    """A refspec fetch pins a commit that is not reachable from any local
+    branch under our own ref namespace, so it becomes readable offline."""
+    remote = tmp_path / "remote"
+    _init(remote)
+    (remote / "f.txt").write_text("remote work\n")
+    _git(remote, "add", "f.txt")
+    _git(remote, "commit", "-q", "-m", "remote commit")
+    remote_sha = _git(remote, "rev-parse", "HEAD").strip()
+    # Park the commit under a non-branch ref the clone won't pull by default.
+    _git(remote, "update-ref", "refs/pull/1/head", remote_sha)
+
+    local = tmp_path / "local"
+    _init(local)
+    (local / "g.txt").write_text("local\n")
+    _git(local, "add", "g.txt")
+    _git(local, "commit", "-q", "-m", "local commit")
+    _git(local, "remote", "add", "origin", str(remote))
+
+    repo = Repository(local)
+    # Before the fetch the object is absent — metadata read fails.
+    with pytest.raises(GitError):
+        repo.commit_metadata(remote_sha)
+
+    repo.fetch_refs(["refs/pull/1/head:refs/whygraph/pull/1"])
+
+    # Now it resolves, and the local ref lives in our namespace.
+    fetched = repo.commit_metadata(remote_sha)
+    assert fetched.sha == remote_sha
+    assert fetched.subject == "remote commit"
+    assert _git(local, "rev-parse", "refs/whygraph/pull/1").strip() == remote_sha
+
+
+def test_fetch_refs_empty_is_noop(tmp_path: Path) -> None:
+    root = tmp_path / "repo"
+    _init(root)
+    (root / "a.txt").write_text("x\n")
+    _git(root, "add", "a.txt")
+    _git(root, "commit", "-q", "-m", "only")
+    # No remote configured: an empty refspec list must not invoke git at all.
+    Repository(root).fetch_refs([])
diff --git a/whygraph.example.toml b/whygraph.example.toml
index 21904dc..906da0d 100644
--- a/whygraph.example.toml
+++ b/whygraph.example.toml
@@ -28,6 +28,8 @@ provider = "anthropic"        # which [llm.*] adapter to use
 # large_commit_file_count = 30  # commits touching more files than this are described
                               # per-file on demand (not whole-diff) — keeps a bulk
                               # import / squash commit from costing a repo-wide pass
+# pr_origin_min_commits = 5   # recover a squash-merged PR's original commits once it
+                              # collapsed >= this many (file-bulk squashes recover too)
 # timeout_sec = 60            # per-call timeout; default: the adapter's own
 
 [rationale]
@@ -35,6 +37,11 @@ provider = "anthropic"        # which [llm.*] adapter to use
 provider = "anthropic"        # which [llm.*] adapter to use
 # model = "claude-haiku-4-5"  # override the provider's model for rationale only
 # timeout_sec = 60            # per-call timeout; default: the adapter's own
+# Caps on how much of a squash-merged PR is rendered into the rationale prompt
+# (the raw whygraph_evidence_for tool output is uncapped):
+# pr_roster_max_commits = 30      # squashed-commit headlines shown per PR
+# pr_discussion_max_comments = 20 # PR comments shown per PR
+# pr_comment_max_chars = 500      # each PR comment clipped to this length
 
 # Uncomment to override default DB locations (resolved relative to this file):
 # whygraph_db  = ".whygraph/whygraph.db"