diff --git a/docs/squash-merge-recovery-plan.md b/docs/squash-merge-recovery-plan.md new file mode 100644 index 0000000..f7160ee --- /dev/null +++ b/docs/squash-merge-recovery-plan.md @@ -0,0 +1,517 @@ +# Squash-Merge PR Commit Recovery — Implementation Plan + +> **Status: Implemented (2026-06-11) on `feature/squash-merge-recovery`.** All five rollout +> steps (§8) landed as separate commits — Stage 0 (`feat(stage-0)`), the `on_default_branch` +> column + migration, the scan enricher, the no-code Step 4 confirmation, and per-line +> `pr-origin` attribution (`feat(stage-2)`). Full suite green (493 tests). The notes below are +> preserved as the design record. +> _Original status: Reviewed — scope locked (all §0 decisions folded in) + consistency pass applied._ +> **Consistency pass (2026-06-11):** corrected config placement — the three Stage-0 rendering caps +> live in `RationaleConfig` (`[rationale]`), not `AnalyzeConfig`, because `RationaleGenerator` is +> built from `RationaleConfig` (`config.py:256`); only the enrichment gate `pr_origin_min_commits` +> sits in `AnalyzeConfig`. Fixed the §4.8 per-line trigger (it must fire for *every* enriched squash, +> not only file-bulk ones). Added skeleton snippets for the two NEW modules (§4.5, §4.8). +> **Goal:** when a feature PR is squash-merged into the default branch, restore WhyGraph's +> analyzing power over the *original* feature-branch commits — surface them (and the PR +> discussion) as evidence, and attribute each line of the squash commit back to the specific +> original commit that introduced it. +> **Date:** 2026-06-11 + +--- + +## 0. Resolved scope decisions (from the requester) + +| Topic | Decision | +|---|---| +| Scope | **Stage 0 + Stage 1.** Stage 2 (graph-aware fallback splitter for local-only / GC'd history) is **out of scope** for this plan. | +| Attribution | **Per-line mapping required** — map each line of the squash `merge_commit_sha` back to the original PR commit that introduced it, not just PR-level evidence. | +| Provider | **GitHub only** for Stage 1 (the only provider WhyGraph integrates today — `services/github/`). Azure/GitLab `refs/pull` equivalents are a later provider-abstraction concern. | +| Link storage | **No PR↔commit table and no `pr_id` column.** Reuse the existing `commit_titles`/`_linked_prs` linkage (many-to-many safe); only the `commit.on_default_branch` flag is new (§4.3). | +| `commit_titles` retention | **Keep for now**, remove in a later separate migration once `_linked_prs` and Stage 0 are repointed. | +| Stage-0 caps | **Make them `whygraph.toml`-configurable** under **`[rationale]`** (consumed by the rationale generator) and document them in the example config so developers can tune roster/discussion size (§4.2, §4.11). | +| Enrichment cadence | **Remote `scan` phase only** (forced by the offline-MCP / `--no-remote` invariants); diffs/descriptions stay lazy (§3.1, §4.7). | +| Enrichment gate | **Balanced:** squash-detected **AND** (`files_changed > large_commit_file_count` **OR** `len(commit_titles) >= pr_origin_min_commits`, default 5, configurable) (§3.3, §4.11). | +| Fetch strategy | **Targeted batched** — one `git fetch` carrying only the gated candidates' refspecs; local ref count == #candidates, not #all-PRs (§3.1). | +| Enrichment default | **On by default**; `--no-pr-origins` disables, always skipped under `--no-remote` (§4.6). | + +--- + +## 1. Goals & Non-Goals + +### Goals +- **Stage 0** — stop discarding data WhyGraph *already stores*: inline a squash PR's + `commit_titles` and `comments` into the evidence bundle and the rationale prompt. +- **Stage 1** — fetch each squash-merged PR's **original feature-branch commits** (full message at + scan, diff lazily), persist them as `Commit` rows linked to their PR via the existing + `commit_titles` (no new relation), and make them first-class evidence. +- **Per-line attribution** — when a blame hit lands on a squash commit, re-attribute the queried + line range to the original PR commits that authored those lines. + +### Non-Goals +- **Stage 2 / LLM or graph-based diff splitting** — *out of scope.* When no provider data exists + (local-only history, GC'd branch, non-PR squash), the commit keeps today's bulk-stub behaviour. + Reconstructing boundaries from the diff is explicitly deferred. +- **Non-GitHub providers** — Azure DevOps / GitLab have no `refs/pull/*/head`; their enrichment is + deferred. Stage 0 is provider-agnostic and benefits them for free. +- **Re-describing the squash commit whole-diff** — the bulk-stub path (`scan/analyze_crawler.py`, + `analyze/backfill.py`) is **unchanged on purpose**; we add a parallel origin-commit signal, we do + not resurrect the expensive whole-squash-diff LLM call. +- **Changing the first-parent main-branch walk** — `scan/git_crawler.py`'s notion of "a commit on + the default branch" is untouched; origin commits are tagged so they never leak into the + main-walk-only queries (area-history, refactor-walk). + +--- + +## 2. How the system works today (reference map) + +| Concern | Location | Current behaviour | +|---|---|---| +| PR ingestion | `services/github/pull_requests.py:43-56` | GraphQL already pulls `commits(first: 250)` (oid, headline, author) **and** `comments(first: 100)`. | +| PR value object | `services/github/pull_request.py:97-115` | `commits: tuple[CommitSummary,...]`, `comments: tuple[Comment,...]` already parsed. | +| PR persistence | `scan/github_crawler.py:92-129` | `commit_titles` ← JSON of `{oid, headline, author_*}`; `comments` ← JSON of `{author, body, created_at}`. **Both stored, neither surfaced.** | +| PR row | `db/models/pull_request.py:33,41-46` | has `merge_commit_sha` (indexed), `head_sha`, `commit_titles`, `comments`. **No merge-method field.** | +| Squash→PR link | `mcp/evidence.py:81-101` (`_linked_prs`) | matches `merge_commit_sha == sha` OR `head_sha == sha` OR oid in `commit_titles`. A squash commit **is** the `merge_commit_sha`, so the link already works. | +| Evidence serialization | `mcp/evidence.py:465-476` (`_pr_dict`) | emits only number/title/body/state/merged_at/author/html_url/labels — **drops `commit_titles` and `comments`.** | +| Rationale prompt | `analyze/rationale_generator.py:93-102` (`_format_pr`) | renders only PR title + body — **drops commits and comments.** | +| Evidence join | `mcp/evidence.py:200-217` | per blame hunk: `session.get(Commit, hunk.sha)`; drops the hunk if no `Commit` row exists. | +| Blame at a rev | `services/git/repository.py:190-258` + `commands.py:125-209` | `blame(path, a, b, rev=)` already supported; **predecessor-blame uses `rev=parent_sha`** (`evidence.py:299-324`). `-w -M -C` always on. | +| Source labels/priority | `evidence.py:39-44`, `rationale_generator.py:119-124`, `rationale.py:37-62` | `blame > blame-walked > predecessor-blame > area`; each label has a human string. | +| Commit row | `db/models/commit.py:11-32` | PK `sha`; has `parent_shas`, `files_changed`, `refactor_score`, `llm_description`. **No origin discriminator.** | +| PR→commit linkage | `mcp/evidence.py:69-101` (`_commit_titles_contain_oid`, `_linked_prs`) | already resolves a commit to its PR(s) by oid-in-`commit_titles`; returns *all* matching PRs (many-to-many safe). Reused instead of a new link table (see §4.3). | +| Lazy diff/description | `analyze/backfill.py` (`backfill_file_description`, `backfill_all`); `evidence.py:362-449` | per-file diff via `Repository.diff(commit, pathspec=...)`, cached on the `commit_file_change` row. The reuse target for origin-commit diffs. | +| Model registration | `db/models/__init__.py:31-47` | a new model must be imported here for Alembic autogenerate to see it. | + +**Key reuse insight (de-risks per-line attribution):** git already does hunk-matching. A squash +merge applies the PR's `base..head` diff as one commit, so for a changed file the squash tree +equals `head_sha`'s tree. Blaming the *same* `path:line_start-line_end` at **`rev=head_sha`** maps +each line to the original PR commit — exactly the mechanism predecessor-blame already uses with +`rev=parent_sha`. No bespoke hunk re-diffing is required; we add a branch, not an algorithm. + +--- + +## 3. Issues found / foreseen problems + +### 3.1 Original PR commits are not in the local object store (Us / enabling) +After a normal clone, `refs/pull/*/head` are **not** fetched, so the squashed feature-branch +commits aren't local — `git blame rev=head_sha` and `Repository.diff` would fail on unknown SHAs. +- **Mitigation:** during the `scan` GitHub phase (already remote), one **targeted batched** fetch + brings only the gated candidates (§3.3): a single `git fetch origin` call carrying one + `refs/pull//head:refs/whygraph/pull/` refspec per candidate PR. One round-trip, and the local + ref count equals #candidates — **not** the wildcard `refs/pull/*` (which would litter `.git` with a + ref per PR and slow every git op on large repos). Storing under our own `refs/whygraph/pull/*` + namespace pins the objects against local GC. All later blame/diff is then **offline** — preserving + the MCP-server and `--no-remote` git-hook invariants (no network at query time). +- **Required from provider:** GitHub retains `refs/pull/N/head` indefinitely (survives branch + deletion). True for GitHub; **not** for Azure/GitLab → why they're deferred. + +### 3.2 Squash tree may not exactly equal `head_sha` (Mixed / Low) +"Rebase and merge", or a squash re-applied onto an advanced base, can shift surrounding context. +Within a changed file the *line content* still matches (same diff), so `git blame`'s own +move/whitespace-tolerant matching (`-w -M -C`) absorbs the drift. +- **Mitigation:** per-line re-blame is **best-effort**, mirroring predecessor-blame (`evidence.py:314-323` + swallows `GitError` per event). On mismatch we keep PR-level evidence (Stage 1) — never worse than today. + +### 3.3 DB volume if every PR's commits are stored (Us / Medium) +5,000 PRs × ~30 commits = ~150k extra `commit` rows if we enriched every merged PR. +- **Mitigation — balanced gate (resolved Q2).** Enrich a PR only when it is squash-detected (§3.5) + **and** either of the two distinct loss signals fires: + - **file-bulk** — `merge_commit_sha`'s `files_changed > analyze.large_commit_file_count` (a "huge + commit" — the squash that lost the most *description* fidelity, the stub case); **or** + - **commit-rich** — `len(commit_titles) >= analyze.pr_origin_min_commits` (default 5) — a squash + that collapsed many commits' worth of *narrative*, even if few files changed. + File-count and commit-count are **distinct signals**: the first is about description cost, the + second about how much per-commit rationale was destroyed. The `OR` catches both; the threshold knob + (§4.11) caps volume. Small single-/few-commit PRs that merged as their own commits already work via + the existing `commit_titles` link and are skipped. + +### 3.4 Origin commits must not pollute main-walk queries (Us / High) +`commit` today means "first-parent walk of the default branch". area-history (`path_history.py`) and +refactor-walk (`_boring_shas_in`, `evidence.py:286-296`) assume that. +- **Mitigation:** add an `on_default_branch` discriminator to `commit` (default `1`); origin commits + insert as `0`. Gate area-history and refactor-walk on `on_default_branch == 1`. The per-line + blame-at-`head_sha` path resolves origin commits via `session.get(Commit, sha)` regardless. + +### 3.5 No stored merge method (Them→Us / Low) +GraphQL gives `mergeCommit{oid}` but WhyGraph stores no squash/rebase/merge flag. +- **Mitigation:** we don't need the method. **Squash detection** = a merged PR whose `commit_titles` + oids are **absent from the `commit` table** (i.e. the originals are not on the main walk). That PR + then passes the §3.3 balanced gate (file-bulk OR commit-rich) to be enriched. Self-correcting: if + the originals *are* on main (a plain merge / rebase), the normal path already wins and we skip. + +--- + +## 4. Detailed changes (file by file) + +### Phase 0 (Stage 0) — surface already-stored data + +#### 4.1 `mcp/evidence.py` — `_pr_dict` (line 465) +Add the two dropped fields (decode with the existing `_json_list` helper, line 58): +```python +def _pr_dict(pr: PullRequest) -> dict: + return { + # ...existing keys unchanged... + "labels": _json_list(pr.labels), + "commit_titles": _json_list(pr.commit_titles), # NEW + "comments": _json_list(pr.comments), # NEW + } +``` +No new query — both columns are already on the loaded `PullRequest` row (and already detached via +`session.expunge_all()` at `evidence.py:217`). **Intentionally uncapped:** `_pr_dict` feeds the raw +`whygraph_evidence_for` JSON, whose consumer is an agent that can handle the full list — the +context-budget caps (§4.2) apply only to the LLM *rationale prompt*, not to this tool output. + +#### 4.2 `analyze/rationale_generator.py` — `_format_pr` (line 93) +After the body block, append a compact commit roster and the discussion so the LLM sees the +narrative that the squash destroyed. Reuse `_indent_block` (line 69) and the JSON-decode idiom from +`_labels_suffix` (line ~57) — add a small local `_json_list` (mirror `evidence.py:58`). The three +caps are passed in (see threading note below), not read globally: +```python +def _format_pr(pr: PullRequest, caps: RationaleConfig) -> list[str]: # or a small caps tuple + # ...existing title/body lines unchanged... + titles = _json_list(pr.commit_titles)[: caps.pr_roster_max_commits] + if titles: + lines.append(" Squashed commits:") + for c in titles: + lines.append(f" - {c.get('headline','')} ({(c.get('oid') or '')[:9]})") + comments = _json_list(pr.comments)[: caps.pr_discussion_max_comments] + if comments: + lines.append(" Discussion:") + for cm in comments: + who = cm.get("author") or "unknown" + body = (cm.get("body") or "").strip()[: caps.pr_comment_max_chars] + lines.append(_indent_block(f"[{who}] {body}", " ")) + return lines +``` +**Execution gotcha:** `from_config` today *discards* the `RationaleConfig` after pulling +`provider`/`model`/`timeout_sec` (`rationale_generator.py:393`). To make caps reach `_format_pr`, +retain the three cap ints (or the whole `RationaleConfig`) on the generator in `__init__`, then pass +them into `_format_evidence(evidence, caps)` at the call site (line 436). +**Config source & threading (corrected in the consistency pass).** The caps come from +**`RationaleConfig`** (the `[rationale]` table — §4.11), **not** `AnalyzeConfig`: the generator is +built via `RationaleGenerator.from_config(get_config().rationale)` (`rationale_generator.py:360-393`), +so `AnalyzeConfig` never reaches it. `_format_pr`/`_format_evidence` are **module-level** functions +called from `generate()` (`bundle = _format_evidence(evidence)`, line 436). Thread the caps by: +storing them on the generator in `from_config`/`__init__`, then passing a small `caps` value into +`_format_evidence(evidence, caps)` → `_format_pr(pr, caps)`. Keeping them as a parameter (not a +global `get_config()` reach-in) keeps the formatters pure and unit-testable. **Bounding rationale +(resolved Q4):** the rationale bundle is already unbounded (the documented "evidence-bundle builder" +TODO in this module), so this stays a prompt-size guard, not a feature — but developers can tune it. + +### Phase 1 (Stage 1) — provider enrichment + persistence + +#### 4.3 No PR↔commit link table — reuse the existing `commit_titles` linkage +We do **not** add a join table. `_linked_prs()` (`mcp/evidence.py:81-101`) already resolves a +commit to its PR(s) by matching the oid inside the PR's `commit_titles` JSON +(`_commit_titles_contain_oid`, line 69). Recovered origin commits *come from* `commit_titles`, so +once inserted as `Commit` rows the existing query finds their PR(s) for free — and because it +returns **all** matching PRs, the many-to-many edge (a commit shared across stacked / backport PRs) +is handled without a column or a join table. A single `pr_id` column on `commit` was rejected: it +cannot represent that many-to-many and would silently drop links. Commit ordering is taken from +`committed_at` (evidence is already sorted that way, `evidence.py:231`), so no `position` field is +needed either. This is the "don't add the abstraction until a second concrete case forces it" rule +(CLAUDE.md) applied: nothing in scope queries "all original commits of PR #N" relationally. + +#### 4.4 `db/models/commit.py` — origin discriminator (line 32) +```python + on_default_branch: int = Field( + default=1, sa_column_kwargs={"server_default": text("1")} + ) # 0 = recovered PR-origin commit, not on the first-parent main walk +``` +`int` 0/1 to keep SQLite INTEGER affinity (same rationale as `PullRequest.draft`, +`pull_request.py:20-22`). + +#### 4.5 NEW: `scan/pr_origin_enricher.py` +A scan sub-phase (driven from `scan/crawler.py`, after `github_crawler` so PR rows exist). +**Candidate selection (§3.3, §3.5):** a merged PR is a candidate when its `commit_titles` oids are +absent from `commit` (squash) **and** (`merge_commit_sha`'s `files_changed > large_commit_file_count` +**or** `len(commit_titles) >= pr_origin_min_commits`). Then: +1. Ensure objects are local: **one targeted batched** `git fetch origin ` carrying a + `refs/pull//head:refs/whygraph/pull/` refspec for **each candidate PR only** (not the + `refs/pull/*` wildcard) — one round-trip, candidate-many local refs. +2. For each oid in `commit_titles`: insert a `Commit` row (`on_default_branch=0`) — parse via the + existing `git log -1 ` path (reuse `GitLogShortstatCmd` shape / `Commit.from_git_log`, + `commands.py:250-279`) so full message body + stats come from git, not just the headline. +3. Leave `llm_description` NULL — diffs/descriptions are lazy (4.7). +No link row is written — the PR↔commit association is already carried by `commit_titles` and +resolved at query time by `_linked_prs` (4.3). Idempotent: skip oids already present in `commit` +(mirror `github_crawler.py:67-75`). + +Skeleton (mirrors `GitHubCrawler.work` session/idempotency shape, `github_crawler.py:59-89`): +```python +def enrich(repo: Repository, *, min_commits: int, large_commit_file_count: int) -> None: + with get_session() as session: + existing: set[str] = set(session.exec(select(Commit.sha)).all()) + candidates = [] + for pr in session.exec(select(PullRequest).where(col(PullRequest.merged_at).is_not(None))): + oids = [c["oid"] for c in _json_list(pr.commit_titles) if c.get("oid")] + if not oids or all(o in existing for o in oids): + continue # not a squash (originals already on main) → skip + squash = session.get(Commit, pr.merge_commit_sha) if pr.merge_commit_sha else None + file_bulk = bool(squash and squash.files_changed > large_commit_file_count) + if not (file_bulk or len(oids) >= min_commits): # balanced gate (§3.3) + continue + candidates.append((pr, [o for o in oids if o not in existing])) + if not candidates: + return + # ONE batched fetch — only candidate refs, not refs/pull/* wildcard (§3.1): + refspecs = [f"refs/pull/{pr.number}/head:refs/whygraph/pull/{pr.number}" for pr, _ in candidates] + repo.fetch_refs(refspecs) # NEW thin Repository method (4.5a) + for pr, new_oids in candidates: + for oid in new_oids: + row = repo.commit_metadata(oid) # reuse Commit.from_git_log (commands.py:250-279) + session.add(_to_commit_row(row, on_default_branch=0)) +``` + +#### 4.5a `services/git/` — two thin additions +The enricher needs two small read/fetch helpers; both are new `ShellCommand` pairs in +`services/git/commands.py` + thin `Repository` methods (mirror `GitDiffCmd`/`Repository.diff`, +`commands.py:90-122` / `repository.py:146-188`): +- `fetch_refs(refspecs: list[str])` → `git fetch origin ` (one process, many refspecs). +- `commit_metadata(oid)` → `git log -1 --pretty=… --shortstat ` parsed by the **existing** + `Commit.from_git_log` (`commands.py:250-279`); do not write a new parser. + +#### 4.6 `scan/crawler.py` + `cli/commands/scan.py` +Wire the new phase under a `--pr-origins / --no-pr-origins` flag, **default on** (resolved Q1/default): +squash recovery is the point, so it runs unless opted out. Always **skipped under `--no-remote`** (the +fetch needs network — like the other remote phases). Mirror the existing `--codegraph/--no-codegraph` +flag wiring. + +#### 4.7 `analyze/backfill.py` + `mcp/evidence.py:backfill_evidence_descriptions` +Origin commits are normal-sized real commits → reuse `backfill_all` / the whole-diff path +(`evidence.py:408-449`) as-is; `Repository.diff(commit)` now resolves because the object is local +(4.5 step 1). **No new descriptor code** — only ensure origin commits flow through the existing +`normal` branch (they will: `files_changed <= threshold` and `llm_description is None`). + +### Phase 2 — per-line attribution + +#### 4.8 `mcp/evidence.py` — new `_attribute_squash_origins` signal +Model on `_predecessor_blame` (line 299) and `_walk_past_boring` (line 244). For each blame hit SHA +that is the **`merge_commit_sha` of an *enriched* squash PR**, re-blame the same range at the PR's +`head_sha` and emit each resulting original commit as `source="pr-origin"`. + +**Trigger condition (fixed in the consistency pass).** The trigger is **not** "the squash is a bulk +commit" — that would *skip* commit-rich-but-small squashes that the §3.3 gate enriched (file-bulk +**OR** commit-rich). The correct, gate-agnostic predicate is: *the blame SHA equals a PR's +`merge_commit_sha` and that PR has ≥1 origin `Commit` row* (i.e. it was actually enriched, so +`head_sha`'s objects are local). That naturally covers exactly the PRs Stage 1 enriched. +```python +def _attribute_squash_origins(repo, target, *, blame_shas, session) -> list[BlameHunk]: + out: list[BlameHunk] = [] + for pr in _enriched_squash_prs_for(session, blame_shas): # merge_commit_sha in blame_shas + # AND has on_default_branch=0 origins + try: + hunks = repo.blame(target.path, target.line_start, target.line_end, rev=pr.head_sha) + except GitError: + continue # best-effort, mirrors line 321 + out.extend(h for h in hunks if not h.is_uncommitted) + return out +``` +Feed the result into the existing `labeled_hunks` list (line 194) tagged `"pr-origin"`; the dedupe / +priority / cap machinery (`_should_replace`, `_SOURCE_PRIORITY`, the `limit` slice) then needs no +further change. Each hunk resolves via the existing `session.get(Commit, sha)` (origin rows exist) and +`_linked_prs` (via `commit_titles`). + +#### 4.9 Source-label plumbing (three small edits) +- `evidence.py:39-44` — add `"pr-origin": 0.5` to `_SOURCE_PRIORITY` (just below `blame`=0: a real + authoring commit reached through the squash is high-precision). +- `rationale_generator.py:119-124` — add `"pr-origin": "original commit recovered from a squash-merged PR"`. +- `rationale.py:37-62` — extend the `source` docstring enum (doc-only). + +#### 4.10 Gate main-walk-only queries on `on_default_branch` (defensive) +- `evidence.py:_boring_shas_in` (286-296) — add `.where(col(Commit.on_default_branch) == 1)`. +- `path_history.py:area_history_commits` (the `Commit`⨝`CommitFileChange` join at `path_history.py:116-121`) + — same guard. + +**Why "defensive":** origin commits get **no** `commit_file_change` rows (the enricher writes only +`Commit` rows, §4.5) and default `refactor_score=0`, so they are *already* naturally excluded from +both area-history (which joins through `commit_file_change`) and refactor-walk (which filters +`refactor_score >= BORING_THRESHOLD`). These guards make the invariant explicit and protect against a +future broad `select(Commit)` consumer; they are belt-and-suspenders, not load-bearing — which is why +acceptance criterion 3 (byte-identical area-history) holds trivially. + +#### 4.11 `core/config.py` + `whygraph.example.toml` — new config (resolved Q4) +Two distinct config homes, by *which component consumes the value* (corrected in the consistency pass): + +**(a) Rendering caps → `RationaleConfig` (`config.py:256`), `[rationale]` table.** Consumed by the +rationale generator (§4.2). Add three fields (mirror `RationaleConfig`'s existing field shape): +```python + pr_roster_max_commits: int = 30 # max squashed-commit headlines rendered into a PR block + pr_discussion_max_comments: int = 20 # max PR comments rendered into a PR block + pr_comment_max_chars: int = 500 # per-comment body clip +``` +```toml +[rationale] +# pr_roster_max_commits = 30 # squashed-commit headlines shown per PR in rationale evidence +# pr_discussion_max_comments = 20 # PR comments shown per PR in rationale evidence +# pr_comment_max_chars = 500 # each PR comment clipped to this length +``` + +**(b) Enrichment gate → `AnalyzeConfig` (`config.py:248-252`), `[analyze]` table.** Consumed by the +scan enricher (§4.5). Add one field next to `large_commit_file_count`: +```python + pr_origin_min_commits: int = 5 # commit-rich half of the Stage-1 enrichment gate (§3.3) +``` +```toml +[analyze] +# pr_origin_min_commits = 5 # enrich a squash PR's original commits once it collapsed >= this many +``` +`pr_origin_min_commits` is the commit-count half of the balanced gate (§3.3); the file-bulk half +reuses the existing `large_commit_file_count`. **All four** new fields get the same `>= 1` validation +that `max_diff_chars`/`large_commit_file_count` already have (extend the validation pass at +`config.py:469-477`). Document them commented-out in `whygraph.example.toml` alongside the existing +`# max_diff_chars` / `# large_commit_file_count` lines under `[analyze]` (`whygraph.example.toml:23-28`) +and under the existing `[rationale]` table (`whygraph.example.toml:33`). + +--- + +## 5. Schema / migration + +One Alembic migration (`db/migrations/`) for a single additive change: the `commit.on_default_branch` +column (4.4). No new table (see 4.3). `commit_titles`/`comments` already exist — Phase 0 needs **no** +migration. Follow the existing `migrations/versions/` autogenerate flow; the column is additive with a +server default, so existing rows backfill to `on_default_branch=1` and re-scans are safe. + +--- + +## 6. Acceptance criteria + +1. **Stage 0:** `whygraph_evidence_for` on a line owned by a squash commit returns that commit's PR + with non-empty `commit_titles` and `comments`; the rationale prompt shows a "Squashed commits" + roster and "Discussion". No DB migration required for this criterion. +2. **Stage 1:** after `whygraph scan` on a repo with a squash-merged PR, the `commit` table contains + the PR's original commits with `on_default_branch=0`, and `_linked_prs` resolves each back to the + PR via `commit_titles` (no link table). +3. **Stage 1 isolation:** area-history and refactor-walk results are byte-identical to pre-change for + a repo with no squash PRs (origin commits never leak into those paths). +4. **Per-line:** for a line in a squash commit, the evidence bundle contains a `source="pr-origin"` + entry whose commit is the original feature-branch commit that authored that line (verified against + `git blame `). +5. **Offline & `--no-remote`:** the MCP query path and `whygraph scan --no-remote` make **no** network + calls; enrichment fetch happens only in the remote scan phase. +6. **Graceful degrade:** a non-GitHub repo, a GC'd/absent PR ref, or a squash-vs-head mismatch yields + today's behaviour (PR-level evidence + bulk stub), never an error. + +--- + +## 7. Testing plan + +### Unit +1. `_pr_dict` includes decoded `commit_titles` + `comments`; malformed JSON → `[]` (via `_json_list`). +2. `_format_pr` renders roster + discussion; respects the caps; empty inputs add no lines. +2b. Cap config: the three new `RationaleConfig` fields default correctly, reject `< 1` (mirror + `max_diff_chars` validation), and a lowered cap actually truncates the rendered roster/discussion + (passed in via the generator, not read from a global). +3. `commit.on_default_branch` defaults to 1, origin insert sets 0; `_linked_prs` resolves an inserted + origin commit to its PR via `commit_titles` (incl. a commit shared across two PRs → both returned). +4. Balanced gate: squash (oids-absent-from-`commit`) + file-bulk → enrich; squash + commit-count ≥ + `pr_origin_min_commits` → enrich; squash but below both thresholds → skip; merged-as-own-commits + (oids present on main) → skip regardless of size. +5. `_boring_shas_in` / area-history exclude `on_default_branch=0` rows. +6. Source priority: a SHA surfacing as both `pr-origin` and `area` is kept as `pr-origin`. + +### Integration (fixture repo with a real squash merge) +7. End-to-end scan → origin `commit` rows populated with `on_default_branch=0`; the targeted batched + `git fetch` is mocked to a local bundle (assert it requests only candidate refspecs, not the + wildcard) so the test is offline. +8. `whygraph_evidence_for` on a squash line returns a `pr-origin` entry matching `git blame `. +8b. Per-line trigger covers a **commit-rich but non-file-bulk** squash (≥ `pr_origin_min_commits` + commits, ≤ `large_commit_file_count` files): it was enriched, so its line gets a `pr-origin` entry — + guards against the §4.8 trigger regressing to a file-bulk-only check. + +### Regression +9. Repo with **no** PRs / no remote: evidence + rationale output unchanged vs. `main`. +10. `tests/test_smoke.py` invariants intact (package imports; MCP server named `"whygraph"`). + +--- + +## 8. Rollout order (single-shot recipe) + +### Step 1 — Stage 0: surface `commit_titles` + `comments` (+ rendering-cap config) +Add the three **`RationaleConfig`** cap fields + `>= 1` validation + `[rationale]` example-toml lines +(4.11a); edit `_pr_dict` (4.1, uncapped) and `_format_pr` (4.2, caps passed in); thread the caps from +the generator into `_format_evidence`/`_format_pr`; add a local `_json_list` in the generator. + +**Model:** Sonnet 4.6 · **Complexity:** Low +**Why this model:** Well-specified renderers reading columns already on the row + three config fields +that mirror existing siblings; no schema migration, no queries. +**Execution notes:** Caps live in **`RationaleConfig`** (`config.py:256`), **not** `AnalyzeConfig` — +the generator is built from `RationaleConfig` (§4.2). Store them on the generator in `from_config`/ +`__init__` and pass into `_format_evidence(evidence, caps)` → `_format_pr(pr, caps)`; do **not** reach +into `get_config()` from the module-level formatters. Add a local `_json_list` (mirror `evidence.py:58`) +and reuse `_indent_block` (`rationale_generator.py:69`). Mirror `max_diff_chars`'s `>= 1` validation +(`config.py:469-477`). `_pr_dict` stays **uncapped** (§4.1). Do not touch the bulk-stub path. Do **not** +add `pr_origin_min_commits` here — that field belongs to Step 3 (the enricher). Ships independently. +**Verify:** `uv run pytest tests/ -k "pr_dict or format_pr or rationale_config"`; manual `whygraph_evidence_for` on a known squash line. + +### Step 2 — Schema: `commit.on_default_branch` + migration +Add the column (4.4) to `db/models/commit.py`; autogenerate one additive Alembic migration (§5). No new +table — the PR↔commit link is the existing `commit_titles` (4.3). + +**Model:** Sonnet 4.6 · **Complexity:** Low +**Why this model:** A single column add + an additive migration; nothing to design. +**Execution notes:** Column default uses `server_default=text("1")` (mirror `pull_request.py:28`). Run +the project's Alembic autogenerate flow; review the generated migration is additive-only (no table +rewrite). Do **not** add a join table or a `pr_id` column (§4.3 explains why). +**Verify:** `uv run alembic upgrade head` on a copy of a scanned DB; existing rows show `on_default_branch=1`. + +### Step 3 — Enricher: fetch + persist original commits +Add `AnalyzeConfig.pr_origin_min_commits` + validation + `[analyze]` example-toml line (4.11b). Add the +two thin git helpers `fetch_refs` / `commit_metadata` (4.5a). Build `scan/pr_origin_enricher.py` (4.5), +wire the phase + `--pr-origins` flag (4.6), apply the §4.10 query guards. + +**Model:** Opus 4.8 · **Complexity:** High +**Why this model:** New scan phase with network, idempotency, balanced-gate detection, and the +`on_default_branch` isolation contract — correctness across re-scans and `--no-remote` matters. +**Execution notes:** **One targeted batched** `git fetch` carrying only the gated candidates' refspecs +(§3.1) — **not** the `refs/pull/*` wildcard. Apply the balanced gate (§3.3) exactly as in the §4.5 +skeleton: squash (oids-absent-from-`commit`) AND (file-bulk OR `len(commit_titles) >= pr_origin_min_commits`). +`pr_origin_min_commits` is added **here** (`AnalyzeConfig`), not in Step 1. Default the phase **on**; skip +entirely under `--no-remote`. Reuse `Commit.from_git_log` for full bodies via the new `commit_metadata` +helper (4.5a); do **not** invent a new commit parser. Apply the §4.10 guards in the same step or +area-history could regress for a future consumer. +**Verify:** integration test 7; unit test 4; regression test 9. + +### Step 4 — Lazy diffs for origin commits +Confirm origin commits flow through the existing `normal` backfill branch (4.7). + +**Model:** Haiku 4.5 · **Complexity:** Low +**Why this model:** Likely zero code — verifying the existing `backfill_all` path already covers them. +**Execution notes:** Origin commits are `files_changed <= threshold` with `llm_description is None`, so +`backfill_evidence_descriptions` (`evidence.py:408-412`) already routes them to `backfill_all`. Only add +code if a test proves they're missed. Do not special-case them. +**Verify:** integration test 8 shows a populated `llm_description` on a `pr-origin` commit. + +### Step 5 — Per-line attribution signal + labels +Add `_attribute_squash_origins` (4.8) and the three label edits (4.9). + +**Model:** Opus 4.8 · **Complexity:** High +**Why this model:** Novel evidence signal in the dedupe/priority machinery; correctness of the +`rev=head_sha` re-blame and best-effort error handling is the crux of the feature. +**Execution notes:** Model the function on `_predecessor_blame` (`evidence.py:299-324`) — same +`rev=`-blame call, same per-event `GitError` swallow. Feed its hunks into the existing `labeled_hunks` +list (line 194) with `source="pr-origin"`; the dedupe/priority/cap machinery then needs no change beyond +the `_SOURCE_PRIORITY` entry. Do not add a second blame implementation. +**Verify:** unit test 6; integration tests 8 & 8b; acceptance criteria 4 & 5. + +--- + +## 9. Open Questions for the reviewer + +**None outstanding** — all six review forks are resolved and folded into §0 (scope, attribution, +provider, link storage, `commit_titles` retention, Stage-0 caps, enrichment cadence, enrichment gate, +fetch strategy, enrichment default). The plan is ready for final approval to implement. + +--- + +## 10. Summary + +A squash merge doesn't destroy a feature's history — it relocates it to the provider, and WhyGraph +already ingests most of it (`commit_titles`, `comments`) then drops it at the last serialization step. +**Stage 0** stops dropping it (no schema, immediate win). **Stage 1** fetches the squash PR's original +commits once during the remote scan and stores them as `Commit` rows flagged `on_default_branch=0` — +linked back to their PR through the existing `commit_titles`/`_linked_prs` path (no new table) — so +they enrich evidence without polluting the main-walk queries. **Per-line +attribution** reuses the existing `blame(rev=…)` machinery — blaming the queried range at the PR's +`head_sha` lets git itself map each squashed line to its original commit, mirroring how +predecessor-blame already crosses renames. The expensive, low-fidelity alternative (LLM/graph diff +splitting) is explicitly deferred; every degraded path falls back to today's safe behaviour. diff --git a/src/whygraph/analyze/rationale.py b/src/whygraph/analyze/rationale.py index 9bfbc0e..44324dd 100644 --- a/src/whygraph/analyze/rationale.py +++ b/src/whygraph/analyze/rationale.py @@ -39,6 +39,10 @@ class CommitEvidence: * ``"blame"`` — line-level attribution from the target's current range (highest-precision signal). + * ``"pr-origin"`` — an original feature-branch commit recovered + from a squash-merged PR: when the queried lines blame to a + squash commit, they are re-blamed at the PR's ``head_sha`` so + each line maps back to the commit that actually authored it. * ``"blame-walked"`` — surfaced only after blame walked past a refactor-heavy commit. Still line-level, but one or more boring commits were skipped to reach this author. diff --git a/src/whygraph/analyze/rationale_generator.py b/src/whygraph/analyze/rationale_generator.py index 7cde31b..341e57d 100644 --- a/src/whygraph/analyze/rationale_generator.py +++ b/src/whygraph/analyze/rationale_generator.py @@ -21,6 +21,7 @@ import json from collections.abc import Sequence +from dataclasses import dataclass from whygraph.core.config import RationaleConfig from whygraph.db.models import Commit, Issue, PullRequest @@ -66,6 +67,53 @@ def _labels_suffix(raw: str) -> str: return " [" + ", ".join(str(label) for label in labels) + "]" +def _json_list(raw: str | None) -> list: + """Decode a JSON-encoded list column; empty list on anything malformed. + + Mirrors :func:`whygraph.mcp.evidence._json_list` — kept local so the + formatters do not depend on the MCP layer. + """ + if not raw: + return [] + try: + parsed = json.loads(raw) + except (TypeError, json.JSONDecodeError): + return [] + return parsed if isinstance(parsed, list) else [] + + +@dataclass(frozen=True, slots=True) +class _PrRenderCaps: + """Size caps for rendering a PR block into the rationale prompt. + + Passed explicitly into the module-level formatters (rather than read + from a global :func:`get_config`) so they stay pure and unit-testable. + Defaults mirror :class:`~whygraph.core.config.RationaleConfig`. + + Attributes + ---------- + pr_roster_max_commits : int + Max squashed-commit headlines rendered into a PR block. + pr_discussion_max_comments : int + Max PR comments rendered into a PR block. + pr_comment_max_chars : int + Per-comment body clip. + """ + + pr_roster_max_commits: int = 30 + pr_discussion_max_comments: int = 20 + pr_comment_max_chars: int = 500 + + @classmethod + def from_config(cls, config: RationaleConfig) -> "_PrRenderCaps": + """Project the three rendering caps out of a :class:`RationaleConfig`.""" + return cls( + pr_roster_max_commits=config.pr_roster_max_commits, + pr_discussion_max_comments=config.pr_discussion_max_comments, + pr_comment_max_chars=config.pr_comment_max_chars, + ) + + def _indent_block(text: str, prefix: str) -> str: """Indent every line of ``text`` by ``prefix``.""" return "\n".join(prefix + line for line in text.splitlines()) @@ -90,8 +138,14 @@ def _format_commit(commit: Commit) -> list[str]: return lines -def _format_pr(pr: PullRequest) -> list[str]: - """Render one pull request as the indented lines of an evidence block.""" +def _format_pr(pr: PullRequest, caps: _PrRenderCaps = _PrRenderCaps()) -> list[str]: + """Render one pull request as the indented lines of an evidence block. + + Appends the squashed-commit roster and the PR discussion so the LLM + sees the narrative a squash merge collapsed. Both are clipped by + ``caps`` to bound the prompt size; ``pr.commit_titles`` / ``pr.comments`` + are JSON-encoded list columns decoded via :func:`_json_list`. + """ when = f"merged {pr.merged_at}" if pr.merged_at else pr.state author = f"by {pr.author}" if pr.author else "by unknown" lines = [f" PR #{pr.number} {author} {when}{_labels_suffix(pr.labels)}"] @@ -99,6 +153,23 @@ def _format_pr(pr: PullRequest) -> list[str]: if pr.body and pr.body.strip(): lines.append(" Body:") lines.append(_indent_block(pr.body.strip(), " ")) + titles = _json_list(pr.commit_titles)[: caps.pr_roster_max_commits] + if titles: + lines.append(" Squashed commits:") + for c in titles: + if not isinstance(c, dict): + continue + oid = (c.get("oid") or "")[:9] + lines.append(f" - {c.get('headline', '')} ({oid})") + comments = _json_list(pr.comments)[: caps.pr_discussion_max_comments] + if comments: + lines.append(" Discussion:") + for cm in comments: + if not isinstance(cm, dict): + continue + who = cm.get("author") or "unknown" + body = (cm.get("body") or "").strip()[: caps.pr_comment_max_chars] + lines.append(_indent_block(f"[{who}] {body}", " ")) return lines @@ -118,13 +189,16 @@ def _format_issue(issue: Issue) -> list[str]: _SOURCE_LABELS = { "blame": "line-blame", + "pr-origin": "original commit recovered from a squash-merged PR", "blame-walked": "line-blame (skipped a refactor commit)", "predecessor-blame": "line-blame on a pre-rename predecessor file", "area": "area-history (touched the file but not these lines)", } -def _format_evidence(evidence: Sequence[CommitEvidence]) -> str: +def _format_evidence( + evidence: Sequence[CommitEvidence], caps: _PrRenderCaps = _PrRenderCaps() +) -> str: """Render an evidence bundle as the text payload for the rationale prompt. Commits are formatted in the order given — the caller controls @@ -135,6 +209,9 @@ def _format_evidence(evidence: Sequence[CommitEvidence]) -> str: the row reached this bundle (line-blame, area-history, etc.), so the LLM can weight precision-vs-coverage signals when synthesising the rationale. + + ``caps`` bounds the per-PR roster / discussion rendering (see + :func:`_format_pr`). """ n_prs = sum(len(item.pull_requests) for item in evidence) n_issues = sum(len(item.issues) for item in evidence) @@ -149,7 +226,7 @@ def _format_evidence(evidence: Sequence[CommitEvidence]) -> str: lines.insert(1, f" Source: {label}") for pr in item.pull_requests: lines.append("") - lines.extend(_format_pr(pr)) + lines.extend(_format_pr(pr, caps)) for issue in item.issues: lines.append("") lines.extend(_format_issue(issue)) @@ -331,6 +408,11 @@ class RationaleGenerator: ``task`` should contain the :data:`~whygraph.analyze.prompt.RATIONALE_PLACEHOLDER` token. Mostly used in tests and one-off overrides. + caps : _PrRenderCaps, optional + Size caps for rendering each PR's squashed-commit roster and + discussion into the prompt. ``None`` (default) uses the + :class:`~whygraph.core.config.RationaleConfig` defaults; + :meth:`from_config` projects them from the loaded config. Examples -------- @@ -345,9 +427,11 @@ def __init__( *, timeout_sec: int | None = None, rationale_prompt: Prompt | None = None, + caps: _PrRenderCaps | None = None, ) -> None: self._client = client self._timeout_sec = timeout_sec + self._caps = caps if caps is not None else _PrRenderCaps() self._rationale_prompt = ( rationale_prompt if rationale_prompt is not None @@ -390,7 +474,11 @@ def from_config( """ factory = factory if factory is not None else LlmClientFactory() client = factory.make(config.provider, model=config.model) - return cls(client, timeout_sec=config.timeout_sec) + return cls( + client, + timeout_sec=config.timeout_sec, + caps=_PrRenderCaps.from_config(config), + ) def generate( self, @@ -433,7 +521,7 @@ def generate( # TODO: capping bundle size belongs to the future evidence-bundle # builder — the generator neither truncates nor chunks its input. - bundle = _format_evidence(evidence) + bundle = _format_evidence(evidence, self._caps) if symbol_context is not None: bundle = f"{_format_symbol_context(symbol_context)}\n\n{bundle}" task = render( diff --git a/src/whygraph/cli/commands/scan.py b/src/whygraph/cli/commands/scan.py index 2cc0946..adfc5df 100644 --- a/src/whygraph/cli/commands/scan.py +++ b/src/whygraph/cli/commands/scan.py @@ -12,7 +12,13 @@ from rich.table import Table from rich.text import Text -from whygraph.scan import CodeGraphCrawler, Crawler, GitCrawler, GitHubCrawler +from whygraph.scan import ( + CodeGraphCrawler, + Crawler, + GitCrawler, + GitHubCrawler, + PROriginEnricher, +) from ..console import console @@ -73,11 +79,25 @@ "auto-rescan git hooks (`whygraph hooks install`). Default: on." ), ) +@click.option( + "--pr-origins/--no-pr-origins", + "enrich_pr_origins", + default=True, + help=( + "Recover a squash-merged PR's original feature-branch commits — " + "one targeted `git fetch` of the gated PRs' heads, persisted as " + "`commit` rows flagged off the default-branch walk so they enrich " + "evidence without polluting area-history / refactor-walk. Needs " + "the network, so it always runs in the remote phase and is skipped " + "under `--no-remote`. Default: on." + ), +) def scan_cmd( no_llm_descriptions: bool, refresh_codegraph: bool, codegraph_image: str | None, remote: bool, + enrich_pr_origins: bool, ) -> None: """Run the source crawlers, then describe each commit with the LLM.""" # Lazy-imported so that --help and other lightweight CLI surfaces @@ -121,6 +141,7 @@ def scan_cmd( analyze_skip=analyze_skip, codegraph_enabled=refresh_codegraph, remote_enabled=remote, + pr_origins_enabled=enrich_pr_origins and github_client is not None, ) scan_log_path = db_path.parent / "scan.log" @@ -142,8 +163,11 @@ def scan_cmd( if github_client is not None: phase1.append(GitHubCrawler(progress, client=github_client)) - # Phase 2 — the analyzer, started only once phase 1 has joined - # (it reads the commits phase 1 persisted). + # Phase 2 — started only once phase 1 has joined (it reads the + # commits + PRs phase 1 persisted). The analyzer and the PR-origin + # enricher run concurrently: analyze only touches main-walk commits, + # the enricher only inserts new on_default_branch=0 rows, so they + # never contend over the same commit row. phase2: list[Crawler] = [] if descriptor is not None: phase2.append( @@ -155,6 +179,18 @@ def scan_cmd( large_commit_file_count=config.analyze.large_commit_file_count, ) ) + # The enricher needs PR rows (the GitHub crawler ran) and the + # network for its fetch — so it is gated on a resolved client, which + # is itself None under --no-remote. + if enrich_pr_origins and github_client is not None: + phase2.append( + PROriginEnricher( + progress, + repository=repository, + min_commits=config.analyze.pr_origin_min_commits, + large_commit_file_count=config.analyze.large_commit_file_count, + ) + ) if codegraph_crawler is not None: codegraph_crawler.start() @@ -257,6 +293,7 @@ def _render_scan_panel( analyze_skip: str | None, codegraph_enabled: bool, remote_enabled: bool, + pr_origins_enabled: bool, ) -> None: """Print a summary panel of what the upcoming scan will collect. @@ -316,6 +353,14 @@ def _render_scan_panel( ("LLM descriptions", Text(f"skipped — {analyze_skip}", style="yellow")) ) rows.append(("Worker threads", str(config.scan_max_workers))) + rows.append( + ( + "PR commit recovery", + "recover squash-merged PR commits" + if pr_origins_enabled + else Text("skipped", style="yellow"), + ) + ) grid = Table.grid(padding=(0, 3)) grid.add_column(style="bold cyan", justify="right", no_wrap=True) diff --git a/src/whygraph/core/config.py b/src/whygraph/core/config.py index a5a358c..7b773cd 100644 --- a/src/whygraph/core/config.py +++ b/src/whygraph/core/config.py @@ -243,6 +243,12 @@ class AnalyzeConfig: timeout_sec : int or None Per-call timeout forwarded into :class:`CompletionRequest`. ``None`` (default) defers to the bound adapter's default. + pr_origin_min_commits : int + Commit-rich half of the squash-merge enrichment gate + (:mod:`whygraph.scan.pr_origin_enricher`). A squash-merged PR has + its original feature-branch commits recovered when it collapsed at + least this many commits (the file-bulk half reuses + ``large_commit_file_count``). Must be ``>= 1``. """ provider: str = "anthropic" @@ -250,6 +256,7 @@ class AnalyzeConfig: max_diff_chars: int = 50_000 large_commit_file_count: int = 30 timeout_sec: int | None = None + pr_origin_min_commits: int = 5 @dataclass(frozen=True, slots=True) @@ -277,11 +284,24 @@ class RationaleConfig: timeout_sec : int or None Per-call timeout forwarded into :class:`CompletionRequest`. ``None`` (default) defers to the bound adapter's default. + pr_roster_max_commits : int + Cap on how many squashed-commit headlines are rendered into a + single PR block in the rationale prompt. Bounds the prompt size + when a squash collapsed a long feature branch. Must be ``>= 1``. + pr_discussion_max_comments : int + Cap on how many PR comments are rendered into a single PR block + in the rationale prompt. Must be ``>= 1``. + pr_comment_max_chars : int + Per-comment body clip applied before rendering a PR comment into + the rationale prompt. Must be ``>= 1``. """ provider: str = "anthropic" model: str | None = None timeout_sec: int | None = None + pr_roster_max_commits: int = 30 + pr_discussion_max_comments: int = 20 + pr_comment_max_chars: int = 500 @dataclass(frozen=True, slots=True) @@ -449,9 +469,12 @@ def __post_init__(self) -> None: ConfigError If ``log_level`` is not a known :class:`LogLevel` name, if ``scan_max_workers`` is less than ``1``, if ``scan_provider`` - is not one of ``"off"`` / ``"github"`` / ``"auto"``, or if - ``analyze.max_diff_chars`` or ``analyze.large_commit_file_count`` - is less than ``1``. + is not one of ``"off"`` / ``"github"`` / ``"auto"``, if + ``analyze.max_diff_chars``, ``analyze.large_commit_file_count`` + or ``analyze.pr_origin_min_commits`` is less than ``1``, or if + any of the ``rationale`` PR-rendering caps + (``pr_roster_max_commits``, ``pr_discussion_max_comments``, + ``pr_comment_max_chars``) is less than ``1``. """ try: LogLevel[self.log_level.upper()] @@ -476,6 +499,26 @@ def __post_init__(self) -> None: "analyze.large_commit_file_count must be >= 1, " f"got {self.analyze.large_commit_file_count}" ) + if self.analyze.pr_origin_min_commits < 1: + raise ConfigError( + "analyze.pr_origin_min_commits must be >= 1, " + f"got {self.analyze.pr_origin_min_commits}" + ) + if self.rationale.pr_roster_max_commits < 1: + raise ConfigError( + "rationale.pr_roster_max_commits must be >= 1, " + f"got {self.rationale.pr_roster_max_commits}" + ) + if self.rationale.pr_discussion_max_comments < 1: + raise ConfigError( + "rationale.pr_discussion_max_comments must be >= 1, " + f"got {self.rationale.pr_discussion_max_comments}" + ) + if self.rationale.pr_comment_max_chars < 1: + raise ConfigError( + "rationale.pr_comment_max_chars must be >= 1, " + f"got {self.rationale.pr_comment_max_chars}" + ) @classmethod def from_toml(cls, path: Path) -> Config: diff --git a/src/whygraph/db/migrations/versions/4e231ec6f0e1_add_on_default_branch_to_commit.py b/src/whygraph/db/migrations/versions/4e231ec6f0e1_add_on_default_branch_to_commit.py new file mode 100644 index 0000000..40982b8 --- /dev/null +++ b/src/whygraph/db/migrations/versions/4e231ec6f0e1_add_on_default_branch_to_commit.py @@ -0,0 +1,53 @@ +"""add on_default_branch to commit + +Revision ID: 4e231ec6f0e1 +Revises: a1f7c2e9b4d8 +Create Date: 2026-06-11 16:27:51.724319 + +Adds the ``on_default_branch`` discriminator on ``commit``. Default 1 +means every existing row is treated as a default-branch (first-parent +walk) commit; PR-origin commits recovered from a squash-merged PR are +inserted with 0 so they stay out of the main-walk-only queries +(area-history, refactor-walk). Additive, server-default backfilled — safe +to re-run scans against an upgraded DB. + +(Autogenerate also surfaced pre-existing server-default / index drift on +``commit.refactor_score`` and ``commit_file_change``; those are unrelated +to this change and intentionally omitted to keep the migration additive.) +""" + +from typing import Sequence, Union + +from alembic import op +import sqlalchemy as sa + + +# revision identifiers, used by Alembic. +revision: str = "4e231ec6f0e1" +down_revision: Union[str, Sequence[str], None] = "a1f7c2e9b4d8" +branch_labels: Union[str, Sequence[str], None] = None +depends_on: Union[str, Sequence[str], None] = None + + +def upgrade() -> None: + """Upgrade schema. + + A plain ``add_column`` (native ``ALTER TABLE ADD COLUMN``) rather than a + batch recreate: SQLite supports adding a column with a constant server + default in place, and recreating ``commit`` would trip the foreign key + from ``commit_file_change`` on a populated DB. + """ + op.add_column( + "commit", + sa.Column( + "on_default_branch", + sa.Integer(), + nullable=False, + server_default=sa.text("1"), + ), + ) + + +def downgrade() -> None: + """Downgrade schema.""" + op.drop_column("commit", "on_default_branch") diff --git a/src/whygraph/db/models/commit.py b/src/whygraph/db/models/commit.py index 587b820..5edfc6e 100644 --- a/src/whygraph/db/models/commit.py +++ b/src/whygraph/db/models/commit.py @@ -2,14 +2,25 @@ from __future__ import annotations -from sqlalchemy import Text +from sqlalchemy import Text, text from sqlmodel import Field from whygraph.db.base import WhygraphTable class Commit(WhygraphTable, table=True): - """One row per scanned Git commit (first-parent walk of the default branch).""" + """One row per scanned Git commit (first-parent walk of the default branch). + + Notes + ----- + * ``on_default_branch`` is ``int`` (0/1), not ``bool``, to keep the + declared SQLite affinity as INTEGER (same rationale as + :attr:`whygraph.db.models.PullRequest.draft`). ``1`` (default) marks a + commit on the first-parent main walk; ``0`` marks a PR-origin commit + recovered from a squash-merged PR (see ``scan/pr_origin_enricher.py``) + that must stay out of the main-walk-only queries (area-history, + refactor-walk). + """ sha: str = Field(primary_key=True, nullable=True, sa_type=Text) parent_shas: str = Field(sa_type=Text) @@ -30,3 +41,8 @@ class Commit(WhygraphTable, table=True): # uses it to drive ``git blame --ignore-rev`` walk-past so older # authorship surfaces through commits that would otherwise mask it. refactor_score: int = Field(default=0) + # 0 = PR-origin commit recovered from a squash-merged PR (not on the + # first-parent main walk); 1 = on the default-branch walk (the norm). + on_default_branch: int = Field( + default=1, sa_column_kwargs={"server_default": text("1")} + ) diff --git a/src/whygraph/mcp/evidence.py b/src/whygraph/mcp/evidence.py index 053d8bc..3dc293a 100644 --- a/src/whygraph/mcp/evidence.py +++ b/src/whygraph/mcp/evidence.py @@ -1,8 +1,9 @@ """The ``whygraph_evidence_for`` MCP tool and its evidence collector. -The collector combines four signals: line-blame at HEAD, line-blame +The collector combines five signals: line-blame at HEAD, line-blame after walking past refactor-heavy commits, line-blame against a rename -predecessor at its pre-rename location, and area-history from the +predecessor at its pre-rename location, per-line attribution back to a +squash-merged PR's original commits, and area-history from the ``commit_file_change`` index. Each commit is tagged with a ``source`` label so the rationale generator can weight precision vs coverage. :mod:`whygraph.mcp.rationale` reuses :func:`collect_evidence`; the tool @@ -35,9 +36,12 @@ # Source ordering for dedupe — a SHA that surfaces from multiple paths # is kept with the strongest source label only. ``blame`` beats every -# other label; ``area`` is the weakest. +# other label; ``area`` is the weakest. ``pr-origin`` sits just below +# ``blame`` (0.5): a real authoring commit reached through the squash is +# high-precision, but a direct HEAD blame hit is still preferred. _SOURCE_PRIORITY = { "blame": 0, + "pr-origin": 0.5, "blame-walked": 1, "predecessor-blame": 2, "area": 3, @@ -128,16 +132,20 @@ def collect_evidence(target: Target, *, limit: int = 20) -> list[CommitEvidence] 1. ``"blame"`` — line-level attribution from the target's current range. The primary signal. - 2. ``"blame-walked"`` — line-level attribution surfaced by walking + 2. ``"pr-origin"`` — when a blamed line lands on a squash-merged PR's + commit that Stage 1 enriched, the same range is re-blamed at the + PR's ``head_sha`` so each line maps back to the original + feature-branch commit that authored it. + 3. ``"blame-walked"`` — line-level attribution surfaced by walking past refactor-heavy commits (scored at scan time). Each round runs blame again with ``--ignore-rev`` for every boring SHA seen so far; bounded by :data:`_MAX_BORING_HOPS`. - 3. ``"predecessor-blame"`` — for every rename event in the target + 4. ``"predecessor-blame"`` — for every rename event in the target path's lineage (``commit_file_change`` rows with ``change_type`` ``"R"``), blame the predecessor file at the rename commit's parent so authorship for code that has been moved across files still surfaces. - 4. ``"area"`` — drawn from the ``commit_file_change`` index for the + 5. ``"area"`` — drawn from the ``commit_file_change`` index for the target's path and every rename ancestor. Used to fill the cap when the line-blame signals are thin. @@ -191,14 +199,24 @@ def _collect_evidence_against_db( walked_hunks, _boring = _walk_past_boring(repo, target, initial_hunks=initial) predecessor_hunks = _predecessor_blame(repo, target) - labeled_hunks: list[tuple[BlameHunk, str]] = ( + base_labeled: list[tuple[BlameHunk, str]] = ( [(h, "blame") for h in initial] + [(h, "blame-walked") for h in walked_hunks] + [(h, "predecessor-blame") for h in predecessor_hunks] ) + # SHAs the queried lines actually blamed to — the candidates for being + # an enriched squash's merge_commit_sha. + blame_shas = {h.sha for h, _ in base_labeled if not h.is_uncommitted} by_sha: dict[str, CommitEvidence] = {} with get_session() as session: + # Per-line squash attribution: re-blame at each enriched squash PR's + # head_sha so origin commits surface with the existing dedupe / + # priority / cap machinery (no special-casing past the label). + origin_hunks = _attribute_squash_origins( + repo, target, blame_shas=blame_shas, session=session + ) + labeled_hunks = base_labeled + [(h, "pr-origin") for h in origin_hunks] for hunk, source in labeled_hunks: if hunk.is_uncommitted: continue @@ -284,7 +302,14 @@ def _walk_past_boring( def _boring_shas_in(shas: set[str]) -> set[str]: - """Return the subset of ``shas`` whose ``refactor_score`` is boring.""" + """Return the subset of ``shas`` whose ``refactor_score`` is boring. + + Restricted to default-branch commits (``on_default_branch == 1``): + refactor-walk is a main-walk-only notion, and recovered PR-origin + commits (``0``) must never drive a walk-past — they carry no + ``commit_file_change`` rows and default ``refactor_score=0``, so the + guard is belt-and-suspenders against a future broad consumer. + """ if not shas: return set() with get_session() as session: @@ -292,6 +317,7 @@ def _boring_shas_in(shas: set[str]) -> set[str]: select(Commit.sha) .where(col(Commit.sha).in_(shas)) .where(col(Commit.refactor_score) >= BORING_THRESHOLD) + .where(col(Commit.on_default_branch) == 1) ).all() return set(rows) @@ -324,6 +350,77 @@ def _predecessor_blame(repo: Repository, target: Target) -> list[BlameHunk]: return out +def _enriched_squash_prs_for( + session: Session, blame_shas: set[str] +) -> list[PullRequest]: + """Squash PRs in ``blame_shas`` that have recovered origin commits. + + A PR qualifies when its ``merge_commit_sha`` is one of the blamed SHAs + (so the queried lines land on its squash commit) **and** at least one + of its ``commit_titles`` oids exists in ``commit`` as an origin row + (``on_default_branch == 0``) — i.e. Stage 1 actually enriched it, so + ``head_sha``'s objects are local and a re-blame there will resolve. + This gate-agnostic predicate covers exactly the PRs the enricher + recovered, whether they were gated as file-bulk or commit-rich. + """ + if not blame_shas: + return [] + prs = session.exec( + select(PullRequest).where(col(PullRequest.merge_commit_sha).in_(blame_shas)) + ).all() + enriched: list[PullRequest] = [] + for pr in prs: + oids = [ + entry["oid"] + for entry in _json_list(pr.commit_titles) + if isinstance(entry, dict) and entry.get("oid") + ] + if not oids: + continue + has_origin = session.exec( + select(Commit.sha) + .where(col(Commit.sha).in_(oids)) + .where(col(Commit.on_default_branch) == 0) + .limit(1) + ).first() + if has_origin is not None: + enriched.append(pr) + return enriched + + +def _attribute_squash_origins( + repo: Repository, + target: Target, + *, + blame_shas: set[str], + session: Session, +) -> list[BlameHunk]: + """Re-blame the target range at each enriched squash PR's ``head_sha``. + + When a blamed line lands on a squash ``merge_commit_sha``, the squash + tree equals the PR's ``head_sha`` tree for that file, so blaming the + same range at ``head_sha`` maps each line back to the original + feature-branch commit that authored it — the same mechanism + predecessor-blame uses with ``rev=parent_sha``. Best-effort: a + squash-vs-head mismatch or absent object skips that PR (mirrors + :func:`_predecessor_blame`'s per-event ``GitError`` swallow), leaving + the PR-level Stage 1 evidence untouched. + """ + out: list[BlameHunk] = [] + for pr in _enriched_squash_prs_for(session, blame_shas): + try: + hunks = repo.blame( + target.path, + target.line_start, + target.line_end, + rev=pr.head_sha, + ) + except GitError: + continue + out.extend(h for h in hunks if not h.is_uncommitted) + return out + + def _rename_events_for(path: str) -> list[tuple[str, str]]: """Return ``(rename_commit_sha, predecessor_path)`` for every rename in path's lineage.""" # Lazy import: path_history reuses _linked_prs / _linked_issues from @@ -463,7 +560,16 @@ def _commit_dict(commit: Commit) -> dict: def _pr_dict(pr: PullRequest) -> dict: - """Serialize a pull request to a JSON-ready dict.""" + """Serialize a pull request to a JSON-ready dict. + + Notes + ----- + ``commit_titles`` and ``comments`` are emitted **uncapped**: the + consumer of ``whygraph_evidence_for`` is an agent that can handle the + full lists. The size caps in :class:`RationaleConfig` apply only to the + LLM *rationale prompt* (see ``analyze.rationale_generator._format_pr``), + not to this raw tool output. + """ return { "number": pr.number, "title": pr.title, @@ -473,6 +579,8 @@ def _pr_dict(pr: PullRequest) -> dict: "author": pr.author, "html_url": pr.html_url, "labels": _json_list(pr.labels), + "commit_titles": _json_list(pr.commit_titles), + "comments": _json_list(pr.comments), } diff --git a/src/whygraph/mcp/path_history.py b/src/whygraph/mcp/path_history.py index e1a6471..bbe96ff 100644 --- a/src/whygraph/mcp/path_history.py +++ b/src/whygraph/mcp/path_history.py @@ -119,6 +119,11 @@ def area_history_commits( col(CommitFileChange.commit_sha) == col(Commit.sha), ) .where(col(CommitFileChange.path).in_(aliases)) + # Area-history is a main-walk-only view. Recovered PR-origin + # commits (on_default_branch=0) carry no commit_file_change + # rows so the join already excludes them; this makes the + # invariant explicit for a future broad consumer. + .where(col(Commit.on_default_branch) == 1) ) if exclude_shas: stmt = stmt.where(col(Commit.sha).not_in(exclude_shas)) diff --git a/src/whygraph/scan/__init__.py b/src/whygraph/scan/__init__.py index 5a45dfd..8860960 100644 --- a/src/whygraph/scan/__init__.py +++ b/src/whygraph/scan/__init__.py @@ -3,11 +3,12 @@ Exposes :class:`Crawler` (the threaded base class) and the concrete crawlers — :class:`GitCrawler` for local git history, :class:`GitHubCrawler` for GitHub pull requests and issues, -:class:`CodeGraphCrawler` which refreshes the CodeGraph index, and +:class:`CodeGraphCrawler` which refreshes the CodeGraph index, :class:`AnalyzeCrawler` which describes each commit's diff with an LLM -(run after :class:`GitCrawler`). The CLI runs the source crawlers (and -CodeGraph) concurrently, then the analyzer, against the shared SQLite -database. +(run after :class:`GitCrawler`), and :class:`PROriginEnricher` which +recovers a squash-merged PR's original commits. The CLI runs the source +crawlers (and CodeGraph) concurrently, then the analyzer and enricher, +against the shared SQLite database. """ from whygraph.scan.analyze_crawler import AnalyzeCrawler @@ -15,6 +16,7 @@ from whygraph.scan.crawler import Crawler from whygraph.scan.git_crawler import GitCrawler from whygraph.scan.github_crawler import GitHubCrawler +from whygraph.scan.pr_origin_enricher import PROriginEnricher __all__ = [ "AnalyzeCrawler", @@ -22,4 +24,5 @@ "Crawler", "GitCrawler", "GitHubCrawler", + "PROriginEnricher", ] diff --git a/src/whygraph/scan/pr_origin_enricher.py b/src/whygraph/scan/pr_origin_enricher.py new file mode 100644 index 0000000..37c31e2 --- /dev/null +++ b/src/whygraph/scan/pr_origin_enricher.py @@ -0,0 +1,248 @@ +"""PROriginEnricher — recover a squash-merged PR's original commits. + +A squash merge collapses a feature branch into one commit on the default +branch, discarding the per-commit history WhyGraph would otherwise index. +GitHub still retains the originals under ``refs/pull//head``, so this +crawler — run after :class:`~whygraph.scan.github_crawler.GitHubCrawler`, +once PR rows exist — fetches them once during the remote scan and persists +them as ``commit`` rows flagged ``on_default_branch=0`` so they enrich +evidence without leaking into the main-walk-only queries (area-history, +refactor-walk). + +Only the squashes that actually lost history are enriched (the *balanced +gate*, see :func:`_select_candidates`), and the fetch is one targeted +batched ``git fetch`` carrying only the gated candidates' refspecs — never +the ``refs/pull/*`` wildcard. No PR↔commit link row is written: the +association is already carried by the PR's ``commit_titles`` and resolved +at query time by ``mcp/evidence.py:_linked_prs``. + +The recovered commits' diffs and LLM descriptions stay lazy — this crawler +writes only the ``commit`` row (full message + stats from ``git log``), +leaving ``llm_description`` ``NULL`` for the on-read backfill to fill. +""" + +from __future__ import annotations + +import json +import logging +from dataclasses import dataclass +from datetime import datetime, timezone + +from rich.progress import Progress +from sqlmodel import col, select + +from whygraph.db import get_session +from whygraph.db.models.commit import Commit as CommitRow +from whygraph.db.models.pull_request import PullRequest +from whygraph.services.git import GitError, Repository +from whygraph.services.git.commit import Commit as CommitDC + +from .crawler import Crawler + +_log = logging.getLogger(__name__) + +# Pin each candidate PR's head under our own ref namespace so the fetched +# objects survive local GC and later blame/diff stay offline. The source +# ``refs/pull//head`` is GitHub's server-side immutable PR ref. +_PULL_REFSPEC = "refs/pull/{number}/head:refs/whygraph/pull/{number}" + + +def _json_list(raw: str | None) -> list: + """Decode a JSON-encoded list column; empty list on anything malformed. + + Mirrors ``mcp/evidence.py:_json_list`` — duplicated rather than + imported to keep the scan layer free of an upward dependency on the + MCP server layer. + """ + if not raw: + return [] + try: + parsed = json.loads(raw) + except (TypeError, json.JSONDecodeError): + return [] + return parsed if isinstance(parsed, list) else [] + + +@dataclass(frozen=True, slots=True) +class _Candidate: + """A gated squash PR and the subset of its oids not yet in ``commit``. + + Attributes + ---------- + number : int + The PR number, used to build its ``refs/pull//head`` refspec. + new_oids : list[str] + The PR's ``commit_titles`` oids that are absent from the ``commit`` + table — exactly the rows this run will materialize. + """ + + number: int + new_oids: list[str] + + +def _select_candidates( + session, *, min_commits: int, large_commit_file_count: int +) -> list[_Candidate]: + """Pick the squash-merged PRs whose original commits should be recovered. + + Applies the *balanced gate* (plan §3.3 / §3.5): a merged PR qualifies + when its ``commit_titles`` oids are **absent** from the ``commit`` + table (squash detection — the originals are not on the main walk) + **and** either loss signal fires: + + - **file-bulk** — its ``merge_commit_sha`` touched more than + ``large_commit_file_count`` files (the squash that lost the most + *description* fidelity); **or** + - **commit-rich** — it collapsed at least ``min_commits`` commits (lost + *narrative*, even if few files changed). + + A PR whose oids are already on the main walk (a plain merge / rebase) + is skipped — the normal path already indexed it. + + Parameters + ---------- + session : sqlmodel.Session + Open session to read PR / commit rows through. + min_commits : int + Commit-rich threshold (``analyze.pr_origin_min_commits``). + large_commit_file_count : int + File-bulk threshold (``analyze.large_commit_file_count``). + + Returns + ------- + list[_Candidate] + One entry per gated PR, carrying the oids still to insert. + """ + existing: set[str] = set(session.exec(select(CommitRow.sha)).all()) + candidates: list[_Candidate] = [] + merged = session.exec( + select(PullRequest).where(col(PullRequest.merged_at).is_not(None)) + ).all() + for pr in merged: + oids = [ + c["oid"] + for c in _json_list(pr.commit_titles) + if isinstance(c, dict) and c.get("oid") + ] + if not oids or all(o in existing for o in oids): + continue # not a squash — originals already on the main walk + squash = ( + session.get(CommitRow, pr.merge_commit_sha) if pr.merge_commit_sha else None + ) + file_bulk = bool(squash and squash.files_changed > large_commit_file_count) + if not (file_bulk or len(oids) >= min_commits): + continue # below both halves of the balanced gate + candidates.append( + _Candidate( + number=pr.number, new_oids=[o for o in oids if o not in existing] + ) + ) + return candidates + + +def _to_origin_row(dc: CommitDC, *, scanned_at: str) -> CommitRow: + """Build an ``on_default_branch=0`` commit row from a git value object. + + Mirrors ``git_crawler._to_row`` but flags the row as a recovered + PR-origin commit and leaves ``refactor_score`` at its default — origin + commits carry no ``commit_file_change`` rows, so the refactor-walk + never reaches them regardless. + """ + return CommitRow( + sha=dc.sha, + parent_shas=" ".join(dc.parent_shas), + author_name=dc.author_name, + author_email=dc.author_email, + authored_at=dc.authored_at, + committed_at=dc.committed_at, + subject=dc.subject, + body=dc.body, + files_changed=dc.stats.files_changed, + insertions=dc.stats.insertions, + deletions=dc.stats.deletions, + scanned_at=scanned_at, + on_default_branch=0, + ) + + +class PROriginEnricher(Crawler): + """Recover and persist squash-merged PRs' original feature-branch commits. + + Runs after the GitHub crawler (so PR rows exist) and the git crawler + (so the squash commits and the main walk exist). Selects the gated + squash PRs (:func:`_select_candidates`), fetches their PR-head refs in + one batched ``git fetch``, then inserts one ``commit`` row per recovered + oid with ``on_default_branch=0``. Re-scans are idempotent: oids already + in ``commit`` are excluded at selection time, so a second run with no + new squashes does nothing. + + Best-effort by design (plan §6.6): a failed fetch (GC'd ref, no + network) or an unreadable single oid is logged and skipped rather than + failing the scan — the PR keeps its existing PR-level evidence. + + Parameters + ---------- + progress : rich.progress.Progress + Shared Progress instance owned by the orchestrator. + repository : Repository + The git repository to fetch and read commit metadata through. + min_commits : int + Commit-rich half of the gate (``analyze.pr_origin_min_commits``). + large_commit_file_count : int + File-bulk half of the gate (``analyze.large_commit_file_count``). + """ + + def __init__( + self, + progress: Progress, + *, + repository: Repository, + min_commits: int, + large_commit_file_count: int, + ) -> None: + super().__init__("pr-origins", progress, total=None) + self._repository = repository + self._min_commits = min_commits + self._large_commit_file_count = large_commit_file_count + + def work(self) -> None: + with get_session() as session: + candidates = _select_candidates( + session, + min_commits=self._min_commits, + large_commit_file_count=self._large_commit_file_count, + ) + self.set_total(len(candidates)) + if not candidates: + return + + # ONE batched fetch — only the gated candidates' refs, never the + # refs/pull/* wildcard. A failure here degrades the whole phase + # gracefully rather than aborting the scan. + refspecs = [_PULL_REFSPEC.format(number=c.number) for c in candidates] + try: + self._repository.fetch_refs(refspecs) + except GitError as exc: + _log.warning( + "pr-origin fetch failed; skipping enrichment for %d PR(s): %s", + len(candidates), + exc, + ) + return + + scanned_at = datetime.now(timezone.utc).isoformat() + inserted: set[str] = set() + for cand in candidates: + for oid in cand.new_oids: + # A commit shared across stacked / backport PRs appears + # in two candidates' new_oids; insert it once. + if oid in inserted: + continue + try: + dc = self._repository.commit_metadata(oid) + except GitError as exc: + _log.warning("skipping origin commit %s: %s", oid[:9], exc) + continue + session.add(_to_origin_row(dc, scanned_at=scanned_at)) + inserted.add(oid) + self.advance(1) diff --git a/src/whygraph/services/git/commands.py b/src/whygraph/services/git/commands.py index 4483210..3436039 100644 --- a/src/whygraph/services/git/commands.py +++ b/src/whygraph/services/git/commands.py @@ -247,6 +247,74 @@ def parse(self, result: CompletedProcess[str]) -> tuple[FileChange, ...]: return FileChange.from_diff_tree(result.stdout) +class GitFetchRefsCmd(ShellCommand[None]): + """``git fetch `` — fetch one or more refspecs in one call. + + Carries every refspec in a single ``git fetch`` invocation so the + squash-origin enricher pins all its candidate PR refs with one network + round-trip rather than one fetch per PR. The parser returns ``None`` — + the command is run for its effect (objects + refs land in the local + object store), not its stdout. + + Parameters + ---------- + refspecs : tuple[str, ...] + One or more ``:`` refspecs, e.g. + ``"refs/pull/12/head:refs/whygraph/pull/12"``. Passed as separate + argv tokens, so no shell quoting is involved. + remote : str, optional + Name of the remote to fetch from. Default ``"origin"``. + """ + + def __init__(self, *refspecs: str, remote: str = "origin") -> None: + self.refspecs = refspecs + self.remote = remote + + def argv(self) -> list[str]: + return ["git", "fetch", "--no-tags", self.remote, *self.refspecs] + + def parse(self, result: CompletedProcess[str]) -> None: + return None + + +class GitLogCommitCmd(ShellCommand[Commit]): + """``git log -1 --shortstat --pretty=format:Commit.LOG_FORMAT ``. + + Reads a *single* commit's full metadata (message body + diff stats) + and parses it via :meth:`Commit.from_git_log`. The ``-1`` is load + bearing: ``git log `` without it walks ``ref`` and all its + ancestors, so the cap restricts output to just ``ref``. + + Used by the squash-origin enricher to materialize a recovered PR + commit from its oid once the object is local, reusing the same parser + as the main history walk rather than inventing a second one. + + Parameters + ---------- + ref : str + Any commit-ish ``git`` accepts (typically a full oid). + """ + + def __init__(self, ref: str) -> None: + self.ref = ref + + def argv(self) -> list[str]: + return [ + "git", + "log", + "-1", + f"--pretty=format:{Commit.LOG_FORMAT}", + "--shortstat", + self.ref, + ] + + def parse(self, result: CompletedProcess[str]) -> Commit: + for block in result.stdout.split("\x1e"): + if block.strip(): + return Commit.from_git_log(block) + raise ValueError(f"no commit record in git log output for {self.ref!r}") + + class GitLogShortstatCmd(ShellCommand[Iterator[Commit]]): """``git log --shortstat --pretty=format:Commit.LOG_FORMAT ``. diff --git a/src/whygraph/services/git/repository.py b/src/whygraph/services/git/repository.py index c593a3f..164c6a4 100644 --- a/src/whygraph/services/git/repository.py +++ b/src/whygraph/services/git/repository.py @@ -13,6 +13,8 @@ GitCurrentBranchCmd, GitDiffCmd, GitDiffTreeFileChangesCmd, + GitFetchRefsCmd, + GitLogCommitCmd, GitRemoteUrlCmd, ) from .commit import Commit @@ -324,3 +326,70 @@ def diff_range(self, base: str, head: str) -> str: return self._shell.run(GitDiffCmd(f"{base}..{head}"), cwd=self.root) except ShellError as exc: raise GitError(f"failed to diff {base[:7]}..{head[:7]}") from exc + + def fetch_refs(self, refspecs: list[str], *, remote: str | None = None) -> None: + """Fetch one or more refspecs from ``remote`` in a single round-trip. + + The only network-touching method on :class:`Repository`. It writes + refs and objects into the local store (not the working tree), so + the squash-origin enricher can pin a batch of PR refs + (``refs/pull//head:refs/whygraph/pull/``) against local GC + with one ``git fetch`` rather than one per PR — keeping later + blame/diff offline. ``--no-tags`` keeps the fetch from dragging in + the remote's tag set. + + Parameters + ---------- + refspecs : list[str] + ``:`` refspecs to fetch. An empty list is a no-op + (no ``git`` is invoked). + remote : str or None, optional + Remote to fetch from. ``None`` (default) uses the remote this + repository was constructed with (``origin_remote``). + + Raises + ------ + GitError + If ``git fetch`` fails — an unknown remote, a missing/GC'd + source ref, or no network. Callers that treat enrichment as + best-effort should catch this. + """ + if not refspecs: + return + target = remote or self._origin_remote + try: + self._shell.run(GitFetchRefsCmd(*refspecs, remote=target), cwd=self.root) + except ShellError as exc: + raise GitError( + f"failed to fetch {len(refspecs)} refspec(s) from {target!r}" + ) from exc + + def commit_metadata(self, ref: str) -> Commit: + """Full :class:`Commit` value object for a single commit-ish. + + Reads one commit's message body and diff stats via + ``git log -1 --shortstat`` and parses it with the same + :meth:`Commit.from_git_log` the main history walk uses. The object + must already be local (the enricher calls :meth:`fetch_refs` + first); an unknown ``ref`` surfaces as :class:`GitError`. + + Parameters + ---------- + ref : str + Any commit-ish ``git`` accepts (typically a full oid). + + Returns + ------- + Commit + The parsed commit value object. + + Raises + ------ + GitError + If ``git`` fails — most commonly an unknown ``ref`` whose + object is not in the local store. + """ + try: + return self._shell.run(GitLogCommitCmd(ref), cwd=self.root) + except ShellError as exc: + raise GitError(f"failed to read commit metadata for {ref[:7]}") from exc diff --git a/tests/test_analyze_rationale_generator.py b/tests/test_analyze_rationale_generator.py index 8a3fe55..e060145 100644 --- a/tests/test_analyze_rationale_generator.py +++ b/tests/test_analyze_rationale_generator.py @@ -23,7 +23,9 @@ RationaleGenerator, ) from whygraph.analyze.rationale_generator import ( + _PrRenderCaps, _format_evidence, + _format_pr, _format_symbol_context, ) from whygraph.core.config import RationaleConfig @@ -122,6 +124,8 @@ def _pr( number: int = 12, title: str = "Cache prompts", body: str | None = "Speeds up scans.", + commit_titles: str = "[]", + comments: str = "[]", ) -> PullRequest: """A :class:`PullRequest` row with sensible defaults for tests.""" return PullRequest( @@ -137,6 +141,8 @@ def _pr( author="octocat", html_url="https://example.com/pr/12", labels='["perf"]', + commit_titles=commit_titles, + comments=comments, fetched_at="2026-05-02T00:00:00Z", ) @@ -490,6 +496,109 @@ def test_format_evidence_renders_source_label_per_commit() -> None: assert "area-history" in bundle +# ---- _format_pr roster + discussion -------------------------------------- + + +def _commit_titles(*headlines: str) -> str: + """JSON-encode a commit_titles roster the way the github crawler stores it.""" + return json.dumps( + [ + {"oid": f"{i:040x}", "headline": h, "author_name": "Jane"} + for i, h in enumerate(headlines, start=1) + ] + ) + + +def _comments(*pairs: tuple[str, str]) -> str: + """JSON-encode a PR comments list ([{author, body, created_at}, ...]).""" + return json.dumps( + [ + {"author": author, "body": body, "created_at": "2026-05-01T00:00:00Z"} + for author, body in pairs + ] + ) + + +def test_format_pr_renders_roster_and_discussion() -> None: + lines = _format_pr( + _pr( + commit_titles=_commit_titles("first commit", "second commit"), + comments=_comments(("alice", "looks good"), ("bob", "ship it")), + ) + ) + text = "\n".join(lines) + + assert "Squashed commits:" in text + assert "first commit" in text + assert "second commit" in text + assert "Discussion:" in text + assert "[alice] looks good" in text + assert "[bob] ship it" in text + + +def test_format_pr_empty_roster_and_discussion_add_no_lines() -> None: + lines = _format_pr(_pr()) # default commit_titles="[]", comments="[]" + text = "\n".join(lines) + + assert "Squashed commits:" not in text + assert "Discussion:" not in text + + +def test_format_pr_respects_caps() -> None: + caps = _PrRenderCaps( + pr_roster_max_commits=1, + pr_discussion_max_comments=1, + pr_comment_max_chars=4, + ) + lines = _format_pr( + _pr( + commit_titles=_commit_titles("kept commit", "dropped commit"), + comments=_comments(("alice", "hello world"), ("bob", "dropped")), + ), + caps, + ) + text = "\n".join(lines) + + assert "kept commit" in text + assert "dropped commit" not in text + assert "[alice] hell" in text # clipped to 4 chars + assert "hello world" not in text + assert "[bob]" not in text # second comment dropped by the cap + + +def test_format_pr_tolerates_malformed_json() -> None: + lines = _format_pr(_pr(commit_titles="not json", comments="{}")) + text = "\n".join(lines) + + assert "Squashed commits:" not in text + assert "Discussion:" not in text + + +# ---- RationaleConfig PR-rendering caps ----------------------------------- + + +def test_rationale_config_cap_defaults() -> None: + cfg = RationaleConfig() + + assert cfg.pr_roster_max_commits == 30 + assert cfg.pr_discussion_max_comments == 20 + assert cfg.pr_comment_max_chars == 500 + + +def test_pr_render_caps_from_config_projects_fields() -> None: + caps = _PrRenderCaps.from_config( + RationaleConfig( + pr_roster_max_commits=3, + pr_discussion_max_comments=2, + pr_comment_max_chars=10, + ) + ) + + assert caps.pr_roster_max_commits == 3 + assert caps.pr_discussion_max_comments == 2 + assert caps.pr_comment_max_chars == 10 + + # ---- _format_symbol_context --------------------------------------------- diff --git a/tests/test_core_config_analyze.py b/tests/test_core_config_analyze.py index 4e3801d..1a32f74 100644 --- a/tests/test_core_config_analyze.py +++ b/tests/test_core_config_analyze.py @@ -28,6 +28,7 @@ def test_analyze_defaults_when_section_omitted(tmp_path: Path) -> None: assert cfg.analyze.provider == "anthropic" assert cfg.analyze.max_diff_chars == 50_000 assert cfg.analyze.large_commit_file_count == 30 + assert cfg.analyze.pr_origin_min_commits == 5 assert cfg.analyze.timeout_sec is None @@ -99,6 +100,25 @@ def test_analyze_invalid_large_commit_file_count_raises(tmp_path: Path) -> None: Config.from_toml(config) +def test_analyze_pr_origin_min_commits_parsed(tmp_path: Path) -> None: + config = _write( + tmp_path / "whygraph.toml", + "[analyze]\npr_origin_min_commits = 12\n", + ) + cfg = Config.from_toml(config) + + assert cfg.analyze.pr_origin_min_commits == 12 + + +def test_analyze_invalid_pr_origin_min_commits_raises(tmp_path: Path) -> None: + config = _write( + tmp_path / "whygraph.toml", + "[analyze]\npr_origin_min_commits = 0\n", + ) + with pytest.raises(ConfigError, match="pr_origin_min_commits"): + Config.from_toml(config) + + def test_analyze_config_is_frozen() -> None: cfg = AnalyzeConfig() with pytest.raises(Exception): # FrozenInstanceError diff --git a/tests/test_core_config_rationale.py b/tests/test_core_config_rationale.py index 6571022..915ed05 100644 --- a/tests/test_core_config_rationale.py +++ b/tests/test_core_config_rationale.py @@ -13,7 +13,7 @@ import pytest -from whygraph.core.config import Config, RationaleConfig +from whygraph.core.config import Config, ConfigError, RationaleConfig def _write(path: Path, body: str) -> Path: @@ -71,6 +71,49 @@ def test_rationale_unknown_key_warns_but_loads( assert cfg.rationale.provider == "anthropic" +def test_rationale_pr_render_cap_defaults(tmp_path: Path) -> None: + config = _write(tmp_path / "whygraph.toml", "") + cfg = Config.from_toml(config) + + assert cfg.rationale.pr_roster_max_commits == 30 + assert cfg.rationale.pr_discussion_max_comments == 20 + assert cfg.rationale.pr_comment_max_chars == 500 + + +def test_rationale_pr_render_caps_parsed(tmp_path: Path) -> None: + config = _write( + tmp_path / "whygraph.toml", + "[rationale]\n" + "pr_roster_max_commits = 5\n" + "pr_discussion_max_comments = 3\n" + "pr_comment_max_chars = 120\n", + ) + cfg = Config.from_toml(config) + + assert cfg.rationale.pr_roster_max_commits == 5 + assert cfg.rationale.pr_discussion_max_comments == 3 + assert cfg.rationale.pr_comment_max_chars == 120 + + +@pytest.mark.parametrize( + "field_name", + [ + "pr_roster_max_commits", + "pr_discussion_max_comments", + "pr_comment_max_chars", + ], +) +def test_rationale_pr_render_cap_below_one_raises( + tmp_path: Path, field_name: str +) -> None: + config = _write( + tmp_path / "whygraph.toml", + f"[rationale]\n{field_name} = 0\n", + ) + with pytest.raises(ConfigError, match=field_name): + Config.from_toml(config) + + def test_rationale_config_is_frozen() -> None: cfg = RationaleConfig() with pytest.raises(Exception): # FrozenInstanceError diff --git a/tests/test_db_plumbing.py b/tests/test_db_plumbing.py index 5568333..be91576 100644 --- a/tests/test_db_plumbing.py +++ b/tests/test_db_plumbing.py @@ -77,6 +77,59 @@ def test_alembic_upgrade_on_empty_db(_isolate_config_and_engine: Path) -> None: assert _table_names(db_path) == SQLMODEL_TABLES | {"alembic_version"} +def _insert_commit(sha: str, **overrides: object) -> None: + """Insert one minimal ``commit`` row, applying any field overrides.""" + from whygraph.db import get_session + from whygraph.db.models import Commit + + fields: dict[str, object] = dict( + sha=sha, + parent_shas="[]", + author_name="Jane", + author_email="jane@example.com", + authored_at="2026-01-01T00:00:00Z", + committed_at="2026-01-01T00:00:00Z", + subject="s", + body="", + files_changed=1, + insertions=1, + deletions=0, + scanned_at="2026-01-02T00:00:00Z", + ) + fields.update(overrides) + with get_session() as session: + session.add(Commit(**fields)) + session.commit() + + +def test_commit_on_default_branch_default_and_explicit( + _isolate_config_and_engine: Path, +) -> None: + """The column defaults to 1; an explicit 0 (PR-origin commit) persists.""" + from sqlmodel import select + + from whygraph.db import get_session + from whygraph.db.models import Commit + + command.upgrade(alembic_config(), "head") + _insert_commit("default_sha") # no on_default_branch → server default + _insert_commit("origin_sha", on_default_branch=0) + + with get_session() as session: + # Read scalars inside the session: the default value is server-set, so + # accessing it on a detached row would trigger a refresh load and raise + # DetachedInstanceError. + default_on_main = session.get(Commit, "default_sha").on_default_branch + origin_on_main = session.get(Commit, "origin_sha").on_default_branch + on_main = set( + session.exec(select(Commit.sha).where(Commit.on_default_branch == 1)).all() + ) + + assert default_on_main == 1 + assert origin_on_main == 0 + assert on_main == {"default_sha"} + + def test_connect_pragmas_applied(_isolate_config_and_engine: Path) -> None: """The connect listener enables WAL + a busy timeout for concurrent writers.""" engine = db_engine.get_engine() diff --git a/tests/test_mcp_evidence.py b/tests/test_mcp_evidence.py index 058cd32..c013a91 100644 --- a/tests/test_mcp_evidence.py +++ b/tests/test_mcp_evidence.py @@ -15,7 +15,7 @@ from whygraph.db import get_session from whygraph.db.models import Commit, CommitFileChange, Issue, PRIssueLink, PullRequest from whygraph.mcp.errors import WhyGraphError -from whygraph.mcp.evidence import whygraph_evidence_for +from whygraph.mcp.evidence import _pr_dict, whygraph_evidence_for from whygraph.scan.refactor_score import BORING_THRESHOLD from whygraph.services.git import Repository @@ -73,6 +73,33 @@ def _db_pr(*, number: int, merge_commit_sha: str) -> PullRequest: ) +# ---- _pr_dict serialization --------------------------------------------- + + +def test_pr_dict_includes_decoded_commit_titles_and_comments() -> None: + pr = _db_pr(number=1, merge_commit_sha="squashsha") + pr.commit_titles = '[{"oid": "abc123", "headline": "first", "author_name": "Jane"}]' + pr.comments = '[{"author": "alice", "body": "lgtm", "created_at": "x"}]' + + out = _pr_dict(pr) + + assert out["commit_titles"] == [ + {"oid": "abc123", "headline": "first", "author_name": "Jane"} + ] + assert out["comments"] == [{"author": "alice", "body": "lgtm", "created_at": "x"}] + + +def test_pr_dict_malformed_commit_titles_and_comments_yield_empty_lists() -> None: + pr = _db_pr(number=1, merge_commit_sha="squashsha") + pr.commit_titles = "not json" + pr.comments = "{}" # an object, not a list + + out = _pr_dict(pr) + + assert out["commit_titles"] == [] + assert out["comments"] == [] + + def _db_issue(*, number: int) -> Issue: """A WhyGraph ``issue`` row for tests.""" return Issue( diff --git a/tests/test_mcp_evidence_pr_origin.py b/tests/test_mcp_evidence_pr_origin.py new file mode 100644 index 0000000..8258f74 --- /dev/null +++ b/tests/test_mcp_evidence_pr_origin.py @@ -0,0 +1,291 @@ +"""Per-line squash attribution (Stage 2 of the squash-merge recovery plan). + +Builds a real squash scenario on disk: a feature branch whose commits +authored ``sample.py``'s lines, collapsed into one squash commit on +``main``. With the original commits recovered as ``on_default_branch=0`` +rows and linked through a PR's ``commit_titles``, +``whygraph_evidence_for`` must re-blame the queried lines at the PR's +``head_sha`` and surface each original commit as ``source="pr-origin"``. +""" + +from __future__ import annotations + +import subprocess +from pathlib import Path + +import pytest + +from whygraph.analyze import CommitEvidence +from whygraph.db import get_session +from whygraph.db.models import Commit, PullRequest +from whygraph.mcp.evidence import ( + _attribute_squash_origins, + _should_replace, + whygraph_evidence_for, +) +from whygraph.mcp.targets import Target +from whygraph.services.git import Repository + + +def _git(repo: Path, *args: str) -> str: + return subprocess.run( + ["git", "-C", str(repo), *args], + check=True, + capture_output=True, + text=True, + ).stdout + + +def _origin_row(sha: str, *, subject: str, committed_at: str) -> Commit: + """A recovered PR-origin commit row (on_default_branch=0).""" + return Commit( + sha=sha, + parent_shas="", + author_name="Feature Dev", + author_email="dev@example.com", + authored_at=committed_at, + committed_at=committed_at, + subject=subject, + body="", + files_changed=1, + insertions=1, + deletions=0, + scanned_at="2026-05-01T00:00:00+00:00", + on_default_branch=0, + ) + + +def _squash_row(sha: str, *, committed_at: str) -> Commit: + """The squash commit on the main walk (small — not file-bulk, so this + fixture also guards the §4.8 'commit-rich but not file-bulk' trigger).""" + return Commit( + sha=sha, + parent_shas="", + author_name="Maintainer", + author_email="maint@example.com", + authored_at=committed_at, + committed_at=committed_at, + subject="Squash-merge feature", + body="", + files_changed=1, + insertions=3, + deletions=0, + scanned_at="2026-05-01T00:00:00+00:00", + on_default_branch=1, + ) + + +def _build_squash_repo(root: Path) -> dict[str, str]: + """A repo whose ``main`` HEAD is a squash of a 2-commit feature branch. + + ``sample.py`` ends as three lines: feat1 wrote lines 1-2, feat2 added + line 3; the squash commit on ``main`` reproduces all three at once. + The feature tip is pinned under ``refs/whygraph/pull/1`` (and its + branch deleted) to mirror what the enricher leaves behind. + + Returns a dict of ``squash`` / ``feat1`` / ``feat2`` → sha. + """ + root.mkdir(parents=True, exist_ok=True) + _git(root, "init", "-q", "-b", "main") + _git(root, "config", "user.email", "test@example.com") + _git(root, "config", "user.name", "Test User") + _git(root, "config", "commit.gpgsign", "false") + + # Base commit on main — no sample.py yet. + (root / "README.md").write_text("base\n") + _git(root, "add", "README.md") + _git(root, "commit", "-q", "-m", "base") + + # Feature branch: two commits authoring sample.py line by line. + _git(root, "checkout", "-q", "-b", "feature") + (root / "sample.py").write_text("line one\nline two\n") + _git(root, "add", "sample.py") + _git(root, "commit", "-q", "-m", "feat: first two lines") + feat1 = _git(root, "rev-parse", "HEAD").strip() + (root / "sample.py").write_text("line one\nline two\nline three\n") + _git(root, "add", "sample.py") + _git(root, "commit", "-q", "-m", "feat: third line") + feat2 = _git(root, "rev-parse", "HEAD").strip() + + # Squash commit on main: the same final sample.py in one commit. + _git(root, "checkout", "-q", "main") + (root / "sample.py").write_text("line one\nline two\nline three\n") + _git(root, "add", "sample.py") + _git(root, "commit", "-q", "-m", "Squash-merge feature") + squash = _git(root, "rev-parse", "HEAD").strip() + + # Pin the feature tip the way the enricher does, then drop the branch. + _git(root, "update-ref", "refs/whygraph/pull/1", feat2) + _git(root, "branch", "-q", "-D", "feature") + + return {"squash": squash, "feat1": feat1, "feat2": feat2} + + +def _seed_squash_pr(shas: dict[str, str]) -> None: + with get_session() as session: + session.add( + _squash_row(shas["squash"], committed_at="2026-04-01T00:00:00+00:00") + ) + session.add( + _origin_row( + shas["feat1"], + subject="feat: first two lines", + committed_at="2026-03-01T00:00:00+00:00", + ) + ) + session.add( + _origin_row( + shas["feat2"], + subject="feat: third line", + committed_at="2026-03-02T00:00:00+00:00", + ) + ) + session.add( + PullRequest( + number=1, + title="Feature", + state="MERGED", + created_at="2026-03-01T00:00:00+00:00", + updated_at="2026-04-01T00:00:00+00:00", + merged_at="2026-04-01T00:00:00+00:00", + merge_commit_sha=shas["squash"], + head_sha=shas["feat2"], + base_ref="main", + html_url="https://example.test/pr/1", + labels="[]", + fetched_at="2026-04-02T00:00:00+00:00", + commit_titles=( + f'[{{"oid": "{shas["feat1"]}", "headline": "feat: first two lines"}},' + f' {{"oid": "{shas["feat2"]}", "headline": "feat: third line"}}]' + ), + ) + ) + + +def test_evidence_attributes_squash_lines_to_origin_commits( + tmp_path: Path, + whygraph_db_initialized: Path, + monkeypatch: pytest.MonkeyPatch, +) -> None: + """A line owned by the squash commit yields pr-origin entries whose + commits match ``git blame `` of the same range.""" + shas = _build_squash_repo(tmp_path / "repo") + _seed_squash_pr(shas) + + monkeypatch.chdir(tmp_path / "repo") + result = whygraph_evidence_for(path="sample.py", line_start=1, line_end=3) + + by_source: dict[str, set[str]] = {} + for item in result["evidence"]: + by_source.setdefault(item["source"], set()).add(item["commit"]["sha"]) + + # The squash commit itself surfaces via direct HEAD blame... + assert shas["squash"] in by_source.get("blame", set()) + # ...and the two originals surface via per-line squash attribution. + assert by_source.get("pr-origin") == {shas["feat1"], shas["feat2"]} + + # Cross-check against git blame at head_sha directly. + blame = Repository(tmp_path / "repo").blame("sample.py", 1, 3, rev=shas["feat2"]) + blamed = {h.sha for h in blame if not h.is_uncommitted} + assert blamed == {shas["feat1"], shas["feat2"]} + + +def test_attribute_squash_origins_degrades_on_bad_head_sha( + tmp_path: Path, + whygraph_db_initialized: Path, +) -> None: + """An unresolvable head_sha (GC'd ref) is swallowed — no error, no + hunks — rather than failing the evidence collection.""" + shas = _build_squash_repo(tmp_path / "repo") + with get_session() as session: + session.add( + _squash_row(shas["squash"], committed_at="2026-04-01T00:00:00+00:00") + ) + session.add( + _origin_row( + shas["feat1"], subject="x", committed_at="2026-03-01T00:00:00+00:00" + ) + ) + session.add( + PullRequest( + number=2, + title="Feature", + state="MERGED", + created_at="2026-03-01T00:00:00+00:00", + updated_at="2026-04-01T00:00:00+00:00", + merged_at="2026-04-01T00:00:00+00:00", + merge_commit_sha=shas["squash"], + head_sha="0" * 40, # not a real object + base_ref="main", + html_url="https://example.test/pr/2", + labels="[]", + fetched_at="2026-04-02T00:00:00+00:00", + commit_titles=f'[{{"oid": "{shas["feat1"]}", "headline": "x"}}]', + ) + ) + + repo = Repository(tmp_path / "repo") + target = Target(path="sample.py", line_start=1, line_end=3, qualified_name=None) + with get_session() as session: + hunks = _attribute_squash_origins( + repo, target, blame_shas={shas["squash"]}, session=session + ) + assert hunks == [] + + +def test_unenriched_squash_yields_no_pr_origin( + tmp_path: Path, + whygraph_db_initialized: Path, +) -> None: + """A squash PR with no recovered origin rows is not attributed (the + trigger requires >= 1 on_default_branch=0 commit).""" + shas = _build_squash_repo(tmp_path / "repo") + with get_session() as session: + session.add( + _squash_row(shas["squash"], committed_at="2026-04-01T00:00:00+00:00") + ) + # No origin rows inserted for this PR. + session.add( + PullRequest( + number=3, + title="Feature", + state="MERGED", + created_at="2026-03-01T00:00:00+00:00", + updated_at="2026-04-01T00:00:00+00:00", + merged_at="2026-04-01T00:00:00+00:00", + merge_commit_sha=shas["squash"], + head_sha=shas["feat2"], + base_ref="main", + html_url="https://example.test/pr/3", + labels="[]", + fetched_at="2026-04-02T00:00:00+00:00", + commit_titles=f'[{{"oid": "{shas["feat1"]}", "headline": "x"}}]', + ) + ) + + repo = Repository(tmp_path / "repo") + target = Target(path="sample.py", line_start=1, line_end=3, qualified_name=None) + with get_session() as session: + hunks = _attribute_squash_origins( + repo, target, blame_shas={shas["squash"]}, session=session + ) + assert hunks == [] + + +# ---- source priority (plan unit test 6) ------------------------------------- + + +def _ev(sha: str, source: str) -> CommitEvidence: + return CommitEvidence( + _origin_row(sha, subject="x", committed_at="x"), source=source + ) + + +def test_pr_origin_beats_area_but_loses_to_blame() -> None: + # pr-origin supersedes area / blame-walked / predecessor-blame... + assert _should_replace(_ev("s", "area"), "pr-origin") is True + assert _should_replace(_ev("s", "blame-walked"), "pr-origin") is True + # ...but a direct HEAD blame hit wins over pr-origin. + assert _should_replace(_ev("s", "blame"), "pr-origin") is False + # And pr-origin is not displaced by the weaker labels. + assert _should_replace(_ev("s", "pr-origin"), "area") is False diff --git a/tests/test_scan_pr_origin_enricher.py b/tests/test_scan_pr_origin_enricher.py new file mode 100644 index 0000000..793ff4c --- /dev/null +++ b/tests/test_scan_pr_origin_enricher.py @@ -0,0 +1,355 @@ +"""Tests for :mod:`whygraph.scan.pr_origin_enricher` (squash-merge recovery). + +Covers the balanced enrichment gate, the end-to-end ``work()`` loop with a +stubbed repository (so no network / real git is needed), and the §4.10 +``on_default_branch`` query guards that keep recovered origin commits out +of the main-walk-only queries. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +from rich.progress import Progress + +from whygraph.db import get_session +from whygraph.db.models import Commit, CommitFileChange, PullRequest +from whygraph.scan.pr_origin_enricher import ( + PROriginEnricher, + _select_candidates, +) +from whygraph.services.git.commit import Commit as CommitDC +from whygraph.services.git.commit import DiffStats + + +def _commit(sha: str, *, files_changed: int = 1, on_default_branch: int = 1) -> Commit: + return Commit( + sha=sha, + parent_shas="", + author_name="Test User", + author_email="tester@example.com", + authored_at="2026-01-01T00:00:00+00:00", + committed_at="2026-01-01T00:00:00+00:00", + subject=sha, + body="", + files_changed=files_changed, + insertions=1, + deletions=0, + scanned_at="2026-01-02T00:00:00+00:00", + on_default_branch=on_default_branch, + ) + + +def _pr( + number: int, + *, + oids: list[str], + merge_commit_sha: str | None = None, + merged: bool = True, +) -> PullRequest: + return PullRequest( + number=number, + title=f"PR {number}", + state="MERGED" if merged else "OPEN", + created_at="2026-01-01T00:00:00+00:00", + updated_at="2026-01-01T00:00:00+00:00", + merged_at="2026-01-02T00:00:00+00:00" if merged else None, + merge_commit_sha=merge_commit_sha, + head_sha=f"head{number}", + base_ref="main", + html_url=f"https://example.test/pr/{number}", + labels="[]", + fetched_at="2026-01-02T00:00:00+00:00", + commit_titles=json.dumps([{"oid": o, "headline": o} for o in oids]), + ) + + +def _dc(oid: str, *, files_changed: int = 2) -> CommitDC: + return CommitDC( + sha=oid, + parent_shas=(), + author_name="Feature Dev", + author_email="dev@example.com", + authored_at="2025-12-01T00:00:00+00:00", + committed_at="2025-12-01T00:00:00+00:00", + subject=f"original {oid}", + body="full body", + stats=DiffStats(files_changed=files_changed, insertions=3, deletions=1), + ) + + +class _StubRepo: + """Duck-typed stand-in for :class:`Repository` — no network, no git. + + Records the refspecs passed to :meth:`fetch_refs` and serves + :meth:`commit_metadata` from a preloaded oid → :class:`CommitDC` map. + An oid absent from the map raises, mirroring an unknown/GC'd object. + """ + + def __init__(self, metadata: dict[str, CommitDC]) -> None: + self._metadata = metadata + self.fetched: list[list[str]] = [] + + def fetch_refs(self, refspecs: list[str], *, remote: str | None = None) -> None: + self.fetched.append(list(refspecs)) + + def commit_metadata(self, ref: str) -> CommitDC: + from whygraph.services.git import GitError + + try: + return self._metadata[ref] + except KeyError as exc: # pragma: no cover - defensive + raise GitError(f"unknown commit {ref}") from exc + + +# --- balanced gate (plan §3.3 / unit test 4) --------------------------------- + + +def test_gate_enriches_file_bulk_squash(whygraph_db_initialized: Path) -> None: + """Squash (oids absent) + file-bulk merge commit → enrich.""" + with get_session() as session: + # The squash commit IS on the main walk and is file-bulk (> threshold). + session.add(_commit("squash1", files_changed=40)) + session.add(_pr(1, oids=["o1", "o2"], merge_commit_sha="squash1")) + session.commit() + candidates = _select_candidates( + session, min_commits=5, large_commit_file_count=30 + ) + + assert len(candidates) == 1 + assert candidates[0].number == 1 + assert sorted(candidates[0].new_oids) == ["o1", "o2"] + + +def test_gate_enriches_commit_rich_squash(whygraph_db_initialized: Path) -> None: + """Squash + many collapsed commits (>= min_commits) → enrich, even when + the merge commit touched few files.""" + with get_session() as session: + session.add(_commit("squash2", files_changed=2)) # not file-bulk + session.add( + _pr(2, oids=[f"c{i}" for i in range(5)], merge_commit_sha="squash2") + ) + session.commit() + candidates = _select_candidates( + session, min_commits=5, large_commit_file_count=30 + ) + + assert len(candidates) == 1 + assert candidates[0].number == 2 + + +def test_gate_skips_below_both_thresholds(whygraph_db_initialized: Path) -> None: + """Squash but small (few files, few commits) → skip.""" + with get_session() as session: + session.add(_commit("squash3", files_changed=2)) + session.add(_pr(3, oids=["x1", "x2"], merge_commit_sha="squash3")) + session.commit() + candidates = _select_candidates( + session, min_commits=5, large_commit_file_count=30 + ) + + assert candidates == [] + + +def test_gate_skips_merged_as_own_commits(whygraph_db_initialized: Path) -> None: + """A plain merge / rebase — the PR's oids are already on the main walk — + is skipped regardless of size (no history was lost).""" + with get_session() as session: + # All of the PR's oids are present in `commit` on the main walk. + for oid in (f"m{i}" for i in range(8)): + session.add(_commit(oid, files_changed=1)) + session.add(_commit("merge4", files_changed=50)) + session.add(_pr(4, oids=[f"m{i}" for i in range(8)], merge_commit_sha="merge4")) + session.commit() + candidates = _select_candidates( + session, min_commits=5, large_commit_file_count=30 + ) + + assert candidates == [] + + +def test_gate_ignores_unmerged_prs(whygraph_db_initialized: Path) -> None: + """An open PR is never a candidate even if it is large.""" + with get_session() as session: + session.add(_commit("tip5", files_changed=40)) + session.add( + _pr( + 5, oids=[f"u{i}" for i in range(9)], merge_commit_sha=None, merged=False + ) + ) + session.commit() + candidates = _select_candidates( + session, min_commits=5, large_commit_file_count=30 + ) + + assert candidates == [] + + +# --- end-to-end work() (plan integration test 7) ----------------------------- + + +def test_work_inserts_origin_rows_and_fetches_only_candidates( + whygraph_db_initialized: Path, +) -> None: + """One batched fetch of only the candidate refspec; origin rows land + with on_default_branch=0 and full git metadata.""" + with get_session() as session: + session.add(_commit("squash6", files_changed=40)) + session.add(_pr(6, oids=["a6", "b6"], merge_commit_sha="squash6")) + session.commit() + + repo = _StubRepo({"a6": _dc("a6"), "b6": _dc("b6")}) + enricher = PROriginEnricher( + Progress(), repository=repo, min_commits=5, large_commit_file_count=30 + ) + enricher.run() + + assert enricher.error is None + # Exactly one fetch, carrying only PR #6's targeted refspec (no wildcard). + assert repo.fetched == [["refs/pull/6/head:refs/whygraph/pull/6"]] + + with get_session() as session: + a6 = session.get(Commit, "a6") + b6 = session.get(Commit, "b6") + a6_flag = a6.on_default_branch + a6_body = a6.body + a6_files = a6.files_changed + b6_flag = b6.on_default_branch + squash_flag = session.get(Commit, "squash6").on_default_branch + + assert a6_flag == 0 and b6_flag == 0 + assert a6_body == "full body" # full message from git, not just headline + assert a6_files == 2 + assert squash_flag == 1 # the squash commit stays on the main walk + + +def test_work_no_candidates_makes_no_fetch(whygraph_db_initialized: Path) -> None: + """When nothing is gated, the enricher never touches the network.""" + with get_session() as session: + session.add(_commit("squash7", files_changed=2)) + session.add(_pr(7, oids=["s1", "s2"], merge_commit_sha="squash7")) + session.commit() + + repo = _StubRepo({}) + enricher = PROriginEnricher( + Progress(), repository=repo, min_commits=5, large_commit_file_count=30 + ) + enricher.run() + + assert enricher.error is None + assert repo.fetched == [] + + +def test_work_dedups_oid_shared_across_prs(whygraph_db_initialized: Path) -> None: + """An oid attached to two squash PRs is inserted exactly once (no PK + collision) — the many-to-many edge §4.3 relies on.""" + with get_session() as session: + session.add(_commit("squashA", files_changed=40)) + session.add(_commit("squashB", files_changed=40)) + session.add(_pr(8, oids=["shared", "onlyA"], merge_commit_sha="squashA")) + session.add(_pr(9, oids=["shared", "onlyB"], merge_commit_sha="squashB")) + session.commit() + + repo = _StubRepo( + {"shared": _dc("shared"), "onlyA": _dc("onlyA"), "onlyB": _dc("onlyB")} + ) + enricher = PROriginEnricher( + Progress(), repository=repo, min_commits=5, large_commit_file_count=30 + ) + enricher.run() + + assert enricher.error is None + with get_session() as session: + from sqlmodel import func, select + + shared_count = session.exec( + select(func.count(Commit.sha)).where(Commit.sha == "shared") + ).one() + assert shared_count == 1 + + +def test_work_skips_scan_when_fetch_fails(whygraph_db_initialized: Path) -> None: + """A fetch failure (GC'd ref / no network) degrades gracefully — the + scan does not error and no origin rows are written.""" + from whygraph.services.git import GitError + + class _FailingRepo(_StubRepo): + def fetch_refs(self, refspecs, *, remote=None): # type: ignore[override] + raise GitError("network down") + + with get_session() as session: + session.add(_commit("squashF", files_changed=40)) + session.add(_pr(10, oids=["f1", "f2"], merge_commit_sha="squashF")) + session.commit() + + repo = _FailingRepo({"f1": _dc("f1"), "f2": _dc("f2")}) + enricher = PROriginEnricher( + Progress(), repository=repo, min_commits=5, large_commit_file_count=30 + ) + enricher.run() + + assert enricher.error is None + with get_session() as session: + assert session.get(Commit, "f1") is None + assert session.get(Commit, "f2") is None + + +# --- §4.10 main-walk-only query guards (unit test 5) ------------------------- + + +def test_boring_shas_excludes_origin_commits(whygraph_db_initialized: Path) -> None: + """``_boring_shas_in`` ignores on_default_branch=0 rows even when their + refactor_score crosses the boring threshold.""" + from whygraph.mcp.evidence import BORING_THRESHOLD, _boring_shas_in + + with get_session() as session: + main = _commit("boring_main", on_default_branch=1) + main.refactor_score = BORING_THRESHOLD + 5 + origin = _commit("boring_origin", on_default_branch=0) + origin.refactor_score = BORING_THRESHOLD + 5 + session.add(main) + session.add(origin) + session.commit() + + result = _boring_shas_in({"boring_main", "boring_origin"}) + assert result == {"boring_main"} + + +def test_area_history_excludes_origin_commits(whygraph_db_initialized: Path) -> None: + """area-history excludes on_default_branch=0 commits even if a + commit_file_change row points at the queried path (defensive guard).""" + from whygraph.mcp.path_history import area_history_commits + + with get_session() as session: + session.add(_commit("area_main", on_default_branch=1)) + session.add(_commit("area_origin", on_default_branch=0)) + session.add( + CommitFileChange( + commit_sha="area_main", + path="src/x.py", + change_type="M", + renamed_from=None, + similarity=None, + lines_added=1, + lines_deleted=0, + ) + ) + # Artificial: origin commits normally get no file-change rows; this + # forces the join to reach one so the explicit guard is exercised. + session.add( + CommitFileChange( + commit_sha="area_origin", + path="src/x.py", + change_type="M", + renamed_from=None, + similarity=None, + lines_added=1, + lines_deleted=0, + ) + ) + session.commit() + + items = area_history_commits("src/x.py", include_renames=False) + shas = {item.commit.sha for item in items} + assert shas == {"area_main"} diff --git a/tests/test_services_git_fetch_metadata.py b/tests/test_services_git_fetch_metadata.py new file mode 100644 index 0000000..6b4bf1f --- /dev/null +++ b/tests/test_services_git_fetch_metadata.py @@ -0,0 +1,108 @@ +"""Tests for :meth:`Repository.fetch_refs` and :meth:`Repository.commit_metadata`. + +Both run the real ``git`` binary against repos on disk (same pattern as +``test_services_git_diff.py``). ``fetch_refs`` is exercised against a local +bare "remote" so no network is involved; ``commit_metadata`` parses a single +commit's full message + stats. +""" + +from __future__ import annotations + +import subprocess +from pathlib import Path + +import pytest + +from whygraph.services.git import GitError, Repository + + +def _git(cwd: Path, *args: str) -> str: + return subprocess.run( + ["git", "-C", str(cwd), *args], + check=True, + capture_output=True, + text=True, + ).stdout + + +def _init(root: Path) -> None: + root.mkdir(parents=True, exist_ok=True) + _git(root, "init", "-q", "-b", "main") + _git(root, "config", "user.email", "test@example.com") + _git(root, "config", "user.name", "Test User") + _git(root, "config", "commit.gpgsign", "false") + + +def test_commit_metadata_reads_single_commit_body_and_stats(tmp_path: Path) -> None: + root = tmp_path / "repo" + _init(root) + (root / "a.txt").write_text("one\ntwo\n") + _git(root, "add", "a.txt") + _git(root, "commit", "-q", "-m", "first") + (root / "a.txt").write_text("one\ntwo\nthree\n") + _git(root, "add", "a.txt") + _git(root, "commit", "-q", "-m", "second subject\n\nA longer body line.") + sha = _git(root, "rev-parse", "HEAD").strip() + + commit = Repository(root).commit_metadata(sha) + + assert commit.sha == sha + assert commit.subject == "second subject" + assert "A longer body line." in commit.body + # Only the single commit's stats — not its ancestors'. + assert commit.stats.files_changed == 1 + assert commit.stats.insertions == 1 + + +def test_commit_metadata_unknown_ref_raises_git_error(tmp_path: Path) -> None: + root = tmp_path / "repo" + _init(root) + (root / "a.txt").write_text("x\n") + _git(root, "add", "a.txt") + _git(root, "commit", "-q", "-m", "only") + + with pytest.raises(GitError): + Repository(root).commit_metadata("0" * 40) + + +def test_fetch_refs_brings_remote_commit_local(tmp_path: Path) -> None: + """A refspec fetch pins a commit that is not reachable from any local + branch under our own ref namespace, so it becomes readable offline.""" + remote = tmp_path / "remote" + _init(remote) + (remote / "f.txt").write_text("remote work\n") + _git(remote, "add", "f.txt") + _git(remote, "commit", "-q", "-m", "remote commit") + remote_sha = _git(remote, "rev-parse", "HEAD").strip() + # Park the commit under a non-branch ref the clone won't pull by default. + _git(remote, "update-ref", "refs/pull/1/head", remote_sha) + + local = tmp_path / "local" + _init(local) + (local / "g.txt").write_text("local\n") + _git(local, "add", "g.txt") + _git(local, "commit", "-q", "-m", "local commit") + _git(local, "remote", "add", "origin", str(remote)) + + repo = Repository(local) + # Before the fetch the object is absent — metadata read fails. + with pytest.raises(GitError): + repo.commit_metadata(remote_sha) + + repo.fetch_refs(["refs/pull/1/head:refs/whygraph/pull/1"]) + + # Now it resolves, and the local ref lives in our namespace. + fetched = repo.commit_metadata(remote_sha) + assert fetched.sha == remote_sha + assert fetched.subject == "remote commit" + assert _git(local, "rev-parse", "refs/whygraph/pull/1").strip() == remote_sha + + +def test_fetch_refs_empty_is_noop(tmp_path: Path) -> None: + root = tmp_path / "repo" + _init(root) + (root / "a.txt").write_text("x\n") + _git(root, "add", "a.txt") + _git(root, "commit", "-q", "-m", "only") + # No remote configured: an empty refspec list must not invoke git at all. + Repository(root).fetch_refs([]) diff --git a/whygraph.example.toml b/whygraph.example.toml index 21904dc..906da0d 100644 --- a/whygraph.example.toml +++ b/whygraph.example.toml @@ -28,6 +28,8 @@ provider = "anthropic" # which [llm.*] adapter to use # large_commit_file_count = 30 # commits touching more files than this are described # per-file on demand (not whole-diff) — keeps a bulk # import / squash commit from costing a repo-wide pass +# pr_origin_min_commits = 5 # recover a squash-merged PR's original commits once it + # collapsed >= this many (file-bulk squashes recover too) # timeout_sec = 60 # per-call timeout; default: the adapter's own [rationale] @@ -35,6 +37,11 @@ provider = "anthropic" # which [llm.*] adapter to use provider = "anthropic" # which [llm.*] adapter to use # model = "claude-haiku-4-5" # override the provider's model for rationale only # timeout_sec = 60 # per-call timeout; default: the adapter's own +# Caps on how much of a squash-merged PR is rendered into the rationale prompt +# (the raw whygraph_evidence_for tool output is uncapped): +# pr_roster_max_commits = 30 # squashed-commit headlines shown per PR +# pr_discussion_max_comments = 20 # PR comments shown per PR +# pr_comment_max_chars = 500 # each PR comment clipped to this length # Uncomment to override default DB locations (resolved relative to this file): # whygraph_db = ".whygraph/whygraph.db"