Skip to content

feat: scrape quality gates — classify outcome and roll back unusable indexes#439

Draft
walterfrey wants to merge 15 commits into
arabold:mainfrom
NEXUZ-SYS:upstream-pr/scrape-quality-gates
Draft

feat: scrape quality gates — classify outcome and roll back unusable indexes#439
walterfrey wants to merge 15 commits into
arabold:mainfrom
NEXUZ-SYS:upstream-pr/scrape-quality-gates

Conversation

@walterfrey

Copy link
Copy Markdown

feat: scrape quality gates — classify outcome and roll back unusable indexes

Problem

Today a scrape job is marked completed as soon as the scraper returns without
throwing (PipelineManager success path). There is no quality gate: the server
conflates "the crawl finished" with "I indexed usable documentation". This produces
three failure modes that all report completed, verified against the real
google/generative-ai-docs repository:

# Failure mode Verified root cause Today
FM-1 Empty / hostile host (SPA, ?hl= locale redirect) No locale normalization; a JS-gated or redirect-walled page yields 0 docs with no signal. Redirect cap exists but there is no cyclic-Location detection and no locale stripping. completed
FM-2 Wrong content — crawler indexes an irrelevant subtree A repo-root scrape legitimately includes demos/ as descendants. Measured: 679 of 1045 tree entries are under demos/, drowning the ~50 real .md docs. Path scope can't help (demos are descendants of the root). completed (non-empty!)
FM-3 Silent no-op — GitHub /tree/<branch>/<subPath> indexes nothing The GitHub tree API returns HTTP 200 with the entire tree regardless of subPath; the subPath is filtered client-side. A nonexistent/mistyped/wrong-case subPath matches 0 files silently → only the wiki link is queued (404) → 0 docs → completed in ~0.4s. completed

Empirical evidence (real repo, not truncated, 1045 entries):

/tree/main/docs                     -> 0 files   (dir doesn't exist)
/tree/main/gemini-api/docs          -> 0 files   (wrong path)
/tree/main/Site/en                  -> 0 files   (wrong case)
/tree/main/site/en/gemini-api/docs  -> 8 files   (correct path)

Solution

Promote a (library, version) to searchable only if it passes quality gates.
Otherwise, discard what was indexed and return a typed error code with remediation.

  • Outcome classification — new ScrapeOutcome enum (indexed | empty | thin | degenerate | failed) computed at job end from counters that already exist.
  • Gate-then-rollback (hard-fail) — at the single seam where the manager would mark
    COMPLETED, a failing verdict calls the existing removeVersion() to discard the
    staged docs and marks the job FAILED with a typed errorCode. No half-indexed
    garbage stays searchable.
  • FM-1localeStrategy (pin-en default): Accept-Language: en + strip
    hl/lang/locale query params; cyclic-Location detection within the existing
    redirect cap raises LOCALE_REDIRECT_LOOP.
  • FM-2denyPaths (default demos/examples) trims irrelevant subtrees; an
    opt-in relevance gate (expectTerms) samples indexed chunks and flags OFF_TOPIC,
    plus path/host coherence flags SCOPE_DRIFT.
  • FM-3 — a /tree/ subPath matching zero files raises GITHUB_SUBPATH_NOT_FOUND
    whose message lists the repo's real top-level directories.

Error codes (surfaced in get_job_info / list_jobs)

EMPTY_RESULT, THIN_RESULT, OFF_TOPIC, SCOPE_DRIFT, LOCALE_REDIRECT_LOOP,
GITHUB_SUBPATH_NOT_FOUND.

Backward compatibility

All public changes are additive:

  • New optional scrape_docs params: expectTerms, denyPaths, localeStrategy
    (bounded in the zod schema: ≤50 items, ≤200 chars each).
  • New optional fields on ScraperOptions, FetchOptions, JobInfo, PipelineJob.
  • Conservative defaults: only a 0-document scrape is hard-failed by default; the
    relevance axes (OFF_TOPIC/SCOPE_DRIFT) are opt-in via expectTerms.
  • Refresh / clean=false jobs skip the gate, so a transient thin re-index can never
    delete a pre-existing good version.

Implementation walkthrough (one commit per step)

  • M1ScrapeOutcome/ScrapeErrorCode/JobOutcomeMetrics types; pure
    evaluateOutcome() classifier; getVersionMetrics() store accessor.
  • M2applyQualityGate() seam in PipelineManager + gate-then-rollback; E2E
    asserting a failed gate leaves the store empty.
  • M3 — real GitHub tree fixture; GITHUB_SUBPATH_NOT_FOUND guard; locale
    normalization + LOCALE_REDIRECT_LOOP.
  • M4denyPaths in GitHub + web crawl; relevanceGate (toScopeKey ref-agnostic
    scope, expectTerms sampling); wiring into the gate (opt-in).
  • M5 — expose outcome/errorCode in job tools; plumb new params through
    scrape_docs; docs.

Tests

New deterministic coverage (no network): outcomeGate.test.ts, relevanceGate.test.ts,
quality-gate-e2e.test.ts, plus additions to PipelineManager, HttpFetcher,
GitHubScraperStrategy, DocumentManagementService, GetJobInfoTool, ScrapeTool tests.
FM-2/FM-3 use a committed snapshot of the real google/generative-ai-docs tree.
npm test, npm run lint, npm run typecheck all green (pre-existing locale-dependent
cli-e2e and an environment-dependent vector-persistence case excluded — failing
identically on main).

Out of scope (intentionally deferred)

  • discover_source(library) tool.
  • A real staging-version with pointer swap (this PR uses gate-then-rollback).
  • Typed signal + pagination for truncated GitHub trees (still logger.warn).

walterfrey added a commit to NEXUZ-SYS/docs-mcp-server that referenced this pull request Jun 13, 2026
Promote a scraped (library, version) to searchable only when it indexed
usable docs. Adds a ScrapeOutcome classifier + gate-then-rollback hard-fail
at the pipeline seam, fixing three failure modes that previously reported
"completed": empty/hostile host (FM-1, locale pin-en + LOCALE_REDIRECT_LOOP),
off-topic subtree (FM-2, denyPaths + opt-in expectTerms/scope relevance),
and silent no-op on a nonexistent GitHub subpath (FM-3, GITHUB_SUBPATH_NOT_FOUND).
Additive API (expectTerms/denyPaths/localeStrategy + outcome/errorCode);
refresh/clean=false skip the gate to avoid data loss.

Upstream PR (clean branch): arabold#439

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arabold arabold requested a review from Copilot June 28, 2026 01:45

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “scrape quality gates” to ensure a (library, version) becomes searchable only when the scrape produced usable documentation, and otherwise fails the job with a typed errorCode and rolls back the staged index to avoid leaving garbage searchable.

Changes:

  • Introduces ScrapeOutcome / ScrapeErrorCode and a pure evaluateOutcome() classifier, plus relevance helpers (computeInScopeRatio, sampleExpectTermsMatch).
  • Adds a quality-gate seam in PipelineManager that can gate-then-rollback via removeVersion(), and plumbs outcome/errorCode through job tools.
  • Adds opt-in relevance gating (expectTerms), default path denylists (denyPaths), and locale handling (localeStrategy) with tests and docs.

Reviewed changes

Copilot reviewed 30 out of 32 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/quality-gate-e2e.test.ts E2E test asserting gate failure rolls back the staged version and reports EMPTY_RESULT.
test/fixtures/empty-fixture/empty.png Fixture used to produce a 0-indexable-docs scrape outcome in the E2E test.
src/utils/errors.ts Extends ScraperError with an optional machine-readable code.
src/utils/config.ts Adds DEFAULT_DENY_PATHS for demos/examples subtree exclusion.
src/tools/ScrapeTool.ts Plumbs expectTerms, denyPaths, localeStrategy through to the pipeline manager.
src/tools/ScrapeTool.test.ts Verifies new scrape_docs options are forwarded to enqueue calls.
src/tools/ListJobsTool.ts Exposes outcome and errorCode in list_jobs.
src/tools/GetJobInfoTool.ts Exposes outcome and errorCode in get_job_info.
src/tools/GetJobInfoTool.test.ts Adds assertions for outcome/errorCode on gate-failed jobs.
src/store/DocumentStore.ts Adds statements for sampling chunk contents and listing indexed page URLs for gating.
src/store/DocumentManagementService.ts Adds metrics/sampling/url-list helpers used by the quality gate.
src/store/DocumentManagementService.test.ts Adds unit tests for getVersionMetrics normalization/behavior.
src/scraper/types.ts Extends ScraperOptions with denyPaths, expectTerms, localeStrategy.
src/scraper/strategies/WebScraperStrategy.ts Passes localeStrategy into fetch options.
src/scraper/strategies/GitHubScraperStrategy.ts Adds default denyPaths filtering and a /tree/<subPath> zero-match hard error with code.
src/scraper/strategies/GitHubScraperStrategy.test.ts Tests GITHUB_SUBPATH_NOT_FOUND and denyPaths exclusion behavior.
src/scraper/strategies/BaseScraperStrategy.ts Adds denyPaths filtering for non-GitHub strategies based on URL pathname.
src/scraper/fetcher/types.ts Extends FetchOptions with localeStrategy.
src/scraper/fetcher/HttpFetcher.ts Adds locale query stripping, Accept-Language pinning, and redirect-loop detection with typed error code.
src/scraper/fetcher/HttpFetcher.test.ts Tests locale redirect loop detection and Accept-Language pin + hl stripping.
src/pipeline/types.ts Adds ScrapeOutcome, ScrapeErrorCode, and gate metrics fields on jobs.
src/pipeline/relevanceGate.ts Adds pure helpers for scope coherence and expect-terms matching.
src/pipeline/relevanceGate.test.ts Unit tests for scope-key alignment and expect-terms sampling.
src/pipeline/PipelineManager.ts Implements the gate seam (gate-then-rollback) and surfaces outcome/errorCode.
src/pipeline/PipelineManager.test.ts Adds tests for rollback behavior, refresh safety, and OFF_TOPIC failures.
src/pipeline/outcomeGate.ts Adds evaluateOutcome() pure classifier with thresholds and remediation hints.
src/pipeline/outcomeGate.test.ts Unit tests for outcome classification paths.
src/pipeline/errors.ts Adds QualityGateError carrying the full verdict.
src/mcp/mcpServer.ts Extends the MCP schema/tool plumbing for new scrape options.
README.md Documents quality gates, new parameters, and new outcome/errorCode fields.
ARCHITECTURE.md Documents the quality-gate seam and gate-then-rollback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 761 to +765
// Handle other errors
const errorMessage = error instanceof Error ? error.message : String(error);
if (error instanceof ScraperError && error.code) {
job.errorCode = error.code as ScrapeErrorCode;
}
Comment on lines 227 to +231
const redirectUrl = new URL(location, currentUrl).href;

// Detect a cyclic redirect (commonly a locale ping-pong, e.g.
// ?hl=pt -> ?hl=en -> ?hl=pt) before exhausting the redirect budget.
if (seenLocations.has(redirectUrl)) {
try {
const u = new URL(rawUrl);
let path = u.pathname;
if (u.hostname.endsWith("github.com")) {
Comment on lines +214 to +218
const normalizedVersion = this.normalizeVersion(version);
const summaries = await this.store.queryLibraryVersions();
const entry = summaries
.get(library.toLowerCase())
?.find((v) => (v.version ?? "") === normalizedVersion);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants