feat: scrape quality gates — classify outcome and roll back unusable indexes#439
Draft
walterfrey wants to merge 15 commits into
Draft
feat: scrape quality gates — classify outcome and roll back unusable indexes#439walterfrey wants to merge 15 commits into
walterfrey wants to merge 15 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
walterfrey
added a commit
to NEXUZ-SYS/docs-mcp-server
that referenced
this pull request
Jun 13, 2026
Promote a scraped (library, version) to searchable only when it indexed usable docs. Adds a ScrapeOutcome classifier + gate-then-rollback hard-fail at the pipeline seam, fixing three failure modes that previously reported "completed": empty/hostile host (FM-1, locale pin-en + LOCALE_REDIRECT_LOOP), off-topic subtree (FM-2, denyPaths + opt-in expectTerms/scope relevance), and silent no-op on a nonexistent GitHub subpath (FM-3, GITHUB_SUBPATH_NOT_FOUND). Additive API (expectTerms/denyPaths/localeStrategy + outcome/errorCode); refresh/clean=false skip the gate to avoid data loss. Upstream PR (clean branch): arabold#439 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds “scrape quality gates” to ensure a (library, version) becomes searchable only when the scrape produced usable documentation, and otherwise fails the job with a typed errorCode and rolls back the staged index to avoid leaving garbage searchable.
Changes:
- Introduces
ScrapeOutcome/ScrapeErrorCodeand a pureevaluateOutcome()classifier, plus relevance helpers (computeInScopeRatio,sampleExpectTermsMatch). - Adds a quality-gate seam in
PipelineManagerthat can gate-then-rollback viaremoveVersion(), and plumbsoutcome/errorCodethrough job tools. - Adds opt-in relevance gating (
expectTerms), default path denylists (denyPaths), and locale handling (localeStrategy) with tests and docs.
Reviewed changes
Copilot reviewed 30 out of 32 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/quality-gate-e2e.test.ts | E2E test asserting gate failure rolls back the staged version and reports EMPTY_RESULT. |
| test/fixtures/empty-fixture/empty.png | Fixture used to produce a 0-indexable-docs scrape outcome in the E2E test. |
| src/utils/errors.ts | Extends ScraperError with an optional machine-readable code. |
| src/utils/config.ts | Adds DEFAULT_DENY_PATHS for demos/examples subtree exclusion. |
| src/tools/ScrapeTool.ts | Plumbs expectTerms, denyPaths, localeStrategy through to the pipeline manager. |
| src/tools/ScrapeTool.test.ts | Verifies new scrape_docs options are forwarded to enqueue calls. |
| src/tools/ListJobsTool.ts | Exposes outcome and errorCode in list_jobs. |
| src/tools/GetJobInfoTool.ts | Exposes outcome and errorCode in get_job_info. |
| src/tools/GetJobInfoTool.test.ts | Adds assertions for outcome/errorCode on gate-failed jobs. |
| src/store/DocumentStore.ts | Adds statements for sampling chunk contents and listing indexed page URLs for gating. |
| src/store/DocumentManagementService.ts | Adds metrics/sampling/url-list helpers used by the quality gate. |
| src/store/DocumentManagementService.test.ts | Adds unit tests for getVersionMetrics normalization/behavior. |
| src/scraper/types.ts | Extends ScraperOptions with denyPaths, expectTerms, localeStrategy. |
| src/scraper/strategies/WebScraperStrategy.ts | Passes localeStrategy into fetch options. |
| src/scraper/strategies/GitHubScraperStrategy.ts | Adds default denyPaths filtering and a /tree/<subPath> zero-match hard error with code. |
| src/scraper/strategies/GitHubScraperStrategy.test.ts | Tests GITHUB_SUBPATH_NOT_FOUND and denyPaths exclusion behavior. |
| src/scraper/strategies/BaseScraperStrategy.ts | Adds denyPaths filtering for non-GitHub strategies based on URL pathname. |
| src/scraper/fetcher/types.ts | Extends FetchOptions with localeStrategy. |
| src/scraper/fetcher/HttpFetcher.ts | Adds locale query stripping, Accept-Language pinning, and redirect-loop detection with typed error code. |
| src/scraper/fetcher/HttpFetcher.test.ts | Tests locale redirect loop detection and Accept-Language pin + hl stripping. |
| src/pipeline/types.ts | Adds ScrapeOutcome, ScrapeErrorCode, and gate metrics fields on jobs. |
| src/pipeline/relevanceGate.ts | Adds pure helpers for scope coherence and expect-terms matching. |
| src/pipeline/relevanceGate.test.ts | Unit tests for scope-key alignment and expect-terms sampling. |
| src/pipeline/PipelineManager.ts | Implements the gate seam (gate-then-rollback) and surfaces outcome/errorCode. |
| src/pipeline/PipelineManager.test.ts | Adds tests for rollback behavior, refresh safety, and OFF_TOPIC failures. |
| src/pipeline/outcomeGate.ts | Adds evaluateOutcome() pure classifier with thresholds and remediation hints. |
| src/pipeline/outcomeGate.test.ts | Unit tests for outcome classification paths. |
| src/pipeline/errors.ts | Adds QualityGateError carrying the full verdict. |
| src/mcp/mcpServer.ts | Extends the MCP schema/tool plumbing for new scrape options. |
| README.md | Documents quality gates, new parameters, and new outcome/errorCode fields. |
| ARCHITECTURE.md | Documents the quality-gate seam and gate-then-rollback behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
761
to
+765
| // Handle other errors | ||
| const errorMessage = error instanceof Error ? error.message : String(error); | ||
| if (error instanceof ScraperError && error.code) { | ||
| job.errorCode = error.code as ScrapeErrorCode; | ||
| } |
Comment on lines
227
to
+231
| const redirectUrl = new URL(location, currentUrl).href; | ||
|
|
||
| // Detect a cyclic redirect (commonly a locale ping-pong, e.g. | ||
| // ?hl=pt -> ?hl=en -> ?hl=pt) before exhausting the redirect budget. | ||
| if (seenLocations.has(redirectUrl)) { |
| try { | ||
| const u = new URL(rawUrl); | ||
| let path = u.pathname; | ||
| if (u.hostname.endsWith("github.com")) { |
Comment on lines
+214
to
+218
| const normalizedVersion = this.normalizeVersion(version); | ||
| const summaries = await this.store.queryLibraryVersions(); | ||
| const entry = summaries | ||
| .get(library.toLowerCase()) | ||
| ?.find((v) => (v.version ?? "") === normalizedVersion); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: scrape quality gates — classify outcome and roll back unusable indexes
Problem
Today a scrape job is marked
completedas soon as the scraper returns withoutthrowing (
PipelineManagersuccess path). There is no quality gate: the serverconflates "the crawl finished" with "I indexed usable documentation". This produces
three failure modes that all report
completed, verified against the realgoogle/generative-ai-docsrepository:?hl=locale redirect)Locationdetection and no locale stripping.completeddemos/as descendants. Measured: 679 of 1045 tree entries are underdemos/, drowning the ~50 real.mddocs. Path scope can't help (demos are descendants of the root).completed(non-empty!)/tree/<branch>/<subPath>indexes nothingcompletedin ~0.4s.completedEmpirical evidence (real repo, not truncated, 1045 entries):
Solution
Promote a
(library, version)to searchable only if it passes quality gates.Otherwise, discard what was indexed and return a typed error code with remediation.
ScrapeOutcomeenum (indexed | empty | thin | degenerate | failed) computed at job end from counters that already exist.COMPLETED, a failing verdict calls the existingremoveVersion()to discard thestaged docs and marks the job
FAILEDwith a typederrorCode. No half-indexedgarbage stays searchable.
localeStrategy(pin-endefault):Accept-Language: en+ striphl/lang/localequery params; cyclic-Locationdetection within the existingredirect cap raises
LOCALE_REDIRECT_LOOP.denyPaths(defaultdemos/examples) trims irrelevant subtrees; anopt-in relevance gate (
expectTerms) samples indexed chunks and flagsOFF_TOPIC,plus path/host coherence flags
SCOPE_DRIFT./tree/subPath matching zero files raisesGITHUB_SUBPATH_NOT_FOUNDwhose message lists the repo's real top-level directories.
Error codes (surfaced in
get_job_info/list_jobs)EMPTY_RESULT,THIN_RESULT,OFF_TOPIC,SCOPE_DRIFT,LOCALE_REDIRECT_LOOP,GITHUB_SUBPATH_NOT_FOUND.Backward compatibility
All public changes are additive:
scrape_docsparams:expectTerms,denyPaths,localeStrategy(bounded in the zod schema: ≤50 items, ≤200 chars each).
ScraperOptions,FetchOptions,JobInfo,PipelineJob.relevance axes (
OFF_TOPIC/SCOPE_DRIFT) are opt-in viaexpectTerms.clean=falsejobs skip the gate, so a transient thin re-index can neverdelete a pre-existing good version.
Implementation walkthrough (one commit per step)
ScrapeOutcome/ScrapeErrorCode/JobOutcomeMetricstypes; pureevaluateOutcome()classifier;getVersionMetrics()store accessor.applyQualityGate()seam inPipelineManager+ gate-then-rollback; E2Easserting a failed gate leaves the store empty.
GITHUB_SUBPATH_NOT_FOUNDguard; localenormalization +
LOCALE_REDIRECT_LOOP.denyPathsin GitHub + web crawl;relevanceGate(toScopeKeyref-agnosticscope,
expectTermssampling); wiring into the gate (opt-in).outcome/errorCodein job tools; plumb new params throughscrape_docs; docs.Tests
New deterministic coverage (no network):
outcomeGate.test.ts,relevanceGate.test.ts,quality-gate-e2e.test.ts, plus additions toPipelineManager,HttpFetcher,GitHubScraperStrategy,DocumentManagementService,GetJobInfoTool,ScrapeTooltests.FM-2/FM-3 use a committed snapshot of the real
google/generative-ai-docstree.npm test,npm run lint,npm run typecheckall green (pre-existing locale-dependentcli-e2eand an environment-dependentvector-persistencecase excluded — failingidentically on
main).Out of scope (intentionally deferred)
discover_source(library)tool.logger.warn).