fix(scraper): make browser fetcher robust to non-navigable urls and idle hangs by evilhamsterman · Pull Request #442 · arabold/docs-mcp-server

evilhamsterman · 2026-06-16T18:00:07Z

Fixes #441.

Summary

Makes BrowserFetcher robust to the two failure modes described in #441:

networkidle → load. page.goto now gates on "load" and waits for networkidle only on a best-effort basis (with a short timeout, swallowed on failure), matching what HtmlPlaywrightMiddleware already does. Sites that never go network-idle (Cloudflare telemetry, analytics, websockets) no longer time out the navigation.
Non-navigable resources fall back to the request API. When page.goto rejects with ERR_INVALID_ARGUMENT / ERR_ABORTED (Markdown/JSON/plain-text resources the browser tries to download), the fetcher loads the origin first (so any JS/anti-bot challenge is solved and clearance cookies are set on the context) and then retrieves the bytes via page.request.get, which reuses those cookies but does not try to render the response. A still-blocked fetch surfaces as a retryable ScraperError rather than crashing the job.

Why

The llms.txt feature seeds Markdown (.md) URL variants at depth 0. Behind a Cloudflare Managed Challenge these get routed to the browser fallback, where page.goto on a Markdown URL throws ERR_INVALID_ARGUMENT; since depth-0 failures are fatal in BaseScraperStrategy, one such seed aborts the whole scrape. The networkidle gate independently caused navigation timeouts on the same class of sites.

Changes

src/scraper/fetcher/BrowserFetcher.ts — load gate + best-effort networkidle; new fetchViaRequest() fallback for non-navigable URLs.
src/scraper/fetcher/BrowserFetcher.test.ts — refactored mocks into a mockBrowser() helper; added tests for the request-API fallback (success path) and for a still-blocked fallback raising a ScraperError.

Testing

npx vitest run src/scraper/fetcher/BrowserFetcher.test.ts — passes (incl. new cases).
npx tsc --noEmit and biome check — clean.
Verified end-to-end against a Cloudflare-challenged Read-the-Docs site (docs.vyos.io): the scrape that previously aborted in ~30s now completes the full tree with no ERR_INVALID_ARGUMENT or networkidle timeouts; pages that genuinely can't be cleared fail individually (non-fatal) instead of killing the job.

Notes

The depth-0-seed-fatal behavior in BaseScraperStrategy (any single llms.txt seed failure aborting the whole job) is a related but separate issue; this PR makes the browser path degrade gracefully so it no longer triggers that path, but the underlying strategy behavior is left for a follow-up.

…dle hangs BrowserFetcher gated page.goto on "networkidle", but many sites (analytics, Cloudflare telemetry, websockets) never reach network idle, so navigation timed out after the full browser timeout even though the document was ready. Gate on "load" instead, then wait for networkidle only on a best-effort basis. Also handle resources the browser cannot navigate to (Markdown, JSON, plain text): page.goto rejects them with ERR_INVALID_ARGUMENT / ERR_ABORTED because the browser starts a download. These now fall back to the browser context's request API, which reuses cookies (so any anti-bot clearance carries over) but does not try to render the response. A still-blocked fallback surfaces as a retryable ScraperError rather than crashing the job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR addresses the two failure modes described in #441 by making the Playwright-based BrowserFetcher more resilient when navigating to pages that never reach networkidle and when encountering non-navigable resources (e.g., .md) that cause page.goto() to reject.

Changes:

Switch page.goto() gating from "networkidle" to "load" and treat "networkidle" as best-effort with a short timeout.
Add a fetchViaRequest() fallback that uses page.request.get() after loading the origin to reuse any solved anti-bot cookies.
Refactor Playwright mocks in BrowserFetcher tests and add cases covering the request-API fallback paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/scraper/fetcher/BrowserFetcher.ts	Changes navigation wait strategy and introduces `fetchViaRequest()` fallback for non-navigable URLs.
src/scraper/fetcher/BrowserFetcher.test.ts	Refactors Playwright mocks and adds tests for the request-API fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    const response = await page.request.get(source, {
+      headers: options?.headers,
+      maxRedirects: options?.followRedirects === false ? 0 : undefined,
+      timeout,
+    });


+    const finalUrl = source;
+    await this.accessPolicy.assertNetworkUrlAllowed(finalUrl);
+
+    if (!response.ok()) {
+      throw new ScraperError(
+        `Browser request for ${source} returned status ${response.status()}`,
+        true,
+      );
+    }


+    if (!response.ok()) {
+      throw new ScraperError(
+        `Browser request for ${source} returned status ${response.status()}`,
+        true,
+      );


+    const fetcher = new BrowserFetcher(loadConfig().scraper);
+    await expect(
+      fetcher.fetch("https://example.com/automation/index.md"),
+    ).rejects.toBeInstanceOf(ScraperError);
+  });


arabold · 2026-06-28T13:53:08Z

Thanks for the PR — this is a clean, well-documented fix and the problem analysis in the description is excellent. 🙏 The load gate + best-effort networkidle change is exactly right, and the test refactor into mockBrowser() is a nice cleanup.

Two issues in the new fetchViaRequest() path before this is ready to merge (both also independently flagged by the Copilot review, so consider these confirmed):

The retryable error never reaches callers. fetchViaRequest() throws new ScraperError(..., true), but because it's called from inside fetch()'s try block, the outercatch re-wraps every error asnew ScraperError(..., false)` - so the retryable signal is dropped. The "still-blocked fallback is retryable" behavior described in the PR doesn't actually happen.
The request-API fallback bypasses the SSRF guards. page.request.get() does not trigger the page.route("**/*") access-policy gate that protects navigations, and it follows redirects by default. Combined with finalUrl being hard-coded to source, a redirect to a private/blocked host would be fetched and returned without assertNetworkUrlAllowed() ever validating the real target, and RawContent.source/etag would be mislabeled as the original URL. The actual final URL after redirects needs re-validation. While here: when followRedirects === false, a 3xx currently surfaces as a retryable "status X" error rather than the non-retryable "Redirect blocked" error the main navigation path produces.

Everything else looks good. Thanks again!!

arabold requested a review from Copilot June 28, 2026 01:46

Copilot started reviewing on behalf of arabold June 28, 2026 01:46 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scraper): make browser fetcher robust to non-navigable urls and idle hangs#442

fix(scraper): make browser fetcher robust to non-navigable urls and idle hangs#442
evilhamsterman wants to merge 1 commit into
arabold:mainfrom
evilhamsterman:fix/browser-fetcher-nonnavigable

evilhamsterman commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

arabold commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

evilhamsterman commented Jun 16, 2026

Summary

Why

Changes

Testing

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

arabold commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants