fix(scraper): make browser fetcher robust to non-navigable urls and idle hangs#442
fix(scraper): make browser fetcher robust to non-navigable urls and idle hangs#442evilhamsterman wants to merge 1 commit into
Conversation
…dle hangs BrowserFetcher gated page.goto on "networkidle", but many sites (analytics, Cloudflare telemetry, websockets) never reach network idle, so navigation timed out after the full browser timeout even though the document was ready. Gate on "load" instead, then wait for networkidle only on a best-effort basis. Also handle resources the browser cannot navigate to (Markdown, JSON, plain text): page.goto rejects them with ERR_INVALID_ARGUMENT / ERR_ABORTED because the browser starts a download. These now fall back to the browser context's request API, which reuses cookies (so any anti-bot clearance carries over) but does not try to render the response. A still-blocked fallback surfaces as a retryable ScraperError rather than crashing the job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses the two failure modes described in #441 by making the Playwright-based BrowserFetcher more resilient when navigating to pages that never reach networkidle and when encountering non-navigable resources (e.g., .md) that cause page.goto() to reject.
Changes:
- Switch
page.goto()gating from"networkidle"to"load"and treat"networkidle"as best-effort with a short timeout. - Add a
fetchViaRequest()fallback that usespage.request.get()after loading the origin to reuse any solved anti-bot cookies. - Refactor Playwright mocks in
BrowserFetchertests and add cases covering the request-API fallback paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/scraper/fetcher/BrowserFetcher.ts | Changes navigation wait strategy and introduces fetchViaRequest() fallback for non-navigable URLs. |
| src/scraper/fetcher/BrowserFetcher.test.ts | Refactors Playwright mocks and adds tests for the request-API fallback behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const response = await page.request.get(source, { | ||
| headers: options?.headers, | ||
| maxRedirects: options?.followRedirects === false ? 0 : undefined, | ||
| timeout, | ||
| }); |
| const finalUrl = source; | ||
| await this.accessPolicy.assertNetworkUrlAllowed(finalUrl); | ||
|
|
||
| if (!response.ok()) { | ||
| throw new ScraperError( | ||
| `Browser request for ${source} returned status ${response.status()}`, | ||
| true, | ||
| ); | ||
| } |
| if (!response.ok()) { | ||
| throw new ScraperError( | ||
| `Browser request for ${source} returned status ${response.status()}`, | ||
| true, | ||
| ); |
| const fetcher = new BrowserFetcher(loadConfig().scraper); | ||
| await expect( | ||
| fetcher.fetch("https://example.com/automation/index.md"), | ||
| ).rejects.toBeInstanceOf(ScraperError); | ||
| }); |
|
Thanks for the PR — this is a clean, well-documented fix and the problem analysis in the description is excellent. 🙏 The Two issues in the new
Everything else looks good. Thanks again!! |
Fixes #441.
Summary
Makes
BrowserFetcherrobust to the two failure modes described in #441:networkidle→load.page.gotonow gates on"load"and waits fornetworkidleonly on a best-effort basis (with a short timeout, swallowed on failure), matching whatHtmlPlaywrightMiddlewarealready does. Sites that never go network-idle (Cloudflare telemetry, analytics, websockets) no longer time out the navigation.page.gotorejects withERR_INVALID_ARGUMENT/ERR_ABORTED(Markdown/JSON/plain-text resources the browser tries to download), the fetcher loads the origin first (so any JS/anti-bot challenge is solved and clearance cookies are set on the context) and then retrieves the bytes viapage.request.get, which reuses those cookies but does not try to render the response. A still-blocked fetch surfaces as a retryableScraperErrorrather than crashing the job.Why
The
llms.txtfeature seeds Markdown (.md) URL variants at depth 0. Behind a Cloudflare Managed Challenge these get routed to the browser fallback, wherepage.gotoon a Markdown URL throwsERR_INVALID_ARGUMENT; since depth-0 failures are fatal inBaseScraperStrategy, one such seed aborts the whole scrape. Thenetworkidlegate independently caused navigation timeouts on the same class of sites.Changes
src/scraper/fetcher/BrowserFetcher.ts—loadgate + best-effort networkidle; newfetchViaRequest()fallback for non-navigable URLs.src/scraper/fetcher/BrowserFetcher.test.ts— refactored mocks into amockBrowser()helper; added tests for the request-API fallback (success path) and for a still-blocked fallback raising aScraperError.Testing
npx vitest run src/scraper/fetcher/BrowserFetcher.test.ts— passes (incl. new cases).npx tsc --noEmitandbiome check— clean.docs.vyos.io): the scrape that previously aborted in ~30s now completes the full tree with noERR_INVALID_ARGUMENTornetworkidletimeouts; pages that genuinely can't be cleared fail individually (non-fatal) instead of killing the job.Notes
The depth-0-seed-fatal behavior in
BaseScraperStrategy(any singlellms.txtseed failure aborting the whole job) is a related but separate issue; this PR makes the browser path degrade gracefully so it no longer triggers that path, but the underlying strategy behavior is left for a follow-up.