Skip to content

Claude/vibrant wozniak 284617#413

Merged
mnot merged 2 commits into
mainfrom
claude/vibrant-wozniak-284617
May 19, 2026
Merged

Claude/vibrant wozniak 284617#413
mnot merged 2 commits into
mainfrom
claude/vibrant-wozniak-284617

Conversation

@mnot
Copy link
Copy Markdown
Owner

@mnot mnot commented May 19, 2026

Summary

  • Cap the HTML link parser at 2 MB of input per response to avoid an event-loop stall (and systemd watchdog kill) caused by Python's html.parser going quadratic inside large inline <script>/<style> bodies — e.g. SPA SSR/hydration blobs.
  • Fix a latent bug where the link parser was registered on response_content_processors (raw network bytes), so for gzip/brotli responses it received compressed bytes, decoded them as UTF-8 with errors=ignore, and silently extracted no links. Moved to response.decoded.processors so it sees the decompressed stream.
  • Added a regression test that exercises the wiring end-to-end with a gzipped body.

Test plan

  • make typecheck
  • make lint
  • .venv/bin/python -m pytest test/test_link_parse.py (3 passed, including the new gzip regression test)
  • Confirmed the regression test fails when the wiring change is reverted
  • Manual: run a descend against a heavy SPA page (e.g. a TikTok profile) and confirm it completes without tripping the systemd watchdog
  • Manual: run a descend against a gzip-encoded HTML page and confirm linked sub-resources are now fetched

mnot added 2 commits May 19, 2026 10:22
Python html.parser exhibits quadratic behaviour inside <script> and
<style> bodies because it rescans the accumulated rawdata buffer for
the closing tag on every feed() call. With descend enabled, a single
large SSR/hydration blob (tens of MB of inline <script>) is enough to
block the event loop past the systemd watchdog timeout and have the
process killed.

Stop feeding the parser once a response has contributed more than 2 MB
of body, which keeps the worst case well under the watchdog limit while
still covering the head and early body where the relevant link tags
live.
The HTML link parser was registered on response_content_processors,
which receives raw network bytes. For any gzip- or brotli-encoded
response the parser saw compressed bytes, decoded them as UTF-8 with
errors=ignore (producing near-empty output), and silently extracted no
links. Descend was effectively broken on most of the modern web.

Register feed_bytes on response.decoded.processors instead, mirroring
how sample_decoded is already wired, so the parser sees the
decompressed stream. Adds a regression test that exercises the wiring
end-to-end with a gzip body.
@mnot mnot merged commit b36367f into main May 19, 2026
7 checks passed
@mnot mnot deleted the claude/vibrant-wozniak-284617 branch May 19, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant