Claude/vibrant wozniak 284617 by mnot · Pull Request #413 · mnot/redbot

mnot · 2026-05-19T00:31:24Z

Summary

Cap the HTML link parser at 2 MB of input per response to avoid an event-loop stall (and systemd watchdog kill) caused by Python's html.parser going quadratic inside large inline <script>/<style> bodies — e.g. SPA SSR/hydration blobs.
Fix a latent bug where the link parser was registered on response_content_processors (raw network bytes), so for gzip/brotli responses it received compressed bytes, decoded them as UTF-8 with errors=ignore, and silently extracted no links. Moved to response.decoded.processors so it sees the decompressed stream.
Added a regression test that exercises the wiring end-to-end with a gzipped body.

Test plan

make typecheck
make lint
.venv/bin/python -m pytest test/test_link_parse.py (3 passed, including the new gzip regression test)
Confirmed the regression test fails when the wiring change is reverted
Manual: run a descend against a heavy SPA page (e.g. a TikTok profile) and confirm it completes without tripping the systemd watchdog
Manual: run a descend against a gzip-encoded HTML page and confirm linked sub-resources are now fetched

Python html.parser exhibits quadratic behaviour inside <script> and <style> bodies because it rescans the accumulated rawdata buffer for the closing tag on every feed() call. With descend enabled, a single large SSR/hydration blob (tens of MB of inline <script>) is enough to block the event loop past the systemd watchdog timeout and have the process killed. Stop feeding the parser once a response has contributed more than 2 MB of body, which keeps the worst case well under the watchdog limit while still covering the head and early body where the relevant link tags live.

The HTML link parser was registered on response_content_processors, which receives raw network bytes. For any gzip- or brotli-encoded response the parser saw compressed bytes, decoded them as UTF-8 with errors=ignore (producing near-empty output), and silently extracted no links. Descend was effectively broken on most of the modern web. Register feed_bytes on response.decoded.processors instead, mirroring how sample_decoded is already wired, so the parser sees the decompressed stream. Adds a regression test that exercises the wiring end-to-end with a gzip body.

mnot added 2 commits May 19, 2026 10:22

mnot merged commit b36367f into main May 19, 2026
7 checks passed

mnot deleted the claude/vibrant-wozniak-284617 branch May 19, 2026 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/vibrant wozniak 284617#413

Claude/vibrant wozniak 284617#413
mnot merged 2 commits into
mainfrom
claude/vibrant-wozniak-284617

mnot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mnot commented May 19, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant