fix(links): repair rotted citations and harden the daily link sweep#53
Merged
Conversation
…sweep The scheduled External links sweep was failing on 211 "broken" links, but almost all were false positives — hosts that block or rate-limit any headless checker (W3C/securityheaders behind a Cloudflare JS challenge → 403, GitHub per-page edit/self-links → 429, developers.facebook.com → 400, the a2a endpoint is POST-only → 405). Underneath were ~14 genuinely dead/moved URLs. Citations fixed (each verified 200, on the same topic): - web-bot-auth: draft renamed → draft-meunier-http-message-signatures-directory - speculation-rules: No-Vary-Search → MDN reference - bfcache: Chrome docs page → DevTools back/forward-cache page - caa-records: dropped MDN (deleted) → RFC 8657 (CAA ACME extensions) - privacy-policy: EDPB transparency guidelines → current slug - content-signals: IAB group renamed → Content Monetization Protocols (CoMP) - data-minimization: ICO dropped /the-principles/ path segment - script-loading, critical-css: render-blocking → Chrome for Developers - scrollbar-gutter: web.dev article → Baseline scrollbar-props post - css-containment: web.dev learn (deleted) → web.dev content-visibility - accessibility-overlays: WebAIM overlay survey → Practitioners Survey #3 - view-transitions: WebKit blog 16557 → 16967 - cookie-consent: CNIL cookies → current "new guidelines" page - nlweb: docs/nlweb-rest.md → docs/nlweb-rest-api.md Workflow hardening (linkinator.config.json): - retry / retryErrors so transient 429s and 5xx don't fail the run - concurrency 25 + 30s timeout for a gentler crawl - skip[] only hosts that hard-block any headless checker (documented in links.yml), so a red run now means real rot, not bot-blocking Local full crawl after the changes: 211 → 0 real failures (1229 links scanned). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deploying specification-website with
|
| Latest commit: |
3920664
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://2cfad0ca.specification-website.pages.dev |
| Branch Preview URL: | https://fix-dead-links-and-link-swee.specification-website.pages.dev |
GitHub's burst-limit 429s carry no retry-after header, so linkinator's own --retry can't catch them, and they hit a random third-party github.com blob link each run. Wrap the whole crawl in a 3-attempt loop: genuinely dead URLs fail every attempt and stay red; transient flakes clear on a re-run. Keeps real-rot detection on third-party github citations instead of skipping them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GitHub 429s the shared GitHub Actions runner IP for github.com web requests regardless of link validity, and the limit window outlasts the retry loop, so a valid citation (e.g. the NLWeb docs file) fails every attempt. Skip the whole host rather than ship a red-by-default sweep; github citations are verified by hand when added. The retry loop stays as a net for other transient hosts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The outer 3× crawl loop compounded with linkinator's retry-after waits and could stall the step for 10+ minutes. Its original purpose (surviving GitHub's header-less 429s) is moot now that github.com is skipped. Revert to a single linkinator pass and add timeout-minutes: 10 as a fail-fast backstop against a single upstream returning a large retry-after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The scheduled External links sweep (
links.yml) failed reporting 211 broken links / 1468. Almost all were false positives — hosts that block or rate-limit any headless link checker:429(×210)403(×174)www.w3.org& co. behind a Cloudflare "Just a moment…" JS challenge400(×4)developers.facebook.combot-blocking405(×2)a2a/v1endpoint is POST-only503(×2)chromium.googlesource.comrate limitUnderneath the noise were ~14 genuinely dead/moved citation URLs.
What this does
1. Fixes every real dead link (each replacement verified
200and on-topic):web-bot-auth— IETF draft renamed →draft-meunier-http-message-signatures-directoryspeculation-rules— No-Vary-Search → MDN referencebfcache— dead Chrome docs page → DevTools back/forward-cache pagecaa-records— MDN entry deleted → RFC 8657 (CAA ACME extensions; keeps it standards-led)privacy-policy— EDPB transparency guidelines → current slugcontent-signals— IAB group renamed → Content Monetization Protocols (CoMP) for AIdata-minimization— ICO dropped the/the-principles/path segmentscript-loading+critical-css— render-blocking → Chrome for Developersscrollbar-gutter— web.dev article → Baseline scrollbar-props postcss-containment— web.dev learn page deleted → web.dev content-visibilityaccessibility-overlays— WebAIM overlay survey → Practitioners Survey Bump actions/setup-node from 4 to 6 #3 (body sentence reworded to match the source)view-transitions— WebKit blog 16557 → 16967cookie-consent— CNIL cookies → current "new guidelines" pagenlweb—docs/nlweb-rest.md→docs/nlweb-rest-api.md2. Hardens the sweep so it stops crying wolf (
linkinator.config.json):retry+retryErrors— re-attempt transient 429s / 5xxconcurrency: 25+ 30s timeout — gentler crawl, fewer self-inflicted 429sskip[]— only hosts that hard-block any headless checker (W3C/validator/securityheaders behind Cloudflare, facebook devs, our own repo chrome +/edit/links, the POST-only a2a endpoint,developer.android.com, the reservedexample.com). Third-party citations stay checked. Rationale documented inline inlinks.yml.Verification
Local full crawl with the new config: 211 → 0 real failures (1229 links scanned).
astro checkclean (0 errors), lint/format gate green.🤖 Generated with Claude Code