feat(archive): add Internet Archive read-only adapter#1969
Merged
jackwener merged 5 commits intoJun 18, 2026
Conversation
Native fetch against archive.org public REST APIs: advancedsearch.php, /metadata/, Wayback /available, and the CDX history endpoint. Four commands cover the most common item / snapshot lookups for AI agents. Closes jackwener#1896
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new Internet Archive adapter to OpenCLI, including documentation, sidebar navigation, CLI commands, and registry/manifest entries.
Changes:
- Added
archivebrowser adapter docs and linked it from the adapters index + VitePress sidebar. - Implemented new CLI commands:
archive search,archive item,archive wayback,archive snapshots. - Added Vitest registry contract tests and updated
cli-manifest.jsonto include the new commands.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/adapters/index.md | Adds archive to the adapters registry table. |
| docs/adapters/browser/archive.md | New documentation page for Internet Archive commands. |
| docs/.vitepress/config.mts | Adds “Internet Archive” to the docs sidebar nav. |
| clis/archive/wayback.js | Implements closest Wayback snapshot lookup. |
| clis/archive/snapshots.js | Implements CDX snapshot history listing. |
| clis/archive/search.js | Implements Archive advanced search. |
| clis/archive/item.js | Implements item metadata lookup by identifier. |
| clis/archive/archive.test.js | Adds basic registry contract tests for the new commands. |
| cli-manifest.json | Registers the new archive/* commands in the CLI manifest. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| strategy: Strategy.PUBLIC, | ||
| browser: false, | ||
| args: [ | ||
| { name: 'url', positional: true, required: true, help: 'URL to look up (with or without scheme).' }, |
| ], | ||
| columns: ['original_url', 'requested_timestamp', 'snapshot_timestamp', 'snapshot_url', 'status'], | ||
| func: async (args) => { | ||
| const target = String(args.url ?? '').trim(); |
Comment on lines
+39
to
+40
| const apiUrl = new URL('https://archive.org/wayback/available'); | ||
| apiUrl.searchParams.set('url', target); |
| strategy: Strategy.PUBLIC, | ||
| browser: false, | ||
| args: [ | ||
| { name: 'url', positional: true, required: true, help: 'URL to look up (with or without scheme).' }, |
| ], | ||
| columns: ['timestamp', 'snapshot_url', 'status', 'mimetype', 'original_url'], | ||
| func: async (args) => { | ||
| const target = String(args.url ?? '').trim(); |
|
|
||
| // Wayback CDX is served on HTTP only; the HTTPS endpoint returns 503. | ||
| const apiUrl = new URL('http://web.archive.org/cdx/search/cdx'); | ||
| apiUrl.searchParams.set('url', target); |
Comment on lines
+48
to
+49
| // Wayback CDX is served on HTTP only; the HTTPS endpoint returns 503. | ||
| const apiUrl = new URL('http://web.archive.org/cdx/search/cdx'); |
Comment on lines
+13
to
+15
| if (!/^\d{4,14}$/.test(digits) || digits.length % 2 !== 0 && digits.length !== 4) { | ||
| throw new ArgumentError('archive wayback timestamp must be YYYY[MM[DD[hh[mm[ss]]]]] or an ISO date'); | ||
| } |
Comment on lines
+81
to
+95
| const [header, ...rows] = data; | ||
| const cols = {}; | ||
| header.forEach((name, i) => { cols[name] = i; }); | ||
|
|
||
| return rows.slice(0, limit).map(row => { | ||
| const timestamp = String(row[cols.timestamp] ?? ''); | ||
| const original = String(row[cols.original] ?? ''); | ||
| return { | ||
| timestamp, | ||
| snapshot_url: buildWaybackUrl(timestamp, original), | ||
| status: String(row[cols.statuscode] ?? ''), | ||
| mimetype: String(row[cols.mimetype] ?? ''), | ||
| original_url: original, | ||
| }; | ||
| }); |
Comment on lines
+9
to
+17
| function normalizeTimestamp(raw) { | ||
| // Accept YYYY, YYYYMM, YYYYMMDD, YYYYMMDDhh, YYYYMMDDhhmm, YYYYMMDDhhmmss, | ||
| // YYYY-MM-DD, or YYYY-MM-DDThh:mm:ss. Strip non-digits and validate length. | ||
| const digits = String(raw).replace(/[^0-9]/g, ''); | ||
| if (!/^\d{4,14}$/.test(digits) || digits.length % 2 !== 0 && digits.length !== 4) { | ||
| throw new ArgumentError('archive wayback timestamp must be YYYY[MM[DD[hh[mm[ss]]]]] or an ISO date'); | ||
| } | ||
| return digits; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a native read-only adapter for the Internet Archive (archive.org). Four commands wrap the four public REST endpoints that cover the most common agent use cases: full-text search across all mediatypes, per-item metadata, the Wayback Machine closest-snapshot lookup, and the Wayback CDX history. No login, no browser, no external CLI wrap. Follows the same Strategy.PUBLIC pattern as
hf,smzdm, and the other read-only adapters.Related issue: Closes #1896.
Type of Change
Checklist
Documentation (if adding/modifying an adapter)
docs/adapters/(if new adapter)docs/adapters/index.mdtable (if new adapter)docs/.vitepress/config.mts(if new adapter)README.md/README.zh-CN.mdwhen command discoverability changedCliErrorsubclasses instead of rawErrorScreenshots / Output
Commands:
Live runs against archive.org (real session, no auth):
Typed-error paths also exercised:
Notes for review:
snapshots.jsdocuments that next to the URL.cli-manifest.jsonregenerated; the diff contains only the four newarchive/*entries.clis/archive/archive.test.jspasses 3 / 3 locally;npm run check:typed-error-lintandnpm run check:silent-column-dropboth reportnew=0.