feat(archive): add Internet Archive read-only adapter by Benjamin-eecs · Pull Request #1969 · jackwener/OpenCLI

Benjamin-eecs · 2026-06-17T14:56:25Z

Description

Adds a native read-only adapter for the Internet Archive (archive.org). Four commands wrap the four public REST endpoints that cover the most common agent use cases: full-text search across all mediatypes, per-item metadata, the Wayback Machine closest-snapshot lookup, and the Wayback CDX history. No login, no browser, no external CLI wrap. Follows the same Strategy.PUBLIC pattern as hf, smzdm, and the other read-only adapters.

Related issue: Closes #1896.

Type of Change

Checklist

I ran the checks relevant to this PR
I updated tests or docs if needed
I included output or screenshots when useful

Documentation (if adding/modifying an adapter)

Added doc page under docs/adapters/ (if new adapter)
Updated docs/adapters/index.md table (if new adapter)
Updated sidebar in docs/.vitepress/config.mts (if new adapter)
Updated README.md / README.zh-CN.md when command discoverability changed
Used positional args for the command's primary subject unless a named flag is clearly better
Normalized expected adapter failures to CliError subclasses instead of raw Error

Screenshots / Output

Commands:

archive search <query>      [--mediatype texts|movies|audio|software|image|web|data|collection] [--sort downloads|date|addeddate|week|title] [--limit N]
archive item <identifier>
archive wayback <url>        [--timestamp YYYY[MM[DD[hh[mm[ss]]]]] | ISO date]
archive snapshots <url>      [--from YYYY...] [--to YYYY...] [--limit N]

Live runs against archive.org (real session, no auth):

$ opencli archive search "machine learning" --limit 3 -f json
[
  { "rank": 1, "identifier": "open-syllabus", "title": "Open Syllabus", "mediatype": "collection", "downloads": 129540246, "url": "https://archive.org/details/open-syllabus", ... },
  { "rank": 2, "identifier": "FinalFantasy2_356", "title": "Final Fantasy II (SNES) - 3:56 - Kevin Juang", "creator": "Kevin 'Enhasa' Juang", "mediatype": "movies", "downloads": 6855428, ... },
  ...
]

$ opencli archive item open-syllabus -f json
[ { "identifier": "open-syllabus", "title": "Open Syllabus", "mediatype": "collection", "file_count": 7, "url": "https://archive.org/details/open-syllabus", ... } ]

$ opencli archive wayback wikipedia.org --timestamp 2015 -f json
[ { "original_url": "wikipedia.org", "requested_timestamp": "2015", "snapshot_timestamp": "20151231235819", "snapshot_url": "http://web.archive.org/web/20151231235819/https://www.wikipedia.org/", "status": "200" } ]

$ opencli archive snapshots wikipedia.org --limit 3 -f json
[
  { "timestamp": "20010727112808", "snapshot_url": "https://web.archive.org/web/20010727112808/http://www.wikipedia.org:80/", "status": "200", "mimetype": "text/html", ... },
  ...
]

Typed-error paths also exercised:

$ opencli archive search "x" --mediatype bogus
ok: false
error:
  code: ARGUMENT
  message: archive search mediatype must be one of texts, movies, audio, software, image, web, data, collection

$ opencli archive search "asdkjfhasdkfjhasdkjfh"
ok: false
error:
  code: EMPTY_RESULT
  message: archive search returned no data
  help: No items match "asdkjfhasdkfjhasdkjfh" on archive.org.

$ opencli archive item "bad id with spaces"
ok: false
error:
  code: ARGUMENT
  message: archive item identifier may only contain letters, digits, ".", "_", "-"

Notes for review:

CDX is served over HTTP only (the HTTPS endpoint returns 503 in practice); snapshots.js documents that next to the URL.
cli-manifest.json regenerated; the diff contains only the four new archive/* entries.

clis/archive/archive.test.js passes 3 / 3 locally; npm run check:typed-error-lint and npm run check:silent-column-drop both report new=0.

Native fetch against archive.org public REST APIs: advancedsearch.php, /metadata/, Wayback /available, and the CDX history endpoint. Four commands cover the most common item / snapshot lookups for AI agents. Closes jackwener#1896

…gate

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new Internet Archive adapter to OpenCLI, including documentation, sidebar navigation, CLI commands, and registry/manifest entries.

Changes:

Added archive browser adapter docs and linked it from the adapters index + VitePress sidebar.
Implemented new CLI commands: archive search, archive item, archive wayback, archive snapshots.
Added Vitest registry contract tests and updated cli-manifest.json to include the new commands.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
docs/adapters/index.md	Adds `archive` to the adapters registry table.
docs/adapters/browser/archive.md	New documentation page for Internet Archive commands.
docs/.vitepress/config.mts	Adds “Internet Archive” to the docs sidebar nav.
clis/archive/wayback.js	Implements closest Wayback snapshot lookup.
clis/archive/snapshots.js	Implements CDX snapshot history listing.
clis/archive/search.js	Implements Archive advanced search.
clis/archive/item.js	Implements item metadata lookup by identifier.
clis/archive/archive.test.js	Adds basic registry contract tests for the new commands.
cli-manifest.json	Registers the new `archive/*` commands in the CLI manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    strategy: Strategy.PUBLIC,
+    browser: false,
+    args: [
+        { name: 'url', positional: true, required: true, help: 'URL to look up (with or without scheme).' },


+    ],
+    columns: ['original_url', 'requested_timestamp', 'snapshot_timestamp', 'snapshot_url', 'status'],
+    func: async (args) => {
+        const target = String(args.url ?? '').trim();


+        const apiUrl = new URL('https://archive.org/wayback/available');
+        apiUrl.searchParams.set('url', target);


+    strategy: Strategy.PUBLIC,
+    browser: false,
+    args: [
+        { name: 'url', positional: true, required: true, help: 'URL to look up (with or without scheme).' },


+    ],
+    columns: ['timestamp', 'snapshot_url', 'status', 'mimetype', 'original_url'],
+    func: async (args) => {
+        const target = String(args.url ?? '').trim();


+
+        // Wayback CDX is served on HTTP only; the HTTPS endpoint returns 503.
+        const apiUrl = new URL('http://web.archive.org/cdx/search/cdx');
+        apiUrl.searchParams.set('url', target);


+        // Wayback CDX is served on HTTP only; the HTTPS endpoint returns 503.
+        const apiUrl = new URL('http://web.archive.org/cdx/search/cdx');


+    if (!/^\d{4,14}$/.test(digits) || digits.length % 2 !== 0 && digits.length !== 4) {
+        throw new ArgumentError('archive wayback timestamp must be YYYY[MM[DD[hh[mm[ss]]]]] or an ISO date');
+    }


+        const [header, ...rows] = data;
+        const cols = {};
+        header.forEach((name, i) => { cols[name] = i; });
+
+        return rows.slice(0, limit).map(row => {
+            const timestamp = String(row[cols.timestamp] ?? '');
+            const original = String(row[cols.original] ?? '');
+            return {
+                timestamp,
+                snapshot_url: buildWaybackUrl(timestamp, original),
+                status: String(row[cols.statuscode] ?? ''),
+                mimetype: String(row[cols.mimetype] ?? ''),
+                original_url: original,
+            };
+        });


+function normalizeTimestamp(raw) {
+    // Accept YYYY, YYYYMM, YYYYMMDD, YYYYMMDDhh, YYYYMMDDhhmm, YYYYMMDDhhmmss,
+    // YYYY-MM-DD, or YYYY-MM-DDThh:mm:ss. Strip non-digits and validate length.
+    const digits = String(raw).replace(/[^0-9]/g, '');
+    if (!/^\d{4,14}$/.test(digits) || digits.length % 2 !== 0 && digits.length !== 4) {
+        throw new ArgumentError('archive wayback timestamp must be YYYY[MM[DD[hh[mm[ss]]]]] or an ISO date');
+    }
+    return digits;
+}


…invalid id paths

Benjamin-eecs added 2 commits June 17, 2026 23:55

feat(archive): add Internet Archive read-only adapter

220851e

Native fetch against archive.org public REST APIs: advancedsearch.php, /metadata/, Wayback /available, and the CDX history endpoint. Four commands cover the most common item / snapshot lookups for AI agents. Closes jackwener#1896

docs(archive): wire adapter into docs/sidebar/index for doc-coverage …

cdd4d77

…gate

Benjamin-eecs marked this pull request as ready for review June 17, 2026 15:03

Copilot AI review requested due to automatic review settings June 17, 2026 15:03

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Benjamin-eecs and others added 3 commits June 18, 2026 00:08

refactor(archive): match hf/paper ArgumentError hint style for empty/…

1449663

…invalid id paths

fix(archive): fail closed on malformed API payloads

cede7ed

fix(archive): fail closed on malformed CDX rows

5f5661c

jackwener merged commit d2abdcc into jackwener:main Jun 18, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(archive): add Internet Archive read-only adapter#1969

feat(archive): add Internet Archive read-only adapter#1969
jackwener merged 5 commits into
jackwener:mainfrom
Benjamin-eecs:feat/archive-org-adapter

Benjamin-eecs commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		const apiUrl = new URL('https://archive.org/wayback/available');
		apiUrl.searchParams.set('url', target);

		// Wayback CDX is served on HTTP only; the HTTPS endpoint returns 503.
		const apiUrl = new URL('http://web.archive.org/cdx/search/cdx');

Conversation

Benjamin-eecs commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Documentation (if adding/modifying an adapter)

Screenshots / Output

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Benjamin-eecs commented Jun 17, 2026 •

edited

Loading