Skip to content

feat(archive): add Internet Archive read-only adapter#1969

Merged
jackwener merged 5 commits into
jackwener:mainfrom
Benjamin-eecs:feat/archive-org-adapter
Jun 18, 2026
Merged

feat(archive): add Internet Archive read-only adapter#1969
jackwener merged 5 commits into
jackwener:mainfrom
Benjamin-eecs:feat/archive-org-adapter

Conversation

@Benjamin-eecs

@Benjamin-eecs Benjamin-eecs commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a native read-only adapter for the Internet Archive (archive.org). Four commands wrap the four public REST endpoints that cover the most common agent use cases: full-text search across all mediatypes, per-item metadata, the Wayback Machine closest-snapshot lookup, and the Wayback CDX history. No login, no browser, no external CLI wrap. Follows the same Strategy.PUBLIC pattern as hf, smzdm, and the other read-only adapters.

Related issue: Closes #1896.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 🌐 New site adapter
  • 📝 Documentation
  • ♻️ Refactor
  • 🔧 CI / build / tooling

Checklist

  • I ran the checks relevant to this PR
  • I updated tests or docs if needed
  • I included output or screenshots when useful

Documentation (if adding/modifying an adapter)

  • Added doc page under docs/adapters/ (if new adapter)
  • Updated docs/adapters/index.md table (if new adapter)
  • Updated sidebar in docs/.vitepress/config.mts (if new adapter)
  • Updated README.md / README.zh-CN.md when command discoverability changed
  • Used positional args for the command's primary subject unless a named flag is clearly better
  • Normalized expected adapter failures to CliError subclasses instead of raw Error

Screenshots / Output

Commands:

archive search <query>      [--mediatype texts|movies|audio|software|image|web|data|collection] [--sort downloads|date|addeddate|week|title] [--limit N]
archive item <identifier>
archive wayback <url>        [--timestamp YYYY[MM[DD[hh[mm[ss]]]]] | ISO date]
archive snapshots <url>      [--from YYYY...] [--to YYYY...] [--limit N]

Live runs against archive.org (real session, no auth):

$ opencli archive search "machine learning" --limit 3 -f json
[
  { "rank": 1, "identifier": "open-syllabus", "title": "Open Syllabus", "mediatype": "collection", "downloads": 129540246, "url": "https://archive.org/details/open-syllabus", ... },
  { "rank": 2, "identifier": "FinalFantasy2_356", "title": "Final Fantasy II (SNES) - 3:56 - Kevin Juang", "creator": "Kevin 'Enhasa' Juang", "mediatype": "movies", "downloads": 6855428, ... },
  ...
]
$ opencli archive item open-syllabus -f json
[ { "identifier": "open-syllabus", "title": "Open Syllabus", "mediatype": "collection", "file_count": 7, "url": "https://archive.org/details/open-syllabus", ... } ]
$ opencli archive wayback wikipedia.org --timestamp 2015 -f json
[ { "original_url": "wikipedia.org", "requested_timestamp": "2015", "snapshot_timestamp": "20151231235819", "snapshot_url": "http://web.archive.org/web/20151231235819/https://www.wikipedia.org/", "status": "200" } ]
$ opencli archive snapshots wikipedia.org --limit 3 -f json
[
  { "timestamp": "20010727112808", "snapshot_url": "https://web.archive.org/web/20010727112808/http://www.wikipedia.org:80/", "status": "200", "mimetype": "text/html", ... },
  ...
]

Typed-error paths also exercised:

$ opencli archive search "x" --mediatype bogus
ok: false
error:
  code: ARGUMENT
  message: archive search mediatype must be one of texts, movies, audio, software, image, web, data, collection
$ opencli archive search "asdkjfhasdkfjhasdkjfh"
ok: false
error:
  code: EMPTY_RESULT
  message: archive search returned no data
  help: No items match "asdkjfhasdkfjhasdkjfh" on archive.org.
$ opencli archive item "bad id with spaces"
ok: false
error:
  code: ARGUMENT
  message: archive item identifier may only contain letters, digits, ".", "_", "-"

Notes for review:

  • CDX is served over HTTP only (the HTTPS endpoint returns 503 in practice); snapshots.js documents that next to the URL.
  • cli-manifest.json regenerated; the diff contains only the four new archive/* entries.

clis/archive/archive.test.js passes 3 / 3 locally; npm run check:typed-error-lint and npm run check:silent-column-drop both report new=0.

Native fetch against archive.org public REST APIs: advancedsearch.php, /metadata/, Wayback /available, and the CDX history endpoint. Four commands cover the most common item / snapshot lookups for AI agents.

Closes jackwener#1896
@Benjamin-eecs Benjamin-eecs marked this pull request as ready for review June 17, 2026 15:03
Copilot AI review requested due to automatic review settings June 17, 2026 15:03

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new Internet Archive adapter to OpenCLI, including documentation, sidebar navigation, CLI commands, and registry/manifest entries.

Changes:

  • Added archive browser adapter docs and linked it from the adapters index + VitePress sidebar.
  • Implemented new CLI commands: archive search, archive item, archive wayback, archive snapshots.
  • Added Vitest registry contract tests and updated cli-manifest.json to include the new commands.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/adapters/index.md Adds archive to the adapters registry table.
docs/adapters/browser/archive.md New documentation page for Internet Archive commands.
docs/.vitepress/config.mts Adds “Internet Archive” to the docs sidebar nav.
clis/archive/wayback.js Implements closest Wayback snapshot lookup.
clis/archive/snapshots.js Implements CDX snapshot history listing.
clis/archive/search.js Implements Archive advanced search.
clis/archive/item.js Implements item metadata lookup by identifier.
clis/archive/archive.test.js Adds basic registry contract tests for the new commands.
cli-manifest.json Registers the new archive/* commands in the CLI manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread clis/archive/wayback.js
strategy: Strategy.PUBLIC,
browser: false,
args: [
{ name: 'url', positional: true, required: true, help: 'URL to look up (with or without scheme).' },
Comment thread clis/archive/wayback.js
],
columns: ['original_url', 'requested_timestamp', 'snapshot_timestamp', 'snapshot_url', 'status'],
func: async (args) => {
const target = String(args.url ?? '').trim();
Comment thread clis/archive/wayback.js
Comment on lines +39 to +40
const apiUrl = new URL('https://archive.org/wayback/available');
apiUrl.searchParams.set('url', target);
Comment thread clis/archive/snapshots.js
strategy: Strategy.PUBLIC,
browser: false,
args: [
{ name: 'url', positional: true, required: true, help: 'URL to look up (with or without scheme).' },
Comment thread clis/archive/snapshots.js
],
columns: ['timestamp', 'snapshot_url', 'status', 'mimetype', 'original_url'],
func: async (args) => {
const target = String(args.url ?? '').trim();
Comment thread clis/archive/snapshots.js

// Wayback CDX is served on HTTP only; the HTTPS endpoint returns 503.
const apiUrl = new URL('http://web.archive.org/cdx/search/cdx');
apiUrl.searchParams.set('url', target);
Comment thread clis/archive/snapshots.js
Comment on lines +48 to +49
// Wayback CDX is served on HTTP only; the HTTPS endpoint returns 503.
const apiUrl = new URL('http://web.archive.org/cdx/search/cdx');
Comment thread clis/archive/wayback.js
Comment on lines +13 to +15
if (!/^\d{4,14}$/.test(digits) || digits.length % 2 !== 0 && digits.length !== 4) {
throw new ArgumentError('archive wayback timestamp must be YYYY[MM[DD[hh[mm[ss]]]]] or an ISO date');
}
Comment thread clis/archive/snapshots.js
Comment on lines +81 to +95
const [header, ...rows] = data;
const cols = {};
header.forEach((name, i) => { cols[name] = i; });

return rows.slice(0, limit).map(row => {
const timestamp = String(row[cols.timestamp] ?? '');
const original = String(row[cols.original] ?? '');
return {
timestamp,
snapshot_url: buildWaybackUrl(timestamp, original),
status: String(row[cols.statuscode] ?? ''),
mimetype: String(row[cols.mimetype] ?? ''),
original_url: original,
};
});
Comment thread clis/archive/wayback.js
Comment on lines +9 to +17
function normalizeTimestamp(raw) {
// Accept YYYY, YYYYMM, YYYYMMDD, YYYYMMDDhh, YYYYMMDDhhmm, YYYYMMDDhhmmss,
// YYYY-MM-DD, or YYYY-MM-DDThh:mm:ss. Strip non-digits and validate length.
const digits = String(raw).replace(/[^0-9]/g, '');
if (!/^\d{4,14}$/.test(digits) || digits.length % 2 !== 0 && digits.length !== 4) {
throw new ArgumentError('archive wayback timestamp must be YYYY[MM[DD[hh[mm[ss]]]]] or an ISO date');
}
return digits;
}
@jackwener jackwener merged commit d2abdcc into jackwener:main Jun 18, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: archive.org How could such an important website not be included in this tool?

3 participants