Basic CLI by gaurav · Pull Request #1 · TranslatorSRI/babel-explorer

gaurav · 2025-12-03T07:28:19Z

Introduces babel-explorer, a CLI tool to query Babel intermediate files (Parquet) via DuckDB and NodeNorm. BabelDownloader handles caching/freshness, BabelXRefs handles querying, NodeNorm handles label enrichment, and cli.py wires them together with Click.

- Add IdentifierRecord dataclass to babel_xrefs.py (resolves TODO) - Add 89 tests across 3 files: test_downloader (26), test_babel_xrefs (31), test_nodenorm (23) - Unit tests (71) use mocks and run without network; integration tests (18) use real downloads/APIs - Add session-scoped fixtures in conftest.py for shared Parquet file downloads - Parametrize integration tests over tests/data/valid_curies.txt for easy expansion - Add integration and slow pytest markers to pyproject.toml - Update CLAUDE.md and README.md with testing documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This pull request implements a basic version of babel-explorer in Python using the uv package manager. It's a tool for querying Babel intermediate files to understand why biological/chemical identifiers are considered equivalent. The implementation includes a downloader for large Parquet files with MD5 validation and resume support, NodeNorm API integration for label enrichment, DuckDB-based cross-reference querying, and a Click-based CLI.

Changes:

Initial project structure with uv-based package management (pyproject.toml, Python 3.11+)
Core functionality: BabelDownloader with streaming downloads and MD5 validation, NodeNorm API client with LRU caching, BabelXRefs for DuckDB-based Parquet queries
CLI with three commands: xrefs, ids, and test-concord
Comprehensive test suite with 80 tests split between unit tests (mocked) and integration tests (real network calls)

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
pyproject.toml	Project configuration with dependencies (click, duckdb, requests, tqdm) and pytest markers
.python-version	Specifies Python 3.11 requirement
.gitignore	Excludes /data directory for downloaded files
README.md	User documentation with setup, usage examples, and testing instructions
CLAUDE.md	AI assistant guidance documentation (contains outdated wget reference)
src/babel_explorer/cli.py	Click-based CLI with xrefs, ids, and test-concord commands
src/babel_explorer/core/downloader.py	Streaming file downloader with MD5 validation and resume capability
src/babel_explorer/core/nodenorm.py	NodeNorm API client for identifier normalization
src/babel_explorer/core/babel_xrefs.py	DuckDB-based cross-reference query engine (has frozen dataclass bug)
tests/conftest.py	Session-scoped pytest fixtures for shared test resources
tests/constants.py	Shared test constants and CURIE loader utility
tests/data/valid_curies.txt	Parametrized test data (one CURIE)
tests/test_downloader.py	26 tests for BabelDownloader (22 unit, 3 integration, 1 slow)
tests/test_nodenorm.py	23 tests for NodeNorm (18 unit, 5 integration)
tests/test_babel_xrefs.py	31 tests for BabelXRefs (22 unit, 8 integration, 1 slow)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Remove _calculate_md5/_fetch_remote_md5 (too slow on 2.5-3.9 GB files) - Add sidecar .meta JSON files (ETag, Last-Modified, Content-Length, last_checked) - Three-tier logic: freshness window → HEAD/ETag check → full re-download - Add freshness_seconds param to BabelDownloader (default 3h) - Add --check-download CLI option to xrefs and ids commands (e.g. 3h, never) - Update tests: replace MD5 test classes with meta/ETag/tier coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 22 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T05:33:35Z

+        return LabeledCrossReference(
+            subj=xref.subj,
+            obj=xref.obj,
+            filename=xref.filename,
+            pred=xref.pred,
+            subj_label=subj_ident.label,
+            subj_biolink_type=subj_ident.biolink_type,
+            obj_label=obj_ident.label,
+            obj_biolink_type=obj_ident.biolink_type,


LabeledCrossReference declares subj_biolink_type/obj_biolink_type as str, but _to_labeled_xref() passes Identifier.biolink_type, which is a list[str] from NodeNorm. This makes the runtime type inconsistent with the dataclass contract and with unit tests that construct LabeledCrossReference using strings. Consider either changing these fields to list[str] (and updating the name/docs accordingly) or serializing the list (e.g., join with ', ' or select a primary type) before constructing the labeled xref.

Suggested change

return LabeledCrossReference(

subj=xref.subj,

obj=xref.obj,

filename=xref.filename,

pred=xref.pred,

subj_label=subj_ident.label,

subj_biolink_type=subj_ident.biolink_type,

obj_label=obj_ident.label,

obj_biolink_type=obj_ident.biolink_type,

# NodeNorm.Identifier.biolink_type is a list[str]; LabeledCrossReference expects str.

# Normalize by selecting the primary (first) type when a list is provided.

subj_biolink_type = (

subj_ident.biolink_type[0]

if isinstance(subj_ident.biolink_type, list) and subj_ident.biolink_type

else subj_ident.biolink_type

)

obj_biolink_type = (

obj_ident.biolink_type[0]

if isinstance(obj_ident.biolink_type, list) and obj_ident.biolink_type

else obj_ident.biolink_type

)

return LabeledCrossReference(

subj=xref.subj,

obj=xref.obj,

filename=xref.filename,

pred=xref.pred,

subj_label=subj_ident.label,

subj_biolink_type=subj_biolink_type,

obj_label=obj_ident.label,

obj_biolink_type=obj_biolink_type,

Copilot · 2026-04-01T05:33:35Z

+    @functools.lru_cache(maxsize=None)
+    def get_downloaded_file(self, dirpath: str, chunk_size: int = 1024 * 1024):
+        """
+        Download a file from the Babel server to local storage with ETag-based caching.
+
+        Three-tier freshness logic:
+        1. If .meta exists and last_checked is within freshness window → return immediately
+        2. If .meta exists but stale → HEAD request to compare ETag; return if unchanged
+        3. If ETag changed or no .meta → full re-download
+


get_downloaded_file() implements time/ETag-based freshness checks, but it is wrapped in @lru_cache. After the first call for a given (dirpath, chunk_size), subsequent calls return from the cache and will never re-evaluate the freshness window or perform the intended HEAD/GET logic in the same process. If the goal is to honor freshness_seconds on each call, consider removing the lru_cache here (or caching only path computation separately) so the three-tier logic can run as documented.

…map to listcomp - babel_xrefs.py: subj_biolink_type/obj_biolink_type str→list[str] to match Identifier.biolink_type after the nodenorm.py type change - nodenorm.py: replace map(lambda) with list comprehension in get_clique_identifiers - tests: update LabeledCrossReference construction to use list values Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…D, type fixes - nodenorm.py: Identifier is now frozen=True; rewrite from_dict as one-shot constructor to avoid post-construction mutation of lru_cache'd objects - nodenorm.py: remove **kwargs from get_clique_identifiers — unhashable and unused, would raise TypeError if any kwarg was ever passed - downloader.py: download to .tmp then os.replace() so the final file is never partially written; clean up .tmp on failure - downloader.py: _etag_matches returns True (fail open) on HEAD network error instead of False, avoiding spurious 2GB re-downloads on transient failures - cli.py: add nodenorm_url: str annotation in xrefs and test_concord; move test_concord inline comment to docstring - tests: update test_returns_false_on_request_error → test_returns_true_on_request_error - FUTURE.md: track CLI option deduplication refactor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix --expand → --recurse (the actual flag name) in Data Flow and Key Design Patterns - BabelXRefs: remove false claim about writing DuckDB databases to disk; all connections are in-memory (duckdb.connect() with no path) - Remove 'Generated DuckDB databases' entry from File Locations (nothing on disk) - Update test count table: numbers were stale and test_cli.py was missing entirely - Add Identifier to Key Dataclasses (now frozen=True as of recent fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…/babel-explorer into basic-implementation-in-uv

- list→tuple on Identifier and LabeledCrossReference fields so frozen dataclasses are hashable (was a TypeError crash in get_curie_xrefs) - NodeNorm(''): add early return in normalize_curie so empty URL truly skips all network calls as documented - BabelDownloader: auto-append trailing slash to url_base so urljoin can't silently drop path segments - CI: fix push trigger branch master → main - Remove dead get_downloaded_dir method (lru_cache + NotImplementedError) - parse_duration: reject negative values with a clear BadParameter error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduces a central formatting.py module (write_records + _record_to_dict) that serialises any dataclass to text, JSON, TSV, or CSV without touching domain objects. A format_option decorator wires --format and --json-indent onto xrefs, ids, and test-concord. test-concord injects a query_curie column for non-text formats. 30 new unit tests in test_formatting.py; 7 CLI format tests added to test_cli.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the 'text' default format with 'console', backed by the rich library. xrefs and test-concord highlight query CURIEs in bold cyan wherever they appear as subject or object; rich auto-strips markup when output is piped. ids uses console.print(str(record)) for TTY-aware plain output. formatting.py gains make_console() and hl_curie() utilities for new commands to reuse. LabeledCrossReference labels appear in parentheses next to CURIEs in console output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tuple() on a bare string iterates its characters, so biolink_type, taxa, and description would become ('b','i','o',...) when NodeNorm returns them as strings rather than lists. _to_tuple() now wraps a bare string in a 1-tuple. Four new unit tests cover the string case for each field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extract babel_options decorator covering --local-dir, --babel-url, and --check-download so the three options are defined once instead of being copy-pasted into xrefs and ids. Move logging.basicConfig into the cli group callback so it fires once rather than inside each subcommand. Replace the file-level # comment with a proper module docstring. Update FUTURE.md: mark CLI deduplication resolved; add two new items for batch NodeNorm lookups and DuckDB connection reuse. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

babel_xrefs: extract _require_nodenorm() to replace the identical guard duplicated in get_curie_xref and _get_curie_xrefs_recursive. Delete LabeledCrossReference.__str__; the dataclass-generated __repr__ already includes all fields. nodenorm: drop the five unused keyword parameters from normalize_curie (no caller ever varies them; values hardcoded inline). Collapse the sequential double-guard in get_clique_identifiers into a single condition. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drop the redundant os.path.exists() call in the elif branch of __init__ (the if-not-exists branch above guarantees it is True at that point; only the isdir check is needed). Delete self-evident comments from _stream_download and _download_with_retry; keep the non-obvious notes about connection-only timeout and the three-tier freshness logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 20 out of 25 changed files in this pull request and generated 4 comments.

+        for attempt in range(1, self.retries + 1):
+            try:
+                resume_byte_pos = 0
+                if os.path.exists(local_path):
+                    resume_byte_pos = os.path.getsize(local_path)
+
+                headers = {}
+                if resume_byte_pos > 0:
+                    headers["Range"] = f"bytes={resume_byte_pos}-"
+                    self.logger.info(f"Resuming download from byte {resume_byte_pos}")
+
+                # timeout applies to connection only, not total transfer time
+                with requests.get(
+                    url, headers=headers, stream=True, timeout=self.timeout
+                ) as response:
+                    if response.status_code == 416:
+                        self.logger.info(f"File already complete: {local_path}")
+                        return response.headers
+                    elif response.status_code == 206:
+                        self.logger.info("Resuming download (HTTP 206)")
+                    elif response.status_code == 200:
+                        if resume_byte_pos > 0:
+                            self.logger.warning(
+                                "Server doesn't support resume, restarting from beginning"
+                            )
+                            resume_byte_pos = 0
+                            if os.path.exists(local_path):
+                                os.remove(local_path)
+                    else:
+                        response.raise_for_status()
+
+                    self._stream_download(
+                        response, local_path, resume_byte_pos, chunk_size
+                    )
+                    return response.headers
+
+            except (requests.RequestException, IOError) as e:
+                self.logger.warning(
+                    f"Download attempt {attempt}/{self.retries} failed: {e}"
+                )
+
+                if attempt < self.retries:
+                    wait_time = min(2**attempt, 60)
+                    self.logger.info(f"Retrying in {wait_time} seconds...")
+                    time.sleep(wait_time)
+                else:
+                    raise RuntimeError(
+                        f"Failed to download {url} after {self.retries} attempts: {e}"
+                    )


+    def from_tuple(tuple: tuple[str, str, str, str]):
+        """Construct from a ``(filename, subj, pred, obj)`` database row tuple."""
+        return CrossReference(
+            filename=tuple[0], subj=tuple[1], pred=tuple[2], obj=tuple[3]


+    taxa: tuple[str, ...] = ()
+    description: tuple[str, ...] = ()
+
+    def __lt__(self, other):


+                        f"Failed to download {url} after {self.retries} attempts: {e}"
+                    )
+
+    @functools.lru_cache(maxsize=None)


Replace @functools.lru_cache on the get_curie_xref instance method with an instance-level dict cache (self._xref_cache). lru_cache on instance methods holds a strong reference to self via the cache key, preventing garbage collection for the process lifetime. Also change the get_curie_ids IN-clause from `IN $1` to `IN (SELECT unnest($1::VARCHAR[]))`, consistent with the pattern already used in the recursive query and guaranteed to work across DuckDB versions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

urllib.parse.urljoin resolves relative paths against the base URL's last path segment, not the full path, so a multi-level url_base like https://host/a/b/ + 'duckdb/file' would produce https://host/a/duckdb/file. Replace with simple string concatenation (url_base + dirpath), which is safe and unambiguous because the constructor already enforces a trailing /. Also drop the now-unused urllib.parse import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Making src/ itself a Python package can interfere with import resolution when the project is installed via hatchling (which uses src layout). Only src/babel_explorer/__init__.py and src/babel_explorer/core/__init__.py are needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- README.md: add missing trailing newline - tests/__init__.py: remove stray comment (test __init__ files should be empty) - FUTURE.md: replace duplicated text with links to GitHub issues #12 and #13 where the two known performance improvements are now tracked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

No logic changes — ruff wrapped long import lists and long lines to meet the project's line-length limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 19 out of 24 changed files in this pull request and generated 3 comments.

+@pytest.mark.parametrize("curie", VALID_CURIES)
+def test_get_curie_xref(babel_xrefs, curie):
+    """get_curie_xref returns non-empty CrossReferences with the queried CURIE."""
+    babel_xrefs.get_curie_xref.cache_clear()


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…ution Replaces the implicit DuckDB pattern (assign a relation to a Python variable, then reference that name in a SQL string) with explicit read_parquet($N) parameterised calls. The recursive query wraps the scan in a MATERIALIZED CTE so the parquet is read once regardless of how many CTEs reference it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lru_cache on get_downloaded_file caused a latent bug: once it cached a local path, subsequent calls bypassed the os.path.exists() guard — returning a stale path if the file was deleted mid-run. The three-tier freshness logic (meta + ETag) already prevents redundant network calls, so the cache adds no benefit and only introduces this risk. Also corrects the misleading "connection only" comment on the requests timeout: it is a per-read idle timeout, not a total-transfer limit. Tests updated to remove cache_clear() calls and rename the caching test to reflect that the freshness window is now the deduplication mechanism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fixes #14. lru_cache on instance methods holds a reference to self in the cache key, preventing garbage collection of NodeNorm instances for the lifetime of the process. Each of the three methods now checks and populates a dedicated dict (_normalize_cache, _identifier_cache, _clique_cache) on the instance. This makes caching scope explicit: the cache lives and dies with the object, and callers who need fresh results simply instantiate a new NodeNorm. HTTP errors in normalize_curie are intentionally not cached so a transient failure does not permanently suppress retries. Tests updated to remove cache_clear() calls — unit tests already construct a fresh NodeNorm per test case via _make_nn(), and integration tests are parametrized per-CURIE so cached results do not interfere. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gaurav and others added 19 commits December 2, 2025 15:38

This initializes a uv package in this repository.

fcc27c2

Added basic CLI.

876353d

Add /data to the .gitignore.

ec1d1f0

Initial implementation of a basic xref query-er.

eff8f26

Added a method to look up a particular identifier.

4d04e2a

Added CURIE expansion/recursive lookup.

8531cb7

Added a basic ConcordTester.

a1aeec6

Added labels via NodeNorm.

bb1eb99

Midnight commit: attempting to improve expansion.

40c3338

Added some improvements.

8c41112

Added a CLAUDE.md by Claude.ai.

239c89f

Reorganized file slightly.

8132fe1

Claude wrote some tests.

bd00972

Improved downloader using Claude.

9cc06bc

Added MD5 download functionality.

da8bb0c

Removed empty model file.

8f36b74

Attempted to rename this package to babel-explorer.

0534fd8

Merge branch 'main' into basic-implementation-in-uv

8535202

gaurav requested a review from Copilot February 17, 2026 23:11

Copilot started reviewing on behalf of gaurav February 17, 2026 23:11 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

This was referenced Feb 25, 2026

Deconflation endpoint NCATSTranslator/NodeNormalization#340

Open

Publish intermediate files or other information to explain how we came to our cliquing decisions NCATSTranslator/Babel#319

Closed

gaurav and others added 6 commits March 2, 2026 17:35

Added uv.lock (not sure why it wasn't added previously).

ff0dacc

Update CLAUDE.md

bacc72d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pyproject.toml

96d9609

Update src/babel_explorer/core/babel_xrefs.py

0c33e7e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/babel_explorer/core/nodenorm.py

1aff013

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot started reviewing on behalf of gaurav April 1, 2026 05:30 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

gaurav and others added 11 commits April 1, 2026 01:40

Merge branch 'basic-implementation-in-uv' of github.com:TranslatorSRI…

46f0863

…/babel-explorer into basic-implementation-in-uv

gaurav requested a review from Copilot May 17, 2026 19:09

Copilot started reviewing on behalf of gaurav May 17, 2026 19:10 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

gaurav and others added 5 commits May 17, 2026 17:32

Apply ruff formatting to cli, formatting, and test files

a4a2ed1

No logic changes — ruff wrapped long import lists and long lines to meet the project's line-length limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gaurav requested a review from Copilot May 18, 2026 00:07

Copilot started reviewing on behalf of gaurav May 18, 2026 00:07 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

gaurav and others added 2 commits May 17, 2026 23:14

Potential fix for pull request finding

a5eb3b4

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

a2fb574

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

gaurav mentioned this pull request May 18, 2026

NodeNorm: lru_cache on instance methods pins self and leaks memory #14

Open

gaurav and others added 3 commits May 17, 2026 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic CLI#1

Basic CLI#1
gaurav wants to merge 81 commits into
mainfrom
basic-implementation-in-uv

gaurav commented Dec 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gaurav commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaurav commented Dec 3, 2025 •

edited

Loading