Basic CLI#1
Conversation
- Add IdentifierRecord dataclass to babel_xrefs.py (resolves TODO) - Add 89 tests across 3 files: test_downloader (26), test_babel_xrefs (31), test_nodenorm (23) - Unit tests (71) use mocks and run without network; integration tests (18) use real downloads/APIs - Add session-scoped fixtures in conftest.py for shared Parquet file downloads - Parametrize integration tests over tests/data/valid_curies.txt for easy expansion - Add integration and slow pytest markers to pyproject.toml - Update CLAUDE.md and README.md with testing documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request implements a basic version of babel-explorer in Python using the uv package manager. It's a tool for querying Babel intermediate files to understand why biological/chemical identifiers are considered equivalent. The implementation includes a downloader for large Parquet files with MD5 validation and resume support, NodeNorm API integration for label enrichment, DuckDB-based cross-reference querying, and a Click-based CLI.
Changes:
- Initial project structure with uv-based package management (pyproject.toml, Python 3.11+)
- Core functionality: BabelDownloader with streaming downloads and MD5 validation, NodeNorm API client with LRU caching, BabelXRefs for DuckDB-based Parquet queries
- CLI with three commands: xrefs, ids, and test-concord
- Comprehensive test suite with 80 tests split between unit tests (mocked) and integration tests (real network calls)
Reviewed changes
Copilot reviewed 15 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Project configuration with dependencies (click, duckdb, requests, tqdm) and pytest markers |
| .python-version | Specifies Python 3.11 requirement |
| .gitignore | Excludes /data directory for downloaded files |
| README.md | User documentation with setup, usage examples, and testing instructions |
| CLAUDE.md | AI assistant guidance documentation (contains outdated wget reference) |
| src/babel_explorer/cli.py | Click-based CLI with xrefs, ids, and test-concord commands |
| src/babel_explorer/core/downloader.py | Streaming file downloader with MD5 validation and resume capability |
| src/babel_explorer/core/nodenorm.py | NodeNorm API client for identifier normalization |
| src/babel_explorer/core/babel_xrefs.py | DuckDB-based cross-reference query engine (has frozen dataclass bug) |
| tests/conftest.py | Session-scoped pytest fixtures for shared test resources |
| tests/constants.py | Shared test constants and CURIE loader utility |
| tests/data/valid_curies.txt | Parametrized test data (one CURIE) |
| tests/test_downloader.py | 26 tests for BabelDownloader (22 unit, 3 integration, 1 slow) |
| tests/test_nodenorm.py | 23 tests for NodeNorm (18 unit, 5 integration) |
| tests/test_babel_xrefs.py | 31 tests for BabelXRefs (22 unit, 8 integration, 1 slow) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Remove _calculate_md5/_fetch_remote_md5 (too slow on 2.5-3.9 GB files) - Add sidecar .meta JSON files (ETag, Last-Modified, Content-Length, last_checked) - Three-tier logic: freshness window → HEAD/ETag check → full re-download - Add freshness_seconds param to BabelDownloader (default 3h) - Add --check-download CLI option to xrefs and ids commands (e.g. 3h, never) - Update tests: replace MD5 test classes with meta/ETag/tier coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 22 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return LabeledCrossReference( | ||
| subj=xref.subj, | ||
| obj=xref.obj, | ||
| filename=xref.filename, | ||
| pred=xref.pred, | ||
| subj_label=subj_ident.label, | ||
| subj_biolink_type=subj_ident.biolink_type, | ||
| obj_label=obj_ident.label, | ||
| obj_biolink_type=obj_ident.biolink_type, |
There was a problem hiding this comment.
LabeledCrossReference declares subj_biolink_type/obj_biolink_type as str, but _to_labeled_xref() passes Identifier.biolink_type, which is a list[str] from NodeNorm. This makes the runtime type inconsistent with the dataclass contract and with unit tests that construct LabeledCrossReference using strings. Consider either changing these fields to list[str] (and updating the name/docs accordingly) or serializing the list (e.g., join with ', ' or select a primary type) before constructing the labeled xref.
| return LabeledCrossReference( | |
| subj=xref.subj, | |
| obj=xref.obj, | |
| filename=xref.filename, | |
| pred=xref.pred, | |
| subj_label=subj_ident.label, | |
| subj_biolink_type=subj_ident.biolink_type, | |
| obj_label=obj_ident.label, | |
| obj_biolink_type=obj_ident.biolink_type, | |
| # NodeNorm.Identifier.biolink_type is a list[str]; LabeledCrossReference expects str. | |
| # Normalize by selecting the primary (first) type when a list is provided. | |
| subj_biolink_type = ( | |
| subj_ident.biolink_type[0] | |
| if isinstance(subj_ident.biolink_type, list) and subj_ident.biolink_type | |
| else subj_ident.biolink_type | |
| ) | |
| obj_biolink_type = ( | |
| obj_ident.biolink_type[0] | |
| if isinstance(obj_ident.biolink_type, list) and obj_ident.biolink_type | |
| else obj_ident.biolink_type | |
| ) | |
| return LabeledCrossReference( | |
| subj=xref.subj, | |
| obj=xref.obj, | |
| filename=xref.filename, | |
| pred=xref.pred, | |
| subj_label=subj_ident.label, | |
| subj_biolink_type=subj_biolink_type, | |
| obj_label=obj_ident.label, | |
| obj_biolink_type=obj_biolink_type, |
| @functools.lru_cache(maxsize=None) | ||
| def get_downloaded_file(self, dirpath: str, chunk_size: int = 1024 * 1024): | ||
| """ | ||
| Download a file from the Babel server to local storage with ETag-based caching. | ||
|
|
||
| Three-tier freshness logic: | ||
| 1. If .meta exists and last_checked is within freshness window → return immediately | ||
| 2. If .meta exists but stale → HEAD request to compare ETag; return if unchanged | ||
| 3. If ETag changed or no .meta → full re-download | ||
|
|
There was a problem hiding this comment.
get_downloaded_file() implements time/ETag-based freshness checks, but it is wrapped in @lru_cache. After the first call for a given (dirpath, chunk_size), subsequent calls return from the cache and will never re-evaluate the freshness window or perform the intended HEAD/GET logic in the same process. If the goal is to honor freshness_seconds on each call, consider removing the lru_cache here (or caching only path computation separately) so the three-tier logic can run as documented.
…map to listcomp - babel_xrefs.py: subj_biolink_type/obj_biolink_type str→list[str] to match Identifier.biolink_type after the nodenorm.py type change - nodenorm.py: replace map(lambda) with list comprehension in get_clique_identifiers - tests: update LabeledCrossReference construction to use list values Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…D, type fixes - nodenorm.py: Identifier is now frozen=True; rewrite from_dict as one-shot constructor to avoid post-construction mutation of lru_cache'd objects - nodenorm.py: remove **kwargs from get_clique_identifiers — unhashable and unused, would raise TypeError if any kwarg was ever passed - downloader.py: download to .tmp then os.replace() so the final file is never partially written; clean up .tmp on failure - downloader.py: _etag_matches returns True (fail open) on HEAD network error instead of False, avoiding spurious 2GB re-downloads on transient failures - cli.py: add nodenorm_url: str annotation in xrefs and test_concord; move test_concord inline comment to docstring - tests: update test_returns_false_on_request_error → test_returns_true_on_request_error - FUTURE.md: track CLI option deduplication refactor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix --expand → --recurse (the actual flag name) in Data Flow and Key Design Patterns - BabelXRefs: remove false claim about writing DuckDB databases to disk; all connections are in-memory (duckdb.connect() with no path) - Remove 'Generated DuckDB databases' entry from File Locations (nothing on disk) - Update test count table: numbers were stale and test_cli.py was missing entirely - Add Identifier to Key Dataclasses (now frozen=True as of recent fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/babel-explorer into basic-implementation-in-uv
- list→tuple on Identifier and LabeledCrossReference fields so frozen
dataclasses are hashable (was a TypeError crash in get_curie_xrefs)
- NodeNorm(''): add early return in normalize_curie so empty URL truly
skips all network calls as documented
- BabelDownloader: auto-append trailing slash to url_base so urljoin
can't silently drop path segments
- CI: fix push trigger branch master → main
- Remove dead get_downloaded_dir method (lru_cache + NotImplementedError)
- parse_duration: reject negative values with a clear BadParameter error
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces a central formatting.py module (write_records + _record_to_dict) that serialises any dataclass to text, JSON, TSV, or CSV without touching domain objects. A format_option decorator wires --format and --json-indent onto xrefs, ids, and test-concord. test-concord injects a query_curie column for non-text formats. 30 new unit tests in test_formatting.py; 7 CLI format tests added to test_cli.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the 'text' default format with 'console', backed by the rich library. xrefs and test-concord highlight query CURIEs in bold cyan wherever they appear as subject or object; rich auto-strips markup when output is piped. ids uses console.print(str(record)) for TTY-aware plain output. formatting.py gains make_console() and hl_curie() utilities for new commands to reuse. LabeledCrossReference labels appear in parentheses next to CURIEs in console output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tuple() on a bare string iterates its characters, so biolink_type,
taxa, and description would become ('b','i','o',...) when NodeNorm
returns them as strings rather than lists. _to_tuple() now wraps
a bare string in a 1-tuple. Four new unit tests cover the string
case for each field.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract babel_options decorator covering --local-dir, --babel-url, and --check-download so the three options are defined once instead of being copy-pasted into xrefs and ids. Move logging.basicConfig into the cli group callback so it fires once rather than inside each subcommand. Replace the file-level # comment with a proper module docstring. Update FUTURE.md: mark CLI deduplication resolved; add two new items for batch NodeNorm lookups and DuckDB connection reuse. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
babel_xrefs: extract _require_nodenorm() to replace the identical guard duplicated in get_curie_xref and _get_curie_xrefs_recursive. Delete LabeledCrossReference.__str__; the dataclass-generated __repr__ already includes all fields. nodenorm: drop the five unused keyword parameters from normalize_curie (no caller ever varies them; values hardcoded inline). Collapse the sequential double-guard in get_clique_identifiers into a single condition. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop the redundant os.path.exists() call in the elif branch of __init__ (the if-not-exists branch above guarantees it is True at that point; only the isdir check is needed). Delete self-evident comments from _stream_download and _download_with_retry; keep the non-obvious notes about connection-only timeout and the three-tier freshness logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| for attempt in range(1, self.retries + 1): | ||
| try: | ||
| resume_byte_pos = 0 | ||
| if os.path.exists(local_path): | ||
| resume_byte_pos = os.path.getsize(local_path) | ||
|
|
||
| headers = {} | ||
| if resume_byte_pos > 0: | ||
| headers["Range"] = f"bytes={resume_byte_pos}-" | ||
| self.logger.info(f"Resuming download from byte {resume_byte_pos}") | ||
|
|
||
| # timeout applies to connection only, not total transfer time | ||
| with requests.get( | ||
| url, headers=headers, stream=True, timeout=self.timeout | ||
| ) as response: | ||
| if response.status_code == 416: | ||
| self.logger.info(f"File already complete: {local_path}") | ||
| return response.headers | ||
| elif response.status_code == 206: | ||
| self.logger.info("Resuming download (HTTP 206)") | ||
| elif response.status_code == 200: | ||
| if resume_byte_pos > 0: | ||
| self.logger.warning( | ||
| "Server doesn't support resume, restarting from beginning" | ||
| ) | ||
| resume_byte_pos = 0 | ||
| if os.path.exists(local_path): | ||
| os.remove(local_path) | ||
| else: | ||
| response.raise_for_status() | ||
|
|
||
| self._stream_download( | ||
| response, local_path, resume_byte_pos, chunk_size | ||
| ) | ||
| return response.headers | ||
|
|
||
| except (requests.RequestException, IOError) as e: | ||
| self.logger.warning( | ||
| f"Download attempt {attempt}/{self.retries} failed: {e}" | ||
| ) | ||
|
|
||
| if attempt < self.retries: | ||
| wait_time = min(2**attempt, 60) | ||
| self.logger.info(f"Retrying in {wait_time} seconds...") | ||
| time.sleep(wait_time) | ||
| else: | ||
| raise RuntimeError( | ||
| f"Failed to download {url} after {self.retries} attempts: {e}" | ||
| ) |
| def from_tuple(tuple: tuple[str, str, str, str]): | ||
| """Construct from a ``(filename, subj, pred, obj)`` database row tuple.""" | ||
| return CrossReference( | ||
| filename=tuple[0], subj=tuple[1], pred=tuple[2], obj=tuple[3] |
| taxa: tuple[str, ...] = () | ||
| description: tuple[str, ...] = () | ||
|
|
||
| def __lt__(self, other): |
| f"Failed to download {url} after {self.retries} attempts: {e}" | ||
| ) | ||
|
|
||
| @functools.lru_cache(maxsize=None) |
Replace @functools.lru_cache on the get_curie_xref instance method with an instance-level dict cache (self._xref_cache). lru_cache on instance methods holds a strong reference to self via the cache key, preventing garbage collection for the process lifetime. Also change the get_curie_ids IN-clause from `IN $1` to `IN (SELECT unnest($1::VARCHAR[]))`, consistent with the pattern already used in the recursive query and guaranteed to work across DuckDB versions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
urllib.parse.urljoin resolves relative paths against the base URL's last path segment, not the full path, so a multi-level url_base like https://host/a/b/ + 'duckdb/file' would produce https://host/a/duckdb/file. Replace with simple string concatenation (url_base + dirpath), which is safe and unambiguous because the constructor already enforces a trailing /. Also drop the now-unused urllib.parse import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Making src/ itself a Python package can interfere with import resolution when the project is installed via hatchling (which uses src layout). Only src/babel_explorer/__init__.py and src/babel_explorer/core/__init__.py are needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README.md: add missing trailing newline - tests/__init__.py: remove stray comment (test __init__ files should be empty) - FUTURE.md: replace duplicated text with links to GitHub issues #12 and #13 where the two known performance improvements are now tracked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No logic changes — ruff wrapped long import lists and long lines to meet the project's line-length limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| @pytest.mark.parametrize("curie", VALID_CURIES) | ||
| def test_get_curie_xref(babel_xrefs, curie): | ||
| """get_curie_xref returns non-empty CrossReferences with the queried CURIE.""" | ||
| babel_xrefs.get_curie_xref.cache_clear() |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…ution Replaces the implicit DuckDB pattern (assign a relation to a Python variable, then reference that name in a SQL string) with explicit read_parquet($N) parameterised calls. The recursive query wraps the scan in a MATERIALIZED CTE so the parquet is read once regardless of how many CTEs reference it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lru_cache on get_downloaded_file caused a latent bug: once it cached a local path, subsequent calls bypassed the os.path.exists() guard — returning a stale path if the file was deleted mid-run. The three-tier freshness logic (meta + ETag) already prevents redundant network calls, so the cache adds no benefit and only introduces this risk. Also corrects the misleading "connection only" comment on the requests timeout: it is a per-read idle timeout, not a total-transfer limit. Tests updated to remove cache_clear() calls and rename the caching test to reflect that the freshness window is now the deduplication mechanism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes #14. lru_cache on instance methods holds a reference to self in the cache key, preventing garbage collection of NodeNorm instances for the lifetime of the process. Each of the three methods now checks and populates a dedicated dict (_normalize_cache, _identifier_cache, _clique_cache) on the instance. This makes caching scope explicit: the cache lives and dies with the object, and callers who need fresh results simply instantiate a new NodeNorm. HTTP errors in normalize_curie are intentionally not cached so a transient failure does not permanently suppress retries. Tests updated to remove cache_clear() calls — unit tests already construct a fresh NodeNorm per test case via _make_nn(), and integration tests are parametrized per-CURIE so cached results do not interfere. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces babel-explorer, a CLI tool to query Babel intermediate files (Parquet) via DuckDB and NodeNorm. BabelDownloader handles caching/freshness, BabelXRefs handles querying, NodeNorm handles label enrichment, and cli.py wires them together with Click.