Skip to content

Basic CLI#1

Open
gaurav wants to merge 81 commits into
mainfrom
basic-implementation-in-uv
Open

Basic CLI#1
gaurav wants to merge 81 commits into
mainfrom
basic-implementation-in-uv

Conversation

@gaurav
Copy link
Copy Markdown
Collaborator

@gaurav gaurav commented Dec 3, 2025

Introduces babel-explorer, a CLI tool to query Babel intermediate files (Parquet) via DuckDB and NodeNorm. BabelDownloader handles caching/freshness, BabelXRefs handles querying, NodeNorm handles label enrichment, and cli.py wires them together with Click.

gaurav and others added 19 commits December 2, 2025 15:38
- Add IdentifierRecord dataclass to babel_xrefs.py (resolves TODO)
- Add 89 tests across 3 files: test_downloader (26), test_babel_xrefs (31), test_nodenorm (23)
- Unit tests (71) use mocks and run without network; integration tests (18) use real downloads/APIs
- Add session-scoped fixtures in conftest.py for shared Parquet file downloads
- Parametrize integration tests over tests/data/valid_curies.txt for easy expansion
- Add integration and slow pytest markers to pyproject.toml
- Update CLAUDE.md and README.md with testing documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a basic version of babel-explorer in Python using the uv package manager. It's a tool for querying Babel intermediate files to understand why biological/chemical identifiers are considered equivalent. The implementation includes a downloader for large Parquet files with MD5 validation and resume support, NodeNorm API integration for label enrichment, DuckDB-based cross-reference querying, and a Click-based CLI.

Changes:

  • Initial project structure with uv-based package management (pyproject.toml, Python 3.11+)
  • Core functionality: BabelDownloader with streaming downloads and MD5 validation, NodeNorm API client with LRU caching, BabelXRefs for DuckDB-based Parquet queries
  • CLI with three commands: xrefs, ids, and test-concord
  • Comprehensive test suite with 80 tests split between unit tests (mocked) and integration tests (real network calls)

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pyproject.toml Project configuration with dependencies (click, duckdb, requests, tqdm) and pytest markers
.python-version Specifies Python 3.11 requirement
.gitignore Excludes /data directory for downloaded files
README.md User documentation with setup, usage examples, and testing instructions
CLAUDE.md AI assistant guidance documentation (contains outdated wget reference)
src/babel_explorer/cli.py Click-based CLI with xrefs, ids, and test-concord commands
src/babel_explorer/core/downloader.py Streaming file downloader with MD5 validation and resume capability
src/babel_explorer/core/nodenorm.py NodeNorm API client for identifier normalization
src/babel_explorer/core/babel_xrefs.py DuckDB-based cross-reference query engine (has frozen dataclass bug)
tests/conftest.py Session-scoped pytest fixtures for shared test resources
tests/constants.py Shared test constants and CURIE loader utility
tests/data/valid_curies.txt Parametrized test data (one CURIE)
tests/test_downloader.py 26 tests for BabelDownloader (22 unit, 3 integration, 1 slow)
tests/test_nodenorm.py 23 tests for NodeNorm (18 unit, 5 integration)
tests/test_babel_xrefs.py 31 tests for BabelXRefs (22 unit, 8 integration, 1 slow)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/babel_explorer/core/nodenorm.py Outdated
Comment thread src/babel_explorer/core/downloader.py Outdated
Comment thread src/babel_explorer/cli.py Outdated
Comment thread pyproject.toml Outdated
Comment thread src/babel_explorer/core/downloader.py Outdated
Comment thread src/babel_explorer/core/babel_xrefs.py Outdated
Comment thread src/babel_explorer/core/babel_xrefs.py Outdated
Comment thread src/babel_explorer/core/nodenorm.py Outdated
Comment thread CLAUDE.md Outdated
gaurav and others added 6 commits March 2, 2026 17:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Remove _calculate_md5/_fetch_remote_md5 (too slow on 2.5-3.9 GB files)
- Add sidecar .meta JSON files (ETag, Last-Modified, Content-Length, last_checked)
- Three-tier logic: freshness window → HEAD/ETag check → full re-download
- Add freshness_seconds param to BabelDownloader (default 3h)
- Add --check-download CLI option to xrefs and ids commands (e.g. 3h, never)
- Update tests: replace MD5 test classes with meta/ETag/tier coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 22 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +127 to +135
return LabeledCrossReference(
subj=xref.subj,
obj=xref.obj,
filename=xref.filename,
pred=xref.pred,
subj_label=subj_ident.label,
subj_biolink_type=subj_ident.biolink_type,
obj_label=obj_ident.label,
obj_biolink_type=obj_ident.biolink_type,
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabeledCrossReference declares subj_biolink_type/obj_biolink_type as str, but _to_labeled_xref() passes Identifier.biolink_type, which is a list[str] from NodeNorm. This makes the runtime type inconsistent with the dataclass contract and with unit tests that construct LabeledCrossReference using strings. Consider either changing these fields to list[str] (and updating the name/docs accordingly) or serializing the list (e.g., join with ', ' or select a primary type) before constructing the labeled xref.

Suggested change
return LabeledCrossReference(
subj=xref.subj,
obj=xref.obj,
filename=xref.filename,
pred=xref.pred,
subj_label=subj_ident.label,
subj_biolink_type=subj_ident.biolink_type,
obj_label=obj_ident.label,
obj_biolink_type=obj_ident.biolink_type,
# NodeNorm.Identifier.biolink_type is a list[str]; LabeledCrossReference expects str.
# Normalize by selecting the primary (first) type when a list is provided.
subj_biolink_type = (
subj_ident.biolink_type[0]
if isinstance(subj_ident.biolink_type, list) and subj_ident.biolink_type
else subj_ident.biolink_type
)
obj_biolink_type = (
obj_ident.biolink_type[0]
if isinstance(obj_ident.biolink_type, list) and obj_ident.biolink_type
else obj_ident.biolink_type
)
return LabeledCrossReference(
subj=xref.subj,
obj=xref.obj,
filename=xref.filename,
pred=xref.pred,
subj_label=subj_ident.label,
subj_biolink_type=subj_biolink_type,
obj_label=obj_ident.label,
obj_biolink_type=obj_biolink_type,

Copilot uses AI. Check for mistakes.
Comment thread src/babel_explorer/core/downloader.py Outdated
Comment on lines +280 to +289
@functools.lru_cache(maxsize=None)
def get_downloaded_file(self, dirpath: str, chunk_size: int = 1024 * 1024):
"""
Download a file from the Babel server to local storage with ETag-based caching.

Three-tier freshness logic:
1. If .meta exists and last_checked is within freshness window → return immediately
2. If .meta exists but stale → HEAD request to compare ETag; return if unchanged
3. If ETag changed or no .meta → full re-download

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_downloaded_file() implements time/ETag-based freshness checks, but it is wrapped in @lru_cache. After the first call for a given (dirpath, chunk_size), subsequent calls return from the cache and will never re-evaluate the freshness window or perform the intended HEAD/GET logic in the same process. If the goal is to honor freshness_seconds on each call, consider removing the lru_cache here (or caching only path computation separately) so the three-tier logic can run as documented.

Copilot uses AI. Check for mistakes.
gaurav and others added 11 commits April 1, 2026 01:40
…map to listcomp

- babel_xrefs.py: subj_biolink_type/obj_biolink_type str→list[str] to match
  Identifier.biolink_type after the nodenorm.py type change
- nodenorm.py: replace map(lambda) with list comprehension in get_clique_identifiers
- tests: update LabeledCrossReference construction to use list values

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…D, type fixes

- nodenorm.py: Identifier is now frozen=True; rewrite from_dict as one-shot
  constructor to avoid post-construction mutation of lru_cache'd objects
- nodenorm.py: remove **kwargs from get_clique_identifiers — unhashable and unused,
  would raise TypeError if any kwarg was ever passed
- downloader.py: download to .tmp then os.replace() so the final file is never
  partially written; clean up .tmp on failure
- downloader.py: _etag_matches returns True (fail open) on HEAD network error
  instead of False, avoiding spurious 2GB re-downloads on transient failures
- cli.py: add nodenorm_url: str annotation in xrefs and test_concord; move
  test_concord inline comment to docstring
- tests: update test_returns_false_on_request_error → test_returns_true_on_request_error
- FUTURE.md: track CLI option deduplication refactor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix --expand → --recurse (the actual flag name) in Data Flow and Key Design Patterns
- BabelXRefs: remove false claim about writing DuckDB databases to disk;
  all connections are in-memory (duckdb.connect() with no path)
- Remove 'Generated DuckDB databases' entry from File Locations (nothing on disk)
- Update test count table: numbers were stale and test_cli.py was missing entirely
- Add Identifier to Key Dataclasses (now frozen=True as of recent fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/babel-explorer into basic-implementation-in-uv
- list→tuple on Identifier and LabeledCrossReference fields so frozen
  dataclasses are hashable (was a TypeError crash in get_curie_xrefs)
- NodeNorm(''): add early return in normalize_curie so empty URL truly
  skips all network calls as documented
- BabelDownloader: auto-append trailing slash to url_base so urljoin
  can't silently drop path segments
- CI: fix push trigger branch master → main
- Remove dead get_downloaded_dir method (lru_cache + NotImplementedError)
- parse_duration: reject negative values with a clear BadParameter error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces a central formatting.py module (write_records + _record_to_dict)
that serialises any dataclass to text, JSON, TSV, or CSV without touching
domain objects. A format_option decorator wires --format and --json-indent
onto xrefs, ids, and test-concord. test-concord injects a query_curie column
for non-text formats. 30 new unit tests in test_formatting.py; 7 CLI format
tests added to test_cli.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the 'text' default format with 'console', backed by the rich
library. xrefs and test-concord highlight query CURIEs in bold cyan
wherever they appear as subject or object; rich auto-strips markup when
output is piped. ids uses console.print(str(record)) for TTY-aware plain
output. formatting.py gains make_console() and hl_curie() utilities for
new commands to reuse. LabeledCrossReference labels appear in parentheses
next to CURIEs in console output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tuple() on a bare string iterates its characters, so biolink_type,
taxa, and description would become ('b','i','o',...) when NodeNorm
returns them as strings rather than lists. _to_tuple() now wraps
a bare string in a 1-tuple. Four new unit tests cover the string
case for each field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract babel_options decorator covering --local-dir, --babel-url, and
--check-download so the three options are defined once instead of being
copy-pasted into xrefs and ids. Move logging.basicConfig into the cli
group callback so it fires once rather than inside each subcommand.
Replace the file-level # comment with a proper module docstring.

Update FUTURE.md: mark CLI deduplication resolved; add two new items
for batch NodeNorm lookups and DuckDB connection reuse.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
babel_xrefs: extract _require_nodenorm() to replace the identical guard
duplicated in get_curie_xref and _get_curie_xrefs_recursive. Delete
LabeledCrossReference.__str__; the dataclass-generated __repr__ already
includes all fields.

nodenorm: drop the five unused keyword parameters from normalize_curie
(no caller ever varies them; values hardcoded inline). Collapse the
sequential double-guard in get_clique_identifiers into a single condition.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop the redundant os.path.exists() call in the elif branch of __init__
(the if-not-exists branch above guarantees it is True at that point;
only the isdir check is needed). Delete self-evident comments from
_stream_download and _download_with_retry; keep the non-obvious notes
about connection-only timeout and the three-tier freshness logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 25 changed files in this pull request and generated 4 comments.

Comment on lines +232 to +280
for attempt in range(1, self.retries + 1):
try:
resume_byte_pos = 0
if os.path.exists(local_path):
resume_byte_pos = os.path.getsize(local_path)

headers = {}
if resume_byte_pos > 0:
headers["Range"] = f"bytes={resume_byte_pos}-"
self.logger.info(f"Resuming download from byte {resume_byte_pos}")

# timeout applies to connection only, not total transfer time
with requests.get(
url, headers=headers, stream=True, timeout=self.timeout
) as response:
if response.status_code == 416:
self.logger.info(f"File already complete: {local_path}")
return response.headers
elif response.status_code == 206:
self.logger.info("Resuming download (HTTP 206)")
elif response.status_code == 200:
if resume_byte_pos > 0:
self.logger.warning(
"Server doesn't support resume, restarting from beginning"
)
resume_byte_pos = 0
if os.path.exists(local_path):
os.remove(local_path)
else:
response.raise_for_status()

self._stream_download(
response, local_path, resume_byte_pos, chunk_size
)
return response.headers

except (requests.RequestException, IOError) as e:
self.logger.warning(
f"Download attempt {attempt}/{self.retries} failed: {e}"
)

if attempt < self.retries:
wait_time = min(2**attempt, 60)
self.logger.info(f"Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
raise RuntimeError(
f"Failed to download {url} after {self.retries} attempts: {e}"
)
Comment thread src/babel_explorer/core/babel_xrefs.py Outdated
Comment on lines +27 to +30
def from_tuple(tuple: tuple[str, str, str, str]):
"""Construct from a ``(filename, subj, pred, obj)`` database row tuple."""
return CrossReference(
filename=tuple[0], subj=tuple[1], pred=tuple[2], obj=tuple[3]
taxa: tuple[str, ...] = ()
description: tuple[str, ...] = ()

def __lt__(self, other):
Comment thread src/babel_explorer/core/downloader.py Outdated
f"Failed to download {url} after {self.retries} attempts: {e}"
)

@functools.lru_cache(maxsize=None)
gaurav and others added 5 commits May 17, 2026 17:32
Replace @functools.lru_cache on the get_curie_xref instance method with an
instance-level dict cache (self._xref_cache). lru_cache on instance methods
holds a strong reference to self via the cache key, preventing garbage
collection for the process lifetime.

Also change the get_curie_ids IN-clause from `IN $1` to
`IN (SELECT unnest($1::VARCHAR[]))`, consistent with the pattern already used
in the recursive query and guaranteed to work across DuckDB versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
urllib.parse.urljoin resolves relative paths against the base URL's last
path segment, not the full path, so a multi-level url_base like
https://host/a/b/ + 'duckdb/file' would produce https://host/a/duckdb/file.

Replace with simple string concatenation (url_base + dirpath), which is
safe and unambiguous because the constructor already enforces a trailing /.
Also drop the now-unused urllib.parse import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Making src/ itself a Python package can interfere with import resolution
when the project is installed via hatchling (which uses src layout). Only
src/babel_explorer/__init__.py and src/babel_explorer/core/__init__.py are
needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README.md: add missing trailing newline
- tests/__init__.py: remove stray comment (test __init__ files should be empty)
- FUTURE.md: replace duplicated text with links to GitHub issues #12 and #13
  where the two known performance improvements are now tracked

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No logic changes — ruff wrapped long import lists and long lines to meet
the project's line-length limit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 24 changed files in this pull request and generated 3 comments.

Comment thread tests/test_babel_xrefs.py
@pytest.mark.parametrize("curie", VALID_CURIES)
def test_get_curie_xref(babel_xrefs, curie):
"""get_curie_xref returns non-empty CrossReferences with the queried CURIE."""
babel_xrefs.get_curie_xref.cache_clear()
Comment thread src/babel_explorer/cli.py Outdated
Comment thread src/babel_explorer/core/babel_xrefs.py Outdated
gaurav and others added 2 commits May 17, 2026 23:14
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
gaurav and others added 3 commits May 17, 2026 23:52
…ution

Replaces the implicit DuckDB pattern (assign a relation to a Python
variable, then reference that name in a SQL string) with explicit
read_parquet($N) parameterised calls. The recursive query wraps the
scan in a MATERIALIZED CTE so the parquet is read once regardless of
how many CTEs reference it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lru_cache on get_downloaded_file caused a latent bug: once it cached a
local path, subsequent calls bypassed the os.path.exists() guard —
returning a stale path if the file was deleted mid-run. The three-tier
freshness logic (meta + ETag) already prevents redundant network calls,
so the cache adds no benefit and only introduces this risk.

Also corrects the misleading "connection only" comment on the requests
timeout: it is a per-read idle timeout, not a total-transfer limit.

Tests updated to remove cache_clear() calls and rename the caching test
to reflect that the freshness window is now the deduplication mechanism.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes #14. lru_cache on instance methods holds a reference to self in
the cache key, preventing garbage collection of NodeNorm instances for
the lifetime of the process.

Each of the three methods now checks and populates a dedicated dict
(_normalize_cache, _identifier_cache, _clique_cache) on the instance.
This makes caching scope explicit: the cache lives and dies with the
object, and callers who need fresh results simply instantiate a new
NodeNorm. HTTP errors in normalize_curie are intentionally not cached
so a transient failure does not permanently suppress retries.

Tests updated to remove cache_clear() calls — unit tests already
construct a fresh NodeNorm per test case via _make_nn(), and integration
tests are parametrized per-CURIE so cached results do not interfere.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants