Add vocab_labels.parquet builder (#148 step 1)#149
Conversation
…ilder Builds vocab_labels.parquet from the SKOS TTL vocabularies that generate_vocab_docs.sh already pulls (10 TTLs across 4 vocabulary repos: core, Earth Science, Archaeology/OpenContext, Biology). Produces 537 rows / 535 unique URIs / 10 schemes / en + de. Output columns: uri, uri_form, pref_label, lang, scheme, definition, alt_labels, source_ttl. The artifact is intended to be consumed by: - the Explorer (Quarto/OJS) — JOIN facet URIs onto pref_label - isamples-python notebooks — enrich queries on IdentifiedConcept - pqg facet-summaries — bake labels into facet summaries at build time - any future React UI — small JSON dump from the same source First step of isamplesorg#148. This commit only adds the builder; publishing the parquet to data.isamples.org and wiring consumers is follow-up work. Two real-world wrinkles handled in the builder, both flagged on the issue: 1. /1.0/ URI mismatch. TTLs declare concepts at e.g. .../materialsampleobjecttype/wholeorganism, but iSamples export records (and every downstream parquet) carry a /1.0/ version segment in the URI: .../materialsampleobjecttype/1.0/wholeorganism. The convention is in the export itself, not added by pqg. The builder emits dual-form rows tagged with a uri_form column ("vocab" vs "data_v1") so consumers can JOIN on either form. Also handles per- prefix oddities: OpenContext uses /0.1/ instead of /1.0/, and biology data has inconsistent slug casing (Animalia/Fungi/Plantae but bacteria/protozoa) — both casings emitted as aliases. 2. Cross-vocab redeclarations. 17 material URIs are declared in 2-3 different TTLs with slightly different labels (whitespace, casing). The builder dedupes (uri, lang) keys, preferring the TTL whose URL matches the concept's expected canonical owner, and preserves the losers' labels in alt_labels so no information is lost. Coverage against the 55 distinct IdentifiedConcept URIs in the current wide parquet: 51/55 (93%) by URI, >99.99% sample-weighted on every facet (material, context, object_type). The 4 residuals are upstream data-quality issues — concept URIs in source data that no TTL declares (opencontext/material/0.1/organicanimalproduct, .../plantmaterial, vocabulary/specimentype/1.0/othersolidobject, .../physicalspecimen) — not something this artifact can fix. Run: python scripts/build_vocab_labels.py [-o out.parquet] [--also-csv] Deps: rdflib, pandas, pyarrow (myenv: rdflib==6.3.2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + pin deps - Default behavior is now to fail loud (exit 3) if any TTL source fails to fetch or parse. Add --allow-partial to override. Since this artifact is intended for publishing, a partial parquet should not be silently produced. - Add scripts/requirements.txt pinning the script's runtime deps (rdflib, pandas, pyarrow). Kept separate from the site-build ../requirements.txt because these scripts are not run in CI; this file is just so a fresh checkout can run them. Verified: injecting a bogus TTL URL yields exit 3 by default, exit 0 with --allow-partial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both Codex findings addressed in 30b3c81: P2 #1 — partial-fetch failures producing a successful artifact Verified by injecting a bogus URL into
P2 #2 — script deps not in repo environment Run: Thanks for the review. |
First step of #148.
Summary
What it doesn't do (yet)
Both follow-ups, in their own PRs once we agree on the artifact shape.
Two surprises handled (full write-up on the issue)
Residual unresolved (4 URIs, ~169 / ~6M samples)
Upstream data-quality issues — concept URIs in source data that no TTL declares:
Not fixable in this artifact — flagged for #148 as design debt.
Run / deps
```
python scripts/build_vocab_labels.py [-o out.parquet] [--also-csv]
```
Deps: `rdflib pandas pyarrow`. Not run in site CI; this is a one-shot artifact builder.
Test plan
🤖 Generated with Claude Code