Skip to content

Add vocab_labels.parquet builder (#148 step 1)#149

Merged
rdhyee merged 2 commits intoisamplesorg:mainfrom
rdhyee:feature/vocab-labels
Apr 29, 2026
Merged

Add vocab_labels.parquet builder (#148 step 1)#149
rdhyee merged 2 commits intoisamplesorg:mainfrom
rdhyee:feature/vocab-labels

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 28, 2026

First step of #148.

Summary

  • Adds `scripts/build_vocab_labels.py` — parses the same 10 SKOS TTLs that `scripts/generate_vocab_docs.sh` already pulls (core + Earth Sci + OpenContext + Biology), emits a single `vocab_labels.parquet` with columns: `uri, uri_form, pref_label, lang, scheme, definition, alt_labels, source_ttl`.
  • 537 rows / 535 unique URIs / 10 schemes / `en` + `de`.
  • Coverage against the 55 distinct `IdentifiedConcept` URIs in the current wide parquet: 51/55 (93%) by URI, >99.99% sample-weighted on every facet (material, context, object_type).

What it doesn't do (yet)

  • Publish to `data.isamples.org`.
  • Wire any consumer (Explorer, notebooks, pqg facet-summaries) to use it.

Both follow-ups, in their own PRs once we agree on the artifact shape.

Two surprises handled (full write-up on the issue)

  1. `/1.0/` URI mismatch. TTLs declare concepts without a version segment, but iSamples export records carry `/1.0/` in the URI. Confirmed: the version segment is in the export itself, propagated faithfully by pqg. The builder emits dual-form rows tagged `uri_form = "vocab" | "data_v1"` so JOINs work on either form. Per-prefix oddities (OpenContext uses `/0.1/`; biology data has inconsistent casing — `Animalia/Fungi/Plantae` but `bacteria/protozoa`) are handled.
  2. Cross-vocab redeclarations. 17 material URIs declared in 2-3 TTLs each with slightly different labels (whitespace, casing). Builder dedupes `(uri, lang)` preferring the canonical TTL; losers' labels preserved in `alt_labels`.

Residual unresolved (4 URIs, ~169 / ~6M samples)

Upstream data-quality issues — concept URIs in source data that no TTL declares:

  • `opencontext/material/0.1/organicanimalproduct`
  • `opencontext/material/0.1/plantmaterial`
  • `vocabulary/specimentype/1.0/othersolidobject` (declared as `msot:othersolidobject` in the wrong namespace)
  • `vocabulary/specimentype/1.0/physicalspecimen` (likewise)

Not fixable in this artifact — flagged for #148 as design debt.

Run / deps

```
python scripts/build_vocab_labels.py [-o out.parquet] [--also-csv]
```

Deps: `rdflib pandas pyarrow`. Not run in site CI; this is a one-shot artifact builder.

Test plan

  • `python scripts/build_vocab_labels.py -o /tmp/v.parquet` succeeds with no warnings beyond the documented dedup.
  • Sanity-check resolved labels for representative facet URIs (`Animalia`, `Fungi`, `Whole organism material sample`, `Past human occupation site`, `Gaseous material`).
  • Coverage check against current wide parquet: 51/55 URIs, >99.99% sample-weighted.

🤖 Generated with Claude Code

rdhyee and others added 2 commits April 28, 2026 16:19
…ilder

Builds vocab_labels.parquet from the SKOS TTL vocabularies that
generate_vocab_docs.sh already pulls (10 TTLs across 4 vocabulary repos:
core, Earth Science, Archaeology/OpenContext, Biology). Produces
537 rows / 535 unique URIs / 10 schemes / en + de.

Output columns: uri, uri_form, pref_label, lang, scheme, definition,
alt_labels, source_ttl. The artifact is intended to be consumed by:
  - the Explorer (Quarto/OJS) — JOIN facet URIs onto pref_label
  - isamples-python notebooks — enrich queries on IdentifiedConcept
  - pqg facet-summaries — bake labels into facet summaries at build time
  - any future React UI — small JSON dump from the same source

First step of isamplesorg#148. This commit only adds the builder; publishing the
parquet to data.isamples.org and wiring consumers is follow-up work.

Two real-world wrinkles handled in the builder, both flagged on the issue:

1. /1.0/ URI mismatch. TTLs declare concepts at e.g.
   .../materialsampleobjecttype/wholeorganism, but iSamples export
   records (and every downstream parquet) carry a /1.0/ version segment
   in the URI: .../materialsampleobjecttype/1.0/wholeorganism. The
   convention is in the export itself, not added by pqg. The builder
   emits dual-form rows tagged with a uri_form column ("vocab" vs
   "data_v1") so consumers can JOIN on either form. Also handles per-
   prefix oddities: OpenContext uses /0.1/ instead of /1.0/, and
   biology data has inconsistent slug casing (Animalia/Fungi/Plantae
   but bacteria/protozoa) — both casings emitted as aliases.

2. Cross-vocab redeclarations. 17 material URIs are declared in 2-3
   different TTLs with slightly different labels (whitespace, casing).
   The builder dedupes (uri, lang) keys, preferring the TTL whose URL
   matches the concept's expected canonical owner, and preserves the
   losers' labels in alt_labels so no information is lost.

Coverage against the 55 distinct IdentifiedConcept URIs in the current
wide parquet: 51/55 (93%) by URI, >99.99% sample-weighted on every
facet (material, context, object_type). The 4 residuals are upstream
data-quality issues — concept URIs in source data that no TTL declares
(opencontext/material/0.1/organicanimalproduct, .../plantmaterial,
vocabulary/specimentype/1.0/othersolidobject, .../physicalspecimen) —
not something this artifact can fix.

Run: python scripts/build_vocab_labels.py [-o out.parquet] [--also-csv]
Deps: rdflib, pandas, pyarrow (myenv: rdflib==6.3.2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + pin deps

- Default behavior is now to fail loud (exit 3) if any TTL source
  fails to fetch or parse. Add --allow-partial to override. Since
  this artifact is intended for publishing, a partial parquet should
  not be silently produced.
- Add scripts/requirements.txt pinning the script's runtime deps
  (rdflib, pandas, pyarrow). Kept separate from the site-build
  ../requirements.txt because these scripts are not run in CI; this
  file is just so a fresh checkout can run them.

Verified: injecting a bogus TTL URL yields exit 3 by default, exit 0
with --allow-partial.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 28, 2026

Both Codex findings addressed in 30b3c81:

P2 #1 — partial-fetch failures producing a successful artifact
Default behavior is now fail-loud (exit 3) if any TTL source fails to fetch/parse. Since this artifact is intended for publishing, silently emitting a partial parquet is the wrong default. Added --allow-partial for explicit override (useful during local debugging).

Verified by injecting a bogus URL into VOCAB_TTLS:

  • without flag → exit 3, error listed on stderr
  • with --allow-partial → exit 0, parquet written

P2 #2 — script deps not in repo environment
Added scripts/requirements.txt pinning rdflib>=6.3, pandas>=2.0, pyarrow>=14. Kept separate from the site-build ../requirements.txt (the scripts here are one-shot artifact builders, not run in CI — pulling them into every Quarto render would be noise). Updated docstring to reference it.

Run:

pip install -r scripts/requirements.txt
python scripts/build_vocab_labels.py -o vocab_labels.parquet

Thanks for the review.

@rdhyee rdhyee merged commit d869ccf into isamplesorg:main Apr 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant