Add vocab_labels.parquet builder (#148 step 1) by rdhyee · Pull Request #149 · isamplesorg/isamplesorg.github.io

rdhyee · 2026-04-28T23:30:02Z

First step of #148.

Summary

Adds `scripts/build_vocab_labels.py` — parses the same 10 SKOS TTLs that `scripts/generate_vocab_docs.sh` already pulls (core + Earth Sci + OpenContext + Biology), emits a single `vocab_labels.parquet` with columns: `uri, uri_form, pref_label, lang, scheme, definition, alt_labels, source_ttl`.
537 rows / 535 unique URIs / 10 schemes / `en` + `de`.
Coverage against the 55 distinct `IdentifiedConcept` URIs in the current wide parquet: 51/55 (93%) by URI, >99.99% sample-weighted on every facet (material, context, object_type).

What it doesn't do (yet)

Publish to `data.isamples.org`.
Wire any consumer (Explorer, notebooks, pqg facet-summaries) to use it.

Both follow-ups, in their own PRs once we agree on the artifact shape.

Two surprises handled (full write-up on the issue)

`/1.0/` URI mismatch. TTLs declare concepts without a version segment, but iSamples export records carry `/1.0/` in the URI. Confirmed: the version segment is in the export itself, propagated faithfully by pqg. The builder emits dual-form rows tagged `uri_form = "vocab" | "data_v1"` so JOINs work on either form. Per-prefix oddities (OpenContext uses `/0.1/`; biology data has inconsistent casing — `Animalia/Fungi/Plantae` but `bacteria/protozoa`) are handled.
Cross-vocab redeclarations. 17 material URIs declared in 2-3 TTLs each with slightly different labels (whitespace, casing). Builder dedupes `(uri, lang)` preferring the canonical TTL; losers' labels preserved in `alt_labels`.

Residual unresolved (4 URIs, ~169 / ~6M samples)

Upstream data-quality issues — concept URIs in source data that no TTL declares:

`opencontext/material/0.1/organicanimalproduct`
`opencontext/material/0.1/plantmaterial`
`vocabulary/specimentype/1.0/othersolidobject` (declared as `msot:othersolidobject` in the wrong namespace)
`vocabulary/specimentype/1.0/physicalspecimen` (likewise)

Not fixable in this artifact — flagged for #148 as design debt.

Run / deps

```
python scripts/build_vocab_labels.py [-o out.parquet] [--also-csv]
```

Deps: `rdflib pandas pyarrow`. Not run in site CI; this is a one-shot artifact builder.

Test plan

`python scripts/build_vocab_labels.py -o /tmp/v.parquet` succeeds with no warnings beyond the documented dedup.
Sanity-check resolved labels for representative facet URIs (`Animalia`, `Fungi`, `Whole organism material sample`, `Past human occupation site`, `Gaseous material`).
Coverage check against current wide parquet: 51/55 URIs, >99.99% sample-weighted.

🤖 Generated with Claude Code

…ilder Builds vocab_labels.parquet from the SKOS TTL vocabularies that generate_vocab_docs.sh already pulls (10 TTLs across 4 vocabulary repos: core, Earth Science, Archaeology/OpenContext, Biology). Produces 537 rows / 535 unique URIs / 10 schemes / en + de. Output columns: uri, uri_form, pref_label, lang, scheme, definition, alt_labels, source_ttl. The artifact is intended to be consumed by: - the Explorer (Quarto/OJS) — JOIN facet URIs onto pref_label - isamples-python notebooks — enrich queries on IdentifiedConcept - pqg facet-summaries — bake labels into facet summaries at build time - any future React UI — small JSON dump from the same source First step of isamplesorg#148. This commit only adds the builder; publishing the parquet to data.isamples.org and wiring consumers is follow-up work. Two real-world wrinkles handled in the builder, both flagged on the issue: 1. /1.0/ URI mismatch. TTLs declare concepts at e.g. .../materialsampleobjecttype/wholeorganism, but iSamples export records (and every downstream parquet) carry a /1.0/ version segment in the URI: .../materialsampleobjecttype/1.0/wholeorganism. The convention is in the export itself, not added by pqg. The builder emits dual-form rows tagged with a uri_form column ("vocab" vs "data_v1") so consumers can JOIN on either form. Also handles per- prefix oddities: OpenContext uses /0.1/ instead of /1.0/, and biology data has inconsistent slug casing (Animalia/Fungi/Plantae but bacteria/protozoa) — both casings emitted as aliases. 2. Cross-vocab redeclarations. 17 material URIs are declared in 2-3 different TTLs with slightly different labels (whitespace, casing). The builder dedupes (uri, lang) keys, preferring the TTL whose URL matches the concept's expected canonical owner, and preserves the losers' labels in alt_labels so no information is lost. Coverage against the 55 distinct IdentifiedConcept URIs in the current wide parquet: 51/55 (93%) by URI, >99.99% sample-weighted on every facet (material, context, object_type). The 4 residuals are upstream data-quality issues — concept URIs in source data that no TTL declares (opencontext/material/0.1/organicanimalproduct, .../plantmaterial, vocabulary/specimentype/1.0/othersolidobject, .../physicalspecimen) — not something this artifact can fix. Run: python scripts/build_vocab_labels.py [-o out.parquet] [--also-csv] Deps: rdflib, pandas, pyarrow (myenv: rdflib==6.3.2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… + pin deps - Default behavior is now to fail loud (exit 3) if any TTL source fails to fetch or parse. Add --allow-partial to override. Since this artifact is intended for publishing, a partial parquet should not be silently produced. - Add scripts/requirements.txt pinning the script's runtime deps (rdflib, pandas, pyarrow). Kept separate from the site-build ../requirements.txt because these scripts are not run in CI; this file is just so a fresh checkout can run them. Verified: injecting a bogus TTL URL yields exit 3 by default, exit 0 with --allow-partial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rdhyee · 2026-04-28T23:51:07Z

Both Codex findings addressed in 30b3c81:

P2 #1 — partial-fetch failures producing a successful artifact
Default behavior is now fail-loud (exit 3) if any TTL source fails to fetch/parse. Since this artifact is intended for publishing, silently emitting a partial parquet is the wrong default. Added --allow-partial for explicit override (useful during local debugging).

Verified by injecting a bogus URL into VOCAB_TTLS:

without flag → exit 3, error listed on stderr
with --allow-partial → exit 0, parquet written

P2 #2 — script deps not in repo environment
Added scripts/requirements.txt pinning rdflib>=6.3, pandas>=2.0, pyarrow>=14. Kept separate from the site-build ../requirements.txt (the scripts here are one-shot artifact builders, not run in CI — pulling them into every Quarto render would be noise). Updated docstring to reference it.

Run:

pip install -r scripts/requirements.txt
python scripts/build_vocab_labels.py -o vocab_labels.parquet

Thanks for the review.

rdhyee and others added 2 commits April 28, 2026 16:19

rdhyee merged commit d869ccf into isamplesorg:main Apr 29, 2026
1 check passed

This was referenced Apr 29, 2026

Explorer: render SKOS prefLabels for facet URIs (#148 step 2) #150

Merged

notebooks: render vocabulary URIs as SKOS prefLabels (#148) isamplesorg/examples#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vocab_labels.parquet builder (#148 step 1)#149

Add vocab_labels.parquet builder (#148 step 1)#149
rdhyee merged 2 commits intoisamplesorg:mainfrom
rdhyee:feature/vocab-labels

rdhyee commented Apr 28, 2026

Uh oh!

rdhyee commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdhyee commented Apr 28, 2026

Summary

What it doesn't do (yet)

Two surprises handled (full write-up on the issue)

Residual unresolved (4 URIs, ~169 / ~6M samples)

Run / deps

Test plan

Uh oh!

rdhyee commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant