Skip to content

notebooks: render vocabulary URIs as SKOS prefLabels (#148)#5

Open
rdhyee wants to merge 15 commits intoisamplesorg:mainfrom
rdhyee:notebooks-vocab-labels
Open

notebooks: render vocabulary URIs as SKOS prefLabels (#148)#5
rdhyee wants to merge 15 commits intoisamplesorg:mainfrom
rdhyee:notebooks-vocab-labels

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 29, 2026

Wires the canonical `vocab_labels.parquet` artifact (60 KB, now live at `data.isamples.org`, built by isamplesorg/isamplesorg.github.io#149) into the example notebooks so they render concept URIs as human-readable prefLabels. Companion to isamplesorg.github.io#150 (Explorer wiring).

Tracks isamplesorg/isamplesorg.github.io#148.

Summary

  • `examples/basic/vocab_labels.py` — new helper module:
    • `load_vocab_labels(url, lang='en')` — one HTTP fetch (~60 KB), returns `{uri: pref_label}`.
    • `pretty_label(uri, labels)` — returns the prefLabel, falling back to the URI tail for the 4 known unresolved URIs (~169 / 6M samples; flagged as upstream design debt in #148).
  • `examples/basic/geoparquet.ipynb` — adds a self-contained demo cell after the existing material-category section that lists 15 IdentifiedConcept material URIs from the wide parquet alongside their resolved prefLabels.
  • `examples/basic/pqg_demo.ipynb` — adds a follow-on cell to Example 4 (Material Type Distribution) that re-prints the same counts with prefLabels.

Verification

End-to-end smoke against live `data.isamples.org/current/wide.parquet`:

```
Loaded 535 URI -> prefLabel entries
pref_label uri
organicanimalproduct https://w3id.org/isample/opencontext/material/0.1/organicanimalproduct ← known unresolved (#148)
Organic material https://w3id.org/isample/vocabulary/material/1.0/organicmaterial
Natural Solid Material https://w3id.org/isample/vocabulary/material/1.0/earthmaterial
```

Why client-side

Temporary workaround. The proper fix is populating `IdentifiedConcept.label` with prefLabels at PQG-build time (upstream `pqg` work). Until then, every consumer (Explorer, notebooks, future detail panels) JOINs the same lookup file.

claude and others added 15 commits February 11, 2026 01:40
- geoparquet.ipynb: Add H3-accelerated bbox filtering benchmark, H3 cell
  distribution stats, and Lonboard visualization using H3-indexed parquet
- isamples_explorer.ipynb: Load 2KB facet summaries at startup for instant
  widget population; add context and object_type dropdown filters
- h3_clustering.ipynb: New notebook demonstrating H3 clustering with
  Lonboard visualization, multi-resolution comparison, performance
  benchmarks, and hierarchical drill-down
- pqg_demo.ipynb: Add wide format shortcut section comparing graph
  traversal vs H3 spatial queries with performance timing

Closes #2, closes #3, closes #4, closes #5

https://claude.ai/code/session_01ADUWKdT6dM7gqmauf6TqWB
1. H3 bbox coverage (Critical): Replace corners+center sampling with
   full data query to find all res4 cells within bbox. The old approach
   missed edge cells for large bounding boxes, causing false negatives.
   Fixed in geoparquet.ipynb and pqg_demo.ipynb.

2. Empty covering list guard (High): Add check for empty cell list
   before building IN () SQL clause, which would be invalid SQL.
   Fixed in geoparquet.ipynb and pqg_demo.ipynb.

3. Material URI suffix collision (Critical): get_all_material_counts()
   now keys by (scheme, suffix) internally before collapsing to suffix,
   keeping the highest-count entry when different vocabularies share a
   suffix (e.g., "rock" in isample/vocabulary vs isample/opencontext).
   Fixed in isamples_explorer.ipynb.

4. N+1 rollup queries (Medium): Replaced per-suffix COUNT(DISTINCT)
   loop with single batch UNION ALL query in compute_accurate_rollup_counts().
   Fixed in isamples_explorer.ipynb.

5. Lazy-load heavy queries (High): Deferred get_all_material_counts()
   and get_year_range_stats() to first use instead of eager startup,
   preserving the "instant" facet experience from pre-computed summaries.
   Fixed in isamples_explorer.ipynb.

6. Graph traversal no-op (Low): Replaced bare `pass` with explanatory
   comment noting it's pseudocode. Fixed in pqg_demo.ipynb.

Addresses review: PR #6 comment #3882037312

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. H3 bbox coverage: Use h3 Python library (h3.geo_to_cells) to compute
   covering cells mathematically instead of scanning the data. This makes
   the benchmark meaningful — cell computation is O(1) relative to data
   size, pure geometry with no I/O. (Codex Critical + Gemini Performance)

2. Filter parity: All H3 queries now include otype='MaterialSampleRecord'
   to match baseline queries exactly. (Codex Medium)

3. SQL quoting: Fixed batch rollup query — was producing escaped quotes
   (\'rock\') instead of proper SQL quotes ('rock'). (Codex High)

4. Actually lazy startup: Deferred get_all_material_counts() and
   get_year_range_stats() for real — they now run on first accordion
   open, not at module import. Startup only loads 2KB summary parquet.
   (Codex High)

5. Scheme-aware rollup: expand_material_filters_with_rollup() now
   matches children within the same vocabulary scheme prefix, preventing
   cross-vocabulary suffix collisions during expansion. (Codex Critical)

6. Notebook JSON format: Restored list-of-strings source format in
   explorer notebook for clean git diffs. (Gemini Style)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Time accordion index: Changed idx==2 to idx==4 to match actual
   accordion children order (Sources=0, Material=1, Context=2,
   Object Type=3, Time=4). Also rebuilds decade checkboxes and
   slider bounds after lazy loading year stats. (Codex High)

2. Scheme-keyed material map: get_all_material_counts() now stores
   all_uris dict mapping {scheme_prefix: uri} per suffix. Rollup
   expansion looks up children by same scheme first, falls back to
   primary URI. Prevents cross-vocabulary collisions. (Codex High)

3. Signed BIGINT normalization: h3_to_signed() converts h3.str_to_int()
   output to signed int64 (val - 2**64 if val >= 2**63) to match
   DuckDB BIGINT storage. All current data is positive, but this
   guards against future cells with high bit set. (Codex Medium)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add H3 spatial optimization and facet summaries
Pins the current state of examples/basic/isamples_explorer.ipynb
before porting in the H3-tier strategy from the Cesium
progressive_globe frontend. This is the "crude sampling" version:

- load_samples() uses ROW_NUMBER() OVER (PARTITION BY source
  ORDER BY RANDOM()) with max_per_source=12500 → ~50K initial points
- adaptive_sample_size(zoom) caps per-source sample by zoom tier but
  still renders individual points at every zoom (no H3 aggregation,
  no pre-computed tiers)
- Every viewport change re-queries the 282 MB wide parquet

Landmark commit so we can return to this baseline if the H3-tier
port regresses UX.
Inserts 6 new cells after the existing load_samples() cell to
demonstrate the port of the Cesium progressive_globe strategy:

- Three pre-computed H3 summary parquets (res 4/6/8) from R2, ~4 MB
  total vs the 282 MB wide-parquet round-trip the crude sampler does.
- load_h3_tier(zoom, source_filters, bbox) returns one row per H3
  cell with sample_count, centroid, and dominant_source.
- make_h3_tier_layer() builds a lonboard ScatterplotLayer with radius
  scaled by log(sample_count) and colors from SOURCE_COLORS.
- Demo cell loads all three tiers and one mid-tier standalone map.

Scaffold ONLY — not yet wired into the widget viewport observer.
The existing load_samples() path and adaptive_sample_size() remain
untouched so the crude and tier strategies can be compared side by
side before integration.

Smoke-tested against the live R2 files:
- zoom 1.5 → res 4, 38,406 cells, ~500 ms cold
- zoom 5.0 → res 6, 111,681 cells, ~500 ms cold
- zoom 8.5 → res 8, 175,653 cells, ~500 ms cold
- source_filter and bbox predicates verified

Baseline pinned at snapshot/explorer-crude-sampling-2026-04-24.
Two bugs in the scaffold layer builder and demo:

1. make_h3_tier_layer() called hex_to_rgb on SOURCE_COLORS values,
   but SOURCE_COLORS (cell 3) already holds RGBA lists — not hex
   strings. AttributeError on .lstrip. Use the values directly and
   fall back to DEFAULT_COLOR for unknown sources.

2. Demo cell referenced BASEMAP_VOYAGER, which is defined later in
   cell 15 (Map Component). Scaffold cell 10 runs before that, so
   NameError. Define a local _TIER_BASEMAP inline from the same
   MaplibreBasemap + CartoStyle.Voyager.

Verified: `nbclient` executes all 28 cells to completion.
Adds a second toggle button next to Viewport Mode. When ON:

- load_viewport_data() branches at the top: if zoom_to_h3_resolution(zoom)
  is not None AND no tier-incompatible filter is active (material, year,
  search), load_h3_tier() returns ~100K aggregate rows from the
  pre-computed R2 parquet (~MB, not ~MB hundreds).
- _update_map_and_table_tier() replaces sample_map.layers with the
  ScatterplotLayer built from the tier centroids.
- Table shows aggregate rows (source / "N samples (H3 res R)" / lat / lng).
- Source facet still works — tier query filters on dominant_source.
- Material / year / search filters auto-fall-back to the crude sampler
  because the tier files don't carry those dimensions.

At zoom >= 10, zoom_to_h3_resolution returns None → crude sampler takes
over automatically. That's the handoff point to the individual-points
tier (item 5 in the scaffold's "still to do" list, not yet implemented).

Verified: nbclient executes all 28 cells clean. Actual widget interaction
needs to be tested in a running Jupyter session.

State changes:
- ExplorerState.h3_tier_mode flag (default False)
- h3_tier_toggle widget added to controls_row1 in cell 21

No changes to the crude-sampler code path.
Defaults were way too big:
- radius_scale=20000 m/log-unit meant a 10K-sample cell projected to
  ~184 km; at country zoom that saturated radius_max_pixels=40, so
  every high-count cell rendered as a 40-px blob.
- min_radius=5000 m forced even 1-sample cells into a visible size.

New defaults:
- radius_scale=3000  (log1p(100)·3000 ≈ 14 km)
- radius_min_pixels=2, radius_max_pixels=12
- drop the meters-min floor; pixel-min handles the low end.

Density now reads as density, not "big vs huge". All three tunables
are parameters so they can be overridden if Raymond wants different
behavior at specific zooms.

Verified: nbclient runs all 28 cells clean.
Previously two independent files hard-coded different paths:
- cell 2 (wide parquet): had a one-off `os.path.exists` check
- cell 8 (H3 tier constants, from the scaffold): REMOTE URLs only

Both now go through a shared resolve_data_url(local_filename,
remote_path) helper defined in cell 2. Raymond's MBP with the 202604
wide parquet cached under ~/Data/iSample/pqg_refining/ picks local.
A fresh mybinder / Colab / clone picks the matching remote URL.
DuckDB's read_parquet() accepts either — downstream SQL is identical.

Other portability touches:
- Explicit INSTALL/LOAD httpfs in cell 2 so DuckDB speaks HTTPS on
  fresh containers where the extension isn't pre-cached.
- New markdown cell near the top explaining Binder/Colab/local paths
  and expected data-transfer volumes.
- Optional Colab install cell (commented out by default).
- binder/requirements.txt fleshed out with lonboard, ipydatagrid,
  ipywidgets, shapely, pyarrow — matches what the notebook imports.

Verified: nbclient executes all 30 cells clean. Runtime output shows
local path taken for wide parquet, remote for facet summaries.

Binder launch URL (once this branch merges to main):
  https://mybinder.org/v2/gh/isamplesorg/isamples-python/main?filepath=examples/basic/isamples_explorer.ipynb
Addresses item 5 in the scaffold's "still to do" list. At zoom >= 10
the H3 tiers are too coarse so load_viewport_data() previously fell
through to the crude sampler, which hits the 282 MB wide parquet.

Adds a lite-parquet loader (`load_samples_from_lite()`) that reads
from the 60 MB `samples_map_lite.parquet` projection instead. When:

  - H3 Tier Mode is ON, AND
  - zoom >= 10 (zoom_to_h3_resolution returns None), AND
  - no material/year/search filter is active
    (lite doesn't carry description, material, or facet URIs)

load_viewport_data now calls the lite loader. Reuses the existing
update_map_and_table() path — lite columns (pid, label, source,
latitude, longitude, place_name, result_time) are a compatible
subset of what the map layer + table expect; description is
supplied as empty, search_score as 0.

Trade-off: the lite path can't honor text search, material filter,
or year-range filter — the existing _tier_compatible_with_filters()
check already rules those out. When such a filter is active we
gracefully fall back to the crude sampler (wide parquet).

Status bar shows "Lite: N samples (zoom X, 60 MB lite parquet,
no description/material filter)" so the active data source is
visible.

Verified: nbclient executes all 31 cells clean. Actual widget
interaction needs a live Jupyter session.
…guards

Addresses 3 Codex findings on examples#3:

1. **Notebook bloat (HIGH)**: the notebook was committed with persisted
   outputs (widget state + lonboard/MapLibre JS/CSS payloads) driving a
   ~109k-line diff and a 20 MB file. Stripped all cell outputs and the
   `metadata.widgets` block. File is now 137 KB. Commits going forward
   should strip outputs before pushing.

2. **H3 source-filter imprecision (HIGH)**: load_h3_tier() filters on
   `dominant_source` but returns cell-total `sample_count`. Selecting
   source X *under*counts (drops cells with some X samples but dominated
   by another source) AND *over*counts (shows cell totals for X-dominant
   cells that include other sources). The displayed number is an
   upper bound for the sum of source-filtered counts, not accurate per
   source.

   Expanded the docstring with the full accuracy caveat. Added a
   "⚠️ source filter is dominant-source only" suffix to the status
   bar whenever tier mode is active with a source filter, so the
   imprecision is visible in the UI.

   Accurate per-source tier aggregates would require a per-(cell, source)
   parquet (the shape of frontend_bundle_v2/h3_cache.parquet which is not
   on R2). Future work if source-accurate filtering is needed.

3. **Empty tier_df crash (MEDIUM)**: _make_tier_table_df() read
   `tier_df['resolution'].iloc[0]` without a length check, so a
   bbox or source filter producing 0 cells raised IndexError instead
   of showing an empty map. Guarded both _make_tier_table_df and
   _update_map_and_table_tier; empty case now clears layers + table
   and shows "0 cells in viewport" status.

Verified:
- All 31 cells compile (syntax check).
- nbclient executes all cells clean end-to-end (in-memory; outputs
  intentionally not saved to keep the file small).
Wires the canonical vocab_labels.parquet artifact (60 KB, hosted at
data.isamples.org, built from isamplesorg/vocabularies TTLs) into the
example notebooks so they render concept URIs as human-readable
prefLabels — see isamplesorg/isamplesorg.github.io#148.

- examples/basic/vocab_labels.py — new helper (load_vocab_labels(),
  pretty_label(uri, labels)). One HTTP fetch per session; falls back to
  the URI tail for the 4 known unresolved URIs (~169 / 6M samples).
- examples/basic/geoparquet.ipynb — adds a self-contained demo cell
  that lists 15 IdentifiedConcept material URIs from the wide parquet
  alongside their resolved prefLabels.
- examples/basic/pqg_demo.ipynb — adds a follow-on cell to Example 4
  (Material Type Distribution) that re-prints the counts with
  prefLabels.

This is a temporary client-side workaround. The proper fix is
populating IdentifiedConcept.label with prefLabels at PQG-build time,
which is upstream pqg work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants