notebooks: render vocabulary URIs as SKOS prefLabels (#148)#5
Open
rdhyee wants to merge 15 commits intoisamplesorg:mainfrom
Open
notebooks: render vocabulary URIs as SKOS prefLabels (#148)#5rdhyee wants to merge 15 commits intoisamplesorg:mainfrom
rdhyee wants to merge 15 commits intoisamplesorg:mainfrom
Conversation
- geoparquet.ipynb: Add H3-accelerated bbox filtering benchmark, H3 cell distribution stats, and Lonboard visualization using H3-indexed parquet - isamples_explorer.ipynb: Load 2KB facet summaries at startup for instant widget population; add context and object_type dropdown filters - h3_clustering.ipynb: New notebook demonstrating H3 clustering with Lonboard visualization, multi-resolution comparison, performance benchmarks, and hierarchical drill-down - pqg_demo.ipynb: Add wide format shortcut section comparing graph traversal vs H3 spatial queries with performance timing Closes #2, closes #3, closes #4, closes #5 https://claude.ai/code/session_01ADUWKdT6dM7gqmauf6TqWB
1. H3 bbox coverage (Critical): Replace corners+center sampling with full data query to find all res4 cells within bbox. The old approach missed edge cells for large bounding boxes, causing false negatives. Fixed in geoparquet.ipynb and pqg_demo.ipynb. 2. Empty covering list guard (High): Add check for empty cell list before building IN () SQL clause, which would be invalid SQL. Fixed in geoparquet.ipynb and pqg_demo.ipynb. 3. Material URI suffix collision (Critical): get_all_material_counts() now keys by (scheme, suffix) internally before collapsing to suffix, keeping the highest-count entry when different vocabularies share a suffix (e.g., "rock" in isample/vocabulary vs isample/opencontext). Fixed in isamples_explorer.ipynb. 4. N+1 rollup queries (Medium): Replaced per-suffix COUNT(DISTINCT) loop with single batch UNION ALL query in compute_accurate_rollup_counts(). Fixed in isamples_explorer.ipynb. 5. Lazy-load heavy queries (High): Deferred get_all_material_counts() and get_year_range_stats() to first use instead of eager startup, preserving the "instant" facet experience from pre-computed summaries. Fixed in isamples_explorer.ipynb. 6. Graph traversal no-op (Low): Replaced bare `pass` with explanatory comment noting it's pseudocode. Fixed in pqg_demo.ipynb. Addresses review: PR #6 comment #3882037312 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. H3 bbox coverage: Use h3 Python library (h3.geo_to_cells) to compute
covering cells mathematically instead of scanning the data. This makes
the benchmark meaningful — cell computation is O(1) relative to data
size, pure geometry with no I/O. (Codex Critical + Gemini Performance)
2. Filter parity: All H3 queries now include otype='MaterialSampleRecord'
to match baseline queries exactly. (Codex Medium)
3. SQL quoting: Fixed batch rollup query — was producing escaped quotes
(\'rock\') instead of proper SQL quotes ('rock'). (Codex High)
4. Actually lazy startup: Deferred get_all_material_counts() and
get_year_range_stats() for real — they now run on first accordion
open, not at module import. Startup only loads 2KB summary parquet.
(Codex High)
5. Scheme-aware rollup: expand_material_filters_with_rollup() now
matches children within the same vocabulary scheme prefix, preventing
cross-vocabulary suffix collisions during expansion. (Codex Critical)
6. Notebook JSON format: Restored list-of-strings source format in
explorer notebook for clean git diffs. (Gemini Style)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Time accordion index: Changed idx==2 to idx==4 to match actual
accordion children order (Sources=0, Material=1, Context=2,
Object Type=3, Time=4). Also rebuilds decade checkboxes and
slider bounds after lazy loading year stats. (Codex High)
2. Scheme-keyed material map: get_all_material_counts() now stores
all_uris dict mapping {scheme_prefix: uri} per suffix. Rollup
expansion looks up children by same scheme first, falls back to
primary URI. Prevents cross-vocabulary collisions. (Codex High)
3. Signed BIGINT normalization: h3_to_signed() converts h3.str_to_int()
output to signed int64 (val - 2**64 if val >= 2**63) to match
DuckDB BIGINT storage. All current data is positive, but this
guards against future cells with high bit set. (Codex Medium)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add H3 spatial optimization and facet summaries
Pins the current state of examples/basic/isamples_explorer.ipynb before porting in the H3-tier strategy from the Cesium progressive_globe frontend. This is the "crude sampling" version: - load_samples() uses ROW_NUMBER() OVER (PARTITION BY source ORDER BY RANDOM()) with max_per_source=12500 → ~50K initial points - adaptive_sample_size(zoom) caps per-source sample by zoom tier but still renders individual points at every zoom (no H3 aggregation, no pre-computed tiers) - Every viewport change re-queries the 282 MB wide parquet Landmark commit so we can return to this baseline if the H3-tier port regresses UX.
Inserts 6 new cells after the existing load_samples() cell to demonstrate the port of the Cesium progressive_globe strategy: - Three pre-computed H3 summary parquets (res 4/6/8) from R2, ~4 MB total vs the 282 MB wide-parquet round-trip the crude sampler does. - load_h3_tier(zoom, source_filters, bbox) returns one row per H3 cell with sample_count, centroid, and dominant_source. - make_h3_tier_layer() builds a lonboard ScatterplotLayer with radius scaled by log(sample_count) and colors from SOURCE_COLORS. - Demo cell loads all three tiers and one mid-tier standalone map. Scaffold ONLY — not yet wired into the widget viewport observer. The existing load_samples() path and adaptive_sample_size() remain untouched so the crude and tier strategies can be compared side by side before integration. Smoke-tested against the live R2 files: - zoom 1.5 → res 4, 38,406 cells, ~500 ms cold - zoom 5.0 → res 6, 111,681 cells, ~500 ms cold - zoom 8.5 → res 8, 175,653 cells, ~500 ms cold - source_filter and bbox predicates verified Baseline pinned at snapshot/explorer-crude-sampling-2026-04-24.
Two bugs in the scaffold layer builder and demo: 1. make_h3_tier_layer() called hex_to_rgb on SOURCE_COLORS values, but SOURCE_COLORS (cell 3) already holds RGBA lists — not hex strings. AttributeError on .lstrip. Use the values directly and fall back to DEFAULT_COLOR for unknown sources. 2. Demo cell referenced BASEMAP_VOYAGER, which is defined later in cell 15 (Map Component). Scaffold cell 10 runs before that, so NameError. Define a local _TIER_BASEMAP inline from the same MaplibreBasemap + CartoStyle.Voyager. Verified: `nbclient` executes all 28 cells to completion.
Adds a second toggle button next to Viewport Mode. When ON: - load_viewport_data() branches at the top: if zoom_to_h3_resolution(zoom) is not None AND no tier-incompatible filter is active (material, year, search), load_h3_tier() returns ~100K aggregate rows from the pre-computed R2 parquet (~MB, not ~MB hundreds). - _update_map_and_table_tier() replaces sample_map.layers with the ScatterplotLayer built from the tier centroids. - Table shows aggregate rows (source / "N samples (H3 res R)" / lat / lng). - Source facet still works — tier query filters on dominant_source. - Material / year / search filters auto-fall-back to the crude sampler because the tier files don't carry those dimensions. At zoom >= 10, zoom_to_h3_resolution returns None → crude sampler takes over automatically. That's the handoff point to the individual-points tier (item 5 in the scaffold's "still to do" list, not yet implemented). Verified: nbclient executes all 28 cells clean. Actual widget interaction needs to be tested in a running Jupyter session. State changes: - ExplorerState.h3_tier_mode flag (default False) - h3_tier_toggle widget added to controls_row1 in cell 21 No changes to the crude-sampler code path.
Defaults were way too big: - radius_scale=20000 m/log-unit meant a 10K-sample cell projected to ~184 km; at country zoom that saturated radius_max_pixels=40, so every high-count cell rendered as a 40-px blob. - min_radius=5000 m forced even 1-sample cells into a visible size. New defaults: - radius_scale=3000 (log1p(100)·3000 ≈ 14 km) - radius_min_pixels=2, radius_max_pixels=12 - drop the meters-min floor; pixel-min handles the low end. Density now reads as density, not "big vs huge". All three tunables are parameters so they can be overridden if Raymond wants different behavior at specific zooms. Verified: nbclient runs all 28 cells clean.
Previously two independent files hard-coded different paths: - cell 2 (wide parquet): had a one-off `os.path.exists` check - cell 8 (H3 tier constants, from the scaffold): REMOTE URLs only Both now go through a shared resolve_data_url(local_filename, remote_path) helper defined in cell 2. Raymond's MBP with the 202604 wide parquet cached under ~/Data/iSample/pqg_refining/ picks local. A fresh mybinder / Colab / clone picks the matching remote URL. DuckDB's read_parquet() accepts either — downstream SQL is identical. Other portability touches: - Explicit INSTALL/LOAD httpfs in cell 2 so DuckDB speaks HTTPS on fresh containers where the extension isn't pre-cached. - New markdown cell near the top explaining Binder/Colab/local paths and expected data-transfer volumes. - Optional Colab install cell (commented out by default). - binder/requirements.txt fleshed out with lonboard, ipydatagrid, ipywidgets, shapely, pyarrow — matches what the notebook imports. Verified: nbclient executes all 30 cells clean. Runtime output shows local path taken for wide parquet, remote for facet summaries. Binder launch URL (once this branch merges to main): https://mybinder.org/v2/gh/isamplesorg/isamples-python/main?filepath=examples/basic/isamples_explorer.ipynb
Addresses item 5 in the scaffold's "still to do" list. At zoom >= 10
the H3 tiers are too coarse so load_viewport_data() previously fell
through to the crude sampler, which hits the 282 MB wide parquet.
Adds a lite-parquet loader (`load_samples_from_lite()`) that reads
from the 60 MB `samples_map_lite.parquet` projection instead. When:
- H3 Tier Mode is ON, AND
- zoom >= 10 (zoom_to_h3_resolution returns None), AND
- no material/year/search filter is active
(lite doesn't carry description, material, or facet URIs)
load_viewport_data now calls the lite loader. Reuses the existing
update_map_and_table() path — lite columns (pid, label, source,
latitude, longitude, place_name, result_time) are a compatible
subset of what the map layer + table expect; description is
supplied as empty, search_score as 0.
Trade-off: the lite path can't honor text search, material filter,
or year-range filter — the existing _tier_compatible_with_filters()
check already rules those out. When such a filter is active we
gracefully fall back to the crude sampler (wide parquet).
Status bar shows "Lite: N samples (zoom X, 60 MB lite parquet,
no description/material filter)" so the active data source is
visible.
Verified: nbclient executes all 31 cells clean. Actual widget
interaction needs a live Jupyter session.
…guards Addresses 3 Codex findings on examples#3: 1. **Notebook bloat (HIGH)**: the notebook was committed with persisted outputs (widget state + lonboard/MapLibre JS/CSS payloads) driving a ~109k-line diff and a 20 MB file. Stripped all cell outputs and the `metadata.widgets` block. File is now 137 KB. Commits going forward should strip outputs before pushing. 2. **H3 source-filter imprecision (HIGH)**: load_h3_tier() filters on `dominant_source` but returns cell-total `sample_count`. Selecting source X *under*counts (drops cells with some X samples but dominated by another source) AND *over*counts (shows cell totals for X-dominant cells that include other sources). The displayed number is an upper bound for the sum of source-filtered counts, not accurate per source. Expanded the docstring with the full accuracy caveat. Added a "⚠️ source filter is dominant-source only" suffix to the status bar whenever tier mode is active with a source filter, so the imprecision is visible in the UI. Accurate per-source tier aggregates would require a per-(cell, source) parquet (the shape of frontend_bundle_v2/h3_cache.parquet which is not on R2). Future work if source-accurate filtering is needed. 3. **Empty tier_df crash (MEDIUM)**: _make_tier_table_df() read `tier_df['resolution'].iloc[0]` without a length check, so a bbox or source filter producing 0 cells raised IndexError instead of showing an empty map. Guarded both _make_tier_table_df and _update_map_and_table_tier; empty case now clears layers + table and shows "0 cells in viewport" status. Verified: - All 31 cells compile (syntax check). - nbclient executes all cells clean end-to-end (in-memory; outputs intentionally not saved to keep the file small).
Wires the canonical vocab_labels.parquet artifact (60 KB, hosted at data.isamples.org, built from isamplesorg/vocabularies TTLs) into the example notebooks so they render concept URIs as human-readable prefLabels — see isamplesorg/isamplesorg.github.io#148. - examples/basic/vocab_labels.py — new helper (load_vocab_labels(), pretty_label(uri, labels)). One HTTP fetch per session; falls back to the URI tail for the 4 known unresolved URIs (~169 / 6M samples). - examples/basic/geoparquet.ipynb — adds a self-contained demo cell that lists 15 IdentifiedConcept material URIs from the wide parquet alongside their resolved prefLabels. - examples/basic/pqg_demo.ipynb — adds a follow-on cell to Example 4 (Material Type Distribution) that re-prints the counts with prefLabels. This is a temporary client-side workaround. The proper fix is populating IdentifiedConcept.label with prefLabels at PQG-build time, which is upstream pqg work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wires the canonical `vocab_labels.parquet` artifact (60 KB, now live at `data.isamples.org`, built by isamplesorg/isamplesorg.github.io#149) into the example notebooks so they render concept URIs as human-readable prefLabels. Companion to isamplesorg.github.io#150 (Explorer wiring).
Tracks isamplesorg/isamplesorg.github.io#148.
Summary
Verification
End-to-end smoke against live `data.isamples.org/current/wide.parquet`:
```
Loaded 535 URI -> prefLabel entries
pref_label uri
organicanimalproduct https://w3id.org/isample/opencontext/material/0.1/organicanimalproduct ← known unresolved (#148)
Organic material https://w3id.org/isample/vocabulary/material/1.0/organicmaterial
Natural Solid Material https://w3id.org/isample/vocabulary/material/1.0/earthmaterial
```
Why client-side
Temporary workaround. The proper fix is populating `IdentifiedConcept.label` with prefLabels at PQG-build time (upstream `pqg` work). Until then, every consumer (Explorer, notebooks, future detail panels) JOINs the same lookup file.