Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions binder/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# Binder environment requirements
# Pin lonboard to version that supports _height parameter
lonboard>=0.10.0
ipydatagrid
geopandas
duckdb
# Binder environment for examples/basic/isamples_explorer.ipynb
# (and the adjacent basic/ notebooks). Tracks what the notebooks
# actually import. Keep minimal — Binder's base image provides
# JupyterLab, ipykernel, IPython.

# Data + query
duckdb>=0.10
pandas>=2.0.0
numpy
ipywidgets
pyarrow>=12.0.0
geopandas
shapely

# Visualization
lonboard>=0.10.0 # _height + MaplibreBasemap
ipydatagrid
ipywidgets
281 changes: 276 additions & 5 deletions examples/basic/geoparquet.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,7 @@
"\n",
"* [iSamples Complete Export Dataset - April 2025](https://zenodo.org/records/15278211)\n",
"\n",
"* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000) \n",
"* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000)\u00a0\n",
"\n"
]
},
Expand Down Expand Up @@ -1479,6 +1479,59 @@
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Joining vocabulary URIs to human-readable labels\n",
"\n",
"iSamples concepts are referenced as SKOS URIs throughout the wide and\n",
"narrow parquets \u2014 `IdentifiedConcept.label` currently holds the URI\n",
"(e.g. `.../material/1.0/earthmaterial`) rather than the prefLabel.\n",
"We publish a small lookup at\n",
"`data.isamples.org/vocab_labels.parquet` (~60 KB, built from the\n",
"canonical TTLs at `isamplesorg/vocabularies`) so notebooks can render\n",
"`Natural Solid Material` instead. The `vocab_labels.py` helper in this\n",
"directory wraps it.\n",
"\n",
"See [issue #148](https://github.com/isamplesorg/isamplesorg.github.io/issues/148)\n",
"for background \u2014 this is a temporary client-side workaround until\n",
"`IdentifiedConcept.label` is populated with prefLabels at PQG-build time.\n",
"URIs that no TTL declares (a known set of 4, ~169/6M samples) fall back\n",
"to the last URL segment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from vocab_labels import load_vocab_labels, pretty_label\n",
"import duckdb, pandas as pd\n",
"\n",
"labels = load_vocab_labels() # one HTTP fetch, ~60 KB\n",
"print(f'Loaded {len(labels):,} URI -> prefLabel entries')\n",
"\n",
"# Count how many concept entities exist for each material URI in the\n",
"# IdentifiedConcept slice of the wide parquet, then join to prefLabels.\n",
"# (The wide parquet stores concept URIs in IdentifiedConcept.label \u2014\n",
"# the #148 data-quality issue.)\n",
"url = 'https://data.isamples.org/current/wide.parquet'\n",
"con = duckdb.connect(); con.execute('INSTALL httpfs; LOAD httpfs;')\n",
"concept_uris = con.sql(f'''\n",
" SELECT label AS uri\n",
" FROM read_parquet(\\'{url}\\')\n",
" WHERE otype = \\'IdentifiedConcept\\'\n",
" AND label LIKE \\'%/material/%\\'\n",
" LIMIT 15\n",
"''').df()\n",
"concept_uris['pref_label'] = concept_uris['uri'].map(\n",
" lambda u: pretty_label(u, labels)\n",
")\n",
"concept_uris[['pref_label', 'uri']]\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
Expand Down Expand Up @@ -1717,7 +1770,7 @@
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6680932 rows × 3 columns</p>\n",
"<p>6680932 rows \u00d7 3 columns</p>\n",
"</div>"
],
"text/plain": [
Expand Down Expand Up @@ -1957,7 +2010,7 @@
"version_minor": 1
},
"text/plain": [
"Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_"
"Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_\u2026"
]
},
"metadata": {},
Expand Down Expand Up @@ -2805,7 +2858,7 @@
"\n",
"The central idea is to create a \"control panel\" of widgets that are dynamically generated based on the schema of your Ibis table. This panel allows a user to build up a complex filter expression interactively, and then Ibis executes the final, filtered query.\n",
"\n",
"Here’s a step-by-step approach to implementing your vision:\n",
"Here\u2019s a step-by-step approach to implementing your vision:\n",
"\n",
"---\n",
"\n",
Expand Down Expand Up @@ -2946,6 +2999,224 @@
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "h3-acceleration-header",
"metadata": {},
"source": [
"## H3-Accelerated Spatial Filtering\n",
"\n",
"The [H3 geospatial indexing system](https://h3geo.org/) partitions the Earth into hexagonal cells at\n",
"multiple resolutions. By pre-computing H3 cell indices for each sample's coordinates, we can\n",
"replace expensive lat/lon range scans with fast integer lookups.\n",
"\n",
"The iSamples wide parquet file with H3 indices adds three BIGINT columns \u2014 `h3_res4`, `h3_res6`,\n",
"`h3_res8` \u2014 covering ~11.96M of 20.7M rows (those with valid coordinates).\n",
"\n",
"Below we benchmark **baseline lat/lon filtering** vs **H3 res4 pre-filtering** for a bounding-box\n",
"query, show H3 cell statistics, and render the H3-indexed data on a Lonboard map."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-setup-cell",
"metadata": {},
"outputs": [],
"source": [
"import duckdb\n",
"import time\n",
"\n",
"# Data URLs\n",
"WIDE_H3_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet\"\n",
"WIDE_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet\"\n",
"\n",
"con_h3 = duckdb.connect()\n",
"con_h3.execute(\"INSTALL h3 FROM community; LOAD h3;\")\n",
"print(\"DuckDB H3 extension loaded.\")"
]
},
{
"cell_type": "markdown",
"id": "h3-stats-header",
"metadata": {},
"source": [
"### H3 Cell Distribution Statistics\n",
"\n",
"How many distinct hexagonal cells exist at each resolution, and what fraction of rows carry H3 values?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-stats-cell",
"metadata": {},
"outputs": [],
"source": [
"h3_stats = con_h3.sql(f\"\"\"\n",
" SELECT\n",
" COUNT(*) AS total_rows,\n",
" COUNT(h3_res4) AS rows_with_h3,\n",
" ROUND(100.0 * COUNT(h3_res4) / COUNT(*), 1) AS pct_with_h3,\n",
" COUNT(DISTINCT h3_res4) AS distinct_res4,\n",
" COUNT(DISTINCT h3_res6) AS distinct_res6,\n",
" COUNT(DISTINCT h3_res8) AS distinct_res8\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
"\"\"\").df()\n",
"\n",
"print(\"H3 Cell Distribution\")\n",
"print(\"=\" * 50)\n",
"print(f\"Total rows: {h3_stats['total_rows'].iloc[0]:>12,}\")\n",
"print(f\"Rows with H3: {h3_stats['rows_with_h3'].iloc[0]:>12,} ({h3_stats['pct_with_h3'].iloc[0]}%)\")\n",
"print(f\"Distinct res4 cells: {h3_stats['distinct_res4'].iloc[0]:>12,} (~1,770 km hex)\")\n",
"print(f\"Distinct res6 cells: {h3_stats['distinct_res6'].iloc[0]:>12,} (~3.2 km hex)\")\n",
"print(f\"Distinct res8 cells: {h3_stats['distinct_res8'].iloc[0]:>12,} (~0.46 km hex)\")"
]
},
{
"cell_type": "markdown",
"id": "h3-benchmark-header",
"metadata": {},
"source": [
"### Bbox Benchmark: Lat/Lon Range Scan vs H3 Pre-Filter\n",
"\n",
"We query samples inside a bounding box (San Francisco Bay Area) two ways:\n",
"\n",
"1. **Baseline** \u2014 filter on `latitude BETWEEN ... AND longitude BETWEEN ...`\n",
"2. **H3 pre-filter** \u2014 find the set of res4 cells that overlap the bbox, then filter by those cells\n",
" before applying the exact lat/lon check.\n",
"\n",
"The H3 approach narrows the scan to a small number of hexagonal cells, reducing I/O on remote\n",
"parquet files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-benchmark-cell",
"metadata": {},
"outputs": [],
"source": [
"import h3\n",
"\n",
"# Bay Area bounding box\n",
"BBOX = dict(min_lat=37.2, max_lat=37.9, min_lon=-122.6, max_lon=-121.8)\n",
"\n",
"# --- Baseline: raw lat/lon range scan ---\n",
"t0 = time.time()\n",
"baseline = con_h3.sql(f\"\"\"\n",
" SELECT COUNT(*) AS n\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
" WHERE otype = 'MaterialSampleRecord'\n",
" AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n",
" AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n",
"\"\"\").df()\n",
"baseline_ms = (time.time() - t0) * 1000\n",
"\n",
"# --- H3 pre-filter: compute covering cells mathematically (no data scan) ---\n",
"t0 = time.time()\n",
"\n",
"# Use h3 Python library to compute all res4 cells covering the bbox.\n",
"# This is O(1) relative to dataset size \u2014 pure geometry, no I/O.\n",
"bbox_polygon = h3.LatLngPoly([\n",
" (BBOX['min_lat'], BBOX['min_lon']),\n",
" (BBOX['min_lat'], BBOX['max_lon']),\n",
" (BBOX['max_lat'], BBOX['max_lon']),\n",
" (BBOX['max_lat'], BBOX['min_lon']),\n",
"])\n",
"covering_cells = h3.geo_to_cells(bbox_polygon, res=4)\n",
"# Convert to signed int64 to match DuckDB BIGINT storage\n",
"def h3_to_signed(cell_hex):\n",
" val = h3.str_to_int(cell_hex)\n",
" return val if val < 2**63 else val - 2**64\n",
"\n",
"cell_list = [str(h3_to_signed(c)) for c in covering_cells]\n",
"print(f'Bbox covered by {len(cell_list)} res4 cells (computed mathematically)')\n",
"\n",
"if not cell_list:\n",
" print('No H3 cells cover this bbox.')\n",
" h3_ms = 0\n",
" h3_result = None\n",
"else:\n",
" cell_sql = ', '.join(cell_list)\n",
" h3_result = con_h3.sql(f\"\"\"\n",
" SELECT COUNT(*) AS n\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
" WHERE otype = 'MaterialSampleRecord'\n",
" AND h3_res4 IN ({cell_sql})\n",
" AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n",
" AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n",
" \"\"\").df()\n",
" h3_ms = (time.time() - t0) * 1000\n",
"\n",
"# --- Results ---\n",
"print('Bounding-Box Query Benchmark (Bay Area)')\n",
"print('=' * 50)\n",
"print(f'Baseline (lat/lon scan): {baseline_ms:>8.0f} ms | {baseline[\"n\"].iloc[0]:,} rows')\n",
"if h3_result is not None:\n",
" print(f'H3 res4 pre-filter: {h3_ms:>8.0f} ms | {h3_result[\"n\"].iloc[0]:,} rows')\n",
" speedup = baseline_ms / h3_ms if h3_ms > 0 else float('inf')\n",
" print(f'Speedup: {speedup:>7.1f}x')\n",
" print(f'\\nRow counts match: {baseline[\"n\"].iloc[0] == h3_result[\"n\"].iloc[0]}')\n",
"else:\n",
" print('H3 result: no covering cells')\n"
]
},
{
"cell_type": "markdown",
"id": "h3-lonboard-header",
"metadata": {},
"source": [
"### Lonboard Visualization with H3-Indexed Data\n",
"\n",
"Render a sample of the H3-indexed dataset to confirm the visualization pipeline works with\n",
"the enriched file."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-lonboard-cell",
"metadata": {},
"outputs": [],
"source": [
"from lonboard import Map, ScatterplotLayer, BitmapTileLayer\n",
"\n",
"# Sample 100K rows with coordinates from the H3 file\n",
"sample_df = con_h3.sql(f\"\"\"\n",
" SELECT latitude, longitude, n AS source, h3_res4, h3_res6\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
" WHERE latitude IS NOT NULL AND longitude IS NOT NULL\n",
" USING SAMPLE 100000\n",
"\"\"\").df()\n",
"\n",
"gdf_h3 = gpd.GeoDataFrame(\n",
" sample_df,\n",
" geometry=gpd.points_from_xy(sample_df.longitude, sample_df.latitude),\n",
" crs=\"EPSG:4326\"\n",
")\n",
"\n",
"# Color by source\n",
"h3_color_map = {\n",
" 'SESAR': [51, 102, 204, 200],\n",
" 'OPENCONTEXT': [220, 57, 18, 200],\n",
" 'GEOME': [16, 150, 24, 200],\n",
" 'SMITHSONIAN': [255, 153, 0, 200],\n",
"}\n",
"default_c = [128, 128, 128, 200]\n",
"colors_arr = np.array([h3_color_map.get(s, default_c) for s in gdf_h3['source']], dtype=np.uint8)\n",
"\n",
"layer = ScatterplotLayer.from_geopandas(gdf_h3, get_fill_color=colors_arr, get_radius=500)\n",
"tile_layer = BitmapTileLayer(\n",
" data=\"https://tile.openstreetmap.org/{z}/{x}/{y}.png\",\n",
" min_zoom=0, max_zoom=19,\n",
")\n",
"\n",
"m = Map([tile_layer, layer], view_state={\"zoom\": 2, \"latitude\": 20, \"longitude\": 0})\n",
"print(f\"Rendering {len(gdf_h3):,} points from H3-indexed file\")\n",
"m"
]
}
],
"metadata": {
Expand All @@ -2969,4 +3240,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
Loading