isamplesorg · rdhyee · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026
diff --git a/binder/requirements.txt b/binder/requirements.txt
@@ -1,10 +1,17 @@
-# Binder environment requirements
-# Pin lonboard to version that supports _height parameter
-lonboard>=0.10.0
-ipydatagrid
-geopandas
-duckdb
+# Binder environment for examples/basic/isamples_explorer.ipynb
+# (and the adjacent basic/ notebooks). Tracks what the notebooks
+# actually import. Keep minimal — Binder's base image provides
+# JupyterLab, ipykernel, IPython.
+
+# Data + query
+duckdb>=0.10
 pandas>=2.0.0
 numpy
-ipywidgets
 pyarrow>=12.0.0
+geopandas
+shapely
+
+# Visualization
+lonboard>=0.10.0          # _height + MaplibreBasemap
+ipydatagrid
+ipywidgets
diff --git a/examples/basic/geoparquet.ipynb b/examples/basic/geoparquet.ipynb
@@ -534,7 +534,7 @@
     "\n",
     "* [iSamples Complete Export Dataset - April 2025](https://zenodo.org/records/15278211)\n",
     "\n",
-    "* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000) \n",
+    "* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000)\u00a0\n",
     "\n"
    ]
   },
@@ -1479,6 +1479,59 @@
     "print(result)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Joining vocabulary URIs to human-readable labels\n",
+    "\n",
+    "iSamples concepts are referenced as SKOS URIs throughout the wide and\n",
+    "narrow parquets \u2014 `IdentifiedConcept.label` currently holds the URI\n",
+    "(e.g. `.../material/1.0/earthmaterial`) rather than the prefLabel.\n",
+    "We publish a small lookup at\n",
+    "`data.isamples.org/vocab_labels.parquet` (~60 KB, built from the\n",
+    "canonical TTLs at `isamplesorg/vocabularies`) so notebooks can render\n",
+    "`Natural Solid Material` instead. The `vocab_labels.py` helper in this\n",
+    "directory wraps it.\n",
+    "\n",
+    "See [issue #148](https://github.com/isamplesorg/isamplesorg.github.io/issues/148)\n",
+    "for background \u2014 this is a temporary client-side workaround until\n",
+    "`IdentifiedConcept.label` is populated with prefLabels at PQG-build time.\n",
+    "URIs that no TTL declares (a known set of 4, ~169/6M samples) fall back\n",
+    "to the last URL segment.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vocab_labels import load_vocab_labels, pretty_label\n",
+    "import duckdb, pandas as pd\n",
+    "\n",
+    "labels = load_vocab_labels()  # one HTTP fetch, ~60 KB\n",
+    "print(f'Loaded {len(labels):,} URI -> prefLabel entries')\n",
+    "\n",
+    "# Count how many concept entities exist for each material URI in the\n",
+    "# IdentifiedConcept slice of the wide parquet, then join to prefLabels.\n",
+    "# (The wide parquet stores concept URIs in IdentifiedConcept.label \u2014\n",
+    "# the #148 data-quality issue.)\n",
+    "url = 'https://data.isamples.org/current/wide.parquet'\n",
+    "con = duckdb.connect(); con.execute('INSTALL httpfs; LOAD httpfs;')\n",
+    "concept_uris = con.sql(f'''\n",
+    "    SELECT label AS uri\n",
+    "    FROM read_parquet(\\'{url}\\')\n",
+    "    WHERE otype = \\'IdentifiedConcept\\'\n",
+    "      AND label LIKE \\'%/material/%\\'\n",
+    "    LIMIT 15\n",
+    "''').df()\n",
+    "concept_uris['pref_label'] = concept_uris['uri'].map(\n",
+    "    lambda u: pretty_label(u, labels)\n",
+    ")\n",
+    "concept_uris[['pref_label', 'uri']]\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 19,
@@ -1717,7 +1770,7 @@
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
-       "<p>6680932 rows × 3 columns</p>\n",
+       "<p>6680932 rows \u00d7 3 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
@@ -1957,7 +2010,7 @@
        "version_minor": 1
       },
       "text/plain": [
-       "Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_…"
+       "Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_\u2026"
       ]
      },
      "metadata": {},
@@ -2805,7 +2858,7 @@
     "\n",
     "The central idea is to create a \"control panel\" of widgets that are dynamically generated based on the schema of your Ibis table. This panel allows a user to build up a complex filter expression interactively, and then Ibis executes the final, filtered query.\n",
     "\n",
-    "Here’s a step-by-step approach to implementing your vision:\n",
+    "Here\u2019s a step-by-step approach to implementing your vision:\n",
     "\n",
     "---\n",
     "\n",
@@ -2946,6 +2999,224 @@
    "metadata": {},
    "outputs": [],
    "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "h3-acceleration-header",
+   "metadata": {},
+   "source": [
+    "## H3-Accelerated Spatial Filtering\n",
+    "\n",
+    "The [H3 geospatial indexing system](https://h3geo.org/) partitions the Earth into hexagonal cells at\n",
+    "multiple resolutions. By pre-computing H3 cell indices for each sample's coordinates, we can\n",
+    "replace expensive lat/lon range scans with fast integer lookups.\n",
+    "\n",
+    "The iSamples wide parquet file with H3 indices adds three BIGINT columns \u2014 `h3_res4`, `h3_res6`,\n",
+    "`h3_res8` \u2014 covering ~11.96M of 20.7M rows (those with valid coordinates).\n",
+    "\n",
+    "Below we benchmark **baseline lat/lon filtering** vs **H3 res4 pre-filtering** for a bounding-box\n",
+    "query, show H3 cell statistics, and render the H3-indexed data on a Lonboard map."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "h3-setup-cell",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import duckdb\n",
+    "import time\n",
+    "\n",
+    "# Data URLs\n",
+    "WIDE_H3_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet\"\n",
+    "WIDE_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet\"\n",
+    "\n",
+    "con_h3 = duckdb.connect()\n",
+    "con_h3.execute(\"INSTALL h3 FROM community; LOAD h3;\")\n",
+    "print(\"DuckDB H3 extension loaded.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "h3-stats-header",
+   "metadata": {},
+   "source": [
+    "### H3 Cell Distribution Statistics\n",
+    "\n",
+    "How many distinct hexagonal cells exist at each resolution, and what fraction of rows carry H3 values?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "h3-stats-cell",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "h3_stats = con_h3.sql(f\"\"\"\n",
+    "    SELECT\n",
+    "        COUNT(*) AS total_rows,\n",
+    "        COUNT(h3_res4) AS rows_with_h3,\n",
+    "        ROUND(100.0 * COUNT(h3_res4) / COUNT(*), 1) AS pct_with_h3,\n",
+    "        COUNT(DISTINCT h3_res4) AS distinct_res4,\n",
+    "        COUNT(DISTINCT h3_res6) AS distinct_res6,\n",
+    "        COUNT(DISTINCT h3_res8) AS distinct_res8\n",
+    "    FROM read_parquet('{WIDE_H3_URL}')\n",
+    "\"\"\").df()\n",
+    "\n",
+    "print(\"H3 Cell Distribution\")\n",
+    "print(\"=\" * 50)\n",
+    "print(f\"Total rows:          {h3_stats['total_rows'].iloc[0]:>12,}\")\n",
+    "print(f\"Rows with H3:        {h3_stats['rows_with_h3'].iloc[0]:>12,} ({h3_stats['pct_with_h3'].iloc[0]}%)\")\n",
+    "print(f\"Distinct res4 cells: {h3_stats['distinct_res4'].iloc[0]:>12,}  (~1,770 km hex)\")\n",
+    "print(f\"Distinct res6 cells: {h3_stats['distinct_res6'].iloc[0]:>12,}  (~3.2 km hex)\")\n",
+    "print(f\"Distinct res8 cells: {h3_stats['distinct_res8'].iloc[0]:>12,}  (~0.46 km hex)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "h3-benchmark-header",
+   "metadata": {},
+   "source": [
+    "### Bbox Benchmark: Lat/Lon Range Scan vs H3 Pre-Filter\n",
+    "\n",
+    "We query samples inside a bounding box (San Francisco Bay Area) two ways:\n",
+    "\n",
+    "1. **Baseline** \u2014 filter on `latitude BETWEEN ... AND longitude BETWEEN ...`\n",
+    "2. **H3 pre-filter** \u2014 find the set of res4 cells that overlap the bbox, then filter by those cells\n",
+    "   before applying the exact lat/lon check.\n",
+    "\n",
+    "The H3 approach narrows the scan to a small number of hexagonal cells, reducing I/O on remote\n",
+    "parquet files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "h3-benchmark-cell",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import h3\n",
+    "\n",
+    "# Bay Area bounding box\n",
+    "BBOX = dict(min_lat=37.2, max_lat=37.9, min_lon=-122.6, max_lon=-121.8)\n",
+    "\n",
+    "# --- Baseline: raw lat/lon range scan ---\n",
+    "t0 = time.time()\n",
+    "baseline = con_h3.sql(f\"\"\"\n",
+    "    SELECT COUNT(*) AS n\n",
+    "    FROM read_parquet('{WIDE_H3_URL}')\n",
+    "    WHERE otype = 'MaterialSampleRecord'\n",
+    "      AND latitude  BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n",
+    "      AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n",
+    "\"\"\").df()\n",
+    "baseline_ms = (time.time() - t0) * 1000\n",
+    "\n",
+    "# --- H3 pre-filter: compute covering cells mathematically (no data scan) ---\n",
+    "t0 = time.time()\n",
+    "\n",
+    "# Use h3 Python library to compute all res4 cells covering the bbox.\n",
+    "# This is O(1) relative to dataset size \u2014 pure geometry, no I/O.\n",
+    "bbox_polygon = h3.LatLngPoly([\n",
+    "    (BBOX['min_lat'], BBOX['min_lon']),\n",
+    "    (BBOX['min_lat'], BBOX['max_lon']),\n",
+    "    (BBOX['max_lat'], BBOX['max_lon']),\n",
+    "    (BBOX['max_lat'], BBOX['min_lon']),\n",
+    "])\n",
+    "covering_cells = h3.geo_to_cells(bbox_polygon, res=4)\n",
+    "# Convert to signed int64 to match DuckDB BIGINT storage\n",
+    "def h3_to_signed(cell_hex):\n",
+    "    val = h3.str_to_int(cell_hex)\n",
+    "    return val if val < 2**63 else val - 2**64\n",
+    "\n",
+    "cell_list = [str(h3_to_signed(c)) for c in covering_cells]\n",
+    "print(f'Bbox covered by {len(cell_list)} res4 cells (computed mathematically)')\n",
+    "\n",
+    "if not cell_list:\n",
+    "    print('No H3 cells cover this bbox.')\n",
+    "    h3_ms = 0\n",
+    "    h3_result = None\n",
+    "else:\n",
+    "    cell_sql = ', '.join(cell_list)\n",
+    "    h3_result = con_h3.sql(f\"\"\"\n",
+    "        SELECT COUNT(*) AS n\n",
+    "        FROM read_parquet('{WIDE_H3_URL}')\n",
+    "        WHERE otype = 'MaterialSampleRecord'\n",
+    "          AND h3_res4 IN ({cell_sql})\n",
+    "          AND latitude  BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n",
+    "          AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n",
+    "    \"\"\").df()\n",
+    "    h3_ms = (time.time() - t0) * 1000\n",
+    "\n",
+    "# --- Results ---\n",
+    "print('Bounding-Box Query Benchmark (Bay Area)')\n",
+    "print('=' * 50)\n",
+    "print(f'Baseline (lat/lon scan): {baseline_ms:>8.0f} ms  | {baseline[\"n\"].iloc[0]:,} rows')\n",
+    "if h3_result is not None:\n",
+    "    print(f'H3 res4 pre-filter:      {h3_ms:>8.0f} ms  | {h3_result[\"n\"].iloc[0]:,} rows')\n",
+    "    speedup = baseline_ms / h3_ms if h3_ms > 0 else float('inf')\n",
+    "    print(f'Speedup:                 {speedup:>7.1f}x')\n",
+    "    print(f'\\nRow counts match: {baseline[\"n\"].iloc[0] == h3_result[\"n\"].iloc[0]}')\n",
+    "else:\n",
+    "    print('H3 result: no covering cells')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "h3-lonboard-header",
+   "metadata": {},
+   "source": [
+    "### Lonboard Visualization with H3-Indexed Data\n",
+    "\n",
+    "Render a sample of the H3-indexed dataset to confirm the visualization pipeline works with\n",
+    "the enriched file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "h3-lonboard-cell",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from lonboard import Map, ScatterplotLayer, BitmapTileLayer\n",
+    "\n",
+    "# Sample 100K rows with coordinates from the H3 file\n",
+    "sample_df = con_h3.sql(f\"\"\"\n",
+    "    SELECT latitude, longitude, n AS source, h3_res4, h3_res6\n",
+    "    FROM read_parquet('{WIDE_H3_URL}')\n",
+    "    WHERE latitude IS NOT NULL AND longitude IS NOT NULL\n",
+    "    USING SAMPLE 100000\n",
+    "\"\"\").df()\n",
+    "\n",
+    "gdf_h3 = gpd.GeoDataFrame(\n",
+    "    sample_df,\n",
+    "    geometry=gpd.points_from_xy(sample_df.longitude, sample_df.latitude),\n",
+    "    crs=\"EPSG:4326\"\n",
+    ")\n",
+    "\n",
+    "# Color by source\n",
+    "h3_color_map = {\n",
+    "    'SESAR': [51, 102, 204, 200],\n",
+    "    'OPENCONTEXT': [220, 57, 18, 200],\n",
+    "    'GEOME': [16, 150, 24, 200],\n",
+    "    'SMITHSONIAN': [255, 153, 0, 200],\n",
+    "}\n",
+    "default_c = [128, 128, 128, 200]\n",
+    "colors_arr = np.array([h3_color_map.get(s, default_c) for s in gdf_h3['source']], dtype=np.uint8)\n",
+    "\n",
+    "layer = ScatterplotLayer.from_geopandas(gdf_h3, get_fill_color=colors_arr, get_radius=500)\n",
+    "tile_layer = BitmapTileLayer(\n",
+    "    data=\"https://tile.openstreetmap.org/{z}/{x}/{y}.png\",\n",
+    "    min_zoom=0, max_zoom=19,\n",
+    ")\n",
+    "\n",
+    "m = Map([tile_layer, layer], view_state={\"zoom\": 2, \"latitude\": 20, \"longitude\": 0})\n",
+    "print(f\"Rendering {len(gdf_h3):,} points from H3-indexed file\")\n",
+    "m"
+   ]
   }
  ],
  "metadata": {
@@ -2969,4 +3240,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}