Jupyter-notebook equivalent of ../topic-modeling.faculytics/. Same multilingual topic modeling pipeline (LaBSE → UMAP → HDBSCAN → BERTopic + KeyBERTInspired) for Philippine student feedback (English, Cebuano, Tagalog, code-switched), reorganized as two interactive notebooks with rich visualizations.
| File | Purpose |
|---|---|
topic_modeling.ipynb |
Main end-to-end run: load → preprocess → embed → BERTopic → evaluate → save artifacts |
grid_search.ipynb |
Hyperparameter sweep across 4 configurations (equivalent of auto_tune.py) |
lib.py |
Helper module — exact regex patterns, multilingual stop words, BERTopic factory, 5 metric functions, artifact saver |
data/ |
Drop dataset files here (gitignored). See "Data" below |
runs/ |
Per-run artifacts: params.json, metrics.json, topics.md, info.json, bertopic_model/ |
figures/ |
Saved HTML visualizations per run |
cd experiments/topic-modeling-notebook
uv venv .venv --python 3.11
source .venv/bin/activate
# CUDA 12.1 (skip if CPU-only)
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
uv pip install -r requirements.txt
# Register kernel for Jupyter
python -m ipykernel install --user --name topic-modeling-notebook
jupyter labDrop the dataset files into data/. The same files used by the original CLI experiment:
| File | Description | Notes |
|---|---|---|
feedback_augmented_v1.json |
~7k curated multilingual entries {text, label, lang_type} |
Best for baseline & tuning |
uc_dataset_20krows1.csv |
~34k raw UC student feedback with comment, label columns |
Real-world data |
uc_dataset_filtered.json |
Sentiment-gated real data (positive < 10 words excluded) | Production baseline (Run 013 in CLI version) |
If you already have these elsewhere on disk (e.g. inside ../topic-modeling.faculytics/data/), the main notebook contains an optional helper cell that copies from a SOURCE_DIR you specify — uncomment, set the path, run once.
uc_dataset_filtered.json is generated from uc_dataset_20krows1.csv by a sentiment-gating cell inside topic_modeling.ipynb (between the data presence check and the LaBSE device cell). It auto-runs when DATASET = "real_filtered" and the file is missing; call build_sentiment_gated(overwrite=True, min_words=...) to regenerate or sweep the threshold.
- Open the notebook, run cells top-to-bottom.
- Adjust the run config in cell 4 (
RUN_ID,DATASET). - Tweak hyperparameters in cell 21 (
paramsdict). - Re-run from cell 21 onward to test new params (embeddings stay cached).
Artifacts land in runs/run_<RUN_ID>/. Visualizations land in figures/run_<RUN_ID>/.
Runs the same 4-config sweep as the CLI's auto_tune.py (run_004…run_007 configs). Produces a ranked dataframe of results and highlights the best run.
The pipeline is configured to match the original CLI exactly:
- Seeds:
random.seed(42),np.random.seed(42), UMAPrandom_state=42 - LaBSE embeddings normalized, batch_size=64, cached per dataset
- UMAP
n_neighbors=15, n_components=5, min_dist=0.0, metric=cosine - HDBSCAN
min_cluster_size=15, min_samples=5, metric=euclidean, cluster_selection_method=eom - KeyBERTInspired representation (semantic keyword extraction; replaces c-TF-IDF as of CLI Run 011)
- 160-entry multilingual stop word list (English + Cebuano + Tagalog function/role/filler words)
- 5 metrics: embedding coherence (primary), NPMI (reference only on multilingual), topic diversity, outlier ratio, silhouette
Notebook RUN_IDs use the nb_* namespace (nb_001, nb_grid_001, ...) so they don't collide with the original CLI runs (001–013).
| Metric | Target | Notes |
|---|---|---|
| Embedding coherence | > 0.5 | Primary metric; language-agnostic |
| Topic diversity | > 0.7 | Non-redundant topics |
| Outlier ratio | < 20% | Often 33–52% on real data — accepted limitation per CLI Run 013 analysis |
| Num topics | 10–25 | Educator-interpretable range |
| NPMI coherence | > 0.1 | Reference only; systematically fails on multilingual corpora (Hoyle et al. 2021) |
- Two notebooks instead of four CLI scripts.
run_eval.py→topic_modeling.ipynb,auto_tune.py→grid_search.ipynb.prepare_dataset.py's sentiment gating is reproduced as a cell intopic_modeling.ipynb;generate_report.pyis not reproduced (run from the original repo if needed). - No Discord notifications. Visualizations are inline in the notebook — no need for asynchronous review.
- Richer visualizations. Drop-reason audit, pre/post word-count histograms, 2D embedding scatter, BERTopic intertopic-distance map, document map, hierarchical merge tree, similarity heatmap, metrics-vs-target bar chart, per-topic quality table.
runs/instead ofexperiments/for per-run artifacts (avoids confusing nesting under the parentexperiments/folder). Same artifact layout, so the originalgenerate_report.pyworks on these.