Automatic keyword extraction engine for structured documents. The core is written in C++20 and exposes native Python bindings via pybind11 for interactive analytics in Jupyter notebooks.
The pipeline parses Markdown documents into a structured YAML representation, extracts tokens with stop-word filtering, and runs three independent scoring algorithms. Results are compared side-by-side to identify consensus keywords — terms that all methods agree are the most important.
Roadmap: The YAML intermediate format is a PoC stepping stone. The architecture is designed to transition to Apache Arrow / Parquet for high-scale knowledge bases and integration with SVM re-tuning on spectral and Grassmannian manifolds.
A classical statistical measure. Each token's score is its relative frequency within the document. In the single-document context, this reduces to normalized term frequency — tokens that appear more often score higher.
A graph-based unsupervised ranking algorithm adapted from PageRank. A co-occurrence graph is built using a sliding window (default: 5 tokens). Each unique token becomes a node; edges connect tokens that co-occur within the window. PageRank is then run with a uniform teleportation bias, giving every node equal random-walk restart probability.
This identifies terms that are structurally central in the document's word graph — i.e., words that bridge many co-occurrence clusters.
A position-aware variant of TextRank. Uses the same co-occurrence graph but biases the PageRank teleportation vector towards words that appear earlier in the document (title, abstract, opening paragraphs). Each word's bias weight is 1 / (first_position + 1), normalized to sum to 1.
This captures the intuition that authors front-load the most important terminology.
Benchmarked on a Hamming metric coding theory document. All three methods were run on the same token stream extracted from the parsed Markdown.
Key observations:
- TF-IDF and TextRank produce near-identical rankings (Spearman ρ = 0.986, p ≈ 0), confirming that in single-document analysis, frequency and graph centrality strongly correlate.
- PositionRank diverges significantly (ρ ≈ 0.13 vs both others), surfacing early-document terms like conspectus, develops, and fundamental that the other methods rank low.
- 12 out of 20 keywords are consensus across all three:
code,codes,hamming,distance,linear,minimum,matrix,codeword,vector,block,bound,definition.
The left panel shows the tight TF-IDF ↔ TextRank correlation; the center and right panels illustrate how PositionRank reorders the rankings based on document position.
Nodes are colored by which scoring methods placed them in the top 20:
- 🟡 Gold — Consensus (all three methods agree)
- 🟣 Purple — TF-IDF ∩ TextRank only
- 🩷 Pink — PositionRank only
Node size reflects aggregate score. The central cluster around code, hamming, distance represents the document's core topic.
SpectralTDAKeywordIndexer/
├── src/
│ ├── bindings.cpp # pybind11 Python bindings
│ └── scores/
│ ├── tf_idf.cpp # Token extraction + TF-IDF
│ └── graph_rank.cpp # CooccurrenceGraph, TextRank, PositionRank
├── include/ # C++ headers (scores.h, graph.h)
├── utils/md_to_yaml/ # Markdown → Document → YAML parser (Arrow mock)
├── analytics/
│ ├── analysis.py # CLI wrapper for batch chart generation
│ ├── analysis.ipynb # Interactive Jupyter notebook
│ └── spectral_analytics/ # Reusable Python package
│ ├── loader.py # process_markdown() — calls C++ directly
│ ├── overlap.py # Consensus sets, Spearman correlation
│ ├── config.py # Style & color palette
│ ├── report.py # Markdown report generator
│ ├── charts/ # Individual chart modules
│ └── spectraltda_cpp.so # Compiled C++ extension
├── data/
│ ├── input/ # Source .md documents
│ └── output/ # Generated YAML, charts, reports
└── Makefile
- C++20 compiler (clang++-10 or later)
- System libraries:
yaml-cpp,md4c(via pkg-config) - Python 3.9+ with
pybind11installed (pip install pybind11)
make allmake venvThis creates .venv/ and installs matplotlib, seaborn, pandas, scipy, networkx, jupyter, and nbconvert.
make analysisBuilds the C++ bindings, processes all .md files in data/input/, generates charts to data/output/charts/, and writes a Markdown analysis report.
make notebooksRuns analytics/analysis.ipynb via nbconvert and exports analysis.html.
source .venv/bin/activate
cd analytics && jupyter notebook analysis.ipynbmake clean # Remove C++ build artifacts
make clean-py # Remove Python caches and .so
make clean-data # Remove generated charts and YAML
make clean-all # All of the aboveThe C++ engine is directly importable from Python:
from spectral_analytics.loader import process_markdown
doc_name, df_all, df_tf, df_tr, df_pr, G = process_markdown("data/input/01-hamming_metric.md")
# Optional: export to CSV from notebook
df_all.to_csv("scores.csv", index=False)The low-level C++ functions are also accessible:
from spectral_analytics import spectraltda_cpp as cpp
tokens = cpp.extract_tokens_from_file("data/input/01-hamming_metric.md")
tfidf = cpp.calculate_tf_idf(tokens) # dict[str, float]
tr = cpp.textrank(tokens) # dict[str, float]
pr = cpp.positionrank(tokens) # dict[str, float]
edges = cpp.get_cooccurrence_edges(tokens) # list[(str, str, float)]

