SpectralTDA Keyword Indexer

Automatic keyword extraction engine for structured documents. The core is written in C++20 and exposes native Python bindings via pybind11 for interactive analytics in Jupyter notebooks.

The pipeline parses Markdown documents into a structured YAML representation, extracts tokens with stop-word filtering, and runs three independent scoring algorithms. Results are compared side-by-side to identify consensus keywords — terms that all methods agree are the most important.

Roadmap: The YAML intermediate format is a PoC stepping stone. The architecture is designed to transition to Apache Arrow / Parquet for high-scale knowledge bases and integration with SVM re-tuning on spectral and Grassmannian manifolds.

Scoring Algorithms

TF-IDF (Term Frequency–Inverse Document Frequency)

A classical statistical measure. Each token's score is its relative frequency within the document. In the single-document context, this reduces to normalized term frequency — tokens that appear more often score higher.

$$\text{tf}(t) = \frac{\text{count}(t)}{|D|}$$

TextRank

A graph-based unsupervised ranking algorithm adapted from PageRank. A co-occurrence graph is built using a sliding window (default: 5 tokens). Each unique token becomes a node; edges connect tokens that co-occur within the window. PageRank is then run with a uniform teleportation bias, giving every node equal random-walk restart probability.

$$S(v_i) = (1 - d) \cdot \frac{1}{N} + d \sum_{v_j \in \text{In}(v_i)} \frac{w_{ji}}{\sum_{v_k} w_{jk}} S(v_j)$$

This identifies terms that are structurally central in the document's word graph — i.e., words that bridge many co-occurrence clusters.

PositionRank

A position-aware variant of TextRank. Uses the same co-occurrence graph but biases the PageRank teleportation vector towards words that appear earlier in the document (title, abstract, opening paragraphs). Each word's bias weight is 1 / (first_position + 1), normalized to sum to 1.

$$b(v_i) = \frac{1}{\text{pos}(v_i) + 1}$$

This captures the intuition that authors front-load the most important terminology.

Results

Benchmarked on a Hamming metric coding theory document. All three methods were run on the same token stream extracted from the parsed Markdown.

Score Comparison (Top 20 Keywords)

Key observations:

TF-IDF and TextRank produce near-identical rankings (Spearman ρ = 0.986, p ≈ 0), confirming that in single-document analysis, frequency and graph centrality strongly correlate.
PositionRank diverges significantly (ρ ≈ 0.13 vs both others), surfacing early-document terms like conspectus, develops, and fundamental that the other methods rank low.
12 out of 20 keywords are consensus across all three: code, codes, hamming, distance, linear, minimum, matrix, codeword, vector, block, bound, definition.

Rank Correlation

The left panel shows the tight TF-IDF ↔ TextRank correlation; the center and right panels illustrate how PositionRank reorders the rankings based on document position.

Co-occurrence Knowledge Graph

Nodes are colored by which scoring methods placed them in the top 20:

🟡 Gold — Consensus (all three methods agree)
🟣 Purple — TF-IDF ∩ TextRank only
🩷 Pink — PositionRank only

Node size reflects aggregate score. The central cluster around code, hamming, distance represents the document's core topic.

Project Structure

SpectralTDAKeywordIndexer/
├── src/
│   ├── bindings.cpp              # pybind11 Python bindings
│   └── scores/
│       ├── tf_idf.cpp            # Token extraction + TF-IDF
│       └── graph_rank.cpp        # CooccurrenceGraph, TextRank, PositionRank
├── include/                      # C++ headers (scores.h, graph.h)
├── utils/md_to_yaml/             # Markdown → Document → YAML parser (Arrow mock)
├── analytics/
│   ├── analysis.py               # CLI wrapper for batch chart generation
│   ├── analysis.ipynb            # Interactive Jupyter notebook
│   └── spectral_analytics/       # Reusable Python package
│       ├── loader.py             # process_markdown() — calls C++ directly
│       ├── overlap.py            # Consensus sets, Spearman correlation
│       ├── config.py             # Style & color palette
│       ├── report.py             # Markdown report generator
│       ├── charts/               # Individual chart modules
│       └── spectraltda_cpp.so    # Compiled C++ extension
├── data/
│   ├── input/                    # Source .md documents
│   └── output/                   # Generated YAML, charts, reports
└── Makefile

Prerequisites

C++20 compiler (clang++-10 or later)
System libraries: yaml-cpp, md4c (via pkg-config)
Python 3.9+ with pybind11 installed (pip install pybind11)

Build & Run

Build everything (C++ binary + Python bindings + YAML parsing)

make all

Bootstrap Python virtual environment

make venv

This creates .venv/ and installs matplotlib, seaborn, pandas, scipy, networkx, jupyter, and nbconvert.

Run the full analysis pipeline

make analysis

Builds the C++ bindings, processes all .md files in data/input/, generates charts to data/output/charts/, and writes a Markdown analysis report.

Execute the Jupyter notebook headlessly

make notebooks

Runs analytics/analysis.ipynb via nbconvert and exports analysis.html.

Open the notebook interactively

source .venv/bin/activate
cd analytics && jupyter notebook analysis.ipynb

Clean targets

make clean        # Remove C++ build artifacts
make clean-py     # Remove Python caches and .so
make clean-data   # Remove generated charts and YAML
make clean-all    # All of the above

Python Bindings API

The C++ engine is directly importable from Python:

from spectral_analytics.loader import process_markdown

doc_name, df_all, df_tf, df_tr, df_pr, G = process_markdown("data/input/01-hamming_metric.md")

# Optional: export to CSV from notebook
df_all.to_csv("scores.csv", index=False)

The low-level C++ functions are also accessible:

from spectral_analytics import spectraltda_cpp as cpp

tokens = cpp.extract_tokens_from_file("data/input/01-hamming_metric.md")
tfidf  = cpp.calculate_tf_idf(tokens)         # dict[str, float]
tr     = cpp.textrank(tokens)                  # dict[str, float]
pr     = cpp.positionrank(tokens)              # dict[str, float]
edges  = cpp.get_cooccurrence_edges(tokens)    # list[(str, str, float)]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpectralTDA Keyword Indexer

Scoring Algorithms

TF-IDF (Term Frequency–Inverse Document Frequency)

TextRank

PositionRank

Results

Score Comparison (Top 20 Keywords)

Rank Correlation

Co-occurrence Knowledge Graph

Project Structure

Prerequisites

Build & Run

Build everything (C++ binary + Python bindings + YAML parsing)

Bootstrap Python virtual environment

Run the full analysis pipeline

Execute the Jupyter notebook headlessly

Open the notebook interactively

Clean targets

Python Bindings API

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analytics		analytics
data/output		data/output
include		include
notebooks		notebooks
src		src
utils/md_to_yaml		utils/md_to_yaml
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SpectralTDA Keyword Indexer

Scoring Algorithms

TF-IDF (Term Frequency–Inverse Document Frequency)

TextRank

PositionRank

Results

Score Comparison (Top 20 Keywords)

Rank Correlation

Co-occurrence Knowledge Graph

Project Structure

Prerequisites

Build & Run

Build everything (C++ binary + Python bindings + YAML parsing)

Bootstrap Python virtual environment

Run the full analysis pipeline

Execute the Jupyter notebook headlessly

Open the notebook interactively

Clean targets

Python Bindings API

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages