Skip to content

mstrielnikov/KeywordGraphInference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpectralTDA Keyword Indexer

Automatic keyword extraction engine for structured documents. The core is written in C++20 and exposes native Python bindings via pybind11 for interactive analytics in Jupyter notebooks.

The pipeline parses Markdown documents into a structured YAML representation, extracts tokens with stop-word filtering, and runs three independent scoring algorithms. Results are compared side-by-side to identify consensus keywords — terms that all methods agree are the most important.

Roadmap: The YAML intermediate format is a PoC stepping stone. The architecture is designed to transition to Apache Arrow / Parquet for high-scale knowledge bases and integration with SVM re-tuning on spectral and Grassmannian manifolds.


Scoring Algorithms

TF-IDF (Term Frequency–Inverse Document Frequency)

A classical statistical measure. Each token's score is its relative frequency within the document. In the single-document context, this reduces to normalized term frequency — tokens that appear more often score higher.

$$\text{tf}(t) = \frac{\text{count}(t)}{|D|}$$

TextRank

A graph-based unsupervised ranking algorithm adapted from PageRank. A co-occurrence graph is built using a sliding window (default: 5 tokens). Each unique token becomes a node; edges connect tokens that co-occur within the window. PageRank is then run with a uniform teleportation bias, giving every node equal random-walk restart probability.

$$S(v_i) = (1 - d) \cdot \frac{1}{N} + d \sum_{v_j \in \text{In}(v_i)} \frac{w_{ji}}{\sum_{v_k} w_{jk}} S(v_j)$$

This identifies terms that are structurally central in the document's word graph — i.e., words that bridge many co-occurrence clusters.

PositionRank

A position-aware variant of TextRank. Uses the same co-occurrence graph but biases the PageRank teleportation vector towards words that appear earlier in the document (title, abstract, opening paragraphs). Each word's bias weight is 1 / (first_position + 1), normalized to sum to 1.

$$b(v_i) = \frac{1}{\text{pos}(v_i) + 1}$$

This captures the intuition that authors front-load the most important terminology.


Results

Benchmarked on a Hamming metric coding theory document. All three methods were run on the same token stream extracted from the parsed Markdown.

Score Comparison (Top 20 Keywords)

Grouped bar chart comparing normalized scores across three methods

Key observations:

  • TF-IDF and TextRank produce near-identical rankings (Spearman ρ = 0.986, p ≈ 0), confirming that in single-document analysis, frequency and graph centrality strongly correlate.
  • PositionRank diverges significantly (ρ ≈ 0.13 vs both others), surfacing early-document terms like conspectus, develops, and fundamental that the other methods rank low.
  • 12 out of 20 keywords are consensus across all three: code, codes, hamming, distance, linear, minimum, matrix, codeword, vector, block, bound, definition.

Rank Correlation

Scatter plots showing pairwise rank correlations between methods

The left panel shows the tight TF-IDF ↔ TextRank correlation; the center and right panels illustrate how PositionRank reorders the rankings based on document position.

Co-occurrence Knowledge Graph

Knowledge graph colored by method membership

Nodes are colored by which scoring methods placed them in the top 20:

  • 🟡 Gold — Consensus (all three methods agree)
  • 🟣 Purple — TF-IDF ∩ TextRank only
  • 🩷 Pink — PositionRank only

Node size reflects aggregate score. The central cluster around code, hamming, distance represents the document's core topic.


Project Structure

SpectralTDAKeywordIndexer/
├── src/
│   ├── bindings.cpp              # pybind11 Python bindings
│   └── scores/
│       ├── tf_idf.cpp            # Token extraction + TF-IDF
│       └── graph_rank.cpp        # CooccurrenceGraph, TextRank, PositionRank
├── include/                      # C++ headers (scores.h, graph.h)
├── utils/md_to_yaml/             # Markdown → Document → YAML parser (Arrow mock)
├── analytics/
│   ├── analysis.py               # CLI wrapper for batch chart generation
│   ├── analysis.ipynb            # Interactive Jupyter notebook
│   └── spectral_analytics/       # Reusable Python package
│       ├── loader.py             # process_markdown() — calls C++ directly
│       ├── overlap.py            # Consensus sets, Spearman correlation
│       ├── config.py             # Style & color palette
│       ├── report.py             # Markdown report generator
│       ├── charts/               # Individual chart modules
│       └── spectraltda_cpp.so    # Compiled C++ extension
├── data/
│   ├── input/                    # Source .md documents
│   └── output/                   # Generated YAML, charts, reports
└── Makefile

Prerequisites

  • C++20 compiler (clang++-10 or later)
  • System libraries: yaml-cpp, md4c (via pkg-config)
  • Python 3.9+ with pybind11 installed (pip install pybind11)

Build & Run

Build everything (C++ binary + Python bindings + YAML parsing)

make all

Bootstrap Python virtual environment

make venv

This creates .venv/ and installs matplotlib, seaborn, pandas, scipy, networkx, jupyter, and nbconvert.

Run the full analysis pipeline

make analysis

Builds the C++ bindings, processes all .md files in data/input/, generates charts to data/output/charts/, and writes a Markdown analysis report.

Execute the Jupyter notebook headlessly

make notebooks

Runs analytics/analysis.ipynb via nbconvert and exports analysis.html.

Open the notebook interactively

source .venv/bin/activate
cd analytics && jupyter notebook analysis.ipynb

Clean targets

make clean        # Remove C++ build artifacts
make clean-py     # Remove Python caches and .so
make clean-data   # Remove generated charts and YAML
make clean-all    # All of the above

Python Bindings API

The C++ engine is directly importable from Python:

from spectral_analytics.loader import process_markdown

doc_name, df_all, df_tf, df_tr, df_pr, G = process_markdown("data/input/01-hamming_metric.md")

# Optional: export to CSV from notebook
df_all.to_csv("scores.csv", index=False)

The low-level C++ functions are also accessible:

from spectral_analytics import spectraltda_cpp as cpp

tokens = cpp.extract_tokens_from_file("data/input/01-hamming_metric.md")
tfidf  = cpp.calculate_tf_idf(tokens)         # dict[str, float]
tr     = cpp.textrank(tokens)                  # dict[str, float]
pr     = cpp.positionrank(tokens)              # dict[str, float]
edges  = cpp.get_cooccurrence_edges(tokens)    # list[(str, str, float)]

About

The automatic smart indexer for automatic discovery of keywords. Integrated with parquet indicies and SVM for re-tuning. Build on TDA in spectral and Grassmanian manifolds

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors