Engrammatic Named Entity Inference

What's This About?

Imagine you're reading a history book where every person's name has been covered up with a sticky note that says [REDACTED:PERSON]. Your job is to guess who it was based on everything else written around it. That's essentially what this project teaches an AI to do — but with more sophistication.

The core question we're asking: Can a language model learn to reconstruct missing named entities (people, places, organizations) just from context? And more importantly, what does it tell us about how well the model actually understands the world?

Why This Matters

Language models like GPT, Claude, and others are deployed everywhere — answering questions, writing code, summarizing documents. But we don't fully understand what they actually "know" versus what they memorized or hallucinate. If we can train a model to successfully fill in redacted names, we can probe its real knowledge of historical actors, geography, and relationships. This is a window into the model's internal understanding and a potential detector for hallucinations.

Our specific angle: using World War I history as a test bed. WWI is rich with named entities (generals, diplomats, nations, battles, treaties), well-documented, and challenging enough that it reveals gaps in a model's knowledge.

A key insight from Meesum: Instead of just masking entity names with a generic [MASK] token, we embed the entity type in the mask itself — [REDACTED:PERSON], [REDACTED:GPE], etc. This gives the model a strong prior: predicting a country name, a person name, and a date are very different tasks. It's a small detail with big implications for what the model can learn.

The Idea

Here's the approach:

Collect historical texts about World War I from Wikipedia, archives, and historical documents
Automatically identify named entities (using a standard NER tagger) — people, places, organizations, events
Redact some of them by replacing the name with a token that says the type: [REDACTED:PERSON], [REDACTED:GPE] (geographic), etc.
Train a language model to predict the original name from the surrounding context
Test the model and see how well it recovers the redacted information

The trick: by telling the model "this is definitely a person's name, not a country," we give it a strong hint, which makes the task more realistic. It's not just guessing missing words — it's understanding what kind of entity fits the context.

Current Status (April 2026)

Latest Results (April 29, zero-OOV test set, 5,326 redactions):

Model	Overall	NORP	EVENT	GPE	LOC	FAC	ORG	PERSON
Claude Sonnet 4.6	56.3%	79.5%	58.1%	57.7%	14.3%	—	34.7%	56.8%
Claude Haiku 4.5	44.0%	68.0%	35.5%	52.6%	14.3%	—	25.3%	29.5%
DistilBERT (6,575 articles, 7 epochs, constrained)	9.97%	13.21%	13.79%	9.29%	10.88%	8.51%	8.13%	7.83%
DistilBERT (4,299 articles, baseline)	6.6%	12.7%	10.3%	—	—	—	—	—

The frontier–fine-tuned gap (44–56% vs ~10%) shows frontier models recover roughly 8–11× more entity context than fine-tuned DistilBERT despite the same training data. Expanding corpus and optimizing hyperparameters in progress. Claude Opus and OSS model toplines pending.

Zero-OOV evaluation: data/test_zero_oov.jsonl strips OOV redactions so every scored entity appears in the training candidate set — the cleanest signal for the hallucination-probe thesis. This is now the primary evaluation target.

DistilBERT baseline: models/zero_oov (6,575-article corpus, 9 epochs, April 24) — 6.61% constrained accuracy, 37% OOV on the full test set.

How We Got Here

Our best results so far show the model can recover ~6.6% of redacted names correctly when we let it choose from a list of known candidates. When we let it generate freely, it's down to ~3–4%. These are modest numbers, and that's actually useful data — it tells us that even with context, reconstructing specific historical names is genuinely hard.

The bigger insight: 37% of the entities the model is tested on have never appeared in its training data at all. This is a major bottleneck. It's not that the model can't learn to recognize patterns — it's that some entities are simply absent from its training corpus. We're systematically eliminating this gap through targeted corpus expansion.

What If We Succeed?

Frontier models already hit 44–56% on this task. The question is now whether a fine-tuned model can close that gap. If we could push the fine-tuned baseline to 20–30%, it would mean:

Better hallucination detection: We'd have a probe for when models confabulate names or facts
Understanding knowledge gaps: We'd know which historical actors and events a model genuinely doesn't know about
Smarter language models: Techniques learned here could improve how models handle factual entities in general
A better measure of "understanding": Instead of just asking "does the model pass a test?", we'd ask "does it actually understand context and relationships?"

This could become a standard benchmark for evaluating frontier models — a way to ask not just "is this model smart?" but "does this model actually know what it claims to know?"

The Technical Side (For Those Who Want It)

The model: We use DistilBERT, a lighter version of BERT (~66 million parameters). It's fast to train and good at understanding context without being enormous.

The data: 6,575 World War I articles from Wikipedia (as of April 2026), with careful curation to remove noisy redactions (numbers, malformed tokens, etc.). We split 80/20 into training and test sets. OOV is currently 37% and dropping as targeted entity fetching continues.

The training: We show the model passages where some named entities are hidden behind [MASK] tokens. It learns to predict what was hidden by looking at the surrounding words and the hint (e.g., "this is a person").

Two evaluation modes:

Constrained: The model picks from a list of known entities. More realistic, gives better numbers.
Free: The model generates any name. Shows what it actually learned, usually harder.

How to Use This

Quick Start

# Install dependencies
uv sync
python -m spacy download en_core_web_sm

# Data pipeline (if starting from scratch)
uv run python scripts/combiner.py          # Combine sources
uv run python scripts/splitter.py          # 80/20 split
uv run python scripts/redactor.py --input data/train_clean.jsonl --output data/train_redacted.jsonl --mode train
uv run python scripts/redactor.py --input data/test_clean.jsonl --output data/test_redacted.jsonl --mode test

# Data curation (removes noisy redactions — strongly recommended)
uv run python src/ner_recovery/curator.py --input data/train_redacted.jsonl --output data/train_redacted_curated.jsonl
uv run python src/ner_recovery/curator.py --input data/test_redacted.jsonl --output data/test_redacted_curated.jsonl

# Train the model (with curated data, 7 epochs recommended)
uv run train --epochs 7 --output-dir models/current

# Evaluate (constrained and free modes)
uv run evaluate --model-dir models/current/final --data data/test_redacted_curated.jsonl --train data/train_redacted_curated.jsonl
uv run evaluate --model-dir models/current/final --data data/test_redacted_curated.jsonl --train data/train_redacted_curated.jsonl --mode free

# Analyze OOV (out-of-vocabulary entities)
uv run python src/ner_recovery/oov_analysis.py --train data/train_redacted_curated.jsonl --test data/test_redacted_curated.jsonl

Data

Raw corpora live in data/:

wwi_extended.jsonl — ~6,575 Wikipedia articles, cleaned plaintext
train_redacted_curated.jsonl — curated training set with redacted entities
test_redacted_curated.jsonl — curated test set for evaluation
test_zero_oov.jsonl — test set filtered so every redaction's entity appears in training candidates

Each redaction record contains:

The article title and text
A list of redactions: {original: "Serbia", label: "GPE", start: 1107, end: 1119}

Model Outputs

Trained models save to models/:

models/current/ — latest training run
models/final/ — stable checkpoint

Each contains the full model weights and tokenizer.

Key Files

File	What It Does
`src/ner_recovery/train.py`	Fine-tunes DistilBERT on redaction task; supports `--epochs N` and `--output-dir PATH`
`src/ner_recovery/eval.py`	Evaluates model in constrained and free modes; accepts `--model-dir`, `--data`, `--train`, `--mode`
`src/ner_recovery/curator.py`	Filters noisy redactions (numbers, malformed tokens, low-confidence labels) for clean training
`src/ner_recovery/oov_analysis.py`	Reports which test entities are missing from training candidates; identifies corpus coverage gaps
`scripts/redactor.py`	Uses spaCy to redact entities in text; supports `--mode train` (weighted) or `--mode test` (uniform)
`scripts/combiner.py`	Combines all `*_clean.jsonl` sources into `all_clean.jsonl` (deduplicates by pageid)
`scripts/splitter.py`	Splits `all_clean.jsonl` into 80/20 train/test with seed=42
`scripts/fetch_oov_entities.py`	Fetches Wikipedia articles for OOV entities and appends to `wwi_extended.jsonl`
`scripts/filter_zero_oov.py`	Strips OOV redactions from test set; produces `test_zero_oov.jsonl` for clean evaluation
`scripts/eval_claude.py`	Frontier model topline via Claude API (batch or sequential); saves JSON + txt to `evals/`
`scripts/benchmark.py`	Submits multiple Claude models as parallel batches; prints comparison table
`scripts/benchmark_table.py`	Aggregates all saved JSON results from `evals/` into a ranked comparison table
`scripts/eval_lmstudio.py`	Evaluate any local/OSS model via LM Studio's OpenAI-compatible server

The Summit: What We're Aiming For

If this work succeeds, it becomes a tool for understanding what language models really know. Instead of asking "is this model intelligent?", we ask "does this model understand the structure of history, geography, and human relationships?" That's a much harder question, and a much more useful one.

Imagine a future where every large language model comes with an "understanding score" — not just accuracy on a benchmark, but a measure of how well it genuinely knows entities, facts, and their relationships. This project is a step toward that.

We're not there yet. But the road is clear, and the question is worth asking.

References

Key papers we're building on:

Engram: Conditional lookup + constrained decoding for factual generation
BERT: The foundation for masked language model pretraining
Hallucination research: Understanding when and why models make things up
Entity understanding: How well do models actually track named entities?

Full references in refs/.

Questions?

This is research in progress. If you want to understand more or contribute, reach out.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
evals		evals
findings		findings
notes		notes
refs		refs
scripts		scripts
src/ner_recovery		src/ner_recovery
tests		tests
.envrc.bak		.envrc.bak
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
bench_30_apr_2026		bench_30_apr_2026
checkpoint_20260412_145307.md		checkpoint_20260412_145307.md
checkpoint_20260412_203635.md		checkpoint_20260412_203635.md
featherweight-corpus-load.md		featherweight-corpus-load.md
featherweight-corpus-load.pdf		featherweight-corpus-load.pdf
fetcher.py		fetcher.py
napoleonic_corpus.jsonl		napoleonic_corpus.jsonl
oov_expanded.txt		oov_expanded.txt
oov_report.txt		oov_report.txt
pyproject.toml		pyproject.toml
run_pipeline.sh		run_pipeline.sh
russian_civil_corpus.jsonl		russian_civil_corpus.jsonl
russian_corpus.jsonl		russian_corpus.jsonl
shikamaru-trainer.skill		shikamaru-trainer.skill
spec.pdf		spec.pdf
uv.lock		uv.lock
wwi_corpus.jsonl		wwi_corpus.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Engrammatic Named Entity Inference

What's This About?

Why This Matters

The Idea

Current Status (April 2026)

How We Got Here

What If We Succeed?

The Technical Side (For Those Who Want It)

How to Use This

Quick Start

Data

Model Outputs

Key Files

The Summit: What We're Aiming For

References

Questions?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Engrammatic Named Entity Inference

What's This About?

Why This Matters

The Idea

Current Status (April 2026)

How We Got Here

What If We Succeed?

The Technical Side (For Those Who Want It)

How to Use This

Quick Start

Data

Model Outputs

Key Files

The Summit: What We're Aiming For

References

Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages