A language model's vocabulary, rendered as color and sound.
ColorGPT takes the input-token embedding matrix of a small language model — Qwen 2.5 0.5B, ~152k tokens × 896 dimensions — projects it to three dimensions with UMAP, and uses those coordinates as both a perceptual color space (OKLab → sRGB) and a sonic one (Web Audio). Each of the model's tokens gets a fixed color and a fixed timbre. Generation, reading, and dialogue then become chromatic and acoustic events, played out in real time.
The piece is an attempt to render the model's mental geography of language — not what words mean to us, but how a network with no eyes, no ears, and no body has learned to organize symbols among themselves.
This is not visual synesthesia. The token blue is not blue. scarlet is not red.
What you see is the geometry of the model's input-embedding space, projected into three dimensions and assigned to color and sound. Tokens that are close in that space are close in color and timbre. So blue / azure / navy will cluster — not because they describe blue things, but because the model has learned they appear in similar contexts. The same is true for Monday / Tuesday / Wednesday, for inflections of a single verb, and for the BPE fragments ing, ed, ly.
A second commitment: ColorGPT renders subword tokens, not words. The model does not see "scarlet"; it sees scar followed by let, with two unrelated colors. Word-averaging would smooth this away and lie about how the model perceives text. The fragmentation is the point.
Qwen 2.5 0.5B input embeddings (vocab × 896)
↓ UMAP (cosine metric, 3D)
3D coordinates per token
↓ axis 0 → OKLab L ↓ all three axes
↓ axis 1 → OKLab a → sRGB → normalized [0,255]³
↓ axis 2 → OKLab b driving Web Audio synth
(pitch, timbre, pan)
The same UMAP coordinates drive both modalities. A token's color and its sound are coupled — they are two readings of the same vector. Dialogue mode hard-pans speaker A to the left channel and speaker B to the right; otherwise the audio is a function of the token alone, not who said it.
OKLab is used because it is perceptually uniform: equal Euclidean distances in OKLab correspond to equal perceived color differences. This means small movements in embedding space produce small movements in apparent color, and large movements produce large ones — the projection is an honest one rather than a function of sRGB's well-known non-uniformities.
All three are streamed over Server-Sent Events. The browser receives {id, text, rgb, u, source, hold_ms} per token and paints / sounds it.
A single instance of Qwen autoregresses from a prompt. A literary-prose prefix is prepended to nudge the model away from QA-style continuations. Non-Latin tokens are suppressed via bad_words_ids — we want the script the LUT was tuned for.
A corpus file is tokenized and emitted in order. No generation, no sampling — pure transcription, the corpus passed through the model's vocabulary as colored cells. Provided corpora are public-domain English: the King James Bible (Genesis 1, 1 Corinthians 13, Ecclesiastes 3, John 1) and Conrad's Heart of Darkness.
Two Qwen instances pass the rolling transcript back and forth. Even turns are speaker A, odd turns are speaker B. Same weights, separate KV state per turn. In the audio, A is panned hard left, B is panned hard right — the room becomes stereo conversation.
Pacing lives in pacing.py, downstream of the streams. Every token gets a base hold of 1 / base_tps (default 500 ms). On top of that, tokens ending in punctuation receive a tiered additional pause:
| punctuation | pause (ms) | role |
|---|---|---|
. ! ? . 。 |
700 | sentence-end |
… |
900 | trailing-off |
; |
400 | clause |
: |
350 | clause |
, — – , |
220 | phrase |
\n |
450 | line break |
closers ) ] " ' ” ’ |
140–180 | release |
A . is always rendered as rgb(0, 143, 87) and held for 700 ms. The piece has a heartbeat: punctuation marks become visual and sonic punctuation, identical across modes.
histogram.py renders four image types from any text. All are produced PNGs intended for print or wall display.
| render | what it shows |
|---|---|
| transcript | Every content token of a corpus packed into a perfect ⌈√N⌉ × ⌈√N⌉ grid, with a hue-sorted palette strip below. Verse structure is deliberately dropped — the corpus collapses into a square of color. |
| palette | Frequency-weighted bars sorted by hue. The corpus's chromatic fingerprint as a Pantone fan. |
| reading-map | A per-corpus grid of unique tokens, each cell labeled with its decoded text, sized for printing as a handout (25 mm cells) or wall card (15 mm cells). The Rosetta stone for a corpus. |
| vocab atlas | The whole 151,936-token vocabulary as a √V × √V chromatic atlas. Sortable by token ID (training-derived structure: common merges first, then rarer / CJK-byte-fallback tokens) or by hue (chromatic distribution). |
Every content token of the passage as a colored cell, packed into a square. The hue-sorted palette underneath is the corpus's chromatic fingerprint.
The whole vocabulary of Conrad's novella as a labeled grid — the Rosetta stone for a corpus, sized for a wall card.
The atlas restricted to the tokens present in a single corpus — a chromatic fingerprint at the vocabulary level. John 1 (146 unique tokens) on the left, Genesis 1 (187) on the right.
| John 1 | Genesis 1 |
|---|---|
![]() |
![]() |
Generate your own with the snippet under Run below.
demo.mp4
Requires Python 3.11+. First-time setup, in PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python build_lut.py # ~6 minutes, one-time precompute
python server.py # http://127.0.0.1:5000build_lut.py produces three files: lut.bin (the token → RGB lookup, ~445 KB), lut_meta.json (vocab size, model id, parameters), and umap_3d.npy (cached 3D coords, so OKLab tuning can be re-rendered without re-running UMAP). All three are regenerable and gitignored.
Press F for fullscreen, or open ?fullscreen=1 directly. Audio starts on first user interaction (browser policy).
For static renders without the server:
python visualize.py "the quick brown fox" # one-off colored strip → out.html
python -c "from histogram import render_transcript; render_transcript(open('corpus/john1.txt').read()).save('john1.png')"| file | role |
|---|---|
build_lut.py |
One-time LUT precompute (embeddings → UMAP → OKLab → sRGB). |
engine.py |
Lazy-loaded shared state: model, tokenizer, color LUT, audio LUT. |
streams.py |
The three modes. Each yields token events; bounded Queue(maxsize=1) for backpressure. |
pacing.py |
Base TPS + tiered punctuation pauses. Single source of truth for timing. |
server.py |
Flask + SSE. Serves the UI, three streams, corpus uploads, static renders, filter bitmaps. |
histogram.py |
Static PNG renders (transcript, palette, reading-map, vocab atlas). |
visualize.py |
Standalone colored-strip HTML for arbitrary text. |
templates/index.html |
UI + Web Audio synth + client-side vocab canvas. |
corpus/ |
Public-domain text (KJV passages, Heart of Darkness). |
A physical instantiation of ColorGPT is in development.
ColorGPT uses Qwen 2.5 0.5B (Apache 2.0) for both the embedding source and the live generation. UMAP is umap-learn. OKLab is Björn Ottosson's perceptual color space (2020). Provided corpora are public domain.
See CITATION.cff — GitHub renders a "Cite this repository" widget from it. If you write about the work, the dual license below applies.
Dual-licensed:
- Code (everything except
README.md,CITATION.cff, and any futuredocs/) — Apache License 2.0. See alsoNOTICE. - Writeup (
README.mdand any futuredocs/*.md) — CC BY 4.0.
The split exists because the code is meant to be reused and the prose is meant to be cited. Apache 2.0 carries an explicit patent grant and a retaliation clause — defensive cover for a project working in a domain where patent activity is increasing. CC BY 4.0 preserves the conceptual claim: anyone may quote, adapt, or translate the framing of ColorGPT, but must credit the original.





