A local-first web retrieval engine that returns a cited, LLM-synthesized answer — not a wall of blue links.
You ask a question. OLLA searches the web, fetches and cleans the content, ranks
it, embeds it into a growing knowledge graph, asks a local LLM to write a direct
answer with inline [n] citations, and hands the whole thing back in one call —
with a per-stage trace so you can see exactly what happened.
# Zero to first answer:
git clone <your-repo-url> olla
cd olla
make setup && make migrate && make dev
# Ask something:
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{"query": "how does pgvector work"}'Three ways to drive it: the HTTP API, the OLLA CLI (python cli.py), and
the web console (frontend/). It also speaks MCP, so Claude Desktop can
call it as a native tool.
- What OLLA Does
- The Pipeline
- Architecture
- Setup
- Environment Variables
- API Reference
- The OLLA CLI
- The Web Console
- Knowledge Graph
- Feedback & Learning
- MCP Integration
- Project Structure
- Running Tests
- Troubleshooting
Given a natural-language query, OLLA:
- Searches the web (DuckDuckGo, multi-backend with fallback) for candidate URLs.
- Fetches clean content for each URL through a fetch waterfall (Jina Reader → direct scrape → snippet).
- Cleans, sanitizes and chunks the text — strips boilerplate, removes prompt-injection attempts, splits into RAG-ready chunks.
- Ranks results with TF-IDF + content density, then blends in a learned source-trust score.
- Synthesizes an answer — a local LLM (via Ollama) reads the ranked sources and writes a direct answer with inline
[n]citations. - Stores the query, results and chunks in PostgreSQL.
- Embeds the new chunks locally (384-dim BGE vectors — no API key needed).
- Builds the knowledge graph — links chunks by semantic similarity so OLLA's memory expands with every search.
Every stage is wrapped in a trace span (success / fallback / skipped / failed)
returned on the response and persisted for the dashboard. Non-fatal stages can fail
without taking down the request — the response is just marked degraded.
On top of the core pipeline OLLA adds confidence-routed hybrid retrieval (cache → local memory → fresh web crawl), knowledge-graph traversal, feedback-driven ranking, API-key auth, and MCP.
Stack: FastAPI · PostgreSQL + pgvector · Redis · BAAI/bge-small-en-v1.5
(local embeddings) · Ollama (local answer LLM) · React + Vite frontend · Docker Compose.
query
│
▼ ┌── cache ──────────────────────────────────────────────┐
│ Redis hit? return immediately (< 50 ms) │
└────────────────────────────────────────────────────────┘
│ (miss)
▼
search → fetch → clean → rank → answer → store → embed → graph
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ └─ link chunks by
│ │ │ │ │ │ │ similarity (graph)
│ │ │ │ │ │ └─ vectorise new chunks
│ │ │ │ │ └─ persist query + results + chunks
│ │ │ │ └─ local LLM writes a cited answer
│ │ │ └─ TF-IDF + density + source-trust ranking
│ │ └─ clean + sanitize + chunk
│ └─ Jina → direct → snippet fetch waterfall
└─ DuckDuckGo multi-backend search
search and fetch are critical — if they fail the request fails. The
enrichment stages (store, embed, graph) are non-fatal: a failure there
marks the response degraded but still returns results and an answer.
The fastest path. Brings up the API, PostgreSQL (pgvector) and Redis together.
cp .env.example .env # adjust if you like
make docker-up # build + start api, db, cache
make migrate # run Alembic migrations (inside the api container)Or directly:
cd docker
docker-compose up --build -d
docker-compose exec api alembic upgrade headThe API is then at http://localhost:8000 · interactive docs at /docs.
# 1. virtualenv + dependencies
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# 2. PostgreSQL (pgvector) + Redis — easiest via Docker
make docker-up # starts just db + cache, or use your own
# 3. configure + migrate
cp .env.example .env
make migrate # alembic upgrade head
# 4. run the API
make dev # uvicorn app.main:app --reloadLocal answer synthesis (optional but recommended). OLLA writes its answers with a local LLM through Ollama. Without it, searches still return ranked sources — just no synthesized answer.
ollama serve
ollama pull qwen2.5:1.5b # small + fast; set OLLAMA_MODEL in .envVerify the LLM wiring any time with python cli.py --test-llm.
A React + Vite single-page console that drives every API feature.
cd frontend
npm install
npm run dev # dev server on http://localhost:5173 (proxies /api → :8000)
npm run build # production build → frontend/dist/The dev server proxies /api/* to http://127.0.0.1:8000, so just keep the
backend running alongside it.
Copy .env.example to .env. The most relevant settings:
# ── Core ─────────────────────────────────────────────────────────
DEBUG=false
# ── Search & fetch ───────────────────────────────────────────────
MAX_SEARCH_RESULTS=5
MAX_CHARS_PER_PAGE=8000
FETCH_TIMEOUT_SECONDS=15
DEFAULT_CHUNK_SIZE=500
DEFAULT_CHUNK_OVERLAP=50
# ── PostgreSQL ───────────────────────────────────────────────────
DATABASE_URL="postgresql+asyncpg://postgres:password@localhost:5433/hybriddb"
# ── Redis ────────────────────────────────────────────────────────
REDIS_URL="redis://localhost:6379"
CACHE_TTL_SECONDS=3600
# ── Embeddings (local by default — no API key needed) ────────────
USE_LOCAL_EMBEDDINGS=true
OPENAI_API_KEY="" # only if USE_LOCAL_EMBEDDINGS=false
# ── Knowledge graph ──────────────────────────────────────────────
ENABLE_KNOWLEDGE_GRAPH=true # embed + graph stages on every search
GRAPH_SIMILARITY_THRESHOLD=0.85 # min cosine similarity to draw an edge
# ── Answer synthesis (local LLM via Ollama) ──────────────────────
ENABLE_ANSWER_SYNTHESIS=true
OLLAMA_BASE_URL="http://localhost:11434"
OLLAMA_MODEL="qwen2.5:1.5b"
# ── Auth ─────────────────────────────────────────────────────────
REQUIRE_AUTH=false
API_KEYS="key1,key2" # comma-separated; used when REQUIRE_AUTH=truePrivacy mode: set
LOCAL_ONLY=trueto forbid any request from reaching an external AI provider — local embeddings are forced and external keys cleared.
All endpoints are under /api/v1. Interactive docs at /docs.
Deep health check — overall status plus per-component status and latency (database, cache, …). Always returns 200, exempt from auth.
The main pipeline. Body:
| Field | Type | Default | Notes |
|---|---|---|---|
query |
string | required | 3–500 chars |
max_results |
int | 5 | 1–10 — pages to fetch |
max_chars_per_page |
int | 8000 | per-page character cap |
chunk_size |
int | 500 | target chars per chunk |
chunk_overlap |
int | 50 | overlap between chunks |
min_score |
float | 0.0 | minimum score to keep a result |
llm_model |
string | — | override the Ollama answer model |
Returns a SearchResponse:
{
"query": "how does pgvector work",
"total_results": 5,
"processing_time_ms": 2140,
"results": [
{ "rank": 1, "title": "...", "url": "...", "content": "...",
"score": 0.91, "chunks": [ ... ] }
],
"answer": "pgvector stores embeddings as a native column type [1] ...",
"answer_model": "qwen2.5:1.5b",
"trace": [ { "stage": "search", "status": "success", "duration_ms": 410, "detail": "..." }, ... ],
"degraded": false,
"cache_hit": false,
"query_id": "f1e2…",
"citations_markdown": "...",
"citations_json": [ ... ]
}
query_idis the UUID of the stored query — pass it back asquery_idwhen submitting answer-level feedback.
Confidence-routed retrieval: checks cache and local semantic memory first, and only crawls the web when memory confidence is low or the query is recency-sensitive.
| Field | Type | Default | Notes |
|---|---|---|---|
query |
string | required | |
mode |
string | auto |
auto · fast · fresh · hybrid · deep |
max_results |
int | 6 |
The response includes retrieval_mode, query_class, confidence,
from_memory, and a human-readable routing_trace.
Knowledge-graph traversal. Finds seed chunks by vector similarity to the query,
then walks chunk_edges outward. Body: { query, hops, seed_k, top_k }.
Returns seed_chunks, connected_chunks, total_chunks, hops.
Pure vector similarity over stored chunks. Body: { query, top_k, min_similarity }.
Backfills embeddings for any stored chunks that don't have one yet. With the knowledge graph enabled this runs automatically on every search, so it's mostly a manual catch-up tool.
Records feedback and updates ranking signals. Body:
| Field | Type | Notes |
|---|---|---|
level |
string | answer · citation · chunk · source |
feedback_type |
string | useful · not_useful · incorrect · outdated · bad_source · missing_context |
query_id |
string | required for answer-level feedback |
source_url |
string | required for source/citation-level feedback |
chunk_id |
string | required for chunk-level feedback |
comment |
string | optional free-text note |
Aggregate analytics — totals, satisfaction rate, top/low-trust sources.
Public — register an email and receive a free-tier API key (shown once).
List, create and revoke API keys (authenticated).
cli.py is a polished terminal client. Run it with no arguments for the
interactive shell, or pass a query for one-shot mode.
python cli.py # interactive OLLA shell
python cli.py "how does pgvector work" # one-shot search
python cli.py "latest AI news" --hybrid --mode fresh
python cli.py "vector similarity" --graph
python cli.py --health
python cli.py --test-llm # diagnose the local Ollama LLM
python cli.py --feedback-statspython cli.py opens a welcome banner (OLLA wordmark, getting-started panel,
live API status) and a prompt. Just type a question to search, or use a command:
| Command | Action |
|---|---|
<your question> / /ask <q> |
search and get a synthesized answer |
/hybrid on|off |
toggle confidence-routed hybrid retrieval |
/mode <name> |
hybrid mode: auto fast fresh hybrid deep |
/graph <query> |
explore the knowledge graph |
/feedback |
rate the last answer or one of its sources |
/health |
API health + components |
/llm |
diagnose the local Ollama LLM |
/stats |
feedback analytics |
/max <n> · /sources on|off · /json on|off |
tune output |
/host · /ollama · /model |
point at different hosts/models |
/set |
show current settings |
/help · /clear · /exit |
— |
One-shot flags include --host, --max, --graph, --hybrid, --mode,
--json, --no-sources, --health, --test-llm, --ollama, --model,
--feedback (with --level / --type), and --feedback-stats.
The CLI uses rich for its terminal UI — installed with the other requirements.
frontend/ is a React + Vite single-page app, themed to match the CLI
(dark, monospace, cyan→violet). It has a hero section and a tabbed console where
every tab calls the real API:
- Search — cited answer, sources, animated pipeline trace.
- Hybrid — mode selector + routing/confidence metadata.
- Graph — seed chunks and connected (hop) context.
- Health — live per-component status.
- Feedback — rate the last answer or a specific source it used.
- Analytics — satisfaction gauge, feedback-by-type, source trust.
- Keys — mint a free-tier API key.
Run npm run dev in frontend/ with the backend up; build with npm run build.
A legacy self-contained dashboard is also served by the backend at
/dashboard.
When ENABLE_KNOWLEDGE_GRAPH=true (the default), every search runs two extra
pipeline stages after store:
- embed — vectorises the chunks just stored (local BGE model, cached in-process).
- graph —
GraphService.build_edges()drawschunk_edgesbetween chunks whose cosine similarity is at or aboveGRAPH_SIMILARITY_THRESHOLD.
So the graph grows continuously: each search both contributes new content and
links it to everything already stored. POST /api/v1/search/graph then does
multi-hop traversal — seed by similarity, walk the edges — to surface connected
context that plain vector search would miss.
Feedback is first-class. You can rate four things — the synthesized answer,
a citation, a retrieved chunk, or a source domain — via
POST /api/v1/feedback, the CLI's /feedback, or the console's Feedback tab.
Source-level feedback feeds the source-trust score, which the ranking stage
blends into result scoring (final = relevance × 0.7 + source_trust × 0.3), so
results measurably improve as feedback accumulates. GET /api/v1/feedback/stats
exposes the aggregate picture.
mcp_server/server.py exposes OLLA as a Model Context Protocol tool over stdio,
so Claude Desktop can call web search as a native tool. Add to
claude_desktop_config.json:
{
"mcpServers": {
"olla": {
"command": "python",
"args": ["-m", "mcp_server.server"],
"cwd": "/absolute/path/to/olla"
}
}
}The FastAPI server must be running on port 8000 before Claude Desktop starts it.
olla/
├── cli.py ← OLLA CLI (interactive shell + one-shot)
├── app/
│ ├── main.py ← FastAPI app factory, middleware, routers
│ ├── config.py ← Pydantic settings (all env vars)
│ ├── api/
│ │ ├── middleware/ ← auth · rate-limit · tracing · usage-meter
│ │ └── routes/ ← health · search · semantic · hybrid · graph ·
│ │ feedback · keys · sources · workspaces · …
│ ├── models/
│ │ ├── request.py ← request schemas
│ │ ├── response.py ← response schemas
│ │ └── db/ ← SQLAlchemy ORM models (query, result, chunk,
│ │ chunk_edge, feedback, source_trust, …)
│ ├── services/
│ │ ├── pipeline.py ← staged, traced search orchestrator
│ │ ├── retrieval_router.py← confidence-routed hybrid retrieval
│ │ ├── search_service.py · fetch_service.py · clean_service.py
│ │ ├── chunk_service.py · rank_service.py · answer_service.py
│ │ ├── embed_service.py · graph_service.py · store_service.py
│ │ ├── feedback_service.py· source_trust_service.py
│ │ ├── cache_service.py · citation_service.py · …
│ └── db/
│ ├── session.py ← async engine + session factory
│ └── migrations/ ← Alembic (001 … 008)
├── mcp_server/server.py ← MCP stdio server
├── frontend/ ← React + Vite web console
│ ├── index.html
│ └── src/ (App.tsx, App.css, index.css, main.tsx)
├── docker/ ← Dockerfile + docker-compose.yml
├── eval/ ← retrieval-quality benchmark harness
├── tests/ ← pytest suite
├── Makefile · alembic.ini · requirements.txt · pyproject.toml
└── .env.example
make test # full pytest suite
make test-fast # unit tests only (no network / DB)
make test-coverage # with a coverage report
make smoke # quick end-to-end smoke check
make lint # ruff
make fmt # ruff formatOr call pytest directly, e.g. pytest tests/ -v.
No synthesized answer in results / answer is empty.
The local LLM isn't reachable. Start Ollama (ollama serve) and pull the model
(ollama pull qwen2.5:1.5b). Diagnose with python cli.py --test-llm. Search
and sources still work without it — only the synthesized answer needs Ollama.
graph stage shows skipped.
The graph builds only when chunks have embeddings. Confirm
ENABLE_KNOWLEDGE_GRAPH=true and that sentence-transformers is installed; the
first embed loads the BGE model (~130 MB download, one time).
First search is very slow. The local embedding model loads on first use. It's cached in-process afterwards, so subsequent searches are fast.
asyncpg … UndefinedTableError.
Migrations haven't run — make migrate (or alembic upgrade head).
redis … ConnectionError.
Redis isn't running or REDIS_URL is wrong. Start it (make docker-up).
422 Unprocessable Entity on /search.
The body failed validation — most often a query shorter than 3 characters.
Frontend can't reach the API.
Run the backend on port 8000 and use the Vite dev server (npm run dev), which
proxies /api to it. A production build (dist/) needs its own reverse proxy
for /api.
Cannot connect to localhost / proxy issues.
An HTTP_PROXY / ALL_PROXY env var can black-hole localhost requests. The CLI
already disables proxies for its calls; for other clients, unset those vars.
OLLA — Hybrid Search. Smarter Answers.

