Project 5 of the GenAI Developer Roadmap 2026.
A monitoring and evaluation platform for LLM applications. Traces every LLM call (inputs, outputs, latency, token usage, cost), runs automated quality evaluations, detects drift and hallucinations, triggers alerts, supports A/B testing, and presents everything in a real-time Streamlit dashboard.
ObserveAI/
├── week-1/ Tracing & Logging (99 tests)
├── week-2/ Evaluation & Drift (167 tests)
└── week-3/ Dashboard & A/B Testing (191 tests)
Total: 457 tests across 3 weeks.
Transparent tracing layer that wraps any LLM call and records full metadata.
| Component | Description |
|---|---|
src/tracing/models.py |
TraceRecord, TokenUsage, TraceStatus, TraceStats dataclasses |
src/tracing/decorator.py |
@trace_llm_call decorator + trace_llm_call_manual() for manual emission |
src/tracing/cost.py |
Per-model cost calculation with configurable rates |
src/tracing/stats.py |
Aggregate statistics: latency percentiles, cost/tokens by model, success rate |
src/storage/sqlite_store.py |
SQLite storage with full CRUD + filtered queries |
src/storage/export.py |
Export to JSONL/CSV, import from JSONL |
src/cache/response_cache.py |
LRU response cache with TTL expiration and hit/miss stats |
src/llm/client.py |
Multi-provider LLM client with automatic tracing and caching |
cd week-1
python main.py trace # Show recent traces
python main.py stats # Aggregated statistics
python main.py export jsonl # Export to JSONL
python main.py export csv # Export to CSVQuality scoring, drift detection, and alert rules engine.
| Component | Description |
|---|---|
src/evaluation/sampler.py |
Random sampling with model stratification |
src/evaluation/quality.py |
Heuristic quality scoring: relevance, completeness, coherence, hallucination risk |
src/evaluation/drift.py |
Compare baseline vs current: latency, cost, error rate, quality changes |
src/alerts/rules.py |
Configurable alert rules: error rate, latency, cost, cache rate thresholds |
Quality Metrics:
- Relevance — keyword overlap between prompt and response
- Completeness — response length and structure heuristics
- Coherence — sentence structure, punctuation, formatting
- Hallucination Risk — overconfidence patterns, fabricated specifics
Drift Detection: Splits traces into baseline (older half) and current (newer half), compares P50 latency, avg cost per call, error rate, and quality score. Alerts when changes exceed configurable thresholds.
cd week-2
python main.py evaluate # Score quality on sampled traces
python main.py drift # Detect drift between time windows
python main.py alerts # Check alert rulesStreamlit dashboard with 5 tabs + A/B testing framework.
| Component | Description |
|---|---|
src/frontend/app.py |
Streamlit app: Overview, Latency, Quality, Alerts & Drift, Trace Browser |
src/abtest/framework.py |
A/B test assignment, analysis, and reporting |
Dashboard Tabs:
- Overview — Total calls, cost, success rate, cache hit rate, calls/cost by model
- Latency — Histogram + box plot by model
- Quality — Heuristic scoring with per-model breakdown
- Alerts & Drift — Live alert evaluation + drift comparison
- Trace Browser — Filterable table of recent traces
A/B Testing:
- Weighted variant assignment
- Quality + latency + cost comparison
- Confidence levels based on sample size
cd week-3
python main.py dashboard # Launch Streamlit dashboard
python main.py ab-test # Run demo A/B test analysis- SQLite for portability — zero-config, embedded database; swap to PostgreSQL for production via the store interface
- Decorator + manual API —
@trace_llm_callfor automatic tracing,trace_llm_call_manual()for integration with existing pipelines - Callback-based architecture — trace records flow through registered callbacks (store, export, alerting)
- Heuristic quality scoring — zero-cost quality assessment; LLM-as-judge available for higher accuracy
- Progressive weekly structure — each week copies previous and extends
Built-in pricing for:
- GPT-4o, GPT-4o-mini, GPT-4-turbo
- Claude Sonnet 4, Claude Haiku 3.5
- Mistral Small
Configurable via environment variables or MODEL_COSTS dict.
# Run any week's tests
cd week-N
python -m pytest tests/ -v- Python 3.12+
- See each week's
requirements.txt - Streamlit + Plotly for dashboard (Week 3)
- No external services required — all storage is local SQLite