ObserveAI — LLM Monitoring & Evaluation Dashboard

Project 5 of the GenAI Developer Roadmap 2026.

A monitoring and evaluation platform for LLM applications. Traces every LLM call (inputs, outputs, latency, token usage, cost), runs automated quality evaluations, detects drift and hallucinations, triggers alerts, supports A/B testing, and presents everything in a real-time Streamlit dashboard.

Architecture

ObserveAI/
├── week-1/   Tracing & Logging         (99 tests)
├── week-2/   Evaluation & Drift        (167 tests)
└── week-3/   Dashboard & A/B Testing   (191 tests)

Total: 457 tests across 3 weeks.

Week 1 — Tracing & Logging

Transparent tracing layer that wraps any LLM call and records full metadata.

Component	Description
`src/tracing/models.py`	`TraceRecord`, `TokenUsage`, `TraceStatus`, `TraceStats` dataclasses
`src/tracing/decorator.py`	`@trace_llm_call` decorator + `trace_llm_call_manual()` for manual emission
`src/tracing/cost.py`	Per-model cost calculation with configurable rates
`src/tracing/stats.py`	Aggregate statistics: latency percentiles, cost/tokens by model, success rate
`src/storage/sqlite_store.py`	SQLite storage with full CRUD + filtered queries
`src/storage/export.py`	Export to JSONL/CSV, import from JSONL
`src/cache/response_cache.py`	LRU response cache with TTL expiration and hit/miss stats
`src/llm/client.py`	Multi-provider LLM client with automatic tracing and caching

cd week-1
python main.py trace         # Show recent traces
python main.py stats         # Aggregated statistics
python main.py export jsonl  # Export to JSONL
python main.py export csv    # Export to CSV

Week 2 — Automated Evaluation & Drift Detection

Quality scoring, drift detection, and alert rules engine.

Component	Description
`src/evaluation/sampler.py`	Random sampling with model stratification
`src/evaluation/quality.py`	Heuristic quality scoring: relevance, completeness, coherence, hallucination risk
`src/evaluation/drift.py`	Compare baseline vs current: latency, cost, error rate, quality changes
`src/alerts/rules.py`	Configurable alert rules: error rate, latency, cost, cache rate thresholds

Quality Metrics:

Relevance — keyword overlap between prompt and response
Completeness — response length and structure heuristics
Coherence — sentence structure, punctuation, formatting
Hallucination Risk — overconfidence patterns, fabricated specifics

Drift Detection: Splits traces into baseline (older half) and current (newer half), compares P50 latency, avg cost per call, error rate, and quality score. Alerts when changes exceed configurable thresholds.

cd week-2
python main.py evaluate      # Score quality on sampled traces
python main.py drift         # Detect drift between time windows
python main.py alerts        # Check alert rules

Week 3 — Dashboard & A/B Testing

Streamlit dashboard with 5 tabs + A/B testing framework.

Component	Description
`src/frontend/app.py`	Streamlit app: Overview, Latency, Quality, Alerts & Drift, Trace Browser
`src/abtest/framework.py`	A/B test assignment, analysis, and reporting

Dashboard Tabs:

Overview — Total calls, cost, success rate, cache hit rate, calls/cost by model
Latency — Histogram + box plot by model
Quality — Heuristic scoring with per-model breakdown
Alerts & Drift — Live alert evaluation + drift comparison
Trace Browser — Filterable table of recent traces

A/B Testing:

Weighted variant assignment
Quality + latency + cost comparison
Confidence levels based on sample size

cd week-3
python main.py dashboard     # Launch Streamlit dashboard
python main.py ab-test       # Run demo A/B test analysis

Key Design Decisions

SQLite for portability — zero-config, embedded database; swap to PostgreSQL for production via the store interface
Decorator + manual API — @trace_llm_call for automatic tracing, trace_llm_call_manual() for integration with existing pipelines
Callback-based architecture — trace records flow through registered callbacks (store, export, alerting)
Heuristic quality scoring — zero-cost quality assessment; LLM-as-judge available for higher accuracy
Progressive weekly structure — each week copies previous and extends

Cost Tracking

Built-in pricing for:

GPT-4o, GPT-4o-mini, GPT-4-turbo
Claude Sonnet 4, Claude Haiku 3.5
Mistral Small

Configurable via environment variables or MODEL_COSTS dict.

Testing

# Run any week's tests
cd week-N
python -m pytest tests/ -v

Requirements

Python 3.12+
See each week's requirements.txt
Streamlit + Plotly for dashboard (Week 3)
No external services required — all storage is local SQLite

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
week-1		week-1
week-2		week-2
week-3		week-3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ObserveAI — LLM Monitoring & Evaluation Dashboard

Architecture

Week 1 — Tracing & Logging

Week 2 — Automated Evaluation & Drift Detection

Week 3 — Dashboard & A/B Testing

Key Design Decisions

Cost Tracking

Testing

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ObserveAI — LLM Monitoring & Evaluation Dashboard

Architecture

Week 1 — Tracing & Logging

Week 2 — Automated Evaluation & Drift Detection

Week 3 — Dashboard & A/B Testing

Key Design Decisions

Cost Tracking

Testing

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages