Skip to content

aditya2425/GuardianAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

ObserveAI — LLM Monitoring & Evaluation Dashboard

Project 5 of the GenAI Developer Roadmap 2026.

A monitoring and evaluation platform for LLM applications. Traces every LLM call (inputs, outputs, latency, token usage, cost), runs automated quality evaluations, detects drift and hallucinations, triggers alerts, supports A/B testing, and presents everything in a real-time Streamlit dashboard.

Architecture

ObserveAI/
├── week-1/   Tracing & Logging         (99 tests)
├── week-2/   Evaluation & Drift        (167 tests)
└── week-3/   Dashboard & A/B Testing   (191 tests)

Total: 457 tests across 3 weeks.

Week 1 — Tracing & Logging

Transparent tracing layer that wraps any LLM call and records full metadata.

Component Description
src/tracing/models.py TraceRecord, TokenUsage, TraceStatus, TraceStats dataclasses
src/tracing/decorator.py @trace_llm_call decorator + trace_llm_call_manual() for manual emission
src/tracing/cost.py Per-model cost calculation with configurable rates
src/tracing/stats.py Aggregate statistics: latency percentiles, cost/tokens by model, success rate
src/storage/sqlite_store.py SQLite storage with full CRUD + filtered queries
src/storage/export.py Export to JSONL/CSV, import from JSONL
src/cache/response_cache.py LRU response cache with TTL expiration and hit/miss stats
src/llm/client.py Multi-provider LLM client with automatic tracing and caching
cd week-1
python main.py trace         # Show recent traces
python main.py stats         # Aggregated statistics
python main.py export jsonl  # Export to JSONL
python main.py export csv    # Export to CSV

Week 2 — Automated Evaluation & Drift Detection

Quality scoring, drift detection, and alert rules engine.

Component Description
src/evaluation/sampler.py Random sampling with model stratification
src/evaluation/quality.py Heuristic quality scoring: relevance, completeness, coherence, hallucination risk
src/evaluation/drift.py Compare baseline vs current: latency, cost, error rate, quality changes
src/alerts/rules.py Configurable alert rules: error rate, latency, cost, cache rate thresholds

Quality Metrics:

  • Relevance — keyword overlap between prompt and response
  • Completeness — response length and structure heuristics
  • Coherence — sentence structure, punctuation, formatting
  • Hallucination Risk — overconfidence patterns, fabricated specifics

Drift Detection: Splits traces into baseline (older half) and current (newer half), compares P50 latency, avg cost per call, error rate, and quality score. Alerts when changes exceed configurable thresholds.

cd week-2
python main.py evaluate      # Score quality on sampled traces
python main.py drift         # Detect drift between time windows
python main.py alerts        # Check alert rules

Week 3 — Dashboard & A/B Testing

Streamlit dashboard with 5 tabs + A/B testing framework.

Component Description
src/frontend/app.py Streamlit app: Overview, Latency, Quality, Alerts & Drift, Trace Browser
src/abtest/framework.py A/B test assignment, analysis, and reporting

Dashboard Tabs:

  1. Overview — Total calls, cost, success rate, cache hit rate, calls/cost by model
  2. Latency — Histogram + box plot by model
  3. Quality — Heuristic scoring with per-model breakdown
  4. Alerts & Drift — Live alert evaluation + drift comparison
  5. Trace Browser — Filterable table of recent traces

A/B Testing:

  • Weighted variant assignment
  • Quality + latency + cost comparison
  • Confidence levels based on sample size
cd week-3
python main.py dashboard     # Launch Streamlit dashboard
python main.py ab-test       # Run demo A/B test analysis

Key Design Decisions

  • SQLite for portability — zero-config, embedded database; swap to PostgreSQL for production via the store interface
  • Decorator + manual API@trace_llm_call for automatic tracing, trace_llm_call_manual() for integration with existing pipelines
  • Callback-based architecture — trace records flow through registered callbacks (store, export, alerting)
  • Heuristic quality scoring — zero-cost quality assessment; LLM-as-judge available for higher accuracy
  • Progressive weekly structure — each week copies previous and extends

Cost Tracking

Built-in pricing for:

  • GPT-4o, GPT-4o-mini, GPT-4-turbo
  • Claude Sonnet 4, Claude Haiku 3.5
  • Mistral Small

Configurable via environment variables or MODEL_COSTS dict.

Testing

# Run any week's tests
cd week-N
python -m pytest tests/ -v

Requirements

  • Python 3.12+
  • See each week's requirements.txt
  • Streamlit + Plotly for dashboard (Week 3)
  • No external services required — all storage is local SQLite

About

LLM safety & observability: content filtering, PII detection, token tracking, cost analytics & real-time Streamlit dashboard

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages