A self-learning, multi-agent debt collections system built for "Riverline" (the regulated-collections company in the problem statement). Three AI agents (two chat, one voice) operate behind a single continuous borrower experience, orchestrated by Temporal. Each agent autonomously improves its prompt via a GEPA-driven self-learning loop with a Darwin–Gödel meta-evaluation layer.
Status: planning complete; implementation begins Day 1. See
docs/for decision rationale.
A borrower has defaulted on a loan. The system:
- Agent 1 (Assessment, chat) — cold and clinical. Verifies identity, gathers financial situation.
- Agent 2 (Resolution, voice call) — transactional dealmaker. Calls the borrower on the phone, negotiates settlement.
- Agent 3 (Final Notice, chat) — closer. States consequences, makes one last offer with a hard expiry.
The borrower never feels the handoff between stages or modalities.
flowchart TB
subgraph "Borrower-facing"
UI[Next.js Chat UI]
Phone[Phone — Twilio]
end
subgraph "API"
FAPI[FastAPI]
end
subgraph "Temporal"
WF[CollectionsWorkflow]
ACT1[Assessment Activity]
ACT2[Resolution Activity]
ACT3[Final Notice Activity]
SUM[Handoff Summarizer]
end
subgraph "Agents"
A1[Agent 1<br/>Sonnet 4.6 + DSPy]
A2[Agent 2<br/>Pipecat + Sonnet 4.6]
A3[Agent 3<br/>Sonnet 4.6 + DSPy]
end
subgraph "Storage"
PG[(Postgres)]
Redis[(Redis)]
end
subgraph "Self-Learning"
GEPA[GEPA Optimizer]
Judge[Multi-rubric Judge]
Meta[Meta-Evaluator]
end
UI --> FAPI --> WF
Phone --> FAPI --> WF
WF --> ACT1 --> A1
WF --> ACT2 --> A2
WF --> ACT3 --> A3
ACT1 --> SUM
SUM --> ACT2
ACT2 --> SUM
SUM --> ACT3
A1 --> PG
A2 --> Redis
A3 --> PG
GEPA --> Judge --> Meta --> PG
flowchart TD
A[Borrower enters post-default pipeline] --> B[Agent 1: Assessment chat]
B -->|situation assessed| C[Agent 2: Resolution voice call]
B -->|no response| D{Retry? max 3}
D -->|yes| B
D -->|exhausted| C
C -->|deal agreed| E[EXIT: Log agreement]
C -->|no deal| F[Agent 3: Final Notice chat]
F -->|resolved| G[EXIT: Log resolution]
F -->|no resolution| H[EXIT: Flag for legal/write-off]
sequenceDiagram
participant B as Borrower
participant A1 as Agent 1 (chat)
participant S as Summarizer (≤500 tok)
participant A2 as Agent 2 (voice)
participant A3 as Agent 3 (chat)
B->>A1: chat conversation
A1->>S: full transcript
S->>A2: HandoffContext (≤500 tok)
Note over A2: Phone call<br/>Live transcript extraction<br/>Offers + objections → Redis
A2->>S: voice transcript + structured events
S->>A3: HandoffContext (≤500 tok, both stages)
A3->>B: "On our call earlier, you mentioned..."
Each agent operates under a hard cap:
- 2000 tokens total per agent (system prompt + handoff context + conversation)
- ≤500 tokens of that for handoff context from prior stages
- Enforced in code by
apps/worker/context_budget.pyusingtiktoken. Seetests/test_context_budget.pyfor the verifying test.
| Agent | System prompt budget | Handoff context | Available for conversation |
|---|---|---|---|
| Agent 1 | 2000 | 0 | inside the 2000 |
| Agent 2 | 1500 | ≤500 (Agent 1 summary) | inside the 2000 |
| Agent 3 | 1500 | ≤500 (Agent 1+2 summary) | inside the 2000 |
flowchart LR
subgraph "Each Iteration"
Personas[5 personas:<br/>cooperative, combative,<br/>evasive, confused, distressed]
Sims[Borrower sim<br/>Haiku 4.5]
Run[90 paired trials<br/>5 personas × 6 conv × 3 seeds]
Judge[3-rubric judge<br/>Haiku 4.5]
Stats[Hierarchical bootstrap<br/>+ McNemar + Wilcoxon]
Decide{All gates pass?}
Mutate[GEPA reflective mutation<br/>Opus 4.7]
end
Personas --> Sims --> Run --> Judge --> Stats --> Decide
Decide -->|yes| Champion[New champion prompt]
Decide -->|no| Mutate --> Run
subgraph "Meta-Eval (DGM)"
Disagree[Inter-rubric<br/>Pearson r < 0.7?]
Reflect[Opus reflects on<br/>disagreement examples]
UpdateRubric[Update rubric set]
end
Judge --> Disagree --> Reflect --> UpdateRubric --> Judge
See [docs/architecture/03-learning-loop.md](docs/architecture/03-learning-loop.md) and [docs/architecture/04-eval-methodology.md](docs/architecture/04-eval-methodology.md) for the statistical gate, meta-evaluation mechanism, and budget breakdown.
| Concern | Choice | Why |
|---|---|---|
| Language | Python 3.12 | Temporal + Pipecat + DSPy all Python-first |
| Orchestration | Temporal | Required by spec |
| Chat agents | Sonnet 4.6 + DSPy | Tone control + clean abstraction for prompt swap |
| Voice framework | Pipecat | Mid-call transcript instrumentation |
| Voice STT | Deepgram Nova-3 | Sub-300ms latency, strong on phone audio |
| Voice TTS | Rime Mist v2 | Sociolinguistics-trained on real call-center audio |
| Telephony | Twilio | Reliable Pipecat SIP integration |
| Self-learning | DSPy + GEPA | 35× fewer rollouts than RL |
| Judge | Haiku 4.5 with 3-rubric setup | Cost-efficient, multi-rubric for meta-eval |
| Mutation | Opus 4.7 | Frontier reasoning for prompt rewriting |
| Statistics | Hierarchical bootstrap + McNemar + Wilcoxon | 2026 best practice for clustered LLM eval |
| Compliance | Two-layer (rules + LLM judge) + Garak/PyRIT red-team | CFPB-aligned, neurosymbolic |
| Chat UI | Next.js | Demo polish |
| State | Postgres + Redis | Persistent state + live transcript stream |
Decision rationale lives in docs/:
[docs/architecture/01-llm-tiering.md](docs/architecture/01-llm-tiering.md)— Anthropic Haiku/Sonnet/Opus tiering[docs/architecture/02-voice-stack.md](docs/architecture/02-voice-stack.md)— Pipecat over hosted alternatives[docs/architecture/03-learning-loop.md](docs/architecture/03-learning-loop.md)— DSPy + GEPA + meta-evaluation[docs/architecture/04-eval-methodology.md](docs/architecture/04-eval-methodology.md)— hierarchical bootstrap + paired tests[docs/architecture/05-stt-tts.md](docs/architecture/05-stt-tts.md)— Deepgram + Rime[docs/architecture/06-compliance.md](docs/architecture/06-compliance.md)— two-layer FDCPA-aligned guardrails[docs/architecture/07-decision-agnosticism.md](docs/architecture/07-decision-agnosticism.md)— swap-cost matrix; how hard each decision would be to change[docs/engineering-log/README.md](docs/engineering-log/README.md)— chronological log of every non-trivial bug hit and how it was fixed (Twilio 15s timeout, premature DEAL_AGREED, Pipecat 1.1 import rename, Indian-carrier rate limiting, etc.)
git clone git@github.com:teetangh/defaultline.git
cd defaultline
cp .env.example .env # fill in ANTHROPIC_API_KEY at minimum
pip install -e '.[learning,dev]' # core deps for learning loop (no voice yet)The eval harness runs the full chat→voice→chat pipeline in-process using the mock voice provider so the entire learning loop is reproducible without Twilio/Pipecat:
bash scripts/run_eval.sh 42 # seed=42, default 2 cycles. Adjust CYCLES env var.Outputs:
data/eval_runs/<cycle_id>.json— per-trial scores (90 rows per cycle)data/eval_runs/decisions.jsonl— promotion gate evidencedata/eval_runs/cost_log.jsonl— every LLM call with token countsdata/eval_runs/meta_log.jsonl— meta-evaluator triggers + proposals
docker compose up
# Temporal UI: http://localhost:8080
# API: http://localhost:8000/healthTrigger one workflow:
python scripts/trigger_conversation.py --borrower-id demo-001 \
--account-last-four 1234 --amount-owed 12500Set settings.voice.provider: pipecat plus the voice env vars listed in
.env.example. See apps/voice/pipecat_provider.py for the pipeline.
- Seeds: every test conversation uses an explicit seed in
data/seeds/ - Single-command rerun:
bash scripts/run_eval.sh --seed 42 - Raw data:
data/eval_runs/*.jsoncontains per-conversation scores - Reported numbers in
docs/deliverables/evolution-report.mdwill match a rerun within ±2% absolute (LLM nondeterminism bounded withtemperature=0and seed control where supported)
After running the full self-learning loop:
python scripts/cost_report.pyOutputs total spend with per-model and per-activity breakdown. Target: under $20 total.
All three agents preserve compliance after every prompt update. Enforcement has two layers:
- Rule-based (
learning/compliance.py): regex + structured checks for hard violations - LLM-based: judge rubric scores nuanced violations
Any compliance violation is a hard veto in the promotion gate — the prompt is rejected regardless of resolution rate.
See [docs/architecture/06-compliance.md](docs/architecture/06-compliance.md).
(To be filled in during the technical writeup pass — what doesn't work well, what we'd improve with more time.)
[docs/deliverables/decision-journal.md](docs/deliverables/decision-journal.md) — handwritten by author, not LLM-generated. Required by the assignment.