Skip to content

teetangh/defaultline

Repository files navigation

defaultline

A self-learning, multi-agent debt collections system built for "Riverline" (the regulated-collections company in the problem statement). Three AI agents (two chat, one voice) operate behind a single continuous borrower experience, orchestrated by Temporal. Each agent autonomously improves its prompt via a GEPA-driven self-learning loop with a Darwin–Gödel meta-evaluation layer.

Status: planning complete; implementation begins Day 1. See docs/ for decision rationale.

What it does

A borrower has defaulted on a loan. The system:

  1. Agent 1 (Assessment, chat) — cold and clinical. Verifies identity, gathers financial situation.
  2. Agent 2 (Resolution, voice call) — transactional dealmaker. Calls the borrower on the phone, negotiates settlement.
  3. Agent 3 (Final Notice, chat) — closer. States consequences, makes one last offer with a hard expiry.

The borrower never feels the handoff between stages or modalities.

Architecture

flowchart TB
    subgraph "Borrower-facing"
        UI[Next.js Chat UI]
        Phone[Phone — Twilio]
    end
    subgraph "API"
        FAPI[FastAPI]
    end
    subgraph "Temporal"
        WF[CollectionsWorkflow]
        ACT1[Assessment Activity]
        ACT2[Resolution Activity]
        ACT3[Final Notice Activity]
        SUM[Handoff Summarizer]
    end
    subgraph "Agents"
        A1[Agent 1<br/>Sonnet 4.6 + DSPy]
        A2[Agent 2<br/>Pipecat + Sonnet 4.6]
        A3[Agent 3<br/>Sonnet 4.6 + DSPy]
    end
    subgraph "Storage"
        PG[(Postgres)]
        Redis[(Redis)]
    end
    subgraph "Self-Learning"
        GEPA[GEPA Optimizer]
        Judge[Multi-rubric Judge]
        Meta[Meta-Evaluator]
    end

    UI --> FAPI --> WF
    Phone --> FAPI --> WF
    WF --> ACT1 --> A1
    WF --> ACT2 --> A2
    WF --> ACT3 --> A3
    ACT1 --> SUM
    SUM --> ACT2
    ACT2 --> SUM
    SUM --> ACT3
    A1 --> PG
    A2 --> Redis
    A3 --> PG
    GEPA --> Judge --> Meta --> PG
Loading

Pipeline flow

flowchart TD
    A[Borrower enters post-default pipeline] --> B[Agent 1: Assessment chat]
    B -->|situation assessed| C[Agent 2: Resolution voice call]
    B -->|no response| D{Retry? max 3}
    D -->|yes| B
    D -->|exhausted| C
    C -->|deal agreed| E[EXIT: Log agreement]
    C -->|no deal| F[Agent 3: Final Notice chat]
    F -->|resolved| G[EXIT: Log resolution]
    F -->|no resolution| H[EXIT: Flag for legal/write-off]
Loading

Cross-modal handoff

sequenceDiagram
    participant B as Borrower
    participant A1 as Agent 1 (chat)
    participant S as Summarizer (≤500 tok)
    participant A2 as Agent 2 (voice)
    participant A3 as Agent 3 (chat)

    B->>A1: chat conversation
    A1->>S: full transcript
    S->>A2: HandoffContext (≤500 tok)
    Note over A2: Phone call<br/>Live transcript extraction<br/>Offers + objections → Redis
    A2->>S: voice transcript + structured events
    S->>A3: HandoffContext (≤500 tok, both stages)
    A3->>B: "On our call earlier, you mentioned..."
Loading

Token budget enforcement

Each agent operates under a hard cap:

  • 2000 tokens total per agent (system prompt + handoff context + conversation)
  • ≤500 tokens of that for handoff context from prior stages
  • Enforced in code by apps/worker/context_budget.py using tiktoken. See tests/test_context_budget.py for the verifying test.
Agent System prompt budget Handoff context Available for conversation
Agent 1 2000 0 inside the 2000
Agent 2 1500 ≤500 (Agent 1 summary) inside the 2000
Agent 3 1500 ≤500 (Agent 1+2 summary) inside the 2000

Self-learning loop

flowchart LR
    subgraph "Each Iteration"
        Personas[5 personas:<br/>cooperative, combative,<br/>evasive, confused, distressed]
        Sims[Borrower sim<br/>Haiku 4.5]
        Run[90 paired trials<br/>5 personas × 6 conv × 3 seeds]
        Judge[3-rubric judge<br/>Haiku 4.5]
        Stats[Hierarchical bootstrap<br/>+ McNemar + Wilcoxon]
        Decide{All gates pass?}
        Mutate[GEPA reflective mutation<br/>Opus 4.7]
    end
    Personas --> Sims --> Run --> Judge --> Stats --> Decide
    Decide -->|yes| Champion[New champion prompt]
    Decide -->|no| Mutate --> Run

    subgraph "Meta-Eval (DGM)"
        Disagree[Inter-rubric<br/>Pearson r < 0.7?]
        Reflect[Opus reflects on<br/>disagreement examples]
        UpdateRubric[Update rubric set]
    end
    Judge --> Disagree --> Reflect --> UpdateRubric --> Judge
Loading

See [docs/architecture/03-learning-loop.md](docs/architecture/03-learning-loop.md) and [docs/architecture/04-eval-methodology.md](docs/architecture/04-eval-methodology.md) for the statistical gate, meta-evaluation mechanism, and budget breakdown.

Tech stack

Concern Choice Why
Language Python 3.12 Temporal + Pipecat + DSPy all Python-first
Orchestration Temporal Required by spec
Chat agents Sonnet 4.6 + DSPy Tone control + clean abstraction for prompt swap
Voice framework Pipecat Mid-call transcript instrumentation
Voice STT Deepgram Nova-3 Sub-300ms latency, strong on phone audio
Voice TTS Rime Mist v2 Sociolinguistics-trained on real call-center audio
Telephony Twilio Reliable Pipecat SIP integration
Self-learning DSPy + GEPA 35× fewer rollouts than RL
Judge Haiku 4.5 with 3-rubric setup Cost-efficient, multi-rubric for meta-eval
Mutation Opus 4.7 Frontier reasoning for prompt rewriting
Statistics Hierarchical bootstrap + McNemar + Wilcoxon 2026 best practice for clustered LLM eval
Compliance Two-layer (rules + LLM judge) + Garak/PyRIT red-team CFPB-aligned, neurosymbolic
Chat UI Next.js Demo polish
State Postgres + Redis Persistent state + live transcript stream

Decision rationale lives in docs/:

  • [docs/architecture/01-llm-tiering.md](docs/architecture/01-llm-tiering.md) — Anthropic Haiku/Sonnet/Opus tiering
  • [docs/architecture/02-voice-stack.md](docs/architecture/02-voice-stack.md) — Pipecat over hosted alternatives
  • [docs/architecture/03-learning-loop.md](docs/architecture/03-learning-loop.md) — DSPy + GEPA + meta-evaluation
  • [docs/architecture/04-eval-methodology.md](docs/architecture/04-eval-methodology.md) — hierarchical bootstrap + paired tests
  • [docs/architecture/05-stt-tts.md](docs/architecture/05-stt-tts.md) — Deepgram + Rime
  • [docs/architecture/06-compliance.md](docs/architecture/06-compliance.md) — two-layer FDCPA-aligned guardrails
  • [docs/architecture/07-decision-agnosticism.md](docs/architecture/07-decision-agnosticism.md) — swap-cost matrix; how hard each decision would be to change
  • [docs/engineering-log/README.md](docs/engineering-log/README.md) — chronological log of every non-trivial bug hit and how it was fixed (Twilio 15s timeout, premature DEAL_AGREED, Pipecat 1.1 import rename, Indian-carrier rate limiting, etc.)

Quickstart

git clone git@github.com:teetangh/defaultline.git
cd defaultline
cp .env.example .env             # fill in ANTHROPIC_API_KEY at minimum
pip install -e '.[learning,dev]' # core deps for learning loop (no voice yet)

Run the learning loop (no voice, no Temporal cluster needed)

The eval harness runs the full chat→voice→chat pipeline in-process using the mock voice provider so the entire learning loop is reproducible without Twilio/Pipecat:

bash scripts/run_eval.sh 42       # seed=42, default 2 cycles. Adjust CYCLES env var.

Outputs:

  • data/eval_runs/<cycle_id>.json — per-trial scores (90 rows per cycle)
  • data/eval_runs/decisions.jsonl — promotion gate evidence
  • data/eval_runs/cost_log.jsonl — every LLM call with token counts
  • data/eval_runs/meta_log.jsonl — meta-evaluator triggers + proposals

Run the live system (Temporal + Postgres + Redis + worker + API)

docker compose up
# Temporal UI: http://localhost:8080
# API:         http://localhost:8000/health

Trigger one workflow:

python scripts/trigger_conversation.py --borrower-id demo-001 \
    --account-last-four 1234 --amount-owed 12500

Run voice (Day 3+; Pipecat + Twilio required)

Set settings.voice.provider: pipecat plus the voice env vars listed in .env.example. See apps/voice/pipecat_provider.py for the pipeline.

Reproducibility

  • Seeds: every test conversation uses an explicit seed in data/seeds/
  • Single-command rerun: bash scripts/run_eval.sh --seed 42
  • Raw data: data/eval_runs/*.json contains per-conversation scores
  • Reported numbers in docs/deliverables/evolution-report.md will match a rerun within ±2% absolute (LLM nondeterminism bounded with temperature=0 and seed control where supported)

Cost report

After running the full self-learning loop:

python scripts/cost_report.py

Outputs total spend with per-model and per-activity breakdown. Target: under $20 total.

Compliance

All three agents preserve compliance after every prompt update. Enforcement has two layers:

  • Rule-based (learning/compliance.py): regex + structured checks for hard violations
  • LLM-based: judge rubric scores nuanced violations

Any compliance violation is a hard veto in the promotion gate — the prompt is rejected regardless of resolution rate.

See [docs/architecture/06-compliance.md](docs/architecture/06-compliance.md).

Limitations

(To be filled in during the technical writeup pass — what doesn't work well, what we'd improve with more time.)

Decision journal

[docs/deliverables/decision-journal.md](docs/deliverables/decision-journal.md) — handwritten by author, not LLM-generated. Required by the assignment.

About

Self-learning multi-agent debt-collections system. Chat→voice→chat pipeline orchestrated by Temporal, with deterministic policy validators, structured tool-call compliance gates, GEPA prompt evolution, and Darwin-Gödel meta-evaluation. $20 LLM budget. Built for the Riverline assignment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors