defaultline

A self-learning, multi-agent debt collections system built for "Riverline" (the regulated-collections company in the problem statement). Three AI agents (two chat, one voice) operate behind a single continuous borrower experience, orchestrated by Temporal. Each agent autonomously improves its prompt via a GEPA-driven self-learning loop with a Darwin–Gödel meta-evaluation layer.

Status: planning complete; implementation begins Day 1. See docs/ for decision rationale.

What it does

A borrower has defaulted on a loan. The system:

Agent 1 (Assessment, chat) — cold and clinical. Verifies identity, gathers financial situation.
Agent 2 (Resolution, voice call) — transactional dealmaker. Calls the borrower on the phone, negotiates settlement.
Agent 3 (Final Notice, chat) — closer. States consequences, makes one last offer with a hard expiry.

The borrower never feels the handoff between stages or modalities.

Architecture

flowchart TB
    subgraph "Borrower-facing"
        UI[Next.js Chat UI]
        Phone[Phone — Twilio]
    end
    subgraph "API"
        FAPI[FastAPI]
    end
    subgraph "Temporal"
        WF[CollectionsWorkflow]
        ACT1[Assessment Activity]
        ACT2[Resolution Activity]
        ACT3[Final Notice Activity]
        SUM[Handoff Summarizer]
    end
    subgraph "Agents"
        A1[Agent 1<br/>Sonnet 4.6 + DSPy]
        A2[Agent 2<br/>Pipecat + Sonnet 4.6]
        A3[Agent 3<br/>Sonnet 4.6 + DSPy]
    end
    subgraph "Storage"
        PG[(Postgres)]
        Redis[(Redis)]
    end
    subgraph "Self-Learning"
        GEPA[GEPA Optimizer]
        Judge[Multi-rubric Judge]
        Meta[Meta-Evaluator]
    end

    UI --> FAPI --> WF
    Phone --> FAPI --> WF
    WF --> ACT1 --> A1
    WF --> ACT2 --> A2
    WF --> ACT3 --> A3
    ACT1 --> SUM
    SUM --> ACT2
    ACT2 --> SUM
    SUM --> ACT3
    A1 --> PG
    A2 --> Redis
    A3 --> PG
    GEPA --> Judge --> Meta --> PG

Pipeline flow

flowchart TD
    A[Borrower enters post-default pipeline] --> B[Agent 1: Assessment chat]
    B -->|situation assessed| C[Agent 2: Resolution voice call]
    B -->|no response| D{Retry? max 3}
    D -->|yes| B
    D -->|exhausted| C
    C -->|deal agreed| E[EXIT: Log agreement]
    C -->|no deal| F[Agent 3: Final Notice chat]
    F -->|resolved| G[EXIT: Log resolution]
    F -->|no resolution| H[EXIT: Flag for legal/write-off]

Cross-modal handoff

sequenceDiagram
    participant B as Borrower
    participant A1 as Agent 1 (chat)
    participant S as Summarizer (≤500 tok)
    participant A2 as Agent 2 (voice)
    participant A3 as Agent 3 (chat)

    B->>A1: chat conversation
    A1->>S: full transcript
    S->>A2: HandoffContext (≤500 tok)
    Note over A2: Phone call<br/>Live transcript extraction<br/>Offers + objections → Redis
    A2->>S: voice transcript + structured events
    S->>A3: HandoffContext (≤500 tok, both stages)
    A3->>B: "On our call earlier, you mentioned..."

Token budget enforcement

Each agent operates under a hard cap:

2000 tokens total per agent (system prompt + handoff context + conversation)
≤500 tokens of that for handoff context from prior stages
Enforced in code by apps/worker/context_budget.py using tiktoken. See tests/test_context_budget.py for the verifying test.

Agent	System prompt budget	Handoff context	Available for conversation
Agent 1	2000	0	inside the 2000
Agent 2	1500	≤500 (Agent 1 summary)	inside the 2000
Agent 3	1500	≤500 (Agent 1+2 summary)	inside the 2000

Self-learning loop

flowchart LR
    subgraph "Each Iteration"
        Personas[5 personas:<br/>cooperative, combative,<br/>evasive, confused, distressed]
        Sims[Borrower sim<br/>Haiku 4.5]
        Run[90 paired trials<br/>5 personas × 6 conv × 3 seeds]
        Judge[3-rubric judge<br/>Haiku 4.5]
        Stats[Hierarchical bootstrap<br/>+ McNemar + Wilcoxon]
        Decide{All gates pass?}
        Mutate[GEPA reflective mutation<br/>Opus 4.7]
    end
    Personas --> Sims --> Run --> Judge --> Stats --> Decide
    Decide -->|yes| Champion[New champion prompt]
    Decide -->|no| Mutate --> Run

    subgraph "Meta-Eval (DGM)"
        Disagree[Inter-rubric<br/>Pearson r < 0.7?]
        Reflect[Opus reflects on<br/>disagreement examples]
        UpdateRubric[Update rubric set]
    end
    Judge --> Disagree --> Reflect --> UpdateRubric --> Judge

See [docs/architecture/03-learning-loop.md](docs/architecture/03-learning-loop.md) and [docs/architecture/04-eval-methodology.md](docs/architecture/04-eval-methodology.md) for the statistical gate, meta-evaluation mechanism, and budget breakdown.

Tech stack

Concern	Choice	Why
Language	Python 3.12	Temporal + Pipecat + DSPy all Python-first
Orchestration	Temporal	Required by spec
Chat agents	Sonnet 4.6 + DSPy	Tone control + clean abstraction for prompt swap
Voice framework	Pipecat	Mid-call transcript instrumentation
Voice STT	Deepgram Nova-3	Sub-300ms latency, strong on phone audio
Voice TTS	Rime Mist v2	Sociolinguistics-trained on real call-center audio
Telephony	Twilio	Reliable Pipecat SIP integration
Self-learning	DSPy + GEPA	35× fewer rollouts than RL
Judge	Haiku 4.5 with 3-rubric setup	Cost-efficient, multi-rubric for meta-eval
Mutation	Opus 4.7	Frontier reasoning for prompt rewriting
Statistics	Hierarchical bootstrap + McNemar + Wilcoxon	2026 best practice for clustered LLM eval
Compliance	Two-layer (rules + LLM judge) + Garak/PyRIT red-team	CFPB-aligned, neurosymbolic
Chat UI	Next.js	Demo polish
State	Postgres + Redis	Persistent state + live transcript stream

Decision rationale lives in docs/:

[docs/architecture/01-llm-tiering.md](docs/architecture/01-llm-tiering.md) — Anthropic Haiku/Sonnet/Opus tiering
[docs/architecture/02-voice-stack.md](docs/architecture/02-voice-stack.md) — Pipecat over hosted alternatives
[docs/architecture/03-learning-loop.md](docs/architecture/03-learning-loop.md) — DSPy + GEPA + meta-evaluation
[docs/architecture/04-eval-methodology.md](docs/architecture/04-eval-methodology.md) — hierarchical bootstrap + paired tests
[docs/architecture/05-stt-tts.md](docs/architecture/05-stt-tts.md) — Deepgram + Rime
[docs/architecture/06-compliance.md](docs/architecture/06-compliance.md) — two-layer FDCPA-aligned guardrails
[docs/architecture/07-decision-agnosticism.md](docs/architecture/07-decision-agnosticism.md) — swap-cost matrix; how hard each decision would be to change
[docs/engineering-log/README.md](docs/engineering-log/README.md) — chronological log of every non-trivial bug hit and how it was fixed (Twilio 15s timeout, premature DEAL_AGREED, Pipecat 1.1 import rename, Indian-carrier rate limiting, etc.)

Quickstart

git clone git@github.com:teetangh/defaultline.git
cd defaultline
cp .env.example .env             # fill in ANTHROPIC_API_KEY at minimum
pip install -e '.[learning,dev]' # core deps for learning loop (no voice yet)

Run the learning loop (no voice, no Temporal cluster needed)

The eval harness runs the full chat→voice→chat pipeline in-process using the mock voice provider so the entire learning loop is reproducible without Twilio/Pipecat:

bash scripts/run_eval.sh 42       # seed=42, default 2 cycles. Adjust CYCLES env var.

Outputs:

data/eval_runs/<cycle_id>.json — per-trial scores (90 rows per cycle)
data/eval_runs/decisions.jsonl — promotion gate evidence
data/eval_runs/cost_log.jsonl — every LLM call with token counts
data/eval_runs/meta_log.jsonl — meta-evaluator triggers + proposals

Run the live system (Temporal + Postgres + Redis + worker + API)

docker compose up
# Temporal UI: http://localhost:8080
# API:         http://localhost:8000/health

Trigger one workflow:

python scripts/trigger_conversation.py --borrower-id demo-001 \
    --account-last-four 1234 --amount-owed 12500

Run voice (Day 3+; Pipecat + Twilio required)

Set settings.voice.provider: pipecat plus the voice env vars listed in .env.example. See apps/voice/pipecat_provider.py for the pipeline.

Reproducibility

Seeds: every test conversation uses an explicit seed in data/seeds/
Single-command rerun: bash scripts/run_eval.sh --seed 42
Raw data: data/eval_runs/*.json contains per-conversation scores
Reported numbers in docs/deliverables/evolution-report.md will match a rerun within ±2% absolute (LLM nondeterminism bounded with temperature=0 and seed control where supported)

Cost report

After running the full self-learning loop:

python scripts/cost_report.py

Outputs total spend with per-model and per-activity breakdown. Target: under $20 total.

Compliance

All three agents preserve compliance after every prompt update. Enforcement has two layers:

Rule-based (learning/compliance.py): regex + structured checks for hard violations
LLM-based: judge rubric scores nuanced violations

Any compliance violation is a hard veto in the promotion gate — the prompt is rejected regardless of resolution rate.

See [docs/architecture/06-compliance.md](docs/architecture/06-compliance.md).

Limitations

(To be filled in during the technical writeup pass — what doesn't work well, what we'd improve with more time.)

Decision journal

[docs/deliverables/decision-journal.md](docs/deliverables/decision-journal.md) — handwritten by author, not LLM-generated. Required by the assignment.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
agents		agents
apps		apps
data		data
docs		docs
examples		examples
learning		learning
prompts		prompts
qa		qa
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.api		Dockerfile.api
Dockerfile.worker		Dockerfile.worker
README.md		README.md
docker-compose.yml		docker-compose.yml
problem-statement.md		problem-statement.md
pyproject.toml		pyproject.toml
settings.yaml		settings.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

defaultline

What it does

Architecture

Pipeline flow

Cross-modal handoff

Token budget enforcement

Self-learning loop

Tech stack

Quickstart

Run the learning loop (no voice, no Temporal cluster needed)

Run the live system (Temporal + Postgres + Redis + worker + API)

Run voice (Day 3+; Pipecat + Twilio required)

Reproducibility

Cost report

Compliance

Limitations

Decision journal

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

defaultline

What it does

Architecture

Pipeline flow

Cross-modal handoff

Token budget enforcement

Self-learning loop

Tech stack

Quickstart

Run the learning loop (no voice, no Temporal cluster needed)

Run the live system (Temporal + Postgres + Redis + worker + API)

Run voice (Day 3+; Pipecat + Twilio required)

Reproducibility

Cost report

Compliance

Limitations

Decision journal

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages