Skip to content

laurasot/speakflow-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SpeakFlow API

Backend service that receives real-time audio from multiple clients (microphone + system audio as separate streams), forwards it to a configurable Speech-to-Text provider, and returns normalized transcripts over WebSocket.

Designed to pair with SpeakFlow Desktop β€” the Electron app that captures dual audio and streams PCM chunks to this API.

Features

  • Dual-stream sessions β€” isolated microphone and system channels per session
  • Provider abstraction β€” switch STT vendors via SPEECH_PROVIDER (no code changes)
  • Binary PCM protocol β€” JSON metadata + raw PCM frames (not base64)
  • Session isolation β€” concurrent users never share audio queues or provider connections
  • Persistent provider connections β€” one WebSocket per (session_id, source), not per chunk
  • Fault tolerance β€” automatic reconnection with exponential backoff per provider
  • Normalized output β€” same transcript schema regardless of Deepgram, AssemblyAI, AWS, or Whisper
  • Observability β€” structured JSON logs + metrics for provider comparison
  • Optional LangChain β€” post-processing on final transcripts (punctuation cleanup, etc.)

Architecture

Every ~500ms, the client sends 2 WebSocket frames per audio chunk:

  1. Text frame β€” JSON metadata (session_id, source, timestamp, size)
  2. Binary frame β€” raw PCM16 LE mono @ 16 kHz (~16 KB)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     JSON + PCM      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SpeakFlow       β”‚ ──────────────────► β”‚ WebSocket        β”‚
β”‚ Desktop         β”‚   /v1/stt/stream    β”‚ (thin router)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                               β”‚
                                               β–Ό
                                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                      β”‚ SessionManager   β”‚
                                      β”‚  β”œβ”€ mic queue    β”‚
                                      β”‚  └─ system queue β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό                          β–Ό                          β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚ Deepgram   β”‚           β”‚ AssemblyAI β”‚           β”‚ AWS / etc. β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                          β”‚                          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                               β–Ό
                                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                      β”‚ Normalized       β”‚
                                      β”‚ transcript JSON  β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Golden rule: audio from one user/session must never mix with another. Each (session_id, source) gets its own asyncio.Queue and dedicated provider connection.

Tech Stack

Layer Technology
API FastAPI + Uvicorn
Validation Pydantic v2 + pydantic-settings
Async I/O asyncio, websockets
STT providers Deepgram, AssemblyAI, AWS Transcribe, Whisper (local)
Post-processing LangChain + LangChain OpenAI (optional)
Package manager uv
Tests pytest + pytest-asyncio
Linting ruff, mypy

Getting Started

Prerequisites

  • Python 3.11+
  • uv 0.4+
  • API key for your chosen STT provider (e.g. Deepgram)

Installation

git clone https://github.com/laurasot/speakflow-api.git
cd speakflow-api
uv sync

Configure

Copy the example env file and fill in your secrets:

cp .env.example .env

Minimum for Deepgram:

SPEECH_PROVIDER=deepgram
DEEPGRAM_API_KEY=your_api_key_here
LOG_LEVEL=INFO

Never commit .env β€” it is listed in .gitignore.

Run

uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
Endpoint URL
Health http://localhost:8000/v1/health
API docs http://localhost:8000/docs
WebSocket STT ws://localhost:8000/v1/stt/stream

Quick health check:

curl http://localhost:8000/v1/health

Expected response:

{"status":"ok","active_sessions":0}

Connect from SpeakFlow Desktop

In Desktop Settings, set:

Field Value
User ID your identifier
Backend WebSocket URL ws://localhost:8000/v1/stt/stream

The desktop app must send the header X-User-Id on the WebSocket handshake (same value as User ID).

WebSocket Protocol (summary)

Authentication

GET /v1/stt/stream HTTP/1.1
X-User-Id: user123

Connections without X-User-Id are rejected with code 1008.

Client β†’ Server

Message Description
start_session Opens provider connections for each source
audio_chunk Metadata JSON, then binary PCM in the next frame
stop_session Graceful shutdown

start_session example:

{
  "type": "start_session",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "sources": ["microphone", "system"],
  "audio_config": {
    "sample_rate": 16000,
    "channels": 1,
    "encoding": "pcm16le"
  }
}

audio_chunk β€” two frames:

{
  "type": "audio_chunk",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "source": "microphone",
  "timestamp": 1717000000123,
  "size": 16000
}

β†’ immediately followed by a binary frame with 16 000 bytes of PCM16 LE audio.

Server β†’ Client

Message Description
session_started Session and provider streams are ready
transcript Partial or final transcription
session_ended Clean close confirmed
error Validation, provider, or session errors

transcript example:

{
  "type": "transcript",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "source": "microphone",
  "text": "Hello, how are you?",
  "is_final": false,
  "timestamp": 1717000001234,
  "provider": "deepgram",
  "language": "es",
  "start_time": 0.0,
  "end_time": 0.5
}

Switch STT Provider

Change one line in .env β€” no code changes:

SPEECH_PROVIDER=deepgram        # default
SPEECH_PROVIDER=assemblyai
SPEECH_PROVIDER=aws_transcribe
SPEECH_PROVIDER=whisper_local   # requires: uv add openai-whisper

Restart the server after changing provider.

Project Structure

speakflow-api/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py                    # FastAPI app, CORS, lifespan
β”‚   β”œβ”€β”€ routers/v1/
β”‚   β”‚   β”œβ”€β”€ health.py              # GET /v1/health
β”‚   β”‚   └── websocket_stt.py       # WS  /v1/stt/stream
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ config.py              # Environment settings
β”‚   β”‚   β”œβ”€β”€ logging.py             # Structured JSON logging
β”‚   β”‚   └── dependencies.py        # DI (SessionManager singleton)
β”‚   β”œβ”€β”€ schemas/
β”‚   β”‚   β”œβ”€β”€ audio.py               # Incoming message models
β”‚   β”‚   └── transcript.py          # Outgoing message models
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ session_manager.py     # Session + stream isolation
β”‚   β”‚   β”œβ”€β”€ speech_service.py      # Transcript pipeline
β”‚   β”‚   └── transcript_processor.py# LangChain post-processing
β”‚   β”œβ”€β”€ providers/
β”‚   β”‚   β”œβ”€β”€ base.py                # SpeechProvider protocol
β”‚   β”‚   β”œβ”€β”€ factory.py             # Provider registry
β”‚   β”‚   β”œβ”€β”€ deepgram/
β”‚   β”‚   β”œβ”€β”€ assemblyai/
β”‚   β”‚   β”œβ”€β”€ aws_transcribe/
β”‚   β”‚   └── whisper_local/
β”‚   └── infrastructure/
β”‚       └── metrics.py             # Provider comparison metrics
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/
β”‚   └── integration/
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ .env.example
└── README.md

Tests

uv run pytest tests/ -v

Critical coverage:

  • Concurrent sessions use separate provider instances
  • Audio routing does not mix bytes between sessions
  • WebSocket rejects missing X-User-Id
  • Invalid messages return error without crashing the server

Design Decisions

Decision Rationale
Binary PCM, not base64 ~25% less bandwidth; direct compatibility with STT streaming APIs
Two frames per chunk WebSocket natively separates text vs binary; metadata stays JSON
Separate sources Backend can attribute speech to user vs meeting without client-side diarization
Provider protocol Swap vendors for quality, latency, cost, and language support benchmarks
Lock only on create/close Hot-path route_audio has zero lock contention between sessions
LangChain on finals only Avoids LLM latency on every partial transcript

Environment Variables

Variable Default Description
SPEECH_PROVIDER deepgram Active STT provider
DEEPGRAM_API_KEY β€” Deepgram API token
ASSEMBLYAI_API_KEY β€” AssemblyAI API key
AWS_REGION us-east-1 AWS region for Transcribe
AWS_ACCESS_KEY_ID β€” AWS credentials
AWS_SECRET_ACCESS_KEY β€” AWS credentials
OPENAI_API_KEY β€” Optional LangChain post-processing
LOG_LEVEL INFO Logging verbosity
CORS_ORIGINS ["http://localhost:3000"] Allowed CORS origins
PROVIDER_CONNECT_TIMEOUT 10 Provider connect timeout (seconds)
PROVIDER_RESPONSE_TIMEOUT 30 Provider response timeout (seconds)

Contributing

git checkout -b feature/my-feature
# ... changes ...
uv run ruff check app tests
uv run pytest tests/
git commit -m "feat: description"
# Push + PR

Related Projects

Project Role
SpeakFlow Desktop Captures mic + system audio, streams to this API

License

This project is licensed under the PolyForm Noncommercial License 1.0.0 β€” see LICENSE.

Allowed Not allowed (for third parties)
Personal use, learning, research Commercial use
Modify and share (noncommercial) Selling the software or derivatives
Internal use in nonprofits/education Commercial SaaS or paid services built on this code

Copyright holder may use the software commercially (e.g. SpeakFlow product). Everyone else needs a separate commercial license.

For commercial licensing inquiries: contact the repository owner.

Why not CC BY-NC?

Creative Commons targets creative works and documentation, not application source code. PolyForm Noncommercial is written for software and is clearer on SaaS and redistribution.

About

Real-time Speech-to-Text API (FastAPI). Dual mic + system audio over WebSocket. Swap STT providers via config. Pairs with SpeakFlow Desktop.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages