Observability for AI inference workloads. Track every inference call with GPU-level metrics, latency breakdown, and cost tracking.
Axonize sits between infrastructure monitoring (Grafana/Prometheus) and LLM service tracing (Langfuse/LangSmith), giving you inference-level visibility into how your models use GPU resources.
- Low-overhead tracing — Lock-free ring buffer designed for <1μs overhead per span
- Multi-vendor GPU support — Auto-detects NVIDIA (via pynvml) and Apple Silicon (via IOKit)
- 3-layer GPU identity — Physical GPU → Compute Resource → Runtime Context (MIG-aware)
- LLM Metrics — TTFT, tokens/sec, and token-level streaming tracking out of the box
- OpenTelemetry Compatible — OTLP export with
ai.*,gpu.*,cost.*semantic conventions - Full-stack Dashboard — Trace timeline (Gantt), GPU monitoring, latency analytics
- Multi-tenant ready — API key authentication and usage tracking for cloud deployments
Python SDK (axonize) Go Server Database
===================== ================== ==================
span() / llm_span() gRPC Ingest
| |
Ring Buffer Batch Writer ----> PostgreSQL
| | (spans, gpu_metrics,
Background Processor GPU Registry physical_gpus,
| | compute_resources)
OTLP gRPC Export REST API
|
Dashboard (React)
git clone https://github.com/Streamize-llc/axonize.git
cd axonize
# Start PostgreSQL + Server
docker compose up -d
# Apply database migrations (waits for databases to be ready)
make migrateThe dashboard is available at http://localhost:3000.
The server generates a default API key on first startup. Get it from the logs:
docker compose logs axonize-server | grep "API key"
# Output: Generated API key: axon_...Or set your own key in .env:
AXONIZE_API_KEY=your-secret-key# Basic installation (includes Apple Silicon GPU support)
pip install axonize
# For NVIDIA GPUs (requires CUDA drivers)
pip install axonize[nvidia]
# All optional dependencies
pip install axonize[all]import os
import axonize
axonize.init(
endpoint="localhost:4317",
service_name="my-inference-service",
api_key=os.getenv("AXONIZE_API_KEY"), # Required for authentication
)
# Basic span
with axonize.span("image-generation") as s:
s.set_attribute("ai.model.name", "stable-diffusion-xl")
s.set_gpus(["cuda:0"]) # or ["mps:0"] for Apple Silicon
result = model.generate(prompt)
# LLM span with streaming token tracking
with axonize.llm_span("generate", model="llama-3-70b") as s:
s.set_tokens_input(len(prompt_tokens))
for token in model.generate_stream(prompt):
s.record_token()
yield token
# TTFT and tokens/sec are calculated automatically
axonize.shutdown()Open http://localhost:3000 to see your traces, GPU metrics, and performance analytics.
axonize.init(
endpoint="localhost:4317", # gRPC endpoint
service_name="my-service", # Service identifier
api_key="your-api-key", # Authentication (required if server has AXONIZE_API_KEY set)
environment="production", # Environment tag
gpu_profiling=True, # Enable GPU profiling (auto-detects vendor)
batch_size=512, # Spans per export batch
flush_interval_ms=5000, # Max time between flushes
sampling_rate=1.0, # Fraction of spans to keep (0.0-1.0)
)# Context manager
with axonize.span("operation-name") as s:
s.set_attribute("key", "value")
s.set_gpus(["cuda:0", "cuda:1"])
s.set_status(axonize.SpanStatus.OK)
# Decorator
@axonize.trace
def my_function():
pass
@axonize.trace(name="custom-name")
def another_function():
passwith axonize.llm_span("generate", model="llama-3-70b") as s:
s.set_tokens_input(128) # Prompt token count
s.set_model("llama-3", "70b") # Model name + version
for token in stream:
s.record_token() # Tracks each output token
# On exit, automatically calculates:
# ai.llm.ttft_ms — Time to first token
# ai.llm.tokens_per_second — Generation throughput
# ai.llm.tokens.output — Total output tokensWhen gpu_profiling=True, the SDK automatically:
- Auto-detects GPU backend: NVIDIA (pynvml) or Apple Silicon (IOKit)
- Discovers available GPUs (including NVIDIA MIG instances)
- Collects metrics every 100ms in a background thread (utilization, memory, power, temperature)
- Attaches GPU attribution when you call
span.set_gpus()
Device labels:
- NVIDIA:
cuda:0,cuda:1, etc. - Apple Silicon:
mps:0
Installation:
- Apple Silicon support is built-in (no extra dependencies)
- NVIDIA requires:
pip install axonize[nvidia](installs pynvml)
axonize.init(endpoint="...", service_name="...", gpu_profiling=True)
with axonize.span("inference") as s:
# NVIDIA
s.set_gpus(["cuda:0"])
# Apple Silicon
# s.set_gpus(["mps:0"])
# GPU metrics (utilization, memory, power, temp) are
# automatically attached to this span with vendor-specific valuesAuthentication: All /api/v1/* endpoints require Bearer token authentication (except health checks):
curl -H "Authorization: Bearer <your-api-key>" \
http://localhost:8080/api/v1/traces| Endpoint | Description |
|---|---|
GET /healthz |
Health check (no auth) |
GET /readyz |
Readiness check - DB connectivity (no auth) |
| Endpoint | Description |
|---|---|
GET /api/v1/traces |
List traces (filterable, paginated) |
GET /api/v1/traces/{id} |
Trace detail with span tree |
GET /api/v1/gpus |
GPU registry list |
GET /api/v1/gpus/{uuid} |
GPU detail |
GET /api/v1/gpus/{uuid}/metrics |
GPU metric time series |
GET /api/v1/analytics/overview |
Dashboard analytics (throughput, latency, errors) |
Available when AXONIZE_AUTH_MODE=multi_tenant. Requires Authorization: Bearer <AXONIZE_ADMIN_KEY>.
| Endpoint | Description |
|---|---|
POST /api/v1/admin/tenants |
Create new tenant |
GET /api/v1/admin/tenants |
List all tenants |
POST /api/v1/admin/tenants/{id}/keys |
Create API key for tenant |
DELETE /api/v1/admin/tenants/{id}/keys/{prefix} |
Revoke API key |
GET /api/v1/admin/tenants/{id}/usage |
Get tenant usage (spans + GPU seconds) |
See docs/server-deployment.md for multi-tenant setup and billing integration.
All server configuration is via environment variables. See .env.example for full reference. Key variables:
| Variable | Default | Description |
|---|---|---|
AXONIZE_API_KEY |
(generated) | Static API key for single-tenant mode |
AXONIZE_AUTH_MODE |
static |
static (single tenant) or multi_tenant |
AXONIZE_ADMIN_KEY |
(none) | Admin API key (multi-tenant mode only) |
POSTGRES_HOST |
postgres |
PostgreSQL host |
GRPC_PORT |
4317 |
gRPC ingest port |
HTTP_PORT |
8080 |
HTTP API port |
Data retention (server-side cleanup):
- Spans: 30 days
- GPU metrics: 7 days
# Start dev database (PostgreSQL only)
make dev
# Run SDK tests
make test-sdk
# Run all linters
make lint
# Start dashboard dev server (hot-reload on port 3000)
make dev-dashboard
# Run E2E tests (requires full stack)
make dev-all && make migrate && make test-e2ePrerequisites: All examples require the server to be running (docker compose up -d && make migrate).
See the examples/ directory for integration guides:
quickstart.py— Minimal setupvllm_integration.py— vLLM with streaming tokensollama_integration.py— Ollama chat with TTFT trackingdiffusers_integration.py— HuggingFace Diffusers pipelinecustom_model.py— General-purpose integration pattern
axonize/
├── sdk-py/ Python SDK (pip install axonize)
├── server/ Go ingest + query server
│ ├── internal/ingest/ OTLP gRPC handler
│ ├── internal/api/ REST API endpoints
│ ├── internal/store/ PostgreSQL store
│ └── internal/tenant/ Multi-tenant auth + usage tracking
├── dashboard/ React + Vite dashboard
├── migrations/ Database schemas
│ └── postgres/ All tables (spans, gpu_metrics, GPU registry, tenants)
├── examples/ Integration examples
├── tests/ E2E + load tests
├── docs/ Documentation
├── .env.example Configuration reference
└── config.yaml Server config (overridden by env vars)
- Getting Started — Step-by-step setup guide
- SDK API Reference — Complete SDK documentation
- Server Deployment — Production deployment guide
- Troubleshooting — Common issues and solutions
- Architecture — Technical design and data model
Apache-2.0