docs(grafana): full dashboard with HTTP histogram + cascade indicators by JasonLovesDoggo · Pull Request #24 · JasonLovesDoggo/abacus

JasonLovesDoggo · 2026-05-19T16:30:57Z

What this is

A drop-in docs/grafana-dashboard.json that replaces the older partial dashboard with one that covers everything /metrics currently emits, plus first-class panels for the new HTTP request histogram from #22.

You can upload it right now while the current spike is live — it's a static file, no code change. Grafana → Dashboards → New → Import → paste the JSON → pick your Prometheus datasource for ${DS_PROMETHEUS}.

Design choices worth knowing

Top row is the spike-diagnosis bar. Six big stat tiles, threshold-colored:

RPS (global)
p99 HTTP (global)
Pool timeout rate (red on any non-zero)
Total goroutines (yellow >500, red >1500)
5xx rate (red on any non-zero)
EXPIRE gate skip ratio (green >0.9)

If any of these turn red, the rest of the dashboard tells you why.

Goroutine count gets its own dedicated panel with a red threshold line at 1500. The two cascades we've seen both showed goroutines spiking to 2k+ before latency moved — it's the leading indicator. Per-instance so we can spot one bad machine.

HTTP per-endpoint section uses the new histogram from #22. p50, p95, p99 each get their own timeseries grouped by route. Combined with the existing Redis per-cmd quantiles, you can finally answer "is the slowness in the handler or in Redis" by comparing both panels side-by-side.

Pool timeouts are shown as a RATE, not absolute count. Absolute timeouts only ever go up; the rate makes the cascade onset visible as a step change.

What's still missing

This dashboard reads exclusively from app-side metrics. The two blind spots that caused the recent spikes:

Redis-internal state — BGSAVE / AOF rewrite / mem fragmentation / connected_clients. Would need redis_exporter running next to your Redis box.
Network path latency — Fly yyz ↔ self-hosted Redis. Would need a ping/probe metric. We can derive it indirectly from dial time in go-redis but it's not exposed today.

Both are separate work, not in this PR.

Test plan

Valid JSON, 33 panels, 7 rows
All PromQL queries reference metrics actually exposed by current main
Manual: import into the live Grafana, confirm all panels populate within one scrape interval

🤖 Generated with Claude Code

…ascade indicators Replaces the prior dashboard with one that covers everything currently emitted by /metrics: - Top 'is it on fire' row: RPS, HTTP p99, pool-timeout rate, total goroutines, 5xx rate, EXPIRE gate skip ratio. All threshold-colored so the dashboard turns red when something's wrong. - HTTP per-endpoint section using the new abacus_http_request_duration histogram from #22: p50/p95/p99 per route, request rate per route, status class breakdown (2xx/4xx/5xx), error rate as % of traffic per endpoint. - Redis per-cmd quantiles section unchanged but joined by an avg- latency bargauge for quick read. - Redis pool section emphasizes timeout RATE (not absolute) since the rate is the cascade indicator. - New EXPIRE coalescer section showing refreshed/s vs skipped/s and cache size per instance. - Goroutines panel marked as the cascade signal, with red threshold line at 1500. Per-instance so we can spot a single bad machine. - CPU rate panel calibrated to the 2 vCPU upper bound. Datasource template variable so Grafana prompts on import. Refresh 30s. Tags: abacus, fly, performance. Total 33 panels across 7 rows. Cardinality budget per panel is inherited from the underlying metric labels — nothing here introduces new label dimensions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(grafana): full dashboard with HTTP histogram + cascade indicators#24

docs(grafana): full dashboard with HTTP histogram + cascade indicators#24
JasonLovesDoggo wants to merge 1 commit into
mainfrom
docs/grafana-dashboard-v2

JasonLovesDoggo commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JasonLovesDoggo commented May 19, 2026

What this is

Design choices worth knowing

What's still missing

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant