Skip to content

docs(grafana): full dashboard with HTTP histogram + cascade indicators#24

Open
JasonLovesDoggo wants to merge 1 commit into
mainfrom
docs/grafana-dashboard-v2
Open

docs(grafana): full dashboard with HTTP histogram + cascade indicators#24
JasonLovesDoggo wants to merge 1 commit into
mainfrom
docs/grafana-dashboard-v2

Conversation

@JasonLovesDoggo

Copy link
Copy Markdown
Owner

What this is

A drop-in docs/grafana-dashboard.json that replaces the older partial dashboard with one that covers everything /metrics currently emits, plus first-class panels for the new HTTP request histogram from #22.

You can upload it right now while the current spike is live — it's a static file, no code change. Grafana → Dashboards → New → Import → paste the JSON → pick your Prometheus datasource for ${DS_PROMETHEUS}.

Design choices worth knowing

Top row is the spike-diagnosis bar. Six big stat tiles, threshold-colored:

  • RPS (global)
  • p99 HTTP (global)
  • Pool timeout rate (red on any non-zero)
  • Total goroutines (yellow >500, red >1500)
  • 5xx rate (red on any non-zero)
  • EXPIRE gate skip ratio (green >0.9)

If any of these turn red, the rest of the dashboard tells you why.

Goroutine count gets its own dedicated panel with a red threshold line at 1500. The two cascades we've seen both showed goroutines spiking to 2k+ before latency moved — it's the leading indicator. Per-instance so we can spot one bad machine.

HTTP per-endpoint section uses the new histogram from #22. p50, p95, p99 each get their own timeseries grouped by route. Combined with the existing Redis per-cmd quantiles, you can finally answer "is the slowness in the handler or in Redis" by comparing both panels side-by-side.

Pool timeouts are shown as a RATE, not absolute count. Absolute timeouts only ever go up; the rate makes the cascade onset visible as a step change.

What's still missing

This dashboard reads exclusively from app-side metrics. The two blind spots that caused the recent spikes:

  • Redis-internal state — BGSAVE / AOF rewrite / mem fragmentation / connected_clients. Would need redis_exporter running next to your Redis box.
  • Network path latency — Fly yyz ↔ self-hosted Redis. Would need a ping/probe metric. We can derive it indirectly from dial time in go-redis but it's not exposed today.

Both are separate work, not in this PR.

Test plan

  • Valid JSON, 33 panels, 7 rows
  • All PromQL queries reference metrics actually exposed by current main
  • Manual: import into the live Grafana, confirm all panels populate within one scrape interval

🤖 Generated with Claude Code

…ascade indicators

Replaces the prior dashboard with one that covers everything currently
emitted by /metrics:

  - Top 'is it on fire' row: RPS, HTTP p99, pool-timeout rate, total
    goroutines, 5xx rate, EXPIRE gate skip ratio. All threshold-colored
    so the dashboard turns red when something's wrong.

  - HTTP per-endpoint section using the new abacus_http_request_duration
    histogram from #22: p50/p95/p99 per route, request rate per route,
    status class breakdown (2xx/4xx/5xx), error rate as % of traffic
    per endpoint.

  - Redis per-cmd quantiles section unchanged but joined by an avg-
    latency bargauge for quick read.

  - Redis pool section emphasizes timeout RATE (not absolute) since the
    rate is the cascade indicator.

  - New EXPIRE coalescer section showing refreshed/s vs skipped/s and
    cache size per instance.

  - Goroutines panel marked as the cascade signal, with red threshold
    line at 1500. Per-instance so we can spot a single bad machine.

  - CPU rate panel calibrated to the 2 vCPU upper bound.

Datasource template variable so Grafana prompts on import. Refresh
30s. Tags: abacus, fly, performance.

Total 33 panels across 7 rows. Cardinality budget per panel is
inherited from the underlying metric labels — nothing here introduces
new label dimensions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant