docs(grafana): full dashboard with HTTP histogram + cascade indicators#24
Open
JasonLovesDoggo wants to merge 1 commit into
Open
docs(grafana): full dashboard with HTTP histogram + cascade indicators#24JasonLovesDoggo wants to merge 1 commit into
JasonLovesDoggo wants to merge 1 commit into
Conversation
…ascade indicators
Replaces the prior dashboard with one that covers everything currently
emitted by /metrics:
- Top 'is it on fire' row: RPS, HTTP p99, pool-timeout rate, total
goroutines, 5xx rate, EXPIRE gate skip ratio. All threshold-colored
so the dashboard turns red when something's wrong.
- HTTP per-endpoint section using the new abacus_http_request_duration
histogram from #22: p50/p95/p99 per route, request rate per route,
status class breakdown (2xx/4xx/5xx), error rate as % of traffic
per endpoint.
- Redis per-cmd quantiles section unchanged but joined by an avg-
latency bargauge for quick read.
- Redis pool section emphasizes timeout RATE (not absolute) since the
rate is the cascade indicator.
- New EXPIRE coalescer section showing refreshed/s vs skipped/s and
cache size per instance.
- Goroutines panel marked as the cascade signal, with red threshold
line at 1500. Per-instance so we can spot a single bad machine.
- CPU rate panel calibrated to the 2 vCPU upper bound.
Datasource template variable so Grafana prompts on import. Refresh
30s. Tags: abacus, fly, performance.
Total 33 panels across 7 rows. Cardinality budget per panel is
inherited from the underlying metric labels — nothing here introduces
new label dimensions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A drop-in
docs/grafana-dashboard.jsonthat replaces the older partial dashboard with one that covers everything/metricscurrently emits, plus first-class panels for the new HTTP request histogram from #22.You can upload it right now while the current spike is live — it's a static file, no code change. Grafana → Dashboards → New → Import → paste the JSON → pick your Prometheus datasource for
${DS_PROMETHEUS}.Design choices worth knowing
Top row is the spike-diagnosis bar. Six big stat tiles, threshold-colored:
If any of these turn red, the rest of the dashboard tells you why.
Goroutine count gets its own dedicated panel with a red threshold line at 1500. The two cascades we've seen both showed goroutines spiking to 2k+ before latency moved — it's the leading indicator. Per-instance so we can spot one bad machine.
HTTP per-endpoint section uses the new histogram from #22. p50, p95, p99 each get their own timeseries grouped by
route. Combined with the existing Redis per-cmd quantiles, you can finally answer "is the slowness in the handler or in Redis" by comparing both panels side-by-side.Pool timeouts are shown as a RATE, not absolute count. Absolute timeouts only ever go up; the rate makes the cascade onset visible as a step change.
What's still missing
This dashboard reads exclusively from app-side metrics. The two blind spots that caused the recent spikes:
redis_exporterrunning next to your Redis box.dialtime in go-redis but it's not exposed today.Both are separate work, not in this PR.
Test plan
🤖 Generated with Claude Code