Monitoring Stack

A Prometheus + Grafana + Alertmanager observability stack you can clone and exercise end-to-end in under a minute. Built as a DevOps/SRE portfolio piece — production-shaped, not production-ready.

What's inside

Instrumented Go service with /api/{fast,slow,error,flaky} endpoints, a histogram, a counter, and an in-flight gauge — enough variety to make dashboards interesting and fire a real burn-rate alert.
Docker Compose stack (Prometheus 2.55, Grafana 11.2, Alertmanager 0.27, node-exporter, cadvisor, app) — every image pinned, every service healthchecked.
Three hand-built Grafana dashboards provisioned from JSON — app RED, host + containers, Prometheus self-monitoring. No copy-pasted community imports.
~10 alert rules across host / container / app categories, including a multi-window multi-burn-rate SLO alert from the Google SRE workbook.
Alertmanager config with severity-based routing and an AppDown-inhibits-derivatives rule.
GitHub Actions smoke test — every PR boots the entire stack and asserts Prometheus targets are UP, rule groups loaded, Grafana healthy.
Three deployment paths — docker compose (default), Kubernetes via kube-prometheus-stack (k8s/), or single-instance AWS EC2 via Terraform (terraform/).
Runbooks for every alert (docs/RUNBOOK.md) — the SRE artifact most demo repos skip.

Architecture

flowchart LR
  App[sample-app:8080<br/>/metrics] -- scrape --> Prom
  NE[node-exporter] -- scrape --> Prom
  CA[cadvisor] -- scrape --> Prom
  Prom[Prometheus<br/>rules + TSDB] -- alerts --> AM[Alertmanager]
  AM -- webhook --> Receiver[(External receiver)]
  Prom -- query --> Graf[Grafana<br/>provisioned dashboards]

Full diagram with component table: docs/ARCHITECTURE.md.

30-second quick start

git clone https://github.com/yashyaadav/monitoring_stack.git
cd monitoring_stack
cp .env.example .env
make up                # docker compose up -d --wait
make smoke             # asserts targets UP, rules loaded, Grafana healthy

URL	What
http://localhost:3000	Grafana (admin / admin)
http://localhost:9090	Prometheus
http://localhost:9093	Alertmanager
http://localhost:8080/metrics	Sample app metrics

Fire an alert (the money shot)

make load-error   # ./scripts/load.sh error 300

After ~5 minutes, AppHighErrorRate transitions Pending → Firing in Prometheus (http://localhost:9090/alerts) and Alertmanager (http://localhost:9093). Stop the app entirely (docker compose stop app) and AppDown fires within 2 minutes, with AppHighErrorRate/AppHighLatencyP95/AppSLOBurnRateFast suppressed by the inhibition rule.

Docs

Doc	What's in it
ARCHITECTURE.md	Component table, full data-flow diagram, scope decisions.
INSTALLATION.md	All three deployment paths step-by-step.
CONFIGURATION.md	"I want to change X" → which file to edit.
ALERTING.md	Rule catalog, routing, inhibition, wiring real receivers.
SLOs.md	The 99% availability SLO + multi-window burn-rate alert math.
RUNBOOK.md	One section per alert: symptoms / causes / commands / remediation.
TROUBLESHOOTING.md	Common local-setup pitfalls.
CONTRIBUTING.md	Lint/test commands and the PR checklist for adding new rules.

Other deployment paths

Kubernetes — k8s/ ships a kube-prometheus-stack values file plus a PrometheusRule CR for the app. One helm upgrade --install.
AWS EC2 — terraform/ is a single-file Terraform module that provisions one EC2 instance, installs Docker, and runs the Compose stack via user_data. Demo-grade; explicitly not multi-env.

Why this exists / what I learned

I built this because the previous version of this repo was a Helm-install transcript: useful as a tutorial, not as a portfolio piece. The rebuild forced me to make opinionated calls a tutorial avoids:

What does "production-shaped" actually look like for a demo? Pinned image versions, healthchecks everywhere, alerts paired with runbooks, an SLO with budget math rather than a flat threshold. Not production-ready — there's no remote-write, no Loki/Tempo, no auth in front of Grafana — but every artifact is the same shape as its production counterpart.
Dashboards-as-code beats imported dashboards. Three hand-built JSONs that I can read in a diff is more useful than thirty community imports I can't reason about.
Inhibition is the alert design choice that matters most. A naive ruleset pages you 3× when one thing dies (AppDown + AppHighErrorRate + AppSLOBurnRateFast). The inhibit rule is the difference between an SRE setup and an alert-spam machine.
The runbook is the highest-signal artifact in the whole repo. I'd rather ship 10 alerts with runbooks than 50 without.

What I'd add next

Loki + Promtail for the logs plane; correlate from a Grafana alert into the relevant log slice.
Tempo for traces, with the sample app emitting OpenTelemetry spans on /api/slow to demonstrate trace exemplars in Prometheus histograms.
Recording rules for the SLO SLI, mirrored in Grafana panels so dashboards and alerts read the same materialized series.
A slow-burn companion to AppSLOBurnRateFast (24h / 3d windows) once the service has enough history to make slow burn meaningful.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
app		app
compose		compose
docs		docs
k8s		k8s
scripts		scripts
terraform		terraform
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monitoring Stack

What's inside

Architecture

30-second quick start

Fire an alert (the money shot)

Docs

Other deployment paths

Why this exists / what I learned

What I'd add next

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Monitoring Stack

What's inside

Architecture

30-second quick start

Fire an alert (the money shot)

Docs

Other deployment paths

Why this exists / what I learned

What I'd add next

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages