Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,10 @@ frontend/dist/
# OS
.DS_Store
CLAUDE.md.bak

# Local (private) service registry — overrides the committed example so the
# real organization inventory never has to live in version control.
backend/config/*.local.yaml

# uv lockfile (runtime deps are pinned in backend/requirements.txt)
backend/uv.lock
220 changes: 220 additions & 0 deletions CASE-STUDY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Case Study: A Real-Time SaaS Status Dashboard for Enterprise IT

> **What it is:** A self-hosted service that polls the public status pages of ~30 enterprise SaaS
> tools every 60 seconds, normalizes a dozen incompatible vendor formats into one status model,
> detects changes, reasons about downstream impact, and alerts the operations team before the
> first user ticket lands.
>
> **Status:** v1 and v2 shipped. **356 tests passing.** ~16k LOC. Python 3.13 / FastAPI backend,
> React + Vite frontend, single-file SQLite datastore, runs on one Mac mini.

---

## The problem

In most IT organizations, the news that a critical SaaS tool is down travels in exactly the wrong
direction: a user hits a broken login, files a ticket, the ticket sits in a queue, and only then —
ten, twenty, sixty minutes later — does someone in IT realize the identity provider has been
degraded the whole time. The team learns about outages from the people it's supposed to be
protecting.

The information was public the entire time. Nearly every major SaaS vendor publishes a machine-readable
status page. The gap isn't data availability — it's that nobody is *watching* thirty status pages at
once, in thirty different formats, and connecting "Vendor X is degraded" to "therefore these internal
workflows are about to break."

This project closes that gap. It turns a reactive, ticket-driven posture into a proactive one: when a
vendor's own status page flips, the operations channel knows within one poll cycle — with an impact
statement attached, not just a raw status code.

---

## The solution

A small, self-contained monitoring service with five responsibilities, each isolated into its own
module so they can be tested and reasoned about independently:

```mermaid
flowchart TD
subgraph upstream["Upstream status sources (heterogeneous)"]
A1["Statuspage.io-powered pages<br/>(~half the fleet)"]
A2["Vendor-specific status APIs<br/>(collaboration, CRM, support, productivity suites)"]
A3["Operator-pushed manual updates<br/>(tools with no machine-readable status)"]
end

A1 & A2 & A3 --> P["<b>Poll Orchestrator</b><br/>APScheduler · 60s cycle<br/>coalesce · no overlap · misfire-aware"]
P --> R["<b>Resilience layer</b><br/>retry + backoff (stamina)<br/>per-host circuit breaker (purgatory)"]
R --> N["<b>Status Normalizer</b><br/>vendor strings → 5-state enum<br/>operational · degraded · partial · major · unknown"]
N --> C["<b>Change Detector</b><br/>diff vs. DB · two-axis health<br/>(is the vendor down? is the poller blind?)"]
C --> I["<b>Impact Engine</b><br/>service dependency graph<br/>→ 'who is affected downstream'"]
C --> AL["<b>Alert Router</b><br/>dedup · maintenance suppression<br/>flap suppression · tier routing"]
I --> AL
AL --> CH["Chat alerting<br/>(tiered: page / notify / dashboard-only)"]
C --> DB[("SQLite<br/>WAL · retention · snapshots")]
DB --> API["FastAPI REST<br/>/services /timeline /summary /metrics"]
API --> UI["React dashboard<br/>Executive + Engineer views · PWA"]
```

**Flow in one sentence:** poll many formats → make them resilient and uniform → detect what
*changed* → decide who it *affects* and whether it's worth interrupting a human → store it and show it.

---

## Engineering highlights

These are the parts that are non-obvious — the places where the naive version is easy and the
correct version takes real design.

### 1. One status model out of a dozen vendor dialects

Every vendor describes "things are broken" differently. Statuspage.io alone has two distinct
vocabularies — a page-level *indicator* (`none` / `minor` / `major` / `critical`) and per-component
*status* strings (`operational` / `degraded_performance` / `partial_outage` / `under_maintenance`) —
and they don't line up. Other vendors ship entirely bespoke JSON. Manual operator updates use a
third shape.

The normalizer collapses all of it into a single five-state enum
(`operational · degraded · partial_outage · major_outage · unknown`) via explicit per-source mapping
tables. The design decision that matters: **an unrecognized vendor string maps to `unknown` and emits
a warning** — it never silently guesses `operational`. A new value a vendor introduces tomorrow shows
up as a logged anomaly instead of a false all-clear. (`under_maintenance`, notably, maps to `degraded`
rather than a fake outage — maintenance is expected, not an incident.)

### 2. Two-axis health: "is the vendor down?" vs. "is my poller blind?"

This is the insight that separates a toy from a tool. A status of `unknown` is dangerously ambiguous —
it could mean the vendor is genuinely in trouble, or it could mean *our own fetch failed* and we have
no idea what's going on. Conflating the two means every network hiccup looks like an outage.

So the system tracks **two orthogonal axes**:

- **Service status** — what the vendor reports (the 5-state enum above).
- **Poller health** — `healthy → degraded → broken`, driven by a pure state machine: a successful
poll resets to `healthy`; failures short of a configurable threshold are `degraded`; sustained
failure past the threshold flips to `broken`.

When a poller goes `broken`, that's routed as a *distinct* signal — "we've gone blind on this service" —
separate from a vendor-outage alert, because they demand different human responses. The UI renders a
broken poller differently from a down vendor. You always know whether you're looking at reality or at
a gap in your own instrumentation.

### 3. Resilience: heal fast, fail loud, don't hammer the dead

Every outbound request goes through one resilience layer with two complementary mechanisms:

- **Retry with backoff + jitter** (via `stamina`) for *transient* trouble — network errors, timeouts,
and HTTP `408` / `429` / `5xx`. These self-heal, so the system gives them a few capped, jittered
attempts before giving up.
- **Per-host circuit breaker** (via `purgatory`) for *persistent* trouble. After N consecutive
failures a host's breaker opens and subsequent calls fast-fail for a TTL (default 5 minutes) before
probing again — avoiding the classic "re-probe a dead host every cycle and burn the whole poll
window" antipattern.

The sharp distinction: **HTTP `4xx` errors other than `408`/`429` are treated as hard failures and
surface immediately** — a `404` or `401` means a URL moved or auth changed; retrying can't fix config
rot, so it shouldn't mask it. And a tripped breaker is reported as *poller-unhealthy*, not
*vendor-down* — feeding straight back into the two-axis model above.

### 4. Alert quality: earn the interruption

A monitor that cries wolf gets muted, and a muted monitor is worthless. Several layers cooperate so
that a human is interrupted only when it's warranted:

- **Deduplication keyed on `(service, vendor_incident_id)`** — with a `(service, status, day)`
fallback when no vendor incident ID exists. Critically, **dedup never keys on message text**:
vendors edit incident titles mid-flight, and a text key would leak a fresh alert on every wording
change.
- **Maintenance-window suppression** — a scheduled maintenance records the state transition but does
not page anyone. Expected ≠ alarming.
- **Flap suppression** — a status must persist across a configurable number of confirming polls before
a worsening alert fires, and recover across another threshold before the all-clear, so a vendor
bouncing between states doesn't machine-gun the channel.
- **Tiered routing** — every service has a tier: `critical` pages with an `@here` mention,
`important` notifies without the mention, `informational` updates the dashboard and sends nothing.
- **Dependency correlation** — when one upstream failure knocks out several dependents, the router can
emit a *single aggregated* upstream alert instead of N separate downstream ones.

Every decision — sent or suppressed, and why — is written to a durable audit log. There's always an
answer to "what did we tell operators, and what did we hold back?"

### 5. From "Vendor X is degraded" to "here's who it hurts"

Raw status is low-value; *impact* is what an on-call human actually needs. A service dependency graph
(stored relationally, queried both upstream and downstream and ordered by severity) lets the impact
engine turn a single vendor event into a downstream blast-radius statement — "identity provider
degraded → these dependent workflows are at risk" — so the alert leads with consequences, not codes.

### 6. Scheduler discipline and observability

The poll loop is built to stay honest under load and to be debuggable in production:

- **Scheduler safety**: cycles `coalesce` (a late wake-up runs once, not as a backlog stampede),
`max_instances=1` (a slow cycle is skipped, never overlapped), and missed runs are logged and
counted rather than silently dropped.
- **Trace-without-tracing**: each cycle binds a fresh `poll_cycle_id` into structured logs, so every
line from one cycle is correlatable without a full distributed-tracing stack.
- **Operational signals**: a Prometheus `/metrics` endpoint (poll counts, durations, circuit-breaker
state, alert sent/suppressed counters), optional Sentry error tracking, and a Healthchecks.io
dead-man's-switch heartbeat that screams if the *monitor itself* goes dark.

### 7. Boring, durable data lifecycle

A single SQLite file in WAL mode is the whole datastore — deliberately. Around it: production pragmas,
automatic retention purges for event and alert-log tables, a daily `VACUUM INTO` snapshot, and
optional Litestream continuous replication. No database server to operate; full point-in-time recovery
if the host dies.

---

## Results / status

- **v1 (demo-ready) — shipped.** Polling, normalization, change detection, chat alerting, the React
UI, dependency graph, timeline, SLA tracking, incident clustering, and automated reports.
- **v2 (production-ready) — shipped.** Bearer-token auth on admin endpoints; the full resilience
layer (retry + circuit breaker); alert-quality stack (flap suppression, dedup, tier routing,
dependency correlation, maintenance windows); observability (structured logging, Prometheus, Sentry,
dead-man's switch); data lifecycle (retention, snapshots, replication); a productionized UI with a
severity-sorted grid, an Executive/Engineer view toggle, accessibility + keyboard navigation, and
PWA support; and platform polish (CI, pre-commit hooks, a hardened launchd service, a Caddy reverse
proxy, and OS-keychain-backed secrets).
- **Quality gate:** **356 automated tests passing**, covering the normalizer, resilience layer,
change detector, alert routing, dependency graph, SLA math, the REST API, and a full end-to-end
pipeline test.
- **Footprint:** runs comfortably on a single Mac mini. One Python process, one SQLite file, one
static React bundle served by the same app.
- **What it watches:** ~30 enterprise SaaS tools spanning identity & access, productivity & content,
collaboration, engineering & ITSM, HR & people, finance, CRM, marketing, network/VPN, and support —
a representative cross-section of a modern enterprise SaaS estate.

A set of features (inbound vendor webhooks, a chat acknowledgement flow, auto-drafted SRE-style
postmortems on recovery, and multi-burn-rate SLO alerting) is built and tested but kept behind feature
flags, defaulting off until their deployment prerequisites are in place — shipped code, deliberately
dark.

---

## Skills demonstrated

Framed for a Platform / DX / AI-infrastructure audience:

- **Distributed-systems resilience as a first-class concern, not an afterthought.** Retries with
backoff + jitter, per-host circuit breaking, and a deliberate transient-vs-permanent failure
taxonomy — the same patterns that keep a platform's outbound integrations from amplifying a
dependency's bad day.
- **Designing the right abstraction over messy reality.** Collapsing a dozen incompatible vendor
formats into one clean status model — with a fail-safe `unknown` path — is the everyday work of
platform and integration engineering.
- **Signal quality over signal volume.** The dedup / suppression / tiering / correlation stack is an
alerting-discipline story: respect the human on the other end, and the system stays trusted instead
of muted.
- **Observability built in from the start.** Structured logs with cycle correlation, Prometheus
metrics, error tracking, and a dead-man's switch — designed to be operated, not just run.
- **Production data discipline at small scale.** WAL, retention, snapshots, and replication on a
single-file datastore: maximum durability for minimum operational surface.
- **Test rigor.** 356 tests including pure-function state-machine coverage and an end-to-end pipeline
test — the difference between "it worked on my machine" and "it's safe to change."
- **Shipping judgment.** Feature-flagged, default-off capabilities show the discipline to merge
complete-but-not-yet-deployable work without destabilizing what's live.

The throughline: this is enterprise IT pain, solved with platform-engineering tools — turning a
reactive, ticket-driven workflow into a proactive, observable, self-healing system.
18 changes: 9 additions & 9 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# IT Service Health Dashboard

Internal web dashboard aggregating real-time health of ~30 SaaS services at Box. Polls Statuspage.io JSON API, Google Workspace JSON feed, Slack native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view, posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN.
Internal web dashboard aggregating real-time health of ~30 SaaS services in an enterprise IT environment. Polls Statuspage.io JSON API, a cloud productivity suite's JSON feed, a chat vendor's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view, posts alerts to Slack. Deployed on a Mac Mini on the internal network.

## Roadmap

Expand Down Expand Up @@ -35,7 +35,7 @@ cd backend && python run.py

Open `http://localhost:8000`.

**CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis. 356 tests passing.
**CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis. 378 tests passing.

## Conventions

Expand All @@ -55,11 +55,11 @@ Open `http://localhost:8000`.
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Primary data source | Statuspage.io `/api/v2/summary.json` | Most vendors use Statuspage.io; JSON, no auth, not rate-limited |
| Google Workspace | Google custom JSON feed + RSS | Google has its own status dashboard, not Statuspage.io |
| Slack status | `slack-status.com/api/v2.0.0/current` | Dedicated JSON status API |
| Cloud productivity suite | Custom JSON feed + RSS | Has its own status dashboard, not Statuspage.io |
| Chat vendor status | Vendor JSON status endpoint | Dedicated JSON status API |
| Database | SQLite + Litestream | Demo-scale + ~1s RPO; Postgres deferred to >100 writes/s |
| Auth | Bearer token on admin endpoints; VPN-only for reads | Bearer token required for write endpoints — VPN alone is insufficient |
| Hosting | Mac Mini + Caddy | Always-on, VPN-accessible; Caddy adds HTTPS + header auth |
| Auth | Bearer token on admin endpoints; internal-network-only for reads | Bearer token required for write endpoints — the internal network alone is insufficient |
| Hosting | Mac Mini + Caddy | Always-on, internal-network-accessible; Caddy adds HTTPS + header auth |
| Dep graph layout | Force-directed (react-force-graph-2d) | Dagre hierarchical layout is deferred; force-directed is current default |
| LLM layer | Deferred (post-Phase-7) | Template-based summaries sufficient for v2 |

Expand All @@ -84,11 +84,11 @@ All new work must map to an active phase in PRODUCTION-ROADMAP.md. Splunk, Thous

## What This Project Is

Internal web dashboard that aggregates real-time health status of ~30 SaaS services IT supports at Box. Polls vendor status pages via Statuspage.io JSON API, Google Workspace JSON feed, Slack's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view and posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN. Designed for IT engineers (deep triage) and IT leadership / company-wide visibility (situational awareness).
Internal web dashboard that aggregates real-time health status of ~30 SaaS services supported by an enterprise IT team. Polls vendor status pages via Statuspage.io JSON API, a cloud productivity suite's JSON feed, a chat vendor's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view and posts alerts to Slack. Deployed on a Mac Mini on the internal network. Designed for IT engineers (deep triage) and IT leadership / company-wide visibility (situational awareness).

## Current State

**v2 SHIPPED — Phases 0–6 complete; Phase 2B + Phase 7 (Statuspage inbound webhook + Slack ack) in tree, gated off by default.** Auth, vendor resilience, alert quality, observability, data lifecycle, UX productionization, and platform polish all landed. **356 tests passing.** The dashboard is production-grade; a mature IT team can rely on it. See PRODUCTION-ROADMAP.md for the exit-criteria detail on each phase.
**v2 SHIPPED — Phases 0–6 complete; Phase 2B + Phase 7 (Statuspage inbound webhook + Slack ack) in tree, gated off by default.** Auth, vendor resilience, alert quality, observability, data lifecycle, UX productionization, and platform polish all landed. **378 tests passing.** The dashboard is production-grade; a mature IT team can rely on it. See PRODUCTION-ROADMAP.md for the exit-criteria detail on each phase.

Main also includes a parallel UX sprint that shipped alongside Phase 5:
- **Executive / Engineer view toggle** — `ViewContext` gates the grid vs category summary and engineer-only affordances (graph, timeline, shortcuts).
Expand Down Expand Up @@ -144,7 +144,7 @@ Open `http://localhost:8000` in your browser.
- Do not start work that isn't in a PRODUCTION-ROADMAP.md phase. If it doesn't fit, discuss first.
- Do not integrate Splunk, ThousandEyes, Datadog, or JSM — those are Phase 7+.
- Do not build an LLM integration yet — post-Phase-7.
- Do not remove the bearer-token auth on admin endpoints once added. VPN is not sufficient for write endpoints.
- Do not remove the bearer-token auth on admin endpoints once added. The internal network is not sufficient for write endpoints.
- Do not use synchronous I/O — all network calls must be async.
- Do not hardcode service definitions in Python — they live in services.yaml.
- Do not use slack-sdk — use raw httpx POST for webhook simplicity.
Expand Down
Loading
Loading