diff --git a/.gitignore b/.gitignore
index 0029c8d..0b875fe 100644
--- a/.gitignore
+++ b/.gitignore
@@ -35,3 +35,10 @@ frontend/dist/
# OS
.DS_Store
CLAUDE.md.bak
+
+# Local (private) service registry — overrides the committed example so the
+# real organization inventory never has to live in version control.
+backend/config/*.local.yaml
+
+# uv lockfile (runtime deps are pinned in backend/requirements.txt)
+backend/uv.lock
diff --git a/CASE-STUDY.md b/CASE-STUDY.md
new file mode 100644
index 0000000..60af402
--- /dev/null
+++ b/CASE-STUDY.md
@@ -0,0 +1,220 @@
+# Case Study: A Real-Time SaaS Status Dashboard for Enterprise IT
+
+> **What it is:** A self-hosted service that polls the public status pages of ~30 enterprise SaaS
+> tools every 60 seconds, normalizes a dozen incompatible vendor formats into one status model,
+> detects changes, reasons about downstream impact, and alerts the operations team before the
+> first user ticket lands.
+>
+> **Status:** v1 and v2 shipped. **356 tests passing.** ~16k LOC. Python 3.13 / FastAPI backend,
+> React + Vite frontend, single-file SQLite datastore, runs on one Mac mini.
+
+---
+
+## The problem
+
+In most IT organizations, the news that a critical SaaS tool is down travels in exactly the wrong
+direction: a user hits a broken login, files a ticket, the ticket sits in a queue, and only then —
+ten, twenty, sixty minutes later — does someone in IT realize the identity provider has been
+degraded the whole time. The team learns about outages from the people it's supposed to be
+protecting.
+
+The information was public the entire time. Nearly every major SaaS vendor publishes a machine-readable
+status page. The gap isn't data availability — it's that nobody is *watching* thirty status pages at
+once, in thirty different formats, and connecting "Vendor X is degraded" to "therefore these internal
+workflows are about to break."
+
+This project closes that gap. It turns a reactive, ticket-driven posture into a proactive one: when a
+vendor's own status page flips, the operations channel knows within one poll cycle — with an impact
+statement attached, not just a raw status code.
+
+---
+
+## The solution
+
+A small, self-contained monitoring service with five responsibilities, each isolated into its own
+module so they can be tested and reasoned about independently:
+
+```mermaid
+flowchart TD
+ subgraph upstream["Upstream status sources (heterogeneous)"]
+ A1["Statuspage.io-powered pages
(~half the fleet)"]
+ A2["Vendor-specific status APIs
(collaboration, CRM, support, productivity suites)"]
+ A3["Operator-pushed manual updates
(tools with no machine-readable status)"]
+ end
+
+ A1 & A2 & A3 --> P["Poll Orchestrator
APScheduler · 60s cycle
coalesce · no overlap · misfire-aware"]
+ P --> R["Resilience layer
retry + backoff (stamina)
per-host circuit breaker (purgatory)"]
+ R --> N["Status Normalizer
vendor strings → 5-state enum
operational · degraded · partial · major · unknown"]
+ N --> C["Change Detector
diff vs. DB · two-axis health
(is the vendor down? is the poller blind?)"]
+ C --> I["Impact Engine
service dependency graph
→ 'who is affected downstream'"]
+ C --> AL["Alert Router
dedup · maintenance suppression
flap suppression · tier routing"]
+ I --> AL
+ AL --> CH["Chat alerting
(tiered: page / notify / dashboard-only)"]
+ C --> DB[("SQLite
WAL · retention · snapshots")]
+ DB --> API["FastAPI REST
/services /timeline /summary /metrics"]
+ API --> UI["React dashboard
Executive + Engineer views · PWA"]
+```
+
+**Flow in one sentence:** poll many formats → make them resilient and uniform → detect what
+*changed* → decide who it *affects* and whether it's worth interrupting a human → store it and show it.
+
+---
+
+## Engineering highlights
+
+These are the parts that are non-obvious — the places where the naive version is easy and the
+correct version takes real design.
+
+### 1. One status model out of a dozen vendor dialects
+
+Every vendor describes "things are broken" differently. Statuspage.io alone has two distinct
+vocabularies — a page-level *indicator* (`none` / `minor` / `major` / `critical`) and per-component
+*status* strings (`operational` / `degraded_performance` / `partial_outage` / `under_maintenance`) —
+and they don't line up. Other vendors ship entirely bespoke JSON. Manual operator updates use a
+third shape.
+
+The normalizer collapses all of it into a single five-state enum
+(`operational · degraded · partial_outage · major_outage · unknown`) via explicit per-source mapping
+tables. The design decision that matters: **an unrecognized vendor string maps to `unknown` and emits
+a warning** — it never silently guesses `operational`. A new value a vendor introduces tomorrow shows
+up as a logged anomaly instead of a false all-clear. (`under_maintenance`, notably, maps to `degraded`
+rather than a fake outage — maintenance is expected, not an incident.)
+
+### 2. Two-axis health: "is the vendor down?" vs. "is my poller blind?"
+
+This is the insight that separates a toy from a tool. A status of `unknown` is dangerously ambiguous —
+it could mean the vendor is genuinely in trouble, or it could mean *our own fetch failed* and we have
+no idea what's going on. Conflating the two means every network hiccup looks like an outage.
+
+So the system tracks **two orthogonal axes**:
+
+- **Service status** — what the vendor reports (the 5-state enum above).
+- **Poller health** — `healthy → degraded → broken`, driven by a pure state machine: a successful
+ poll resets to `healthy`; failures short of a configurable threshold are `degraded`; sustained
+ failure past the threshold flips to `broken`.
+
+When a poller goes `broken`, that's routed as a *distinct* signal — "we've gone blind on this service" —
+separate from a vendor-outage alert, because they demand different human responses. The UI renders a
+broken poller differently from a down vendor. You always know whether you're looking at reality or at
+a gap in your own instrumentation.
+
+### 3. Resilience: heal fast, fail loud, don't hammer the dead
+
+Every outbound request goes through one resilience layer with two complementary mechanisms:
+
+- **Retry with backoff + jitter** (via `stamina`) for *transient* trouble — network errors, timeouts,
+ and HTTP `408` / `429` / `5xx`. These self-heal, so the system gives them a few capped, jittered
+ attempts before giving up.
+- **Per-host circuit breaker** (via `purgatory`) for *persistent* trouble. After N consecutive
+ failures a host's breaker opens and subsequent calls fast-fail for a TTL (default 5 minutes) before
+ probing again — avoiding the classic "re-probe a dead host every cycle and burn the whole poll
+ window" antipattern.
+
+The sharp distinction: **HTTP `4xx` errors other than `408`/`429` are treated as hard failures and
+surface immediately** — a `404` or `401` means a URL moved or auth changed; retrying can't fix config
+rot, so it shouldn't mask it. And a tripped breaker is reported as *poller-unhealthy*, not
+*vendor-down* — feeding straight back into the two-axis model above.
+
+### 4. Alert quality: earn the interruption
+
+A monitor that cries wolf gets muted, and a muted monitor is worthless. Several layers cooperate so
+that a human is interrupted only when it's warranted:
+
+- **Deduplication keyed on `(service, vendor_incident_id)`** — with a `(service, status, day)`
+ fallback when no vendor incident ID exists. Critically, **dedup never keys on message text**:
+ vendors edit incident titles mid-flight, and a text key would leak a fresh alert on every wording
+ change.
+- **Maintenance-window suppression** — a scheduled maintenance records the state transition but does
+ not page anyone. Expected ≠ alarming.
+- **Flap suppression** — a status must persist across a configurable number of confirming polls before
+ a worsening alert fires, and recover across another threshold before the all-clear, so a vendor
+ bouncing between states doesn't machine-gun the channel.
+- **Tiered routing** — every service has a tier: `critical` pages with an `@here` mention,
+ `important` notifies without the mention, `informational` updates the dashboard and sends nothing.
+- **Dependency correlation** — when one upstream failure knocks out several dependents, the router can
+ emit a *single aggregated* upstream alert instead of N separate downstream ones.
+
+Every decision — sent or suppressed, and why — is written to a durable audit log. There's always an
+answer to "what did we tell operators, and what did we hold back?"
+
+### 5. From "Vendor X is degraded" to "here's who it hurts"
+
+Raw status is low-value; *impact* is what an on-call human actually needs. A service dependency graph
+(stored relationally, queried both upstream and downstream and ordered by severity) lets the impact
+engine turn a single vendor event into a downstream blast-radius statement — "identity provider
+degraded → these dependent workflows are at risk" — so the alert leads with consequences, not codes.
+
+### 6. Scheduler discipline and observability
+
+The poll loop is built to stay honest under load and to be debuggable in production:
+
+- **Scheduler safety**: cycles `coalesce` (a late wake-up runs once, not as a backlog stampede),
+ `max_instances=1` (a slow cycle is skipped, never overlapped), and missed runs are logged and
+ counted rather than silently dropped.
+- **Trace-without-tracing**: each cycle binds a fresh `poll_cycle_id` into structured logs, so every
+ line from one cycle is correlatable without a full distributed-tracing stack.
+- **Operational signals**: a Prometheus `/metrics` endpoint (poll counts, durations, circuit-breaker
+ state, alert sent/suppressed counters), optional Sentry error tracking, and a Healthchecks.io
+ dead-man's-switch heartbeat that screams if the *monitor itself* goes dark.
+
+### 7. Boring, durable data lifecycle
+
+A single SQLite file in WAL mode is the whole datastore — deliberately. Around it: production pragmas,
+automatic retention purges for event and alert-log tables, a daily `VACUUM INTO` snapshot, and
+optional Litestream continuous replication. No database server to operate; full point-in-time recovery
+if the host dies.
+
+---
+
+## Results / status
+
+- **v1 (demo-ready) — shipped.** Polling, normalization, change detection, chat alerting, the React
+ UI, dependency graph, timeline, SLA tracking, incident clustering, and automated reports.
+- **v2 (production-ready) — shipped.** Bearer-token auth on admin endpoints; the full resilience
+ layer (retry + circuit breaker); alert-quality stack (flap suppression, dedup, tier routing,
+ dependency correlation, maintenance windows); observability (structured logging, Prometheus, Sentry,
+ dead-man's switch); data lifecycle (retention, snapshots, replication); a productionized UI with a
+ severity-sorted grid, an Executive/Engineer view toggle, accessibility + keyboard navigation, and
+ PWA support; and platform polish (CI, pre-commit hooks, a hardened launchd service, a Caddy reverse
+ proxy, and OS-keychain-backed secrets).
+- **Quality gate:** **356 automated tests passing**, covering the normalizer, resilience layer,
+ change detector, alert routing, dependency graph, SLA math, the REST API, and a full end-to-end
+ pipeline test.
+- **Footprint:** runs comfortably on a single Mac mini. One Python process, one SQLite file, one
+ static React bundle served by the same app.
+- **What it watches:** ~30 enterprise SaaS tools spanning identity & access, productivity & content,
+ collaboration, engineering & ITSM, HR & people, finance, CRM, marketing, network/VPN, and support —
+ a representative cross-section of a modern enterprise SaaS estate.
+
+A set of features (inbound vendor webhooks, a chat acknowledgement flow, auto-drafted SRE-style
+postmortems on recovery, and multi-burn-rate SLO alerting) is built and tested but kept behind feature
+flags, defaulting off until their deployment prerequisites are in place — shipped code, deliberately
+dark.
+
+---
+
+## Skills demonstrated
+
+Framed for a Platform / DX / AI-infrastructure audience:
+
+- **Distributed-systems resilience as a first-class concern, not an afterthought.** Retries with
+ backoff + jitter, per-host circuit breaking, and a deliberate transient-vs-permanent failure
+ taxonomy — the same patterns that keep a platform's outbound integrations from amplifying a
+ dependency's bad day.
+- **Designing the right abstraction over messy reality.** Collapsing a dozen incompatible vendor
+ formats into one clean status model — with a fail-safe `unknown` path — is the everyday work of
+ platform and integration engineering.
+- **Signal quality over signal volume.** The dedup / suppression / tiering / correlation stack is an
+ alerting-discipline story: respect the human on the other end, and the system stays trusted instead
+ of muted.
+- **Observability built in from the start.** Structured logs with cycle correlation, Prometheus
+ metrics, error tracking, and a dead-man's switch — designed to be operated, not just run.
+- **Production data discipline at small scale.** WAL, retention, snapshots, and replication on a
+ single-file datastore: maximum durability for minimum operational surface.
+- **Test rigor.** 356 tests including pure-function state-machine coverage and an end-to-end pipeline
+ test — the difference between "it worked on my machine" and "it's safe to change."
+- **Shipping judgment.** Feature-flagged, default-off capabilities show the discipline to merge
+ complete-but-not-yet-deployable work without destabilizing what's live.
+
+The throughline: this is enterprise IT pain, solved with platform-engineering tools — turning a
+reactive, ticket-driven workflow into a proactive, observable, self-healing system.
diff --git a/CLAUDE.md b/CLAUDE.md
index fe7e60c..a00e398 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,6 +1,6 @@
# IT Service Health Dashboard
-Internal web dashboard aggregating real-time health of ~30 SaaS services at Box. Polls Statuspage.io JSON API, Google Workspace JSON feed, Slack native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view, posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN.
+Internal web dashboard aggregating real-time health of ~30 SaaS services in an enterprise IT environment. Polls Statuspage.io JSON API, a cloud productivity suite's JSON feed, a chat vendor's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view, posts alerts to Slack. Deployed on a Mac Mini on the internal network.
## Roadmap
@@ -35,7 +35,7 @@ cd backend && python run.py
Open `http://localhost:8000`.
-**CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis. 356 tests passing.
+**CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis. 378 tests passing.
## Conventions
@@ -55,11 +55,11 @@ Open `http://localhost:8000`.
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Primary data source | Statuspage.io `/api/v2/summary.json` | Most vendors use Statuspage.io; JSON, no auth, not rate-limited |
-| Google Workspace | Google custom JSON feed + RSS | Google has its own status dashboard, not Statuspage.io |
-| Slack status | `slack-status.com/api/v2.0.0/current` | Dedicated JSON status API |
+| Cloud productivity suite | Custom JSON feed + RSS | Has its own status dashboard, not Statuspage.io |
+| Chat vendor status | Vendor JSON status endpoint | Dedicated JSON status API |
| Database | SQLite + Litestream | Demo-scale + ~1s RPO; Postgres deferred to >100 writes/s |
-| Auth | Bearer token on admin endpoints; VPN-only for reads | Bearer token required for write endpoints — VPN alone is insufficient |
-| Hosting | Mac Mini + Caddy | Always-on, VPN-accessible; Caddy adds HTTPS + header auth |
+| Auth | Bearer token on admin endpoints; internal-network-only for reads | Bearer token required for write endpoints — the internal network alone is insufficient |
+| Hosting | Mac Mini + Caddy | Always-on, internal-network-accessible; Caddy adds HTTPS + header auth |
| Dep graph layout | Force-directed (react-force-graph-2d) | Dagre hierarchical layout is deferred; force-directed is current default |
| LLM layer | Deferred (post-Phase-7) | Template-based summaries sufficient for v2 |
@@ -84,11 +84,11 @@ All new work must map to an active phase in PRODUCTION-ROADMAP.md. Splunk, Thous
## What This Project Is
-Internal web dashboard that aggregates real-time health status of ~30 SaaS services IT supports at Box. Polls vendor status pages via Statuspage.io JSON API, Google Workspace JSON feed, Slack's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view and posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN. Designed for IT engineers (deep triage) and IT leadership / company-wide visibility (situational awareness).
+Internal web dashboard that aggregates real-time health status of ~30 SaaS services supported by an enterprise IT team. Polls vendor status pages via Statuspage.io JSON API, a cloud productivity suite's JSON feed, a chat vendor's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view and posts alerts to Slack. Deployed on a Mac Mini on the internal network. Designed for IT engineers (deep triage) and IT leadership / company-wide visibility (situational awareness).
## Current State
-**v2 SHIPPED — Phases 0–6 complete; Phase 2B + Phase 7 (Statuspage inbound webhook + Slack ack) in tree, gated off by default.** Auth, vendor resilience, alert quality, observability, data lifecycle, UX productionization, and platform polish all landed. **356 tests passing.** The dashboard is production-grade; a mature IT team can rely on it. See PRODUCTION-ROADMAP.md for the exit-criteria detail on each phase.
+**v2 SHIPPED — Phases 0–6 complete; Phase 2B + Phase 7 (Statuspage inbound webhook + Slack ack) in tree, gated off by default.** Auth, vendor resilience, alert quality, observability, data lifecycle, UX productionization, and platform polish all landed. **378 tests passing.** The dashboard is production-grade; a mature IT team can rely on it. See PRODUCTION-ROADMAP.md for the exit-criteria detail on each phase.
Main also includes a parallel UX sprint that shipped alongside Phase 5:
- **Executive / Engineer view toggle** — `ViewContext` gates the grid vs category summary and engineer-only affordances (graph, timeline, shortcuts).
@@ -144,7 +144,7 @@ Open `http://localhost:8000` in your browser.
- Do not start work that isn't in a PRODUCTION-ROADMAP.md phase. If it doesn't fit, discuss first.
- Do not integrate Splunk, ThousandEyes, Datadog, or JSM — those are Phase 7+.
- Do not build an LLM integration yet — post-Phase-7.
-- Do not remove the bearer-token auth on admin endpoints once added. VPN is not sufficient for write endpoints.
+- Do not remove the bearer-token auth on admin endpoints once added. The internal network is not sufficient for write endpoints.
- Do not use synchronous I/O — all network calls must be async.
- Do not hardcode service definitions in Python — they live in services.yaml.
- Do not use slack-sdk — use raw httpx POST for webhook simplicity.
diff --git a/IMPLEMENTATION-ROADMAP.md b/IMPLEMENTATION-ROADMAP.md
index 0f7118c..95f266a 100644
--- a/IMPLEMENTATION-ROADMAP.md
+++ b/IMPLEMENTATION-ROADMAP.md
@@ -12,8 +12,8 @@
```
[Vendor Status Pages]
├── Statuspage.io JSON API (/api/v2/summary.json) — ~20 services
- ├── Google Workspace JSON feed + RSS — 2 services (Gmail, Calendar)
- ├── Slack Status API (slack-status.com/api/v2.0.0/current) — 1 service
+ ├── A cloud productivity suite JSON feed + RSS — 2 services (Mail, Calendar)
+ ├── the chat-platform status API (chat-status.example.com/api/v2.0.0/current) — 1 service
└── Manual updates via POST /api/admin/status — ~10 services
↓ (async poll every 60s via APScheduler)
[Polling Workers]
@@ -24,7 +24,7 @@
↓ (on change)
┌──────────────────────┐
│ [Template Engine] │ — generate impact statements from dependency graph
- │ [Slack Alerter] │ — POST Block Kit message to #service-validation webhook
+ │ [Slack Alerter] │ — POST Block Kit message to the ops-alert channel webhook
│ [SQLite Writer] │ — insert status_events row, update services row
└──────────────────────┘
↓
@@ -53,8 +53,8 @@ it-service-health/
│ │ │ ├── __init__.py
│ │ │ ├── scheduler.py # APScheduler async setup, 60s interval, error handling
│ │ │ ├── statuspage_poller.py # Statuspage.io JSON API poller (handles ~20 services)
-│ │ │ ├── google_poller.py # Google Workspace status poller (JSON feed + RSS)
-│ │ │ ├── slack_poller.py # Slack Status API poller (slack-status.com)
+│ │ │ ├── product_feed_poller.py # Cloud productivity suite status poller (JSON feed + RSS)
+│ │ │ ├── current_status_poller.py # Chat-platform status API poller
│ │ │ ├── rss_poller.py # Fallback RSS/Atom poller for services without JSON API
│ │ │ ├── normalizer.py # Vendor status string → ServiceStatus enum mapping
│ │ │ └── change_detector.py # Diff current vs stored state, emit change events
@@ -117,10 +117,10 @@ it-service-health/
```sql
-- Service registry: static + current state for each monitored service
CREATE TABLE services (
- id TEXT PRIMARY KEY, -- slug: "okta", "google-mail", "slack"
- display_name TEXT NOT NULL, -- "Okta", "Google Mail", "Slack"
+ id TEXT PRIMARY KEY, -- slug: "identity-provider", "cloud-mail", "chat-platform"
+ display_name TEXT NOT NULL, -- "Identity Provider", "Cloud Mail", "Chat Platform"
category TEXT NOT NULL, -- see categories below
- poll_type TEXT NOT NULL DEFAULT 'manual', -- "statuspage_json", "google_json", "slack_api", "rss", "manual"
+ poll_type TEXT NOT NULL DEFAULT 'manual', -- "statuspage_json", "product_feed_json", "current_status_api", "rss", "manual"
poll_url TEXT, -- API/feed URL to poll (NULL if manual)
statuspage_component_name TEXT, -- for statuspage_json: match this component name in API response
status_page_url TEXT, -- vendor public status page URL for linking
@@ -140,7 +140,7 @@ CREATE TABLE status_events (
vendor_title TEXT, -- incident title from vendor
vendor_detail TEXT, -- incident description/body from vendor
impact_statement TEXT, -- generated template-based impact text
- source TEXT NOT NULL DEFAULT 'statuspage_json', -- "statuspage_json", "google_json", "slack_api", "rss", "manual"
+ source TEXT NOT NULL DEFAULT 'statuspage_json', -- "statuspage_json", "product_feed_json", "current_status_api", "rss", "manual"
vendor_incident_id TEXT, -- vendor's incident ID for deduplication
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
@@ -201,33 +201,33 @@ These vendors use Atlassian Statuspage. Poll their `/api/v2/summary.json` endpoi
| Service | Status Page Base URL | summary.json URL | Component to Match | Category |
|---------|---------------------|-------------------|--------------------|----------|
-| Box | status.box.com | `https://status.box.com/api/v2/summary.json` | Match overall page status or relevant component | productivity |
-| Okta | status.okta.com | `https://status.okta.com/api/v2/summary.json` | Use page-level status (Okta has cell-specific pages; use main) | identity |
-| Duo | status.duo.com | `https://status.duo.com/api/v2/summary.json` | "Duo Security" or page-level | identity |
-| DocuSign | status.docusign.com | `https://status.docusign.com/api/v2/summary.json` | Page-level or "eSignature" component | productivity |
-| Zoom | status.zoom.us | `https://status.zoom.us/api/v2/summary.json` | "Zoom Meetings", "Zoom Phone", or page-level | collaboration |
-| Concur | open.concur.com (verify) | `https://open.concur.com/api/v2/summary.json` | Page-level | finance |
-| Conga | (verify Statuspage URL) | Verify at runtime — may use custom domain | Page-level | productivity |
-| SnapLogic | status.snaplogic.com | `https://status.snaplogic.com/api/v2/summary.json` | Page-level | engineering |
-| Zuora | status.zuora.com | `https://status.zuora.com/api/v2/summary.json` | Page-level | finance |
-| Cornerstone | status.csod.com (verify) | Verify at runtime | Page-level | hr |
-| Iterable | status.iterable.com | `https://status.iterable.com/api/v2/summary.json` | Page-level | marketing |
-| Marketo | status.adobe.com (verify) | Verify — Marketo may be under Adobe's status page | "Marketo Engage" component | marketing |
-| Greenhouse | status.greenhouse.io (verify) | Verify at runtime | Page-level | hr |
-| Teem (iOFFICE) | (verify) | Verify — Teem was acquired by iOFFICE | Page-level | productivity |
-| Salesforce | status.salesforce.com | `https://status.salesforce.com/api/v2/summary.json` | Page-level or instance-specific | sales |
-| Zendesk | status.zendesk.com | `https://status.zendesk.com/api/v2/summary.json` | Page-level | support |
-
-**IMPORTANT: Claude Code must verify every URL during Phase 0 by running `curl ` and confirming valid JSON is returned. Some URLs may have changed or use custom Statuspage domains. If a URL fails, search for ` status page` and find the correct Statuspage.io URL, then check if `/api/v2/summary.json` is accessible.**
-
-### Poll Type: `slack_api`
-Slack has its own dedicated status API, not Statuspage.
+| Content platform | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Match overall page status or relevant component | productivity |
+| Identity provider (SSO) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Use page-level status (vendor has cell-specific pages; use main) | identity |
+| MFA | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | "MFA Security" or page-level | identity |
+| E-signature tool | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level or "eSignature" component | productivity |
+| Video conferencing | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | "Meetings", "Phone", or page-level | collaboration |
+| Finance tools (expense) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | finance |
+| Document automation | (verify Statuspage URL) | Verify at runtime — may use custom domain | Page-level | productivity |
+| Integration platform | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | engineering |
+| Finance tools (billing) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | finance |
+| HR tools (LMS) | (vendor status domain — verify at runtime) | Verify at runtime | Page-level | hr |
+| Marketing tools (email) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | marketing |
+| Marketing tools (automation) | (vendor status domain — verify at runtime) | Verify — may be under a parent vendor's status page | "Marketing Automation" component | marketing |
+| HR tools (ATS) | (vendor status domain — verify at runtime) | Verify at runtime | Page-level | hr |
+| Space management | (verify) | Verify — may have been acquired | Page-level | productivity |
+| CRM | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level or instance-specific | sales |
+| Support platform | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | support |
+
+**IMPORTANT: Claude Code must verify every URL during Phase 0 by running `curl ` and confirming valid JSON is returned. Some URLs may have changed or use custom Statuspage domains. If a URL fails, search for the vendor's status page and find the correct Statuspage.io URL, then check if `/api/v2/summary.json` is accessible.**
+
+### Poll Type: `current_status_api`
+Some collaboration vendors expose a dedicated current-status API, not Statuspage.
| Service | API URL | Category |
|---------|---------|----------|
-| Slack | `https://slack-status.com/api/v2.0.0/current` | collaboration |
+| Chat platform | `https://chat-status.example.com/api/v2.0.0/current` | collaboration |
-**Slack Status API response shape:**
+**Current-status API response shape:**
```json
{
"status": "ok|active",
@@ -241,7 +241,7 @@ Slack has its own dedicated status API, not Statuspage.
"title": "Some customers may experience...",
"type": "incident|notice|maintenance",
"status": "active|resolved",
- "url": "https://status.slack.com/2024-01/...",
+ "url": "https://status.example.com/2024-01/...",
"services": ["Login/SSO", "Messaging", "Connections", ...],
"notes": [{ "date_created": "...", "body": "..." }]
}
@@ -250,55 +250,55 @@ Slack has its own dedicated status API, not Statuspage.
```
When `status` is `"ok"` and `active_incidents` is empty → operational. Otherwise map by incident type/impact.
-### Poll Type: `google_json`
-Google Workspace uses its own status dashboard with a JSON feed and RSS.
+### Poll Type: `product_feed_json`
+A cloud productivity suite uses its own status dashboard with a JSON feed and RSS.
| Service | JSON URL | RSS URL | Category |
|---------|----------|---------|----------|
-| Google Mail | `https://www.google.com/appsstatus/dashboard/incidents.json` | `https://www.google.com/appsstatus/rss/en` | productivity |
-| Google Calendar | (same JSON feed — filter by product) | (same RSS feed) | productivity |
+| Cloud Mail | `https://feed.example.com/incidents.json` | `https://feed.example.com/rss` | productivity |
+| Cloud Calendar | (same JSON feed — filter by product) | (same RSS feed) | productivity |
-**Google Workspace JSON feed:** Contains incidents for ALL Google Workspace products. Filter by `service_name` field matching "Gmail" or "Google Calendar". The feed provides `most_recent_update.status` which maps to severity. RSS feed at `https://www.google.com/appsstatus/rss/en` is a fallback.
+**Cloud productivity suite JSON feed:** Contains incidents for ALL products in the suite. Filter by `service_name` field matching the relevant products. The feed provides `most_recent_update.status` which maps to severity. RSS feed at `https://feed.example.com/rss` is a fallback.
-**Note:** Google's JSON feed returns incident *history*, not a real-time component status like Statuspage. For current status: if no active (non-resolved) incidents exist for the product → operational. If active incidents exist → map severity.
+**Note:** The cloud productivity suite's JSON feed returns incident *history*, not a real-time component status like Statuspage. For current status: if no active (non-resolved) incidents exist for the product → operational. If active incidents exist → map severity.
### Poll Type: `rss`
Fallback for any service that has an RSS/Atom feed but not a known JSON API.
| Service | Feed URL | Category |
|---------|----------|----------|
-| RingCentral | (find RSS URL from status page) | collaboration |
+| Telephony | (find RSS URL from status page) | collaboration |
### Poll Type: `manual`
These services have no automated monitoring feeds. IT engineers update status via `curl POST`.
| Service | Status Page URL (for linking) | Category |
|---------|-------------------------------|----------|
-| Confluence | status.atlassian.com (has JSON API — consider upgrading to statuspage_json) | engineering |
-| Jira | status.atlassian.com (same — consider statuspage_json with component filter) | engineering |
-| ServiceDesk (JSM) | status.atlassian.com (same) | engineering |
-| Coupa | (no known status page) | finance |
-| Juniper VPN | (no public status page) | networking |
-| Lithium | (no known status page) | other |
-| Netsuite | (verify — Oracle may have a status page) | finance |
-| Workday | (verify — Workday has a trust site) | hr |
-| Partnerportal | (Salesforce instance — use Salesforce status) | sales |
-
-**NOTE: Confluence, Jira, and ServiceDesk are all Atlassian products. Atlassian has a Statuspage.io-based status page at `https://status.atlassian.com/api/v2/summary.json`. Claude Code should verify this API and if accessible, upgrade these from `manual` to `statuspage_json` with appropriate component name filters. This would reduce manual services from ~10 to ~7.**
+| Team wiki | status.atlassian.com (has JSON API — consider upgrading to statuspage_json) | engineering |
+| Ticketing / ITSM system | status.atlassian.com (same — consider statuspage_json with component filter) | engineering |
+| ServiceDesk | status.atlassian.com (same) | engineering |
+| Finance tools (procurement) | (no known status page) | finance |
+| VPN | (no public status page) | networking |
+| Community platform | (no known status page) | other |
+| Finance tools (ERP) | (verify — vendor may have a status page) | finance |
+| HR system | (verify — vendor has a trust site) | hr |
+| Partner portal | (CRM instance — use CRM status) | sales |
+
+**NOTE: The team wiki, ticketing / ITSM system, and ServiceDesk are all products from the same vendor (Atlassian). Atlassian has a Statuspage.io-based status page at `https://status.atlassian.com/api/v2/summary.json`. Claude Code should verify this API and if accessible, upgrade these from `manual` to `statuspage_json` with appropriate component name filters. This would reduce manual services from ~10 to ~7.**
### Service Categories
```yaml
categories:
- identity: "Identity & Access" # Okta, Duo
- productivity: "Productivity" # Box, DocuSign, Google Mail, Google Calendar, Teem, Conga
- collaboration: "Collaboration" # Slack, Zoom, RingCentral
- engineering: "Engineering" # Jira, Confluence, ServiceDesk, SnapLogic
- hr: "HR & People" # Greenhouse, Workday, Cornerstone
- finance: "Finance" # Concur, Coupa, Netsuite, Zuora
- sales: "Sales & CRM" # Salesforce, Partnerportal
- marketing: "Marketing" # Iterable, Marketo
- networking: "Network & VPN" # Juniper VPN
- support: "Support" # Zendesk
+ identity: "Identity & Access" # identity provider (SSO), MFA
+ productivity: "Productivity" # content platform, e-signature, cloud mail, cloud calendar, space management, document automation
+ collaboration: "Collaboration" # chat platform, video conferencing, telephony
+ engineering: "Engineering" # ticketing / ITSM system, team wiki, ServiceDesk, integration platform
+ hr: "HR & People" # HR tools (ATS), HR system, HR tools (LMS)
+ finance: "Finance" # finance tools (expense), finance tools (procurement), finance tools (ERP), finance tools (billing)
+ sales: "Sales & CRM" # CRM, partner portal
+ marketing: "Marketing" # marketing tools (email), marketing tools (automation)
+ networking: "Network & VPN" # VPN
+ support: "Support" # support platform
```
---
@@ -338,9 +338,9 @@ STATUSPAGE_INDICATOR_MAP = {
}
```
-### Slack Status API Mapping
+### Current-status API mapping
```python
-def normalize_slack_status(response: dict) -> ServiceStatus:
+def normalize_current_status(response: dict) -> ServiceStatus:
if response["status"] == "ok" and not response.get("active_incidents"):
return ServiceStatus.OPERATIONAL
incidents = response.get("active_incidents", [])
@@ -353,9 +353,9 @@ def normalize_slack_status(response: dict) -> ServiceStatus:
return ServiceStatus.DEGRADED # default if active but unknown type
```
-### Google Workspace Mapping
+### Product-feed mapping
```python
-# Google incidents have severity levels in updates
+# Cloud productivity suite incidents have severity levels in updates
# If no active incident for the product → OPERATIONAL
# If active incident exists → DEGRADED (default), escalate based on description keywords
```
@@ -378,69 +378,69 @@ RSS_TITLE_KEYWORDS = {
# Format: upstream → downstream (when upstream breaks, downstream is impacted)
dependencies:
# Identity & Access — highest blast radius
- okta:
- - service: box
- impact: "Box SSO login unavailable"
+ identity-provider:
+ - service: content-platform
+ impact: "Content platform SSO login unavailable"
severity: critical
- - service: slack
- impact: "Slack SSO login may fail for new sessions"
+ - service: chat-platform
+ impact: "Chat platform SSO login may fail for new sessions"
severity: critical
- - service: zoom
- impact: "Zoom SSO login unavailable"
+ - service: video-conferencing
+ impact: "Video conferencing SSO login unavailable"
severity: critical
- - service: salesforce
- impact: "Salesforce SSO login unavailable"
+ - service: crm
+ impact: "CRM SSO login unavailable"
severity: critical
- - service: jira
- impact: "Jira SSO login unavailable"
+ - service: itsm
+ impact: "Ticketing / ITSM system SSO login unavailable"
severity: high
- - service: confluence
- impact: "Confluence SSO login unavailable"
+ - service: team-wiki
+ impact: "Team wiki SSO login unavailable"
severity: high
- - service: concur
- impact: "Concur SSO login unavailable"
+ - service: finance-expense
+ impact: "Finance tools (expense) SSO login unavailable"
severity: high
- - service: workday
- impact: "Workday SSO login unavailable"
+ - service: hr-system
+ impact: "HR system SSO login unavailable"
severity: high
- - service: greenhouse
- impact: "Greenhouse SSO login unavailable"
+ - service: hr-ats
+ impact: "HR tools (ATS) SSO login unavailable"
severity: medium
- - service: docusign
- impact: "DocuSign SSO login unavailable"
+ - service: esignature
+ impact: "E-signature tool SSO login unavailable"
severity: medium
- - service: zendesk
- impact: "Zendesk SSO login unavailable"
+ - service: support-platform
+ impact: "Support platform SSO login unavailable"
severity: medium
- - service: netsuite
- impact: "NetSuite SSO login unavailable"
+ - service: finance-erp
+ impact: "Finance tools (ERP) SSO login unavailable"
severity: medium
- duo:
- - service: okta
- impact: "MFA push notifications unavailable; Okta login may require fallback methods"
+ mfa:
+ - service: identity-provider
+ impact: "MFA push notifications unavailable; identity provider login may require fallback methods"
severity: critical
# Collaboration dependencies
- slack:
+ chat-platform:
- service: servicedesk
- impact: "Slack-based IT support channel and Aisera bot unavailable"
+ impact: "Chat-based IT support channel and bot unavailable"
severity: high
- # Google Workspace
- google-mail:
- - service: google-calendar
+ # Cloud productivity suite
+ cloud-mail:
+ - service: cloud-calendar
impact: "Calendar notifications and email invites may be delayed"
severity: medium
# Sales & CRM
- salesforce:
- - service: partnerportal
- impact: "Partner Portal is hosted on Salesforce — full outage expected"
+ crm:
+ - service: partner-portal
+ impact: "Partner Portal is hosted on the CRM — full outage expected"
severity: critical
# Network
- juniper-vpn:
+ vpn:
- service: all_internal
impact: "VPN outage affects remote access to all internal services"
severity: critical
@@ -467,17 +467,17 @@ TEMPLATES = {
"with_downstream": (
" This may impact: {downstream_list}."
),
- "okta_degraded": (
- "Okta is reporting degraded performance. SSO authentication "
+ "sso_degraded": (
+ "The identity provider (SSO) is reporting degraded performance. SSO authentication "
"for all SaaS applications may be affected. Impacted services: {downstream_list}."
),
- "okta_outage": (
- "⚠️ Okta is experiencing an outage. SSO authentication is unavailable. "
+ "sso_outage": (
+ "⚠️ The identity provider (SSO) is experiencing an outage. SSO authentication is unavailable. "
"Users cannot log into: {downstream_list}. "
"Advise users with active sessions to avoid logging out."
),
"vpn_outage": (
- "⚠️ VPN (Juniper) is experiencing an outage. "
+ "⚠️ VPN is experiencing an outage. "
"Remote users cannot access internal services. "
"On-site users are not affected."
),
@@ -493,7 +493,7 @@ TEMPLATES = {
## Slack Alert Format (Block Kit)
-Alerts posted to #service-validation use Slack Block Kit for rich formatting:
+Alerts posted to the ops-alert channel use Slack Block Kit for rich formatting:
```python
def build_slack_alert(service_name: str, old_status: str, new_status: str,
@@ -702,14 +702,14 @@ npm install -D tailwindcss @tailwindcss/vite
**In scope (v1 demo):**
- Unified status board for all ~30 services
- Statuspage.io JSON API polling for ~20 services
-- Slack Status API polling
-- Google Workspace JSON/RSS polling
+- Chat-platform status API polling
+- Cloud productivity suite JSON/RSS polling
- Manual status update API for remaining services
- Service dependency graph with impact statement templates
- Timeline view of recent status changes
- Situation banner with template-generated summary
- Scheduled maintenance tracking and display
-- Slack Block Kit alerts to #service-validation on status changes
+- Slack Block Kit alerts to the ops-alert channel on status changes
- Auto-refresh dashboard (30s polling)
- Service categorization and grouped display
- "Last updated" indicator showing poll freshness
@@ -739,7 +739,7 @@ npm install -D tailwindcss @tailwindcss/vite
- **Slack webhook URL:** env var `SLACK_WEBHOOK_URL` — never in git
- **All vendor status APIs:** public, no auth
- **No user data collected:** dashboard is read-only, no PII
-- **Network boundary:** Mac Mini on corporate network, VPN-only access
+- **Network boundary:** Mac Mini on the internal network, internal-network-only access
- **SQLite:** local file on Mac Mini, not exposed
- `.env` file in `.gitignore`, `.env.example` committed as template
@@ -758,7 +758,7 @@ npm install -D tailwindcss @tailwindcss/vite
5. Create `config/services.yaml` with all ~30 services — **Acceptance:** File contains every service from the catalog above. For each statuspage_json service, the `poll_url` has been verified by running `curl ` and confirming valid JSON response. Services with unverified URLs are noted with comments.
6. Create `config/dependencies.yaml` — **Acceptance:** Contains the full dependency graph from this roadmap
7. Implement YAML loader + DB seeder — **Acceptance:** Run seeder → `SELECT count(*) FROM services` returns correct count (≥28)
-8. Implement `statuspage_poller.py` — poll ONE service (Okta) via `https://status.okta.com/api/v2/summary.json` — **Acceptance:** Returns parsed status, prints component statuses to console
+8. Implement `statuspage_poller.py` — poll ONE service (identity provider) via its Statuspage.io summary URL — **Acceptance:** Returns parsed status, prints component statuses to console
9. Implement `normalizer.py` with all mapping tables — **Acceptance:** `pytest tests/test_normalizer.py` passes with tests for every vendor mapping
**Verification checklist:**
@@ -767,11 +767,11 @@ npm install -D tailwindcss @tailwindcss/vite
- [ ] `sqlite3 data.db "SELECT count(*) FROM services"` → ≥28
- [ ] `sqlite3 data.db "SELECT id, poll_type FROM services WHERE poll_type='statuspage_json'"` → ~16-20 rows
- [ ] `pytest tests/test_normalizer.py` → all pass
-- [ ] Manual test: `python -c "import asyncio; from app.poller.statuspage_poller import poll_service; asyncio.run(poll_service('okta'))"` → prints Okta status
+- [ ] Manual test: `python -c "import asyncio; from app.poller.statuspage_poller import poll_service; asyncio.run(poll_service('identity-provider'))"` → prints identity provider status
**Risks:**
- Some Statuspage.io URLs may have changed or use custom domains → **Mitigation:** Phase 0 Task 5 explicitly requires verifying each URL. Document any failures and fall back to RSS or manual.
-- Google's JSON feed URL may not be publicly documented and could change → **Mitigation:** Fall back to RSS at `https://www.google.com/appsstatus/rss/en`
+- The cloud productivity suite JSON feed URL may not be publicly documented and could change → **Mitigation:** Fall back to RSS at `https://feed.example.com/rss`
---
@@ -782,14 +782,14 @@ npm install -D tailwindcss @tailwindcss/vite
**Tasks:**
1. Extend `statuspage_poller.py` to handle ALL statuspage_json services in a single poll cycle — **Acceptance:** One async function iterates `services.yaml`, polls each statuspage_json service, handles errors per-service (one failure doesn't stop others)
-2. Implement `slack_poller.py` for Slack Status API — **Acceptance:** Polls `https://slack-status.com/api/v2.0.0/current`, normalizes response to ServiceStatus
-3. Implement `google_poller.py` for Google Workspace — **Acceptance:** Fetches Google JSON/RSS feed, filters for Gmail and Calendar, returns per-product status
+2. Implement `current_status_poller.py` for the chat-platform status API — **Acceptance:** Polls `https://chat-status.example.com/api/v2.0.0/current`, normalizes response to ServiceStatus
+3. Implement `product_feed_poller.py` for the cloud productivity suite — **Acceptance:** Fetches JSON/RSS feed, filters for Cloud Mail and Cloud Calendar, returns per-product status
4. Implement `change_detector.py` — **Acceptance:** Compares poll result against `services.current_status` in DB; on change: inserts `status_events` row, updates `services` row, returns list of changes
-5. Implement `dependencies/graph.py` — **Acceptance:** `test_dependencies.py` passes; `get_downstream("okta")` returns all SSO-dependent services with impact descriptions
-6. Implement `alerting/templates.py` — **Acceptance:** `test_templates.py` passes; generates correct impact statements for Okta outage, VPN outage, generic service degradation
-7. Implement `alerting/slack.py` with Block Kit formatting — **Acceptance:** Trigger a test change → message appears in #service-validation with header, status fields, impact text, and button linking to vendor status page
+5. Implement `dependencies/graph.py` — **Acceptance:** `test_dependencies.py` passes; `get_downstream("identity-provider")` returns all SSO-dependent services with impact descriptions
+6. Implement `alerting/templates.py` — **Acceptance:** `test_templates.py` passes; generates correct impact statements for identity provider outage, VPN outage, generic service degradation
+7. Implement `alerting/slack.py` with Block Kit formatting — **Acceptance:** Trigger a test change → message appears in the ops-alert channel with header, status fields, impact text, and button linking to vendor status page
8. Implement `scheduler.py` tying it all together — **Acceptance:** On app startup, scheduler begins 60s poll cycle. Logs show all services polled. No unhandled exceptions.
-9. Implement `router_admin.py` POST `/api/admin/status` — **Acceptance:** `curl -X POST localhost:8000/api/admin/status -H 'Content-Type: application/json' -d '{"service_id":"workday","new_status":"degraded","detail":"Slow response times"}'` → returns updated service, creates status_event, triggers Slack alert
+9. Implement `router_admin.py` POST `/api/admin/status` — **Acceptance:** `curl -X POST localhost:8000/api/admin/status -H 'Content-Type: application/json' -d '{"service_id":"hr-system","new_status":"degraded","detail":"Slow response times"}'` → returns updated service, creates status_event, triggers Slack alert
10. Implement all GET API endpoints (services, timeline, summary, maintenance) — **Acceptance:** Each returns correct JSON matching the Pydantic response models
**Verification checklist:**
@@ -797,7 +797,7 @@ npm install -D tailwindcss @tailwindcss/vite
- [ ] `curl localhost:8000/api/timeline` → events (test with manual status change if no real incidents)
- [ ] `curl localhost:8000/api/summary` → correct counts, active incidents list, maintenance list
- [ ] `curl localhost:8000/api/maintenance` → upcoming maintenances from Statuspage.io
-- [ ] POST a degraded status manually → Slack Block Kit message appears in #service-validation within 5 seconds
+- [ ] POST a degraded status manually → Slack Block Kit message appears in the ops-alert channel within 5 seconds
- [ ] Wait 2 minutes → see at least 2 poll cycles in logs, no errors
- [ ] `pytest tests/` → all tests pass
@@ -852,15 +852,15 @@ npm install -D tailwindcss @tailwindcss/vite
2. Configure FastAPI to serve static frontend — **Acceptance:** Mount `dist/` as static files at `/`; `curl localhost:8000/` returns `index.html`
3. Create `scripts/seed_demo_data.py` — **Acceptance:** Seeds 5-7 historical incidents over the past 7 days across different services with realistic timestamps, status progressions (investigating → identified → monitoring → resolved), and impact statements. Timeline view looks populated, not empty.
4. Create `.env.example` — **Acceptance:** Contains `SLACK_WEBHOOK_URL=`, `DATABASE_PATH=./data.db`, `POLL_INTERVAL_SECONDS=60`, `HOST=0.0.0.0`, `PORT=8000`
-5. Create `com.box.it-health-dashboard.plist` launchd service file — **Acceptance:** Starts on boot, restarts on crash, logs to `/var/log/it-health-dashboard.log`
-6. Deploy to Mac Mini — **Acceptance:** Clone repo, install deps, configure `.env`, load launchd plist, verify `curl :8000/api/health` from another machine on VPN
-7. Open macOS firewall for port 8000 — **Acceptance:** Dashboard accessible from another laptop on VPN
-8. End-to-end smoke test — **Acceptance:** From a different machine: load dashboard, see live statuses, verify at least 16+ services show non-"unknown" statuses, trigger manual status change, see Slack alert + dashboard update within 60s
+5. Create `com.company.it-health-dashboard.plist` launchd service file — **Acceptance:** Starts on boot, restarts on crash, logs to `/var/log/it-health-dashboard.log`
+6. Deploy to Mac Mini — **Acceptance:** Clone repo, install deps, configure `.env`, load launchd plist, verify `curl :8000/api/health` from another machine on the internal network
+7. Open macOS firewall for port 8000 — **Acceptance:** Dashboard accessible from another laptop on the internal network
+8. End-to-end smoke test — **Acceptance:** From a different machine on the internal network: load dashboard, see live statuses, verify at least 16+ services show non-"unknown" statuses, trigger manual status change, see Slack alert + dashboard update within 60s
9. Write `README.md` — **Acceptance:** Contains: project overview, architecture diagram (text), how to access (URL), what it shows, how to manually update services (curl examples), environment setup instructions, what's planned next (v2 features)
-10. Prepare demo script — **Acceptance:** 2-3 talking points for Mark: (a) show live dashboard with real statuses, (b) trigger a simulated incident and show Slack alert + dashboard update, (c) click a service to show dependency mapping
+10. Prepare demo script — **Acceptance:** 2-3 talking points: (a) show live dashboard with real statuses, (b) trigger a simulated incident and show Slack alert + dashboard update, (c) click a service to show dependency mapping
**Verification checklist:**
-- [ ] From another laptop on VPN: `http://:8000` → dashboard loads
+- [ ] From another laptop on the internal network: `http://:8000` → dashboard loads
- [ ] ≥16 services show live statuses (not all "unknown")
- [ ] Manual-only services show "unknown" or manually-set statuses
- [ ] Timeline shows seeded + any real events
@@ -870,7 +870,7 @@ npm install -D tailwindcss @tailwindcss/vite
**Risks:**
- Mac Mini firewall blocking inbound → **Mitigation:** Run `sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /usr/local/bin/python3` or disable firewall for port 8000 specifically. Test from another machine early in Phase 3.
-- DNS/hostname on VPN → **Mitigation:** Use raw IP for demo; request DNS alias from network team if project continues
+- DNS/hostname on the internal network → **Mitigation:** Use raw IP for demo; request DNS alias from network team if project continues
- stale data after Mac Mini sleep → **Mitigation:** Disable sleep in System Preferences → Energy Saver. Verify poll cycle resumes after network reconnect.
---
@@ -884,11 +884,11 @@ npm install -D tailwindcss @tailwindcss/vite
**v3 — Internal Signal Correlation (Weeks 5-8):**
- Splunk integration: pull auth failure logs, network errors, app-specific events
-- JSM ticket correlation: count open tickets mentioning affected service names
-- Dashboard enrichment: "Okta degraded + 47 SSO tickets in last 30min + Splunk showing auth failures"
+- ITSM ticket correlation: count open tickets mentioning affected service names
+- Dashboard enrichment: "Identity provider (SSO) degraded + 47 SSO tickets in last 30min + Splunk showing auth failures"
**v4 — Proactive Detection + Slack Bot (Weeks 9-12):**
- ThousandEyes + Datadog integration for network and APM signals
- Anomaly detection: alert when ticket volume spikes before vendor status page updates
-- Slack bot: `@it-agent what's going on with Okta?` → returns correlated intelligence
+- Slack bot: `@it-agent what's going on with the identity provider?` → returns correlated intelligence
- GitHub change correlation: "These 3 config changes were deployed in the last hour"
diff --git a/PRODUCTION-ROADMAP.md b/PRODUCTION-ROADMAP.md
index bf0b3fb..5f810b9 100644
--- a/PRODUCTION-ROADMAP.md
+++ b/PRODUCTION-ROADMAP.md
@@ -110,7 +110,7 @@ Alert fatigue is the #1 killer of status dashboards. Current pipeline fires on e
### Severity routing (config-driven)
- [x] `tier` (`critical | important | informational`) + `slack_channel_override` added to `services.yaml`, `ServiceConfig`, and the DB via migration 0006.
-- [x] `okta`, `duo`, `slack` tagged `critical`; everything else defaults to `important` so operators explicitly elect services into the `@here` tier.
+- [x] The identity provider (SSO), MFA, and chat platform tagged `critical`; everything else defaults to `important` so operators explicitly elect services into the `@here` tier.
- [x] `route_status_change()` applies routing:
- `critical` → Slack + `` mention
- `important` → Slack, no mention
@@ -218,7 +218,7 @@ If the app goes down, nobody knows. Fix meta-monitoring.
### Backup: Litestream
- [x] `deploy/litestream.yml.example` — template supporting local-file, S3, and SFTP replicas (operator picks one).
-- [x] `deploy/com.box.it-health-dashboard-litestream.plist.example` — sidecar launchd daemon with `KeepAlive` dict form, `ThrottleInterval`, macOS Keychain-sourced credentials (never hardcoded).
+- [x] `deploy/com.company.it-health-dashboard-litestream.plist.example` — sidecar launchd daemon with `KeepAlive` dict form, `ThrottleInterval`, macOS Keychain-sourced credentials (never hardcoded).
- [x] README "Backup & Disaster Recovery" section documents install → validate → replicate → restore with exact commands.
### Retention
@@ -316,7 +316,7 @@ If the app goes down, nobody knows. Fix meta-monitoring.
- [x] End-to-end integration — new `tests/test_e2e_pipeline.py::test_poll_change_db_alert_pipeline`. Drives three `respx`-mocked polls through `poll_statuspage → detect_changes → process_changes → Slack POST`, asserts flap suppression holds the first major_outage reading, change emits on the second, Slack webhook receives one POST with ``, and `alerts_sent_total{kind=status_change,severity=critical}` increments exactly once.
### launchd hardening
-- [x] `com.box.it-health-dashboard.plist` rewritten with dict-form `KeepAlive` (`SuccessfulExit=false`, `Crashed=true`) so deliberate stops stick.
+- [x] `com.company.it-health-dashboard.plist` rewritten with dict-form `KeepAlive` (`SuccessfulExit=false`, `Crashed=true`) so deliberate stops stick.
- [x] `ThrottleInterval=30` prevents crash-loop pegging on bad config.
- [x] `PYTHONUNBUFFERED=1` so stdout reaches the log file in real time.
- [x] `ProcessType=Background`, `SoftResourceLimits.NumberOfFiles=4096`.
@@ -353,7 +353,7 @@ If the app goes down, nobody knows. Fix meta-monitoring.
- **Postmortem automation** — Google-SRE-template Markdown per incident, committed to a repo (Summary → Impact → Root Cause → Timeline → What Went Well/Poorly/Lucky → Action Items categorized Prevent/Mitigate/Detect/Repair).
- **SLO view** — Grafana-style fuel gauge (remaining error budget) + burn-rate line with 1× / 6× / 14.4× thresholds per tier.
- **Multi-burn-rate alerting** — Google SRE canonical pattern: require both long and short window to breach before paging.
-- **Slack bot** — `/itstatus okta` slash command, natural-language deferred to post-LLM phase.
+- **Slack bot** — `/itstatus` slash command (e.g. query by service name), natural-language deferred to post-LLM phase.
---
diff --git a/README.md b/README.md
index 856a0a3..aefadc4 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,13 @@
# IT Service Health Dashboard
-Real-time status monitoring dashboard for ~30 SaaS services used by Box IT. Polls vendor status pages every 60 seconds, detects changes, generates impact statements using a service dependency graph, posts Slack alerts, and displays a unified dark-themed operations dashboard.
+Real-time status monitoring dashboard for ~30 SaaS services used across an enterprise IT environment. Polls vendor status pages every 60 seconds, detects changes, generates impact statements using a service dependency graph, posts Slack alerts, and displays a unified dark-themed operations dashboard.
## Project status
- **v1 (demo-ready) — SHIPPED.** All original spec delivered: polling, normalization, change detection, Slack alerting, React UI, dependency graph, timeline, SLA tracking, incident clustering, auto reports.
-- **v2 (production-ready) — SHIPPED.** Phases 0–6 of the production roadmap complete: bearer-token auth, vendor resilience (stamina + purgatory), alert quality (flap suppression, dedup, tier routing, dependency correlation, maintenance windows, flapping-badge UI), observability (structlog, Prometheus `/metrics`, Sentry, Healthchecks.io dead-man's switch), data lifecycle (production pragmas, retention, Litestream streaming + daily `VACUUM INTO` snapshot), UX productionization (severity-sorted grid, distinct poller-broken state, a11y + keyboard nav, Executive/Engineer view toggle, PWA, `recharts` SLA trend), and platform polish (CI, pre-commit, hardened launchd plist, Caddy, Keychain secrets). **356 tests passing.**
+- **v2 (production-ready) — SHIPPED.** Phases 0–6 of the production roadmap complete: bearer-token auth, vendor resilience (stamina + purgatory), alert quality (flap suppression, dedup, tier routing, dependency correlation, maintenance windows, flapping-badge UI), observability (structlog, Prometheus `/metrics`, Sentry, Healthchecks.io dead-man's switch), data lifecycle (production pragmas, retention, Litestream streaming + daily `VACUUM INTO` snapshot), UX productionization (severity-sorted grid, distinct poller-broken state, a11y + keyboard nav, Executive/Engineer view toggle, PWA, `recharts` SLA trend), and platform polish (CI, pre-commit, hardened launchd plist, Caddy, Keychain secrets). **378 tests passing.**
- **v2 Phase 2B + Phase 7 — in tree, gated off.** Statuspage inbound webhook receiver (`WEBHOOKS_ENABLED`), Slack ack flow (`SLACK_ACK_ENABLED`), postmortem drafts (`POSTMORTEMS_ENABLED`), SLO fuel-gauge + multi-burn-rate alerting (`SLO_BURN_RATE_ENABLED`), and Slack `/itstatus` slash command (`SLACK_SLASH_ENABLED`) all shipped with tests but default off. Flip each flag once its prerequisites are in place (public endpoint for Slack features; postmortems need only a writable `POSTMORTEMS_DIR`).
-- **v2 Phase 7 remainder — optional.** LLM-layer impact statements, Splunk/JSM/ThousandEyes integration. Not on a fixed schedule; add as demand emerges.
+- **v2 Phase 7 remainder — optional.** LLM-layer impact statements; log-aggregation / ITSM / synthetic-monitoring integrations. Not on a fixed schedule; add as demand emerges.
**Active roadmap:** [PRODUCTION-ROADMAP.md](./PRODUCTION-ROADMAP.md) — exit-criteria detail for every phase.
**Historical spec:** [IMPLEMENTATION-ROADMAP.md](./IMPLEMENTATION-ROADMAP.md) — archived; v1 is complete.
@@ -17,8 +17,8 @@ Real-time status monitoring dashboard for ~30 SaaS services used by Box IT. Poll
```
[Vendor Status Pages]
|-- Statuspage.io JSON API (15 services)
- |-- Slack Status API (1 service)
- |-- Google Workspace JSON feed (2 services)
+ |-- Chat vendor status API (1 service)
+ |-- Productivity suite JSON feed (2 services)
|-- Manual updates via POST /api/admin/status (11 services)
| (async poll every 60s)
[Poll Orchestrator]
@@ -28,7 +28,7 @@ Real-time status monitoring dashboard for ~30 SaaS services used by Box IT. Poll
[Change Detector] --> diff against DB, write status_events
|
[Impact Statement Engine] --> dependency graph + templates
- [Slack Alerter] --> Block Kit message to #service-validation
+ [Slack Alerter] --> Block Kit message to the ops-alert channel
[SQLite Writer] --> update services, insert events
|
[FastAPI REST API] --> /api/services, /api/timeline, /api/summary
@@ -60,32 +60,40 @@ Open `http://localhost:8000` in your browser.
## Accessing the Dashboard
-The dashboard runs on a Mac Mini on the corporate network. Access via VPN at:
+The dashboard runs on a Mac Mini on the internal network. Access it at:
```
-http://:8000
+http://:8000
```
-No authentication required — VPN access is the security boundary.
+No authentication required — internal-network access is the security boundary.
## Service Categories
-| Category | Services |
-|----------|----------|
-| Identity & Access | Okta, Duo |
-| Productivity | Box, DocuSign, Google Mail, Google Calendar, Conga, Eptura |
-| Collaboration | Slack, Zoom, RingCentral |
-| Engineering | Confluence, Jira, Jira Service Management, SnapLogic |
-| HR & People | Greenhouse, Workday, Cornerstone |
-| Finance | SAP Concur, Coupa, NetSuite, Zuora |
-| Sales & CRM | Salesforce, Partner Portal |
-| Marketing | Iterable, Marketo |
-| Network & VPN | Juniper VPN |
-| Support | Zendesk, Lithium |
+Services are organized into ten categories. The committed example registry
+(`backend/config/services.yaml`) ships a generic, runnable set that monitors
+public developer-tool status pages, so the dashboard works immediately after
+clone:
+
+| Category | Example services |
+|----------|------------------|
+| Identity & Access | Identity provider (SSO) |
+| Engineering | GitHub, npm, PyPI, Sentry |
+| Productivity | Dropbox |
+| Collaboration | Discord |
+| Network & VPN | Cloudflare |
+| Support | Ticketing / ITSM |
+| Other | Datadog |
+
+To monitor your own organization's services, copy the example to a gitignored
+`backend/config/services.local.yaml` (the loader prefers it when present) and
+list your real registry there — see that file's header for the schema and the
+full category list (identity, productivity, collaboration, engineering, HR,
+finance, sales, marketing, networking, support).
## Manual Status Updates
-For services without automated polling (Okta, Workday, Concur, etc.), update status via curl. **Admin endpoints require a bearer token** (set `ADMIN_API_TOKEN` in your env).
+For services without automated polling (e.g. an identity provider, an HR system, or any service with no public status API), update status via curl. **Admin endpoints require a bearer token** (set `ADMIN_API_TOKEN` in your env).
```bash
export TOKEN="your-admin-token"
@@ -94,19 +102,19 @@ export TOKEN="your-admin-token"
curl -X POST http://localhost:8000/api/admin/status \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
- -d '{"service_id": "workday", "new_status": "degraded", "detail": "Slow login page", "reason": "Reported by user in #it-help"}'
+ -d '{"service_id": "hr-system", "new_status": "degraded", "detail": "Slow login page", "reason": "Reported by user in the help channel"}'
# Set to major outage
curl -X POST http://localhost:8000/api/admin/status \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
- -d '{"service_id": "okta", "new_status": "major_outage", "detail": "SSO completely unavailable", "reason": "Confirmed with vendor"}'
+ -d '{"service_id": "identity-provider", "new_status": "major_outage", "detail": "SSO completely unavailable", "reason": "Confirmed with vendor"}'
# Resolve (set back to operational)
curl -X POST http://localhost:8000/api/admin/status \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
- -d '{"service_id": "okta", "new_status": "operational", "reason": "Vendor posted recovery"}'
+ -d '{"service_id": "identity-provider", "new_status": "operational", "reason": "Vendor posted recovery"}'
```
Valid statuses: `operational`, `degraded`, `partial_outage`, `major_outage`, `unknown`. The `reason` field is required for audit trail.
@@ -115,7 +123,7 @@ Valid statuses: `operational`, `degraded`, `partial_outage`, `major_outage`, `un
| Variable | Default | Description |
|----------|---------|-------------|
-| `SLACK_WEBHOOK_URL` | _(none)_ | Slack incoming webhook URL for #service-validation alerts |
+| `SLACK_WEBHOOK_URL` | _(none)_ | Slack incoming webhook URL for ops-alert channel notifications |
| `DATABASE_PATH` | `data.db` | SQLite database file path |
| `POLL_INTERVAL_SECONDS` | `60` | How often to poll vendor status pages (1–3600) |
| `HOST` | `127.0.0.1` | Server bind address (`0.0.0.0` for network access) |
@@ -192,13 +200,13 @@ cp .env.example backend/.env
# Edit backend/.env: set HOST=0.0.0.0, SLACK_WEBHOOK_URL=
# 3. Update plist paths
-# Edit com.box.it-health-dashboard.plist:
+# Edit com.company.it-health-dashboard.plist:
# - Replace /path/to/ with actual project path
# - Add SLACK_WEBHOOK_URL
# 4. Install launchd service
-sudo cp com.box.it-health-dashboard.plist /Library/LaunchDaemons/
-sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist
+sudo cp com.company.it-health-dashboard.plist /Library/LaunchDaemons/
+sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist
# 5. Verify
curl http://localhost:8000/api/health
@@ -210,10 +218,10 @@ sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add $(which python3)
Manage the service:
```bash
# Stop
-sudo launchctl bootout system/com.box.it-health-dashboard
+sudo launchctl bootout system/com.company.it-health-dashboard
# Start
-sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist
+sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist
# View logs
tail -f /var/log/it-health-dashboard.log
@@ -237,9 +245,9 @@ $EDITOR /opt/it-health/deploy/litestream.yml
litestream validate -config /opt/it-health/deploy/litestream.yml
# 4. Install the sidecar launchd daemon
-cp deploy/com.box.it-health-dashboard-litestream.plist.example \
- /Library/LaunchDaemons/com.box.it-health-dashboard-litestream.plist
-sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard-litestream.plist
+cp deploy/com.company.it-health-dashboard-litestream.plist.example \
+ /Library/LaunchDaemons/com.company.it-health-dashboard-litestream.plist
+sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard-litestream.plist
# 5. Confirm replication is working
litestream snapshots -config /opt/it-health/deploy/litestream.yml
@@ -251,7 +259,7 @@ Litestream RPO is ~1 second — after the initial snapshot, every WAL frame ship
```bash
# 1. Stop the main app so the DB isn't being written to
-sudo launchctl bootout system/com.box.it-health-dashboard
+sudo launchctl bootout system/com.company.it-health-dashboard
# 2. Restore from replica (picks up the latest snapshot + WAL frames)
litestream restore -config /opt/it-health/deploy/litestream.yml \
@@ -259,7 +267,7 @@ litestream restore -config /opt/it-health/deploy/litestream.yml \
/opt/it-health/data.db
# 3. Start the app — it applies pending migrations on boot and resumes polling
-sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist
+sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist
```
### Data retention
@@ -300,4 +308,4 @@ The retention job runs every `RETENTION_INTERVAL_HOURS` (default 168 = weekly) a
All production phases (0–6) and the primary Phase 7 reach features are complete. Full exit-criteria history is in [PRODUCTION-ROADMAP.md](./PRODUCTION-ROADMAP.md). Remaining optional work:
- **Phase 7 — LLM layer:** Natural-language impact statements; deferred post-Phase-7.
-- **Phase 7 — Integrations:** Splunk, ThousandEyes, Datadog, JSM — deferred to org demand.
+- **Phase 7 — Integrations:** log aggregation, synthetic monitoring, metrics, and ITSM platforms — deferred to demand.
diff --git a/backend/app/alerting/routing.py b/backend/app/alerting/routing.py
index 84aff02..8970612 100644
--- a/backend/app/alerting/routing.py
+++ b/backend/app/alerting/routing.py
@@ -43,10 +43,10 @@ class RoutingDecision:
should_send: bool
webhook_url: str | None
- channel_mention: str | None # "" for critical tier, else None
+ channel_mention: str | None # "" for critical tier, else None
dedup_key: str
tier: str
- suppressed_by: str | None # None if sent, else reason code
+ suppressed_by: str | None # None if sent, else reason code
# If this change was consolidated into an aggregated upstream alert,
# name + status of the upstream; caller suppresses the individual alert.
aggregated_under: str | None = None
@@ -115,7 +115,8 @@ async def was_recently_alerted(
async def get_service_tier(
- db: aiosqlite.Connection, service_id: str,
+ db: aiosqlite.Connection,
+ service_id: str,
) -> tuple[str, str | None]:
"""Return (tier, slack_channel_override) for a service."""
cursor = await db.execute(
@@ -167,7 +168,9 @@ async def route_status_change(
# Recoveries to 'operational' skip dedup — users always want to know
# "it's back", even if they just saw the outage alert minutes ago.
if change.new_status != "operational" and await was_recently_alerted(
- db, dedup_key, settings.alert_dedup_window_seconds,
+ db,
+ dedup_key,
+ settings.alert_dedup_window_seconds,
):
return RoutingDecision(
should_send=False,
@@ -249,11 +252,13 @@ async def record_alert(
# Mirror into Prometheus so operators can scrape alert hygiene trends
if decision.suppressed_by:
ALERTS_SUPPRESSED_TOTAL.labels(
- kind=alert_kind, reason=decision.suppressed_by,
+ kind=alert_kind,
+ reason=decision.suppressed_by,
).inc()
else:
ALERTS_SENT_TOTAL.labels(
- kind=alert_kind, severity=decision.tier,
+ kind=alert_kind,
+ severity=decision.tier,
).inc()
@@ -263,7 +268,7 @@ async def record_alert(
def build_slo_burn_rate_dedup_key(service_id: str, severity: str) -> str:
- """e.g. 'slo_burn:slack_api:fast' — used by alert_sent_log for dedup."""
+ """e.g. 'slo_burn:identity-provider:fast' — used by alert_sent_log for dedup."""
return f"slo_burn:{service_id}:{severity}"
@@ -351,16 +356,19 @@ async def record_slo_alert(
if decision.suppressed_by:
ALERTS_SUPPRESSED_TOTAL.labels(
- kind="slo_burn_rate", reason=decision.suppressed_by,
+ kind="slo_burn_rate",
+ reason=decision.suppressed_by,
).inc()
else:
ALERTS_SENT_TOTAL.labels(
- kind="slo_burn_rate", severity=decision.tier,
+ kind="slo_burn_rate",
+ severity=decision.tier,
).inc()
# ── Dependency correlation ──────────────────────────────────────────
+
async def find_aggregation_candidates(
db: aiosqlite.Connection,
changes: list[StatusChange],
@@ -382,7 +390,8 @@ async def find_aggregation_candidates(
# Build quick lookup: which service_ids in this batch are going non-operational?
affected_ids = {
- c.service_id for c in changes
+ c.service_id
+ for c in changes
if c.new_status != "operational" and c.previous_status == "operational"
}
if not affected_ids:
@@ -407,7 +416,8 @@ async def find_aggregation_candidates(
declared_downstream = {row[0] for row in await cursor.fetchall()}
consolidated = [
- c for c in changes
+ c
+ for c in changes
if c.service_id != upstream_change.service_id
and c.service_id in declared_downstream
and c.service_id in affected_ids
diff --git a/backend/app/alerting/templates.py b/backend/app/alerting/templates.py
index b78b4fa..046fe6f 100644
--- a/backend/app/alerting/templates.py
+++ b/backend/app/alerting/templates.py
@@ -4,40 +4,32 @@
and the service dependency graph. Uses simple string templates.
"""
+from app.config import settings
from app.poller.change_detector import StatusChange
TEMPLATES = {
"single_service_degraded": (
- "{service_name} is reporting degraded performance. "
- "{vendor_detail}"
- ),
- "single_service_partial": (
- "{service_name} is experiencing a partial outage. "
- "{vendor_detail}"
+ "{service_name} is reporting degraded performance. {vendor_detail}"
),
+ "single_service_partial": ("{service_name} is experiencing a partial outage. {vendor_detail}"),
"single_service_major": (
- "\u26a0\ufe0f {service_name} is experiencing a MAJOR OUTAGE. "
- "{vendor_detail}"
- ),
- "with_downstream": (
- " This may impact: {downstream_list}."
+ "\u26a0\ufe0f {service_name} is experiencing a MAJOR OUTAGE. {vendor_detail}"
),
- "okta_degraded": (
- "Okta is reporting degraded performance. SSO authentication "
+ "with_downstream": (" This may impact: {downstream_list}."),
+ "sso_degraded": (
+ "The identity provider (SSO) is reporting degraded performance. SSO authentication "
"for all SaaS applications may be affected. Impacted services: {downstream_list}."
),
- "okta_outage": (
- "\u26a0\ufe0f Okta is experiencing an outage. SSO authentication is unavailable. "
+ "sso_outage": (
+ "\u26a0\ufe0f The identity provider (SSO) is experiencing an outage. "
+ "SSO authentication is unavailable. "
"Users cannot log into: {downstream_list}. "
"Advise users with active sessions to avoid logging out."
),
- "recovery": (
- "{service_name} has recovered and is now operational."
- ),
+ "recovery": ("{service_name} has recovered and is now operational."),
"overall_healthy": "All {total} monitored services are operational.",
"overall_incidents": (
- "{incident_count} active incident(s) across {total} monitored services. "
- "{incident_summary}"
+ "{incident_count} active incident(s) across {total} monitored services. {incident_summary}"
),
}
@@ -60,12 +52,14 @@ def generate_impact_statement(
if change.new_status == "operational":
return TEMPLATES["recovery"].replace("{service_name}", change.service_display_name)
- # Special case: Okta
- if change.service_id == "okta":
+ # Special case: the SSO / identity broker (configurable via
+ # SSO_BROKER_SERVICE_ID). An identity-provider outage blocks login to
+ # everything downstream, so it gets dedicated impact wording.
+ if settings.sso_broker_service_id and change.service_id == settings.sso_broker_service_id:
if change.new_status in ("major_outage", "partial_outage"):
- return TEMPLATES["okta_outage"].replace("{downstream_list}", downstream_list)
+ return TEMPLATES["sso_outage"].replace("{downstream_list}", downstream_list)
if change.new_status == "degraded":
- return TEMPLATES["okta_degraded"].replace("{downstream_list}", downstream_list)
+ return TEMPLATES["sso_degraded"].replace("{downstream_list}", downstream_list)
# Generic path: pick template by severity
template_key = {
@@ -85,7 +79,8 @@ def generate_impact_statement(
if statement and not statement.endswith((".", "!", "?")):
statement += "."
statement += TEMPLATES["with_downstream"].replace(
- "{downstream_list}", downstream_list,
+ "{downstream_list}",
+ downstream_list,
)
return statement
diff --git a/backend/app/config.py b/backend/app/config.py
index 42f2903..72387fb 100644
--- a/backend/app/config.py
+++ b/backend/app/config.py
@@ -83,6 +83,13 @@ class Settings(BaseSettings):
# changes state, emit one aggregated alert instead of one per dependent.
dependency_correlation_threshold: int = Field(default=3, gt=0, le=100)
+ # The service id that acts as the SSO / identity broker. When a service
+ # with this id changes state, impact statements use the dedicated SSO
+ # template (an identity-provider outage blocks login to everything
+ # downstream). Leave unset to disable the special case — e.g. set
+ # SSO_BROKER_SERVICE_ID= in the environment.
+ sso_broker_service_id: str | None = None
+
# Observability (Phase 3)
# Pretty console output in dev, JSON in prod. JSON is cheap to parse
# and preserves contextvars (poll_cycle_id etc.) as first-class fields.
@@ -147,13 +154,22 @@ def sentry_dsn_str(self) -> str | None:
def healthcheck_ping_url_str(self) -> str | None:
return str(self.healthcheck_ping_url) if self.healthcheck_ping_url else None
+ @property
+ def _config_dir(self) -> Path:
+ return Path(__file__).parent.parent / "config"
+
@property
def services_yaml_path(self) -> Path:
- return Path(__file__).parent.parent / "config" / "services.yaml"
+ # Prefer a gitignored services.local.yaml (the operator's real
+ # registry) when present; otherwise fall back to the committed
+ # generic example.
+ local = self._config_dir / "services.local.yaml"
+ return local if local.exists() else self._config_dir / "services.yaml"
@property
def dependencies_yaml_path(self) -> Path:
- return Path(__file__).parent.parent / "config" / "dependencies.yaml"
+ local = self._config_dir / "dependencies.local.yaml"
+ return local if local.exists() else self._config_dir / "dependencies.yaml"
@property
def migrations_dir(self) -> Path:
diff --git a/backend/app/poller/zendesk_poller.py b/backend/app/poller/active_incidents_poller.py
similarity index 78%
rename from backend/app/poller/zendesk_poller.py
rename to backend/app/poller/active_incidents_poller.py
index 30006d4..f2aef54 100644
--- a/backend/app/poller/zendesk_poller.py
+++ b/backend/app/poller/active_incidents_poller.py
@@ -1,7 +1,8 @@
-"""Zendesk Status API poller.
+"""Active incidents API poller.
-Fetches active incidents from status.zendesk.com/api/incidents/active
-and maps to our status model.
+Fetches active incidents from a JSON incidents endpoint and maps to
+our status model. The response envelope wraps incident objects under
+a "data" key; an empty array means operational.
"""
import logging
@@ -15,11 +16,11 @@
logger = logging.getLogger(__name__)
-async def poll_zendesk(
+async def poll_active_incidents(
client: httpx.AsyncClient,
poll_url: str,
) -> PollResult:
- """Poll the Zendesk Status API for active incidents.
+ """Poll an active-incidents JSON endpoint.
Returns { data: [...incidents], included: [...] }.
Empty data array means operational.
@@ -29,7 +30,7 @@ async def poll_zendesk(
body = response.json()
except Exception as e:
detail, reason = describe_fetch_error(e)
- logger.warning("Zendesk poll failed: %s (%s)", detail, reason)
+ logger.warning("Active-incidents poll failed: %s (%s)", detail, reason)
return PollResult(
status=ServiceStatus.UNKNOWN,
status_detail=detail,
@@ -41,7 +42,7 @@ async def poll_zendesk(
if not incidents:
return PollResult(
status=ServiceStatus.OPERATIONAL,
- page_name="Zendesk",
+ page_name="Active Incidents",
incidents=[],
)
@@ -68,6 +69,6 @@ async def poll_zendesk(
return PollResult(
status=severity,
status_detail=status_detail,
- page_name="Zendesk",
+ page_name="Active Incidents",
incidents=incidents,
)
diff --git a/backend/app/poller/slack_poller.py b/backend/app/poller/current_status_poller.py
similarity index 70%
rename from backend/app/poller/slack_poller.py
rename to backend/app/poller/current_status_poller.py
index 7197cfd..896034e 100644
--- a/backend/app/poller/slack_poller.py
+++ b/backend/app/poller/current_status_poller.py
@@ -1,9 +1,9 @@
-"""Slack Status API poller.
+"""Current-status API poller.
-Fetches current status from slack-status.com/api/v2.0.0/current
-and returns normalized status data.
+Fetches the current status from a JSON status endpoint and returns
+normalized status data.
-The Slack API can return two formats:
+The endpoint can return two formats:
- Dict: the normal /current response with {status, active_incidents, ...}
- List: a redirect to the history endpoint returning incident objects.
Each incident has {id, status, type, title, ...} where status is
@@ -15,14 +15,14 @@
import httpx
-from app.poller.normalizer import ServiceStatus, normalize_slack_status
+from app.poller.normalizer import ServiceStatus, normalize_current_status
from app.poller.resilience import describe_fetch_error, resilient_fetch
from app.poller.statuspage_poller import PollResult
logger = logging.getLogger(__name__)
-# Slack incident type → severity rank (higher = worse)
-_SLACK_TYPE_RANK = {
+# Incident type → severity mapping (higher = worse)
+_STATUS_TYPE_RANK = {
"outage": ServiceStatus.MAJOR_OUTAGE,
"incident": ServiceStatus.PARTIAL_OUTAGE,
"notice": ServiceStatus.DEGRADED,
@@ -30,30 +30,32 @@
}
-async def poll_slack(
+async def poll_current_status(
client: httpx.AsyncClient,
poll_url: str,
) -> PollResult:
- """Poll the Slack Status API.
+ """Poll a current-status JSON endpoint.
Args:
client: Shared httpx AsyncClient.
- poll_url: Slack status API URL.
+ poll_url: Status API URL.
Returns:
PollResult with normalized status.
"""
try:
# resilient_fetch handles retries + per-host breaker. The explicit
- # Accept header is from the origin branch so slack-status.com
- # returns JSON not HTML when redirects land us on a different doc.
+ # Accept header ensures the endpoint returns JSON rather than HTML
+ # when redirects land on a different document.
response = await resilient_fetch(
- client, poll_url, headers={"Accept": "application/json"},
+ client,
+ poll_url,
+ headers={"Accept": "application/json"},
)
data = response.json()
except Exception as e:
detail, reason = describe_fetch_error(e)
- logger.warning("Slack poll failed: %s (%s)", detail, reason)
+ logger.warning("Current-status poll failed: %s (%s)", detail, reason)
return PollResult(
status=ServiceStatus.UNKNOWN,
status_detail=detail,
@@ -62,7 +64,7 @@ async def poll_slack(
# Normal dict response — use the existing normalizer
if isinstance(data, dict):
- status = normalize_slack_status(data)
+ status = normalize_current_status(data)
status_detail = None
active = data.get("active_incidents", [])
if active and isinstance(active[0], dict):
@@ -70,7 +72,7 @@ async def poll_slack(
return PollResult(
status=status,
status_detail=status_detail,
- page_name="Slack",
+ page_name="Current Status",
incidents=active,
scheduled_maintenances=[],
)
@@ -79,15 +81,14 @@ async def poll_slack(
# with a "status" field (active/resolved/etc) and "type" (outage/incident/etc).
if isinstance(data, list):
active_incidents = [
- item for item in data
- if isinstance(item, dict) and item.get("status") == "active"
+ item for item in data if isinstance(item, dict) and item.get("status") == "active"
]
if not active_incidents:
return PollResult(
status=ServiceStatus.OPERATIONAL,
status_detail=None,
- page_name="Slack",
+ page_name="Current Status",
incidents=[],
scheduled_maintenances=[],
)
@@ -99,7 +100,7 @@ async def poll_slack(
for inc in active_incidents:
inc_type = inc.get("type", "")
- mapped = _SLACK_TYPE_RANK.get(inc_type, ServiceStatus.DEGRADED)
+ mapped = _STATUS_TYPE_RANK.get(inc_type, ServiceStatus.DEGRADED)
if severity_rank.get(mapped.value, 0) > severity_rank.get(worst.value, 0):
worst = mapped
worst_title = inc.get("title")
@@ -108,18 +109,20 @@ async def poll_slack(
worst_title = active_incidents[0].get("title")
logger.info(
- "Slack API returned history list (%d items, %d active) — status: %s",
- len(data), len(active_incidents), worst.value,
+ "Current-status API returned history list (%d items, %d active) — status: %s",
+ len(data),
+ len(active_incidents),
+ worst.value,
)
return PollResult(
status=worst,
status_detail=worst_title,
- page_name="Slack",
+ page_name="Current Status",
incidents=active_incidents,
scheduled_maintenances=[],
)
# Unexpected type
- logger.warning("Slack API returned unexpected type: %s", type(data).__name__)
+ logger.warning("Current-status API returned unexpected type: %s", type(data).__name__)
return PollResult(status=ServiceStatus.UNKNOWN, status_detail="Unexpected response format")
diff --git a/backend/app/poller/google_poller.py b/backend/app/poller/google_poller.py
deleted file mode 100644
index ac79408..0000000
--- a/backend/app/poller/google_poller.py
+++ /dev/null
@@ -1,96 +0,0 @@
-"""Google Workspace status poller.
-
-Fetches the incidents.json feed from Google's status dashboard.
-One HTTP call serves both Google Mail and Google Calendar.
-"""
-
-import logging
-
-import httpx
-
-from app.poller.normalizer import (
- GOOGLE_PRODUCT_NAMES,
- ServiceStatus,
- normalize_google_status,
-)
-from app.poller.resilience import describe_fetch_error, resilient_fetch
-from app.poller.statuspage_poller import PollResult
-
-logger = logging.getLogger(__name__)
-
-
-async def poll_google(
- client: httpx.AsyncClient,
- poll_url: str,
- services: list[dict],
-) -> list[tuple[str, PollResult]]:
- """Poll Google Workspace status for multiple products in one call.
-
- Args:
- client: Shared httpx AsyncClient.
- poll_url: Google incidents.json URL.
- services: DB rows with key 'id' (e.g., "google-mail", "google-calendar").
-
- Returns:
- List of (service_id, PollResult) tuples.
- """
- try:
- response = await resilient_fetch(client, poll_url)
- incidents = response.json()
- except Exception as e:
- detail, reason = describe_fetch_error(e)
- logger.warning("Google poll failed: %s (%s)", detail, reason)
- return [
- (svc["id"], PollResult(
- status=ServiceStatus.UNKNOWN,
- status_detail=detail,
- poll_failure_reason=reason,
- ))
- for svc in services
- ]
-
- if not isinstance(incidents, list):
- logger.warning("Google incidents.json returned non-list: %s", type(incidents))
- return [
- (svc["id"], PollResult(
- status=ServiceStatus.UNKNOWN,
- status_detail="Unexpected response format",
- poll_failure_reason=f"parse_error: expected list, got {type(incidents).__name__}",
- ))
- for svc in services
- ]
-
- results: list[tuple[str, PollResult]] = []
- for svc in services:
- service_id = svc["id"]
- status = normalize_google_status(incidents, service_id)
-
- # Find status detail from most recent active incident for this product
- status_detail = None
- product_names = GOOGLE_PRODUCT_NAMES.get(service_id, [])
- for incident in incidents:
- if incident.get("end"):
- continue
- affected = incident.get("affected_products", [])
- if any(p.get("title") in product_names for p in affected):
- status_detail = incident.get("external_desc", "")[:200]
- break
-
- results.append((
- service_id,
- PollResult(
- status=status,
- status_detail=status_detail,
- page_name="Google Workspace",
- incidents=[
- inc for inc in incidents
- if not inc.get("end") and any(
- p.get("title") in product_names
- for p in inc.get("affected_products", [])
- )
- ],
- scheduled_maintenances=[],
- ),
- ))
-
- return results
diff --git a/backend/app/poller/normalizer.py b/backend/app/poller/normalizer.py
index 05aed9a..9046f1f 100644
--- a/backend/app/poller/normalizer.py
+++ b/backend/app/poller/normalizer.py
@@ -1,4 +1,4 @@
-"""Normalize vendor-specific status strings to our unified 5-state enum."""
+"""Normalize poll-format status strings to our unified 5-state enum."""
import logging
from enum import StrEnum
@@ -55,16 +55,18 @@ def normalize_statuspage_indicator(indicator: str) -> ServiceStatus:
mapped = STATUSPAGE_INDICATOR_MAP.get(key)
if mapped is None:
logger.warning(
- "Unmapped Statuspage indicator %r — returning UNKNOWN.", indicator,
+ "Unmapped Statuspage indicator %r — returning UNKNOWN.",
+ indicator,
)
return ServiceStatus.UNKNOWN
return mapped
-# ── Slack Status API ───────────────────────────────────────────────
+# ── Current Status API ────────────────────────────────────────────
-def normalize_slack_status(response: dict) -> ServiceStatus:
- """Map Slack Status API response to ServiceStatus.
+
+def normalize_current_status(response: dict) -> ServiceStatus:
+ """Map a current-status API dict response to ServiceStatus.
When status is "ok" and no active incidents → OPERATIONAL.
Otherwise, map by incident type.
@@ -108,23 +110,24 @@ def normalize_slack_status(response: dict) -> ServiceStatus:
return ServiceStatus.DEGRADED
-# ── Google Workspace ───────────────────────────────────────────────
+# ── Product Feed ───────────────────────────────────────────────────
-# Google product name mappings for filtering the incident feed
-GOOGLE_PRODUCT_NAMES: dict[str, list[str]] = {
- "google-mail": ["Gmail", "Google Mail"],
- "google-calendar": ["Google Calendar"],
+# Maps a service id to the product-title strings that identify it in a
+# multi-product status feed. Populate per your feed adapter's payload.
+PRODUCT_FEED_NAMES: dict[str, list[str]] = {
+ "feed-product-a": ["Product A", "Service A"],
+ "feed-product-b": ["Product B"],
}
-def normalize_google_status(incidents: list[dict], service_id: str) -> ServiceStatus:
- """Map Google Workspace incident feed to ServiceStatus for a specific product.
+def normalize_product_feed_status(incidents: list[dict], service_id: str) -> ServiceStatus:
+ """Map a multi-product incident feed to ServiceStatus for a specific product.
- The incidents.json feed contains incidents for ALL Google Workspace products.
- Filter by matching product names for the given service_id.
+ The feed contains incidents for all products; filter by matching product
+ names for the given service_id.
If no active (non-resolved) incidents exist for the product → OPERATIONAL.
"""
- product_names = GOOGLE_PRODUCT_NAMES.get(service_id, [])
+ product_names = PRODUCT_FEED_NAMES.get(service_id, [])
if not product_names:
return ServiceStatus.UNKNOWN
@@ -165,16 +168,27 @@ def normalize_google_status(incidents: list[dict], service_id: str) -> ServiceSt
RSS_SEVERITY_KEYWORDS: dict[ServiceStatus, list[str]] = {
ServiceStatus.MAJOR_OUTAGE: [
- "major outage", "service outage", "completely unavailable",
+ "major outage",
+ "service outage",
+ "completely unavailable",
],
ServiceStatus.PARTIAL_OUTAGE: [
- "partial outage", "partial disruption", "some users",
+ "partial outage",
+ "partial disruption",
+ "some users",
],
ServiceStatus.DEGRADED: [
- "degraded", "performance issue", "intermittent", "delays", "investigating",
+ "degraded",
+ "performance issue",
+ "intermittent",
+ "delays",
+ "investigating",
],
ServiceStatus.OPERATIONAL: [
- "resolved", "operational", "recovered", "fix implemented",
+ "resolved",
+ "operational",
+ "recovered",
+ "fix implemented",
],
}
diff --git a/backend/app/poller/product_feed_poller.py b/backend/app/poller/product_feed_poller.py
new file mode 100644
index 0000000..c642a05
--- /dev/null
+++ b/backend/app/poller/product_feed_poller.py
@@ -0,0 +1,107 @@
+"""Product-feed status poller.
+
+Parses a multi-product incident feed where one HTTP call serves multiple
+product IDs. Each entry in the feed represents an incident and carries an
+affected_products list; entries without an "end" timestamp are active.
+"""
+
+import logging
+
+import httpx
+
+from app.poller.normalizer import (
+ PRODUCT_FEED_NAMES,
+ ServiceStatus,
+ normalize_product_feed_status,
+)
+from app.poller.resilience import describe_fetch_error, resilient_fetch
+from app.poller.statuspage_poller import PollResult
+
+logger = logging.getLogger(__name__)
+
+
+async def poll_product_feed(
+ client: httpx.AsyncClient,
+ poll_url: str,
+ services: list[dict],
+) -> list[tuple[str, PollResult]]:
+ """Poll a multi-product incident feed for multiple services in one call.
+
+ Args:
+ client: Shared httpx AsyncClient.
+ poll_url: Incident feed URL (returns a JSON array of incident objects).
+ services: DB rows with key 'id' matching entries in PRODUCT_FEED_NAMES.
+
+ Returns:
+ List of (service_id, PollResult) tuples.
+ """
+ try:
+ response = await resilient_fetch(client, poll_url)
+ incidents = response.json()
+ except Exception as e:
+ detail, reason = describe_fetch_error(e)
+ logger.warning("Product-feed poll failed: %s (%s)", detail, reason)
+ return [
+ (
+ svc["id"],
+ PollResult(
+ status=ServiceStatus.UNKNOWN,
+ status_detail=detail,
+ poll_failure_reason=reason,
+ ),
+ )
+ for svc in services
+ ]
+
+ if not isinstance(incidents, list):
+ logger.warning("Product feed returned non-list: %s", type(incidents))
+ return [
+ (
+ svc["id"],
+ PollResult(
+ status=ServiceStatus.UNKNOWN,
+ status_detail="Unexpected response format",
+ poll_failure_reason=f"parse_error: expected list, got {type(incidents).__name__}",
+ ),
+ )
+ for svc in services
+ ]
+
+ results: list[tuple[str, PollResult]] = []
+ for svc in services:
+ service_id = svc["id"]
+ status = normalize_product_feed_status(incidents, service_id)
+
+ # Find status detail from most recent active incident for this product
+ status_detail = None
+ product_names = PRODUCT_FEED_NAMES.get(service_id, [])
+ for incident in incidents:
+ if incident.get("end"):
+ continue
+ affected = incident.get("affected_products", [])
+ if any(p.get("title") in product_names for p in affected):
+ status_detail = incident.get("external_desc", "")[:200]
+ break
+
+ results.append(
+ (
+ service_id,
+ PollResult(
+ status=status,
+ status_detail=status_detail,
+ page_name="Product Feed",
+ incidents=[
+ inc
+ for inc in incidents
+ if not inc.get("end")
+ and any(
+ p.get("title") in product_names
+ for p in inc.get("affected_products", [])
+ )
+ ],
+ scheduled_maintenances=[],
+ ),
+ )
+ )
+
+ return results
diff --git a/backend/app/poller/resilience.py b/backend/app/poller/resilience.py
index 099bf21..cf7dd7f 100644
--- a/backend/app/poller/resilience.py
+++ b/backend/app/poller/resilience.py
@@ -136,7 +136,7 @@ async def resilient_fetch(
`headers` forwards through to ``client.get`` so pollers that need
vendor-specific headers (e.g., ``Accept: application/json`` for
- Slack's redirect-happy status host) can pass them without
+ some vendors' redirect-happy status hosts) can pass them without
bypassing the resilience layer.
Raises:
diff --git a/backend/app/poller/scheduler.py b/backend/app/poller/scheduler.py
index 521e358..c2be5ba 100644
--- a/backend/app/poller/scheduler.py
+++ b/backend/app/poller/scheduler.py
@@ -1,7 +1,7 @@
"""Poll scheduler: runs all pollers on a 60-second cycle via APScheduler.
-Orchestrates statuspage, Slack, and Google pollers, feeds results through
-the change detector, and logs status changes.
+Orchestrates all poller types, feeds results through the change detector,
+and logs status changes.
Phase 3 observability hooks:
- Each poll cycle binds a fresh `poll_cycle_id` contextvar so every
@@ -50,12 +50,15 @@ def _on_scheduler_event(event) -> None:
"""Bridge APScheduler events into our logs + metrics."""
if event.code == EVENT_JOB_ERROR:
logger.error(
- "APScheduler job %s raised: %s", event.job_id, event.exception,
+ "APScheduler job %s raised: %s",
+ event.job_id,
+ event.exception,
)
elif event.code == EVENT_JOB_MISSED:
logger.warning(
"APScheduler job %s missed its run time at %s",
- event.job_id, event.scheduled_run_time,
+ event.job_id,
+ event.scheduled_run_time,
)
@@ -97,6 +100,7 @@ def start_scheduler(app) -> None:
# WAL checkpoint — runs more often than retention so the -wal sidecar
# file doesn't grow without bound between weekly retention passes.
from app.retention import scheduled_retention_tick, scheduled_wal_checkpoint_tick
+
scheduler.add_job(
scheduled_wal_checkpoint_tick,
"interval",
@@ -119,6 +123,7 @@ def start_scheduler(app) -> None:
# as a belt-and-suspenders snapshot for operators who don't want to
# set up Litestream's continuous WAL shipping.
from app.backup import run_backup
+
scheduler.add_job(
run_backup,
"cron",
@@ -133,6 +138,7 @@ def start_scheduler(app) -> None:
# SLO burn-rate alerting — gated by feature flag (default off)
if settings.slo_burn_rate_enabled:
from app.alerting.burn_rate import run_slo_burn_rate_cycle
+
scheduler.add_job(
run_slo_burn_rate_cycle,
"interval",
@@ -198,21 +204,23 @@ async def run_poll_cycle(app) -> None:
services_by_type.setdefault(svc["poll_type"], []).append(svc)
# Dispatch all poller groups concurrently
- from app.poller.google_poller import poll_google
- from app.poller.ringcentral_poller import poll_ringcentral
- from app.poller.salesforce_poller import poll_salesforce
- from app.poller.slack_poller import poll_slack
+ from app.poller.active_incidents_poller import poll_active_incidents
+ from app.poller.current_status_poller import poll_current_status
+ from app.poller.product_feed_poller import poll_product_feed
+ from app.poller.service_array_poller import poll_service_array
from app.poller.statuspage_poller import poll_all_statuspage
- from app.poller.zendesk_poller import poll_zendesk
+ from app.poller.trust_incidents_poller import poll_trust_incidents
tasks = []
task_labels = []
def _timed(poll_type: str, coro):
"""Wrap a poller coroutine to record its wall-clock duration."""
+
async def _runner():
with POLL_DURATION_SECONDS.labels(poll_type=poll_type).time():
return await coro
+
return _runner()
statuspage_svcs = services_by_type.get("statuspage_json", [])
@@ -220,33 +228,37 @@ async def _runner():
tasks.append(_timed("statuspage_json", poll_all_statuspage(client, statuspage_svcs)))
task_labels.append(f"statuspage ({len(statuspage_svcs)} services)")
- slack_svcs = services_by_type.get("slack_api", [])
- if slack_svcs:
- svc = slack_svcs[0]
+ current_status_svcs = services_by_type.get("current_status_api", [])
+ if current_status_svcs:
+ svc = current_status_svcs[0]
- async def _poll_slack():
- result = await poll_slack(client, svc["poll_url"])
+ async def _poll_current_status():
+ result = await poll_current_status(client, svc["poll_url"])
return [(svc["id"], result)]
- tasks.append(_timed("slack_api", _poll_slack()))
- task_labels.append("slack (1 service)")
+ tasks.append(_timed("current_status_api", _poll_current_status()))
+ task_labels.append("current_status (1 service)")
- google_svcs = services_by_type.get("google_json", [])
- if google_svcs:
- url = google_svcs[0]["poll_url"]
- tasks.append(_timed("google_json", poll_google(client, url, google_svcs)))
- task_labels.append(f"google ({len(google_svcs)} services)")
+ product_feed_svcs = services_by_type.get("product_feed_json", [])
+ if product_feed_svcs:
+ url = product_feed_svcs[0]["poll_url"]
+ tasks.append(
+ _timed("product_feed_json", poll_product_feed(client, url, product_feed_svcs))
+ )
+ task_labels.append(f"product_feed ({len(product_feed_svcs)} services)")
# Single-service custom pollers
for poll_type, poller_fn in [
- ("salesforce_trust", poll_salesforce),
- ("zendesk_api", poll_zendesk),
- ("ringcentral_api", poll_ringcentral),
+ ("trust_incidents_api", poll_trust_incidents),
+ ("active_incidents_api", poll_active_incidents),
+ ("service_array_json", poll_service_array),
]:
for svc in services_by_type.get(poll_type, []):
+
async def _poll_single(s=svc, fn=poller_fn):
result = await fn(client, s["poll_url"])
return [(s["id"], result)]
+
tasks.append(_timed(poll_type, _poll_single()))
task_labels.append(f"{poll_type} ({svc['id']})")
@@ -274,20 +286,25 @@ async def _poll_single(s=svc, fn=poller_fn):
# Process vendor-outage changes: impact statements + Slack alerts
if changes:
from app.alerting.engine import process_changes
+
await process_changes(db, write_lock, changes, http_client=client)
# Process poller-health changes: alert on the separate poller-health
# webhook so operators can tell "we're blind" from "vendor is down".
if health_changes:
from app.alerting.engine import process_poller_health_changes
+
await process_poller_health_changes(
- health_changes, http_client=client,
+ health_changes,
+ http_client=client,
)
# Log summary
logger.info(
"Poll cycle complete: %d services polled, %d changes, %d health",
- len(all_results), len(changes), len(health_changes),
+ len(all_results),
+ len(changes),
+ len(health_changes),
)
for change in changes:
logger.info(
diff --git a/backend/app/poller/ringcentral_poller.py b/backend/app/poller/service_array_poller.py
similarity index 77%
rename from backend/app/poller/ringcentral_poller.py
rename to backend/app/poller/service_array_poller.py
index d031320..425351b 100644
--- a/backend/app/poller/ringcentral_poller.py
+++ b/backend/app/poller/service_array_poller.py
@@ -1,7 +1,8 @@
-"""RingCentral Status API poller.
+"""Service-array status poller.
-Fetches service status from status.ringcentral.com/status.json
-which returns an array of 75 service status objects.
+Fetches status from an endpoint that returns an array of per-service
+status objects, each carrying a level field and optional alerts list.
+Overall status is derived from the worst level across all entries.
"""
import logging
@@ -14,7 +15,7 @@
logger = logging.getLogger(__name__)
-# RingCentral level values → our status
+# Level field values → our status
LEVEL_MAP = {
"Good": ServiceStatus.OPERATIONAL,
"Informational": ServiceStatus.OPERATIONAL,
@@ -24,24 +25,24 @@
}
-async def poll_ringcentral(
+async def poll_service_array(
client: httpx.AsyncClient,
poll_url: str,
) -> PollResult:
- """Poll RingCentral status API.
+ """Poll a service-array status endpoint.
- Returns an array of objects like:
+ The endpoint returns an array of objects like:
{ "category": "Core Services", "service": "Calling - Inbound",
"region": "Americas", "level": "Good", "alerts": [] }
- We compute overall status from the worst level across all services.
+ Overall status is computed from the worst level across all entries.
"""
try:
response = await resilient_fetch(client, poll_url)
services = response.json()
except Exception as e:
detail, reason = describe_fetch_error(e)
- logger.warning("RingCentral poll failed: %s (%s)", detail, reason)
+ logger.warning("Service-array poll failed: %s (%s)", detail, reason)
return PollResult(
status=ServiceStatus.UNKNOWN,
status_detail=detail,
@@ -49,7 +50,7 @@ async def poll_ringcentral(
)
if not isinstance(services, list):
- return PollResult(status=ServiceStatus.OPERATIONAL, page_name="RingCentral")
+ return PollResult(status=ServiceStatus.OPERATIONAL, page_name="Service Array")
# Compute worst status across all services
worst = ServiceStatus.OPERATIONAL
@@ -79,6 +80,6 @@ async def poll_ringcentral(
return PollResult(
status=worst,
status_detail=worst_detail if worst != ServiceStatus.OPERATIONAL else None,
- page_name="RingCentral",
+ page_name="Service Array",
incidents=active_alerts,
)
diff --git a/backend/app/poller/statuspage_poller.py b/backend/app/poller/statuspage_poller.py
index 752284e..813273f 100644
--- a/backend/app/poller/statuspage_poller.py
+++ b/backend/app/poller/statuspage_poller.py
@@ -158,14 +158,16 @@ async def _fetch_one(url: str) -> tuple[str, dict | Exception]:
detail, reason = describe_fetch_error(data_or_error)
logger.warning("Failed to fetch %s: %s (%s)", url, detail, reason)
for svc in svcs:
- results.append((
- svc["id"],
- PollResult(
- status=ServiceStatus.UNKNOWN,
- status_detail=detail,
- poll_failure_reason=reason,
- ),
- ))
+ results.append(
+ (
+ svc["id"],
+ PollResult(
+ status=ServiceStatus.UNKNOWN,
+ status_detail=detail,
+ poll_failure_reason=reason,
+ ),
+ )
+ )
else:
data = data_or_error
for svc in svcs:
@@ -183,7 +185,7 @@ async def _demo_poll() -> None:
) as client:
result = await poll_statuspage(
client,
- "https://status.box.com/api/v2/summary.json",
+ "https://status.example.com/api/v2/summary.json",
)
print(f"Page: {result.page_name}")
print(f"Status: {result.status.value}")
@@ -202,4 +204,5 @@ async def _demo_poll() -> None:
if __name__ == "__main__":
import asyncio as _asyncio
+
_asyncio.run(_demo_poll())
diff --git a/backend/app/poller/salesforce_poller.py b/backend/app/poller/trust_incidents_poller.py
similarity index 76%
rename from backend/app/poller/salesforce_poller.py
rename to backend/app/poller/trust_incidents_poller.py
index a1be1e1..a97ab98 100644
--- a/backend/app/poller/salesforce_poller.py
+++ b/backend/app/poller/trust_incidents_poller.py
@@ -1,7 +1,8 @@
-"""Salesforce Trust API poller.
+"""Trust incidents API poller.
-Fetches active incidents from api.status.salesforce.com/v1/incidents
-and maps to our status model.
+Fetches active incidents from a trust/incidents JSON endpoint and maps
+to our status model. The endpoint returns a list of incident objects;
+entries without a resolved timestamp are considered active.
"""
import logging
@@ -15,13 +16,13 @@
logger = logging.getLogger(__name__)
-async def poll_salesforce(
+async def poll_trust_incidents(
client: httpx.AsyncClient,
poll_url: str,
) -> PollResult:
- """Poll the Salesforce Trust API for active incidents.
+ """Poll a trust-incidents JSON endpoint for active incidents.
- The API returns a list of incident objects. If any are active
+ The endpoint returns a list of incident objects. If any are active
(no resolvedAt), the service is degraded/outaged.
"""
try:
@@ -29,7 +30,7 @@ async def poll_salesforce(
incidents = response.json()
except Exception as e:
detail, reason = describe_fetch_error(e)
- logger.warning("Salesforce poll failed: %s (%s)", detail, reason)
+ logger.warning("Trust-incidents poll failed: %s (%s)", detail, reason)
return PollResult(
status=ServiceStatus.UNKNOWN,
status_detail=detail,
@@ -37,7 +38,7 @@ async def poll_salesforce(
)
if not isinstance(incidents, list):
- return PollResult(status=ServiceStatus.OPERATIONAL, page_name="Salesforce")
+ return PollResult(status=ServiceStatus.OPERATIONAL, page_name="Trust Incidents")
# Filter to active incidents (those without a resolved timestamp)
active = [inc for inc in incidents if not inc.get("isResolved", True)]
@@ -45,7 +46,7 @@ async def poll_salesforce(
if not active:
return PollResult(
status=ServiceStatus.OPERATIONAL,
- page_name="Salesforce",
+ page_name="Trust Incidents",
incidents=[],
)
@@ -72,6 +73,6 @@ async def poll_salesforce(
return PollResult(
status=severity,
status_detail=status_detail,
- page_name="Salesforce",
+ page_name="Trust Incidents",
incidents=active,
)
diff --git a/backend/app/seed.py b/backend/app/seed.py
index b9ee610..84966f5 100644
--- a/backend/app/seed.py
+++ b/backend/app/seed.py
@@ -42,13 +42,28 @@ def _expand_env_var(value: str | None) -> str | None:
VALID_CATEGORIES = Literal[
- "identity", "productivity", "collaboration", "engineering",
- "hr", "finance", "sales", "marketing", "networking", "support", "other",
+ "identity",
+ "productivity",
+ "collaboration",
+ "engineering",
+ "hr",
+ "finance",
+ "sales",
+ "marketing",
+ "networking",
+ "support",
+ "other",
]
VALID_POLL_TYPES = Literal[
- "statuspage_json", "google_json", "slack_api", "rss", "manual",
- "salesforce_trust", "zendesk_api", "ringcentral_api",
+ "statuspage_json",
+ "product_feed_json",
+ "current_status_api",
+ "rss",
+ "manual",
+ "trust_incidents_api",
+ "active_incidents_api",
+ "service_array_json",
]
VALID_TIERS = Literal["critical", "important", "informational"]
@@ -106,9 +121,7 @@ def load_services(path: Path | None = None) -> list[ServiceConfig]:
errors.append(f" Service #{i + 1} ({raw.get('id', '?')}): {e}")
if errors:
- raise ValueError(
- f"Validation failed for {len(errors)} service(s):\n" + "\n".join(errors)
- )
+ raise ValueError(f"Validation failed for {len(errors)} service(s):\n" + "\n".join(errors))
return services
@@ -138,9 +151,7 @@ def load_dependencies(
errors: list[str] = []
for upstream, targets in deps.items():
if upstream not in known_service_ids:
- errors.append(
- f" Unknown upstream service '{upstream}' (not in services.yaml)"
- )
+ errors.append(f" Unknown upstream service '{upstream}' (not in services.yaml)")
for target in targets:
if target.service == "all_internal":
continue
@@ -218,9 +229,7 @@ async def seed_dependencies(
for target in targets:
# Expand "all_internal" to all services except the upstream itself
if target.service == "all_internal":
- downstream_ids = [
- sid for sid in all_service_ids if sid != upstream
- ]
+ downstream_ids = [sid for sid in all_service_ids if sid != upstream]
else:
downstream_ids = [target.service]
diff --git a/backend/config/dependencies.yaml b/backend/config/dependencies.yaml
index d44a486..1e691b7 100644
--- a/backend/config/dependencies.yaml
+++ b/backend/config/dependencies.yaml
@@ -1,60 +1,49 @@
-# IT Service Health Dashboard — Service Dependency Graph
-# Format: upstream → downstream (when upstream breaks, downstream is impacted)
+# Service dependency graph (EXAMPLE)
+#
+# Maps an upstream service id to the downstream services impacted when it
+# degrades. Used to generate impact statements ("X is down — this affects Y").
+# Every id here must exist in services.yaml. The sentinel `all_internal`
+# expands to every service except the upstream itself.
+#
+# For a real deployment, copy this to `dependencies.local.yaml` (gitignored)
+# alongside `services.local.yaml`; the loader prefers the .local files.
+#
+# Edge schema: { service: , impact: , severity: critical|high|medium|low }
dependencies:
- # Identity & Access — highest blast radius
- okta:
- - service: box
- impact: "Box SSO login unavailable"
+ # The identity provider is the login broker — an outage blocks SSO access
+ # to everything that authenticates through it.
+ identity-provider:
+ - service: github
+ impact: "SSO login to source control unavailable"
severity: critical
- - service: slack
- impact: "Slack SSO login may fail for new sessions"
- severity: critical
- - service: zoom
- impact: "Zoom SSO login unavailable"
- severity: critical
- - service: salesforce
- impact: "Salesforce SSO login unavailable"
- severity: critical
- - service: jira
- impact: "Jira SSO login unavailable"
+ - service: npm
+ impact: "SSO login to the package registry unavailable"
severity: high
- - service: confluence
- impact: "Confluence SSO login unavailable"
+ - service: pypi
+ impact: "SSO login to the package index unavailable"
severity: high
- - service: concur
- impact: "Concur SSO login unavailable"
- severity: high
- - service: workday
- impact: "Workday SSO login unavailable"
- severity: high
- - service: greenhouse
- impact: "Greenhouse SSO login unavailable"
- severity: medium
- - service: docusign
- impact: "DocuSign SSO login unavailable"
+ - service: sentry
+ impact: "SSO login to error tracking unavailable"
severity: medium
- - service: zendesk
- impact: "Zendesk SSO login unavailable"
+ - service: dropbox
+ impact: "SSO login to file storage unavailable"
+ severity: high
+ - service: discord
+ impact: "SSO login to team chat unavailable"
severity: medium
- - service: netsuite
- impact: "NetSuite SSO login unavailable"
+ - service: datadog
+ impact: "SSO login to observability unavailable"
severity: medium
-
- duo:
- - service: okta
- impact: "MFA push notifications unavailable; Okta login may require fallback methods"
- severity: critical
-
- # Collaboration dependencies
- slack:
- - service: servicedesk
- impact: "Slack-based IT support channel and Aisera bot unavailable"
+ - service: ticketing
+ impact: "SSO login to the ITSM/ticketing system unavailable"
severity: high
- # Google Workspace
- google-mail:
- - service: google-calendar
- impact: "Calendar notifications and email invites may be delayed"
+ # Edge/CDN provider — degradation ripples to web-facing services behind it.
+ cloudflare:
+ - service: github
+ impact: "Elevated latency reaching source-control web UI"
severity: medium
-
+ - service: sentry
+ impact: "Elevated latency reaching the error-tracking UI"
+ severity: low
diff --git a/backend/config/services.yaml b/backend/config/services.yaml
index d5da019..478e8c6 100644
--- a/backend/config/services.yaml
+++ b/backend/config/services.yaml
@@ -1,6 +1,22 @@
-# IT Service Health Dashboard — Service Registry
-# All poll_urls verified via curl on 2026-04-07
-# Services with no public JSON API are set to poll_type: manual
+# IT Service Health Dashboard — Service Registry (EXAMPLE)
+#
+# This is a generic, public example registry. It monitors well-known public
+# developer-tool status pages so the dashboard works immediately for a demo.
+#
+# For a real deployment, copy this file to `services.local.yaml` (gitignored)
+# and replace these entries with your own organization's services. The loader
+# prefers `services.local.yaml` when present, so your real registry never has
+# to live in version control.
+#
+# Schema per entry:
+# id: stable slug, referenced by dependencies.yaml and the API
+# display_name: human-facing name shown in the UI
+# category: one of identity|productivity|collaboration|engineering|hr|
+# finance|sales|marketing|networking|support|other
+# poll_type: statuspage_json | manual | (see app/seed.py for the full set)
+# poll_url: required unless poll_type is manual
+# status_page_url: human-facing status page link
+# tier: critical | important | informational (default: important)
categories:
identity: "Identity & Access"
@@ -16,194 +32,82 @@ categories:
services:
# ── Identity & Access ─────────────────────────────────────────────
- - id: okta
- display_name: Okta
+ # The SSO / identity broker. Point SSO_BROKER_SERVICE_ID at this id to
+ # enable the dedicated identity-provider impact wording. Many SSO vendors
+ # gate their status API, so this is modeled as a manual-update service.
+ - id: identity-provider
+ display_name: Identity Provider (SSO)
category: identity
- poll_type: manual # status.okta.com returns 401 — requires Salesforce login, no public API
- status_page_url: https://status.okta.com
- tier: critical # SSO broker — outage blocks access to almost every app
+ poll_type: manual
+ status_page_url: https://example.com/status
+ tier: critical # SSO broker — an outage blocks login to everything downstream
- - id: duo
- display_name: Duo Security
- category: identity
+ # ── Engineering ───────────────────────────────────────────────────
+ - id: github
+ display_name: GitHub
+ category: engineering
poll_type: statuspage_json
- poll_url: https://status.duo.com/api/v2/summary.json
- status_page_url: https://status.duo.com
- tier: critical # MFA — outage blocks every new login
+ poll_url: https://www.githubstatus.com/api/v2/summary.json
+ status_page_url: https://www.githubstatus.com
+ tier: important
- # ── Productivity ──────────────────────────────────────────────────
- - id: box
- display_name: Box
- category: productivity
+ - id: npm
+ display_name: npm
+ category: engineering
poll_type: statuspage_json
- poll_url: https://status.box.com/api/v2/summary.json
- status_page_url: https://status.box.com
+ poll_url: https://status.npmjs.org/api/v2/summary.json
+ status_page_url: https://status.npmjs.org
- - id: docusign
- display_name: DocuSign
- category: productivity
+ - id: pypi
+ display_name: PyPI
+ category: engineering
poll_type: statuspage_json
- poll_url: https://status.docusign.com/api/v2/summary.json
- status_page_url: https://status.docusign.com
+ poll_url: https://status.python.org/api/v2/summary.json
+ status_page_url: https://status.python.org
- - id: google-mail
- display_name: Google Mail
- category: productivity
- poll_type: google_json
- poll_url: https://www.google.com/appsstatus/dashboard/incidents.json
- status_page_url: https://www.google.com/appsstatus/dashboard
-
- - id: google-calendar
- display_name: Google Calendar
- category: productivity
- poll_type: google_json
- poll_url: https://www.google.com/appsstatus/dashboard/incidents.json
- status_page_url: https://www.google.com/appsstatus/dashboard
-
- - id: conga
- display_name: Conga
- category: productivity
+ - id: sentry
+ display_name: Sentry
+ category: engineering
poll_type: statuspage_json
- poll_url: https://status.conga.com/api/v2/summary.json
- status_page_url: https://status.conga.com
+ poll_url: https://status.sentry.io/api/v2/summary.json
+ status_page_url: https://status.sentry.io
- - id: eptura
- display_name: Eptura (Teem)
+ # ── Productivity ──────────────────────────────────────────────────
+ - id: dropbox
+ display_name: Dropbox
category: productivity
poll_type: statuspage_json
- poll_url: https://status.eptura.com/api/v2/summary.json
- status_page_url: https://status.eptura.com
+ poll_url: https://status.dropbox.com/api/v2/summary.json
+ status_page_url: https://status.dropbox.com
# ── Collaboration ─────────────────────────────────────────────────
- - id: slack
- display_name: Slack
- category: collaboration
- poll_type: slack_api
- poll_url: https://slack-status.com/api/v2.0.0/current
- status_page_url: https://status.slack.com
- tier: critical # Primary async comms — outage halts incident response itself
-
- - id: zoom
- display_name: Zoom
+ - id: discord
+ display_name: Discord
category: collaboration
poll_type: statuspage_json
- poll_url: https://status.zoom.us/api/v2/summary.json # returns 302 — httpx needs follow_redirects=True
- status_page_url: https://status.zoom.us
-
- - id: ringcentral
- display_name: RingCentral
- category: collaboration
- poll_type: ringcentral_api
- poll_url: https://status.ringcentral.com/status.json
- status_page_url: https://status.ringcentral.com
-
- # ── Engineering ───────────────────────────────────────────────────
- - id: confluence
- display_name: Confluence
- category: engineering
- poll_type: statuspage_json # page-level status — components hidden when healthy
- poll_url: https://status.atlassian.com/api/v2/summary.json
- statuspage_component_name: Confluence
- status_page_url: https://status.atlassian.com
-
- - id: jira
- display_name: Jira
- category: engineering
- poll_type: statuspage_json # page-level status — components hidden when healthy
- poll_url: https://status.atlassian.com/api/v2/summary.json
- statuspage_component_name: Jira
- status_page_url: https://status.atlassian.com
-
- - id: servicedesk
- display_name: Jira Service Management
- category: engineering
- poll_type: statuspage_json # page-level status — components hidden when healthy
- poll_url: https://status.atlassian.com/api/v2/summary.json
- statuspage_component_name: Jira Service Management
- status_page_url: https://status.atlassian.com
-
- - id: snaplogic
- display_name: SnapLogic
- category: engineering
- poll_type: statuspage_json
- poll_url: https://trust.snaplogic.com/api/v2/summary.json # corrected: trust.snaplogic.com, not status.snaplogic.com
- status_page_url: https://trust.snaplogic.com
-
- # ── HR & People ──────────────────────────────────────────────────
- - id: greenhouse
- display_name: Greenhouse
- category: hr
- poll_type: statuspage_json
- poll_url: https://status.greenhouse.io/api/v2/summary.json
- status_page_url: https://status.greenhouse.io
-
- - id: workday
- display_name: Workday
- category: hr
- poll_type: manual # trust site requires Workday Community login — no public API
- status_page_url: https://community.workday.com/trust/status
-
- - id: cornerstone
- display_name: Cornerstone OnDemand
- category: hr
+ poll_url: https://discordstatus.com/api/v2/summary.json
+ status_page_url: https://discordstatus.com
+ tier: critical # primary async comms in this example deployment
+
+ # ── Networking ────────────────────────────────────────────────────
+ - id: cloudflare
+ display_name: Cloudflare
+ category: networking
poll_type: statuspage_json
- poll_url: https://status.csod.com/api/v2/summary.json
- status_page_url: https://status.csod.com
+ poll_url: https://www.cloudflarestatus.com/api/v2/summary.json
+ status_page_url: https://www.cloudflarestatus.com
- # ── Finance ──────────────────────────────────────────────────────
- - id: concur
- display_name: SAP Concur
- category: finance
- poll_type: manual # open.concur.com is a React SPA — no JSON API
- status_page_url: https://open.concur.com
-
- - id: coupa
- display_name: Coupa
- category: finance
- poll_type: manual
- status_page_url: null
-
- - id: netsuite
- display_name: NetSuite
- category: finance
+ # ── Other ─────────────────────────────────────────────────────────
+ - id: datadog
+ display_name: Datadog
+ category: other
poll_type: statuspage_json
- poll_url: https://status.netsuite.com/api/v2/summary.json
- status_page_url: https://status.netsuite.com
-
- - id: zuora
- display_name: Zuora
- category: finance
- poll_type: statuspage_json
- poll_url: https://trust.zuora.com/api/v2/summary.json # corrected: trust.zuora.com, not status.zuora.com
- status_page_url: https://trust.zuora.com
-
- # ── Sales & CRM ─────────────────────────────────────────────────
- - id: salesforce
- display_name: Salesforce
- category: sales
- poll_type: salesforce_trust
- poll_url: https://api.status.salesforce.com/v1/incidents
- status_page_url: https://status.salesforce.com
+ poll_url: https://status.datadoghq.com/api/v2/summary.json
+ status_page_url: https://status.datadoghq.com
- # ── Marketing ────────────────────────────────────────────────────
- - id: iterable
- display_name: Iterable
- category: marketing
- poll_type: statuspage_json
- poll_url: https://status.iterable.com/api/v2/summary.json
- status_page_url: https://status.iterable.com
-
- - id: marketo
- display_name: Marketo (Adobe)
- category: marketing
- poll_type: manual # status.adobe.com requires Adobe I/O credentials — no public API
- status_page_url: https://status.adobe.com
-
- # ── Support ──────────────────────────────────────────────────────
- - id: zendesk
- display_name: Zendesk
+ # ── Support ───────────────────────────────────────────────────────
+ - id: ticketing
+ display_name: Ticketing / ITSM
category: support
- poll_type: zendesk_api
- poll_url: https://status.zendesk.com/api/incidents/active
- status_page_url: https://status.zendesk.com
-
+ poll_type: manual
+ status_page_url: https://example.com/status
diff --git a/backend/pyproject.toml b/backend/pyproject.toml
index 485a441..175d064 100644
--- a/backend/pyproject.toml
+++ b/backend/pyproject.toml
@@ -1,7 +1,7 @@
[project]
name = "it-service-health-dashboard"
version = "0.1.0"
-description = "Internal dashboard that aggregates SaaS vendor health for Box IT"
+description = "Internal dashboard that aggregates SaaS vendor health for enterprise IT"
requires-python = ">=3.12"
# Runtime deps are pinned in requirements.txt for reproducible CI;
# this file covers tool configs (ruff, mypy, pytest) + packaging metadata.
diff --git a/backend/tests/test_admin_api.py b/backend/tests/test_admin_api.py
index 900f9bf..dd022c1 100644
--- a/backend/tests/test_admin_api.py
+++ b/backend/tests/test_admin_api.py
@@ -23,14 +23,21 @@ async def seeded_app(tmp_path):
db_path = str(tmp_path / "test.db")
conn = await init_db(db_path)
- services = load_services()
- from tests.test_seeder import seed_deps_with_db, seed_services_with_db
+ from tests.test_seeder import (
+ _DEPENDENCIES_YAML,
+ _SERVICES_YAML,
+ seed_deps_with_db,
+ seed_services_with_db,
+ )
+
+ services = load_services(path=_SERVICES_YAML)
await seed_services_with_db(conn, services)
- deps = load_dependencies(known_service_ids={s.id for s in services})
+ deps = load_dependencies(path=_DEPENDENCIES_YAML, known_service_ids={s.id for s in services})
await seed_deps_with_db(conn, deps, [s.id for s in services])
# Import app after DB is initialized
from app.main import app
+
yield app
await close_db()
@@ -48,18 +55,21 @@ async def client(seeded_app):
class TestAdminAuth:
async def test_missing_token_rejected(self, client):
- resp = await client.post("/api/admin/status", json={
- "service_id": "okta",
- "new_status": "degraded",
- "reason": "test",
- })
+ resp = await client.post(
+ "/api/admin/status",
+ json={
+ "service_id": "identity-provider",
+ "new_status": "degraded",
+ "reason": "test",
+ },
+ )
assert resp.status_code == 401
async def test_wrong_token_rejected(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "okta",
+ "service_id": "identity-provider",
"new_status": "degraded",
"reason": "test",
},
@@ -72,7 +82,7 @@ async def test_unset_token_returns_503(self, client, monkeypatch):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "okta",
+ "service_id": "identity-provider",
"new_status": "degraded",
"reason": "test",
},
@@ -84,7 +94,7 @@ async def test_non_bearer_scheme_rejected(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "okta",
+ "service_id": "identity-provider",
"new_status": "degraded",
"reason": "test",
},
@@ -98,10 +108,10 @@ async def test_update_valid_service(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "okta",
+ "service_id": "identity-provider",
"new_status": "degraded",
"detail": "SSO slow",
- "reason": "User reported in #it-help",
+ "reason": "User reported in the help channel",
},
headers=AUTH_HEADERS,
)
@@ -129,7 +139,7 @@ async def test_update_invalid_status(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "okta",
+ "service_id": "identity-provider",
"new_status": "invalid_status",
"reason": "test",
},
@@ -141,7 +151,7 @@ async def test_update_missing_reason(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "okta",
+ "service_id": "identity-provider",
"new_status": "degraded",
},
headers=AUTH_HEADERS,
@@ -152,7 +162,7 @@ async def test_update_creates_status_event_with_audit(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "workday",
+ "service_id": "ticketing",
"new_status": "major_outage",
"detail": "Down",
"reason": "Confirmed with vendor support",
@@ -162,10 +172,11 @@ async def test_update_creates_status_event_with_audit(self, client):
assert resp.status_code == 200
from app.database import get_db
+
db = await get_db()
cursor = await db.execute(
"""SELECT source, new_status, updated_by, reason, client_ip
- FROM status_events WHERE service_id='workday'"""
+ FROM status_events WHERE service_id='ticketing'"""
)
row = dict(await cursor.fetchone())
assert row["source"] == "manual"
@@ -179,7 +190,7 @@ async def test_update_same_status_no_event(self, client):
await client.post(
"/api/admin/status",
json={
- "service_id": "concur",
+ "service_id": "datadog",
"new_status": "degraded",
"reason": "first update",
},
@@ -188,7 +199,7 @@ async def test_update_same_status_no_event(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "concur",
+ "service_id": "datadog",
"new_status": "degraded",
"detail": "Still slow",
"reason": "follow-up",
@@ -200,17 +211,16 @@ async def test_update_same_status_no_event(self, client):
assert body["meta"]["status_changed"] is False
from app.database import get_db
+
db = await get_db()
- cursor = await db.execute(
- "SELECT count(*) FROM status_events WHERE service_id='concur'"
- )
+ cursor = await db.execute("SELECT count(*) FROM status_events WHERE service_id='datadog'")
assert (await cursor.fetchone())[0] == 1
async def test_response_envelope_structure(self, client):
resp = await client.post(
"/api/admin/status",
json={
- "service_id": "okta",
+ "service_id": "identity-provider",
"new_status": "operational",
"reason": "resolved",
},
diff --git a/backend/tests/test_burn_rate.py b/backend/tests/test_burn_rate.py
index 8c1a91c..63f7022 100644
--- a/backend/tests/test_burn_rate.py
+++ b/backend/tests/test_burn_rate.py
@@ -42,7 +42,9 @@
)
-async def _seed_service(db: aiosqlite.Connection, service_id: str = _SVC, name: str = _SVC_NAME) -> None:
+async def _seed_service(
+ db: aiosqlite.Connection, service_id: str = _SVC, name: str = _SVC_NAME
+) -> None:
await db.execute(_CREATE_SVC, (service_id, name))
await db.commit()
@@ -149,10 +151,14 @@ async def _fake(
for ws, pct in window_to_uptime.items():
if abs((window - ws).total_seconds()) < 2.0:
if pct is None:
- return WindowUptime(operational_seconds=0.0, tracked_seconds=0.0, uptime_percent=None)
+ return WindowUptime(
+ operational_seconds=0.0, tracked_seconds=0.0, uptime_percent=None
+ )
total = ws.total_seconds()
op = total * (pct / 100.0)
- return WindowUptime(operational_seconds=op, tracked_seconds=total, uptime_percent=pct)
+ return WindowUptime(
+ operational_seconds=op, tracked_seconds=total, uptime_percent=pct
+ )
return WindowUptime(operational_seconds=0.0, tracked_seconds=0.0, uptime_percent=None)
monkeypatch.setattr(br_module, "compute_uptime", _fake)
@@ -227,16 +233,21 @@ async def test_burn_rate_zero_downtime_returns_zero(self, db_with_svc: aiosqlite
@pytest.mark.asyncio
async def test_burn_rate_math_matches_hand_calc(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""97.12% uptime -> 2.88% failure rate -> 28.8x burn at 99.9% SLO -> fast breach."""
- _stub_compute_uptime(monkeypatch, {
- timedelta(minutes=5): 97.12,
- timedelta(minutes=30): 100.0,
- timedelta(hours=1): 97.12,
- timedelta(hours=6): 100.0,
- timedelta(days=30): 99.95,
- })
+ _stub_compute_uptime(
+ monkeypatch,
+ {
+ timedelta(minutes=5): 97.12,
+ timedelta(minutes=30): 100.0,
+ timedelta(hours=1): 97.12,
+ timedelta(hours=6): 100.0,
+ timedelta(days=30): 99.95,
+ },
+ )
breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC))
fast_breaches = [b for b in breaches if b.severity == "fast"]
@@ -247,31 +258,41 @@ async def test_burn_rate_math_matches_hand_calc(
@pytest.mark.asyncio
async def test_fast_breach_requires_both_windows(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""5m high burn but 1h low burn -> no fast breach."""
- _stub_compute_uptime(monkeypatch, {
- timedelta(minutes=5): 97.0, # ~30x burn
- timedelta(minutes=30): 100.0,
- timedelta(hours=1): 99.5, # ~5x burn, below 14.4 fast threshold
- timedelta(hours=6): 100.0,
- timedelta(days=30): 100.0,
- })
+ _stub_compute_uptime(
+ monkeypatch,
+ {
+ timedelta(minutes=5): 97.0, # ~30x burn
+ timedelta(minutes=30): 100.0,
+ timedelta(hours=1): 99.5, # ~5x burn, below 14.4 fast threshold
+ timedelta(hours=6): 100.0,
+ timedelta(days=30): 100.0,
+ },
+ )
breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC))
assert [b for b in breaches if b.severity == "fast"] == []
@pytest.mark.asyncio
async def test_slow_breach_requires_both_windows(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""30m high burn but 6h low burn -> no slow breach."""
- _stub_compute_uptime(monkeypatch, {
- timedelta(minutes=5): 100.0,
- timedelta(minutes=30): 99.0, # 10x burn
- timedelta(hours=1): 100.0,
- timedelta(hours=6): 99.8, # 2x burn, below 6.0 slow threshold
- timedelta(days=30): 100.0,
- })
+ _stub_compute_uptime(
+ monkeypatch,
+ {
+ timedelta(minutes=5): 100.0,
+ timedelta(minutes=30): 99.0, # 10x burn
+ timedelta(hours=1): 100.0,
+ timedelta(hours=6): 99.8, # 2x burn, below 6.0 slow threshold
+ timedelta(days=30): 100.0,
+ },
+ )
breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC))
assert [b for b in breaches if b.severity == "slow"] == []
@@ -294,16 +315,21 @@ async def test_unknown_dominated_window_does_not_alert(self, db: aiosqlite.Conne
@pytest.mark.asyncio
async def test_both_fast_and_slow_can_fire_simultaneously(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""All four windows at high burn -> 2 breaches returned (fast + slow)."""
- _stub_compute_uptime(monkeypatch, {
- timedelta(minutes=5): 97.0,
- timedelta(minutes=30): 97.0,
- timedelta(hours=1): 97.0,
- timedelta(hours=6): 97.0,
- timedelta(days=30): 99.95,
- })
+ _stub_compute_uptime(
+ monkeypatch,
+ {
+ timedelta(minutes=5): 97.0,
+ timedelta(minutes=30): 97.0,
+ timedelta(hours=1): 97.0,
+ timedelta(hours=6): 97.0,
+ timedelta(days=30): 99.95,
+ },
+ )
breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC))
severities = {b.severity for b in breaches}
assert "fast" in severities
@@ -312,48 +338,63 @@ async def test_both_fast_and_slow_can_fire_simultaneously(
@pytest.mark.asyncio
async def test_error_budget_remaining_pct_from_30d_uptime(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""30d uptime of 99.95% -> half budget used -> ~50% remaining."""
- _stub_compute_uptime(monkeypatch, {
- timedelta(minutes=5): 97.0,
- timedelta(minutes=30): 97.0,
- timedelta(hours=1): 97.0,
- timedelta(hours=6): 97.0,
- timedelta(days=30): 99.95,
- })
+ _stub_compute_uptime(
+ monkeypatch,
+ {
+ timedelta(minutes=5): 97.0,
+ timedelta(minutes=30): 97.0,
+ timedelta(hours=1): 97.0,
+ timedelta(hours=6): 97.0,
+ timedelta(days=30): 99.95,
+ },
+ )
breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC))
fast = next(b for b in breaches if b.severity == "fast")
assert abs(fast.error_budget_remaining_pct - 50.0) < 2.0
@pytest.mark.asyncio
async def test_error_budget_remaining_pct_fully_consumed(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""30d uptime exactly at 99.9% target -> 0% budget remaining."""
- _stub_compute_uptime(monkeypatch, {
- timedelta(minutes=5): 97.0,
- timedelta(minutes=30): 97.0,
- timedelta(hours=1): 97.0,
- timedelta(hours=6): 97.0,
- timedelta(days=30): 99.9,
- })
+ _stub_compute_uptime(
+ monkeypatch,
+ {
+ timedelta(minutes=5): 97.0,
+ timedelta(minutes=30): 97.0,
+ timedelta(hours=1): 97.0,
+ timedelta(hours=6): 97.0,
+ timedelta(days=30): 99.9,
+ },
+ )
breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC))
fast = next(b for b in breaches if b.severity == "fast")
assert fast.error_budget_remaining_pct <= 1.0
@pytest.mark.asyncio
async def test_error_budget_remaining_pct_full_when_no_data(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""30d window with no tracked data (uptime_percent=None) -> 100% remaining."""
- _stub_compute_uptime(monkeypatch, {
- timedelta(minutes=5): 97.0,
- timedelta(minutes=30): 97.0,
- timedelta(hours=1): 97.0,
- timedelta(hours=6): 97.0,
- timedelta(days=30): None,
- })
+ _stub_compute_uptime(
+ monkeypatch,
+ {
+ timedelta(minutes=5): 97.0,
+ timedelta(minutes=30): 97.0,
+ timedelta(hours=1): 97.0,
+ timedelta(hours=6): 97.0,
+ timedelta(days=30): None,
+ },
+ )
breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC))
fast = next(b for b in breaches if b.severity == "fast")
assert fast.error_budget_remaining_pct == 100.0
@@ -417,9 +458,7 @@ async def test_route_suppressed_by_dedup(self, db_with_svc: aiosqlite.Connection
)
await db.commit()
- decision = await route_slo_burn_rate_alert(
- db, breach, "https://hooks.slack.com/test", now
- )
+ decision = await route_slo_burn_rate_alert(db, breach, "https://hooks.slack.com/test", now)
assert decision.should_send is False
assert decision.suppressed_by == "dedup"
@@ -441,9 +480,7 @@ async def test_route_suppressed_by_maintenance(self, db_with_svc: aiosqlite.Conn
)
await db.commit()
- decision = await route_slo_burn_rate_alert(
- db, breach, "https://hooks.slack.com/test", now
- )
+ decision = await route_slo_burn_rate_alert(db, breach, "https://hooks.slack.com/test", now)
assert decision.should_send is False
assert decision.suppressed_by == "maintenance_window"
@@ -454,16 +491,17 @@ async def test_route_suppressed_when_no_webhook(self, db_with_svc: aiosqlite.Con
breach = _make_fast_breach()
now = datetime.now(UTC)
- decision = await route_slo_burn_rate_alert(
- db_with_svc, breach, None, now
- )
+ decision = await route_slo_burn_rate_alert(db_with_svc, breach, None, now)
assert decision.should_send is False
assert decision.suppressed_by == "webhook_not_configured"
def test_build_dedup_key_format(self):
"""build_slo_burn_rate_dedup_key produces expected format."""
- assert build_slo_burn_rate_dedup_key("slack_api", "fast") == "slo_burn:slack_api:fast"
+ assert (
+ build_slo_burn_rate_dedup_key("identity-provider", "fast")
+ == "slo_burn:identity-provider:fast"
+ )
assert build_slo_burn_rate_dedup_key("github", "slow") == "slo_burn:github:slow"
@@ -481,6 +519,7 @@ async def test_record_alert_writes_row(self, db_with_svc: aiosqlite.Connection):
dedup_key = build_slo_burn_rate_dedup_key(breach.service_id, breach.severity)
from app.alerting.routing import RoutingDecision
+
decision = RoutingDecision(
should_send=True,
webhook_url="https://hooks.slack.com/test",
@@ -493,9 +532,7 @@ async def test_record_alert_writes_row(self, db_with_svc: aiosqlite.Connection):
await record_slo_alert(db, breach, decision)
await db.commit()
- cursor = await db.execute(
- "SELECT * FROM alert_sent_log WHERE dedup_key = ?", (dedup_key,)
- )
+ cursor = await db.execute("SELECT * FROM alert_sent_log WHERE dedup_key = ?", (dedup_key,))
row = await cursor.fetchone()
assert row is not None
assert row["alert_kind"] == "slo_burn_rate"
@@ -511,6 +548,7 @@ async def test_record_alert_records_suppression(self, db_with_svc: aiosqlite.Con
dedup_key = build_slo_burn_rate_dedup_key(breach.service_id, breach.severity)
from app.alerting.routing import RoutingDecision
+
decision = RoutingDecision(
should_send=False,
webhook_url=None,
@@ -654,7 +692,8 @@ def test_payload_omits_channel_mention_context_when_empty(self, monkeypatch):
# The mention context block is only appended when channel_mention is truthy
mention_contexts = [
- block for block in payload.get("blocks", [])
+ block
+ for block in payload.get("blocks", [])
if block.get("type") == "context"
and any("here" in str(e) for e in block.get("elements", []))
]
@@ -678,9 +717,7 @@ async def _spy(*args: Any, **kwargs: Any) -> list:
called.append(True)
return []
- monkeypatch.setattr(
- "app.alerting.burn_rate.evaluate_burn_rate", _spy
- )
+ monkeypatch.setattr("app.alerting.burn_rate.evaluate_burn_rate", _spy)
app_mock = MagicMock()
await run_slo_burn_rate_cycle(app_mock)
@@ -689,7 +726,9 @@ async def _spy(*args: Any, **kwargs: Any) -> list:
@pytest.mark.asyncio
async def test_cycle_routes_and_records_for_each_breach(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""Enable flag, stub a fast breach, mock Slack send → alert_sent_log row written."""
monkeypatch.setattr(settings, "slo_burn_rate_enabled", True)
@@ -711,6 +750,7 @@ async def _fake_evaluate(
# get_db is imported lazily inside run_slo_burn_rate_cycle — patch at source.
async def _fake_get_db() -> aiosqlite.Connection:
return db
+
monkeypatch.setattr("app.database.get_db", _fake_get_db)
# Patch the Slack send hook wherever burn_rate.py imports it from.
@@ -722,6 +762,7 @@ async def _fake_send(*args: Any, **kwargs: Any) -> bool:
if isinstance(payload, dict):
send_calls.append(payload)
return True
+
# Try common names; set whichever exists on the module.
for attr in ("send_slack_alert", "send_slack_webhook", "send_slack"):
if hasattr(br_module, attr):
@@ -739,7 +780,9 @@ async def _fake_send(*args: Any, **kwargs: Any) -> bool:
@pytest.mark.asyncio
async def test_cycle_logs_duration_no_error(
- self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch,
+ self,
+ db_with_svc: aiosqlite.Connection,
+ monkeypatch: pytest.MonkeyPatch,
):
"""Cycle completes without raising even when no breaches fire."""
monkeypatch.setattr(settings, "slo_burn_rate_enabled", True)
@@ -749,13 +792,18 @@ async def test_cycle_logs_duration_no_error(
async def _fake_get_db() -> aiosqlite.Connection:
return db
+
monkeypatch.setattr("app.database.get_db", _fake_get_db)
# No breaches so no Slack send path is exercised.
async def _no_breaches(
- _db: aiosqlite.Connection, _sid: str, _sname: str, _now: datetime,
+ _db: aiosqlite.Connection,
+ _sid: str,
+ _sname: str,
+ _now: datetime,
) -> list[BurnRateBreach]:
return []
+
monkeypatch.setattr(br_module, "evaluate_burn_rate", _no_breaches)
app_mock = MagicMock()
diff --git a/backend/tests/test_graph.py b/backend/tests/test_graph.py
index 491c095..0837623 100644
--- a/backend/tests/test_graph.py
+++ b/backend/tests/test_graph.py
@@ -4,7 +4,12 @@
from app.dependencies.graph import get_downstream, get_upstream
from app.seed import DependencyTarget, load_dependencies, load_services
-from tests.test_seeder import seed_deps_with_db, seed_services_with_db
+from tests.test_seeder import (
+ _DEPENDENCIES_YAML,
+ _SERVICES_YAML,
+ seed_deps_with_db,
+ seed_services_with_db,
+)
async def _insert_service(db, sid):
@@ -31,48 +36,50 @@ async def _insert_edge(db, upstream, downstream, severity="high"):
@pytest.fixture
async def seeded_db(db):
- """DB with services and dependencies seeded."""
- services = load_services()
+ """DB with services and dependencies seeded from the committed example."""
+ services = load_services(path=_SERVICES_YAML)
await seed_services_with_db(db, services)
- deps = load_dependencies()
+ deps = load_dependencies(path=_DEPENDENCIES_YAML)
await seed_deps_with_db(db, deps, [s.id for s in services])
return db
class TestGetDownstream:
- async def test_okta_has_12_downstream(self, seeded_db):
- results = await get_downstream(seeded_db, "okta")
- assert len(results) == 12
+ async def test_identity_provider_has_8_downstream(self, seeded_db):
+ results = await get_downstream(seeded_db, "identity-provider")
+ assert len(results) == 8
- async def test_okta_downstream_includes_box(self, seeded_db):
- results = await get_downstream(seeded_db, "okta")
+ async def test_idp_downstream_includes_known(self, seeded_db):
+ results = await get_downstream(seeded_db, "identity-provider")
ids = [r["service_id"] for r in results]
- assert "box" in ids
- assert "slack" in ids
- assert "zoom" in ids
+ assert "github" in ids
+ assert "dropbox" in ids
+ assert "ticketing" in ids
async def test_downstream_ordered_by_severity(self, seeded_db):
- results = await get_downstream(seeded_db, "okta")
+ results = await get_downstream(seeded_db, "identity-provider")
severities = [r["severity"] for r in results]
# Assert at least one 'critical' row exists before locating its last index.
# This test protects against regressions in the SEVERITY ORDER BY sort:
# if a non-critical entry sneaks between two critical entries we'd
# catch it via `last_critical < first_high` below.
assert "critical" in severities
- last_critical = len(severities) - 1 - next(
- i for i, s in enumerate(reversed(severities)) if s == "critical"
+ last_critical = (
+ len(severities)
+ - 1
+ - next(i for i, s in enumerate(reversed(severities)) if s == "critical")
)
first_high = next((i for i, s in enumerate(severities) if s == "high"), len(severities))
assert last_critical < first_high or first_high == len(severities)
async def test_downstream_includes_current_status(self, seeded_db):
- results = await get_downstream(seeded_db, "okta")
+ results = await get_downstream(seeded_db, "identity-provider")
for r in results:
assert "current_status" in r
assert r["current_status"] is not None
async def test_no_downstream(self, seeded_db):
- results = await get_downstream(seeded_db, "coupa")
+ results = await get_downstream(seeded_db, "npm")
assert results == []
async def test_nonexistent_service(self, seeded_db):
@@ -81,18 +88,18 @@ async def test_nonexistent_service(self, seeded_db):
class TestGetUpstream:
- async def test_box_upstream_includes_okta(self, seeded_db):
- results = await get_upstream(seeded_db, "box")
+ async def test_github_upstream_includes_idp(self, seeded_db):
+ results = await get_upstream(seeded_db, "github")
ids = [r["service_id"] for r in results]
- assert "okta" in ids
+ assert "identity-provider" in ids
- async def test_okta_upstream_includes_duo(self, seeded_db):
- results = await get_upstream(seeded_db, "okta")
+ async def test_github_upstream_includes_cloudflare(self, seeded_db):
+ results = await get_upstream(seeded_db, "github")
ids = [r["service_id"] for r in results]
- assert "duo" in ids
+ assert "cloudflare" in ids
async def test_no_upstream(self, seeded_db):
- results = await get_upstream(seeded_db, "duo")
+ results = await get_upstream(seeded_db, "identity-provider")
assert results == []
@@ -142,12 +149,15 @@ async def test_seed_allows_cycle_without_error(self, tmp_path):
the guard is about orphan references, not acyclicity. Document
this by asserting it doesn't throw."""
import yaml
- cycle_yaml = yaml.safe_dump({
- "dependencies": {
- "a": [{"service": "b", "impact": "x", "severity": "high"}],
- "b": [{"service": "a", "impact": "x", "severity": "high"}],
- },
- })
+
+ cycle_yaml = yaml.safe_dump(
+ {
+ "dependencies": {
+ "a": [{"service": "b", "impact": "x", "severity": "high"}],
+ "b": [{"service": "a", "impact": "x", "severity": "high"}],
+ },
+ }
+ )
path = tmp_path / "cycle_deps.yaml"
path.write_text(cycle_yaml)
deps = load_dependencies(path=path, known_service_ids={"a", "b"})
diff --git a/backend/tests/test_normalizer.py b/backend/tests/test_normalizer.py
index 7597a48..7efe218 100644
--- a/backend/tests/test_normalizer.py
+++ b/backend/tests/test_normalizer.py
@@ -1,18 +1,19 @@
-"""Tests for status normalizer — all vendor mappings + edge cases."""
+"""Tests for status normalizer — all format mappings + edge cases."""
import pytest
from app.poller.normalizer import (
ServiceStatus,
- normalize_google_status,
+ normalize_current_status,
+ normalize_product_feed_status,
normalize_rss_title,
- normalize_slack_status,
normalize_statuspage_component,
normalize_statuspage_indicator,
)
# ── Statuspage.io Component Status ──────────────────────────────────
+
class TestStatuspageComponent:
@pytest.mark.parametrize(
"input_status,expected",
@@ -43,6 +44,7 @@ def test_whitespace_stripped(self):
# ── Statuspage.io Page-Level Indicator ──────────────────────────────
+
class TestStatuspageIndicator:
@pytest.mark.parametrize(
"indicator,expected",
@@ -67,16 +69,17 @@ def test_case_insensitive(self):
assert normalize_statuspage_indicator("Critical") == ServiceStatus.MAJOR_OUTAGE
-# ── Slack Status API ───────────────────────────────────────────────
+# ── Current Status API ────────────────────────────────────────────
+
-class TestSlackStatus:
+class TestCurrentStatus:
def test_ok_no_incidents(self):
response = {"status": "ok", "active_incidents": []}
- assert normalize_slack_status(response) == ServiceStatus.OPERATIONAL
+ assert normalize_current_status(response) == ServiceStatus.OPERATIONAL
def test_ok_missing_incidents_key(self):
response = {"status": "ok"}
- assert normalize_slack_status(response) == ServiceStatus.OPERATIONAL
+ assert normalize_current_status(response) == ServiceStatus.OPERATIONAL
def test_outage_incident(self):
response = {
@@ -85,7 +88,7 @@ def test_outage_incident(self):
{"type": "outage", "title": "Major outage"},
],
}
- assert normalize_slack_status(response) == ServiceStatus.MAJOR_OUTAGE
+ assert normalize_current_status(response) == ServiceStatus.MAJOR_OUTAGE
def test_incident_type(self):
response = {
@@ -94,7 +97,7 @@ def test_incident_type(self):
{"type": "incident", "title": "Some users affected"},
],
}
- assert normalize_slack_status(response) == ServiceStatus.PARTIAL_OUTAGE
+ assert normalize_current_status(response) == ServiceStatus.PARTIAL_OUTAGE
def test_notice_type(self):
response = {
@@ -103,7 +106,7 @@ def test_notice_type(self):
{"type": "notice", "title": "Planned maintenance"},
],
}
- assert normalize_slack_status(response) == ServiceStatus.DEGRADED
+ assert normalize_current_status(response) == ServiceStatus.DEGRADED
def test_maintenance_type(self):
response = {
@@ -112,7 +115,7 @@ def test_maintenance_type(self):
{"type": "maintenance", "title": "Scheduled maintenance"},
],
}
- assert normalize_slack_status(response) == ServiceStatus.DEGRADED
+ assert normalize_current_status(response) == ServiceStatus.DEGRADED
def test_multiple_incidents_most_severe_wins(self):
response = {
@@ -123,7 +126,7 @@ def test_multiple_incidents_most_severe_wins(self):
{"type": "incident", "title": "Some issue"},
],
}
- assert normalize_slack_status(response) == ServiceStatus.MAJOR_OUTAGE
+ assert normalize_current_status(response) == ServiceStatus.MAJOR_OUTAGE
def test_unknown_incident_type(self):
response = {
@@ -132,71 +135,82 @@ def test_unknown_incident_type(self):
{"type": "something_new", "title": "Unknown type"},
],
}
- assert normalize_slack_status(response) == ServiceStatus.DEGRADED
+ assert normalize_current_status(response) == ServiceStatus.DEGRADED
def test_non_ok_no_incidents(self):
response = {"status": "active", "active_incidents": []}
- assert normalize_slack_status(response) == ServiceStatus.DEGRADED
+ assert normalize_current_status(response) == ServiceStatus.DEGRADED
-# ── Google Workspace ───────────────────────────────────────────────
+# ── Product Feed ──────────────────────────────────────────────────
-class TestGoogleStatus:
+
+class TestProductFeedStatus:
def test_no_incidents_operational(self):
- assert normalize_google_status([], "google-mail") == ServiceStatus.OPERATIONAL
+ assert normalize_product_feed_status([], "feed-product-a") == ServiceStatus.OPERATIONAL
def test_active_incident_for_product(self):
incidents = [
{
- "affected_products": [{"title": "Gmail"}],
+ "affected_products": [{"title": "Product A"}],
"most_recent_update": {"status": "SERVICE_DISRUPTION"},
},
]
- assert normalize_google_status(incidents, "google-mail") == ServiceStatus.PARTIAL_OUTAGE
+ assert (
+ normalize_product_feed_status(incidents, "feed-product-a")
+ == ServiceStatus.PARTIAL_OUTAGE
+ )
def test_active_outage_for_product(self):
incidents = [
{
- "affected_products": [{"title": "Gmail"}],
+ "affected_products": [{"title": "Product A"}],
"most_recent_update": {"status": "SERVICE_OUTAGE"},
},
]
- assert normalize_google_status(incidents, "google-mail") == ServiceStatus.MAJOR_OUTAGE
+ assert (
+ normalize_product_feed_status(incidents, "feed-product-a") == ServiceStatus.MAJOR_OUTAGE
+ )
def test_resolved_incident_is_operational(self):
incidents = [
{
- "affected_products": [{"title": "Gmail"}],
+ "affected_products": [{"title": "Product A"}],
"end": "2026-04-01T00:00:00Z",
"most_recent_update": {"status": "SERVICE_DISRUPTION"},
},
]
- assert normalize_google_status(incidents, "google-mail") == ServiceStatus.OPERATIONAL
+ assert (
+ normalize_product_feed_status(incidents, "feed-product-a") == ServiceStatus.OPERATIONAL
+ )
def test_incident_for_different_product(self):
incidents = [
{
- "affected_products": [{"title": "Google Drive"}],
+ "affected_products": [{"title": "Unmapped Product"}],
"most_recent_update": {"status": "SERVICE_OUTAGE"},
},
]
- assert normalize_google_status(incidents, "google-mail") == ServiceStatus.OPERATIONAL
+ assert (
+ normalize_product_feed_status(incidents, "feed-product-a") == ServiceStatus.OPERATIONAL
+ )
def test_calendar_product(self):
incidents = [
{
- "affected_products": [{"title": "Google Calendar"}],
+ "affected_products": [{"title": "Product B"}],
"most_recent_update": {"status": "degraded"},
},
]
- assert normalize_google_status(incidents, "google-calendar") == ServiceStatus.DEGRADED
+ assert normalize_product_feed_status(incidents, "feed-product-b") == ServiceStatus.DEGRADED
def test_unknown_service_id(self):
- assert normalize_google_status([], "google-drive") == ServiceStatus.UNKNOWN
+ assert normalize_product_feed_status([], "feed-product-unknown") == ServiceStatus.UNKNOWN
# ── RSS Feed ──────────────────────────────────────────────────────
+
class TestRSSTitle:
@pytest.mark.parametrize(
"title,expected",
diff --git a/backend/tests/test_poller_integration.py b/backend/tests/test_poller_integration.py
index b61d280..3963e74 100644
--- a/backend/tests/test_poller_integration.py
+++ b/backend/tests/test_poller_integration.py
@@ -1,6 +1,6 @@
"""End-to-end poller tests using respx for httpx mocking.
-These tests exercise each vendor poller against realistic mocked responses
+These tests exercise each poller against realistic mocked responses
to confirm they produce the right PollResult shape for both happy paths
and failure modes (timeouts, 5xx, malformed JSON).
"""
@@ -9,14 +9,14 @@
import pytest
import respx
-from app.poller.google_poller import poll_google
+from app.poller.active_incidents_poller import poll_active_incidents
+from app.poller.current_status_poller import poll_current_status
from app.poller.normalizer import ServiceStatus
+from app.poller.product_feed_poller import poll_product_feed
from app.poller.resilience import configure_breakers
-from app.poller.ringcentral_poller import poll_ringcentral
-from app.poller.salesforce_poller import poll_salesforce
-from app.poller.slack_poller import poll_slack
+from app.poller.service_array_poller import poll_service_array
from app.poller.statuspage_poller import poll_all_statuspage, poll_statuspage
-from app.poller.zendesk_poller import poll_zendesk
+from app.poller.trust_incidents_poller import poll_trust_incidents
@pytest.fixture(autouse=True)
@@ -27,17 +27,20 @@ def _reset_breakers():
class TestStatuspagePoller:
@respx.mock
async def test_happy_path_operational(self):
- respx.get("https://status.box.com/api/v2/summary.json").mock(
- return_value=httpx.Response(200, json={
- "page": {"name": "Box"},
- "status": {"indicator": "none", "description": "All Systems Operational"},
- "components": [],
- "incidents": [],
- "scheduled_maintenances": [],
- })
+ respx.get("https://status.example.com/api/v2/summary.json").mock(
+ return_value=httpx.Response(
+ 200,
+ json={
+ "page": {"name": "Example Service"},
+ "status": {"indicator": "none", "description": "All Systems Operational"},
+ "components": [],
+ "incidents": [],
+ "scheduled_maintenances": [],
+ },
+ )
)
async with httpx.AsyncClient() as client:
- result = await poll_statuspage(client, "https://status.box.com/api/v2/summary.json")
+ result = await poll_statuspage(client, "https://status.example.com/api/v2/summary.json")
assert result.status == ServiceStatus.OPERATIONAL
assert result.poll_failure_reason is None
@@ -65,153 +68,186 @@ async def test_404_produces_http_failure_reason(self):
@respx.mock
async def test_batch_polling_dedupes_urls(self):
"""Two services sharing one poll_url should cost one HTTP call."""
- route = respx.get("https://status.atlassian.com/api/v2/summary.json").mock(
- return_value=httpx.Response(200, json={
- "page": {"name": "Atlassian"},
- "status": {"indicator": "none"},
- "components": [
- {"name": "Jira", "status": "operational", "description": None},
- {"name": "Confluence", "status": "degraded_performance", "description": "latency"},
- ],
- "incidents": [],
- "scheduled_maintenances": [],
- })
+ route = respx.get("https://status.example.org/api/v2/summary.json").mock(
+ return_value=httpx.Response(
+ 200,
+ json={
+ "page": {"name": "Example Org"},
+ "status": {"indicator": "none"},
+ "components": [
+ {"name": "Issue Tracker", "status": "operational", "description": None},
+ {
+ "name": "Wiki",
+ "status": "degraded_performance",
+ "description": "latency",
+ },
+ ],
+ "incidents": [],
+ "scheduled_maintenances": [],
+ },
+ )
)
services = [
{
- "id": "jira",
- "poll_url": "https://status.atlassian.com/api/v2/summary.json",
- "statuspage_component_name": "Jira",
+ "id": "issue-tracker",
+ "poll_url": "https://status.example.org/api/v2/summary.json",
+ "statuspage_component_name": "Issue Tracker",
},
{
- "id": "confluence",
- "poll_url": "https://status.atlassian.com/api/v2/summary.json",
- "statuspage_component_name": "Confluence",
+ "id": "wiki",
+ "poll_url": "https://status.example.org/api/v2/summary.json",
+ "statuspage_component_name": "Wiki",
},
]
async with httpx.AsyncClient() as client:
results = await poll_all_statuspage(client, services)
assert route.call_count == 1
result_map = dict(results)
- assert result_map["jira"].status == ServiceStatus.OPERATIONAL
- assert result_map["confluence"].status == ServiceStatus.DEGRADED
+ assert result_map["issue-tracker"].status == ServiceStatus.OPERATIONAL
+ assert result_map["wiki"].status == ServiceStatus.DEGRADED
-class TestSlackPoller:
+class TestCurrentStatusPoller:
@respx.mock
async def test_happy_path(self):
- respx.get("https://slack-status.com/api/v2.0.0/current").mock(
- return_value=httpx.Response(200, json={
- "status": "ok",
- "active_incidents": [],
- })
+ respx.get("https://chat-status.example.com/api/v2.0.0/current").mock(
+ return_value=httpx.Response(
+ 200,
+ json={
+ "status": "ok",
+ "active_incidents": [],
+ },
+ )
)
async with httpx.AsyncClient() as client:
- result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current")
+ result = await poll_current_status(
+ client, "https://chat-status.example.com/api/v2.0.0/current"
+ )
assert result.status == ServiceStatus.OPERATIONAL
@respx.mock
async def test_list_response_no_active_is_operational(self):
- """slack-status.com redirects /current to /history which returns a
- list of incident objects. An empty list or a list with only
- resolved/completed incidents means OPERATIONAL."""
- respx.get("https://slack-status.com/api/v2.0.0/current").mock(
- return_value=httpx.Response(200, json=[
- {"id": 1, "status": "resolved", "type": "incident", "title": "old"},
- ])
+ """Endpoint redirects /current to /history which returns a list of incident
+ objects. An empty list or a list with only resolved/completed incidents
+ means OPERATIONAL."""
+ respx.get("https://chat-status.example.com/api/v2.0.0/current").mock(
+ return_value=httpx.Response(
+ 200,
+ json=[
+ {"id": 1, "status": "resolved", "type": "incident", "title": "old"},
+ ],
+ )
)
async with httpx.AsyncClient() as client:
- result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current")
+ result = await poll_current_status(
+ client, "https://chat-status.example.com/api/v2.0.0/current"
+ )
assert result.status == ServiceStatus.OPERATIONAL
@respx.mock
async def test_list_response_active_incident_maps_type(self):
"""Active incidents in the list response are mapped by `type`."""
- respx.get("https://slack-status.com/api/v2.0.0/current").mock(
- return_value=httpx.Response(200, json=[
- {"id": 2, "status": "active", "type": "outage", "title": "Big one"},
- ])
+ respx.get("https://chat-status.example.com/api/v2.0.0/current").mock(
+ return_value=httpx.Response(
+ 200,
+ json=[
+ {"id": 2, "status": "active", "type": "outage", "title": "Big one"},
+ ],
+ )
)
async with httpx.AsyncClient() as client:
- result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current")
+ result = await poll_current_status(
+ client, "https://chat-status.example.com/api/v2.0.0/current"
+ )
assert result.status == ServiceStatus.MAJOR_OUTAGE
assert result.status_detail == "Big one"
@respx.mock
async def test_unexpected_type_returns_unknown(self):
"""Neither dict nor list (e.g., a bare string) still UNKNOWNs out."""
- respx.get("https://slack-status.com/api/v2.0.0/current").mock(
+ respx.get("https://chat-status.example.com/api/v2.0.0/current").mock(
return_value=httpx.Response(200, json="not-a-dict-or-list")
)
async with httpx.AsyncClient() as client:
- result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current")
+ result = await poll_current_status(
+ client, "https://chat-status.example.com/api/v2.0.0/current"
+ )
assert result.status == ServiceStatus.UNKNOWN
-class TestSalesforcePoller:
+class TestTrustIncidentsPoller:
@respx.mock
async def test_no_active_incidents(self):
- respx.get("https://api.status.salesforce.com/v1/incidents").mock(
+ respx.get("https://trust.example.com/v1/incidents").mock(
return_value=httpx.Response(200, json=[])
)
async with httpx.AsyncClient() as client:
- result = await poll_salesforce(
- client, "https://api.status.salesforce.com/v1/incidents",
+ result = await poll_trust_incidents(
+ client,
+ "https://trust.example.com/v1/incidents",
)
assert result.status == ServiceStatus.OPERATIONAL
@respx.mock
async def test_network_error(self):
- respx.get("https://api.status.salesforce.com/v1/incidents").mock(
+ respx.get("https://trust.example.com/v1/incidents").mock(
side_effect=httpx.ConnectError("DNS fail")
)
async with httpx.AsyncClient() as client:
- result = await poll_salesforce(
- client, "https://api.status.salesforce.com/v1/incidents",
+ result = await poll_trust_incidents(
+ client,
+ "https://trust.example.com/v1/incidents",
)
assert result.status == ServiceStatus.UNKNOWN
assert result.poll_failure_reason is not None
assert "request_error" in result.poll_failure_reason
-class TestZendeskPoller:
+class TestActiveIncidentsPoller:
@respx.mock
async def test_happy_path(self):
- respx.get("https://status.zendesk.com/api/incidents/active").mock(
+ respx.get("https://support.example.com/api/incidents/active").mock(
return_value=httpx.Response(200, json={"data": []})
)
async with httpx.AsyncClient() as client:
- result = await poll_zendesk(
- client, "https://status.zendesk.com/api/incidents/active",
+ result = await poll_active_incidents(
+ client,
+ "https://support.example.com/api/incidents/active",
)
assert result.status == ServiceStatus.OPERATIONAL
-class TestRingCentralPoller:
+class TestServiceArrayPoller:
@respx.mock
async def test_all_good(self):
- respx.get("https://status.ringcentral.com/status.json").mock(
- return_value=httpx.Response(200, json=[
- {"service": "Calling", "region": "US", "level": "Good", "alerts": []},
- ])
+ respx.get("https://status.example.net/status.json").mock(
+ return_value=httpx.Response(
+ 200,
+ json=[
+ {"service": "Calling", "region": "US", "level": "Good", "alerts": []},
+ ],
+ )
)
async with httpx.AsyncClient() as client:
- result = await poll_ringcentral(
- client, "https://status.ringcentral.com/status.json",
+ result = await poll_service_array(
+ client,
+ "https://status.example.net/status.json",
)
assert result.status == ServiceStatus.OPERATIONAL
-class TestGooglePoller:
+class TestProductFeedPoller:
@respx.mock
async def test_operational(self):
- respx.get("https://www.google.com/appsstatus/incidents.json").mock(
+ respx.get("https://feed.example.com/incidents.json").mock(
return_value=httpx.Response(200, json=[])
)
- services = [{"id": "google-mail"}, {"id": "google-calendar"}]
+ services = [{"id": "feed-product-a"}, {"id": "feed-product-b"}]
async with httpx.AsyncClient() as client:
- results = await poll_google(
- client, "https://www.google.com/appsstatus/incidents.json", services,
+ results = await poll_product_feed(
+ client,
+ "https://feed.example.com/incidents.json",
+ services,
)
assert len(results) == 2
for _, r in results:
@@ -219,13 +255,13 @@ async def test_operational(self):
@respx.mock
async def test_error_propagates_to_each_service(self):
- respx.get("https://www.google.com/appsstatus/incidents.json").mock(
- return_value=httpx.Response(500)
- )
- services = [{"id": "google-mail"}, {"id": "google-calendar"}]
+ respx.get("https://feed.example.com/incidents.json").mock(return_value=httpx.Response(500))
+ services = [{"id": "feed-product-a"}, {"id": "feed-product-b"}]
async with httpx.AsyncClient() as client:
- results = await poll_google(
- client, "https://www.google.com/appsstatus/incidents.json", services,
+ results = await poll_product_feed(
+ client,
+ "https://feed.example.com/incidents.json",
+ services,
)
assert len(results) == 2
for _, r in results:
diff --git a/backend/tests/test_postmortems.py b/backend/tests/test_postmortems.py
index 0866a57..6387985 100644
--- a/backend/tests/test_postmortems.py
+++ b/backend/tests/test_postmortems.py
@@ -20,18 +20,18 @@
_BASE_EVENTS = [
{
"id": 1,
- "service_id": "okta",
+ "service_id": "identity-provider",
"previous_status": "operational",
"new_status": "degraded",
"vendor_title": "Elevated error rates",
"vendor_detail": "Users experiencing login failures",
- "impact_statement": "Okta is degraded",
+ "impact_statement": "Identity Provider is degraded",
"source": "statuspage_json",
"created_at": "2026-04-24T10:00:00Z",
},
{
"id": 2,
- "service_id": "okta",
+ "service_id": "identity-provider",
"previous_status": "degraded",
"new_status": "major_outage",
"vendor_title": "Complete SSO failure",
@@ -42,12 +42,12 @@
},
{
"id": 3,
- "service_id": "okta",
+ "service_id": "identity-provider",
"previous_status": "major_outage",
"new_status": "operational",
"vendor_title": None,
"vendor_detail": None,
- "impact_statement": "Okta has recovered",
+ "impact_statement": "Identity Provider has recovered",
"source": "statuspage_json",
"created_at": "2026-04-24T11:00:00Z",
},
@@ -57,17 +57,17 @@
def _sample_report(**overrides) -> dict:
"""Return a realistic report dict with all required keys."""
base = {
- "service_id": "okta",
- "service_name": "Okta",
+ "service_id": "identity-provider",
+ "service_name": "Identity Provider",
"started_at": "2026-04-24T10:00:00Z",
"resolved_at": "2026-04-24T11:00:00Z",
"duration_seconds": 3600,
"duration_human": "1h",
"peak_severity": "major_outage",
- "affected_downstream": ["Box Web", "Box Mobile"],
+ "affected_downstream": ["Content Platform Web", "Content Platform Mobile"],
"event_count": 3,
"events": list(_BASE_EVENTS),
- "impact_summary": "Okta experienced major outage for 1h.",
+ "impact_summary": "Identity Provider experienced major outage for 1h.",
}
base.update(overrides)
return base
@@ -92,24 +92,22 @@ def test_render_includes_all_eight_sections_in_order(self):
"## Action Items",
]
positions = [md.index(s) for s in expected_sections]
- assert positions == sorted(positions), (
- "Sections are not in the expected order"
- )
+ assert positions == sorted(positions), "Sections are not in the expected order"
def test_render_auto_fills_summary_with_impact_summary(self):
- report = _sample_report(impact_summary="Okta was down for exactly 1h.")
+ report = _sample_report(impact_summary="Identity Provider was down for exactly 1h.")
md = render_markdown(report)
summary_start = md.index("## Summary")
impact_start = md.index("## Impact")
summary_body = md[summary_start:impact_start]
- assert "Okta was down for exactly 1h." in summary_body
+ assert "Identity Provider was down for exactly 1h." in summary_body
def test_render_impact_lists_peak_severity_duration_count_affected(self):
report = _sample_report(
peak_severity="major_outage",
duration_human="1h",
event_count=3,
- affected_downstream=["Box Web", "Box Mobile"],
+ affected_downstream=["Content Platform Web", "Content Platform Mobile"],
)
md = render_markdown(report)
impact_start = md.index("## Impact")
@@ -119,8 +117,8 @@ def test_render_impact_lists_peak_severity_duration_count_affected(self):
assert "major_outage" in impact_body
assert "1h" in impact_body
assert "3" in impact_body
- assert "Box Web" in impact_body
- assert "Box Mobile" in impact_body
+ assert "Content Platform Web" in impact_body
+ assert "Content Platform Mobile" in impact_body
def test_render_impact_handles_empty_affected_downstream(self):
report = _sample_report(affected_downstream=[])
@@ -140,6 +138,7 @@ def test_render_timeline_renders_events_chronologically(self):
assert len(bullets) == 3
import re
+
time_pattern = re.compile(r"\d{2}:\d{2}:\d{2} UTC")
arrow_pattern = re.compile(r"\w+ → \w+")
for bullet in bullets:
@@ -183,13 +182,17 @@ def test_render_timeline_prefers_vendor_title_over_detail_and_impact(self):
# vendor_detail fallback (no vendor_title)
event_detail = dict(event_all, vendor_title=None)
md_detail = render_markdown(_sample_report(events=[event_detail]))
- body_detail = md_detail[md_detail.index("## Timeline"):md_detail.index("## What Went Well")]
+ body_detail = md_detail[
+ md_detail.index("## Timeline") : md_detail.index("## What Went Well")
+ ]
assert "the detail" in body_detail
# impact_statement fallback (no vendor_title, no vendor_detail)
event_impact = dict(event_all, vendor_title=None, vendor_detail=None)
md_impact = render_markdown(_sample_report(events=[event_impact]))
- body_impact = md_impact[md_impact.index("## Timeline"):md_impact.index("## What Went Well")]
+ body_impact = md_impact[
+ md_impact.index("## Timeline") : md_impact.index("## What Went Well")
+ ]
assert "the impact" in body_impact
def test_render_preserves_all_todo_placeholders(self):
@@ -213,31 +216,41 @@ def test_render_frontmatter_is_valid_yaml(self):
fm = yaml.safe_load(frontmatter_text)
assert isinstance(fm, dict)
required_keys = {
- "service", "service_name", "started_at", "resolved_at",
- "duration", "peak_severity", "affected_downstream",
- "event_count", "status",
+ "service",
+ "service_name",
+ "started_at",
+ "resolved_at",
+ "duration",
+ "peak_severity",
+ "affected_downstream",
+ "event_count",
+ "status",
}
assert required_keys <= fm.keys()
assert fm["status"] == "draft"
def test_render_frontmatter_escapes_yaml_special_chars(self):
# Service name with a colon — yaml.safe_dump must quote it properly
- report = _sample_report(service_name="Okta: identity", service_id="okta-identity")
+ report = _sample_report(
+ service_name="Identity Provider: SSO", service_id="identity-provider-sso"
+ )
md = render_markdown(report)
lines = md.splitlines()
assert lines[0] == "---"
end_idx = lines.index("---", 1)
frontmatter_text = "\n".join(lines[1:end_idx])
fm = yaml.safe_load(frontmatter_text)
- assert fm["service_name"] == "Okta: identity"
+ assert fm["service_name"] == "Identity Provider: SSO"
# Service name with a leading dash
- report2 = _sample_report(service_name="- Okta primary", service_id="okta2")
+ report2 = _sample_report(
+ service_name="- Identity Provider primary", service_id="identity-provider2"
+ )
md2 = render_markdown(report2)
lines2 = md2.splitlines()
end_idx2 = lines2.index("---", 1)
fm2 = yaml.safe_load("\n".join(lines2[1:end_idx2]))
- assert fm2["service_name"] == "- Okta primary"
+ assert fm2["service_name"] == "- Identity Provider primary"
# ---------------------------------------------------------------------------
@@ -257,11 +270,12 @@ async def test_write_creates_file_with_expected_filename(self, tmp_path):
started_at = report["started_at"]
resolved_at = report["resolved_at"]
sha = hashlib.sha1(
- f"{started_at}|{resolved_at}".encode(), usedforsecurity=False,
+ f"{started_at}|{resolved_at}".encode(),
+ usedforsecurity=False,
).hexdigest()[:6]
dt = datetime.fromisoformat(started_at.replace("Z", "+00:00")).astimezone(UTC)
compact = dt.strftime("%Y%m%dT%H%M%SZ")
- expected_name = f"okta-{compact}-{sha}.md"
+ expected_name = f"identity-provider-{compact}-{sha}.md"
assert result.name == expected_name
@pytest.mark.asyncio
@@ -350,7 +364,7 @@ def _raise(*args, **kwargs):
# ---------------------------------------------------------------------------
-async def _seed_recovery_scenario(db, write_lock, service_id: str = "okta") -> None:
+async def _seed_recovery_scenario(db, write_lock, service_id: str = "identity-provider") -> None:
"""Seed the DB so generate_incident_report finds a complete incident window.
Inserts a service (current_status=operational) and two status_events
@@ -359,7 +373,7 @@ async def _seed_recovery_scenario(db, write_lock, service_id: str = "okta") -> N
await db.execute(
"""INSERT OR REPLACE INTO services
(id, display_name, category, poll_type, current_status)
- VALUES (?, 'Okta', 'identity', 'statuspage_json', 'operational')""",
+ VALUES (?, 'Identity Provider', 'identity', 'statuspage_json', 'operational')""",
(service_id,),
)
# Event 1: started the incident (operational → degraded, >60s ago)
@@ -381,10 +395,9 @@ async def _seed_recovery_scenario(db, write_lock, service_id: str = "okta") -> N
class TestAlertingEngineIntegration:
@pytest.mark.asyncio
- async def test_engine_calls_write_postmortem_when_enabled(
- self, db, tmp_path, monkeypatch
- ):
+ async def test_engine_calls_write_postmortem_when_enabled(self, db, tmp_path, monkeypatch):
from app.config import settings as real_settings
+
monkeypatch.setattr(real_settings, "postmortems_enabled", True)
monkeypatch.setattr(real_settings, "postmortems_dir", str(tmp_path))
# Suppress slack alerting — no webhook configured
@@ -394,11 +407,11 @@ async def test_engine_calls_write_postmortem_when_enabled(
monkeypatch.setattr(real_settings, "alert_dedup_window_seconds", 1)
write_lock = asyncio.Lock()
- await _seed_recovery_scenario(db, write_lock, service_id="okta")
+ await _seed_recovery_scenario(db, write_lock, service_id="identity-provider")
recovery_change = StatusChange(
- service_id="okta",
- service_display_name="Okta",
+ service_id="identity-provider",
+ service_display_name="Identity Provider",
previous_status="degraded",
new_status="operational",
status_detail=None,
@@ -408,7 +421,7 @@ async def test_engine_calls_write_postmortem_when_enabled(
await process_changes(db, write_lock, [recovery_change])
- md_files = list(tmp_path.glob("okta-*.md"))
+ md_files = list(tmp_path.glob("identity-provider-*.md"))
assert len(md_files) == 1, f"Expected 1 postmortem file, found: {md_files}"
@pytest.mark.asyncio
@@ -416,6 +429,7 @@ async def test_engine_does_not_call_write_postmortem_when_disabled(
self, db, tmp_path, monkeypatch
):
from app.config import settings as real_settings
+
monkeypatch.setattr(real_settings, "postmortems_enabled", False)
monkeypatch.setattr(real_settings, "postmortems_dir", str(tmp_path))
monkeypatch.setattr(real_settings, "slack_webhook_url", None)
@@ -423,11 +437,11 @@ async def test_engine_does_not_call_write_postmortem_when_disabled(
monkeypatch.setattr(real_settings, "alert_dedup_window_seconds", 1)
write_lock = asyncio.Lock()
- await _seed_recovery_scenario(db, write_lock, service_id="okta")
+ await _seed_recovery_scenario(db, write_lock, service_id="identity-provider")
recovery_change = StatusChange(
- service_id="okta",
- service_display_name="Okta",
+ service_id="identity-provider",
+ service_display_name="Identity Provider",
previous_status="degraded",
new_status="operational",
status_detail=None,
@@ -462,11 +476,11 @@ async def _failing_write(report, *, out_dir):
monkeypatch.setattr(pm_module, "write_postmortem", _failing_write)
write_lock = asyncio.Lock()
- await _seed_recovery_scenario(db, write_lock, service_id="okta")
+ await _seed_recovery_scenario(db, write_lock, service_id="identity-provider")
recovery_change = StatusChange(
- service_id="okta",
- service_display_name="Okta",
+ service_id="identity-provider",
+ service_display_name="Identity Provider",
previous_status="degraded",
new_status="operational",
status_detail=None,
diff --git a/backend/tests/test_resilience.py b/backend/tests/test_resilience.py
index 1fdb159..36a26dd 100644
--- a/backend/tests/test_resilience.py
+++ b/backend/tests/test_resilience.py
@@ -26,7 +26,7 @@ def _fast_and_isolated_breakers():
class TestHostOf:
def test_extracts_host(self):
- assert host_of("https://status.box.com/api/v2/summary.json") == "status.box.com"
+ assert host_of("https://status.example.com/api/v2/summary.json") == "status.example.com"
def test_bare_url_fallback(self):
assert host_of("not-a-url") == "not-a-url"
@@ -53,7 +53,10 @@ async def test_retries_transient_5xx_then_succeeds(self):
)
async with httpx.AsyncClient() as client:
resp = await resilient_fetch(
- client, "https://example.com/flaky", attempts=3, timeout=5.0,
+ client,
+ "https://example.com/flaky",
+ attempts=3,
+ timeout=5.0,
)
assert resp.status_code == 200
assert route.call_count == 2
@@ -68,58 +71,70 @@ async def test_retries_429(self):
)
async with httpx.AsyncClient() as client:
resp = await resilient_fetch(
- client, "https://example.com/ratelimited", attempts=3, timeout=5.0,
+ client,
+ "https://example.com/ratelimited",
+ attempts=3,
+ timeout=5.0,
)
assert resp.status_code == 200
assert route.call_count == 2
@respx.mock
async def test_does_not_retry_404(self):
- route = respx.get("https://example.com/missing").mock(
- return_value=httpx.Response(404)
- )
+ route = respx.get("https://example.com/missing").mock(return_value=httpx.Response(404))
async with httpx.AsyncClient() as client:
with pytest.raises(httpx.HTTPStatusError):
await resilient_fetch(
- client, "https://example.com/missing", attempts=3, timeout=5.0,
+ client,
+ "https://example.com/missing",
+ attempts=3,
+ timeout=5.0,
)
# 404 is a hard failure — no retries
assert route.call_count == 1
@respx.mock
async def test_all_retries_exhausted_raises_transient(self):
- respx.get("https://example.com/dead").mock(
- return_value=httpx.Response(503)
- )
+ respx.get("https://example.com/dead").mock(return_value=httpx.Response(503))
async with httpx.AsyncClient() as client:
with pytest.raises(TransientHTTPError):
await resilient_fetch(
- client, "https://example.com/dead", attempts=2, timeout=5.0,
+ client,
+ "https://example.com/dead",
+ attempts=2,
+ timeout=5.0,
)
@respx.mock
async def test_breaker_opens_after_threshold(self):
"""Two consecutive hard failures trip the breaker for this host.
Subsequent calls raise CircuitBreakerOpen without hitting the network."""
- route = respx.get("https://trip.example.com/x").mock(
- return_value=httpx.Response(500)
- )
+ route = respx.get("https://trip.example.com/x").mock(return_value=httpx.Response(500))
async with httpx.AsyncClient() as client:
# First attempt: exhausts retries, counts as 1 failure
with pytest.raises(TransientHTTPError):
await resilient_fetch(
- client, "https://trip.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://trip.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
# Second attempt: also fails, tripping the threshold-2 breaker
with pytest.raises(TransientHTTPError):
await resilient_fetch(
- client, "https://trip.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://trip.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
# Third attempt: breaker is open, fast-fails without hitting network
calls_before = route.call_count
with pytest.raises(CircuitBreakerOpen):
await resilient_fetch(
- client, "https://trip.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://trip.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
assert route.call_count == calls_before # no new network call
@@ -136,11 +151,17 @@ async def test_breaker_isolates_hosts(self):
for _ in range(2):
with pytest.raises(TransientHTTPError):
await resilient_fetch(
- client, "https://bad.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://bad.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
# Good host should still succeed cleanly
resp = await resilient_fetch(
- client, "https://good.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://good.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
assert resp.status_code == 200
@@ -159,17 +180,26 @@ async def test_breaker_recovers_after_ttl(self):
for _ in range(2):
with pytest.raises(TransientHTTPError):
await resilient_fetch(
- client, "https://heal.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://heal.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
with pytest.raises(CircuitBreakerOpen):
await resilient_fetch(
- client, "https://heal.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://heal.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
# Wait for TTL to elapse, then probe succeeds
await asyncio.sleep(0.6)
resp = await resilient_fetch(
- client, "https://heal.example.com/x", attempts=1, timeout=5.0,
+ client,
+ "https://heal.example.com/x",
+ attempts=1,
+ timeout=5.0,
)
assert resp.status_code == 200
assert route.call_count == 3
@@ -189,9 +219,7 @@ def test_timeout(self):
assert reason == "timeout"
def test_transient_http(self):
- detail, reason = describe_fetch_error(
- TransientHTTPError(503, "https://example.com")
- )
+ detail, reason = describe_fetch_error(TransientHTTPError(503, "https://example.com"))
assert detail == "HTTP 503"
assert reason == "transient_http_503"
diff --git a/backend/tests/test_routing.py b/backend/tests/test_routing.py
index d7ac966..e61d5d2 100644
--- a/backend/tests/test_routing.py
+++ b/backend/tests/test_routing.py
@@ -15,7 +15,11 @@
async def _insert_service(
- db, sid, status="operational", tier="important", override=None,
+ db,
+ sid,
+ status="operational",
+ tier="important",
+ override=None,
):
await db.execute(
"""INSERT OR REPLACE INTO services
@@ -43,16 +47,16 @@ def _make_change(service_id, new_status="degraded", prev_status="operational", e
class TestDedupKey:
def test_vendor_id_preferred(self):
- key = build_dedup_key("box", "degraded", vendor_incident_id="inc-123")
- assert key == "vendor:box:inc-123"
+ key = build_dedup_key("content-platform", "degraded", vendor_incident_id="inc-123")
+ assert key == "vendor:content-platform:inc-123"
def test_fallback_when_no_vendor_id(self):
- key = build_dedup_key("box", "degraded", vendor_incident_id=None)
- assert key.startswith("fallback:box:degraded:")
+ key = build_dedup_key("content-platform", "degraded", vendor_incident_id=None)
+ assert key.startswith("fallback:content-platform:degraded:")
def test_different_statuses_different_keys(self):
- a = build_dedup_key("box", "degraded", None)
- b = build_dedup_key("box", "major_outage", None)
+ a = build_dedup_key("content-platform", "degraded", None)
+ b = build_dedup_key("content-platform", "major_outage", None)
assert a != b
@@ -91,69 +95,70 @@ async def test_future_window_returns_false(self, db):
class TestWasRecentlyAlerted:
async def test_fresh_dedup_key_returns_false(self, db):
- assert not await was_recently_alerted(db, "vendor:box:new", 3600)
+ assert not await was_recently_alerted(db, "vendor:content-platform:new", 3600)
async def test_recent_same_key_returns_true(self, db):
- await _insert_service(db, "box")
+ await _insert_service(db, "content-platform")
await db.execute(
"""INSERT INTO alert_sent_log
(dedup_key, service_id, severity, new_status, alert_kind,
first_sent_at, last_updated_at)
- VALUES ('vendor:box:inc-1', 'box', 'important', 'degraded',
+ VALUES ('vendor:content-platform:inc-1', 'content-platform', 'important', 'degraded',
'status_change', datetime('now', '-2 minutes'),
datetime('now', '-2 minutes'))"""
)
await db.commit()
- assert await was_recently_alerted(db, "vendor:box:inc-1", 3600)
+ assert await was_recently_alerted(db, "vendor:content-platform:inc-1", 3600)
async def test_outside_window_returns_false(self, db):
- await _insert_service(db, "box")
+ await _insert_service(db, "content-platform")
await db.execute(
"""INSERT INTO alert_sent_log
(dedup_key, service_id, severity, new_status, alert_kind,
first_sent_at, last_updated_at)
- VALUES ('vendor:box:old', 'box', 'important', 'degraded',
+ VALUES ('vendor:content-platform:old', 'content-platform', 'important', 'degraded',
'status_change', datetime('now', '-25 hours'),
datetime('now', '-25 hours'))"""
)
await db.commit()
- assert not await was_recently_alerted(db, "vendor:box:old", 86400)
+ assert not await was_recently_alerted(db, "vendor:content-platform:old", 86400)
async def test_suppressed_rows_ignored(self, db):
"""A suppressed alert doesn't count toward dedup — it didn't actually fire."""
- await _insert_service(db, "box")
+ await _insert_service(db, "content-platform")
await db.execute(
"""INSERT INTO alert_sent_log
(dedup_key, service_id, severity, new_status, alert_kind,
suppressed_by, first_sent_at, last_updated_at)
- VALUES ('vendor:box:supp', 'box', 'important', 'degraded',
+ VALUES ('vendor:content-platform:supp', 'content-platform', 'important', 'degraded',
'status_change', 'maintenance_window',
datetime('now', '-2 minutes'),
datetime('now', '-2 minutes'))"""
)
await db.commit()
- assert not await was_recently_alerted(db, "vendor:box:supp", 3600)
+ assert not await was_recently_alerted(db, "vendor:content-platform:supp", 3600)
class TestRouteStatusChange:
@pytest.fixture(autouse=True)
def _webhook_set(self, monkeypatch):
monkeypatch.setattr(
- settings, "slack_webhook_url",
+ settings,
+ "slack_webhook_url",
"https://hooks.slack.com/services/x/y/z",
)
async def test_critical_tier_adds_here_mention(self, db):
- await _insert_service(db, "okta", tier="critical")
- decision = await route_status_change(db, _make_change("okta"))
+ await _insert_service(db, "identity-provider", tier="critical")
+ decision = await route_status_change(db, _make_change("identity-provider"))
assert decision.should_send
assert decision.channel_mention == ""
assert decision.tier == "critical"
assert decision.suppressed_by is None
async def test_important_tier_no_mention(self, db):
- await _insert_service(db, "box", tier="important")
- decision = await route_status_change(db, _make_change("box"))
+ await _insert_service(db, "content-platform", tier="important")
+ decision = await route_status_change(db, _make_change("content-platform"))
assert decision.should_send
assert decision.channel_mention is None
assert decision.tier == "important"
@@ -219,11 +224,13 @@ async def test_recovery_bypasses_dedup(self, db):
async def test_aggregated_under_suppresses(self, db):
await _insert_service(db, "dep", tier="critical")
decision = await route_status_change(
- db, _make_change("dep"), aggregated_under="Okta",
+ db,
+ _make_change("dep"),
+ aggregated_under="Identity Provider",
)
assert not decision.should_send
assert decision.suppressed_by == "aggregated_under_upstream"
- assert decision.aggregated_under == "Okta"
+ assert decision.aggregated_under == "Identity Provider"
async def test_no_webhook_is_recorded_as_suppressed(self, db, monkeypatch):
monkeypatch.setattr(settings, "slack_webhook_url", None)
@@ -248,29 +255,33 @@ async def test_webhook_override_null_falls_back_to_global(self, db):
async def test_vendor_incident_id_drives_dedup_key(self, db):
await _insert_service(db, "dedup-svc", tier="important")
decision = await route_status_change(
- db, _make_change("dedup-svc"), vendor_incident_id="abc123",
+ db,
+ _make_change("dedup-svc"),
+ vendor_incident_id="abc123",
)
assert decision.dedup_key == "vendor:dedup-svc:abc123"
class TestRecordAlert:
async def test_records_fired_alert(self, db):
- await _insert_service(db, "box")
- change = _make_change("box")
+ await _insert_service(db, "content-platform")
+ change = _make_change("content-platform")
from app.alerting.routing import RoutingDecision
+
decision = RoutingDecision(
should_send=True,
webhook_url="https://example.com",
channel_mention=None,
- dedup_key="vendor:box:x",
+ dedup_key="vendor:content-platform:x",
tier="important",
suppressed_by=None,
)
await record_alert(db, change, decision)
await db.commit()
cursor = await db.execute(
- "SELECT suppressed_by, tier_col FROM alert_sent_log WHERE dedup_key='vendor:box:x'"
- .replace("tier_col", "severity") # table uses `severity` col name
+ "SELECT suppressed_by, tier_col FROM alert_sent_log WHERE dedup_key='vendor:content-platform:x'".replace(
+ "tier_col", "severity"
+ ) # table uses `severity` col name
)
row = await cursor.fetchone()
assert row is not None
@@ -278,21 +289,22 @@ async def test_records_fired_alert(self, db):
assert dict(row)["suppressed_by"] is None
async def test_records_suppressed_alert(self, db):
- await _insert_service(db, "box")
- change = _make_change("box")
+ await _insert_service(db, "content-platform")
+ change = _make_change("content-platform")
from app.alerting.routing import RoutingDecision
+
decision = RoutingDecision(
should_send=False,
webhook_url=None,
channel_mention=None,
- dedup_key="vendor:box:y",
+ dedup_key="vendor:content-platform:y",
tier="informational",
suppressed_by="tier_informational",
)
await record_alert(db, change, decision)
await db.commit()
cursor = await db.execute(
- "SELECT suppressed_by FROM alert_sent_log WHERE dedup_key='vendor:box:y'"
+ "SELECT suppressed_by FROM alert_sent_log WHERE dedup_key='vendor:content-platform:y'"
)
row = dict(await cursor.fetchone())
assert row["suppressed_by"] == "tier_informational"
@@ -309,48 +321,48 @@ async def _insert_dep(self, db, upstream, downstream, severity="high"):
await db.commit()
async def test_aggregates_when_threshold_met(self, db):
- for sid in ["okta", "a", "b", "c", "d"]:
+ for sid in ["identity-provider", "a", "b", "c", "d"]:
await _insert_service(db, sid)
for dep in ["a", "b", "c", "d"]:
- await self._insert_dep(db, "okta", dep)
+ await self._insert_dep(db, "identity-provider", dep)
changes = [
- _make_change("okta", "major_outage"),
+ _make_change("identity-provider", "major_outage"),
_make_change("a", "degraded"),
_make_change("b", "degraded"),
_make_change("c", "degraded"),
_make_change("d", "operational"), # not affected → ignored
]
grouped = await find_aggregation_candidates(db, changes, threshold=3)
- assert "okta" in grouped
- assert len(grouped["okta"]) == 3
- assert {c.service_id for c in grouped["okta"]} == {"a", "b", "c"}
+ assert "identity-provider" in grouped
+ assert len(grouped["identity-provider"]) == 3
+ assert {c.service_id for c in grouped["identity-provider"]} == {"a", "b", "c"}
async def test_no_aggregation_when_below_threshold(self, db):
- for sid in ["okta", "a", "b"]:
+ for sid in ["identity-provider", "a", "b"]:
await _insert_service(db, sid)
- await self._insert_dep(db, "okta", "a")
- await self._insert_dep(db, "okta", "b")
+ await self._insert_dep(db, "identity-provider", "a")
+ await self._insert_dep(db, "identity-provider", "b")
changes = [
- _make_change("okta", "degraded"),
+ _make_change("identity-provider", "degraded"),
_make_change("a", "degraded"),
]
grouped = await find_aggregation_candidates(db, changes, threshold=3)
assert grouped == {}
async def test_upstream_recovering_does_not_aggregate(self, db):
- for sid in ["okta", "a", "b", "c"]:
+ for sid in ["identity-provider", "a", "b", "c"]:
await _insert_service(db, sid)
for dep in ["a", "b", "c"]:
- await self._insert_dep(db, "okta", dep)
+ await self._insert_dep(db, "identity-provider", dep)
changes = [
- _make_change("okta", new_status="operational", prev_status="major_outage"),
+ _make_change("identity-provider", new_status="operational", prev_status="major_outage"),
_make_change("a", "degraded"),
_make_change("b", "degraded"),
_make_change("c", "degraded"),
]
grouped = await find_aggregation_candidates(db, changes, threshold=3)
# Upstream going back to operational isn't an outage event to aggregate
- assert "okta" not in grouped
+ assert "identity-provider" not in grouped
diff --git a/backend/tests/test_seeder.py b/backend/tests/test_seeder.py
index 3099c42..10e7dab 100644
--- a/backend/tests/test_seeder.py
+++ b/backend/tests/test_seeder.py
@@ -1,5 +1,11 @@
-"""Tests for the YAML config loader and database seeder."""
+"""Tests for the YAML config loader and database seeder.
+These tests load the committed example config explicitly (not the
+settings-resolved path) so they stay deterministic even when an operator
+has a gitignored services.local.yaml / dependencies.local.yaml present.
+"""
+
+from pathlib import Path
from urllib.parse import urlsplit
import pytest
@@ -11,27 +17,31 @@
load_services,
)
+_CONFIG_DIR = Path(__file__).resolve().parent.parent / "config"
+_SERVICES_YAML = _CONFIG_DIR / "services.yaml"
+_DEPENDENCIES_YAML = _CONFIG_DIR / "dependencies.yaml"
+
class TestServiceConfig:
def test_valid_manual_service(self):
svc = ServiceConfig(
- id="okta",
- display_name="Okta",
+ id="identity-provider",
+ display_name="Identity Provider (SSO)",
category="identity",
poll_type="manual",
)
- assert svc.id == "okta"
+ assert svc.id == "identity-provider"
assert svc.poll_url is None
def test_valid_polled_service(self):
svc = ServiceConfig(
- id="box",
- display_name="Box",
- category="productivity",
+ id="github",
+ display_name="GitHub",
+ category="engineering",
poll_type="statuspage_json",
- poll_url="https://status.box.com/api/v2/summary.json",
+ poll_url="https://www.githubstatus.com/api/v2/summary.json",
)
- assert svc.poll_url == "https://status.box.com/api/v2/summary.json"
+ assert svc.poll_url == "https://www.githubstatus.com/api/v2/summary.json"
def test_polled_service_without_url_fails(self):
with pytest.raises(ValueError, match="requires a poll_url"):
@@ -63,96 +73,109 @@ def test_invalid_poll_type_fails(self):
class TestLoadServices:
def test_loads_all_services(self):
- services = load_services()
- assert len(services) >= 25
+ services = load_services(path=_SERVICES_YAML)
+ assert len(services) == 10
def test_service_types(self):
- services = load_services()
+ services = load_services(path=_SERVICES_YAML)
poll_types = {s.poll_type for s in services}
assert "statuspage_json" in poll_types
assert "manual" in poll_types
- assert "google_json" in poll_types
- assert "slack_api" in poll_types
- def test_okta_is_manual(self):
- services = load_services()
- okta = next(s for s in services if s.id == "okta")
- assert okta.poll_type == "manual"
+ def test_identity_provider_is_manual(self):
+ services = load_services(path=_SERVICES_YAML)
+ idp = next(s for s in services if s.id == "identity-provider")
+ assert idp.poll_type == "manual"
- def test_box_has_poll_url(self):
- services = load_services()
- box = next(s for s in services if s.id == "box")
- assert box.poll_type == "statuspage_json"
- assert urlsplit(str(box.poll_url)).hostname == "status.box.com"
+ def test_github_has_poll_url(self):
+ services = load_services(path=_SERVICES_YAML)
+ gh = next(s for s in services if s.id == "github")
+ assert gh.poll_type == "statuspage_json"
+ assert urlsplit(str(gh.poll_url)).hostname == "www.githubstatus.com"
class TestLoadDependencies:
def test_loads_dependencies(self):
- deps = load_dependencies()
- assert "okta" in deps
- assert len(deps["okta"]) >= 10
+ deps = load_dependencies(path=_DEPENDENCIES_YAML)
+ assert "identity-provider" in deps
+ assert len(deps["identity-provider"]) == 8
- def test_okta_downstream_services(self):
- deps = load_dependencies()
- okta_targets = {t.service for t in deps["okta"]}
- assert "box" in okta_targets
- assert "slack" in okta_targets
+ def test_sso_downstream_services(self):
+ deps = load_dependencies(path=_DEPENDENCIES_YAML)
+ targets = {t.service for t in deps["identity-provider"]}
+ assert "github" in targets
+ assert "dropbox" in targets
- def test_okta_downstream_count(self):
- deps = load_dependencies()
- assert len(deps["okta"]) >= 10
+ def test_sso_downstream_count(self):
+ deps = load_dependencies(path=_DEPENDENCIES_YAML)
+ assert len(deps["identity-provider"]) == 8
def test_cross_validation_accepts_matching_services(self):
- services = load_services()
+ services = load_services(path=_SERVICES_YAML)
ids = {s.id for s in services}
- # Should not raise
- deps = load_dependencies(known_service_ids=ids)
- assert "okta" in deps
+ # Should not raise — every edge references a known service id.
+ deps = load_dependencies(path=_DEPENDENCIES_YAML, known_service_ids=ids)
+ assert "identity-provider" in deps
def test_cross_validation_rejects_unknown_upstream(self, tmp_path):
import yaml
+
bad = tmp_path / "bad_deps.yaml"
- bad.write_text(yaml.safe_dump({
- "dependencies": {
- "ghost_service": [
- {"service": "box", "impact": "x", "severity": "high"},
- ],
- },
- }))
+ bad.write_text(
+ yaml.safe_dump(
+ {
+ "dependencies": {
+ "ghost_service": [
+ {"service": "github", "impact": "x", "severity": "high"},
+ ],
+ },
+ }
+ )
+ )
with pytest.raises(ValueError, match="Unknown upstream service 'ghost_service'"):
- load_dependencies(path=bad, known_service_ids={"box"})
+ load_dependencies(path=bad, known_service_ids={"github"})
def test_cross_validation_rejects_unknown_downstream(self, tmp_path):
import yaml
+
bad = tmp_path / "bad_deps.yaml"
- bad.write_text(yaml.safe_dump({
- "dependencies": {
- "okta": [
- {"service": "phantom_app", "impact": "x", "severity": "high"},
- ],
- },
- }))
+ bad.write_text(
+ yaml.safe_dump(
+ {
+ "dependencies": {
+ "identity-provider": [
+ {"service": "phantom_app", "impact": "x", "severity": "high"},
+ ],
+ },
+ }
+ )
+ )
with pytest.raises(ValueError, match="Unknown downstream service 'phantom_app'"):
- load_dependencies(path=bad, known_service_ids={"okta"})
+ load_dependencies(path=bad, known_service_ids={"identity-provider"})
def test_cross_validation_allows_all_internal_sentinel(self, tmp_path):
import yaml
+
good = tmp_path / "deps.yaml"
- good.write_text(yaml.safe_dump({
- "dependencies": {
- "okta": [
- {"service": "all_internal", "impact": "x", "severity": "high"},
- ],
- },
- }))
+ good.write_text(
+ yaml.safe_dump(
+ {
+ "dependencies": {
+ "identity-provider": [
+ {"service": "all_internal", "impact": "x", "severity": "high"},
+ ],
+ },
+ }
+ )
+ )
# Should not raise even though "all_internal" isn't in the id set
- deps = load_dependencies(path=good, known_service_ids={"okta"})
- assert deps["okta"][0].service == "all_internal"
+ deps = load_dependencies(path=good, known_service_ids={"identity-provider"})
+ assert deps["identity-provider"][0].service == "all_internal"
class TestSeedDatabase:
async def test_seed_services(self, db):
- services = load_services()
+ services = load_services(path=_SERVICES_YAML)
count = await seed_services_with_db(db, services)
assert count == len(services)
@@ -161,7 +184,7 @@ async def test_seed_services(self, db):
assert row[0] == len(services)
async def test_seed_services_idempotent(self, db):
- services = load_services()
+ services = load_services(path=_SERVICES_YAML)
await seed_services_with_db(db, services)
await seed_services_with_db(db, services)
@@ -170,31 +193,32 @@ async def test_seed_services_idempotent(self, db):
assert row[0] == len(services) # Same count, not doubled
async def test_seed_dependencies(self, db):
- services = load_services()
+ services = load_services(path=_SERVICES_YAML)
await seed_services_with_db(db, services)
- deps = load_dependencies()
+ deps = load_dependencies(path=_DEPENDENCIES_YAML)
all_ids = [s.id for s in services]
count = await seed_deps_with_db(db, deps, all_ids)
- assert count >= 14
+ assert count == 10 # identity-provider (8 edges) + cloudflare (2 edges)
cursor = await db.execute("SELECT count(*) FROM service_dependencies")
row = await cursor.fetchone()
- assert row[0] >= 14
+ assert row[0] == 10
- async def test_okta_deps_seeded(self, db):
- services = load_services()
+ async def test_sso_deps_seeded(self, db):
+ services = load_services(path=_SERVICES_YAML)
await seed_services_with_db(db, services)
- deps = load_dependencies()
+ deps = load_dependencies(path=_DEPENDENCIES_YAML)
all_ids = [s.id for s in services]
await seed_deps_with_db(db, deps, all_ids)
cursor = await db.execute(
- "SELECT count(*) FROM service_dependencies WHERE upstream_service_id='okta'"
+ "SELECT count(*) FROM service_dependencies "
+ "WHERE upstream_service_id='identity-provider'"
)
row = await cursor.fetchone()
- assert row[0] == 12
+ assert row[0] == 8
# Helper functions that operate on a given db connection instead of the global one
@@ -206,17 +230,20 @@ async def seed_services_with_db(db, services: list[ServiceConfig]) -> int:
statuspage_component_name, status_page_url, current_status)
VALUES (?, ?, ?, ?, ?, ?, ?, 'unknown')""",
(
- svc.id, svc.display_name, svc.category, svc.poll_type,
- svc.poll_url, svc.statuspage_component_name, svc.status_page_url,
+ svc.id,
+ svc.display_name,
+ svc.category,
+ svc.poll_type,
+ svc.poll_url,
+ svc.statuspage_component_name,
+ svc.status_page_url,
),
)
await db.commit()
return len(services)
-async def seed_deps_with_db(
- db, deps: dict[str, list[DependencyTarget]], all_ids: list[str]
-) -> int:
+async def seed_deps_with_db(db, deps: dict[str, list[DependencyTarget]], all_ids: list[str]) -> int:
await db.execute("DELETE FROM service_dependencies")
count = 0
for upstream, targets in deps.items():
diff --git a/backend/tests/test_services_api.py b/backend/tests/test_services_api.py
index 9bdd86e..a6b08ca 100644
--- a/backend/tests/test_services_api.py
+++ b/backend/tests/test_services_api.py
@@ -17,11 +17,16 @@ async def seeded_app(tmp_path):
db_path = str(tmp_path / "test.db")
conn = await init_db(db_path)
- services = load_services()
- from tests.test_seeder import seed_deps_with_db, seed_services_with_db
-
+ from tests.test_seeder import (
+ _DEPENDENCIES_YAML,
+ _SERVICES_YAML,
+ seed_deps_with_db,
+ seed_services_with_db,
+ )
+
+ services = load_services(path=_SERVICES_YAML)
await seed_services_with_db(conn, services)
- deps = load_dependencies(known_service_ids={s.id for s in services})
+ deps = load_dependencies(path=_DEPENDENCIES_YAML, known_service_ids={s.id for s in services})
await seed_deps_with_db(conn, deps, [s.id for s in services])
from app.main import app
@@ -79,7 +84,7 @@ async def test_list_services_pending_status_null_by_default(self, client):
class TestServiceDetailShape:
async def test_detail_includes_pending_status_fields(self, client):
- resp = await client.get("/api/services/okta")
+ resp = await client.get("/api/services/identity-provider")
assert resp.status_code == 200
body = resp.json()
svc = body["data"]["service"]
diff --git a/backend/tests/test_slack_ack.py b/backend/tests/test_slack_ack.py
index e4cc499..438dd59 100644
--- a/backend/tests/test_slack_ack.py
+++ b/backend/tests/test_slack_ack.py
@@ -120,6 +120,7 @@ async def ack_app(tmp_path, monkeypatch):
monkeypatch.setattr(settings, "slack_signing_secret", SecretStr(SIGNING_SECRET))
from app.main import app
+
yield app, conn
await close_db()
@@ -174,7 +175,8 @@ async def test_valid_ack_updates_db_and_calls_response_url(ack_client):
blocks = posted_body.get("blocks", [])
context_texts = [
elem.get("text", "")
- for b in blocks if b.get("type") == "context"
+ for b in blocks
+ if b.get("type") == "context"
for elem in b.get("elements", [])
if isinstance(elem.get("text"), str)
]
@@ -244,7 +246,8 @@ async def test_disabled_returns_404(tmp_path, monkeypatch):
sig = _slack_sign(body, SIGNING_SECRET, ts)
async with AsyncClient(
- transport=ASGITransport(app=app), base_url="http://test",
+ transport=ASGITransport(app=app),
+ base_url="http://test",
) as client:
resp = await client.post(
"/api/slack/interactivity",
@@ -273,7 +276,8 @@ async def test_signing_secret_not_configured_returns_503(tmp_path, monkeypatch):
sig = _slack_sign(body, SIGNING_SECRET, ts)
async with AsyncClient(
- transport=ASGITransport(app=app), base_url="http://test",
+ transport=ASGITransport(app=app),
+ base_url="http://test",
) as client:
resp = await client.post(
"/api/slack/interactivity",
@@ -320,7 +324,8 @@ def test_ack_button_present_when_ack_enabled(monkeypatch):
)
ack_actions = [
- b for b in payload["blocks"]
+ b
+ for b in payload["blocks"]
if b.get("type") == "actions"
and any(e.get("action_id") == "ack_alert" for e in b.get("elements", []))
]
@@ -346,7 +351,8 @@ def test_ack_button_absent_when_ack_disabled(monkeypatch):
)
ack_actions = [
- b for b in payload["blocks"]
+ b
+ for b in payload["blocks"]
if b.get("type") == "actions"
and any(e.get("action_id") == "ack_alert" for e in b.get("elements", []))
]
@@ -369,7 +375,8 @@ def test_ack_button_absent_when_no_dedup_key(monkeypatch):
)
ack_actions = [
- b for b in payload["blocks"]
+ b
+ for b in payload["blocks"]
if b.get("type") == "actions"
and any(e.get("action_id") == "ack_alert" for e in b.get("elements", []))
]
@@ -384,8 +391,8 @@ def test_aggregated_alert_has_ack_button_when_enabled(monkeypatch):
monkeypatch.setattr(settings, "slack_ack_enabled", True)
upstream = StatusChange(
- service_id="okta",
- service_display_name="Okta",
+ service_id="identity-provider",
+ service_display_name="Identity Provider",
previous_status="operational",
new_status="major_outage",
status_detail=None,
@@ -393,8 +400,8 @@ def test_aggregated_alert_has_ack_button_when_enabled(monkeypatch):
status_page_url=None,
)
dependent = StatusChange(
- service_id="box",
- service_display_name="Box",
+ service_id="content-platform",
+ service_display_name="Content Platform",
previous_status="operational",
new_status="major_outage",
status_detail=None,
@@ -405,12 +412,13 @@ def test_aggregated_alert_has_ack_button_when_enabled(monkeypatch):
payload = build_aggregated_upstream_alert(
upstream_change=upstream,
dependents=[dependent],
- impact_statement="Okta is down",
- dedup_key="vendor:okta:inc-999",
+ impact_statement="Identity Provider is down",
+ dedup_key="vendor:identity-provider:inc-999",
)
ack_actions = [
- b for b in payload["blocks"]
+ b
+ for b in payload["blocks"]
if b.get("type") == "actions"
and any(e.get("action_id") == "ack_alert" for e in b.get("elements", []))
]
@@ -425,8 +433,8 @@ def test_aggregated_alert_no_ack_button_when_disabled(monkeypatch):
monkeypatch.setattr(settings, "slack_ack_enabled", False)
upstream = StatusChange(
- service_id="okta",
- service_display_name="Okta",
+ service_id="identity-provider",
+ service_display_name="Identity Provider",
previous_status="operational",
new_status="major_outage",
status_detail=None,
@@ -437,12 +445,13 @@ def test_aggregated_alert_no_ack_button_when_disabled(monkeypatch):
payload = build_aggregated_upstream_alert(
upstream_change=upstream,
dependents=[],
- impact_statement="Okta is down",
- dedup_key="vendor:okta:inc-999",
+ impact_statement="Identity Provider is down",
+ dedup_key="vendor:identity-provider:inc-999",
)
ack_actions = [
- b for b in payload["blocks"]
+ b
+ for b in payload["blocks"]
if b.get("type") == "actions"
and any(e.get("action_id") == "ack_alert" for e in b.get("elements", []))
]
diff --git a/backend/tests/test_slack_slash.py b/backend/tests/test_slack_slash.py
index 492a5f2..9a8a9a0 100644
--- a/backend/tests/test_slack_slash.py
+++ b/backend/tests/test_slack_slash.py
@@ -80,11 +80,25 @@ async def slash_app(tmp_path, monkeypatch):
conn = await init_db(db_path)
services = [
- ("okta", "Okta", "identity", "critical", "operational", "healthy"),
- ("zoom", "Zoom", "collaboration", "important", "degraded", "healthy"),
- ("jira_sm", "Jira Service Management", "itsm", "important", "operational", "healthy"),
- ("slack_api", "Slack API", "collaboration", "critical", "operational", "healthy"),
- ("slack_bot", "Slack Bot", "collaboration", "important", "operational", "healthy"),
+ (
+ "identity-provider",
+ "Identity Provider",
+ "identity",
+ "critical",
+ "operational",
+ "healthy",
+ ),
+ (
+ "video-conferencing",
+ "Video Conferencing",
+ "collaboration",
+ "important",
+ "degraded",
+ "healthy",
+ ),
+ ("ticketing", "Ticketing", "itsm", "important", "operational", "healthy"),
+ ("chat-platform", "Chat Platform", "collaboration", "critical", "operational", "healthy"),
+ ("chat-bot", "Chat Bot", "collaboration", "important", "operational", "healthy"),
("broken_svc", "Broken Service", "other", "low", "operational", "broken"),
]
@@ -125,7 +139,7 @@ async def slash_client(slash_app):
async def test_missing_signature_header(slash_client):
client, _ = slash_client
- body = _form_body_for_slash({"command": "/itstatus", "text": "okta"})
+ body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"})
ts = _ts_now()
resp = await client.post(
"/api/slack/slash",
@@ -144,7 +158,7 @@ async def test_missing_signature_header(slash_client):
async def test_bad_signature(slash_client):
client, _ = slash_client
- body = _form_body_for_slash({"command": "/itstatus", "text": "okta"})
+ body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"})
resp = await client.post(
"/api/slack/slash",
content=body,
@@ -158,7 +172,7 @@ async def test_bad_signature(slash_client):
async def test_stale_timestamp(slash_client):
client, _ = slash_client
- body = _form_body_for_slash({"command": "/itstatus", "text": "okta"})
+ body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"})
resp = await client.post(
"/api/slack/slash",
content=body,
@@ -179,11 +193,9 @@ async def test_feature_disabled(tmp_path, monkeypatch):
from app.main import app
- body = _form_body_for_slash({"command": "/itstatus", "text": "okta"})
+ body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"})
- async with AsyncClient(
- transport=ASGITransport(app=app), base_url="http://test"
- ) as client:
+ async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
resp = await client.post(
"/api/slack/slash",
content=body,
@@ -207,13 +219,11 @@ async def test_signing_secret_unset(tmp_path, monkeypatch):
from app.main import app
- body = _form_body_for_slash({"command": "/itstatus", "text": "okta"})
+ body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"})
ts = _ts_now()
sig = _slack_sign(body, SIGNING_SECRET, ts)
- async with AsyncClient(
- transport=ASGITransport(app=app), base_url="http://test"
- ) as client:
+ async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
resp = await client.post(
"/api/slack/slash",
content=body,
@@ -234,20 +244,14 @@ async def test_signing_secret_unset(tmp_path, monkeypatch):
async def test_exact_id_match(slash_client):
client, _ = slash_client
- body = _form_body_for_slash({"command": "/itstatus", "text": "okta"})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"})
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
assert data["response_type"] == "ephemeral"
# Header block should contain the display name
- header_texts = [
- b["text"]["text"]
- for b in data["blocks"]
- if b.get("type") == "header"
- ]
- assert any("Okta" in t for t in header_texts)
+ header_texts = [b["text"]["text"] for b in data["blocks"] if b.get("type") == "header"]
+ assert any("Identity Provider" in t for t in header_texts)
# Status is operational → green check emoji
assert "✅" in data["text"] or "Operational" in data["text"]
@@ -257,55 +261,41 @@ async def test_exact_id_match(slash_client):
async def test_case_insensitive_display_name_match(slash_client):
client, _ = slash_client
- body = _form_body_for_slash({"command": "/itstatus", "text": "Okta"})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ body = _form_body_for_slash({"command": "/itstatus", "text": "Identity Provider"})
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
- header_texts = [
- b["text"]["text"]
- for b in data["blocks"]
- if b.get("type") == "header"
- ]
- assert any("Okta" in t for t in header_texts)
+ header_texts = [b["text"]["text"] for b in data["blocks"] if b.get("type") == "header"]
+ assert any("Identity Provider" in t for t in header_texts)
# ── 8. Unique substring match → found ─────────────────────────────────────────
async def test_substring_match_unique(slash_client):
- """'jir' matches only 'Jira Service Management' — unique → found."""
+ """'ticket' matches only 'Ticketing' — unique → found."""
client, _ = slash_client
- body = _form_body_for_slash({"command": "/itstatus", "text": "jir"})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ body = _form_body_for_slash({"command": "/itstatus", "text": "ticket"})
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
- header_texts = [
- b["text"]["text"]
- for b in data["blocks"]
- if b.get("type") == "header"
- ]
- assert any("Jira" in t for t in header_texts)
+ header_texts = [b["text"]["text"] for b in data["blocks"] if b.get("type") == "header"]
+ assert any("Ticketing" in t for t in header_texts)
# ── 9. Ambiguous substring → disambiguation text ──────────────────────────────
async def test_substring_match_ambiguous(slash_client):
- """'slac' matches both 'Slack API' and 'Slack Bot' but neither id exactly."""
+ """'chat' matches both 'Chat Platform' and 'Chat Bot' but neither id exactly."""
client, _ = slash_client
- body = _form_body_for_slash({"command": "/itstatus", "text": "slac"})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ body = _form_body_for_slash({"command": "/itstatus", "text": "chat"})
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
text = data["text"]
# Must mention both candidates
- assert "Slack API" in text or "Slack Bot" in text
+ assert "Chat Platform" in text or "Chat Bot" in text
assert "more specific" in text or "Multiple" in text
@@ -315,9 +305,7 @@ async def test_substring_match_ambiguous(slash_client):
async def test_no_match(slash_client):
client, _ = slash_client
body = _form_body_for_slash({"command": "/itstatus", "text": "notaservice"})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
assert "No service matches" in data["text"]
@@ -330,9 +318,7 @@ async def test_no_match(slash_client):
async def test_empty_text(slash_client):
client, _ = slash_client
body = _form_body_for_slash({"command": "/itstatus", "text": ""})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
assert "Usage" in data["text"]
@@ -346,9 +332,7 @@ async def test_poller_broken_surfaces_as_unknown(slash_client):
client, _ = slash_client
# broken_svc has current_status=operational but poller_health=broken
body = _form_body_for_slash({"command": "/itstatus", "text": "broken_svc"})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
# Should show Unknown, not Operational
@@ -365,9 +349,7 @@ async def test_wrong_command(slash_client):
"""Slack expects 200 even for unrecognised command names."""
client, _ = slash_client
body = _form_body_for_slash({"command": "/something-else", "text": ""})
- resp = await client.post(
- "/api/slack/slash", content=body, headers=_headers(body)
- )
+ resp = await client.post("/api/slack/slash", content=body, headers=_headers(body))
assert resp.status_code == 200
data = resp.json()
assert "Unknown slash command" in data["text"]
diff --git a/backend/tests/test_templates.py b/backend/tests/test_templates.py
index e5777b9..415ff13 100644
--- a/backend/tests/test_templates.py
+++ b/backend/tests/test_templates.py
@@ -1,12 +1,19 @@
"""Tests for impact statement templates."""
from app.alerting.templates import generate_impact_statement, generate_summary_text
+from app.config import settings
from app.poller.change_detector import StatusChange
-def _make_change(service_id="test-svc", display_name="Test Service",
- previous="operational", new="degraded", detail=None,
- poll_type="statuspage_json", url=None):
+def _make_change(
+ service_id="test-svc",
+ display_name="Test Service",
+ previous="operational",
+ new="degraded",
+ detail=None,
+ poll_type="statuspage_json",
+ url=None,
+):
return StatusChange(
service_id=service_id,
service_display_name=display_name,
@@ -19,9 +26,16 @@ def _make_change(service_id="test-svc", display_name="Test Service",
def _make_downstream(names):
- return [{"service_name": n, "service_id": n.lower(), "severity": "high",
- "impact_description": f"{n} impacted", "current_status": "operational"}
- for n in names]
+ return [
+ {
+ "service_name": n,
+ "service_id": n.lower(),
+ "severity": "high",
+ "impact_description": f"{n} impacted",
+ "current_status": "operational",
+ }
+ for n in names
+ ]
class TestGenerateImpactStatement:
@@ -43,11 +57,11 @@ def test_generic_major_outage(self):
def test_with_downstream(self):
change = _make_change(new="degraded", detail="Slow")
- downstream = _make_downstream(["Jira", "Confluence"])
+ downstream = _make_downstream(["Ticketing", "Team Wiki"])
result = generate_impact_statement(change, downstream)
assert "may impact" in result
- assert "Jira" in result
- assert "Confluence" in result
+ assert "Ticketing" in result
+ assert "Team Wiki" in result
def test_recovery(self):
change = _make_change(new="operational", previous="degraded")
@@ -55,29 +69,51 @@ def test_recovery(self):
assert "recovered" in result
assert "operational" in result
- def test_okta_outage(self):
- change = _make_change(service_id="okta", display_name="Okta", new="major_outage")
- downstream = _make_downstream(["Box", "Slack", "Zoom"])
+ def test_sso_broker_outage(self, monkeypatch):
+ monkeypatch.setattr(settings, "sso_broker_service_id", "identity-provider")
+ change = _make_change(
+ service_id="identity-provider", display_name="Identity Provider", new="major_outage"
+ )
+ downstream = _make_downstream(["Service A", "Service B", "Service C"])
result = generate_impact_statement(change, downstream)
assert "SSO authentication is unavailable" in result
- assert "Box" in result
+ assert "Service A" in result
assert "avoid logging out" in result
- def test_okta_degraded(self):
- change = _make_change(service_id="okta", display_name="Okta", new="degraded")
- downstream = _make_downstream(["Box", "Slack"])
+ def test_sso_broker_degraded(self, monkeypatch):
+ monkeypatch.setattr(settings, "sso_broker_service_id", "identity-provider")
+ change = _make_change(
+ service_id="identity-provider", display_name="Identity Provider", new="degraded"
+ )
+ downstream = _make_downstream(["Service A", "Service B"])
result = generate_impact_statement(change, downstream)
assert "SSO authentication" in result
assert "may be affected" in result
- def test_okta_partial_uses_outage_template(self):
- change = _make_change(service_id="okta", display_name="Okta", new="partial_outage")
- downstream = _make_downstream(["Box"])
+ def test_sso_broker_partial_uses_outage_template(self, monkeypatch):
+ monkeypatch.setattr(settings, "sso_broker_service_id", "identity-provider")
+ change = _make_change(
+ service_id="identity-provider", display_name="Identity Provider", new="partial_outage"
+ )
+ downstream = _make_downstream(["Service A"])
result = generate_impact_statement(change, downstream)
assert "SSO authentication is unavailable" in result
+ def test_sso_broker_unset_uses_generic_template(self, monkeypatch):
+ # With no broker configured, even the identity service gets the
+ # generic severity template — no hardcoded vendor special-casing.
+ monkeypatch.setattr(settings, "sso_broker_service_id", None)
+ change = _make_change(
+ service_id="identity-provider", display_name="Identity Provider", new="major_outage"
+ )
+ result = generate_impact_statement(change, [])
+ assert "MAJOR OUTAGE" in result
+ assert "SSO authentication" not in result
+
def test_generic_major_outage_service(self):
- change = _make_change(service_id="some-service", display_name="Some Service", new="major_outage")
+ change = _make_change(
+ service_id="some-service", display_name="Some Service", new="major_outage"
+ )
result = generate_impact_statement(change, [])
assert "Some Service" in result
assert "MAJOR OUTAGE" in result
@@ -101,12 +137,12 @@ def test_all_healthy(self):
assert "operational" in result
def test_with_incidents(self):
- result = generate_summary_text(29, 2, ["Okta", "Slack"])
+ result = generate_summary_text(29, 2, ["Identity Provider", "Chat Platform"])
assert "2 active incident" in result
assert "29" in result
- assert "Okta" in result
- assert "Slack" in result
+ assert "Identity Provider" in result
+ assert "Chat Platform" in result
def test_single_incident(self):
- result = generate_summary_text(29, 1, ["Box"])
+ result = generate_summary_text(29, 1, ["Content Platform"])
assert "1 active incident" in result
diff --git a/com.box.it-health-dashboard.plist b/com.company.it-health-dashboard.plist
similarity index 94%
rename from com.box.it-health-dashboard.plist
rename to com.company.it-health-dashboard.plist
index 131565f..141f2ba 100644
--- a/com.box.it-health-dashboard.plist
+++ b/com.company.it-health-dashboard.plist
@@ -4,10 +4,10 @@
IT Service Health Dashboard — launchd daemon (Phase 6 hardened).
Install:
- cp com.box.it-health-dashboard.plist \
- /Library/LaunchDaemons/com.box.it-health-dashboard.plist
+ cp com.company.it-health-dashboard.plist \
+ /Library/LaunchDaemons/com.company.it-health-dashboard.plist
# Edit the /path/to/ placeholders below first.
- sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist
+ sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist
Key hardening decisions (see PRODUCTION-ROADMAP.md Phase 6):
- KeepAlive is dict-form: restart on crash, NOT on clean exit.
@@ -23,7 +23,7 @@
Label
- com.box.it-health-dashboard
+ com.company.it-health-dashboard
ProgramArguments
diff --git a/deploy/Caddyfile.example b/deploy/Caddyfile.example
index 68bf68b..8f18933 100644
--- a/deploy/Caddyfile.example
+++ b/deploy/Caddyfile.example
@@ -14,12 +14,12 @@
# sudo caddy validate --config /opt/it-health/deploy/Caddyfile
# sudo caddy run --config /opt/it-health/deploy/Caddyfile # or via brew services
#
-# For production: set up a com.box.it-health-caddy.plist daemon
+# For production: set up a com.company.it-health-caddy.plist daemon
# mirroring the Litestream sidecar plist in this directory.
{
# Global options
- email ops@box.example # ACME registration — change to your ops alias
+ email ops@example.com # ACME registration — change to your ops alias
admin off # don't expose the Caddy admin API
}
@@ -27,7 +27,7 @@
# Using `tls internal` issues a cert from Caddy's local CA, which your
# clients will need to trust. For a publicly-routable host, swap for a
# real domain and ACME handles cert issuance automatically.
-health.box.corp {
+health.example.internal {
tls internal
# Hardening — keep these even for VPN-internal deploys. Attackers
@@ -78,7 +78,7 @@ health.box.corp {
health_status 2xx
}
- # Structured access log — JSON so it ships into whatever Box uses.
+ # Structured access log — JSON so it ships into whatever log aggregator you use.
log {
output file /var/log/it-health/caddy-access.log {
roll_size 10mb
diff --git a/deploy/com.box.it-health-dashboard-litestream.plist.example b/deploy/com.company.it-health-dashboard-litestream.plist.example
similarity index 90%
rename from deploy/com.box.it-health-dashboard-litestream.plist.example
rename to deploy/com.company.it-health-dashboard-litestream.plist.example
index 9d25a86..50c6b62 100644
--- a/deploy/com.box.it-health-dashboard-litestream.plist.example
+++ b/deploy/com.company.it-health-dashboard-litestream.plist.example
@@ -1,10 +1,10 @@