diff --git a/.gitignore b/.gitignore index 0029c8d..0b875fe 100644 --- a/.gitignore +++ b/.gitignore @@ -35,3 +35,10 @@ frontend/dist/ # OS .DS_Store CLAUDE.md.bak + +# Local (private) service registry — overrides the committed example so the +# real organization inventory never has to live in version control. +backend/config/*.local.yaml + +# uv lockfile (runtime deps are pinned in backend/requirements.txt) +backend/uv.lock diff --git a/CASE-STUDY.md b/CASE-STUDY.md new file mode 100644 index 0000000..60af402 --- /dev/null +++ b/CASE-STUDY.md @@ -0,0 +1,220 @@ +# Case Study: A Real-Time SaaS Status Dashboard for Enterprise IT + +> **What it is:** A self-hosted service that polls the public status pages of ~30 enterprise SaaS +> tools every 60 seconds, normalizes a dozen incompatible vendor formats into one status model, +> detects changes, reasons about downstream impact, and alerts the operations team before the +> first user ticket lands. +> +> **Status:** v1 and v2 shipped. **356 tests passing.** ~16k LOC. Python 3.13 / FastAPI backend, +> React + Vite frontend, single-file SQLite datastore, runs on one Mac mini. + +--- + +## The problem + +In most IT organizations, the news that a critical SaaS tool is down travels in exactly the wrong +direction: a user hits a broken login, files a ticket, the ticket sits in a queue, and only then — +ten, twenty, sixty minutes later — does someone in IT realize the identity provider has been +degraded the whole time. The team learns about outages from the people it's supposed to be +protecting. + +The information was public the entire time. Nearly every major SaaS vendor publishes a machine-readable +status page. The gap isn't data availability — it's that nobody is *watching* thirty status pages at +once, in thirty different formats, and connecting "Vendor X is degraded" to "therefore these internal +workflows are about to break." + +This project closes that gap. It turns a reactive, ticket-driven posture into a proactive one: when a +vendor's own status page flips, the operations channel knows within one poll cycle — with an impact +statement attached, not just a raw status code. + +--- + +## The solution + +A small, self-contained monitoring service with five responsibilities, each isolated into its own +module so they can be tested and reasoned about independently: + +```mermaid +flowchart TD + subgraph upstream["Upstream status sources (heterogeneous)"] + A1["Statuspage.io-powered pages
(~half the fleet)"] + A2["Vendor-specific status APIs
(collaboration, CRM, support, productivity suites)"] + A3["Operator-pushed manual updates
(tools with no machine-readable status)"] + end + + A1 & A2 & A3 --> P["Poll Orchestrator
APScheduler · 60s cycle
coalesce · no overlap · misfire-aware"] + P --> R["Resilience layer
retry + backoff (stamina)
per-host circuit breaker (purgatory)"] + R --> N["Status Normalizer
vendor strings → 5-state enum
operational · degraded · partial · major · unknown"] + N --> C["Change Detector
diff vs. DB · two-axis health
(is the vendor down? is the poller blind?)"] + C --> I["Impact Engine
service dependency graph
→ 'who is affected downstream'"] + C --> AL["Alert Router
dedup · maintenance suppression
flap suppression · tier routing"] + I --> AL + AL --> CH["Chat alerting
(tiered: page / notify / dashboard-only)"] + C --> DB[("SQLite
WAL · retention · snapshots")] + DB --> API["FastAPI REST
/services /timeline /summary /metrics"] + API --> UI["React dashboard
Executive + Engineer views · PWA"] +``` + +**Flow in one sentence:** poll many formats → make them resilient and uniform → detect what +*changed* → decide who it *affects* and whether it's worth interrupting a human → store it and show it. + +--- + +## Engineering highlights + +These are the parts that are non-obvious — the places where the naive version is easy and the +correct version takes real design. + +### 1. One status model out of a dozen vendor dialects + +Every vendor describes "things are broken" differently. Statuspage.io alone has two distinct +vocabularies — a page-level *indicator* (`none` / `minor` / `major` / `critical`) and per-component +*status* strings (`operational` / `degraded_performance` / `partial_outage` / `under_maintenance`) — +and they don't line up. Other vendors ship entirely bespoke JSON. Manual operator updates use a +third shape. + +The normalizer collapses all of it into a single five-state enum +(`operational · degraded · partial_outage · major_outage · unknown`) via explicit per-source mapping +tables. The design decision that matters: **an unrecognized vendor string maps to `unknown` and emits +a warning** — it never silently guesses `operational`. A new value a vendor introduces tomorrow shows +up as a logged anomaly instead of a false all-clear. (`under_maintenance`, notably, maps to `degraded` +rather than a fake outage — maintenance is expected, not an incident.) + +### 2. Two-axis health: "is the vendor down?" vs. "is my poller blind?" + +This is the insight that separates a toy from a tool. A status of `unknown` is dangerously ambiguous — +it could mean the vendor is genuinely in trouble, or it could mean *our own fetch failed* and we have +no idea what's going on. Conflating the two means every network hiccup looks like an outage. + +So the system tracks **two orthogonal axes**: + +- **Service status** — what the vendor reports (the 5-state enum above). +- **Poller health** — `healthy → degraded → broken`, driven by a pure state machine: a successful + poll resets to `healthy`; failures short of a configurable threshold are `degraded`; sustained + failure past the threshold flips to `broken`. + +When a poller goes `broken`, that's routed as a *distinct* signal — "we've gone blind on this service" — +separate from a vendor-outage alert, because they demand different human responses. The UI renders a +broken poller differently from a down vendor. You always know whether you're looking at reality or at +a gap in your own instrumentation. + +### 3. Resilience: heal fast, fail loud, don't hammer the dead + +Every outbound request goes through one resilience layer with two complementary mechanisms: + +- **Retry with backoff + jitter** (via `stamina`) for *transient* trouble — network errors, timeouts, + and HTTP `408` / `429` / `5xx`. These self-heal, so the system gives them a few capped, jittered + attempts before giving up. +- **Per-host circuit breaker** (via `purgatory`) for *persistent* trouble. After N consecutive + failures a host's breaker opens and subsequent calls fast-fail for a TTL (default 5 minutes) before + probing again — avoiding the classic "re-probe a dead host every cycle and burn the whole poll + window" antipattern. + +The sharp distinction: **HTTP `4xx` errors other than `408`/`429` are treated as hard failures and +surface immediately** — a `404` or `401` means a URL moved or auth changed; retrying can't fix config +rot, so it shouldn't mask it. And a tripped breaker is reported as *poller-unhealthy*, not +*vendor-down* — feeding straight back into the two-axis model above. + +### 4. Alert quality: earn the interruption + +A monitor that cries wolf gets muted, and a muted monitor is worthless. Several layers cooperate so +that a human is interrupted only when it's warranted: + +- **Deduplication keyed on `(service, vendor_incident_id)`** — with a `(service, status, day)` + fallback when no vendor incident ID exists. Critically, **dedup never keys on message text**: + vendors edit incident titles mid-flight, and a text key would leak a fresh alert on every wording + change. +- **Maintenance-window suppression** — a scheduled maintenance records the state transition but does + not page anyone. Expected ≠ alarming. +- **Flap suppression** — a status must persist across a configurable number of confirming polls before + a worsening alert fires, and recover across another threshold before the all-clear, so a vendor + bouncing between states doesn't machine-gun the channel. +- **Tiered routing** — every service has a tier: `critical` pages with an `@here` mention, + `important` notifies without the mention, `informational` updates the dashboard and sends nothing. +- **Dependency correlation** — when one upstream failure knocks out several dependents, the router can + emit a *single aggregated* upstream alert instead of N separate downstream ones. + +Every decision — sent or suppressed, and why — is written to a durable audit log. There's always an +answer to "what did we tell operators, and what did we hold back?" + +### 5. From "Vendor X is degraded" to "here's who it hurts" + +Raw status is low-value; *impact* is what an on-call human actually needs. A service dependency graph +(stored relationally, queried both upstream and downstream and ordered by severity) lets the impact +engine turn a single vendor event into a downstream blast-radius statement — "identity provider +degraded → these dependent workflows are at risk" — so the alert leads with consequences, not codes. + +### 6. Scheduler discipline and observability + +The poll loop is built to stay honest under load and to be debuggable in production: + +- **Scheduler safety**: cycles `coalesce` (a late wake-up runs once, not as a backlog stampede), + `max_instances=1` (a slow cycle is skipped, never overlapped), and missed runs are logged and + counted rather than silently dropped. +- **Trace-without-tracing**: each cycle binds a fresh `poll_cycle_id` into structured logs, so every + line from one cycle is correlatable without a full distributed-tracing stack. +- **Operational signals**: a Prometheus `/metrics` endpoint (poll counts, durations, circuit-breaker + state, alert sent/suppressed counters), optional Sentry error tracking, and a Healthchecks.io + dead-man's-switch heartbeat that screams if the *monitor itself* goes dark. + +### 7. Boring, durable data lifecycle + +A single SQLite file in WAL mode is the whole datastore — deliberately. Around it: production pragmas, +automatic retention purges for event and alert-log tables, a daily `VACUUM INTO` snapshot, and +optional Litestream continuous replication. No database server to operate; full point-in-time recovery +if the host dies. + +--- + +## Results / status + +- **v1 (demo-ready) — shipped.** Polling, normalization, change detection, chat alerting, the React + UI, dependency graph, timeline, SLA tracking, incident clustering, and automated reports. +- **v2 (production-ready) — shipped.** Bearer-token auth on admin endpoints; the full resilience + layer (retry + circuit breaker); alert-quality stack (flap suppression, dedup, tier routing, + dependency correlation, maintenance windows); observability (structured logging, Prometheus, Sentry, + dead-man's switch); data lifecycle (retention, snapshots, replication); a productionized UI with a + severity-sorted grid, an Executive/Engineer view toggle, accessibility + keyboard navigation, and + PWA support; and platform polish (CI, pre-commit hooks, a hardened launchd service, a Caddy reverse + proxy, and OS-keychain-backed secrets). +- **Quality gate:** **356 automated tests passing**, covering the normalizer, resilience layer, + change detector, alert routing, dependency graph, SLA math, the REST API, and a full end-to-end + pipeline test. +- **Footprint:** runs comfortably on a single Mac mini. One Python process, one SQLite file, one + static React bundle served by the same app. +- **What it watches:** ~30 enterprise SaaS tools spanning identity & access, productivity & content, + collaboration, engineering & ITSM, HR & people, finance, CRM, marketing, network/VPN, and support — + a representative cross-section of a modern enterprise SaaS estate. + +A set of features (inbound vendor webhooks, a chat acknowledgement flow, auto-drafted SRE-style +postmortems on recovery, and multi-burn-rate SLO alerting) is built and tested but kept behind feature +flags, defaulting off until their deployment prerequisites are in place — shipped code, deliberately +dark. + +--- + +## Skills demonstrated + +Framed for a Platform / DX / AI-infrastructure audience: + +- **Distributed-systems resilience as a first-class concern, not an afterthought.** Retries with + backoff + jitter, per-host circuit breaking, and a deliberate transient-vs-permanent failure + taxonomy — the same patterns that keep a platform's outbound integrations from amplifying a + dependency's bad day. +- **Designing the right abstraction over messy reality.** Collapsing a dozen incompatible vendor + formats into one clean status model — with a fail-safe `unknown` path — is the everyday work of + platform and integration engineering. +- **Signal quality over signal volume.** The dedup / suppression / tiering / correlation stack is an + alerting-discipline story: respect the human on the other end, and the system stays trusted instead + of muted. +- **Observability built in from the start.** Structured logs with cycle correlation, Prometheus + metrics, error tracking, and a dead-man's switch — designed to be operated, not just run. +- **Production data discipline at small scale.** WAL, retention, snapshots, and replication on a + single-file datastore: maximum durability for minimum operational surface. +- **Test rigor.** 356 tests including pure-function state-machine coverage and an end-to-end pipeline + test — the difference between "it worked on my machine" and "it's safe to change." +- **Shipping judgment.** Feature-flagged, default-off capabilities show the discipline to merge + complete-but-not-yet-deployable work without destabilizing what's live. + +The throughline: this is enterprise IT pain, solved with platform-engineering tools — turning a +reactive, ticket-driven workflow into a proactive, observable, self-healing system. diff --git a/CLAUDE.md b/CLAUDE.md index fe7e60c..a00e398 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,6 +1,6 @@ # IT Service Health Dashboard -Internal web dashboard aggregating real-time health of ~30 SaaS services at Box. Polls Statuspage.io JSON API, Google Workspace JSON feed, Slack native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view, posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN. +Internal web dashboard aggregating real-time health of ~30 SaaS services in an enterprise IT environment. Polls Statuspage.io JSON API, a cloud productivity suite's JSON feed, a chat vendor's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view, posts alerts to Slack. Deployed on a Mac Mini on the internal network. ## Roadmap @@ -35,7 +35,7 @@ cd backend && python run.py Open `http://localhost:8000`. -**CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis. 356 tests passing. +**CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis. 378 tests passing. ## Conventions @@ -55,11 +55,11 @@ Open `http://localhost:8000`. | Decision | Choice | Rationale | |----------|--------|-----------| | Primary data source | Statuspage.io `/api/v2/summary.json` | Most vendors use Statuspage.io; JSON, no auth, not rate-limited | -| Google Workspace | Google custom JSON feed + RSS | Google has its own status dashboard, not Statuspage.io | -| Slack status | `slack-status.com/api/v2.0.0/current` | Dedicated JSON status API | +| Cloud productivity suite | Custom JSON feed + RSS | Has its own status dashboard, not Statuspage.io | +| Chat vendor status | Vendor JSON status endpoint | Dedicated JSON status API | | Database | SQLite + Litestream | Demo-scale + ~1s RPO; Postgres deferred to >100 writes/s | -| Auth | Bearer token on admin endpoints; VPN-only for reads | Bearer token required for write endpoints — VPN alone is insufficient | -| Hosting | Mac Mini + Caddy | Always-on, VPN-accessible; Caddy adds HTTPS + header auth | +| Auth | Bearer token on admin endpoints; internal-network-only for reads | Bearer token required for write endpoints — the internal network alone is insufficient | +| Hosting | Mac Mini + Caddy | Always-on, internal-network-accessible; Caddy adds HTTPS + header auth | | Dep graph layout | Force-directed (react-force-graph-2d) | Dagre hierarchical layout is deferred; force-directed is current default | | LLM layer | Deferred (post-Phase-7) | Template-based summaries sufficient for v2 | @@ -84,11 +84,11 @@ All new work must map to an active phase in PRODUCTION-ROADMAP.md. Splunk, Thous ## What This Project Is -Internal web dashboard that aggregates real-time health status of ~30 SaaS services IT supports at Box. Polls vendor status pages via Statuspage.io JSON API, Google Workspace JSON feed, Slack's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view and posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN. Designed for IT engineers (deep triage) and IT leadership / company-wide visibility (situational awareness). +Internal web dashboard that aggregates real-time health status of ~30 SaaS services supported by an enterprise IT team. Polls vendor status pages via Statuspage.io JSON API, a cloud productivity suite's JSON feed, a chat vendor's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view and posts alerts to Slack. Deployed on a Mac Mini on the internal network. Designed for IT engineers (deep triage) and IT leadership / company-wide visibility (situational awareness). ## Current State -**v2 SHIPPED — Phases 0–6 complete; Phase 2B + Phase 7 (Statuspage inbound webhook + Slack ack) in tree, gated off by default.** Auth, vendor resilience, alert quality, observability, data lifecycle, UX productionization, and platform polish all landed. **356 tests passing.** The dashboard is production-grade; a mature IT team can rely on it. See PRODUCTION-ROADMAP.md for the exit-criteria detail on each phase. +**v2 SHIPPED — Phases 0–6 complete; Phase 2B + Phase 7 (Statuspage inbound webhook + Slack ack) in tree, gated off by default.** Auth, vendor resilience, alert quality, observability, data lifecycle, UX productionization, and platform polish all landed. **378 tests passing.** The dashboard is production-grade; a mature IT team can rely on it. See PRODUCTION-ROADMAP.md for the exit-criteria detail on each phase. Main also includes a parallel UX sprint that shipped alongside Phase 5: - **Executive / Engineer view toggle** — `ViewContext` gates the grid vs category summary and engineer-only affordances (graph, timeline, shortcuts). @@ -144,7 +144,7 @@ Open `http://localhost:8000` in your browser. - Do not start work that isn't in a PRODUCTION-ROADMAP.md phase. If it doesn't fit, discuss first. - Do not integrate Splunk, ThousandEyes, Datadog, or JSM — those are Phase 7+. - Do not build an LLM integration yet — post-Phase-7. -- Do not remove the bearer-token auth on admin endpoints once added. VPN is not sufficient for write endpoints. +- Do not remove the bearer-token auth on admin endpoints once added. The internal network is not sufficient for write endpoints. - Do not use synchronous I/O — all network calls must be async. - Do not hardcode service definitions in Python — they live in services.yaml. - Do not use slack-sdk — use raw httpx POST for webhook simplicity. diff --git a/IMPLEMENTATION-ROADMAP.md b/IMPLEMENTATION-ROADMAP.md index 0f7118c..95f266a 100644 --- a/IMPLEMENTATION-ROADMAP.md +++ b/IMPLEMENTATION-ROADMAP.md @@ -12,8 +12,8 @@ ``` [Vendor Status Pages] ├── Statuspage.io JSON API (/api/v2/summary.json) — ~20 services - ├── Google Workspace JSON feed + RSS — 2 services (Gmail, Calendar) - ├── Slack Status API (slack-status.com/api/v2.0.0/current) — 1 service + ├── A cloud productivity suite JSON feed + RSS — 2 services (Mail, Calendar) + ├── the chat-platform status API (chat-status.example.com/api/v2.0.0/current) — 1 service └── Manual updates via POST /api/admin/status — ~10 services ↓ (async poll every 60s via APScheduler) [Polling Workers] @@ -24,7 +24,7 @@ ↓ (on change) ┌──────────────────────┐ │ [Template Engine] │ — generate impact statements from dependency graph - │ [Slack Alerter] │ — POST Block Kit message to #service-validation webhook + │ [Slack Alerter] │ — POST Block Kit message to the ops-alert channel webhook │ [SQLite Writer] │ — insert status_events row, update services row └──────────────────────┘ ↓ @@ -53,8 +53,8 @@ it-service-health/ │ │ │ ├── __init__.py │ │ │ ├── scheduler.py # APScheduler async setup, 60s interval, error handling │ │ │ ├── statuspage_poller.py # Statuspage.io JSON API poller (handles ~20 services) -│ │ │ ├── google_poller.py # Google Workspace status poller (JSON feed + RSS) -│ │ │ ├── slack_poller.py # Slack Status API poller (slack-status.com) +│ │ │ ├── product_feed_poller.py # Cloud productivity suite status poller (JSON feed + RSS) +│ │ │ ├── current_status_poller.py # Chat-platform status API poller │ │ │ ├── rss_poller.py # Fallback RSS/Atom poller for services without JSON API │ │ │ ├── normalizer.py # Vendor status string → ServiceStatus enum mapping │ │ │ └── change_detector.py # Diff current vs stored state, emit change events @@ -117,10 +117,10 @@ it-service-health/ ```sql -- Service registry: static + current state for each monitored service CREATE TABLE services ( - id TEXT PRIMARY KEY, -- slug: "okta", "google-mail", "slack" - display_name TEXT NOT NULL, -- "Okta", "Google Mail", "Slack" + id TEXT PRIMARY KEY, -- slug: "identity-provider", "cloud-mail", "chat-platform" + display_name TEXT NOT NULL, -- "Identity Provider", "Cloud Mail", "Chat Platform" category TEXT NOT NULL, -- see categories below - poll_type TEXT NOT NULL DEFAULT 'manual', -- "statuspage_json", "google_json", "slack_api", "rss", "manual" + poll_type TEXT NOT NULL DEFAULT 'manual', -- "statuspage_json", "product_feed_json", "current_status_api", "rss", "manual" poll_url TEXT, -- API/feed URL to poll (NULL if manual) statuspage_component_name TEXT, -- for statuspage_json: match this component name in API response status_page_url TEXT, -- vendor public status page URL for linking @@ -140,7 +140,7 @@ CREATE TABLE status_events ( vendor_title TEXT, -- incident title from vendor vendor_detail TEXT, -- incident description/body from vendor impact_statement TEXT, -- generated template-based impact text - source TEXT NOT NULL DEFAULT 'statuspage_json', -- "statuspage_json", "google_json", "slack_api", "rss", "manual" + source TEXT NOT NULL DEFAULT 'statuspage_json', -- "statuspage_json", "product_feed_json", "current_status_api", "rss", "manual" vendor_incident_id TEXT, -- vendor's incident ID for deduplication created_at DATETIME DEFAULT CURRENT_TIMESTAMP ); @@ -201,33 +201,33 @@ These vendors use Atlassian Statuspage. Poll their `/api/v2/summary.json` endpoi | Service | Status Page Base URL | summary.json URL | Component to Match | Category | |---------|---------------------|-------------------|--------------------|----------| -| Box | status.box.com | `https://status.box.com/api/v2/summary.json` | Match overall page status or relevant component | productivity | -| Okta | status.okta.com | `https://status.okta.com/api/v2/summary.json` | Use page-level status (Okta has cell-specific pages; use main) | identity | -| Duo | status.duo.com | `https://status.duo.com/api/v2/summary.json` | "Duo Security" or page-level | identity | -| DocuSign | status.docusign.com | `https://status.docusign.com/api/v2/summary.json` | Page-level or "eSignature" component | productivity | -| Zoom | status.zoom.us | `https://status.zoom.us/api/v2/summary.json` | "Zoom Meetings", "Zoom Phone", or page-level | collaboration | -| Concur | open.concur.com (verify) | `https://open.concur.com/api/v2/summary.json` | Page-level | finance | -| Conga | (verify Statuspage URL) | Verify at runtime — may use custom domain | Page-level | productivity | -| SnapLogic | status.snaplogic.com | `https://status.snaplogic.com/api/v2/summary.json` | Page-level | engineering | -| Zuora | status.zuora.com | `https://status.zuora.com/api/v2/summary.json` | Page-level | finance | -| Cornerstone | status.csod.com (verify) | Verify at runtime | Page-level | hr | -| Iterable | status.iterable.com | `https://status.iterable.com/api/v2/summary.json` | Page-level | marketing | -| Marketo | status.adobe.com (verify) | Verify — Marketo may be under Adobe's status page | "Marketo Engage" component | marketing | -| Greenhouse | status.greenhouse.io (verify) | Verify at runtime | Page-level | hr | -| Teem (iOFFICE) | (verify) | Verify — Teem was acquired by iOFFICE | Page-level | productivity | -| Salesforce | status.salesforce.com | `https://status.salesforce.com/api/v2/summary.json` | Page-level or instance-specific | sales | -| Zendesk | status.zendesk.com | `https://status.zendesk.com/api/v2/summary.json` | Page-level | support | - -**IMPORTANT: Claude Code must verify every URL during Phase 0 by running `curl ` and confirming valid JSON is returned. Some URLs may have changed or use custom Statuspage domains. If a URL fails, search for ` status page` and find the correct Statuspage.io URL, then check if `/api/v2/summary.json` is accessible.** - -### Poll Type: `slack_api` -Slack has its own dedicated status API, not Statuspage. +| Content platform | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Match overall page status or relevant component | productivity | +| Identity provider (SSO) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Use page-level status (vendor has cell-specific pages; use main) | identity | +| MFA | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | "MFA Security" or page-level | identity | +| E-signature tool | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level or "eSignature" component | productivity | +| Video conferencing | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | "Meetings", "Phone", or page-level | collaboration | +| Finance tools (expense) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | finance | +| Document automation | (verify Statuspage URL) | Verify at runtime — may use custom domain | Page-level | productivity | +| Integration platform | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | engineering | +| Finance tools (billing) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | finance | +| HR tools (LMS) | (vendor status domain — verify at runtime) | Verify at runtime | Page-level | hr | +| Marketing tools (email) | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | marketing | +| Marketing tools (automation) | (vendor status domain — verify at runtime) | Verify — may be under a parent vendor's status page | "Marketing Automation" component | marketing | +| HR tools (ATS) | (vendor status domain — verify at runtime) | Verify at runtime | Page-level | hr | +| Space management | (verify) | Verify — may have been acquired | Page-level | productivity | +| CRM | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level or instance-specific | sales | +| Support platform | (vendor status domain — verify at runtime) | `https:///api/v2/summary.json` | Page-level | support | + +**IMPORTANT: Claude Code must verify every URL during Phase 0 by running `curl ` and confirming valid JSON is returned. Some URLs may have changed or use custom Statuspage domains. If a URL fails, search for the vendor's status page and find the correct Statuspage.io URL, then check if `/api/v2/summary.json` is accessible.** + +### Poll Type: `current_status_api` +Some collaboration vendors expose a dedicated current-status API, not Statuspage. | Service | API URL | Category | |---------|---------|----------| -| Slack | `https://slack-status.com/api/v2.0.0/current` | collaboration | +| Chat platform | `https://chat-status.example.com/api/v2.0.0/current` | collaboration | -**Slack Status API response shape:** +**Current-status API response shape:** ```json { "status": "ok|active", @@ -241,7 +241,7 @@ Slack has its own dedicated status API, not Statuspage. "title": "Some customers may experience...", "type": "incident|notice|maintenance", "status": "active|resolved", - "url": "https://status.slack.com/2024-01/...", + "url": "https://status.example.com/2024-01/...", "services": ["Login/SSO", "Messaging", "Connections", ...], "notes": [{ "date_created": "...", "body": "..." }] } @@ -250,55 +250,55 @@ Slack has its own dedicated status API, not Statuspage. ``` When `status` is `"ok"` and `active_incidents` is empty → operational. Otherwise map by incident type/impact. -### Poll Type: `google_json` -Google Workspace uses its own status dashboard with a JSON feed and RSS. +### Poll Type: `product_feed_json` +A cloud productivity suite uses its own status dashboard with a JSON feed and RSS. | Service | JSON URL | RSS URL | Category | |---------|----------|---------|----------| -| Google Mail | `https://www.google.com/appsstatus/dashboard/incidents.json` | `https://www.google.com/appsstatus/rss/en` | productivity | -| Google Calendar | (same JSON feed — filter by product) | (same RSS feed) | productivity | +| Cloud Mail | `https://feed.example.com/incidents.json` | `https://feed.example.com/rss` | productivity | +| Cloud Calendar | (same JSON feed — filter by product) | (same RSS feed) | productivity | -**Google Workspace JSON feed:** Contains incidents for ALL Google Workspace products. Filter by `service_name` field matching "Gmail" or "Google Calendar". The feed provides `most_recent_update.status` which maps to severity. RSS feed at `https://www.google.com/appsstatus/rss/en` is a fallback. +**Cloud productivity suite JSON feed:** Contains incidents for ALL products in the suite. Filter by `service_name` field matching the relevant products. The feed provides `most_recent_update.status` which maps to severity. RSS feed at `https://feed.example.com/rss` is a fallback. -**Note:** Google's JSON feed returns incident *history*, not a real-time component status like Statuspage. For current status: if no active (non-resolved) incidents exist for the product → operational. If active incidents exist → map severity. +**Note:** The cloud productivity suite's JSON feed returns incident *history*, not a real-time component status like Statuspage. For current status: if no active (non-resolved) incidents exist for the product → operational. If active incidents exist → map severity. ### Poll Type: `rss` Fallback for any service that has an RSS/Atom feed but not a known JSON API. | Service | Feed URL | Category | |---------|----------|----------| -| RingCentral | (find RSS URL from status page) | collaboration | +| Telephony | (find RSS URL from status page) | collaboration | ### Poll Type: `manual` These services have no automated monitoring feeds. IT engineers update status via `curl POST`. | Service | Status Page URL (for linking) | Category | |---------|-------------------------------|----------| -| Confluence | status.atlassian.com (has JSON API — consider upgrading to statuspage_json) | engineering | -| Jira | status.atlassian.com (same — consider statuspage_json with component filter) | engineering | -| ServiceDesk (JSM) | status.atlassian.com (same) | engineering | -| Coupa | (no known status page) | finance | -| Juniper VPN | (no public status page) | networking | -| Lithium | (no known status page) | other | -| Netsuite | (verify — Oracle may have a status page) | finance | -| Workday | (verify — Workday has a trust site) | hr | -| Partnerportal | (Salesforce instance — use Salesforce status) | sales | - -**NOTE: Confluence, Jira, and ServiceDesk are all Atlassian products. Atlassian has a Statuspage.io-based status page at `https://status.atlassian.com/api/v2/summary.json`. Claude Code should verify this API and if accessible, upgrade these from `manual` to `statuspage_json` with appropriate component name filters. This would reduce manual services from ~10 to ~7.** +| Team wiki | status.atlassian.com (has JSON API — consider upgrading to statuspage_json) | engineering | +| Ticketing / ITSM system | status.atlassian.com (same — consider statuspage_json with component filter) | engineering | +| ServiceDesk | status.atlassian.com (same) | engineering | +| Finance tools (procurement) | (no known status page) | finance | +| VPN | (no public status page) | networking | +| Community platform | (no known status page) | other | +| Finance tools (ERP) | (verify — vendor may have a status page) | finance | +| HR system | (verify — vendor has a trust site) | hr | +| Partner portal | (CRM instance — use CRM status) | sales | + +**NOTE: The team wiki, ticketing / ITSM system, and ServiceDesk are all products from the same vendor (Atlassian). Atlassian has a Statuspage.io-based status page at `https://status.atlassian.com/api/v2/summary.json`. Claude Code should verify this API and if accessible, upgrade these from `manual` to `statuspage_json` with appropriate component name filters. This would reduce manual services from ~10 to ~7.** ### Service Categories ```yaml categories: - identity: "Identity & Access" # Okta, Duo - productivity: "Productivity" # Box, DocuSign, Google Mail, Google Calendar, Teem, Conga - collaboration: "Collaboration" # Slack, Zoom, RingCentral - engineering: "Engineering" # Jira, Confluence, ServiceDesk, SnapLogic - hr: "HR & People" # Greenhouse, Workday, Cornerstone - finance: "Finance" # Concur, Coupa, Netsuite, Zuora - sales: "Sales & CRM" # Salesforce, Partnerportal - marketing: "Marketing" # Iterable, Marketo - networking: "Network & VPN" # Juniper VPN - support: "Support" # Zendesk + identity: "Identity & Access" # identity provider (SSO), MFA + productivity: "Productivity" # content platform, e-signature, cloud mail, cloud calendar, space management, document automation + collaboration: "Collaboration" # chat platform, video conferencing, telephony + engineering: "Engineering" # ticketing / ITSM system, team wiki, ServiceDesk, integration platform + hr: "HR & People" # HR tools (ATS), HR system, HR tools (LMS) + finance: "Finance" # finance tools (expense), finance tools (procurement), finance tools (ERP), finance tools (billing) + sales: "Sales & CRM" # CRM, partner portal + marketing: "Marketing" # marketing tools (email), marketing tools (automation) + networking: "Network & VPN" # VPN + support: "Support" # support platform ``` --- @@ -338,9 +338,9 @@ STATUSPAGE_INDICATOR_MAP = { } ``` -### Slack Status API Mapping +### Current-status API mapping ```python -def normalize_slack_status(response: dict) -> ServiceStatus: +def normalize_current_status(response: dict) -> ServiceStatus: if response["status"] == "ok" and not response.get("active_incidents"): return ServiceStatus.OPERATIONAL incidents = response.get("active_incidents", []) @@ -353,9 +353,9 @@ def normalize_slack_status(response: dict) -> ServiceStatus: return ServiceStatus.DEGRADED # default if active but unknown type ``` -### Google Workspace Mapping +### Product-feed mapping ```python -# Google incidents have severity levels in updates +# Cloud productivity suite incidents have severity levels in updates # If no active incident for the product → OPERATIONAL # If active incident exists → DEGRADED (default), escalate based on description keywords ``` @@ -378,69 +378,69 @@ RSS_TITLE_KEYWORDS = { # Format: upstream → downstream (when upstream breaks, downstream is impacted) dependencies: # Identity & Access — highest blast radius - okta: - - service: box - impact: "Box SSO login unavailable" + identity-provider: + - service: content-platform + impact: "Content platform SSO login unavailable" severity: critical - - service: slack - impact: "Slack SSO login may fail for new sessions" + - service: chat-platform + impact: "Chat platform SSO login may fail for new sessions" severity: critical - - service: zoom - impact: "Zoom SSO login unavailable" + - service: video-conferencing + impact: "Video conferencing SSO login unavailable" severity: critical - - service: salesforce - impact: "Salesforce SSO login unavailable" + - service: crm + impact: "CRM SSO login unavailable" severity: critical - - service: jira - impact: "Jira SSO login unavailable" + - service: itsm + impact: "Ticketing / ITSM system SSO login unavailable" severity: high - - service: confluence - impact: "Confluence SSO login unavailable" + - service: team-wiki + impact: "Team wiki SSO login unavailable" severity: high - - service: concur - impact: "Concur SSO login unavailable" + - service: finance-expense + impact: "Finance tools (expense) SSO login unavailable" severity: high - - service: workday - impact: "Workday SSO login unavailable" + - service: hr-system + impact: "HR system SSO login unavailable" severity: high - - service: greenhouse - impact: "Greenhouse SSO login unavailable" + - service: hr-ats + impact: "HR tools (ATS) SSO login unavailable" severity: medium - - service: docusign - impact: "DocuSign SSO login unavailable" + - service: esignature + impact: "E-signature tool SSO login unavailable" severity: medium - - service: zendesk - impact: "Zendesk SSO login unavailable" + - service: support-platform + impact: "Support platform SSO login unavailable" severity: medium - - service: netsuite - impact: "NetSuite SSO login unavailable" + - service: finance-erp + impact: "Finance tools (ERP) SSO login unavailable" severity: medium - duo: - - service: okta - impact: "MFA push notifications unavailable; Okta login may require fallback methods" + mfa: + - service: identity-provider + impact: "MFA push notifications unavailable; identity provider login may require fallback methods" severity: critical # Collaboration dependencies - slack: + chat-platform: - service: servicedesk - impact: "Slack-based IT support channel and Aisera bot unavailable" + impact: "Chat-based IT support channel and bot unavailable" severity: high - # Google Workspace - google-mail: - - service: google-calendar + # Cloud productivity suite + cloud-mail: + - service: cloud-calendar impact: "Calendar notifications and email invites may be delayed" severity: medium # Sales & CRM - salesforce: - - service: partnerportal - impact: "Partner Portal is hosted on Salesforce — full outage expected" + crm: + - service: partner-portal + impact: "Partner Portal is hosted on the CRM — full outage expected" severity: critical # Network - juniper-vpn: + vpn: - service: all_internal impact: "VPN outage affects remote access to all internal services" severity: critical @@ -467,17 +467,17 @@ TEMPLATES = { "with_downstream": ( " This may impact: {downstream_list}." ), - "okta_degraded": ( - "Okta is reporting degraded performance. SSO authentication " + "sso_degraded": ( + "The identity provider (SSO) is reporting degraded performance. SSO authentication " "for all SaaS applications may be affected. Impacted services: {downstream_list}." ), - "okta_outage": ( - "⚠️ Okta is experiencing an outage. SSO authentication is unavailable. " + "sso_outage": ( + "⚠️ The identity provider (SSO) is experiencing an outage. SSO authentication is unavailable. " "Users cannot log into: {downstream_list}. " "Advise users with active sessions to avoid logging out." ), "vpn_outage": ( - "⚠️ VPN (Juniper) is experiencing an outage. " + "⚠️ VPN is experiencing an outage. " "Remote users cannot access internal services. " "On-site users are not affected." ), @@ -493,7 +493,7 @@ TEMPLATES = { ## Slack Alert Format (Block Kit) -Alerts posted to #service-validation use Slack Block Kit for rich formatting: +Alerts posted to the ops-alert channel use Slack Block Kit for rich formatting: ```python def build_slack_alert(service_name: str, old_status: str, new_status: str, @@ -702,14 +702,14 @@ npm install -D tailwindcss @tailwindcss/vite **In scope (v1 demo):** - Unified status board for all ~30 services - Statuspage.io JSON API polling for ~20 services -- Slack Status API polling -- Google Workspace JSON/RSS polling +- Chat-platform status API polling +- Cloud productivity suite JSON/RSS polling - Manual status update API for remaining services - Service dependency graph with impact statement templates - Timeline view of recent status changes - Situation banner with template-generated summary - Scheduled maintenance tracking and display -- Slack Block Kit alerts to #service-validation on status changes +- Slack Block Kit alerts to the ops-alert channel on status changes - Auto-refresh dashboard (30s polling) - Service categorization and grouped display - "Last updated" indicator showing poll freshness @@ -739,7 +739,7 @@ npm install -D tailwindcss @tailwindcss/vite - **Slack webhook URL:** env var `SLACK_WEBHOOK_URL` — never in git - **All vendor status APIs:** public, no auth - **No user data collected:** dashboard is read-only, no PII -- **Network boundary:** Mac Mini on corporate network, VPN-only access +- **Network boundary:** Mac Mini on the internal network, internal-network-only access - **SQLite:** local file on Mac Mini, not exposed - `.env` file in `.gitignore`, `.env.example` committed as template @@ -758,7 +758,7 @@ npm install -D tailwindcss @tailwindcss/vite 5. Create `config/services.yaml` with all ~30 services — **Acceptance:** File contains every service from the catalog above. For each statuspage_json service, the `poll_url` has been verified by running `curl ` and confirming valid JSON response. Services with unverified URLs are noted with comments. 6. Create `config/dependencies.yaml` — **Acceptance:** Contains the full dependency graph from this roadmap 7. Implement YAML loader + DB seeder — **Acceptance:** Run seeder → `SELECT count(*) FROM services` returns correct count (≥28) -8. Implement `statuspage_poller.py` — poll ONE service (Okta) via `https://status.okta.com/api/v2/summary.json` — **Acceptance:** Returns parsed status, prints component statuses to console +8. Implement `statuspage_poller.py` — poll ONE service (identity provider) via its Statuspage.io summary URL — **Acceptance:** Returns parsed status, prints component statuses to console 9. Implement `normalizer.py` with all mapping tables — **Acceptance:** `pytest tests/test_normalizer.py` passes with tests for every vendor mapping **Verification checklist:** @@ -767,11 +767,11 @@ npm install -D tailwindcss @tailwindcss/vite - [ ] `sqlite3 data.db "SELECT count(*) FROM services"` → ≥28 - [ ] `sqlite3 data.db "SELECT id, poll_type FROM services WHERE poll_type='statuspage_json'"` → ~16-20 rows - [ ] `pytest tests/test_normalizer.py` → all pass -- [ ] Manual test: `python -c "import asyncio; from app.poller.statuspage_poller import poll_service; asyncio.run(poll_service('okta'))"` → prints Okta status +- [ ] Manual test: `python -c "import asyncio; from app.poller.statuspage_poller import poll_service; asyncio.run(poll_service('identity-provider'))"` → prints identity provider status **Risks:** - Some Statuspage.io URLs may have changed or use custom domains → **Mitigation:** Phase 0 Task 5 explicitly requires verifying each URL. Document any failures and fall back to RSS or manual. -- Google's JSON feed URL may not be publicly documented and could change → **Mitigation:** Fall back to RSS at `https://www.google.com/appsstatus/rss/en` +- The cloud productivity suite JSON feed URL may not be publicly documented and could change → **Mitigation:** Fall back to RSS at `https://feed.example.com/rss` --- @@ -782,14 +782,14 @@ npm install -D tailwindcss @tailwindcss/vite **Tasks:** 1. Extend `statuspage_poller.py` to handle ALL statuspage_json services in a single poll cycle — **Acceptance:** One async function iterates `services.yaml`, polls each statuspage_json service, handles errors per-service (one failure doesn't stop others) -2. Implement `slack_poller.py` for Slack Status API — **Acceptance:** Polls `https://slack-status.com/api/v2.0.0/current`, normalizes response to ServiceStatus -3. Implement `google_poller.py` for Google Workspace — **Acceptance:** Fetches Google JSON/RSS feed, filters for Gmail and Calendar, returns per-product status +2. Implement `current_status_poller.py` for the chat-platform status API — **Acceptance:** Polls `https://chat-status.example.com/api/v2.0.0/current`, normalizes response to ServiceStatus +3. Implement `product_feed_poller.py` for the cloud productivity suite — **Acceptance:** Fetches JSON/RSS feed, filters for Cloud Mail and Cloud Calendar, returns per-product status 4. Implement `change_detector.py` — **Acceptance:** Compares poll result against `services.current_status` in DB; on change: inserts `status_events` row, updates `services` row, returns list of changes -5. Implement `dependencies/graph.py` — **Acceptance:** `test_dependencies.py` passes; `get_downstream("okta")` returns all SSO-dependent services with impact descriptions -6. Implement `alerting/templates.py` — **Acceptance:** `test_templates.py` passes; generates correct impact statements for Okta outage, VPN outage, generic service degradation -7. Implement `alerting/slack.py` with Block Kit formatting — **Acceptance:** Trigger a test change → message appears in #service-validation with header, status fields, impact text, and button linking to vendor status page +5. Implement `dependencies/graph.py` — **Acceptance:** `test_dependencies.py` passes; `get_downstream("identity-provider")` returns all SSO-dependent services with impact descriptions +6. Implement `alerting/templates.py` — **Acceptance:** `test_templates.py` passes; generates correct impact statements for identity provider outage, VPN outage, generic service degradation +7. Implement `alerting/slack.py` with Block Kit formatting — **Acceptance:** Trigger a test change → message appears in the ops-alert channel with header, status fields, impact text, and button linking to vendor status page 8. Implement `scheduler.py` tying it all together — **Acceptance:** On app startup, scheduler begins 60s poll cycle. Logs show all services polled. No unhandled exceptions. -9. Implement `router_admin.py` POST `/api/admin/status` — **Acceptance:** `curl -X POST localhost:8000/api/admin/status -H 'Content-Type: application/json' -d '{"service_id":"workday","new_status":"degraded","detail":"Slow response times"}'` → returns updated service, creates status_event, triggers Slack alert +9. Implement `router_admin.py` POST `/api/admin/status` — **Acceptance:** `curl -X POST localhost:8000/api/admin/status -H 'Content-Type: application/json' -d '{"service_id":"hr-system","new_status":"degraded","detail":"Slow response times"}'` → returns updated service, creates status_event, triggers Slack alert 10. Implement all GET API endpoints (services, timeline, summary, maintenance) — **Acceptance:** Each returns correct JSON matching the Pydantic response models **Verification checklist:** @@ -797,7 +797,7 @@ npm install -D tailwindcss @tailwindcss/vite - [ ] `curl localhost:8000/api/timeline` → events (test with manual status change if no real incidents) - [ ] `curl localhost:8000/api/summary` → correct counts, active incidents list, maintenance list - [ ] `curl localhost:8000/api/maintenance` → upcoming maintenances from Statuspage.io -- [ ] POST a degraded status manually → Slack Block Kit message appears in #service-validation within 5 seconds +- [ ] POST a degraded status manually → Slack Block Kit message appears in the ops-alert channel within 5 seconds - [ ] Wait 2 minutes → see at least 2 poll cycles in logs, no errors - [ ] `pytest tests/` → all tests pass @@ -852,15 +852,15 @@ npm install -D tailwindcss @tailwindcss/vite 2. Configure FastAPI to serve static frontend — **Acceptance:** Mount `dist/` as static files at `/`; `curl localhost:8000/` returns `index.html` 3. Create `scripts/seed_demo_data.py` — **Acceptance:** Seeds 5-7 historical incidents over the past 7 days across different services with realistic timestamps, status progressions (investigating → identified → monitoring → resolved), and impact statements. Timeline view looks populated, not empty. 4. Create `.env.example` — **Acceptance:** Contains `SLACK_WEBHOOK_URL=`, `DATABASE_PATH=./data.db`, `POLL_INTERVAL_SECONDS=60`, `HOST=0.0.0.0`, `PORT=8000` -5. Create `com.box.it-health-dashboard.plist` launchd service file — **Acceptance:** Starts on boot, restarts on crash, logs to `/var/log/it-health-dashboard.log` -6. Deploy to Mac Mini — **Acceptance:** Clone repo, install deps, configure `.env`, load launchd plist, verify `curl :8000/api/health` from another machine on VPN -7. Open macOS firewall for port 8000 — **Acceptance:** Dashboard accessible from another laptop on VPN -8. End-to-end smoke test — **Acceptance:** From a different machine: load dashboard, see live statuses, verify at least 16+ services show non-"unknown" statuses, trigger manual status change, see Slack alert + dashboard update within 60s +5. Create `com.company.it-health-dashboard.plist` launchd service file — **Acceptance:** Starts on boot, restarts on crash, logs to `/var/log/it-health-dashboard.log` +6. Deploy to Mac Mini — **Acceptance:** Clone repo, install deps, configure `.env`, load launchd plist, verify `curl :8000/api/health` from another machine on the internal network +7. Open macOS firewall for port 8000 — **Acceptance:** Dashboard accessible from another laptop on the internal network +8. End-to-end smoke test — **Acceptance:** From a different machine on the internal network: load dashboard, see live statuses, verify at least 16+ services show non-"unknown" statuses, trigger manual status change, see Slack alert + dashboard update within 60s 9. Write `README.md` — **Acceptance:** Contains: project overview, architecture diagram (text), how to access (URL), what it shows, how to manually update services (curl examples), environment setup instructions, what's planned next (v2 features) -10. Prepare demo script — **Acceptance:** 2-3 talking points for Mark: (a) show live dashboard with real statuses, (b) trigger a simulated incident and show Slack alert + dashboard update, (c) click a service to show dependency mapping +10. Prepare demo script — **Acceptance:** 2-3 talking points: (a) show live dashboard with real statuses, (b) trigger a simulated incident and show Slack alert + dashboard update, (c) click a service to show dependency mapping **Verification checklist:** -- [ ] From another laptop on VPN: `http://:8000` → dashboard loads +- [ ] From another laptop on the internal network: `http://:8000` → dashboard loads - [ ] ≥16 services show live statuses (not all "unknown") - [ ] Manual-only services show "unknown" or manually-set statuses - [ ] Timeline shows seeded + any real events @@ -870,7 +870,7 @@ npm install -D tailwindcss @tailwindcss/vite **Risks:** - Mac Mini firewall blocking inbound → **Mitigation:** Run `sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /usr/local/bin/python3` or disable firewall for port 8000 specifically. Test from another machine early in Phase 3. -- DNS/hostname on VPN → **Mitigation:** Use raw IP for demo; request DNS alias from network team if project continues +- DNS/hostname on the internal network → **Mitigation:** Use raw IP for demo; request DNS alias from network team if project continues - stale data after Mac Mini sleep → **Mitigation:** Disable sleep in System Preferences → Energy Saver. Verify poll cycle resumes after network reconnect. --- @@ -884,11 +884,11 @@ npm install -D tailwindcss @tailwindcss/vite **v3 — Internal Signal Correlation (Weeks 5-8):** - Splunk integration: pull auth failure logs, network errors, app-specific events -- JSM ticket correlation: count open tickets mentioning affected service names -- Dashboard enrichment: "Okta degraded + 47 SSO tickets in last 30min + Splunk showing auth failures" +- ITSM ticket correlation: count open tickets mentioning affected service names +- Dashboard enrichment: "Identity provider (SSO) degraded + 47 SSO tickets in last 30min + Splunk showing auth failures" **v4 — Proactive Detection + Slack Bot (Weeks 9-12):** - ThousandEyes + Datadog integration for network and APM signals - Anomaly detection: alert when ticket volume spikes before vendor status page updates -- Slack bot: `@it-agent what's going on with Okta?` → returns correlated intelligence +- Slack bot: `@it-agent what's going on with the identity provider?` → returns correlated intelligence - GitHub change correlation: "These 3 config changes were deployed in the last hour" diff --git a/PRODUCTION-ROADMAP.md b/PRODUCTION-ROADMAP.md index bf0b3fb..5f810b9 100644 --- a/PRODUCTION-ROADMAP.md +++ b/PRODUCTION-ROADMAP.md @@ -110,7 +110,7 @@ Alert fatigue is the #1 killer of status dashboards. Current pipeline fires on e ### Severity routing (config-driven) - [x] `tier` (`critical | important | informational`) + `slack_channel_override` added to `services.yaml`, `ServiceConfig`, and the DB via migration 0006. -- [x] `okta`, `duo`, `slack` tagged `critical`; everything else defaults to `important` so operators explicitly elect services into the `@here` tier. +- [x] The identity provider (SSO), MFA, and chat platform tagged `critical`; everything else defaults to `important` so operators explicitly elect services into the `@here` tier. - [x] `route_status_change()` applies routing: - `critical` → Slack + `` mention - `important` → Slack, no mention @@ -218,7 +218,7 @@ If the app goes down, nobody knows. Fix meta-monitoring. ### Backup: Litestream - [x] `deploy/litestream.yml.example` — template supporting local-file, S3, and SFTP replicas (operator picks one). -- [x] `deploy/com.box.it-health-dashboard-litestream.plist.example` — sidecar launchd daemon with `KeepAlive` dict form, `ThrottleInterval`, macOS Keychain-sourced credentials (never hardcoded). +- [x] `deploy/com.company.it-health-dashboard-litestream.plist.example` — sidecar launchd daemon with `KeepAlive` dict form, `ThrottleInterval`, macOS Keychain-sourced credentials (never hardcoded). - [x] README "Backup & Disaster Recovery" section documents install → validate → replicate → restore with exact commands. ### Retention @@ -316,7 +316,7 @@ If the app goes down, nobody knows. Fix meta-monitoring. - [x] End-to-end integration — new `tests/test_e2e_pipeline.py::test_poll_change_db_alert_pipeline`. Drives three `respx`-mocked polls through `poll_statuspage → detect_changes → process_changes → Slack POST`, asserts flap suppression holds the first major_outage reading, change emits on the second, Slack webhook receives one POST with ``, and `alerts_sent_total{kind=status_change,severity=critical}` increments exactly once. ### launchd hardening -- [x] `com.box.it-health-dashboard.plist` rewritten with dict-form `KeepAlive` (`SuccessfulExit=false`, `Crashed=true`) so deliberate stops stick. +- [x] `com.company.it-health-dashboard.plist` rewritten with dict-form `KeepAlive` (`SuccessfulExit=false`, `Crashed=true`) so deliberate stops stick. - [x] `ThrottleInterval=30` prevents crash-loop pegging on bad config. - [x] `PYTHONUNBUFFERED=1` so stdout reaches the log file in real time. - [x] `ProcessType=Background`, `SoftResourceLimits.NumberOfFiles=4096`. @@ -353,7 +353,7 @@ If the app goes down, nobody knows. Fix meta-monitoring. - **Postmortem automation** — Google-SRE-template Markdown per incident, committed to a repo (Summary → Impact → Root Cause → Timeline → What Went Well/Poorly/Lucky → Action Items categorized Prevent/Mitigate/Detect/Repair). - **SLO view** — Grafana-style fuel gauge (remaining error budget) + burn-rate line with 1× / 6× / 14.4× thresholds per tier. - **Multi-burn-rate alerting** — Google SRE canonical pattern: require both long and short window to breach before paging. -- **Slack bot** — `/itstatus okta` slash command, natural-language deferred to post-LLM phase. +- **Slack bot** — `/itstatus` slash command (e.g. query by service name), natural-language deferred to post-LLM phase. --- diff --git a/README.md b/README.md index 856a0a3..aefadc4 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,13 @@ # IT Service Health Dashboard -Real-time status monitoring dashboard for ~30 SaaS services used by Box IT. Polls vendor status pages every 60 seconds, detects changes, generates impact statements using a service dependency graph, posts Slack alerts, and displays a unified dark-themed operations dashboard. +Real-time status monitoring dashboard for ~30 SaaS services used across an enterprise IT environment. Polls vendor status pages every 60 seconds, detects changes, generates impact statements using a service dependency graph, posts Slack alerts, and displays a unified dark-themed operations dashboard. ## Project status - **v1 (demo-ready) — SHIPPED.** All original spec delivered: polling, normalization, change detection, Slack alerting, React UI, dependency graph, timeline, SLA tracking, incident clustering, auto reports. -- **v2 (production-ready) — SHIPPED.** Phases 0–6 of the production roadmap complete: bearer-token auth, vendor resilience (stamina + purgatory), alert quality (flap suppression, dedup, tier routing, dependency correlation, maintenance windows, flapping-badge UI), observability (structlog, Prometheus `/metrics`, Sentry, Healthchecks.io dead-man's switch), data lifecycle (production pragmas, retention, Litestream streaming + daily `VACUUM INTO` snapshot), UX productionization (severity-sorted grid, distinct poller-broken state, a11y + keyboard nav, Executive/Engineer view toggle, PWA, `recharts` SLA trend), and platform polish (CI, pre-commit, hardened launchd plist, Caddy, Keychain secrets). **356 tests passing.** +- **v2 (production-ready) — SHIPPED.** Phases 0–6 of the production roadmap complete: bearer-token auth, vendor resilience (stamina + purgatory), alert quality (flap suppression, dedup, tier routing, dependency correlation, maintenance windows, flapping-badge UI), observability (structlog, Prometheus `/metrics`, Sentry, Healthchecks.io dead-man's switch), data lifecycle (production pragmas, retention, Litestream streaming + daily `VACUUM INTO` snapshot), UX productionization (severity-sorted grid, distinct poller-broken state, a11y + keyboard nav, Executive/Engineer view toggle, PWA, `recharts` SLA trend), and platform polish (CI, pre-commit, hardened launchd plist, Caddy, Keychain secrets). **378 tests passing.** - **v2 Phase 2B + Phase 7 — in tree, gated off.** Statuspage inbound webhook receiver (`WEBHOOKS_ENABLED`), Slack ack flow (`SLACK_ACK_ENABLED`), postmortem drafts (`POSTMORTEMS_ENABLED`), SLO fuel-gauge + multi-burn-rate alerting (`SLO_BURN_RATE_ENABLED`), and Slack `/itstatus` slash command (`SLACK_SLASH_ENABLED`) all shipped with tests but default off. Flip each flag once its prerequisites are in place (public endpoint for Slack features; postmortems need only a writable `POSTMORTEMS_DIR`). -- **v2 Phase 7 remainder — optional.** LLM-layer impact statements, Splunk/JSM/ThousandEyes integration. Not on a fixed schedule; add as demand emerges. +- **v2 Phase 7 remainder — optional.** LLM-layer impact statements; log-aggregation / ITSM / synthetic-monitoring integrations. Not on a fixed schedule; add as demand emerges. **Active roadmap:** [PRODUCTION-ROADMAP.md](./PRODUCTION-ROADMAP.md) — exit-criteria detail for every phase. **Historical spec:** [IMPLEMENTATION-ROADMAP.md](./IMPLEMENTATION-ROADMAP.md) — archived; v1 is complete. @@ -17,8 +17,8 @@ Real-time status monitoring dashboard for ~30 SaaS services used by Box IT. Poll ``` [Vendor Status Pages] |-- Statuspage.io JSON API (15 services) - |-- Slack Status API (1 service) - |-- Google Workspace JSON feed (2 services) + |-- Chat vendor status API (1 service) + |-- Productivity suite JSON feed (2 services) |-- Manual updates via POST /api/admin/status (11 services) | (async poll every 60s) [Poll Orchestrator] @@ -28,7 +28,7 @@ Real-time status monitoring dashboard for ~30 SaaS services used by Box IT. Poll [Change Detector] --> diff against DB, write status_events | [Impact Statement Engine] --> dependency graph + templates - [Slack Alerter] --> Block Kit message to #service-validation + [Slack Alerter] --> Block Kit message to the ops-alert channel [SQLite Writer] --> update services, insert events | [FastAPI REST API] --> /api/services, /api/timeline, /api/summary @@ -60,32 +60,40 @@ Open `http://localhost:8000` in your browser. ## Accessing the Dashboard -The dashboard runs on a Mac Mini on the corporate network. Access via VPN at: +The dashboard runs on a Mac Mini on the internal network. Access it at: ``` -http://:8000 +http://:8000 ``` -No authentication required — VPN access is the security boundary. +No authentication required — internal-network access is the security boundary. ## Service Categories -| Category | Services | -|----------|----------| -| Identity & Access | Okta, Duo | -| Productivity | Box, DocuSign, Google Mail, Google Calendar, Conga, Eptura | -| Collaboration | Slack, Zoom, RingCentral | -| Engineering | Confluence, Jira, Jira Service Management, SnapLogic | -| HR & People | Greenhouse, Workday, Cornerstone | -| Finance | SAP Concur, Coupa, NetSuite, Zuora | -| Sales & CRM | Salesforce, Partner Portal | -| Marketing | Iterable, Marketo | -| Network & VPN | Juniper VPN | -| Support | Zendesk, Lithium | +Services are organized into ten categories. The committed example registry +(`backend/config/services.yaml`) ships a generic, runnable set that monitors +public developer-tool status pages, so the dashboard works immediately after +clone: + +| Category | Example services | +|----------|------------------| +| Identity & Access | Identity provider (SSO) | +| Engineering | GitHub, npm, PyPI, Sentry | +| Productivity | Dropbox | +| Collaboration | Discord | +| Network & VPN | Cloudflare | +| Support | Ticketing / ITSM | +| Other | Datadog | + +To monitor your own organization's services, copy the example to a gitignored +`backend/config/services.local.yaml` (the loader prefers it when present) and +list your real registry there — see that file's header for the schema and the +full category list (identity, productivity, collaboration, engineering, HR, +finance, sales, marketing, networking, support). ## Manual Status Updates -For services without automated polling (Okta, Workday, Concur, etc.), update status via curl. **Admin endpoints require a bearer token** (set `ADMIN_API_TOKEN` in your env). +For services without automated polling (e.g. an identity provider, an HR system, or any service with no public status API), update status via curl. **Admin endpoints require a bearer token** (set `ADMIN_API_TOKEN` in your env). ```bash export TOKEN="your-admin-token" @@ -94,19 +102,19 @@ export TOKEN="your-admin-token" curl -X POST http://localhost:8000/api/admin/status \ -H "Authorization: Bearer $TOKEN" \ -H 'Content-Type: application/json' \ - -d '{"service_id": "workday", "new_status": "degraded", "detail": "Slow login page", "reason": "Reported by user in #it-help"}' + -d '{"service_id": "hr-system", "new_status": "degraded", "detail": "Slow login page", "reason": "Reported by user in the help channel"}' # Set to major outage curl -X POST http://localhost:8000/api/admin/status \ -H "Authorization: Bearer $TOKEN" \ -H 'Content-Type: application/json' \ - -d '{"service_id": "okta", "new_status": "major_outage", "detail": "SSO completely unavailable", "reason": "Confirmed with vendor"}' + -d '{"service_id": "identity-provider", "new_status": "major_outage", "detail": "SSO completely unavailable", "reason": "Confirmed with vendor"}' # Resolve (set back to operational) curl -X POST http://localhost:8000/api/admin/status \ -H "Authorization: Bearer $TOKEN" \ -H 'Content-Type: application/json' \ - -d '{"service_id": "okta", "new_status": "operational", "reason": "Vendor posted recovery"}' + -d '{"service_id": "identity-provider", "new_status": "operational", "reason": "Vendor posted recovery"}' ``` Valid statuses: `operational`, `degraded`, `partial_outage`, `major_outage`, `unknown`. The `reason` field is required for audit trail. @@ -115,7 +123,7 @@ Valid statuses: `operational`, `degraded`, `partial_outage`, `major_outage`, `un | Variable | Default | Description | |----------|---------|-------------| -| `SLACK_WEBHOOK_URL` | _(none)_ | Slack incoming webhook URL for #service-validation alerts | +| `SLACK_WEBHOOK_URL` | _(none)_ | Slack incoming webhook URL for ops-alert channel notifications | | `DATABASE_PATH` | `data.db` | SQLite database file path | | `POLL_INTERVAL_SECONDS` | `60` | How often to poll vendor status pages (1–3600) | | `HOST` | `127.0.0.1` | Server bind address (`0.0.0.0` for network access) | @@ -192,13 +200,13 @@ cp .env.example backend/.env # Edit backend/.env: set HOST=0.0.0.0, SLACK_WEBHOOK_URL= # 3. Update plist paths -# Edit com.box.it-health-dashboard.plist: +# Edit com.company.it-health-dashboard.plist: # - Replace /path/to/ with actual project path # - Add SLACK_WEBHOOK_URL # 4. Install launchd service -sudo cp com.box.it-health-dashboard.plist /Library/LaunchDaemons/ -sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist +sudo cp com.company.it-health-dashboard.plist /Library/LaunchDaemons/ +sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist # 5. Verify curl http://localhost:8000/api/health @@ -210,10 +218,10 @@ sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add $(which python3) Manage the service: ```bash # Stop -sudo launchctl bootout system/com.box.it-health-dashboard +sudo launchctl bootout system/com.company.it-health-dashboard # Start -sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist +sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist # View logs tail -f /var/log/it-health-dashboard.log @@ -237,9 +245,9 @@ $EDITOR /opt/it-health/deploy/litestream.yml litestream validate -config /opt/it-health/deploy/litestream.yml # 4. Install the sidecar launchd daemon -cp deploy/com.box.it-health-dashboard-litestream.plist.example \ - /Library/LaunchDaemons/com.box.it-health-dashboard-litestream.plist -sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard-litestream.plist +cp deploy/com.company.it-health-dashboard-litestream.plist.example \ + /Library/LaunchDaemons/com.company.it-health-dashboard-litestream.plist +sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard-litestream.plist # 5. Confirm replication is working litestream snapshots -config /opt/it-health/deploy/litestream.yml @@ -251,7 +259,7 @@ Litestream RPO is ~1 second — after the initial snapshot, every WAL frame ship ```bash # 1. Stop the main app so the DB isn't being written to -sudo launchctl bootout system/com.box.it-health-dashboard +sudo launchctl bootout system/com.company.it-health-dashboard # 2. Restore from replica (picks up the latest snapshot + WAL frames) litestream restore -config /opt/it-health/deploy/litestream.yml \ @@ -259,7 +267,7 @@ litestream restore -config /opt/it-health/deploy/litestream.yml \ /opt/it-health/data.db # 3. Start the app — it applies pending migrations on boot and resumes polling -sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist +sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist ``` ### Data retention @@ -300,4 +308,4 @@ The retention job runs every `RETENTION_INTERVAL_HOURS` (default 168 = weekly) a All production phases (0–6) and the primary Phase 7 reach features are complete. Full exit-criteria history is in [PRODUCTION-ROADMAP.md](./PRODUCTION-ROADMAP.md). Remaining optional work: - **Phase 7 — LLM layer:** Natural-language impact statements; deferred post-Phase-7. -- **Phase 7 — Integrations:** Splunk, ThousandEyes, Datadog, JSM — deferred to org demand. +- **Phase 7 — Integrations:** log aggregation, synthetic monitoring, metrics, and ITSM platforms — deferred to demand. diff --git a/backend/app/alerting/routing.py b/backend/app/alerting/routing.py index 84aff02..8970612 100644 --- a/backend/app/alerting/routing.py +++ b/backend/app/alerting/routing.py @@ -43,10 +43,10 @@ class RoutingDecision: should_send: bool webhook_url: str | None - channel_mention: str | None # "" for critical tier, else None + channel_mention: str | None # "" for critical tier, else None dedup_key: str tier: str - suppressed_by: str | None # None if sent, else reason code + suppressed_by: str | None # None if sent, else reason code # If this change was consolidated into an aggregated upstream alert, # name + status of the upstream; caller suppresses the individual alert. aggregated_under: str | None = None @@ -115,7 +115,8 @@ async def was_recently_alerted( async def get_service_tier( - db: aiosqlite.Connection, service_id: str, + db: aiosqlite.Connection, + service_id: str, ) -> tuple[str, str | None]: """Return (tier, slack_channel_override) for a service.""" cursor = await db.execute( @@ -167,7 +168,9 @@ async def route_status_change( # Recoveries to 'operational' skip dedup — users always want to know # "it's back", even if they just saw the outage alert minutes ago. if change.new_status != "operational" and await was_recently_alerted( - db, dedup_key, settings.alert_dedup_window_seconds, + db, + dedup_key, + settings.alert_dedup_window_seconds, ): return RoutingDecision( should_send=False, @@ -249,11 +252,13 @@ async def record_alert( # Mirror into Prometheus so operators can scrape alert hygiene trends if decision.suppressed_by: ALERTS_SUPPRESSED_TOTAL.labels( - kind=alert_kind, reason=decision.suppressed_by, + kind=alert_kind, + reason=decision.suppressed_by, ).inc() else: ALERTS_SENT_TOTAL.labels( - kind=alert_kind, severity=decision.tier, + kind=alert_kind, + severity=decision.tier, ).inc() @@ -263,7 +268,7 @@ async def record_alert( def build_slo_burn_rate_dedup_key(service_id: str, severity: str) -> str: - """e.g. 'slo_burn:slack_api:fast' — used by alert_sent_log for dedup.""" + """e.g. 'slo_burn:identity-provider:fast' — used by alert_sent_log for dedup.""" return f"slo_burn:{service_id}:{severity}" @@ -351,16 +356,19 @@ async def record_slo_alert( if decision.suppressed_by: ALERTS_SUPPRESSED_TOTAL.labels( - kind="slo_burn_rate", reason=decision.suppressed_by, + kind="slo_burn_rate", + reason=decision.suppressed_by, ).inc() else: ALERTS_SENT_TOTAL.labels( - kind="slo_burn_rate", severity=decision.tier, + kind="slo_burn_rate", + severity=decision.tier, ).inc() # ── Dependency correlation ────────────────────────────────────────── + async def find_aggregation_candidates( db: aiosqlite.Connection, changes: list[StatusChange], @@ -382,7 +390,8 @@ async def find_aggregation_candidates( # Build quick lookup: which service_ids in this batch are going non-operational? affected_ids = { - c.service_id for c in changes + c.service_id + for c in changes if c.new_status != "operational" and c.previous_status == "operational" } if not affected_ids: @@ -407,7 +416,8 @@ async def find_aggregation_candidates( declared_downstream = {row[0] for row in await cursor.fetchall()} consolidated = [ - c for c in changes + c + for c in changes if c.service_id != upstream_change.service_id and c.service_id in declared_downstream and c.service_id in affected_ids diff --git a/backend/app/alerting/templates.py b/backend/app/alerting/templates.py index b78b4fa..046fe6f 100644 --- a/backend/app/alerting/templates.py +++ b/backend/app/alerting/templates.py @@ -4,40 +4,32 @@ and the service dependency graph. Uses simple string templates. """ +from app.config import settings from app.poller.change_detector import StatusChange TEMPLATES = { "single_service_degraded": ( - "{service_name} is reporting degraded performance. " - "{vendor_detail}" - ), - "single_service_partial": ( - "{service_name} is experiencing a partial outage. " - "{vendor_detail}" + "{service_name} is reporting degraded performance. {vendor_detail}" ), + "single_service_partial": ("{service_name} is experiencing a partial outage. {vendor_detail}"), "single_service_major": ( - "\u26a0\ufe0f {service_name} is experiencing a MAJOR OUTAGE. " - "{vendor_detail}" - ), - "with_downstream": ( - " This may impact: {downstream_list}." + "\u26a0\ufe0f {service_name} is experiencing a MAJOR OUTAGE. {vendor_detail}" ), - "okta_degraded": ( - "Okta is reporting degraded performance. SSO authentication " + "with_downstream": (" This may impact: {downstream_list}."), + "sso_degraded": ( + "The identity provider (SSO) is reporting degraded performance. SSO authentication " "for all SaaS applications may be affected. Impacted services: {downstream_list}." ), - "okta_outage": ( - "\u26a0\ufe0f Okta is experiencing an outage. SSO authentication is unavailable. " + "sso_outage": ( + "\u26a0\ufe0f The identity provider (SSO) is experiencing an outage. " + "SSO authentication is unavailable. " "Users cannot log into: {downstream_list}. " "Advise users with active sessions to avoid logging out." ), - "recovery": ( - "{service_name} has recovered and is now operational." - ), + "recovery": ("{service_name} has recovered and is now operational."), "overall_healthy": "All {total} monitored services are operational.", "overall_incidents": ( - "{incident_count} active incident(s) across {total} monitored services. " - "{incident_summary}" + "{incident_count} active incident(s) across {total} monitored services. {incident_summary}" ), } @@ -60,12 +52,14 @@ def generate_impact_statement( if change.new_status == "operational": return TEMPLATES["recovery"].replace("{service_name}", change.service_display_name) - # Special case: Okta - if change.service_id == "okta": + # Special case: the SSO / identity broker (configurable via + # SSO_BROKER_SERVICE_ID). An identity-provider outage blocks login to + # everything downstream, so it gets dedicated impact wording. + if settings.sso_broker_service_id and change.service_id == settings.sso_broker_service_id: if change.new_status in ("major_outage", "partial_outage"): - return TEMPLATES["okta_outage"].replace("{downstream_list}", downstream_list) + return TEMPLATES["sso_outage"].replace("{downstream_list}", downstream_list) if change.new_status == "degraded": - return TEMPLATES["okta_degraded"].replace("{downstream_list}", downstream_list) + return TEMPLATES["sso_degraded"].replace("{downstream_list}", downstream_list) # Generic path: pick template by severity template_key = { @@ -85,7 +79,8 @@ def generate_impact_statement( if statement and not statement.endswith((".", "!", "?")): statement += "." statement += TEMPLATES["with_downstream"].replace( - "{downstream_list}", downstream_list, + "{downstream_list}", + downstream_list, ) return statement diff --git a/backend/app/config.py b/backend/app/config.py index 42f2903..72387fb 100644 --- a/backend/app/config.py +++ b/backend/app/config.py @@ -83,6 +83,13 @@ class Settings(BaseSettings): # changes state, emit one aggregated alert instead of one per dependent. dependency_correlation_threshold: int = Field(default=3, gt=0, le=100) + # The service id that acts as the SSO / identity broker. When a service + # with this id changes state, impact statements use the dedicated SSO + # template (an identity-provider outage blocks login to everything + # downstream). Leave unset to disable the special case — e.g. set + # SSO_BROKER_SERVICE_ID= in the environment. + sso_broker_service_id: str | None = None + # Observability (Phase 3) # Pretty console output in dev, JSON in prod. JSON is cheap to parse # and preserves contextvars (poll_cycle_id etc.) as first-class fields. @@ -147,13 +154,22 @@ def sentry_dsn_str(self) -> str | None: def healthcheck_ping_url_str(self) -> str | None: return str(self.healthcheck_ping_url) if self.healthcheck_ping_url else None + @property + def _config_dir(self) -> Path: + return Path(__file__).parent.parent / "config" + @property def services_yaml_path(self) -> Path: - return Path(__file__).parent.parent / "config" / "services.yaml" + # Prefer a gitignored services.local.yaml (the operator's real + # registry) when present; otherwise fall back to the committed + # generic example. + local = self._config_dir / "services.local.yaml" + return local if local.exists() else self._config_dir / "services.yaml" @property def dependencies_yaml_path(self) -> Path: - return Path(__file__).parent.parent / "config" / "dependencies.yaml" + local = self._config_dir / "dependencies.local.yaml" + return local if local.exists() else self._config_dir / "dependencies.yaml" @property def migrations_dir(self) -> Path: diff --git a/backend/app/poller/zendesk_poller.py b/backend/app/poller/active_incidents_poller.py similarity index 78% rename from backend/app/poller/zendesk_poller.py rename to backend/app/poller/active_incidents_poller.py index 30006d4..f2aef54 100644 --- a/backend/app/poller/zendesk_poller.py +++ b/backend/app/poller/active_incidents_poller.py @@ -1,7 +1,8 @@ -"""Zendesk Status API poller. +"""Active incidents API poller. -Fetches active incidents from status.zendesk.com/api/incidents/active -and maps to our status model. +Fetches active incidents from a JSON incidents endpoint and maps to +our status model. The response envelope wraps incident objects under +a "data" key; an empty array means operational. """ import logging @@ -15,11 +16,11 @@ logger = logging.getLogger(__name__) -async def poll_zendesk( +async def poll_active_incidents( client: httpx.AsyncClient, poll_url: str, ) -> PollResult: - """Poll the Zendesk Status API for active incidents. + """Poll an active-incidents JSON endpoint. Returns { data: [...incidents], included: [...] }. Empty data array means operational. @@ -29,7 +30,7 @@ async def poll_zendesk( body = response.json() except Exception as e: detail, reason = describe_fetch_error(e) - logger.warning("Zendesk poll failed: %s (%s)", detail, reason) + logger.warning("Active-incidents poll failed: %s (%s)", detail, reason) return PollResult( status=ServiceStatus.UNKNOWN, status_detail=detail, @@ -41,7 +42,7 @@ async def poll_zendesk( if not incidents: return PollResult( status=ServiceStatus.OPERATIONAL, - page_name="Zendesk", + page_name="Active Incidents", incidents=[], ) @@ -68,6 +69,6 @@ async def poll_zendesk( return PollResult( status=severity, status_detail=status_detail, - page_name="Zendesk", + page_name="Active Incidents", incidents=incidents, ) diff --git a/backend/app/poller/slack_poller.py b/backend/app/poller/current_status_poller.py similarity index 70% rename from backend/app/poller/slack_poller.py rename to backend/app/poller/current_status_poller.py index 7197cfd..896034e 100644 --- a/backend/app/poller/slack_poller.py +++ b/backend/app/poller/current_status_poller.py @@ -1,9 +1,9 @@ -"""Slack Status API poller. +"""Current-status API poller. -Fetches current status from slack-status.com/api/v2.0.0/current -and returns normalized status data. +Fetches the current status from a JSON status endpoint and returns +normalized status data. -The Slack API can return two formats: +The endpoint can return two formats: - Dict: the normal /current response with {status, active_incidents, ...} - List: a redirect to the history endpoint returning incident objects. Each incident has {id, status, type, title, ...} where status is @@ -15,14 +15,14 @@ import httpx -from app.poller.normalizer import ServiceStatus, normalize_slack_status +from app.poller.normalizer import ServiceStatus, normalize_current_status from app.poller.resilience import describe_fetch_error, resilient_fetch from app.poller.statuspage_poller import PollResult logger = logging.getLogger(__name__) -# Slack incident type → severity rank (higher = worse) -_SLACK_TYPE_RANK = { +# Incident type → severity mapping (higher = worse) +_STATUS_TYPE_RANK = { "outage": ServiceStatus.MAJOR_OUTAGE, "incident": ServiceStatus.PARTIAL_OUTAGE, "notice": ServiceStatus.DEGRADED, @@ -30,30 +30,32 @@ } -async def poll_slack( +async def poll_current_status( client: httpx.AsyncClient, poll_url: str, ) -> PollResult: - """Poll the Slack Status API. + """Poll a current-status JSON endpoint. Args: client: Shared httpx AsyncClient. - poll_url: Slack status API URL. + poll_url: Status API URL. Returns: PollResult with normalized status. """ try: # resilient_fetch handles retries + per-host breaker. The explicit - # Accept header is from the origin branch so slack-status.com - # returns JSON not HTML when redirects land us on a different doc. + # Accept header ensures the endpoint returns JSON rather than HTML + # when redirects land on a different document. response = await resilient_fetch( - client, poll_url, headers={"Accept": "application/json"}, + client, + poll_url, + headers={"Accept": "application/json"}, ) data = response.json() except Exception as e: detail, reason = describe_fetch_error(e) - logger.warning("Slack poll failed: %s (%s)", detail, reason) + logger.warning("Current-status poll failed: %s (%s)", detail, reason) return PollResult( status=ServiceStatus.UNKNOWN, status_detail=detail, @@ -62,7 +64,7 @@ async def poll_slack( # Normal dict response — use the existing normalizer if isinstance(data, dict): - status = normalize_slack_status(data) + status = normalize_current_status(data) status_detail = None active = data.get("active_incidents", []) if active and isinstance(active[0], dict): @@ -70,7 +72,7 @@ async def poll_slack( return PollResult( status=status, status_detail=status_detail, - page_name="Slack", + page_name="Current Status", incidents=active, scheduled_maintenances=[], ) @@ -79,15 +81,14 @@ async def poll_slack( # with a "status" field (active/resolved/etc) and "type" (outage/incident/etc). if isinstance(data, list): active_incidents = [ - item for item in data - if isinstance(item, dict) and item.get("status") == "active" + item for item in data if isinstance(item, dict) and item.get("status") == "active" ] if not active_incidents: return PollResult( status=ServiceStatus.OPERATIONAL, status_detail=None, - page_name="Slack", + page_name="Current Status", incidents=[], scheduled_maintenances=[], ) @@ -99,7 +100,7 @@ async def poll_slack( for inc in active_incidents: inc_type = inc.get("type", "") - mapped = _SLACK_TYPE_RANK.get(inc_type, ServiceStatus.DEGRADED) + mapped = _STATUS_TYPE_RANK.get(inc_type, ServiceStatus.DEGRADED) if severity_rank.get(mapped.value, 0) > severity_rank.get(worst.value, 0): worst = mapped worst_title = inc.get("title") @@ -108,18 +109,20 @@ async def poll_slack( worst_title = active_incidents[0].get("title") logger.info( - "Slack API returned history list (%d items, %d active) — status: %s", - len(data), len(active_incidents), worst.value, + "Current-status API returned history list (%d items, %d active) — status: %s", + len(data), + len(active_incidents), + worst.value, ) return PollResult( status=worst, status_detail=worst_title, - page_name="Slack", + page_name="Current Status", incidents=active_incidents, scheduled_maintenances=[], ) # Unexpected type - logger.warning("Slack API returned unexpected type: %s", type(data).__name__) + logger.warning("Current-status API returned unexpected type: %s", type(data).__name__) return PollResult(status=ServiceStatus.UNKNOWN, status_detail="Unexpected response format") diff --git a/backend/app/poller/google_poller.py b/backend/app/poller/google_poller.py deleted file mode 100644 index ac79408..0000000 --- a/backend/app/poller/google_poller.py +++ /dev/null @@ -1,96 +0,0 @@ -"""Google Workspace status poller. - -Fetches the incidents.json feed from Google's status dashboard. -One HTTP call serves both Google Mail and Google Calendar. -""" - -import logging - -import httpx - -from app.poller.normalizer import ( - GOOGLE_PRODUCT_NAMES, - ServiceStatus, - normalize_google_status, -) -from app.poller.resilience import describe_fetch_error, resilient_fetch -from app.poller.statuspage_poller import PollResult - -logger = logging.getLogger(__name__) - - -async def poll_google( - client: httpx.AsyncClient, - poll_url: str, - services: list[dict], -) -> list[tuple[str, PollResult]]: - """Poll Google Workspace status for multiple products in one call. - - Args: - client: Shared httpx AsyncClient. - poll_url: Google incidents.json URL. - services: DB rows with key 'id' (e.g., "google-mail", "google-calendar"). - - Returns: - List of (service_id, PollResult) tuples. - """ - try: - response = await resilient_fetch(client, poll_url) - incidents = response.json() - except Exception as e: - detail, reason = describe_fetch_error(e) - logger.warning("Google poll failed: %s (%s)", detail, reason) - return [ - (svc["id"], PollResult( - status=ServiceStatus.UNKNOWN, - status_detail=detail, - poll_failure_reason=reason, - )) - for svc in services - ] - - if not isinstance(incidents, list): - logger.warning("Google incidents.json returned non-list: %s", type(incidents)) - return [ - (svc["id"], PollResult( - status=ServiceStatus.UNKNOWN, - status_detail="Unexpected response format", - poll_failure_reason=f"parse_error: expected list, got {type(incidents).__name__}", - )) - for svc in services - ] - - results: list[tuple[str, PollResult]] = [] - for svc in services: - service_id = svc["id"] - status = normalize_google_status(incidents, service_id) - - # Find status detail from most recent active incident for this product - status_detail = None - product_names = GOOGLE_PRODUCT_NAMES.get(service_id, []) - for incident in incidents: - if incident.get("end"): - continue - affected = incident.get("affected_products", []) - if any(p.get("title") in product_names for p in affected): - status_detail = incident.get("external_desc", "")[:200] - break - - results.append(( - service_id, - PollResult( - status=status, - status_detail=status_detail, - page_name="Google Workspace", - incidents=[ - inc for inc in incidents - if not inc.get("end") and any( - p.get("title") in product_names - for p in inc.get("affected_products", []) - ) - ], - scheduled_maintenances=[], - ), - )) - - return results diff --git a/backend/app/poller/normalizer.py b/backend/app/poller/normalizer.py index 05aed9a..9046f1f 100644 --- a/backend/app/poller/normalizer.py +++ b/backend/app/poller/normalizer.py @@ -1,4 +1,4 @@ -"""Normalize vendor-specific status strings to our unified 5-state enum.""" +"""Normalize poll-format status strings to our unified 5-state enum.""" import logging from enum import StrEnum @@ -55,16 +55,18 @@ def normalize_statuspage_indicator(indicator: str) -> ServiceStatus: mapped = STATUSPAGE_INDICATOR_MAP.get(key) if mapped is None: logger.warning( - "Unmapped Statuspage indicator %r — returning UNKNOWN.", indicator, + "Unmapped Statuspage indicator %r — returning UNKNOWN.", + indicator, ) return ServiceStatus.UNKNOWN return mapped -# ── Slack Status API ─────────────────────────────────────────────── +# ── Current Status API ──────────────────────────────────────────── -def normalize_slack_status(response: dict) -> ServiceStatus: - """Map Slack Status API response to ServiceStatus. + +def normalize_current_status(response: dict) -> ServiceStatus: + """Map a current-status API dict response to ServiceStatus. When status is "ok" and no active incidents → OPERATIONAL. Otherwise, map by incident type. @@ -108,23 +110,24 @@ def normalize_slack_status(response: dict) -> ServiceStatus: return ServiceStatus.DEGRADED -# ── Google Workspace ─────────────────────────────────────────────── +# ── Product Feed ─────────────────────────────────────────────────── -# Google product name mappings for filtering the incident feed -GOOGLE_PRODUCT_NAMES: dict[str, list[str]] = { - "google-mail": ["Gmail", "Google Mail"], - "google-calendar": ["Google Calendar"], +# Maps a service id to the product-title strings that identify it in a +# multi-product status feed. Populate per your feed adapter's payload. +PRODUCT_FEED_NAMES: dict[str, list[str]] = { + "feed-product-a": ["Product A", "Service A"], + "feed-product-b": ["Product B"], } -def normalize_google_status(incidents: list[dict], service_id: str) -> ServiceStatus: - """Map Google Workspace incident feed to ServiceStatus for a specific product. +def normalize_product_feed_status(incidents: list[dict], service_id: str) -> ServiceStatus: + """Map a multi-product incident feed to ServiceStatus for a specific product. - The incidents.json feed contains incidents for ALL Google Workspace products. - Filter by matching product names for the given service_id. + The feed contains incidents for all products; filter by matching product + names for the given service_id. If no active (non-resolved) incidents exist for the product → OPERATIONAL. """ - product_names = GOOGLE_PRODUCT_NAMES.get(service_id, []) + product_names = PRODUCT_FEED_NAMES.get(service_id, []) if not product_names: return ServiceStatus.UNKNOWN @@ -165,16 +168,27 @@ def normalize_google_status(incidents: list[dict], service_id: str) -> ServiceSt RSS_SEVERITY_KEYWORDS: dict[ServiceStatus, list[str]] = { ServiceStatus.MAJOR_OUTAGE: [ - "major outage", "service outage", "completely unavailable", + "major outage", + "service outage", + "completely unavailable", ], ServiceStatus.PARTIAL_OUTAGE: [ - "partial outage", "partial disruption", "some users", + "partial outage", + "partial disruption", + "some users", ], ServiceStatus.DEGRADED: [ - "degraded", "performance issue", "intermittent", "delays", "investigating", + "degraded", + "performance issue", + "intermittent", + "delays", + "investigating", ], ServiceStatus.OPERATIONAL: [ - "resolved", "operational", "recovered", "fix implemented", + "resolved", + "operational", + "recovered", + "fix implemented", ], } diff --git a/backend/app/poller/product_feed_poller.py b/backend/app/poller/product_feed_poller.py new file mode 100644 index 0000000..c642a05 --- /dev/null +++ b/backend/app/poller/product_feed_poller.py @@ -0,0 +1,107 @@ +"""Product-feed status poller. + +Parses a multi-product incident feed where one HTTP call serves multiple +product IDs. Each entry in the feed represents an incident and carries an +affected_products list; entries without an "end" timestamp are active. +""" + +import logging + +import httpx + +from app.poller.normalizer import ( + PRODUCT_FEED_NAMES, + ServiceStatus, + normalize_product_feed_status, +) +from app.poller.resilience import describe_fetch_error, resilient_fetch +from app.poller.statuspage_poller import PollResult + +logger = logging.getLogger(__name__) + + +async def poll_product_feed( + client: httpx.AsyncClient, + poll_url: str, + services: list[dict], +) -> list[tuple[str, PollResult]]: + """Poll a multi-product incident feed for multiple services in one call. + + Args: + client: Shared httpx AsyncClient. + poll_url: Incident feed URL (returns a JSON array of incident objects). + services: DB rows with key 'id' matching entries in PRODUCT_FEED_NAMES. + + Returns: + List of (service_id, PollResult) tuples. + """ + try: + response = await resilient_fetch(client, poll_url) + incidents = response.json() + except Exception as e: + detail, reason = describe_fetch_error(e) + logger.warning("Product-feed poll failed: %s (%s)", detail, reason) + return [ + ( + svc["id"], + PollResult( + status=ServiceStatus.UNKNOWN, + status_detail=detail, + poll_failure_reason=reason, + ), + ) + for svc in services + ] + + if not isinstance(incidents, list): + logger.warning("Product feed returned non-list: %s", type(incidents)) + return [ + ( + svc["id"], + PollResult( + status=ServiceStatus.UNKNOWN, + status_detail="Unexpected response format", + poll_failure_reason=f"parse_error: expected list, got {type(incidents).__name__}", + ), + ) + for svc in services + ] + + results: list[tuple[str, PollResult]] = [] + for svc in services: + service_id = svc["id"] + status = normalize_product_feed_status(incidents, service_id) + + # Find status detail from most recent active incident for this product + status_detail = None + product_names = PRODUCT_FEED_NAMES.get(service_id, []) + for incident in incidents: + if incident.get("end"): + continue + affected = incident.get("affected_products", []) + if any(p.get("title") in product_names for p in affected): + status_detail = incident.get("external_desc", "")[:200] + break + + results.append( + ( + service_id, + PollResult( + status=status, + status_detail=status_detail, + page_name="Product Feed", + incidents=[ + inc + for inc in incidents + if not inc.get("end") + and any( + p.get("title") in product_names + for p in inc.get("affected_products", []) + ) + ], + scheduled_maintenances=[], + ), + ) + ) + + return results diff --git a/backend/app/poller/resilience.py b/backend/app/poller/resilience.py index 099bf21..cf7dd7f 100644 --- a/backend/app/poller/resilience.py +++ b/backend/app/poller/resilience.py @@ -136,7 +136,7 @@ async def resilient_fetch( `headers` forwards through to ``client.get`` so pollers that need vendor-specific headers (e.g., ``Accept: application/json`` for - Slack's redirect-happy status host) can pass them without + some vendors' redirect-happy status hosts) can pass them without bypassing the resilience layer. Raises: diff --git a/backend/app/poller/scheduler.py b/backend/app/poller/scheduler.py index 521e358..c2be5ba 100644 --- a/backend/app/poller/scheduler.py +++ b/backend/app/poller/scheduler.py @@ -1,7 +1,7 @@ """Poll scheduler: runs all pollers on a 60-second cycle via APScheduler. -Orchestrates statuspage, Slack, and Google pollers, feeds results through -the change detector, and logs status changes. +Orchestrates all poller types, feeds results through the change detector, +and logs status changes. Phase 3 observability hooks: - Each poll cycle binds a fresh `poll_cycle_id` contextvar so every @@ -50,12 +50,15 @@ def _on_scheduler_event(event) -> None: """Bridge APScheduler events into our logs + metrics.""" if event.code == EVENT_JOB_ERROR: logger.error( - "APScheduler job %s raised: %s", event.job_id, event.exception, + "APScheduler job %s raised: %s", + event.job_id, + event.exception, ) elif event.code == EVENT_JOB_MISSED: logger.warning( "APScheduler job %s missed its run time at %s", - event.job_id, event.scheduled_run_time, + event.job_id, + event.scheduled_run_time, ) @@ -97,6 +100,7 @@ def start_scheduler(app) -> None: # WAL checkpoint — runs more often than retention so the -wal sidecar # file doesn't grow without bound between weekly retention passes. from app.retention import scheduled_retention_tick, scheduled_wal_checkpoint_tick + scheduler.add_job( scheduled_wal_checkpoint_tick, "interval", @@ -119,6 +123,7 @@ def start_scheduler(app) -> None: # as a belt-and-suspenders snapshot for operators who don't want to # set up Litestream's continuous WAL shipping. from app.backup import run_backup + scheduler.add_job( run_backup, "cron", @@ -133,6 +138,7 @@ def start_scheduler(app) -> None: # SLO burn-rate alerting — gated by feature flag (default off) if settings.slo_burn_rate_enabled: from app.alerting.burn_rate import run_slo_burn_rate_cycle + scheduler.add_job( run_slo_burn_rate_cycle, "interval", @@ -198,21 +204,23 @@ async def run_poll_cycle(app) -> None: services_by_type.setdefault(svc["poll_type"], []).append(svc) # Dispatch all poller groups concurrently - from app.poller.google_poller import poll_google - from app.poller.ringcentral_poller import poll_ringcentral - from app.poller.salesforce_poller import poll_salesforce - from app.poller.slack_poller import poll_slack + from app.poller.active_incidents_poller import poll_active_incidents + from app.poller.current_status_poller import poll_current_status + from app.poller.product_feed_poller import poll_product_feed + from app.poller.service_array_poller import poll_service_array from app.poller.statuspage_poller import poll_all_statuspage - from app.poller.zendesk_poller import poll_zendesk + from app.poller.trust_incidents_poller import poll_trust_incidents tasks = [] task_labels = [] def _timed(poll_type: str, coro): """Wrap a poller coroutine to record its wall-clock duration.""" + async def _runner(): with POLL_DURATION_SECONDS.labels(poll_type=poll_type).time(): return await coro + return _runner() statuspage_svcs = services_by_type.get("statuspage_json", []) @@ -220,33 +228,37 @@ async def _runner(): tasks.append(_timed("statuspage_json", poll_all_statuspage(client, statuspage_svcs))) task_labels.append(f"statuspage ({len(statuspage_svcs)} services)") - slack_svcs = services_by_type.get("slack_api", []) - if slack_svcs: - svc = slack_svcs[0] + current_status_svcs = services_by_type.get("current_status_api", []) + if current_status_svcs: + svc = current_status_svcs[0] - async def _poll_slack(): - result = await poll_slack(client, svc["poll_url"]) + async def _poll_current_status(): + result = await poll_current_status(client, svc["poll_url"]) return [(svc["id"], result)] - tasks.append(_timed("slack_api", _poll_slack())) - task_labels.append("slack (1 service)") + tasks.append(_timed("current_status_api", _poll_current_status())) + task_labels.append("current_status (1 service)") - google_svcs = services_by_type.get("google_json", []) - if google_svcs: - url = google_svcs[0]["poll_url"] - tasks.append(_timed("google_json", poll_google(client, url, google_svcs))) - task_labels.append(f"google ({len(google_svcs)} services)") + product_feed_svcs = services_by_type.get("product_feed_json", []) + if product_feed_svcs: + url = product_feed_svcs[0]["poll_url"] + tasks.append( + _timed("product_feed_json", poll_product_feed(client, url, product_feed_svcs)) + ) + task_labels.append(f"product_feed ({len(product_feed_svcs)} services)") # Single-service custom pollers for poll_type, poller_fn in [ - ("salesforce_trust", poll_salesforce), - ("zendesk_api", poll_zendesk), - ("ringcentral_api", poll_ringcentral), + ("trust_incidents_api", poll_trust_incidents), + ("active_incidents_api", poll_active_incidents), + ("service_array_json", poll_service_array), ]: for svc in services_by_type.get(poll_type, []): + async def _poll_single(s=svc, fn=poller_fn): result = await fn(client, s["poll_url"]) return [(s["id"], result)] + tasks.append(_timed(poll_type, _poll_single())) task_labels.append(f"{poll_type} ({svc['id']})") @@ -274,20 +286,25 @@ async def _poll_single(s=svc, fn=poller_fn): # Process vendor-outage changes: impact statements + Slack alerts if changes: from app.alerting.engine import process_changes + await process_changes(db, write_lock, changes, http_client=client) # Process poller-health changes: alert on the separate poller-health # webhook so operators can tell "we're blind" from "vendor is down". if health_changes: from app.alerting.engine import process_poller_health_changes + await process_poller_health_changes( - health_changes, http_client=client, + health_changes, + http_client=client, ) # Log summary logger.info( "Poll cycle complete: %d services polled, %d changes, %d health", - len(all_results), len(changes), len(health_changes), + len(all_results), + len(changes), + len(health_changes), ) for change in changes: logger.info( diff --git a/backend/app/poller/ringcentral_poller.py b/backend/app/poller/service_array_poller.py similarity index 77% rename from backend/app/poller/ringcentral_poller.py rename to backend/app/poller/service_array_poller.py index d031320..425351b 100644 --- a/backend/app/poller/ringcentral_poller.py +++ b/backend/app/poller/service_array_poller.py @@ -1,7 +1,8 @@ -"""RingCentral Status API poller. +"""Service-array status poller. -Fetches service status from status.ringcentral.com/status.json -which returns an array of 75 service status objects. +Fetches status from an endpoint that returns an array of per-service +status objects, each carrying a level field and optional alerts list. +Overall status is derived from the worst level across all entries. """ import logging @@ -14,7 +15,7 @@ logger = logging.getLogger(__name__) -# RingCentral level values → our status +# Level field values → our status LEVEL_MAP = { "Good": ServiceStatus.OPERATIONAL, "Informational": ServiceStatus.OPERATIONAL, @@ -24,24 +25,24 @@ } -async def poll_ringcentral( +async def poll_service_array( client: httpx.AsyncClient, poll_url: str, ) -> PollResult: - """Poll RingCentral status API. + """Poll a service-array status endpoint. - Returns an array of objects like: + The endpoint returns an array of objects like: { "category": "Core Services", "service": "Calling - Inbound", "region": "Americas", "level": "Good", "alerts": [] } - We compute overall status from the worst level across all services. + Overall status is computed from the worst level across all entries. """ try: response = await resilient_fetch(client, poll_url) services = response.json() except Exception as e: detail, reason = describe_fetch_error(e) - logger.warning("RingCentral poll failed: %s (%s)", detail, reason) + logger.warning("Service-array poll failed: %s (%s)", detail, reason) return PollResult( status=ServiceStatus.UNKNOWN, status_detail=detail, @@ -49,7 +50,7 @@ async def poll_ringcentral( ) if not isinstance(services, list): - return PollResult(status=ServiceStatus.OPERATIONAL, page_name="RingCentral") + return PollResult(status=ServiceStatus.OPERATIONAL, page_name="Service Array") # Compute worst status across all services worst = ServiceStatus.OPERATIONAL @@ -79,6 +80,6 @@ async def poll_ringcentral( return PollResult( status=worst, status_detail=worst_detail if worst != ServiceStatus.OPERATIONAL else None, - page_name="RingCentral", + page_name="Service Array", incidents=active_alerts, ) diff --git a/backend/app/poller/statuspage_poller.py b/backend/app/poller/statuspage_poller.py index 752284e..813273f 100644 --- a/backend/app/poller/statuspage_poller.py +++ b/backend/app/poller/statuspage_poller.py @@ -158,14 +158,16 @@ async def _fetch_one(url: str) -> tuple[str, dict | Exception]: detail, reason = describe_fetch_error(data_or_error) logger.warning("Failed to fetch %s: %s (%s)", url, detail, reason) for svc in svcs: - results.append(( - svc["id"], - PollResult( - status=ServiceStatus.UNKNOWN, - status_detail=detail, - poll_failure_reason=reason, - ), - )) + results.append( + ( + svc["id"], + PollResult( + status=ServiceStatus.UNKNOWN, + status_detail=detail, + poll_failure_reason=reason, + ), + ) + ) else: data = data_or_error for svc in svcs: @@ -183,7 +185,7 @@ async def _demo_poll() -> None: ) as client: result = await poll_statuspage( client, - "https://status.box.com/api/v2/summary.json", + "https://status.example.com/api/v2/summary.json", ) print(f"Page: {result.page_name}") print(f"Status: {result.status.value}") @@ -202,4 +204,5 @@ async def _demo_poll() -> None: if __name__ == "__main__": import asyncio as _asyncio + _asyncio.run(_demo_poll()) diff --git a/backend/app/poller/salesforce_poller.py b/backend/app/poller/trust_incidents_poller.py similarity index 76% rename from backend/app/poller/salesforce_poller.py rename to backend/app/poller/trust_incidents_poller.py index a1be1e1..a97ab98 100644 --- a/backend/app/poller/salesforce_poller.py +++ b/backend/app/poller/trust_incidents_poller.py @@ -1,7 +1,8 @@ -"""Salesforce Trust API poller. +"""Trust incidents API poller. -Fetches active incidents from api.status.salesforce.com/v1/incidents -and maps to our status model. +Fetches active incidents from a trust/incidents JSON endpoint and maps +to our status model. The endpoint returns a list of incident objects; +entries without a resolved timestamp are considered active. """ import logging @@ -15,13 +16,13 @@ logger = logging.getLogger(__name__) -async def poll_salesforce( +async def poll_trust_incidents( client: httpx.AsyncClient, poll_url: str, ) -> PollResult: - """Poll the Salesforce Trust API for active incidents. + """Poll a trust-incidents JSON endpoint for active incidents. - The API returns a list of incident objects. If any are active + The endpoint returns a list of incident objects. If any are active (no resolvedAt), the service is degraded/outaged. """ try: @@ -29,7 +30,7 @@ async def poll_salesforce( incidents = response.json() except Exception as e: detail, reason = describe_fetch_error(e) - logger.warning("Salesforce poll failed: %s (%s)", detail, reason) + logger.warning("Trust-incidents poll failed: %s (%s)", detail, reason) return PollResult( status=ServiceStatus.UNKNOWN, status_detail=detail, @@ -37,7 +38,7 @@ async def poll_salesforce( ) if not isinstance(incidents, list): - return PollResult(status=ServiceStatus.OPERATIONAL, page_name="Salesforce") + return PollResult(status=ServiceStatus.OPERATIONAL, page_name="Trust Incidents") # Filter to active incidents (those without a resolved timestamp) active = [inc for inc in incidents if not inc.get("isResolved", True)] @@ -45,7 +46,7 @@ async def poll_salesforce( if not active: return PollResult( status=ServiceStatus.OPERATIONAL, - page_name="Salesforce", + page_name="Trust Incidents", incidents=[], ) @@ -72,6 +73,6 @@ async def poll_salesforce( return PollResult( status=severity, status_detail=status_detail, - page_name="Salesforce", + page_name="Trust Incidents", incidents=active, ) diff --git a/backend/app/seed.py b/backend/app/seed.py index b9ee610..84966f5 100644 --- a/backend/app/seed.py +++ b/backend/app/seed.py @@ -42,13 +42,28 @@ def _expand_env_var(value: str | None) -> str | None: VALID_CATEGORIES = Literal[ - "identity", "productivity", "collaboration", "engineering", - "hr", "finance", "sales", "marketing", "networking", "support", "other", + "identity", + "productivity", + "collaboration", + "engineering", + "hr", + "finance", + "sales", + "marketing", + "networking", + "support", + "other", ] VALID_POLL_TYPES = Literal[ - "statuspage_json", "google_json", "slack_api", "rss", "manual", - "salesforce_trust", "zendesk_api", "ringcentral_api", + "statuspage_json", + "product_feed_json", + "current_status_api", + "rss", + "manual", + "trust_incidents_api", + "active_incidents_api", + "service_array_json", ] VALID_TIERS = Literal["critical", "important", "informational"] @@ -106,9 +121,7 @@ def load_services(path: Path | None = None) -> list[ServiceConfig]: errors.append(f" Service #{i + 1} ({raw.get('id', '?')}): {e}") if errors: - raise ValueError( - f"Validation failed for {len(errors)} service(s):\n" + "\n".join(errors) - ) + raise ValueError(f"Validation failed for {len(errors)} service(s):\n" + "\n".join(errors)) return services @@ -138,9 +151,7 @@ def load_dependencies( errors: list[str] = [] for upstream, targets in deps.items(): if upstream not in known_service_ids: - errors.append( - f" Unknown upstream service '{upstream}' (not in services.yaml)" - ) + errors.append(f" Unknown upstream service '{upstream}' (not in services.yaml)") for target in targets: if target.service == "all_internal": continue @@ -218,9 +229,7 @@ async def seed_dependencies( for target in targets: # Expand "all_internal" to all services except the upstream itself if target.service == "all_internal": - downstream_ids = [ - sid for sid in all_service_ids if sid != upstream - ] + downstream_ids = [sid for sid in all_service_ids if sid != upstream] else: downstream_ids = [target.service] diff --git a/backend/config/dependencies.yaml b/backend/config/dependencies.yaml index d44a486..1e691b7 100644 --- a/backend/config/dependencies.yaml +++ b/backend/config/dependencies.yaml @@ -1,60 +1,49 @@ -# IT Service Health Dashboard — Service Dependency Graph -# Format: upstream → downstream (when upstream breaks, downstream is impacted) +# Service dependency graph (EXAMPLE) +# +# Maps an upstream service id to the downstream services impacted when it +# degrades. Used to generate impact statements ("X is down — this affects Y"). +# Every id here must exist in services.yaml. The sentinel `all_internal` +# expands to every service except the upstream itself. +# +# For a real deployment, copy this to `dependencies.local.yaml` (gitignored) +# alongside `services.local.yaml`; the loader prefers the .local files. +# +# Edge schema: { service: , impact: , severity: critical|high|medium|low } dependencies: - # Identity & Access — highest blast radius - okta: - - service: box - impact: "Box SSO login unavailable" + # The identity provider is the login broker — an outage blocks SSO access + # to everything that authenticates through it. + identity-provider: + - service: github + impact: "SSO login to source control unavailable" severity: critical - - service: slack - impact: "Slack SSO login may fail for new sessions" - severity: critical - - service: zoom - impact: "Zoom SSO login unavailable" - severity: critical - - service: salesforce - impact: "Salesforce SSO login unavailable" - severity: critical - - service: jira - impact: "Jira SSO login unavailable" + - service: npm + impact: "SSO login to the package registry unavailable" severity: high - - service: confluence - impact: "Confluence SSO login unavailable" + - service: pypi + impact: "SSO login to the package index unavailable" severity: high - - service: concur - impact: "Concur SSO login unavailable" - severity: high - - service: workday - impact: "Workday SSO login unavailable" - severity: high - - service: greenhouse - impact: "Greenhouse SSO login unavailable" - severity: medium - - service: docusign - impact: "DocuSign SSO login unavailable" + - service: sentry + impact: "SSO login to error tracking unavailable" severity: medium - - service: zendesk - impact: "Zendesk SSO login unavailable" + - service: dropbox + impact: "SSO login to file storage unavailable" + severity: high + - service: discord + impact: "SSO login to team chat unavailable" severity: medium - - service: netsuite - impact: "NetSuite SSO login unavailable" + - service: datadog + impact: "SSO login to observability unavailable" severity: medium - - duo: - - service: okta - impact: "MFA push notifications unavailable; Okta login may require fallback methods" - severity: critical - - # Collaboration dependencies - slack: - - service: servicedesk - impact: "Slack-based IT support channel and Aisera bot unavailable" + - service: ticketing + impact: "SSO login to the ITSM/ticketing system unavailable" severity: high - # Google Workspace - google-mail: - - service: google-calendar - impact: "Calendar notifications and email invites may be delayed" + # Edge/CDN provider — degradation ripples to web-facing services behind it. + cloudflare: + - service: github + impact: "Elevated latency reaching source-control web UI" severity: medium - + - service: sentry + impact: "Elevated latency reaching the error-tracking UI" + severity: low diff --git a/backend/config/services.yaml b/backend/config/services.yaml index d5da019..478e8c6 100644 --- a/backend/config/services.yaml +++ b/backend/config/services.yaml @@ -1,6 +1,22 @@ -# IT Service Health Dashboard — Service Registry -# All poll_urls verified via curl on 2026-04-07 -# Services with no public JSON API are set to poll_type: manual +# IT Service Health Dashboard — Service Registry (EXAMPLE) +# +# This is a generic, public example registry. It monitors well-known public +# developer-tool status pages so the dashboard works immediately for a demo. +# +# For a real deployment, copy this file to `services.local.yaml` (gitignored) +# and replace these entries with your own organization's services. The loader +# prefers `services.local.yaml` when present, so your real registry never has +# to live in version control. +# +# Schema per entry: +# id: stable slug, referenced by dependencies.yaml and the API +# display_name: human-facing name shown in the UI +# category: one of identity|productivity|collaboration|engineering|hr| +# finance|sales|marketing|networking|support|other +# poll_type: statuspage_json | manual | (see app/seed.py for the full set) +# poll_url: required unless poll_type is manual +# status_page_url: human-facing status page link +# tier: critical | important | informational (default: important) categories: identity: "Identity & Access" @@ -16,194 +32,82 @@ categories: services: # ── Identity & Access ───────────────────────────────────────────── - - id: okta - display_name: Okta + # The SSO / identity broker. Point SSO_BROKER_SERVICE_ID at this id to + # enable the dedicated identity-provider impact wording. Many SSO vendors + # gate their status API, so this is modeled as a manual-update service. + - id: identity-provider + display_name: Identity Provider (SSO) category: identity - poll_type: manual # status.okta.com returns 401 — requires Salesforce login, no public API - status_page_url: https://status.okta.com - tier: critical # SSO broker — outage blocks access to almost every app + poll_type: manual + status_page_url: https://example.com/status + tier: critical # SSO broker — an outage blocks login to everything downstream - - id: duo - display_name: Duo Security - category: identity + # ── Engineering ─────────────────────────────────────────────────── + - id: github + display_name: GitHub + category: engineering poll_type: statuspage_json - poll_url: https://status.duo.com/api/v2/summary.json - status_page_url: https://status.duo.com - tier: critical # MFA — outage blocks every new login + poll_url: https://www.githubstatus.com/api/v2/summary.json + status_page_url: https://www.githubstatus.com + tier: important - # ── Productivity ────────────────────────────────────────────────── - - id: box - display_name: Box - category: productivity + - id: npm + display_name: npm + category: engineering poll_type: statuspage_json - poll_url: https://status.box.com/api/v2/summary.json - status_page_url: https://status.box.com + poll_url: https://status.npmjs.org/api/v2/summary.json + status_page_url: https://status.npmjs.org - - id: docusign - display_name: DocuSign - category: productivity + - id: pypi + display_name: PyPI + category: engineering poll_type: statuspage_json - poll_url: https://status.docusign.com/api/v2/summary.json - status_page_url: https://status.docusign.com + poll_url: https://status.python.org/api/v2/summary.json + status_page_url: https://status.python.org - - id: google-mail - display_name: Google Mail - category: productivity - poll_type: google_json - poll_url: https://www.google.com/appsstatus/dashboard/incidents.json - status_page_url: https://www.google.com/appsstatus/dashboard - - - id: google-calendar - display_name: Google Calendar - category: productivity - poll_type: google_json - poll_url: https://www.google.com/appsstatus/dashboard/incidents.json - status_page_url: https://www.google.com/appsstatus/dashboard - - - id: conga - display_name: Conga - category: productivity + - id: sentry + display_name: Sentry + category: engineering poll_type: statuspage_json - poll_url: https://status.conga.com/api/v2/summary.json - status_page_url: https://status.conga.com + poll_url: https://status.sentry.io/api/v2/summary.json + status_page_url: https://status.sentry.io - - id: eptura - display_name: Eptura (Teem) + # ── Productivity ────────────────────────────────────────────────── + - id: dropbox + display_name: Dropbox category: productivity poll_type: statuspage_json - poll_url: https://status.eptura.com/api/v2/summary.json - status_page_url: https://status.eptura.com + poll_url: https://status.dropbox.com/api/v2/summary.json + status_page_url: https://status.dropbox.com # ── Collaboration ───────────────────────────────────────────────── - - id: slack - display_name: Slack - category: collaboration - poll_type: slack_api - poll_url: https://slack-status.com/api/v2.0.0/current - status_page_url: https://status.slack.com - tier: critical # Primary async comms — outage halts incident response itself - - - id: zoom - display_name: Zoom + - id: discord + display_name: Discord category: collaboration poll_type: statuspage_json - poll_url: https://status.zoom.us/api/v2/summary.json # returns 302 — httpx needs follow_redirects=True - status_page_url: https://status.zoom.us - - - id: ringcentral - display_name: RingCentral - category: collaboration - poll_type: ringcentral_api - poll_url: https://status.ringcentral.com/status.json - status_page_url: https://status.ringcentral.com - - # ── Engineering ─────────────────────────────────────────────────── - - id: confluence - display_name: Confluence - category: engineering - poll_type: statuspage_json # page-level status — components hidden when healthy - poll_url: https://status.atlassian.com/api/v2/summary.json - statuspage_component_name: Confluence - status_page_url: https://status.atlassian.com - - - id: jira - display_name: Jira - category: engineering - poll_type: statuspage_json # page-level status — components hidden when healthy - poll_url: https://status.atlassian.com/api/v2/summary.json - statuspage_component_name: Jira - status_page_url: https://status.atlassian.com - - - id: servicedesk - display_name: Jira Service Management - category: engineering - poll_type: statuspage_json # page-level status — components hidden when healthy - poll_url: https://status.atlassian.com/api/v2/summary.json - statuspage_component_name: Jira Service Management - status_page_url: https://status.atlassian.com - - - id: snaplogic - display_name: SnapLogic - category: engineering - poll_type: statuspage_json - poll_url: https://trust.snaplogic.com/api/v2/summary.json # corrected: trust.snaplogic.com, not status.snaplogic.com - status_page_url: https://trust.snaplogic.com - - # ── HR & People ────────────────────────────────────────────────── - - id: greenhouse - display_name: Greenhouse - category: hr - poll_type: statuspage_json - poll_url: https://status.greenhouse.io/api/v2/summary.json - status_page_url: https://status.greenhouse.io - - - id: workday - display_name: Workday - category: hr - poll_type: manual # trust site requires Workday Community login — no public API - status_page_url: https://community.workday.com/trust/status - - - id: cornerstone - display_name: Cornerstone OnDemand - category: hr + poll_url: https://discordstatus.com/api/v2/summary.json + status_page_url: https://discordstatus.com + tier: critical # primary async comms in this example deployment + + # ── Networking ──────────────────────────────────────────────────── + - id: cloudflare + display_name: Cloudflare + category: networking poll_type: statuspage_json - poll_url: https://status.csod.com/api/v2/summary.json - status_page_url: https://status.csod.com + poll_url: https://www.cloudflarestatus.com/api/v2/summary.json + status_page_url: https://www.cloudflarestatus.com - # ── Finance ────────────────────────────────────────────────────── - - id: concur - display_name: SAP Concur - category: finance - poll_type: manual # open.concur.com is a React SPA — no JSON API - status_page_url: https://open.concur.com - - - id: coupa - display_name: Coupa - category: finance - poll_type: manual - status_page_url: null - - - id: netsuite - display_name: NetSuite - category: finance + # ── Other ───────────────────────────────────────────────────────── + - id: datadog + display_name: Datadog + category: other poll_type: statuspage_json - poll_url: https://status.netsuite.com/api/v2/summary.json - status_page_url: https://status.netsuite.com - - - id: zuora - display_name: Zuora - category: finance - poll_type: statuspage_json - poll_url: https://trust.zuora.com/api/v2/summary.json # corrected: trust.zuora.com, not status.zuora.com - status_page_url: https://trust.zuora.com - - # ── Sales & CRM ───────────────────────────────────────────────── - - id: salesforce - display_name: Salesforce - category: sales - poll_type: salesforce_trust - poll_url: https://api.status.salesforce.com/v1/incidents - status_page_url: https://status.salesforce.com + poll_url: https://status.datadoghq.com/api/v2/summary.json + status_page_url: https://status.datadoghq.com - # ── Marketing ──────────────────────────────────────────────────── - - id: iterable - display_name: Iterable - category: marketing - poll_type: statuspage_json - poll_url: https://status.iterable.com/api/v2/summary.json - status_page_url: https://status.iterable.com - - - id: marketo - display_name: Marketo (Adobe) - category: marketing - poll_type: manual # status.adobe.com requires Adobe I/O credentials — no public API - status_page_url: https://status.adobe.com - - # ── Support ────────────────────────────────────────────────────── - - id: zendesk - display_name: Zendesk + # ── Support ─────────────────────────────────────────────────────── + - id: ticketing + display_name: Ticketing / ITSM category: support - poll_type: zendesk_api - poll_url: https://status.zendesk.com/api/incidents/active - status_page_url: https://status.zendesk.com - + poll_type: manual + status_page_url: https://example.com/status diff --git a/backend/pyproject.toml b/backend/pyproject.toml index 485a441..175d064 100644 --- a/backend/pyproject.toml +++ b/backend/pyproject.toml @@ -1,7 +1,7 @@ [project] name = "it-service-health-dashboard" version = "0.1.0" -description = "Internal dashboard that aggregates SaaS vendor health for Box IT" +description = "Internal dashboard that aggregates SaaS vendor health for enterprise IT" requires-python = ">=3.12" # Runtime deps are pinned in requirements.txt for reproducible CI; # this file covers tool configs (ruff, mypy, pytest) + packaging metadata. diff --git a/backend/tests/test_admin_api.py b/backend/tests/test_admin_api.py index 900f9bf..dd022c1 100644 --- a/backend/tests/test_admin_api.py +++ b/backend/tests/test_admin_api.py @@ -23,14 +23,21 @@ async def seeded_app(tmp_path): db_path = str(tmp_path / "test.db") conn = await init_db(db_path) - services = load_services() - from tests.test_seeder import seed_deps_with_db, seed_services_with_db + from tests.test_seeder import ( + _DEPENDENCIES_YAML, + _SERVICES_YAML, + seed_deps_with_db, + seed_services_with_db, + ) + + services = load_services(path=_SERVICES_YAML) await seed_services_with_db(conn, services) - deps = load_dependencies(known_service_ids={s.id for s in services}) + deps = load_dependencies(path=_DEPENDENCIES_YAML, known_service_ids={s.id for s in services}) await seed_deps_with_db(conn, deps, [s.id for s in services]) # Import app after DB is initialized from app.main import app + yield app await close_db() @@ -48,18 +55,21 @@ async def client(seeded_app): class TestAdminAuth: async def test_missing_token_rejected(self, client): - resp = await client.post("/api/admin/status", json={ - "service_id": "okta", - "new_status": "degraded", - "reason": "test", - }) + resp = await client.post( + "/api/admin/status", + json={ + "service_id": "identity-provider", + "new_status": "degraded", + "reason": "test", + }, + ) assert resp.status_code == 401 async def test_wrong_token_rejected(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "okta", + "service_id": "identity-provider", "new_status": "degraded", "reason": "test", }, @@ -72,7 +82,7 @@ async def test_unset_token_returns_503(self, client, monkeypatch): resp = await client.post( "/api/admin/status", json={ - "service_id": "okta", + "service_id": "identity-provider", "new_status": "degraded", "reason": "test", }, @@ -84,7 +94,7 @@ async def test_non_bearer_scheme_rejected(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "okta", + "service_id": "identity-provider", "new_status": "degraded", "reason": "test", }, @@ -98,10 +108,10 @@ async def test_update_valid_service(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "okta", + "service_id": "identity-provider", "new_status": "degraded", "detail": "SSO slow", - "reason": "User reported in #it-help", + "reason": "User reported in the help channel", }, headers=AUTH_HEADERS, ) @@ -129,7 +139,7 @@ async def test_update_invalid_status(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "okta", + "service_id": "identity-provider", "new_status": "invalid_status", "reason": "test", }, @@ -141,7 +151,7 @@ async def test_update_missing_reason(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "okta", + "service_id": "identity-provider", "new_status": "degraded", }, headers=AUTH_HEADERS, @@ -152,7 +162,7 @@ async def test_update_creates_status_event_with_audit(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "workday", + "service_id": "ticketing", "new_status": "major_outage", "detail": "Down", "reason": "Confirmed with vendor support", @@ -162,10 +172,11 @@ async def test_update_creates_status_event_with_audit(self, client): assert resp.status_code == 200 from app.database import get_db + db = await get_db() cursor = await db.execute( """SELECT source, new_status, updated_by, reason, client_ip - FROM status_events WHERE service_id='workday'""" + FROM status_events WHERE service_id='ticketing'""" ) row = dict(await cursor.fetchone()) assert row["source"] == "manual" @@ -179,7 +190,7 @@ async def test_update_same_status_no_event(self, client): await client.post( "/api/admin/status", json={ - "service_id": "concur", + "service_id": "datadog", "new_status": "degraded", "reason": "first update", }, @@ -188,7 +199,7 @@ async def test_update_same_status_no_event(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "concur", + "service_id": "datadog", "new_status": "degraded", "detail": "Still slow", "reason": "follow-up", @@ -200,17 +211,16 @@ async def test_update_same_status_no_event(self, client): assert body["meta"]["status_changed"] is False from app.database import get_db + db = await get_db() - cursor = await db.execute( - "SELECT count(*) FROM status_events WHERE service_id='concur'" - ) + cursor = await db.execute("SELECT count(*) FROM status_events WHERE service_id='datadog'") assert (await cursor.fetchone())[0] == 1 async def test_response_envelope_structure(self, client): resp = await client.post( "/api/admin/status", json={ - "service_id": "okta", + "service_id": "identity-provider", "new_status": "operational", "reason": "resolved", }, diff --git a/backend/tests/test_burn_rate.py b/backend/tests/test_burn_rate.py index 8c1a91c..63f7022 100644 --- a/backend/tests/test_burn_rate.py +++ b/backend/tests/test_burn_rate.py @@ -42,7 +42,9 @@ ) -async def _seed_service(db: aiosqlite.Connection, service_id: str = _SVC, name: str = _SVC_NAME) -> None: +async def _seed_service( + db: aiosqlite.Connection, service_id: str = _SVC, name: str = _SVC_NAME +) -> None: await db.execute(_CREATE_SVC, (service_id, name)) await db.commit() @@ -149,10 +151,14 @@ async def _fake( for ws, pct in window_to_uptime.items(): if abs((window - ws).total_seconds()) < 2.0: if pct is None: - return WindowUptime(operational_seconds=0.0, tracked_seconds=0.0, uptime_percent=None) + return WindowUptime( + operational_seconds=0.0, tracked_seconds=0.0, uptime_percent=None + ) total = ws.total_seconds() op = total * (pct / 100.0) - return WindowUptime(operational_seconds=op, tracked_seconds=total, uptime_percent=pct) + return WindowUptime( + operational_seconds=op, tracked_seconds=total, uptime_percent=pct + ) return WindowUptime(operational_seconds=0.0, tracked_seconds=0.0, uptime_percent=None) monkeypatch.setattr(br_module, "compute_uptime", _fake) @@ -227,16 +233,21 @@ async def test_burn_rate_zero_downtime_returns_zero(self, db_with_svc: aiosqlite @pytest.mark.asyncio async def test_burn_rate_math_matches_hand_calc( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """97.12% uptime -> 2.88% failure rate -> 28.8x burn at 99.9% SLO -> fast breach.""" - _stub_compute_uptime(monkeypatch, { - timedelta(minutes=5): 97.12, - timedelta(minutes=30): 100.0, - timedelta(hours=1): 97.12, - timedelta(hours=6): 100.0, - timedelta(days=30): 99.95, - }) + _stub_compute_uptime( + monkeypatch, + { + timedelta(minutes=5): 97.12, + timedelta(minutes=30): 100.0, + timedelta(hours=1): 97.12, + timedelta(hours=6): 100.0, + timedelta(days=30): 99.95, + }, + ) breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC)) fast_breaches = [b for b in breaches if b.severity == "fast"] @@ -247,31 +258,41 @@ async def test_burn_rate_math_matches_hand_calc( @pytest.mark.asyncio async def test_fast_breach_requires_both_windows( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """5m high burn but 1h low burn -> no fast breach.""" - _stub_compute_uptime(monkeypatch, { - timedelta(minutes=5): 97.0, # ~30x burn - timedelta(minutes=30): 100.0, - timedelta(hours=1): 99.5, # ~5x burn, below 14.4 fast threshold - timedelta(hours=6): 100.0, - timedelta(days=30): 100.0, - }) + _stub_compute_uptime( + monkeypatch, + { + timedelta(minutes=5): 97.0, # ~30x burn + timedelta(minutes=30): 100.0, + timedelta(hours=1): 99.5, # ~5x burn, below 14.4 fast threshold + timedelta(hours=6): 100.0, + timedelta(days=30): 100.0, + }, + ) breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC)) assert [b for b in breaches if b.severity == "fast"] == [] @pytest.mark.asyncio async def test_slow_breach_requires_both_windows( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """30m high burn but 6h low burn -> no slow breach.""" - _stub_compute_uptime(monkeypatch, { - timedelta(minutes=5): 100.0, - timedelta(minutes=30): 99.0, # 10x burn - timedelta(hours=1): 100.0, - timedelta(hours=6): 99.8, # 2x burn, below 6.0 slow threshold - timedelta(days=30): 100.0, - }) + _stub_compute_uptime( + monkeypatch, + { + timedelta(minutes=5): 100.0, + timedelta(minutes=30): 99.0, # 10x burn + timedelta(hours=1): 100.0, + timedelta(hours=6): 99.8, # 2x burn, below 6.0 slow threshold + timedelta(days=30): 100.0, + }, + ) breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC)) assert [b for b in breaches if b.severity == "slow"] == [] @@ -294,16 +315,21 @@ async def test_unknown_dominated_window_does_not_alert(self, db: aiosqlite.Conne @pytest.mark.asyncio async def test_both_fast_and_slow_can_fire_simultaneously( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """All four windows at high burn -> 2 breaches returned (fast + slow).""" - _stub_compute_uptime(monkeypatch, { - timedelta(minutes=5): 97.0, - timedelta(minutes=30): 97.0, - timedelta(hours=1): 97.0, - timedelta(hours=6): 97.0, - timedelta(days=30): 99.95, - }) + _stub_compute_uptime( + monkeypatch, + { + timedelta(minutes=5): 97.0, + timedelta(minutes=30): 97.0, + timedelta(hours=1): 97.0, + timedelta(hours=6): 97.0, + timedelta(days=30): 99.95, + }, + ) breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC)) severities = {b.severity for b in breaches} assert "fast" in severities @@ -312,48 +338,63 @@ async def test_both_fast_and_slow_can_fire_simultaneously( @pytest.mark.asyncio async def test_error_budget_remaining_pct_from_30d_uptime( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """30d uptime of 99.95% -> half budget used -> ~50% remaining.""" - _stub_compute_uptime(monkeypatch, { - timedelta(minutes=5): 97.0, - timedelta(minutes=30): 97.0, - timedelta(hours=1): 97.0, - timedelta(hours=6): 97.0, - timedelta(days=30): 99.95, - }) + _stub_compute_uptime( + monkeypatch, + { + timedelta(minutes=5): 97.0, + timedelta(minutes=30): 97.0, + timedelta(hours=1): 97.0, + timedelta(hours=6): 97.0, + timedelta(days=30): 99.95, + }, + ) breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC)) fast = next(b for b in breaches if b.severity == "fast") assert abs(fast.error_budget_remaining_pct - 50.0) < 2.0 @pytest.mark.asyncio async def test_error_budget_remaining_pct_fully_consumed( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """30d uptime exactly at 99.9% target -> 0% budget remaining.""" - _stub_compute_uptime(monkeypatch, { - timedelta(minutes=5): 97.0, - timedelta(minutes=30): 97.0, - timedelta(hours=1): 97.0, - timedelta(hours=6): 97.0, - timedelta(days=30): 99.9, - }) + _stub_compute_uptime( + monkeypatch, + { + timedelta(minutes=5): 97.0, + timedelta(minutes=30): 97.0, + timedelta(hours=1): 97.0, + timedelta(hours=6): 97.0, + timedelta(days=30): 99.9, + }, + ) breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC)) fast = next(b for b in breaches if b.severity == "fast") assert fast.error_budget_remaining_pct <= 1.0 @pytest.mark.asyncio async def test_error_budget_remaining_pct_full_when_no_data( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """30d window with no tracked data (uptime_percent=None) -> 100% remaining.""" - _stub_compute_uptime(monkeypatch, { - timedelta(minutes=5): 97.0, - timedelta(minutes=30): 97.0, - timedelta(hours=1): 97.0, - timedelta(hours=6): 97.0, - timedelta(days=30): None, - }) + _stub_compute_uptime( + monkeypatch, + { + timedelta(minutes=5): 97.0, + timedelta(minutes=30): 97.0, + timedelta(hours=1): 97.0, + timedelta(hours=6): 97.0, + timedelta(days=30): None, + }, + ) breaches = await evaluate_burn_rate(db_with_svc, _SVC, _SVC_NAME, datetime.now(UTC)) fast = next(b for b in breaches if b.severity == "fast") assert fast.error_budget_remaining_pct == 100.0 @@ -417,9 +458,7 @@ async def test_route_suppressed_by_dedup(self, db_with_svc: aiosqlite.Connection ) await db.commit() - decision = await route_slo_burn_rate_alert( - db, breach, "https://hooks.slack.com/test", now - ) + decision = await route_slo_burn_rate_alert(db, breach, "https://hooks.slack.com/test", now) assert decision.should_send is False assert decision.suppressed_by == "dedup" @@ -441,9 +480,7 @@ async def test_route_suppressed_by_maintenance(self, db_with_svc: aiosqlite.Conn ) await db.commit() - decision = await route_slo_burn_rate_alert( - db, breach, "https://hooks.slack.com/test", now - ) + decision = await route_slo_burn_rate_alert(db, breach, "https://hooks.slack.com/test", now) assert decision.should_send is False assert decision.suppressed_by == "maintenance_window" @@ -454,16 +491,17 @@ async def test_route_suppressed_when_no_webhook(self, db_with_svc: aiosqlite.Con breach = _make_fast_breach() now = datetime.now(UTC) - decision = await route_slo_burn_rate_alert( - db_with_svc, breach, None, now - ) + decision = await route_slo_burn_rate_alert(db_with_svc, breach, None, now) assert decision.should_send is False assert decision.suppressed_by == "webhook_not_configured" def test_build_dedup_key_format(self): """build_slo_burn_rate_dedup_key produces expected format.""" - assert build_slo_burn_rate_dedup_key("slack_api", "fast") == "slo_burn:slack_api:fast" + assert ( + build_slo_burn_rate_dedup_key("identity-provider", "fast") + == "slo_burn:identity-provider:fast" + ) assert build_slo_burn_rate_dedup_key("github", "slow") == "slo_burn:github:slow" @@ -481,6 +519,7 @@ async def test_record_alert_writes_row(self, db_with_svc: aiosqlite.Connection): dedup_key = build_slo_burn_rate_dedup_key(breach.service_id, breach.severity) from app.alerting.routing import RoutingDecision + decision = RoutingDecision( should_send=True, webhook_url="https://hooks.slack.com/test", @@ -493,9 +532,7 @@ async def test_record_alert_writes_row(self, db_with_svc: aiosqlite.Connection): await record_slo_alert(db, breach, decision) await db.commit() - cursor = await db.execute( - "SELECT * FROM alert_sent_log WHERE dedup_key = ?", (dedup_key,) - ) + cursor = await db.execute("SELECT * FROM alert_sent_log WHERE dedup_key = ?", (dedup_key,)) row = await cursor.fetchone() assert row is not None assert row["alert_kind"] == "slo_burn_rate" @@ -511,6 +548,7 @@ async def test_record_alert_records_suppression(self, db_with_svc: aiosqlite.Con dedup_key = build_slo_burn_rate_dedup_key(breach.service_id, breach.severity) from app.alerting.routing import RoutingDecision + decision = RoutingDecision( should_send=False, webhook_url=None, @@ -654,7 +692,8 @@ def test_payload_omits_channel_mention_context_when_empty(self, monkeypatch): # The mention context block is only appended when channel_mention is truthy mention_contexts = [ - block for block in payload.get("blocks", []) + block + for block in payload.get("blocks", []) if block.get("type") == "context" and any("here" in str(e) for e in block.get("elements", [])) ] @@ -678,9 +717,7 @@ async def _spy(*args: Any, **kwargs: Any) -> list: called.append(True) return [] - monkeypatch.setattr( - "app.alerting.burn_rate.evaluate_burn_rate", _spy - ) + monkeypatch.setattr("app.alerting.burn_rate.evaluate_burn_rate", _spy) app_mock = MagicMock() await run_slo_burn_rate_cycle(app_mock) @@ -689,7 +726,9 @@ async def _spy(*args: Any, **kwargs: Any) -> list: @pytest.mark.asyncio async def test_cycle_routes_and_records_for_each_breach( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """Enable flag, stub a fast breach, mock Slack send → alert_sent_log row written.""" monkeypatch.setattr(settings, "slo_burn_rate_enabled", True) @@ -711,6 +750,7 @@ async def _fake_evaluate( # get_db is imported lazily inside run_slo_burn_rate_cycle — patch at source. async def _fake_get_db() -> aiosqlite.Connection: return db + monkeypatch.setattr("app.database.get_db", _fake_get_db) # Patch the Slack send hook wherever burn_rate.py imports it from. @@ -722,6 +762,7 @@ async def _fake_send(*args: Any, **kwargs: Any) -> bool: if isinstance(payload, dict): send_calls.append(payload) return True + # Try common names; set whichever exists on the module. for attr in ("send_slack_alert", "send_slack_webhook", "send_slack"): if hasattr(br_module, attr): @@ -739,7 +780,9 @@ async def _fake_send(*args: Any, **kwargs: Any) -> bool: @pytest.mark.asyncio async def test_cycle_logs_duration_no_error( - self, db_with_svc: aiosqlite.Connection, monkeypatch: pytest.MonkeyPatch, + self, + db_with_svc: aiosqlite.Connection, + monkeypatch: pytest.MonkeyPatch, ): """Cycle completes without raising even when no breaches fire.""" monkeypatch.setattr(settings, "slo_burn_rate_enabled", True) @@ -749,13 +792,18 @@ async def test_cycle_logs_duration_no_error( async def _fake_get_db() -> aiosqlite.Connection: return db + monkeypatch.setattr("app.database.get_db", _fake_get_db) # No breaches so no Slack send path is exercised. async def _no_breaches( - _db: aiosqlite.Connection, _sid: str, _sname: str, _now: datetime, + _db: aiosqlite.Connection, + _sid: str, + _sname: str, + _now: datetime, ) -> list[BurnRateBreach]: return [] + monkeypatch.setattr(br_module, "evaluate_burn_rate", _no_breaches) app_mock = MagicMock() diff --git a/backend/tests/test_graph.py b/backend/tests/test_graph.py index 491c095..0837623 100644 --- a/backend/tests/test_graph.py +++ b/backend/tests/test_graph.py @@ -4,7 +4,12 @@ from app.dependencies.graph import get_downstream, get_upstream from app.seed import DependencyTarget, load_dependencies, load_services -from tests.test_seeder import seed_deps_with_db, seed_services_with_db +from tests.test_seeder import ( + _DEPENDENCIES_YAML, + _SERVICES_YAML, + seed_deps_with_db, + seed_services_with_db, +) async def _insert_service(db, sid): @@ -31,48 +36,50 @@ async def _insert_edge(db, upstream, downstream, severity="high"): @pytest.fixture async def seeded_db(db): - """DB with services and dependencies seeded.""" - services = load_services() + """DB with services and dependencies seeded from the committed example.""" + services = load_services(path=_SERVICES_YAML) await seed_services_with_db(db, services) - deps = load_dependencies() + deps = load_dependencies(path=_DEPENDENCIES_YAML) await seed_deps_with_db(db, deps, [s.id for s in services]) return db class TestGetDownstream: - async def test_okta_has_12_downstream(self, seeded_db): - results = await get_downstream(seeded_db, "okta") - assert len(results) == 12 + async def test_identity_provider_has_8_downstream(self, seeded_db): + results = await get_downstream(seeded_db, "identity-provider") + assert len(results) == 8 - async def test_okta_downstream_includes_box(self, seeded_db): - results = await get_downstream(seeded_db, "okta") + async def test_idp_downstream_includes_known(self, seeded_db): + results = await get_downstream(seeded_db, "identity-provider") ids = [r["service_id"] for r in results] - assert "box" in ids - assert "slack" in ids - assert "zoom" in ids + assert "github" in ids + assert "dropbox" in ids + assert "ticketing" in ids async def test_downstream_ordered_by_severity(self, seeded_db): - results = await get_downstream(seeded_db, "okta") + results = await get_downstream(seeded_db, "identity-provider") severities = [r["severity"] for r in results] # Assert at least one 'critical' row exists before locating its last index. # This test protects against regressions in the SEVERITY ORDER BY sort: # if a non-critical entry sneaks between two critical entries we'd # catch it via `last_critical < first_high` below. assert "critical" in severities - last_critical = len(severities) - 1 - next( - i for i, s in enumerate(reversed(severities)) if s == "critical" + last_critical = ( + len(severities) + - 1 + - next(i for i, s in enumerate(reversed(severities)) if s == "critical") ) first_high = next((i for i, s in enumerate(severities) if s == "high"), len(severities)) assert last_critical < first_high or first_high == len(severities) async def test_downstream_includes_current_status(self, seeded_db): - results = await get_downstream(seeded_db, "okta") + results = await get_downstream(seeded_db, "identity-provider") for r in results: assert "current_status" in r assert r["current_status"] is not None async def test_no_downstream(self, seeded_db): - results = await get_downstream(seeded_db, "coupa") + results = await get_downstream(seeded_db, "npm") assert results == [] async def test_nonexistent_service(self, seeded_db): @@ -81,18 +88,18 @@ async def test_nonexistent_service(self, seeded_db): class TestGetUpstream: - async def test_box_upstream_includes_okta(self, seeded_db): - results = await get_upstream(seeded_db, "box") + async def test_github_upstream_includes_idp(self, seeded_db): + results = await get_upstream(seeded_db, "github") ids = [r["service_id"] for r in results] - assert "okta" in ids + assert "identity-provider" in ids - async def test_okta_upstream_includes_duo(self, seeded_db): - results = await get_upstream(seeded_db, "okta") + async def test_github_upstream_includes_cloudflare(self, seeded_db): + results = await get_upstream(seeded_db, "github") ids = [r["service_id"] for r in results] - assert "duo" in ids + assert "cloudflare" in ids async def test_no_upstream(self, seeded_db): - results = await get_upstream(seeded_db, "duo") + results = await get_upstream(seeded_db, "identity-provider") assert results == [] @@ -142,12 +149,15 @@ async def test_seed_allows_cycle_without_error(self, tmp_path): the guard is about orphan references, not acyclicity. Document this by asserting it doesn't throw.""" import yaml - cycle_yaml = yaml.safe_dump({ - "dependencies": { - "a": [{"service": "b", "impact": "x", "severity": "high"}], - "b": [{"service": "a", "impact": "x", "severity": "high"}], - }, - }) + + cycle_yaml = yaml.safe_dump( + { + "dependencies": { + "a": [{"service": "b", "impact": "x", "severity": "high"}], + "b": [{"service": "a", "impact": "x", "severity": "high"}], + }, + } + ) path = tmp_path / "cycle_deps.yaml" path.write_text(cycle_yaml) deps = load_dependencies(path=path, known_service_ids={"a", "b"}) diff --git a/backend/tests/test_normalizer.py b/backend/tests/test_normalizer.py index 7597a48..7efe218 100644 --- a/backend/tests/test_normalizer.py +++ b/backend/tests/test_normalizer.py @@ -1,18 +1,19 @@ -"""Tests for status normalizer — all vendor mappings + edge cases.""" +"""Tests for status normalizer — all format mappings + edge cases.""" import pytest from app.poller.normalizer import ( ServiceStatus, - normalize_google_status, + normalize_current_status, + normalize_product_feed_status, normalize_rss_title, - normalize_slack_status, normalize_statuspage_component, normalize_statuspage_indicator, ) # ── Statuspage.io Component Status ────────────────────────────────── + class TestStatuspageComponent: @pytest.mark.parametrize( "input_status,expected", @@ -43,6 +44,7 @@ def test_whitespace_stripped(self): # ── Statuspage.io Page-Level Indicator ────────────────────────────── + class TestStatuspageIndicator: @pytest.mark.parametrize( "indicator,expected", @@ -67,16 +69,17 @@ def test_case_insensitive(self): assert normalize_statuspage_indicator("Critical") == ServiceStatus.MAJOR_OUTAGE -# ── Slack Status API ─────────────────────────────────────────────── +# ── Current Status API ──────────────────────────────────────────── + -class TestSlackStatus: +class TestCurrentStatus: def test_ok_no_incidents(self): response = {"status": "ok", "active_incidents": []} - assert normalize_slack_status(response) == ServiceStatus.OPERATIONAL + assert normalize_current_status(response) == ServiceStatus.OPERATIONAL def test_ok_missing_incidents_key(self): response = {"status": "ok"} - assert normalize_slack_status(response) == ServiceStatus.OPERATIONAL + assert normalize_current_status(response) == ServiceStatus.OPERATIONAL def test_outage_incident(self): response = { @@ -85,7 +88,7 @@ def test_outage_incident(self): {"type": "outage", "title": "Major outage"}, ], } - assert normalize_slack_status(response) == ServiceStatus.MAJOR_OUTAGE + assert normalize_current_status(response) == ServiceStatus.MAJOR_OUTAGE def test_incident_type(self): response = { @@ -94,7 +97,7 @@ def test_incident_type(self): {"type": "incident", "title": "Some users affected"}, ], } - assert normalize_slack_status(response) == ServiceStatus.PARTIAL_OUTAGE + assert normalize_current_status(response) == ServiceStatus.PARTIAL_OUTAGE def test_notice_type(self): response = { @@ -103,7 +106,7 @@ def test_notice_type(self): {"type": "notice", "title": "Planned maintenance"}, ], } - assert normalize_slack_status(response) == ServiceStatus.DEGRADED + assert normalize_current_status(response) == ServiceStatus.DEGRADED def test_maintenance_type(self): response = { @@ -112,7 +115,7 @@ def test_maintenance_type(self): {"type": "maintenance", "title": "Scheduled maintenance"}, ], } - assert normalize_slack_status(response) == ServiceStatus.DEGRADED + assert normalize_current_status(response) == ServiceStatus.DEGRADED def test_multiple_incidents_most_severe_wins(self): response = { @@ -123,7 +126,7 @@ def test_multiple_incidents_most_severe_wins(self): {"type": "incident", "title": "Some issue"}, ], } - assert normalize_slack_status(response) == ServiceStatus.MAJOR_OUTAGE + assert normalize_current_status(response) == ServiceStatus.MAJOR_OUTAGE def test_unknown_incident_type(self): response = { @@ -132,71 +135,82 @@ def test_unknown_incident_type(self): {"type": "something_new", "title": "Unknown type"}, ], } - assert normalize_slack_status(response) == ServiceStatus.DEGRADED + assert normalize_current_status(response) == ServiceStatus.DEGRADED def test_non_ok_no_incidents(self): response = {"status": "active", "active_incidents": []} - assert normalize_slack_status(response) == ServiceStatus.DEGRADED + assert normalize_current_status(response) == ServiceStatus.DEGRADED -# ── Google Workspace ─────────────────────────────────────────────── +# ── Product Feed ────────────────────────────────────────────────── -class TestGoogleStatus: + +class TestProductFeedStatus: def test_no_incidents_operational(self): - assert normalize_google_status([], "google-mail") == ServiceStatus.OPERATIONAL + assert normalize_product_feed_status([], "feed-product-a") == ServiceStatus.OPERATIONAL def test_active_incident_for_product(self): incidents = [ { - "affected_products": [{"title": "Gmail"}], + "affected_products": [{"title": "Product A"}], "most_recent_update": {"status": "SERVICE_DISRUPTION"}, }, ] - assert normalize_google_status(incidents, "google-mail") == ServiceStatus.PARTIAL_OUTAGE + assert ( + normalize_product_feed_status(incidents, "feed-product-a") + == ServiceStatus.PARTIAL_OUTAGE + ) def test_active_outage_for_product(self): incidents = [ { - "affected_products": [{"title": "Gmail"}], + "affected_products": [{"title": "Product A"}], "most_recent_update": {"status": "SERVICE_OUTAGE"}, }, ] - assert normalize_google_status(incidents, "google-mail") == ServiceStatus.MAJOR_OUTAGE + assert ( + normalize_product_feed_status(incidents, "feed-product-a") == ServiceStatus.MAJOR_OUTAGE + ) def test_resolved_incident_is_operational(self): incidents = [ { - "affected_products": [{"title": "Gmail"}], + "affected_products": [{"title": "Product A"}], "end": "2026-04-01T00:00:00Z", "most_recent_update": {"status": "SERVICE_DISRUPTION"}, }, ] - assert normalize_google_status(incidents, "google-mail") == ServiceStatus.OPERATIONAL + assert ( + normalize_product_feed_status(incidents, "feed-product-a") == ServiceStatus.OPERATIONAL + ) def test_incident_for_different_product(self): incidents = [ { - "affected_products": [{"title": "Google Drive"}], + "affected_products": [{"title": "Unmapped Product"}], "most_recent_update": {"status": "SERVICE_OUTAGE"}, }, ] - assert normalize_google_status(incidents, "google-mail") == ServiceStatus.OPERATIONAL + assert ( + normalize_product_feed_status(incidents, "feed-product-a") == ServiceStatus.OPERATIONAL + ) def test_calendar_product(self): incidents = [ { - "affected_products": [{"title": "Google Calendar"}], + "affected_products": [{"title": "Product B"}], "most_recent_update": {"status": "degraded"}, }, ] - assert normalize_google_status(incidents, "google-calendar") == ServiceStatus.DEGRADED + assert normalize_product_feed_status(incidents, "feed-product-b") == ServiceStatus.DEGRADED def test_unknown_service_id(self): - assert normalize_google_status([], "google-drive") == ServiceStatus.UNKNOWN + assert normalize_product_feed_status([], "feed-product-unknown") == ServiceStatus.UNKNOWN # ── RSS Feed ────────────────────────────────────────────────────── + class TestRSSTitle: @pytest.mark.parametrize( "title,expected", diff --git a/backend/tests/test_poller_integration.py b/backend/tests/test_poller_integration.py index b61d280..3963e74 100644 --- a/backend/tests/test_poller_integration.py +++ b/backend/tests/test_poller_integration.py @@ -1,6 +1,6 @@ """End-to-end poller tests using respx for httpx mocking. -These tests exercise each vendor poller against realistic mocked responses +These tests exercise each poller against realistic mocked responses to confirm they produce the right PollResult shape for both happy paths and failure modes (timeouts, 5xx, malformed JSON). """ @@ -9,14 +9,14 @@ import pytest import respx -from app.poller.google_poller import poll_google +from app.poller.active_incidents_poller import poll_active_incidents +from app.poller.current_status_poller import poll_current_status from app.poller.normalizer import ServiceStatus +from app.poller.product_feed_poller import poll_product_feed from app.poller.resilience import configure_breakers -from app.poller.ringcentral_poller import poll_ringcentral -from app.poller.salesforce_poller import poll_salesforce -from app.poller.slack_poller import poll_slack +from app.poller.service_array_poller import poll_service_array from app.poller.statuspage_poller import poll_all_statuspage, poll_statuspage -from app.poller.zendesk_poller import poll_zendesk +from app.poller.trust_incidents_poller import poll_trust_incidents @pytest.fixture(autouse=True) @@ -27,17 +27,20 @@ def _reset_breakers(): class TestStatuspagePoller: @respx.mock async def test_happy_path_operational(self): - respx.get("https://status.box.com/api/v2/summary.json").mock( - return_value=httpx.Response(200, json={ - "page": {"name": "Box"}, - "status": {"indicator": "none", "description": "All Systems Operational"}, - "components": [], - "incidents": [], - "scheduled_maintenances": [], - }) + respx.get("https://status.example.com/api/v2/summary.json").mock( + return_value=httpx.Response( + 200, + json={ + "page": {"name": "Example Service"}, + "status": {"indicator": "none", "description": "All Systems Operational"}, + "components": [], + "incidents": [], + "scheduled_maintenances": [], + }, + ) ) async with httpx.AsyncClient() as client: - result = await poll_statuspage(client, "https://status.box.com/api/v2/summary.json") + result = await poll_statuspage(client, "https://status.example.com/api/v2/summary.json") assert result.status == ServiceStatus.OPERATIONAL assert result.poll_failure_reason is None @@ -65,153 +68,186 @@ async def test_404_produces_http_failure_reason(self): @respx.mock async def test_batch_polling_dedupes_urls(self): """Two services sharing one poll_url should cost one HTTP call.""" - route = respx.get("https://status.atlassian.com/api/v2/summary.json").mock( - return_value=httpx.Response(200, json={ - "page": {"name": "Atlassian"}, - "status": {"indicator": "none"}, - "components": [ - {"name": "Jira", "status": "operational", "description": None}, - {"name": "Confluence", "status": "degraded_performance", "description": "latency"}, - ], - "incidents": [], - "scheduled_maintenances": [], - }) + route = respx.get("https://status.example.org/api/v2/summary.json").mock( + return_value=httpx.Response( + 200, + json={ + "page": {"name": "Example Org"}, + "status": {"indicator": "none"}, + "components": [ + {"name": "Issue Tracker", "status": "operational", "description": None}, + { + "name": "Wiki", + "status": "degraded_performance", + "description": "latency", + }, + ], + "incidents": [], + "scheduled_maintenances": [], + }, + ) ) services = [ { - "id": "jira", - "poll_url": "https://status.atlassian.com/api/v2/summary.json", - "statuspage_component_name": "Jira", + "id": "issue-tracker", + "poll_url": "https://status.example.org/api/v2/summary.json", + "statuspage_component_name": "Issue Tracker", }, { - "id": "confluence", - "poll_url": "https://status.atlassian.com/api/v2/summary.json", - "statuspage_component_name": "Confluence", + "id": "wiki", + "poll_url": "https://status.example.org/api/v2/summary.json", + "statuspage_component_name": "Wiki", }, ] async with httpx.AsyncClient() as client: results = await poll_all_statuspage(client, services) assert route.call_count == 1 result_map = dict(results) - assert result_map["jira"].status == ServiceStatus.OPERATIONAL - assert result_map["confluence"].status == ServiceStatus.DEGRADED + assert result_map["issue-tracker"].status == ServiceStatus.OPERATIONAL + assert result_map["wiki"].status == ServiceStatus.DEGRADED -class TestSlackPoller: +class TestCurrentStatusPoller: @respx.mock async def test_happy_path(self): - respx.get("https://slack-status.com/api/v2.0.0/current").mock( - return_value=httpx.Response(200, json={ - "status": "ok", - "active_incidents": [], - }) + respx.get("https://chat-status.example.com/api/v2.0.0/current").mock( + return_value=httpx.Response( + 200, + json={ + "status": "ok", + "active_incidents": [], + }, + ) ) async with httpx.AsyncClient() as client: - result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current") + result = await poll_current_status( + client, "https://chat-status.example.com/api/v2.0.0/current" + ) assert result.status == ServiceStatus.OPERATIONAL @respx.mock async def test_list_response_no_active_is_operational(self): - """slack-status.com redirects /current to /history which returns a - list of incident objects. An empty list or a list with only - resolved/completed incidents means OPERATIONAL.""" - respx.get("https://slack-status.com/api/v2.0.0/current").mock( - return_value=httpx.Response(200, json=[ - {"id": 1, "status": "resolved", "type": "incident", "title": "old"}, - ]) + """Endpoint redirects /current to /history which returns a list of incident + objects. An empty list or a list with only resolved/completed incidents + means OPERATIONAL.""" + respx.get("https://chat-status.example.com/api/v2.0.0/current").mock( + return_value=httpx.Response( + 200, + json=[ + {"id": 1, "status": "resolved", "type": "incident", "title": "old"}, + ], + ) ) async with httpx.AsyncClient() as client: - result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current") + result = await poll_current_status( + client, "https://chat-status.example.com/api/v2.0.0/current" + ) assert result.status == ServiceStatus.OPERATIONAL @respx.mock async def test_list_response_active_incident_maps_type(self): """Active incidents in the list response are mapped by `type`.""" - respx.get("https://slack-status.com/api/v2.0.0/current").mock( - return_value=httpx.Response(200, json=[ - {"id": 2, "status": "active", "type": "outage", "title": "Big one"}, - ]) + respx.get("https://chat-status.example.com/api/v2.0.0/current").mock( + return_value=httpx.Response( + 200, + json=[ + {"id": 2, "status": "active", "type": "outage", "title": "Big one"}, + ], + ) ) async with httpx.AsyncClient() as client: - result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current") + result = await poll_current_status( + client, "https://chat-status.example.com/api/v2.0.0/current" + ) assert result.status == ServiceStatus.MAJOR_OUTAGE assert result.status_detail == "Big one" @respx.mock async def test_unexpected_type_returns_unknown(self): """Neither dict nor list (e.g., a bare string) still UNKNOWNs out.""" - respx.get("https://slack-status.com/api/v2.0.0/current").mock( + respx.get("https://chat-status.example.com/api/v2.0.0/current").mock( return_value=httpx.Response(200, json="not-a-dict-or-list") ) async with httpx.AsyncClient() as client: - result = await poll_slack(client, "https://slack-status.com/api/v2.0.0/current") + result = await poll_current_status( + client, "https://chat-status.example.com/api/v2.0.0/current" + ) assert result.status == ServiceStatus.UNKNOWN -class TestSalesforcePoller: +class TestTrustIncidentsPoller: @respx.mock async def test_no_active_incidents(self): - respx.get("https://api.status.salesforce.com/v1/incidents").mock( + respx.get("https://trust.example.com/v1/incidents").mock( return_value=httpx.Response(200, json=[]) ) async with httpx.AsyncClient() as client: - result = await poll_salesforce( - client, "https://api.status.salesforce.com/v1/incidents", + result = await poll_trust_incidents( + client, + "https://trust.example.com/v1/incidents", ) assert result.status == ServiceStatus.OPERATIONAL @respx.mock async def test_network_error(self): - respx.get("https://api.status.salesforce.com/v1/incidents").mock( + respx.get("https://trust.example.com/v1/incidents").mock( side_effect=httpx.ConnectError("DNS fail") ) async with httpx.AsyncClient() as client: - result = await poll_salesforce( - client, "https://api.status.salesforce.com/v1/incidents", + result = await poll_trust_incidents( + client, + "https://trust.example.com/v1/incidents", ) assert result.status == ServiceStatus.UNKNOWN assert result.poll_failure_reason is not None assert "request_error" in result.poll_failure_reason -class TestZendeskPoller: +class TestActiveIncidentsPoller: @respx.mock async def test_happy_path(self): - respx.get("https://status.zendesk.com/api/incidents/active").mock( + respx.get("https://support.example.com/api/incidents/active").mock( return_value=httpx.Response(200, json={"data": []}) ) async with httpx.AsyncClient() as client: - result = await poll_zendesk( - client, "https://status.zendesk.com/api/incidents/active", + result = await poll_active_incidents( + client, + "https://support.example.com/api/incidents/active", ) assert result.status == ServiceStatus.OPERATIONAL -class TestRingCentralPoller: +class TestServiceArrayPoller: @respx.mock async def test_all_good(self): - respx.get("https://status.ringcentral.com/status.json").mock( - return_value=httpx.Response(200, json=[ - {"service": "Calling", "region": "US", "level": "Good", "alerts": []}, - ]) + respx.get("https://status.example.net/status.json").mock( + return_value=httpx.Response( + 200, + json=[ + {"service": "Calling", "region": "US", "level": "Good", "alerts": []}, + ], + ) ) async with httpx.AsyncClient() as client: - result = await poll_ringcentral( - client, "https://status.ringcentral.com/status.json", + result = await poll_service_array( + client, + "https://status.example.net/status.json", ) assert result.status == ServiceStatus.OPERATIONAL -class TestGooglePoller: +class TestProductFeedPoller: @respx.mock async def test_operational(self): - respx.get("https://www.google.com/appsstatus/incidents.json").mock( + respx.get("https://feed.example.com/incidents.json").mock( return_value=httpx.Response(200, json=[]) ) - services = [{"id": "google-mail"}, {"id": "google-calendar"}] + services = [{"id": "feed-product-a"}, {"id": "feed-product-b"}] async with httpx.AsyncClient() as client: - results = await poll_google( - client, "https://www.google.com/appsstatus/incidents.json", services, + results = await poll_product_feed( + client, + "https://feed.example.com/incidents.json", + services, ) assert len(results) == 2 for _, r in results: @@ -219,13 +255,13 @@ async def test_operational(self): @respx.mock async def test_error_propagates_to_each_service(self): - respx.get("https://www.google.com/appsstatus/incidents.json").mock( - return_value=httpx.Response(500) - ) - services = [{"id": "google-mail"}, {"id": "google-calendar"}] + respx.get("https://feed.example.com/incidents.json").mock(return_value=httpx.Response(500)) + services = [{"id": "feed-product-a"}, {"id": "feed-product-b"}] async with httpx.AsyncClient() as client: - results = await poll_google( - client, "https://www.google.com/appsstatus/incidents.json", services, + results = await poll_product_feed( + client, + "https://feed.example.com/incidents.json", + services, ) assert len(results) == 2 for _, r in results: diff --git a/backend/tests/test_postmortems.py b/backend/tests/test_postmortems.py index 0866a57..6387985 100644 --- a/backend/tests/test_postmortems.py +++ b/backend/tests/test_postmortems.py @@ -20,18 +20,18 @@ _BASE_EVENTS = [ { "id": 1, - "service_id": "okta", + "service_id": "identity-provider", "previous_status": "operational", "new_status": "degraded", "vendor_title": "Elevated error rates", "vendor_detail": "Users experiencing login failures", - "impact_statement": "Okta is degraded", + "impact_statement": "Identity Provider is degraded", "source": "statuspage_json", "created_at": "2026-04-24T10:00:00Z", }, { "id": 2, - "service_id": "okta", + "service_id": "identity-provider", "previous_status": "degraded", "new_status": "major_outage", "vendor_title": "Complete SSO failure", @@ -42,12 +42,12 @@ }, { "id": 3, - "service_id": "okta", + "service_id": "identity-provider", "previous_status": "major_outage", "new_status": "operational", "vendor_title": None, "vendor_detail": None, - "impact_statement": "Okta has recovered", + "impact_statement": "Identity Provider has recovered", "source": "statuspage_json", "created_at": "2026-04-24T11:00:00Z", }, @@ -57,17 +57,17 @@ def _sample_report(**overrides) -> dict: """Return a realistic report dict with all required keys.""" base = { - "service_id": "okta", - "service_name": "Okta", + "service_id": "identity-provider", + "service_name": "Identity Provider", "started_at": "2026-04-24T10:00:00Z", "resolved_at": "2026-04-24T11:00:00Z", "duration_seconds": 3600, "duration_human": "1h", "peak_severity": "major_outage", - "affected_downstream": ["Box Web", "Box Mobile"], + "affected_downstream": ["Content Platform Web", "Content Platform Mobile"], "event_count": 3, "events": list(_BASE_EVENTS), - "impact_summary": "Okta experienced major outage for 1h.", + "impact_summary": "Identity Provider experienced major outage for 1h.", } base.update(overrides) return base @@ -92,24 +92,22 @@ def test_render_includes_all_eight_sections_in_order(self): "## Action Items", ] positions = [md.index(s) for s in expected_sections] - assert positions == sorted(positions), ( - "Sections are not in the expected order" - ) + assert positions == sorted(positions), "Sections are not in the expected order" def test_render_auto_fills_summary_with_impact_summary(self): - report = _sample_report(impact_summary="Okta was down for exactly 1h.") + report = _sample_report(impact_summary="Identity Provider was down for exactly 1h.") md = render_markdown(report) summary_start = md.index("## Summary") impact_start = md.index("## Impact") summary_body = md[summary_start:impact_start] - assert "Okta was down for exactly 1h." in summary_body + assert "Identity Provider was down for exactly 1h." in summary_body def test_render_impact_lists_peak_severity_duration_count_affected(self): report = _sample_report( peak_severity="major_outage", duration_human="1h", event_count=3, - affected_downstream=["Box Web", "Box Mobile"], + affected_downstream=["Content Platform Web", "Content Platform Mobile"], ) md = render_markdown(report) impact_start = md.index("## Impact") @@ -119,8 +117,8 @@ def test_render_impact_lists_peak_severity_duration_count_affected(self): assert "major_outage" in impact_body assert "1h" in impact_body assert "3" in impact_body - assert "Box Web" in impact_body - assert "Box Mobile" in impact_body + assert "Content Platform Web" in impact_body + assert "Content Platform Mobile" in impact_body def test_render_impact_handles_empty_affected_downstream(self): report = _sample_report(affected_downstream=[]) @@ -140,6 +138,7 @@ def test_render_timeline_renders_events_chronologically(self): assert len(bullets) == 3 import re + time_pattern = re.compile(r"\d{2}:\d{2}:\d{2} UTC") arrow_pattern = re.compile(r"\w+ → \w+") for bullet in bullets: @@ -183,13 +182,17 @@ def test_render_timeline_prefers_vendor_title_over_detail_and_impact(self): # vendor_detail fallback (no vendor_title) event_detail = dict(event_all, vendor_title=None) md_detail = render_markdown(_sample_report(events=[event_detail])) - body_detail = md_detail[md_detail.index("## Timeline"):md_detail.index("## What Went Well")] + body_detail = md_detail[ + md_detail.index("## Timeline") : md_detail.index("## What Went Well") + ] assert "the detail" in body_detail # impact_statement fallback (no vendor_title, no vendor_detail) event_impact = dict(event_all, vendor_title=None, vendor_detail=None) md_impact = render_markdown(_sample_report(events=[event_impact])) - body_impact = md_impact[md_impact.index("## Timeline"):md_impact.index("## What Went Well")] + body_impact = md_impact[ + md_impact.index("## Timeline") : md_impact.index("## What Went Well") + ] assert "the impact" in body_impact def test_render_preserves_all_todo_placeholders(self): @@ -213,31 +216,41 @@ def test_render_frontmatter_is_valid_yaml(self): fm = yaml.safe_load(frontmatter_text) assert isinstance(fm, dict) required_keys = { - "service", "service_name", "started_at", "resolved_at", - "duration", "peak_severity", "affected_downstream", - "event_count", "status", + "service", + "service_name", + "started_at", + "resolved_at", + "duration", + "peak_severity", + "affected_downstream", + "event_count", + "status", } assert required_keys <= fm.keys() assert fm["status"] == "draft" def test_render_frontmatter_escapes_yaml_special_chars(self): # Service name with a colon — yaml.safe_dump must quote it properly - report = _sample_report(service_name="Okta: identity", service_id="okta-identity") + report = _sample_report( + service_name="Identity Provider: SSO", service_id="identity-provider-sso" + ) md = render_markdown(report) lines = md.splitlines() assert lines[0] == "---" end_idx = lines.index("---", 1) frontmatter_text = "\n".join(lines[1:end_idx]) fm = yaml.safe_load(frontmatter_text) - assert fm["service_name"] == "Okta: identity" + assert fm["service_name"] == "Identity Provider: SSO" # Service name with a leading dash - report2 = _sample_report(service_name="- Okta primary", service_id="okta2") + report2 = _sample_report( + service_name="- Identity Provider primary", service_id="identity-provider2" + ) md2 = render_markdown(report2) lines2 = md2.splitlines() end_idx2 = lines2.index("---", 1) fm2 = yaml.safe_load("\n".join(lines2[1:end_idx2])) - assert fm2["service_name"] == "- Okta primary" + assert fm2["service_name"] == "- Identity Provider primary" # --------------------------------------------------------------------------- @@ -257,11 +270,12 @@ async def test_write_creates_file_with_expected_filename(self, tmp_path): started_at = report["started_at"] resolved_at = report["resolved_at"] sha = hashlib.sha1( - f"{started_at}|{resolved_at}".encode(), usedforsecurity=False, + f"{started_at}|{resolved_at}".encode(), + usedforsecurity=False, ).hexdigest()[:6] dt = datetime.fromisoformat(started_at.replace("Z", "+00:00")).astimezone(UTC) compact = dt.strftime("%Y%m%dT%H%M%SZ") - expected_name = f"okta-{compact}-{sha}.md" + expected_name = f"identity-provider-{compact}-{sha}.md" assert result.name == expected_name @pytest.mark.asyncio @@ -350,7 +364,7 @@ def _raise(*args, **kwargs): # --------------------------------------------------------------------------- -async def _seed_recovery_scenario(db, write_lock, service_id: str = "okta") -> None: +async def _seed_recovery_scenario(db, write_lock, service_id: str = "identity-provider") -> None: """Seed the DB so generate_incident_report finds a complete incident window. Inserts a service (current_status=operational) and two status_events @@ -359,7 +373,7 @@ async def _seed_recovery_scenario(db, write_lock, service_id: str = "okta") -> N await db.execute( """INSERT OR REPLACE INTO services (id, display_name, category, poll_type, current_status) - VALUES (?, 'Okta', 'identity', 'statuspage_json', 'operational')""", + VALUES (?, 'Identity Provider', 'identity', 'statuspage_json', 'operational')""", (service_id,), ) # Event 1: started the incident (operational → degraded, >60s ago) @@ -381,10 +395,9 @@ async def _seed_recovery_scenario(db, write_lock, service_id: str = "okta") -> N class TestAlertingEngineIntegration: @pytest.mark.asyncio - async def test_engine_calls_write_postmortem_when_enabled( - self, db, tmp_path, monkeypatch - ): + async def test_engine_calls_write_postmortem_when_enabled(self, db, tmp_path, monkeypatch): from app.config import settings as real_settings + monkeypatch.setattr(real_settings, "postmortems_enabled", True) monkeypatch.setattr(real_settings, "postmortems_dir", str(tmp_path)) # Suppress slack alerting — no webhook configured @@ -394,11 +407,11 @@ async def test_engine_calls_write_postmortem_when_enabled( monkeypatch.setattr(real_settings, "alert_dedup_window_seconds", 1) write_lock = asyncio.Lock() - await _seed_recovery_scenario(db, write_lock, service_id="okta") + await _seed_recovery_scenario(db, write_lock, service_id="identity-provider") recovery_change = StatusChange( - service_id="okta", - service_display_name="Okta", + service_id="identity-provider", + service_display_name="Identity Provider", previous_status="degraded", new_status="operational", status_detail=None, @@ -408,7 +421,7 @@ async def test_engine_calls_write_postmortem_when_enabled( await process_changes(db, write_lock, [recovery_change]) - md_files = list(tmp_path.glob("okta-*.md")) + md_files = list(tmp_path.glob("identity-provider-*.md")) assert len(md_files) == 1, f"Expected 1 postmortem file, found: {md_files}" @pytest.mark.asyncio @@ -416,6 +429,7 @@ async def test_engine_does_not_call_write_postmortem_when_disabled( self, db, tmp_path, monkeypatch ): from app.config import settings as real_settings + monkeypatch.setattr(real_settings, "postmortems_enabled", False) monkeypatch.setattr(real_settings, "postmortems_dir", str(tmp_path)) monkeypatch.setattr(real_settings, "slack_webhook_url", None) @@ -423,11 +437,11 @@ async def test_engine_does_not_call_write_postmortem_when_disabled( monkeypatch.setattr(real_settings, "alert_dedup_window_seconds", 1) write_lock = asyncio.Lock() - await _seed_recovery_scenario(db, write_lock, service_id="okta") + await _seed_recovery_scenario(db, write_lock, service_id="identity-provider") recovery_change = StatusChange( - service_id="okta", - service_display_name="Okta", + service_id="identity-provider", + service_display_name="Identity Provider", previous_status="degraded", new_status="operational", status_detail=None, @@ -462,11 +476,11 @@ async def _failing_write(report, *, out_dir): monkeypatch.setattr(pm_module, "write_postmortem", _failing_write) write_lock = asyncio.Lock() - await _seed_recovery_scenario(db, write_lock, service_id="okta") + await _seed_recovery_scenario(db, write_lock, service_id="identity-provider") recovery_change = StatusChange( - service_id="okta", - service_display_name="Okta", + service_id="identity-provider", + service_display_name="Identity Provider", previous_status="degraded", new_status="operational", status_detail=None, diff --git a/backend/tests/test_resilience.py b/backend/tests/test_resilience.py index 1fdb159..36a26dd 100644 --- a/backend/tests/test_resilience.py +++ b/backend/tests/test_resilience.py @@ -26,7 +26,7 @@ def _fast_and_isolated_breakers(): class TestHostOf: def test_extracts_host(self): - assert host_of("https://status.box.com/api/v2/summary.json") == "status.box.com" + assert host_of("https://status.example.com/api/v2/summary.json") == "status.example.com" def test_bare_url_fallback(self): assert host_of("not-a-url") == "not-a-url" @@ -53,7 +53,10 @@ async def test_retries_transient_5xx_then_succeeds(self): ) async with httpx.AsyncClient() as client: resp = await resilient_fetch( - client, "https://example.com/flaky", attempts=3, timeout=5.0, + client, + "https://example.com/flaky", + attempts=3, + timeout=5.0, ) assert resp.status_code == 200 assert route.call_count == 2 @@ -68,58 +71,70 @@ async def test_retries_429(self): ) async with httpx.AsyncClient() as client: resp = await resilient_fetch( - client, "https://example.com/ratelimited", attempts=3, timeout=5.0, + client, + "https://example.com/ratelimited", + attempts=3, + timeout=5.0, ) assert resp.status_code == 200 assert route.call_count == 2 @respx.mock async def test_does_not_retry_404(self): - route = respx.get("https://example.com/missing").mock( - return_value=httpx.Response(404) - ) + route = respx.get("https://example.com/missing").mock(return_value=httpx.Response(404)) async with httpx.AsyncClient() as client: with pytest.raises(httpx.HTTPStatusError): await resilient_fetch( - client, "https://example.com/missing", attempts=3, timeout=5.0, + client, + "https://example.com/missing", + attempts=3, + timeout=5.0, ) # 404 is a hard failure — no retries assert route.call_count == 1 @respx.mock async def test_all_retries_exhausted_raises_transient(self): - respx.get("https://example.com/dead").mock( - return_value=httpx.Response(503) - ) + respx.get("https://example.com/dead").mock(return_value=httpx.Response(503)) async with httpx.AsyncClient() as client: with pytest.raises(TransientHTTPError): await resilient_fetch( - client, "https://example.com/dead", attempts=2, timeout=5.0, + client, + "https://example.com/dead", + attempts=2, + timeout=5.0, ) @respx.mock async def test_breaker_opens_after_threshold(self): """Two consecutive hard failures trip the breaker for this host. Subsequent calls raise CircuitBreakerOpen without hitting the network.""" - route = respx.get("https://trip.example.com/x").mock( - return_value=httpx.Response(500) - ) + route = respx.get("https://trip.example.com/x").mock(return_value=httpx.Response(500)) async with httpx.AsyncClient() as client: # First attempt: exhausts retries, counts as 1 failure with pytest.raises(TransientHTTPError): await resilient_fetch( - client, "https://trip.example.com/x", attempts=1, timeout=5.0, + client, + "https://trip.example.com/x", + attempts=1, + timeout=5.0, ) # Second attempt: also fails, tripping the threshold-2 breaker with pytest.raises(TransientHTTPError): await resilient_fetch( - client, "https://trip.example.com/x", attempts=1, timeout=5.0, + client, + "https://trip.example.com/x", + attempts=1, + timeout=5.0, ) # Third attempt: breaker is open, fast-fails without hitting network calls_before = route.call_count with pytest.raises(CircuitBreakerOpen): await resilient_fetch( - client, "https://trip.example.com/x", attempts=1, timeout=5.0, + client, + "https://trip.example.com/x", + attempts=1, + timeout=5.0, ) assert route.call_count == calls_before # no new network call @@ -136,11 +151,17 @@ async def test_breaker_isolates_hosts(self): for _ in range(2): with pytest.raises(TransientHTTPError): await resilient_fetch( - client, "https://bad.example.com/x", attempts=1, timeout=5.0, + client, + "https://bad.example.com/x", + attempts=1, + timeout=5.0, ) # Good host should still succeed cleanly resp = await resilient_fetch( - client, "https://good.example.com/x", attempts=1, timeout=5.0, + client, + "https://good.example.com/x", + attempts=1, + timeout=5.0, ) assert resp.status_code == 200 @@ -159,17 +180,26 @@ async def test_breaker_recovers_after_ttl(self): for _ in range(2): with pytest.raises(TransientHTTPError): await resilient_fetch( - client, "https://heal.example.com/x", attempts=1, timeout=5.0, + client, + "https://heal.example.com/x", + attempts=1, + timeout=5.0, ) with pytest.raises(CircuitBreakerOpen): await resilient_fetch( - client, "https://heal.example.com/x", attempts=1, timeout=5.0, + client, + "https://heal.example.com/x", + attempts=1, + timeout=5.0, ) # Wait for TTL to elapse, then probe succeeds await asyncio.sleep(0.6) resp = await resilient_fetch( - client, "https://heal.example.com/x", attempts=1, timeout=5.0, + client, + "https://heal.example.com/x", + attempts=1, + timeout=5.0, ) assert resp.status_code == 200 assert route.call_count == 3 @@ -189,9 +219,7 @@ def test_timeout(self): assert reason == "timeout" def test_transient_http(self): - detail, reason = describe_fetch_error( - TransientHTTPError(503, "https://example.com") - ) + detail, reason = describe_fetch_error(TransientHTTPError(503, "https://example.com")) assert detail == "HTTP 503" assert reason == "transient_http_503" diff --git a/backend/tests/test_routing.py b/backend/tests/test_routing.py index d7ac966..e61d5d2 100644 --- a/backend/tests/test_routing.py +++ b/backend/tests/test_routing.py @@ -15,7 +15,11 @@ async def _insert_service( - db, sid, status="operational", tier="important", override=None, + db, + sid, + status="operational", + tier="important", + override=None, ): await db.execute( """INSERT OR REPLACE INTO services @@ -43,16 +47,16 @@ def _make_change(service_id, new_status="degraded", prev_status="operational", e class TestDedupKey: def test_vendor_id_preferred(self): - key = build_dedup_key("box", "degraded", vendor_incident_id="inc-123") - assert key == "vendor:box:inc-123" + key = build_dedup_key("content-platform", "degraded", vendor_incident_id="inc-123") + assert key == "vendor:content-platform:inc-123" def test_fallback_when_no_vendor_id(self): - key = build_dedup_key("box", "degraded", vendor_incident_id=None) - assert key.startswith("fallback:box:degraded:") + key = build_dedup_key("content-platform", "degraded", vendor_incident_id=None) + assert key.startswith("fallback:content-platform:degraded:") def test_different_statuses_different_keys(self): - a = build_dedup_key("box", "degraded", None) - b = build_dedup_key("box", "major_outage", None) + a = build_dedup_key("content-platform", "degraded", None) + b = build_dedup_key("content-platform", "major_outage", None) assert a != b @@ -91,69 +95,70 @@ async def test_future_window_returns_false(self, db): class TestWasRecentlyAlerted: async def test_fresh_dedup_key_returns_false(self, db): - assert not await was_recently_alerted(db, "vendor:box:new", 3600) + assert not await was_recently_alerted(db, "vendor:content-platform:new", 3600) async def test_recent_same_key_returns_true(self, db): - await _insert_service(db, "box") + await _insert_service(db, "content-platform") await db.execute( """INSERT INTO alert_sent_log (dedup_key, service_id, severity, new_status, alert_kind, first_sent_at, last_updated_at) - VALUES ('vendor:box:inc-1', 'box', 'important', 'degraded', + VALUES ('vendor:content-platform:inc-1', 'content-platform', 'important', 'degraded', 'status_change', datetime('now', '-2 minutes'), datetime('now', '-2 minutes'))""" ) await db.commit() - assert await was_recently_alerted(db, "vendor:box:inc-1", 3600) + assert await was_recently_alerted(db, "vendor:content-platform:inc-1", 3600) async def test_outside_window_returns_false(self, db): - await _insert_service(db, "box") + await _insert_service(db, "content-platform") await db.execute( """INSERT INTO alert_sent_log (dedup_key, service_id, severity, new_status, alert_kind, first_sent_at, last_updated_at) - VALUES ('vendor:box:old', 'box', 'important', 'degraded', + VALUES ('vendor:content-platform:old', 'content-platform', 'important', 'degraded', 'status_change', datetime('now', '-25 hours'), datetime('now', '-25 hours'))""" ) await db.commit() - assert not await was_recently_alerted(db, "vendor:box:old", 86400) + assert not await was_recently_alerted(db, "vendor:content-platform:old", 86400) async def test_suppressed_rows_ignored(self, db): """A suppressed alert doesn't count toward dedup — it didn't actually fire.""" - await _insert_service(db, "box") + await _insert_service(db, "content-platform") await db.execute( """INSERT INTO alert_sent_log (dedup_key, service_id, severity, new_status, alert_kind, suppressed_by, first_sent_at, last_updated_at) - VALUES ('vendor:box:supp', 'box', 'important', 'degraded', + VALUES ('vendor:content-platform:supp', 'content-platform', 'important', 'degraded', 'status_change', 'maintenance_window', datetime('now', '-2 minutes'), datetime('now', '-2 minutes'))""" ) await db.commit() - assert not await was_recently_alerted(db, "vendor:box:supp", 3600) + assert not await was_recently_alerted(db, "vendor:content-platform:supp", 3600) class TestRouteStatusChange: @pytest.fixture(autouse=True) def _webhook_set(self, monkeypatch): monkeypatch.setattr( - settings, "slack_webhook_url", + settings, + "slack_webhook_url", "https://hooks.slack.com/services/x/y/z", ) async def test_critical_tier_adds_here_mention(self, db): - await _insert_service(db, "okta", tier="critical") - decision = await route_status_change(db, _make_change("okta")) + await _insert_service(db, "identity-provider", tier="critical") + decision = await route_status_change(db, _make_change("identity-provider")) assert decision.should_send assert decision.channel_mention == "" assert decision.tier == "critical" assert decision.suppressed_by is None async def test_important_tier_no_mention(self, db): - await _insert_service(db, "box", tier="important") - decision = await route_status_change(db, _make_change("box")) + await _insert_service(db, "content-platform", tier="important") + decision = await route_status_change(db, _make_change("content-platform")) assert decision.should_send assert decision.channel_mention is None assert decision.tier == "important" @@ -219,11 +224,13 @@ async def test_recovery_bypasses_dedup(self, db): async def test_aggregated_under_suppresses(self, db): await _insert_service(db, "dep", tier="critical") decision = await route_status_change( - db, _make_change("dep"), aggregated_under="Okta", + db, + _make_change("dep"), + aggregated_under="Identity Provider", ) assert not decision.should_send assert decision.suppressed_by == "aggregated_under_upstream" - assert decision.aggregated_under == "Okta" + assert decision.aggregated_under == "Identity Provider" async def test_no_webhook_is_recorded_as_suppressed(self, db, monkeypatch): monkeypatch.setattr(settings, "slack_webhook_url", None) @@ -248,29 +255,33 @@ async def test_webhook_override_null_falls_back_to_global(self, db): async def test_vendor_incident_id_drives_dedup_key(self, db): await _insert_service(db, "dedup-svc", tier="important") decision = await route_status_change( - db, _make_change("dedup-svc"), vendor_incident_id="abc123", + db, + _make_change("dedup-svc"), + vendor_incident_id="abc123", ) assert decision.dedup_key == "vendor:dedup-svc:abc123" class TestRecordAlert: async def test_records_fired_alert(self, db): - await _insert_service(db, "box") - change = _make_change("box") + await _insert_service(db, "content-platform") + change = _make_change("content-platform") from app.alerting.routing import RoutingDecision + decision = RoutingDecision( should_send=True, webhook_url="https://example.com", channel_mention=None, - dedup_key="vendor:box:x", + dedup_key="vendor:content-platform:x", tier="important", suppressed_by=None, ) await record_alert(db, change, decision) await db.commit() cursor = await db.execute( - "SELECT suppressed_by, tier_col FROM alert_sent_log WHERE dedup_key='vendor:box:x'" - .replace("tier_col", "severity") # table uses `severity` col name + "SELECT suppressed_by, tier_col FROM alert_sent_log WHERE dedup_key='vendor:content-platform:x'".replace( + "tier_col", "severity" + ) # table uses `severity` col name ) row = await cursor.fetchone() assert row is not None @@ -278,21 +289,22 @@ async def test_records_fired_alert(self, db): assert dict(row)["suppressed_by"] is None async def test_records_suppressed_alert(self, db): - await _insert_service(db, "box") - change = _make_change("box") + await _insert_service(db, "content-platform") + change = _make_change("content-platform") from app.alerting.routing import RoutingDecision + decision = RoutingDecision( should_send=False, webhook_url=None, channel_mention=None, - dedup_key="vendor:box:y", + dedup_key="vendor:content-platform:y", tier="informational", suppressed_by="tier_informational", ) await record_alert(db, change, decision) await db.commit() cursor = await db.execute( - "SELECT suppressed_by FROM alert_sent_log WHERE dedup_key='vendor:box:y'" + "SELECT suppressed_by FROM alert_sent_log WHERE dedup_key='vendor:content-platform:y'" ) row = dict(await cursor.fetchone()) assert row["suppressed_by"] == "tier_informational" @@ -309,48 +321,48 @@ async def _insert_dep(self, db, upstream, downstream, severity="high"): await db.commit() async def test_aggregates_when_threshold_met(self, db): - for sid in ["okta", "a", "b", "c", "d"]: + for sid in ["identity-provider", "a", "b", "c", "d"]: await _insert_service(db, sid) for dep in ["a", "b", "c", "d"]: - await self._insert_dep(db, "okta", dep) + await self._insert_dep(db, "identity-provider", dep) changes = [ - _make_change("okta", "major_outage"), + _make_change("identity-provider", "major_outage"), _make_change("a", "degraded"), _make_change("b", "degraded"), _make_change("c", "degraded"), _make_change("d", "operational"), # not affected → ignored ] grouped = await find_aggregation_candidates(db, changes, threshold=3) - assert "okta" in grouped - assert len(grouped["okta"]) == 3 - assert {c.service_id for c in grouped["okta"]} == {"a", "b", "c"} + assert "identity-provider" in grouped + assert len(grouped["identity-provider"]) == 3 + assert {c.service_id for c in grouped["identity-provider"]} == {"a", "b", "c"} async def test_no_aggregation_when_below_threshold(self, db): - for sid in ["okta", "a", "b"]: + for sid in ["identity-provider", "a", "b"]: await _insert_service(db, sid) - await self._insert_dep(db, "okta", "a") - await self._insert_dep(db, "okta", "b") + await self._insert_dep(db, "identity-provider", "a") + await self._insert_dep(db, "identity-provider", "b") changes = [ - _make_change("okta", "degraded"), + _make_change("identity-provider", "degraded"), _make_change("a", "degraded"), ] grouped = await find_aggregation_candidates(db, changes, threshold=3) assert grouped == {} async def test_upstream_recovering_does_not_aggregate(self, db): - for sid in ["okta", "a", "b", "c"]: + for sid in ["identity-provider", "a", "b", "c"]: await _insert_service(db, sid) for dep in ["a", "b", "c"]: - await self._insert_dep(db, "okta", dep) + await self._insert_dep(db, "identity-provider", dep) changes = [ - _make_change("okta", new_status="operational", prev_status="major_outage"), + _make_change("identity-provider", new_status="operational", prev_status="major_outage"), _make_change("a", "degraded"), _make_change("b", "degraded"), _make_change("c", "degraded"), ] grouped = await find_aggregation_candidates(db, changes, threshold=3) # Upstream going back to operational isn't an outage event to aggregate - assert "okta" not in grouped + assert "identity-provider" not in grouped diff --git a/backend/tests/test_seeder.py b/backend/tests/test_seeder.py index 3099c42..10e7dab 100644 --- a/backend/tests/test_seeder.py +++ b/backend/tests/test_seeder.py @@ -1,5 +1,11 @@ -"""Tests for the YAML config loader and database seeder.""" +"""Tests for the YAML config loader and database seeder. +These tests load the committed example config explicitly (not the +settings-resolved path) so they stay deterministic even when an operator +has a gitignored services.local.yaml / dependencies.local.yaml present. +""" + +from pathlib import Path from urllib.parse import urlsplit import pytest @@ -11,27 +17,31 @@ load_services, ) +_CONFIG_DIR = Path(__file__).resolve().parent.parent / "config" +_SERVICES_YAML = _CONFIG_DIR / "services.yaml" +_DEPENDENCIES_YAML = _CONFIG_DIR / "dependencies.yaml" + class TestServiceConfig: def test_valid_manual_service(self): svc = ServiceConfig( - id="okta", - display_name="Okta", + id="identity-provider", + display_name="Identity Provider (SSO)", category="identity", poll_type="manual", ) - assert svc.id == "okta" + assert svc.id == "identity-provider" assert svc.poll_url is None def test_valid_polled_service(self): svc = ServiceConfig( - id="box", - display_name="Box", - category="productivity", + id="github", + display_name="GitHub", + category="engineering", poll_type="statuspage_json", - poll_url="https://status.box.com/api/v2/summary.json", + poll_url="https://www.githubstatus.com/api/v2/summary.json", ) - assert svc.poll_url == "https://status.box.com/api/v2/summary.json" + assert svc.poll_url == "https://www.githubstatus.com/api/v2/summary.json" def test_polled_service_without_url_fails(self): with pytest.raises(ValueError, match="requires a poll_url"): @@ -63,96 +73,109 @@ def test_invalid_poll_type_fails(self): class TestLoadServices: def test_loads_all_services(self): - services = load_services() - assert len(services) >= 25 + services = load_services(path=_SERVICES_YAML) + assert len(services) == 10 def test_service_types(self): - services = load_services() + services = load_services(path=_SERVICES_YAML) poll_types = {s.poll_type for s in services} assert "statuspage_json" in poll_types assert "manual" in poll_types - assert "google_json" in poll_types - assert "slack_api" in poll_types - def test_okta_is_manual(self): - services = load_services() - okta = next(s for s in services if s.id == "okta") - assert okta.poll_type == "manual" + def test_identity_provider_is_manual(self): + services = load_services(path=_SERVICES_YAML) + idp = next(s for s in services if s.id == "identity-provider") + assert idp.poll_type == "manual" - def test_box_has_poll_url(self): - services = load_services() - box = next(s for s in services if s.id == "box") - assert box.poll_type == "statuspage_json" - assert urlsplit(str(box.poll_url)).hostname == "status.box.com" + def test_github_has_poll_url(self): + services = load_services(path=_SERVICES_YAML) + gh = next(s for s in services if s.id == "github") + assert gh.poll_type == "statuspage_json" + assert urlsplit(str(gh.poll_url)).hostname == "www.githubstatus.com" class TestLoadDependencies: def test_loads_dependencies(self): - deps = load_dependencies() - assert "okta" in deps - assert len(deps["okta"]) >= 10 + deps = load_dependencies(path=_DEPENDENCIES_YAML) + assert "identity-provider" in deps + assert len(deps["identity-provider"]) == 8 - def test_okta_downstream_services(self): - deps = load_dependencies() - okta_targets = {t.service for t in deps["okta"]} - assert "box" in okta_targets - assert "slack" in okta_targets + def test_sso_downstream_services(self): + deps = load_dependencies(path=_DEPENDENCIES_YAML) + targets = {t.service for t in deps["identity-provider"]} + assert "github" in targets + assert "dropbox" in targets - def test_okta_downstream_count(self): - deps = load_dependencies() - assert len(deps["okta"]) >= 10 + def test_sso_downstream_count(self): + deps = load_dependencies(path=_DEPENDENCIES_YAML) + assert len(deps["identity-provider"]) == 8 def test_cross_validation_accepts_matching_services(self): - services = load_services() + services = load_services(path=_SERVICES_YAML) ids = {s.id for s in services} - # Should not raise - deps = load_dependencies(known_service_ids=ids) - assert "okta" in deps + # Should not raise — every edge references a known service id. + deps = load_dependencies(path=_DEPENDENCIES_YAML, known_service_ids=ids) + assert "identity-provider" in deps def test_cross_validation_rejects_unknown_upstream(self, tmp_path): import yaml + bad = tmp_path / "bad_deps.yaml" - bad.write_text(yaml.safe_dump({ - "dependencies": { - "ghost_service": [ - {"service": "box", "impact": "x", "severity": "high"}, - ], - }, - })) + bad.write_text( + yaml.safe_dump( + { + "dependencies": { + "ghost_service": [ + {"service": "github", "impact": "x", "severity": "high"}, + ], + }, + } + ) + ) with pytest.raises(ValueError, match="Unknown upstream service 'ghost_service'"): - load_dependencies(path=bad, known_service_ids={"box"}) + load_dependencies(path=bad, known_service_ids={"github"}) def test_cross_validation_rejects_unknown_downstream(self, tmp_path): import yaml + bad = tmp_path / "bad_deps.yaml" - bad.write_text(yaml.safe_dump({ - "dependencies": { - "okta": [ - {"service": "phantom_app", "impact": "x", "severity": "high"}, - ], - }, - })) + bad.write_text( + yaml.safe_dump( + { + "dependencies": { + "identity-provider": [ + {"service": "phantom_app", "impact": "x", "severity": "high"}, + ], + }, + } + ) + ) with pytest.raises(ValueError, match="Unknown downstream service 'phantom_app'"): - load_dependencies(path=bad, known_service_ids={"okta"}) + load_dependencies(path=bad, known_service_ids={"identity-provider"}) def test_cross_validation_allows_all_internal_sentinel(self, tmp_path): import yaml + good = tmp_path / "deps.yaml" - good.write_text(yaml.safe_dump({ - "dependencies": { - "okta": [ - {"service": "all_internal", "impact": "x", "severity": "high"}, - ], - }, - })) + good.write_text( + yaml.safe_dump( + { + "dependencies": { + "identity-provider": [ + {"service": "all_internal", "impact": "x", "severity": "high"}, + ], + }, + } + ) + ) # Should not raise even though "all_internal" isn't in the id set - deps = load_dependencies(path=good, known_service_ids={"okta"}) - assert deps["okta"][0].service == "all_internal" + deps = load_dependencies(path=good, known_service_ids={"identity-provider"}) + assert deps["identity-provider"][0].service == "all_internal" class TestSeedDatabase: async def test_seed_services(self, db): - services = load_services() + services = load_services(path=_SERVICES_YAML) count = await seed_services_with_db(db, services) assert count == len(services) @@ -161,7 +184,7 @@ async def test_seed_services(self, db): assert row[0] == len(services) async def test_seed_services_idempotent(self, db): - services = load_services() + services = load_services(path=_SERVICES_YAML) await seed_services_with_db(db, services) await seed_services_with_db(db, services) @@ -170,31 +193,32 @@ async def test_seed_services_idempotent(self, db): assert row[0] == len(services) # Same count, not doubled async def test_seed_dependencies(self, db): - services = load_services() + services = load_services(path=_SERVICES_YAML) await seed_services_with_db(db, services) - deps = load_dependencies() + deps = load_dependencies(path=_DEPENDENCIES_YAML) all_ids = [s.id for s in services] count = await seed_deps_with_db(db, deps, all_ids) - assert count >= 14 + assert count == 10 # identity-provider (8 edges) + cloudflare (2 edges) cursor = await db.execute("SELECT count(*) FROM service_dependencies") row = await cursor.fetchone() - assert row[0] >= 14 + assert row[0] == 10 - async def test_okta_deps_seeded(self, db): - services = load_services() + async def test_sso_deps_seeded(self, db): + services = load_services(path=_SERVICES_YAML) await seed_services_with_db(db, services) - deps = load_dependencies() + deps = load_dependencies(path=_DEPENDENCIES_YAML) all_ids = [s.id for s in services] await seed_deps_with_db(db, deps, all_ids) cursor = await db.execute( - "SELECT count(*) FROM service_dependencies WHERE upstream_service_id='okta'" + "SELECT count(*) FROM service_dependencies " + "WHERE upstream_service_id='identity-provider'" ) row = await cursor.fetchone() - assert row[0] == 12 + assert row[0] == 8 # Helper functions that operate on a given db connection instead of the global one @@ -206,17 +230,20 @@ async def seed_services_with_db(db, services: list[ServiceConfig]) -> int: statuspage_component_name, status_page_url, current_status) VALUES (?, ?, ?, ?, ?, ?, ?, 'unknown')""", ( - svc.id, svc.display_name, svc.category, svc.poll_type, - svc.poll_url, svc.statuspage_component_name, svc.status_page_url, + svc.id, + svc.display_name, + svc.category, + svc.poll_type, + svc.poll_url, + svc.statuspage_component_name, + svc.status_page_url, ), ) await db.commit() return len(services) -async def seed_deps_with_db( - db, deps: dict[str, list[DependencyTarget]], all_ids: list[str] -) -> int: +async def seed_deps_with_db(db, deps: dict[str, list[DependencyTarget]], all_ids: list[str]) -> int: await db.execute("DELETE FROM service_dependencies") count = 0 for upstream, targets in deps.items(): diff --git a/backend/tests/test_services_api.py b/backend/tests/test_services_api.py index 9bdd86e..a6b08ca 100644 --- a/backend/tests/test_services_api.py +++ b/backend/tests/test_services_api.py @@ -17,11 +17,16 @@ async def seeded_app(tmp_path): db_path = str(tmp_path / "test.db") conn = await init_db(db_path) - services = load_services() - from tests.test_seeder import seed_deps_with_db, seed_services_with_db - + from tests.test_seeder import ( + _DEPENDENCIES_YAML, + _SERVICES_YAML, + seed_deps_with_db, + seed_services_with_db, + ) + + services = load_services(path=_SERVICES_YAML) await seed_services_with_db(conn, services) - deps = load_dependencies(known_service_ids={s.id for s in services}) + deps = load_dependencies(path=_DEPENDENCIES_YAML, known_service_ids={s.id for s in services}) await seed_deps_with_db(conn, deps, [s.id for s in services]) from app.main import app @@ -79,7 +84,7 @@ async def test_list_services_pending_status_null_by_default(self, client): class TestServiceDetailShape: async def test_detail_includes_pending_status_fields(self, client): - resp = await client.get("/api/services/okta") + resp = await client.get("/api/services/identity-provider") assert resp.status_code == 200 body = resp.json() svc = body["data"]["service"] diff --git a/backend/tests/test_slack_ack.py b/backend/tests/test_slack_ack.py index e4cc499..438dd59 100644 --- a/backend/tests/test_slack_ack.py +++ b/backend/tests/test_slack_ack.py @@ -120,6 +120,7 @@ async def ack_app(tmp_path, monkeypatch): monkeypatch.setattr(settings, "slack_signing_secret", SecretStr(SIGNING_SECRET)) from app.main import app + yield app, conn await close_db() @@ -174,7 +175,8 @@ async def test_valid_ack_updates_db_and_calls_response_url(ack_client): blocks = posted_body.get("blocks", []) context_texts = [ elem.get("text", "") - for b in blocks if b.get("type") == "context" + for b in blocks + if b.get("type") == "context" for elem in b.get("elements", []) if isinstance(elem.get("text"), str) ] @@ -244,7 +246,8 @@ async def test_disabled_returns_404(tmp_path, monkeypatch): sig = _slack_sign(body, SIGNING_SECRET, ts) async with AsyncClient( - transport=ASGITransport(app=app), base_url="http://test", + transport=ASGITransport(app=app), + base_url="http://test", ) as client: resp = await client.post( "/api/slack/interactivity", @@ -273,7 +276,8 @@ async def test_signing_secret_not_configured_returns_503(tmp_path, monkeypatch): sig = _slack_sign(body, SIGNING_SECRET, ts) async with AsyncClient( - transport=ASGITransport(app=app), base_url="http://test", + transport=ASGITransport(app=app), + base_url="http://test", ) as client: resp = await client.post( "/api/slack/interactivity", @@ -320,7 +324,8 @@ def test_ack_button_present_when_ack_enabled(monkeypatch): ) ack_actions = [ - b for b in payload["blocks"] + b + for b in payload["blocks"] if b.get("type") == "actions" and any(e.get("action_id") == "ack_alert" for e in b.get("elements", [])) ] @@ -346,7 +351,8 @@ def test_ack_button_absent_when_ack_disabled(monkeypatch): ) ack_actions = [ - b for b in payload["blocks"] + b + for b in payload["blocks"] if b.get("type") == "actions" and any(e.get("action_id") == "ack_alert" for e in b.get("elements", [])) ] @@ -369,7 +375,8 @@ def test_ack_button_absent_when_no_dedup_key(monkeypatch): ) ack_actions = [ - b for b in payload["blocks"] + b + for b in payload["blocks"] if b.get("type") == "actions" and any(e.get("action_id") == "ack_alert" for e in b.get("elements", [])) ] @@ -384,8 +391,8 @@ def test_aggregated_alert_has_ack_button_when_enabled(monkeypatch): monkeypatch.setattr(settings, "slack_ack_enabled", True) upstream = StatusChange( - service_id="okta", - service_display_name="Okta", + service_id="identity-provider", + service_display_name="Identity Provider", previous_status="operational", new_status="major_outage", status_detail=None, @@ -393,8 +400,8 @@ def test_aggregated_alert_has_ack_button_when_enabled(monkeypatch): status_page_url=None, ) dependent = StatusChange( - service_id="box", - service_display_name="Box", + service_id="content-platform", + service_display_name="Content Platform", previous_status="operational", new_status="major_outage", status_detail=None, @@ -405,12 +412,13 @@ def test_aggregated_alert_has_ack_button_when_enabled(monkeypatch): payload = build_aggregated_upstream_alert( upstream_change=upstream, dependents=[dependent], - impact_statement="Okta is down", - dedup_key="vendor:okta:inc-999", + impact_statement="Identity Provider is down", + dedup_key="vendor:identity-provider:inc-999", ) ack_actions = [ - b for b in payload["blocks"] + b + for b in payload["blocks"] if b.get("type") == "actions" and any(e.get("action_id") == "ack_alert" for e in b.get("elements", [])) ] @@ -425,8 +433,8 @@ def test_aggregated_alert_no_ack_button_when_disabled(monkeypatch): monkeypatch.setattr(settings, "slack_ack_enabled", False) upstream = StatusChange( - service_id="okta", - service_display_name="Okta", + service_id="identity-provider", + service_display_name="Identity Provider", previous_status="operational", new_status="major_outage", status_detail=None, @@ -437,12 +445,13 @@ def test_aggregated_alert_no_ack_button_when_disabled(monkeypatch): payload = build_aggregated_upstream_alert( upstream_change=upstream, dependents=[], - impact_statement="Okta is down", - dedup_key="vendor:okta:inc-999", + impact_statement="Identity Provider is down", + dedup_key="vendor:identity-provider:inc-999", ) ack_actions = [ - b for b in payload["blocks"] + b + for b in payload["blocks"] if b.get("type") == "actions" and any(e.get("action_id") == "ack_alert" for e in b.get("elements", [])) ] diff --git a/backend/tests/test_slack_slash.py b/backend/tests/test_slack_slash.py index 492a5f2..9a8a9a0 100644 --- a/backend/tests/test_slack_slash.py +++ b/backend/tests/test_slack_slash.py @@ -80,11 +80,25 @@ async def slash_app(tmp_path, monkeypatch): conn = await init_db(db_path) services = [ - ("okta", "Okta", "identity", "critical", "operational", "healthy"), - ("zoom", "Zoom", "collaboration", "important", "degraded", "healthy"), - ("jira_sm", "Jira Service Management", "itsm", "important", "operational", "healthy"), - ("slack_api", "Slack API", "collaboration", "critical", "operational", "healthy"), - ("slack_bot", "Slack Bot", "collaboration", "important", "operational", "healthy"), + ( + "identity-provider", + "Identity Provider", + "identity", + "critical", + "operational", + "healthy", + ), + ( + "video-conferencing", + "Video Conferencing", + "collaboration", + "important", + "degraded", + "healthy", + ), + ("ticketing", "Ticketing", "itsm", "important", "operational", "healthy"), + ("chat-platform", "Chat Platform", "collaboration", "critical", "operational", "healthy"), + ("chat-bot", "Chat Bot", "collaboration", "important", "operational", "healthy"), ("broken_svc", "Broken Service", "other", "low", "operational", "broken"), ] @@ -125,7 +139,7 @@ async def slash_client(slash_app): async def test_missing_signature_header(slash_client): client, _ = slash_client - body = _form_body_for_slash({"command": "/itstatus", "text": "okta"}) + body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"}) ts = _ts_now() resp = await client.post( "/api/slack/slash", @@ -144,7 +158,7 @@ async def test_missing_signature_header(slash_client): async def test_bad_signature(slash_client): client, _ = slash_client - body = _form_body_for_slash({"command": "/itstatus", "text": "okta"}) + body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"}) resp = await client.post( "/api/slack/slash", content=body, @@ -158,7 +172,7 @@ async def test_bad_signature(slash_client): async def test_stale_timestamp(slash_client): client, _ = slash_client - body = _form_body_for_slash({"command": "/itstatus", "text": "okta"}) + body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"}) resp = await client.post( "/api/slack/slash", content=body, @@ -179,11 +193,9 @@ async def test_feature_disabled(tmp_path, monkeypatch): from app.main import app - body = _form_body_for_slash({"command": "/itstatus", "text": "okta"}) + body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"}) - async with AsyncClient( - transport=ASGITransport(app=app), base_url="http://test" - ) as client: + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: resp = await client.post( "/api/slack/slash", content=body, @@ -207,13 +219,11 @@ async def test_signing_secret_unset(tmp_path, monkeypatch): from app.main import app - body = _form_body_for_slash({"command": "/itstatus", "text": "okta"}) + body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"}) ts = _ts_now() sig = _slack_sign(body, SIGNING_SECRET, ts) - async with AsyncClient( - transport=ASGITransport(app=app), base_url="http://test" - ) as client: + async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client: resp = await client.post( "/api/slack/slash", content=body, @@ -234,20 +244,14 @@ async def test_signing_secret_unset(tmp_path, monkeypatch): async def test_exact_id_match(slash_client): client, _ = slash_client - body = _form_body_for_slash({"command": "/itstatus", "text": "okta"}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + body = _form_body_for_slash({"command": "/itstatus", "text": "identity-provider"}) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() assert data["response_type"] == "ephemeral" # Header block should contain the display name - header_texts = [ - b["text"]["text"] - for b in data["blocks"] - if b.get("type") == "header" - ] - assert any("Okta" in t for t in header_texts) + header_texts = [b["text"]["text"] for b in data["blocks"] if b.get("type") == "header"] + assert any("Identity Provider" in t for t in header_texts) # Status is operational → green check emoji assert "✅" in data["text"] or "Operational" in data["text"] @@ -257,55 +261,41 @@ async def test_exact_id_match(slash_client): async def test_case_insensitive_display_name_match(slash_client): client, _ = slash_client - body = _form_body_for_slash({"command": "/itstatus", "text": "Okta"}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + body = _form_body_for_slash({"command": "/itstatus", "text": "Identity Provider"}) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() - header_texts = [ - b["text"]["text"] - for b in data["blocks"] - if b.get("type") == "header" - ] - assert any("Okta" in t for t in header_texts) + header_texts = [b["text"]["text"] for b in data["blocks"] if b.get("type") == "header"] + assert any("Identity Provider" in t for t in header_texts) # ── 8. Unique substring match → found ───────────────────────────────────────── async def test_substring_match_unique(slash_client): - """'jir' matches only 'Jira Service Management' — unique → found.""" + """'ticket' matches only 'Ticketing' — unique → found.""" client, _ = slash_client - body = _form_body_for_slash({"command": "/itstatus", "text": "jir"}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + body = _form_body_for_slash({"command": "/itstatus", "text": "ticket"}) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() - header_texts = [ - b["text"]["text"] - for b in data["blocks"] - if b.get("type") == "header" - ] - assert any("Jira" in t for t in header_texts) + header_texts = [b["text"]["text"] for b in data["blocks"] if b.get("type") == "header"] + assert any("Ticketing" in t for t in header_texts) # ── 9. Ambiguous substring → disambiguation text ────────────────────────────── async def test_substring_match_ambiguous(slash_client): - """'slac' matches both 'Slack API' and 'Slack Bot' but neither id exactly.""" + """'chat' matches both 'Chat Platform' and 'Chat Bot' but neither id exactly.""" client, _ = slash_client - body = _form_body_for_slash({"command": "/itstatus", "text": "slac"}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + body = _form_body_for_slash({"command": "/itstatus", "text": "chat"}) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() text = data["text"] # Must mention both candidates - assert "Slack API" in text or "Slack Bot" in text + assert "Chat Platform" in text or "Chat Bot" in text assert "more specific" in text or "Multiple" in text @@ -315,9 +305,7 @@ async def test_substring_match_ambiguous(slash_client): async def test_no_match(slash_client): client, _ = slash_client body = _form_body_for_slash({"command": "/itstatus", "text": "notaservice"}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() assert "No service matches" in data["text"] @@ -330,9 +318,7 @@ async def test_no_match(slash_client): async def test_empty_text(slash_client): client, _ = slash_client body = _form_body_for_slash({"command": "/itstatus", "text": ""}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() assert "Usage" in data["text"] @@ -346,9 +332,7 @@ async def test_poller_broken_surfaces_as_unknown(slash_client): client, _ = slash_client # broken_svc has current_status=operational but poller_health=broken body = _form_body_for_slash({"command": "/itstatus", "text": "broken_svc"}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() # Should show Unknown, not Operational @@ -365,9 +349,7 @@ async def test_wrong_command(slash_client): """Slack expects 200 even for unrecognised command names.""" client, _ = slash_client body = _form_body_for_slash({"command": "/something-else", "text": ""}) - resp = await client.post( - "/api/slack/slash", content=body, headers=_headers(body) - ) + resp = await client.post("/api/slack/slash", content=body, headers=_headers(body)) assert resp.status_code == 200 data = resp.json() assert "Unknown slash command" in data["text"] diff --git a/backend/tests/test_templates.py b/backend/tests/test_templates.py index e5777b9..415ff13 100644 --- a/backend/tests/test_templates.py +++ b/backend/tests/test_templates.py @@ -1,12 +1,19 @@ """Tests for impact statement templates.""" from app.alerting.templates import generate_impact_statement, generate_summary_text +from app.config import settings from app.poller.change_detector import StatusChange -def _make_change(service_id="test-svc", display_name="Test Service", - previous="operational", new="degraded", detail=None, - poll_type="statuspage_json", url=None): +def _make_change( + service_id="test-svc", + display_name="Test Service", + previous="operational", + new="degraded", + detail=None, + poll_type="statuspage_json", + url=None, +): return StatusChange( service_id=service_id, service_display_name=display_name, @@ -19,9 +26,16 @@ def _make_change(service_id="test-svc", display_name="Test Service", def _make_downstream(names): - return [{"service_name": n, "service_id": n.lower(), "severity": "high", - "impact_description": f"{n} impacted", "current_status": "operational"} - for n in names] + return [ + { + "service_name": n, + "service_id": n.lower(), + "severity": "high", + "impact_description": f"{n} impacted", + "current_status": "operational", + } + for n in names + ] class TestGenerateImpactStatement: @@ -43,11 +57,11 @@ def test_generic_major_outage(self): def test_with_downstream(self): change = _make_change(new="degraded", detail="Slow") - downstream = _make_downstream(["Jira", "Confluence"]) + downstream = _make_downstream(["Ticketing", "Team Wiki"]) result = generate_impact_statement(change, downstream) assert "may impact" in result - assert "Jira" in result - assert "Confluence" in result + assert "Ticketing" in result + assert "Team Wiki" in result def test_recovery(self): change = _make_change(new="operational", previous="degraded") @@ -55,29 +69,51 @@ def test_recovery(self): assert "recovered" in result assert "operational" in result - def test_okta_outage(self): - change = _make_change(service_id="okta", display_name="Okta", new="major_outage") - downstream = _make_downstream(["Box", "Slack", "Zoom"]) + def test_sso_broker_outage(self, monkeypatch): + monkeypatch.setattr(settings, "sso_broker_service_id", "identity-provider") + change = _make_change( + service_id="identity-provider", display_name="Identity Provider", new="major_outage" + ) + downstream = _make_downstream(["Service A", "Service B", "Service C"]) result = generate_impact_statement(change, downstream) assert "SSO authentication is unavailable" in result - assert "Box" in result + assert "Service A" in result assert "avoid logging out" in result - def test_okta_degraded(self): - change = _make_change(service_id="okta", display_name="Okta", new="degraded") - downstream = _make_downstream(["Box", "Slack"]) + def test_sso_broker_degraded(self, monkeypatch): + monkeypatch.setattr(settings, "sso_broker_service_id", "identity-provider") + change = _make_change( + service_id="identity-provider", display_name="Identity Provider", new="degraded" + ) + downstream = _make_downstream(["Service A", "Service B"]) result = generate_impact_statement(change, downstream) assert "SSO authentication" in result assert "may be affected" in result - def test_okta_partial_uses_outage_template(self): - change = _make_change(service_id="okta", display_name="Okta", new="partial_outage") - downstream = _make_downstream(["Box"]) + def test_sso_broker_partial_uses_outage_template(self, monkeypatch): + monkeypatch.setattr(settings, "sso_broker_service_id", "identity-provider") + change = _make_change( + service_id="identity-provider", display_name="Identity Provider", new="partial_outage" + ) + downstream = _make_downstream(["Service A"]) result = generate_impact_statement(change, downstream) assert "SSO authentication is unavailable" in result + def test_sso_broker_unset_uses_generic_template(self, monkeypatch): + # With no broker configured, even the identity service gets the + # generic severity template — no hardcoded vendor special-casing. + monkeypatch.setattr(settings, "sso_broker_service_id", None) + change = _make_change( + service_id="identity-provider", display_name="Identity Provider", new="major_outage" + ) + result = generate_impact_statement(change, []) + assert "MAJOR OUTAGE" in result + assert "SSO authentication" not in result + def test_generic_major_outage_service(self): - change = _make_change(service_id="some-service", display_name="Some Service", new="major_outage") + change = _make_change( + service_id="some-service", display_name="Some Service", new="major_outage" + ) result = generate_impact_statement(change, []) assert "Some Service" in result assert "MAJOR OUTAGE" in result @@ -101,12 +137,12 @@ def test_all_healthy(self): assert "operational" in result def test_with_incidents(self): - result = generate_summary_text(29, 2, ["Okta", "Slack"]) + result = generate_summary_text(29, 2, ["Identity Provider", "Chat Platform"]) assert "2 active incident" in result assert "29" in result - assert "Okta" in result - assert "Slack" in result + assert "Identity Provider" in result + assert "Chat Platform" in result def test_single_incident(self): - result = generate_summary_text(29, 1, ["Box"]) + result = generate_summary_text(29, 1, ["Content Platform"]) assert "1 active incident" in result diff --git a/com.box.it-health-dashboard.plist b/com.company.it-health-dashboard.plist similarity index 94% rename from com.box.it-health-dashboard.plist rename to com.company.it-health-dashboard.plist index 131565f..141f2ba 100644 --- a/com.box.it-health-dashboard.plist +++ b/com.company.it-health-dashboard.plist @@ -4,10 +4,10 @@ IT Service Health Dashboard — launchd daemon (Phase 6 hardened). Install: - cp com.box.it-health-dashboard.plist \ - /Library/LaunchDaemons/com.box.it-health-dashboard.plist + cp com.company.it-health-dashboard.plist \ + /Library/LaunchDaemons/com.company.it-health-dashboard.plist # Edit the /path/to/ placeholders below first. - sudo launchctl bootstrap system /Library/LaunchDaemons/com.box.it-health-dashboard.plist + sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist Key hardening decisions (see PRODUCTION-ROADMAP.md Phase 6): - KeepAlive is dict-form: restart on crash, NOT on clean exit. @@ -23,7 +23,7 @@ Label - com.box.it-health-dashboard + com.company.it-health-dashboard ProgramArguments diff --git a/deploy/Caddyfile.example b/deploy/Caddyfile.example index 68bf68b..8f18933 100644 --- a/deploy/Caddyfile.example +++ b/deploy/Caddyfile.example @@ -14,12 +14,12 @@ # sudo caddy validate --config /opt/it-health/deploy/Caddyfile # sudo caddy run --config /opt/it-health/deploy/Caddyfile # or via brew services # -# For production: set up a com.box.it-health-caddy.plist daemon +# For production: set up a com.company.it-health-caddy.plist daemon # mirroring the Litestream sidecar plist in this directory. { # Global options - email ops@box.example # ACME registration — change to your ops alias + email ops@example.com # ACME registration — change to your ops alias admin off # don't expose the Caddy admin API } @@ -27,7 +27,7 @@ # Using `tls internal` issues a cert from Caddy's local CA, which your # clients will need to trust. For a publicly-routable host, swap for a # real domain and ACME handles cert issuance automatically. -health.box.corp { +health.example.internal { tls internal # Hardening — keep these even for VPN-internal deploys. Attackers @@ -78,7 +78,7 @@ health.box.corp { health_status 2xx } - # Structured access log — JSON so it ships into whatever Box uses. + # Structured access log — JSON so it ships into whatever log aggregator you use. log { output file /var/log/it-health/caddy-access.log { roll_size 10mb diff --git a/deploy/com.box.it-health-dashboard-litestream.plist.example b/deploy/com.company.it-health-dashboard-litestream.plist.example similarity index 90% rename from deploy/com.box.it-health-dashboard-litestream.plist.example rename to deploy/com.company.it-health-dashboard-litestream.plist.example index 9d25a86..50c6b62 100644 --- a/deploy/com.box.it-health-dashboard-litestream.plist.example +++ b/deploy/com.company.it-health-dashboard-litestream.plist.example @@ -1,10 +1,10 @@