From 62b73f19e3317eff34885a7df498733a941ad7a8 Mon Sep 17 00:00:00 2001 From: saagpatel Date: Sun, 31 May 2026 00:07:25 -0700 Subject: [PATCH] docs: lean CLAUDE.md (claude-md-lint) --- .gitignore | 1 + CLAUDE.md | 147 ++++++++++++------------- docs/executive-view-redesign/CLAUDE.md | 31 +++--- 3 files changed, 83 insertions(+), 96 deletions(-) diff --git a/.gitignore b/.gitignore index e975bf5..0029c8d 100644 --- a/.gitignore +++ b/.gitignore @@ -34,3 +34,4 @@ frontend/dist/ # OS .DS_Store +CLAUDE.md.bak diff --git a/CLAUDE.md b/CLAUDE.md index 9fd8f82..202b01a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,92 +1,83 @@ # IT Service Health Dashboard -## Overview -Internal web dashboard that aggregates real-time health status of ~30 SaaS services IT supports at Box. Polls vendor status pages via Statuspage.io JSON API, Google Workspace JSON feed, Slack's native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view and posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN. Designed for IT engineers (deep triage) and IT leadership / company-wide visibility (situational awareness). +Internal web dashboard aggregating real-time health of ~30 SaaS services at Box. Polls Statuspage.io JSON API, Google Workspace JSON feed, Slack native status API, and RSS/Atom feeds. Enriches with dependency mapping and templated impact statements. Displays a unified status board with timeline view, posts alerts to Slack. Deployed on a Mac Mini behind corporate VPN. -## Roadmap — READ FIRST +## Roadmap -- **Active roadmap:** [PRODUCTION-ROADMAP.md](./PRODUCTION-ROADMAP.md) — this is the source of truth for current and upcoming work. -- **Historical v1 spec:** [IMPLEMENTATION-ROADMAP.md](./IMPLEMENTATION-ROADMAP.md) — v1 is **shipped**; this doc is archived for reference only. Do not start new work against it. +Active source of truth: [PRODUCTION-ROADMAP.md](./PRODUCTION-ROADMAP.md) — read before proposing any work. +Historical v1 spec: [IMPLEMENTATION-ROADMAP.md](./IMPLEMENTATION-ROADMAP.md) — archived, v1 shipped. -All new sessions must read PRODUCTION-ROADMAP.md before proposing work. +## Stack -## Current Phase -**v2 SHIPPED — Phases 0–6 complete; Phase 2B + Phase 7 (Statuspage inbound webhook + Slack ack) in tree, gated off by default.** Auth, vendor resilience, alert quality, observability, data lifecycle, UX productionization, and platform polish all landed. **356 tests passing.** The dashboard is production-grade; a mature IT team can rely on it. See PRODUCTION-ROADMAP.md for the exit-criteria detail on each phase. +- **Python** 3.12+ / FastAPI 0.115+ / httpx 0.28+ / aiosqlite 0.21+ / APScheduler 3.10+ +- **Config / validation:** PyYAML 6.0+ / Pydantic 2.10+ / feedparser 6.0+ +- **Frontend:** React 19 (Vite 8+) + Tailwind CSS 4+; FastAPI serves the built static files +- **Observability:** structlog (JSON), prometheus-client, sentry-sdk[fastapi], Healthchecks.io +- **Resilience:** stamina (retries) + purgatory (per-host circuit breakers) +- **Production process manager:** launchd (macOS); Caddy in front for HTTPS + header auth -Main also includes a parallel UX sprint that shipped alongside Phase 5: -- **Executive / Engineer view toggle** — `ViewContext` gates the grid vs category summary and engineer-only affordances (graph, timeline, shortcuts). -- **PWA** — `vite-plugin-pwa` registers a service worker with 55-entry precache; `ReloadPrompt` surfaces updates. -- **`recharts` SLA trend** — the service-detail drawer renders 7/30-day uptime history. -- **Daily `VACUUM INTO` backup** — `app/backup.py` writes a snapshot at `settings.backup_time_hour`, independent of Litestream. +## Build / Test / Run -**Phase 7 partially landed:** -- **Statuspage inbound webhook** (`POST /api/webhooks/statuspage/{service_id}`, HMAC-SHA256, optional replay protection) — code in `backend/app/router_webhooks.py`, gated by `WEBHOOKS_ENABLED` (default false). Writes directly through the alerting pipeline, bypassing flap suppression. -- **Slack ack flow** (`POST /api/slack/interactivity`, v0 signing-secret) — code in `backend/app/router_slack.py`, gated by `SLACK_ACK_ENABLED` (default false). Block Kit messages only include the Acknowledge button when the flag is true. -- Both features require a public endpoint (Cloudflare Tunnel / Caddy allowlist / ngrok) before flipping the flag. They ship off-by-default so the main app is unaffected. +```bash +# Set up backend +python3.13 -m venv .venv && source .venv/bin/activate +pip install -r backend/requirements.txt -**Phase 7 further landed** — postmortem automation (`POSTMORTEMS_ENABLED`), SLO fuel-gauge view + multi-burn-rate alerting (`SLO_BURN_RATE_ENABLED`), and Slack `/itstatus` slash command (`SLACK_SLASH_ENABLED`) all shipped, feature-gated off by default. **Still open:** LLM-layer impact statements, Splunk/JSM/ThousandEyes integration. +# Build frontend +cd frontend && npm install && npm run build && cd .. -## Tech Stack -- **Python:** 3.12+ -- **Backend framework:** FastAPI 0.115+ -- **Async HTTP:** httpx 0.28+ -- **Database:** SQLite via aiosqlite 0.21+ -- **Task scheduling:** APScheduler 3.10+ -- **RSS parsing:** feedparser 6.0+ -- **Slack alerting:** httpx (raw webhook POST — no SDK needed) -- **Config:** PyYAML 6.0+ -- **Data validation:** Pydantic 2.10+ -- **Frontend:** React 19 (Vite 8+) + Tailwind CSS 4+ -- **Process manager:** launchd (macOS) for production +# (Optional) seed demo data +cd backend && python -m scripts.seed_demo_data && cd .. -## Shipped production additions (Phases 0–6) - -All of these are in `requirements.txt` and active: -- **Resilience:** `stamina` (retries) + `purgatory` (per-host circuit breakers) -- **Observability:** `structlog`, `prometheus-client`, `sentry-sdk[fastapi]`, Healthchecks.io dead-man's switch, `QueueListener` offloads file I/O -- **DB:** `aiosqlitepool` in requirements (pool migration deferred); Litestream config template in `deploy/` -- **Frontend:** Lucide icons, `recharts` SLA trend, IBM Plex fonts -- **CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis - -## Deferred UX additions (optional) -- TanStack Query v5, shadcn/ui `Sheet` + `Command`, Dagre hierarchical dep graph - -## Development Conventions -- Python: type hints on all functions, async/await for all I/O, no blocking calls -- File naming: snake_case for Python, kebab-case for React files, PascalCase for React components -- Git commits: conventional commits — feat:, fix:, chore: -- Testing: pytest + pytest-asyncio for backend; **`respx`** for httpx mocking in poller tests -- Config: all service definitions and dependency mappings in YAML, never hardcoded -- Logging: structlog JSON (Phase 3 complete — structlog is active) -- Error handling: all HTTP calls wrapped in try/except with timeout, retry, graceful degradation - -## Key Decisions -| Decision | Choice | Why | -|----------|--------|-----| -| Primary data source | Statuspage.io JSON API (`/api/v2/summary.json`) | Most vendors use Statuspage.io; JSON is structured, no auth, not rate-limited | -| Google Workspace | Google's custom JSON feed + RSS | Google uses its own status dashboard, not Statuspage.io | -| Slack status | Slack native Status API (`slack-status.com/api/v2.0.0/current`) | Slack has a dedicated JSON status API | -| Database | SQLite + Litestream (Phase 4) | Demo-scale + ~1s RPO backup; Postgres deferred to >100 writes/s | -| LLM layer | None (v1) | Template-based summaries; LLM deferred to post-Phase-7 | -| Auth | Bearer token on admin endpoints (Phase 0); VPN-only for reads | Demo had no auth; production requires it for writes | -| Hosting | Mac Mini on corporate network, Caddy in front (Phase 6) | Always-on, VPN-accessible; Caddy adds HTTPS + header auth | -| Slack alerting | Raw httpx POST with Block Kit; ack buttons (Phase 2) | No SDK needed for webhooks; ack flow requires interactivity endpoint | -| Frontend serving | FastAPI serves built React static files | Single process, no nginx | -| Scheduled maintenance | First-class DB table (Phase 2) | Vendor windows auto-populated; suppresses alerts | - -## Do NOT -- Do not start work that isn't in a PRODUCTION-ROADMAP.md phase. If it doesn't fit, discuss first. -- Do not integrate Splunk, ThousandEyes, Datadog, or JSM — those are Phase 7+. -- Do not build an LLM integration yet — post-Phase-7. -- Do not remove the bearer-token auth on admin endpoints once added. VPN is not sufficient for write endpoints. -- Do not use synchronous I/O — all network calls must be async. -- Do not hardcode service definitions in Python — they live in services.yaml. -- Do not use slack-sdk — use raw httpx POST for webhook simplicity. -- Do not parse RSS when a JSON API is available — JSON is always preferred. -- Do not poll more frequently than every 60 seconds — courtesy limit with vendor APIs. -- Do not render a service as `operational` when its poller has failed. Use `unknown`. -- Do not dedup alerts on message text. Use `vendor_incident_id` when available. -- Do not use force-directed layout as default for the dependency graph. Use hierarchical (Dagre). +# Run — serves dashboard + API on port 8000 +cd backend && python run.py +``` + +Open `http://localhost:8000`. + +**CI:** GitHub Actions — `uv`, `ruff`, `mypy --strict`, `pytest`; CodeQL analysis. 356 tests passing. + +## Conventions + +- **I/O:** async/await throughout; no blocking calls in async context. +- **Config:** service definitions and dependency mappings live in `services.yaml`, never hardcoded in Python. +- **HTTP calls:** all wrapped in try/except with timeout, retry (stamina), and graceful degradation; return `unknown` status when a poller fails — not `operational`. +- **Alerting dedup:** use `vendor_incident_id` when available, not message text. +- **Slack integration:** raw httpx POST for webhooks (no slack-sdk); `POST /api/slack/interactivity` for ack flow. +- **Feed priority:** prefer JSON API over RSS when both exist. +- **Poll interval:** minimum 60 seconds (vendor courtesy limit). +- **File naming:** snake_case Python, kebab-case React files, PascalCase React components. +- **Commits:** conventional commits — feat:, fix:, chore:. +- **Testing:** pytest + pytest-asyncio; `respx` for httpx mocking in poller tests. + +## Key Architecture Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Primary data source | Statuspage.io `/api/v2/summary.json` | Most vendors use Statuspage.io; JSON, no auth, not rate-limited | +| Google Workspace | Google custom JSON feed + RSS | Google has its own status dashboard, not Statuspage.io | +| Slack status | `slack-status.com/api/v2.0.0/current` | Dedicated JSON status API | +| Database | SQLite + Litestream | Demo-scale + ~1s RPO; Postgres deferred to >100 writes/s | +| Auth | Bearer token on admin endpoints; VPN-only for reads | Bearer token required for write endpoints — VPN alone is insufficient | +| Hosting | Mac Mini + Caddy | Always-on, VPN-accessible; Caddy adds HTTPS + header auth | +| Dep graph layout | Hierarchical (Dagre), not force-directed | Force-directed is default-off; Dagre is the production default | +| LLM layer | Deferred (post-Phase-7) | Template-based summaries sufficient for v2 | + +## Feature Gates (off by default) + +Phase 7 code is in-tree but gated. Flip only when a public endpoint (Cloudflare Tunnel / Caddy allowlist / ngrok) is available: + +- `WEBHOOKS_ENABLED` — `POST /api/webhooks/statuspage/{service_id}` (HMAC-SHA256; `backend/app/router_webhooks.py`). Bypasses flap suppression; writes directly through the alerting pipeline. +- `SLACK_ACK_ENABLED` — `POST /api/slack/interactivity` (v0 signing-secret; `backend/app/router_slack.py`). +- `POSTMORTEMS_ENABLED` — postmortem automation. +- `SLO_BURN_RATE_ENABLED` — SLO fuel-gauge view + multi-burn-rate alerting. +- `SLACK_SLASH_ENABLED` — Slack `/itstatus` slash command. + +**Still open (Phase 7+):** LLM-layer impact statements, Splunk/JSM/ThousandEyes integration. + +## Scope Gate + +All new work must map to an active phase in PRODUCTION-ROADMAP.md. Splunk, ThousandEyes, Datadog, JSM, and LLM integrations are Phase 7+ — discuss before starting. # Portfolio Context diff --git a/docs/executive-view-redesign/CLAUDE.md b/docs/executive-view-redesign/CLAUDE.md index 4075492..90bded8 100644 --- a/docs/executive-view-redesign/CLAUDE.md +++ b/docs/executive-view-redesign/CLAUDE.md @@ -1,7 +1,7 @@ # Executive-View Redesign — Scoped Addendum ## Overview -This bundle redesigns the Executive-view codepath of the Pulse / IT Service Health dashboard so it reads from three meters in a conference room. It is an additive feature inside the existing React frontend — the Engineer view, backend, and polling pipeline are out of scope. The root `CLAUDE.md` at the repo root remains authoritative for everything outside this feature. +Redesigns the Executive-view codepath of the IT Service Health dashboard so it reads from three meters in a conference room. Additive feature inside the existing React frontend — Engineer view, backend, and polling pipeline are out of scope. The root `CLAUDE.md` at the repo root remains authoritative for everything outside this feature. ## Tech Stack (this feature only) - React: 19.2 (existing) @@ -9,19 +9,14 @@ This bundle redesigns the Executive-view codepath of the Pulse / IT Service Heal - Tailwind CSS: 4.2 — `@theme` token block in `frontend/src/styles/index.css` - recharts: 3.8 — already in tree for `SlaChart`; reused by the 30-day trend strip - lucide-react: 1.8 — icon set already in tree -- No new runtime deps. If a task appears to need one, stop and escalate. +- No new runtime deps. Stop and escalate if a task appears to need one. -## Development Conventions -- React components: PascalCase files, one component per file, under `frontend/src/components/executive/` for new surfaces and `frontend/src/views/` for the top-level view shell +## File Conventions +- New components: `frontend/src/components/executive/` (surfaces), `frontend/src/views/` (view shell) - Hooks: kebab-case `use-*.js` under `frontend/src/hooks/` -- Design tokens: declared only in `@theme` inside `frontend/src/styles/index.css`; never hardcode hex in components -- No `any`-equivalent escapes — components accept typed props via JSDoc or default-shape destructuring +- Design tokens: declare only in `@theme` inside `frontend/src/styles/index.css` — no hex values in components +- Props: type via JSDoc or default-shape destructuring — no `any`-equivalent escapes - Commits: conventional commits, e.g. `feat(exec-view): ...`, `style(tokens): ...` -- After every file change, re-read the diff before committing; follow the root repo's Done = Verify → Commit rule - -## Current Phase -**Phase 0: Foundation (tokens, data hook, view shell)** -See IMPLEMENTATION-ROADMAP.md for full phase details. ## Key Decisions | Decision | Choice | Why | @@ -32,11 +27,11 @@ See IMPLEMENTATION-ROADMAP.md for full phase details. | View integration | New `ExecutiveView.jsx` replaces the `CategorySummary` render branch inside `App.jsx` when `view === "executive"` | Matches existing `ViewContext` gating; no routing changes | | Trend strip library | `recharts` `` or `` reused from `SlaChart` pattern | No new bundle weight | -## Phase-Boundary Review -At the end of every phase, run `/ultrareview` before committing the phase-final code. Do not skip on phases that "feel small." +## Scope Gates +- **Phase scope:** Implement only what IMPLEMENTATION-ROADMAP.md defines for the current phase. Work outside the active phase requires a discussion first. +- **Engineer render path:** `ServiceGrid`, `ServiceDetail`, `DependencyGraph`, `Timeline` are out of scope — this feature runs only when `view === "executive"`. +- **Accent colors:** `--color-accent-alarm` is the sole eye-magnet. Keep all other surfaces neutral; no second accent color, no gradients. +- **Runtime dependencies:** recharts, lucide-react, tailwind 4, and date-fns are the complete palette. Escalate before adding anything. -## Do NOT -- Do not add features not in the current phase of IMPLEMENTATION-ROADMAP.md. -- Do not touch the Engineer render path (`ServiceGrid`, `ServiceDetail`, `DependencyGraph`, `Timeline`). This feature only runs when `view === "executive"`. -- Do not introduce a second accent color or a gradient. `--color-accent-alarm` is the only eye-magnet; everything else is neutral surface + typography. -- Do not add runtime dependencies. recharts, lucide-react, tailwind 4, and date-fns are the complete palette. +## Phase-Boundary Review +At the end of every phase, run `/code-review` on the branch diff before committing the phase-final code. Do not skip on phases that "feel small." (Note: `/ultrareview` referenced in earlier drafts does not exist; use `/code-review` instead.)