Skip to content

saagpatel/ITServiceHealth

Repository files navigation

IT Service Health Dashboard

Real-time status monitoring dashboard for ~30 SaaS services used across an enterprise IT environment. Polls vendor status pages every 60 seconds, detects changes, generates impact statements using a service dependency graph, posts Slack alerts, and displays a unified dark-themed operations dashboard.

Project status

  • v1 (demo-ready) — SHIPPED. All original spec delivered: polling, normalization, change detection, Slack alerting, React UI, dependency graph, timeline, SLA tracking, incident clustering, auto reports.
  • v2 (production-ready) — SHIPPED. Phases 0–6 of the production roadmap complete: bearer-token auth, vendor resilience (stamina + purgatory), alert quality (flap suppression, dedup, tier routing, dependency correlation, maintenance windows, flapping-badge UI), observability (structlog, Prometheus /metrics, Sentry, Healthchecks.io dead-man's switch), data lifecycle (production pragmas, retention, Litestream streaming + daily VACUUM INTO snapshot), UX productionization (severity-sorted grid, distinct poller-broken state, a11y + keyboard nav, Executive/Engineer view toggle, PWA, recharts SLA trend), and platform polish (CI, pre-commit, hardened launchd plist, Caddy, Keychain secrets). 378 tests passing.
  • v2 Phase 2B + Phase 7 — in tree, gated off. Statuspage inbound webhook receiver (WEBHOOKS_ENABLED), Slack ack flow (SLACK_ACK_ENABLED), postmortem drafts (POSTMORTEMS_ENABLED), SLO fuel-gauge + multi-burn-rate alerting (SLO_BURN_RATE_ENABLED), and Slack /itstatus slash command (SLACK_SLASH_ENABLED) all shipped with tests but default off. Flip each flag once its prerequisites are in place (public endpoint for Slack features; postmortems need only a writable POSTMORTEMS_DIR).
  • v2 Phase 7 remainder — optional. LLM-layer impact statements; log-aggregation / ITSM / synthetic-monitoring integrations. Not on a fixed schedule; add as demand emerges.

Active roadmap: PRODUCTION-ROADMAP.md — exit-criteria detail for every phase. Historical spec: IMPLEMENTATION-ROADMAP.md — archived; v1 is complete.

Architecture

[Vendor Status Pages]
    |-- Statuspage.io JSON API (15 services)
    |-- Chat vendor status API (1 service)
    |-- Productivity suite JSON feed (2 services)
    |-- Manual updates via POST /api/admin/status (11 services)
              | (async poll every 60s)
       [Poll Orchestrator]
              |
       [Status Normalizer] --> 5-state enum: operational|degraded|partial|major|unknown
              |
       [Change Detector] --> diff against DB, write status_events
              |
       [Impact Statement Engine] --> dependency graph + templates
       [Slack Alerter] --> Block Kit message to the ops-alert channel
       [SQLite Writer] --> update services, insert events
              |
       [FastAPI REST API] --> /api/services, /api/timeline, /api/summary
              |
       [React Dashboard] <-- auto-refresh 30s

Quick Start

# 1. Clone and enter project
git clone <repo-url> && cd ITServiceHealth

# 2. Set up Python environment
python3.13 -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt

# 3. Build frontend
cd frontend && npm install && npm run build && cd ..

# 4. (Optional) Seed demo data for a populated timeline
cd backend && python -m scripts.seed_demo_data && cd ..

# 5. Run (serves dashboard + API on port 8000)
cd backend && python run.py

Open http://localhost:8000 in your browser.

Accessing the Dashboard

The dashboard runs on a Mac Mini on the internal network. Access it at:

http://<host>:8000

No authentication required — internal-network access is the security boundary.

Service Categories

Services are organized into ten categories. The committed example registry (backend/config/services.yaml) ships a generic, runnable set that monitors public developer-tool status pages, so the dashboard works immediately after clone:

Category Example services
Identity & Access Identity provider (SSO)
Engineering GitHub, npm, PyPI, Sentry
Productivity Dropbox
Collaboration Discord
Network & VPN Cloudflare
Support Ticketing / ITSM
Other Datadog

To monitor your own organization's services, copy the example to a gitignored backend/config/services.local.yaml (the loader prefers it when present) and list your real registry there — see that file's header for the schema and the full category list (identity, productivity, collaboration, engineering, HR, finance, sales, marketing, networking, support).

Manual Status Updates

For services without automated polling (e.g. an identity provider, an HR system, or any service with no public status API), update status via curl. Admin endpoints require a bearer token (set ADMIN_API_TOKEN in your env).

export TOKEN="your-admin-token"

# Set a service to degraded
curl -X POST http://localhost:8000/api/admin/status \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"service_id": "hr-system", "new_status": "degraded", "detail": "Slow login page", "reason": "Reported by user in the help channel"}'

# Set to major outage
curl -X POST http://localhost:8000/api/admin/status \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"service_id": "identity-provider", "new_status": "major_outage", "detail": "SSO completely unavailable", "reason": "Confirmed with vendor"}'

# Resolve (set back to operational)
curl -X POST http://localhost:8000/api/admin/status \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"service_id": "identity-provider", "new_status": "operational", "reason": "Vendor posted recovery"}'

Valid statuses: operational, degraded, partial_outage, major_outage, unknown. The reason field is required for audit trail.

Environment Variables

Variable Default Description
SLACK_WEBHOOK_URL (none) Slack incoming webhook URL for ops-alert channel notifications
DATABASE_PATH data.db SQLite database file path
POLL_INTERVAL_SECONDS 60 How often to poll vendor status pages (1–3600)
HOST 127.0.0.1 Server bind address (0.0.0.0 for network access)
PORT 8000 Server port
LOG_LEVEL INFO Logging level
ADMIN_API_TOKEN (none) Bearer token required for /api/admin/* endpoints. If unset, admin endpoints refuse all requests.
CORS_ORIGINS http://localhost:5173,http://127.0.0.1:5173 Comma-separated list of allowed CORS origins
SEED_DEMO_DATA false Dev-only: auto-populate the DB with synthetic data on boot
POLLER_HEALTH_SLACK_WEBHOOK_URL (none) Separate webhook for poller-health alerts. Falls back to SLACK_WEBHOOK_URL when unset.
ALERT_CONFIRM_THRESHOLD_POLLS 3 Consecutive polls required before firing a worsening alert (flap suppression)
ALERT_RECOVERY_THRESHOLD_POLLS 2 Consecutive successes required before firing a recovery alert
ALERT_MIN_STATE_DURATION_SECONDS 600 Minimum dwell time (seconds) for worsening transitions
ALERT_DEDUP_WINDOW_SECONDS 86400 Dedup window for repeat alerts on the same dedup key
DEPENDENCY_CORRELATION_THRESHOLD 3 Min affected dependents before emitting one aggregated upstream alert
BREAKER_THRESHOLD 3 Consecutive failures before the per-host circuit breaker opens
BREAKER_TTL_SECONDS 300 How long an open breaker stays open before half-opening
POLLER_FAILURE_THRESHOLD 3 Consecutive failures before a service's poller_health flips to broken
LOG_JSON true JSON structured logging vs pretty console
LOG_FILE (none) Optional path for Python-side file logging (uses WatchedFileHandler). Default: stderr
SENTRY_DSN (none) Enable Sentry error tracking when set
SENTRY_ENVIRONMENT production Environment tag reported to Sentry
SENTRY_TRACES_SAMPLE_RATE 0.0 0.0–1.0 sample rate for Sentry performance traces
HEALTHCHECK_PING_URL (none) Healthchecks.io (or similar) URL pinged by the heartbeat job
HEARTBEAT_INTERVAL_SECONDS 30 How often the heartbeat job marks itself alive
HEARTBEAT_STALE_AFTER_SECONDS 120 /healthz returns 503 past this threshold
RETENTION_DAYS_STATUS_EVENTS 90 Auto-purge status_events rows older than this (0 = disable)
RETENTION_DAYS_ALERT_SENT_LOG 90 Auto-purge alert_sent_log rows older than this (0 = disable)
RETENTION_INTERVAL_HOURS 168 How often the retention job runs
WAL_CHECKPOINT_INTERVAL_HOURS 24 How often the truncating WAL checkpoint runs
BACKUP_DIR backups Directory for the daily VACUUM INTO snapshot
BACKUP_TIME_HOUR 2 UTC hour for the daily snapshot (independent of Litestream)
BACKUP_RETENTION_DAYS 7 How many daily snapshots to keep
WEBHOOKS_ENABLED false Enable inbound Statuspage subscriber webhooks. Requires public reachability and STATUSPAGE_WEBHOOK_SECRET.
STATUSPAGE_WEBHOOK_SECRET (none) HMAC-SHA256 shared secret configured in Statuspage → Subscribers → Webhook settings. Required when WEBHOOKS_ENABLED=true.
SLACK_ACK_ENABLED false Enable the Slack ack-button flow. Requires public reachability and SLACK_SIGNING_SECRET.
SLACK_SIGNING_SECRET (none) Signing secret from your Slack app's "Basic Information → App Credentials" page. Required when SLACK_ACK_ENABLED=true.
SLACK_SLASH_ENABLED false Enable the /itstatus slash-command endpoint. Requires public reachability and SLACK_SIGNING_SECRET.
POSTMORTEMS_ENABLED false Write Google-SRE-style Markdown postmortem drafts on service recovery.
POSTMORTEMS_DIR docs/postmortems Directory where postmortem drafts are written (created if absent).
SLO_BURN_RATE_ENABLED false Enable the multi-burn-rate SLO alerting scheduler job.
SLO_TARGET_PERCENT 99.9 SLO uptime target used for error-budget calculations (90.0–99.99).
SLO_BURN_RATE_CHECK_INTERVAL_SECONDS 300 How often the burn-rate cycle runs (1–3600).
SLO_BURN_RATE_FAST_THRESHOLD 14.4 Fast-burn multiplier — triggers page-worthy alert (e.g. 14.4× SLO error rate).
SLO_BURN_RATE_SLOW_THRESHOLD 6.0 Slow-burn multiplier — triggers warning-level alert.
SLO_BURN_RATE_TICKET_THRESHOLD 1.0 Ticket-severity burn multiplier — low-urgency notification only.

Copy .env.example to .env and configure:

cp .env.example .env
# Edit .env with your values

Development Mode

Run frontend and backend separately with hot reload:

# Terminal 1: Backend (auto-reload on Python changes)
cd backend && python run.py --dev

# Terminal 2: Frontend (Vite dev server with HMR)
cd frontend && npm run dev

Frontend dev server at localhost:5173 proxies /api/* to localhost:8000.

Production Deployment (Mac Mini)

# 1. Clone and set up (same as Quick Start steps 1-4)

# 2. Configure environment
cp .env.example backend/.env
# Edit backend/.env: set HOST=0.0.0.0, SLACK_WEBHOOK_URL=<your-url>

# 3. Update plist paths
# Edit com.company.it-health-dashboard.plist:
#   - Replace /path/to/ with actual project path
#   - Add SLACK_WEBHOOK_URL

# 4. Install launchd service
sudo cp com.company.it-health-dashboard.plist /Library/LaunchDaemons/
sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist

# 5. Verify
curl http://localhost:8000/api/health

# 6. Open firewall (if needed)
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add $(which python3)

Manage the service:

# Stop
sudo launchctl bootout system/com.company.it-health-dashboard

# Start
sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist

# View logs
tail -f /var/log/it-health-dashboard.log

Backup & Disaster Recovery (Litestream)

SQLite is the primary store; Litestream streams WAL frames to an external replica (S3, SFTP, or a second disk) so the dashboard survives a Mac Mini failure.

Setup

# 1. Install the binary
brew install benbjohnson/litestream/litestream

# 2. Customize the config template (pick one replica destination)
cp deploy/litestream.yml.example /opt/it-health/deploy/litestream.yml
$EDITOR /opt/it-health/deploy/litestream.yml

# 3. Validate the config before loading it
litestream validate -config /opt/it-health/deploy/litestream.yml

# 4. Install the sidecar launchd daemon
cp deploy/com.company.it-health-dashboard-litestream.plist.example \
   /Library/LaunchDaemons/com.company.it-health-dashboard-litestream.plist
sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard-litestream.plist

# 5. Confirm replication is working
litestream snapshots -config /opt/it-health/deploy/litestream.yml

Litestream RPO is ~1 second — after the initial snapshot, every WAL frame ships as it's written.

Restore

# 1. Stop the main app so the DB isn't being written to
sudo launchctl bootout system/com.company.it-health-dashboard

# 2. Restore from replica (picks up the latest snapshot + WAL frames)
litestream restore -config /opt/it-health/deploy/litestream.yml \
                   -o /opt/it-health/data.db \
                   /opt/it-health/data.db

# 3. Start the app — it applies pending migrations on boot and resumes polling
sudo launchctl bootstrap system /Library/LaunchDaemons/com.company.it-health-dashboard.plist

Data retention

The dashboard auto-prunes old rows to keep the DB from growing without bound:

Table Default retention Env var
status_events 90 days RETENTION_DAYS_STATUS_EVENTS
alert_sent_log 90 days RETENTION_DAYS_ALERT_SENT_LOG

The retention job runs every RETENTION_INTERVAL_HOURS (default 168 = weekly) and a truncating WAL checkpoint runs every WAL_CHECKPOINT_INTERVAL_HOURS (default 24) so deleted rows actually reclaim disk. Set any retention window to 0 to keep data forever.

API Endpoints

Endpoint Method Description
/api/health GET Backend health check
/api/services GET All services with status counts
/api/services/{id} GET Service detail with dependencies
/api/timeline GET Recent status change events
/api/summary GET Overall health + active incidents
/api/maintenance GET Upcoming scheduled maintenances
/api/services/uptime GET Per-service per-day worst status over the past 7 days
/api/services/sla GET Per-service uptime % for 24h, 7d, and 30d windows
/api/services/sla/history GET Daily uptime % per service (1–90 days, default 30d)
/api/services/graph GET Service dependency graph (nodes + links) for visualization
/api/services/slo GET Per-service SLO snapshot: error-budget remaining + active burn-rate breaches
/api/admin/status POST Manual status update (requires Authorization: Bearer $ADMIN_API_TOKEN)
/healthz GET Dead-man's switch — 200 fresh / 503 stale. Hit by launchd + Healthchecks.io.
/metrics GET Prometheus text exposition.
/api/webhooks/statuspage/{id} POST Inbound Statuspage subscriber webhook, HMAC-verified. 404 unless WEBHOOKS_ENABLED=true.
/api/slack/interactivity POST Slack block-actions receiver (ack button). 404 unless SLACK_ACK_ENABLED=true.
/api/slack/slash POST Slack /itstatus slash-command handler. 503 unless SLACK_SLASH_ENABLED=true.

What's Next

All production phases (0–6) and the primary Phase 7 reach features are complete. Full exit-criteria history is in PRODUCTION-ROADMAP.md. Remaining optional work:

  • Phase 7 — LLM layer: Natural-language impact statements; deferred post-Phase-7.
  • Phase 7 — Integrations: log aggregation, synthetic monitoring, metrics, and ITSM platforms — deferred to demand.

About

Internal IT dashboard aggregating SaaS vendor health — Python/FastAPI + React, with production hardening (resilience, observability, alert hygiene)

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors