Skip to content

feat(state): Phase 9 — per-node telemetry, cluster aggregator, routing feedback#83

Merged
transfix merged 5 commits into
masterfrom
feature/phase9-telemetry
May 23, 2026
Merged

feat(state): Phase 9 — per-node telemetry, cluster aggregator, routing feedback#83
transfix merged 5 commits into
masterfrom
feature/phase9-telemetry

Conversation

@transfix
Copy link
Copy Markdown
Owner

@transfix transfix commented May 22, 2026

Summary

Implements the core of Phase 9: Network Analytics & Live Telemetry from the distributed state roadmap.

New classes

  • state_node_telemetry — per-node sampler with:

    • EWMA (configurable half-life, mutex-guarded)
    • Power-of-2 latency histogram (22 buckets, O(1) record via __builtin_clzll, p50/p90/p99 queries)
    • telemetry_snapshot POD struct (35+ fields: counters, rates, latency percentiles, tree/cluster shape)
    • sample() reads attached shard, transport, and bus counters
    • publish_snapshot() serializes to JSON and publishes via OOB bus on __telemetry.<cluster>.<node> topics
    • Full JSON round-trip serialize/deserialize (no external JSON lib required)
  • state_telemetry_aggregator — cluster-level rollup:

    • Subscribes to __telemetry.<cluster_id> on the OOB bus via attach_bus()
    • Ingests peer snapshots, computes cluster_telemetry_summary (aggregated counters, max latencies, summed rates)
    • Stale-peer detection (configurable threshold, default 5 s)
    • evaluate_routing_feedback(policy) returns isolate/release node lists based on p99 latency and outbox drop thresholds
    • to_text() for human-readable cluster telemetry dumps

Modified classes

  • state_distributed_adminattach_telemetry(), telemetry() accessor, to_text() now appends a [telemetry] section when an aggregator is attached

Tests

  • state_node_telemetry_test — 18 test cases (EWMA convergence, histogram percentiles, JSON round-trip, sampling with shard/transport/bus, latency recording, rate computation, publish/subscribe via bus)
  • state_telemetry_aggregator_test — 13 test cases (aggregation, stale detection, routing feedback isolation/release, admin integration)
  • All 31 tests pass. No regressions in existing test suites.

C-identifier compliance (67ae6f8)

  • Replaced hyphenated node identifiers (node-1, node-2, node-3) with valid C identifiers (node1, node2, node3) in both test files to comply with the isValidStateName() validation from Phase 7.

Remaining Phase 9 work (future PRs)

  • 100-node simulated cluster benchmark
  • CBOR encoding for production telemetry messages
  • Periodic auto-publish timer integration
  • Automatic slow-peer isolation wired into transport

@transfix transfix force-pushed the feature/phase9-telemetry branch 3 times, most recently from abb91db to 742da4a Compare May 22, 2026 20:07
transfix added 3 commits May 22, 2026 15:35
…g feedback

Add state_node_telemetry with EWMA (configurable half-life), power-of-2
latency histogram (22 buckets, O(1) record, p50/p90/p99), telemetry_snapshot
POD struct (35+ fields), JSON serialize/deserialize, and OOB bus publishing
on __telemetry.<cluster>.<node> topics.

Add state_telemetry_aggregator: subscribes to __telemetry.<cluster_id> on the
OOB bus, ingests peer snapshots, computes cluster_telemetry_summary (aggregated
counters, max latencies, summed rates), stale-peer detection (configurable
threshold, default 5 s), evaluate_routing_feedback(policy) returning
isolate/release node lists based on p99 latency and outbox drop thresholds.

Extend state_distributed_admin with attach_telemetry(), telemetry() accessor,
and to_text() telemetry section.

Tests: 18 cases in state_node_telemetry_test (EWMA, histogram, JSON round-trip,
sampling, latency recording, rate computation, publish/subscribe), 13 cases in
state_telemetry_aggregator_test (aggregation, stale detection, routing feedback,
admin integration). All pass.
Replace hyphenated node identifiers (node-1, node-2, node-3) with
valid C identifiers (node1, node2, node3) to comply with the
isValidStateName() validation added in Phase 7.
@transfix transfix force-pushed the feature/phase9-telemetry branch from 742da4a to 05470ba Compare May 22, 2026 20:36
@transfix transfix merged commit 350d07b into master May 23, 2026
14 checks passed
@transfix transfix deleted the feature/phase9-telemetry branch May 23, 2026 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant