Skip to content

feat: distributed state gap closure — 14 items#89

Merged
transfix merged 4 commits into
masterfrom
feature/distributed-state-gaps
May 23, 2026
Merged

feat: distributed state gap closure — 14 items#89
transfix merged 4 commits into
masterfrom
feature/distributed-state-gaps

Conversation

@transfix
Copy link
Copy Markdown
Owner

Summary

Implements 14 of 16 post-gap-analysis roadmap items for the distributed state synchronization feature. Item #3 (TLS/auth) was already implemented; item #14 (Admin CLI/HTTP) is deferred to upcoming CLI/dashboard work.

Changes

Proto & Wire Format

  • New protobuf messages: ChunkRequest/Response, SnapshotRequest/Entry/Response, Heartbeat
  • hlc_time field added to Mutation (field 12)
  • Frame oneof extended with 5 new fields
  • IPC wire types extended: kMsgSnapReq (6), kMsgSnapRsp (7)

Transport Chunk Fetch (IPC + gRPC)

  • set_blob_store(), fetch_chunk() on both transports
  • Condition-variable waiter pattern with 10s timeout
  • Serves chunks from local blob store on request

Session Improvements

  • state_data_hydrator wired into distributed_state_session::join()
  • TLS/auth config passthrough (gRPC only)
  • Blob store wired to transports
  • wait_for_data(path, timeout) API
  • replica_status status() with peer_count, local_sequence, pump_cycles, pending_hydrations

Compression

  • New state_zstd_compression_codec (production zstd via libzstd)
  • Registered as built-in in state_compression_registry::shared()
  • CMake zstd discovery (pkg-config + fallback)

Conflict Visibility

  • conflict_entry ring buffer (128 entries) on shard
  • recent_conflicts() accessor (most-recent-first)
  • publish_conflicts() on state_distributed_metrics

Hybrid Logical Clocks

  • New state_hybrid_time.h: hybrid_time struct + hybrid_clock class
  • hlc_time field on state_mutation (packed 48-bit wall + 16-bit logical)
  • should_replace() prefers HLC ordering when both mutations carry timestamps
  • Backward-compatible IPC wire encoding

Hash-Range Sharding

  • New state_hash_partition.h: FNV-1a hash ring with assign_uniform()

Selective Brick Fetch

  • bricks_in_frustum() on state_brick_manifest (AABB-vs-frustum conservative test)

Initial-Sync Snapshot

  • snapshot(path_prefix) on state_cluster_shard (walks state tree)
  • request_snapshot() virtual on state_transport
  • Full IPC implementation (kMsgSnapReq/Rsp wire protocol)

New Tests

  • state_reconnect_resilience_test.cpp: StopRestartReconnect, PeerDisconnectNoHang, ConnectionCountTracking (3 tests)
  • state_large_tree_bench_test.cpp: Ingest10k, Ingest100k, DrainAndPublish10k, Snapshot10k (4 bench tests, gated on CVC_DISTRIBUTED_STATE_BENCH=1)

Files Changed

25 files, +1423/-6 lines

Test Results

All affected test suites pass:

  • state_transport_ipc_test: 12 passed, 2 skipped
  • state_reconnect_resilience_test: 3 passed
  • state_compression_registry_test: 16 passed
  • state_cluster_shard_test: 23 passed, 2 skipped
  • state_distributed_metrics_test: 3 passed
  • state_replica_test: 10 passed, 2 skipped
  • state_brick_manifest_test: 10 passed
  • distributed_state_session_test: 12 passed
  • state_blob_transport_integration_test: 4 passed
  • state_large_tree_bench_test: 4 skipped (by design)

transfix added 4 commits May 22, 2026 17:30
Implements the post-gap-analysis roadmap items (all except Admin CLI,
deferred to separate work):

Proto & wire format:
- Add ChunkRequest/Response, SnapshotRequest/Entry/Response, Heartbeat
  messages to state_transport.proto
- Add hlc_time field to Mutation (proto field 12)
- Extend Frame oneof with 5 new fields
- Add snapshot & chunk wire types to IPC transport (kMsgSnapReq/Rsp)

Transport chunk fetch:
- IPC transport: set_blob_store, fetch_chunk, chunk request/response
- gRPC transport: set_blob_store, fetch_chunk, chunk request/response
- Both use condition_variable waiter pattern with 10s timeout

Session improvements:
- Wire state_data_hydrator into distributed_state_session::join()
- Add TLS/auth config passthrough (gRPC only)
- Add blob store wiring on both IPC and gRPC transports
- Add wait_for_data() delegating to hydrator
- Add replica_status status() with peer_count, local_sequence, etc.

Compression:
- Add state_zstd_compression_codec (production zstd via libzstd)
- Register in shared() singleton; update tests for 3 built-in codecs
- Add CMake zstd discovery (pkg-config + fallback find_library)

Conflict visibility:
- Add conflict_entry ring buffer to state_cluster_shard (128 entries)
- Add recent_conflicts() accessor (most-recent-first)
- Add publish_conflicts() to state_distributed_metrics

Hybrid logical clocks:
- New state_hybrid_time.h: hybrid_time struct + hybrid_clock class
- Add hlc_time field to state_mutation (packed 48-bit wall + 16-bit logical)
- Update should_replace() to prefer HLC ordering when both carry timestamps
- Serialize hlc_time on IPC wire (backward-compatible) and gRPC proto

Hash-range sharding:
- New state_hash_partition.h: FNV-1a hash ring with uniform partitioning

Selective brick fetch:
- Add plane struct and bricks_in_frustum() to state_brick_manifest
  (AABB-vs-frustum conservative test)

Initial-sync snapshot:
- Add snapshot() to state_cluster_shard (walks state tree via traverse)
- Add request_snapshot() virtual to state_transport base class
- Implement on IPC transport (kMsgSnapReq/Rsp wire protocol)

Tests:
- state_reconnect_resilience_test.cpp: 3 tests (stop/restart/reconnect,
  peer disconnect survival, connection count tracking)
- state_large_tree_bench_test.cpp: 4 benchmarks (10k/100k ingest,
  drain+publish, snapshot) gated on CVC_DISTRIBUTED_STATE_BENCH=1
P0 — Production Blockers:
  - gRPC snapshot + heartbeat frame dispatch in both server and client
  - Wire hybrid_clock into shard (stamp outbound, merge inbound)
  - IPC wire version negotiation (kMaxVersion, skip unknown msg types)

P1 — Functional Gaps:
  - Wire hash_partition into shard with enforce_partition gate
  - gRPC heartbeat send/receive (background heartbeat_loop thread)
  - Session config: blob_store_path, snapshot_on_join wired

P2 — Test Coverage & Docs:
  - 7 new test suites (47 tests): hash_partition, hybrid_time,
    brick_frustum, conflict_ring, snapshot_protocol, zstd_round_trip,
    session_status
  - gRPC reconnect/resilience tests (3 tests)
  - USAGE.md: distributed state section (quick start, config
    reference, transports, state API, interest filters, snapshots,
    conflict resolution, session lifecycle)
  - CI: gRPC build matrix entry (libcvc-debug-grpc) for Linux

P3 — Nice to Have:
  - require_tls safety boolean on distributed_state_config

Bug fixes:
  - hash_partition owner_of() uint32 overflow (UINT32_MAX + 1 wraps)

19 files changed, +1560/-4 lines
…icts

1. Wire max_inline_payload_bytes into state_cluster_shard::drain_local():
   - Add set_blob_store() and set_max_inline_payload_bytes() to shard
   - In drain_local(), values exceeding the threshold are offloaded
     to the blob store and replaced with a blob_ref payload
   - Session::join() passes config.max_inline_payload_bytes to shard

2. Auto-publish conflict metrics in session pump loop:
   - Store app* context in distributed_state_session for metrics
   - Every 100 pump cycles, call publish_conflicts() to write
     recent conflict entries under __system.distributed.<cluster>.conflicts.*
   - Only active when config.resolve_conflicts is true

Tests:
   - InlinePayloadOffloadToBlobStore: verifies large values go to blob store
   - InlinePayloadOffloadDisabledByDefault: verifies no offload without blob store
   - ConflictAutoPublish: verifies conflicts appear in state tree automatically

6 files changed, +169/-1 lines
…-format

- Add zstd to macOS brew install lines in CI workflow
- Use IMPORTED_TARGET in pkg_check_modules for zstd to get full library
  paths (fixes macOS linker error: 'ld: library zstd not found')
- Pass steady_clock timestamp to note_seen() in state_transport_grpc.cpp
  (fixes grpc build: note_seen now requires 2 arguments)
- Apply clang-format to all changed lines vs base c297c8a
@transfix transfix merged commit 43894f9 into master May 23, 2026
14 checks passed
@transfix transfix deleted the feature/distributed-state-gaps branch May 23, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant