Skip to content

docs: high availability sequencer guide#3293

Open
auricom wants to merge 7 commits intomainfrom
auricom/ha-docs
Open

docs: high availability sequencer guide#3293
auricom wants to merge 7 commits intomainfrom
auricom/ha-docs

Conversation

@auricom
Copy link
Copy Markdown
Contributor

@auricom auricom commented Apr 28, 2026

Summary

  • Add Raft HA concept guide, full configuration reference backed by 60 h load-test results
  • Add step-by-step tutorial for bootstrapping a 5-node cluster from scratch, including RTT measurement, peer ID collection via evm net-info, systemd unit, rolling restart, and failover testing
  • Add migration guide for converting a live single sequencer to an HA cluster with a planned cutover window and rollback plan
  • Wire all three guides into the sidebar under a new High Availability section

Test plan

  • yarn dev — navigate to each of the three HA guides, verify sidebar links and internal cross-references work
  • just testlink — no broken links

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added a new "High Availability" guides section in the documentation sidebar.
    • Added an HA overview covering Raft-based cluster behavior, sizing, networking, and tuning guidance.
    • Added a step-by-step 5-node cluster setup guide with deployment, validation, failover, and maintenance instructions.
    • Added a migration guide for converting a single-sequencer deployment to a 5-node HA cluster.
    • Removed the legacy Raft production guide (replaced by the new HA docs).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 28, 2026, 3:19 PM

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

Adds a new "High Availability" section to the docs sidebar and introduces three new HA guides: an overview of Raft-based operation, a 5-node cluster setup guide, and a single-to-HA migration runbook.

Changes

Cohort / File(s) Summary
Sidebar config
docs/.vitepress/config.ts
Inserted a collapsed "High Availability" group under "How To Guides" with links to three new HA guides.
HA guides
docs/guides/ha/overview.md, docs/guides/ha/cluster-setup.md, docs/guides/ha/single-to-ha.md
Added three comprehensive documentation pages covering Raft HA concepts, production/configuration parameters, 5-node cluster provisioning and validation, migration steps from single sequencer to HA, systemd examples, and troubleshooting guidance.
Removed doc
docs/guides/raft_production.md
Removed the previous raft_production.md guide (no replacement file in this diff).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 I hopped from one to five with glee,
Raft whispers kept our cluster free,
Docs laid out the steps to try,
Nodes align beneath the sky,
Little rabbit cheers: HA, we! 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding comprehensive high availability sequencer documentation with guides for overview, cluster setup, and migration.
Description check ✅ Passed The description covers the main additions (three HA guides), sidebar integration, technical details, and test plan, though it lacks explicit mention of the raft_production.md removal.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch auricom/ha-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 28, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://evstack.github.io/docs-preview/pr-3293/

Built to branch main at 2026-04-28 15:20 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.48%. Comparing base (49ef5c9) to head (516b533).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3293      +/-   ##
==========================================
+ Coverage   62.45%   62.48%   +0.03%     
==========================================
  Files         122      122              
  Lines       13047    13047              
==========================================
+ Hits         8149     8153       +4     
+ Misses       4012     4009       -3     
+ Partials      886      885       -1     
Flag Coverage Δ
combined 62.48% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/.vitepress/config.ts (1)

345-355: ⚠️ Potential issue | 🟡 Minor

Fix MD040: add a language to the RTT adaptation equations code fence.

The fenced block under “Adapting for different RTT values” starts at Line 349 without a language specifier. Add something like text (or math) so markdownlint MD040 is satisfied.

Suggested change
-```
+```text
 heartbeat_timeout    = RTT_MAX × 4
 election_timeout     = heartbeat_timeout × 4
 leader_lease_timeout = heartbeat_timeout / 2
 send_timeout         = RTT_MAX × 3

</details>


Based on learnings, markdownlint-cli2 (MD040) reports: “Fenced code blocks should have a language specified (MD040, fenced-code-language)”.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/.vitepress/config.ts around lines 345 - 355, The fenced code block
under the "Adapting for different RTT values" heading is missing a language tag
which triggers markdownlint MD040; edit the markdown string containing that
heading and change the opening triple backticks from plain to a language-tagged fence (for exampletext or ```math) so the RTT equations
block is annotated (preserve the existing lines for heartbeat_timeout,
election_timeout, leader_lease_timeout, send_timeout). Ensure only the opening
fence is updated and no other content is altered.


</details>

</blockquote></details>

</blockquote></details>
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/guides/ha/cluster-setup.md`:
- Around line 262-266: The current failover step uses a blunt kill command `kill
-SIGTERM $(pgrep evm)` which may match multiple processes; replace it with a
targeted stop by either: 1) using the evm systemd unit (e.g. `systemctl stop
evm` on the leader host) if available, or 2) resolving the leader's exact PID
from the leader instance (use its PID file or parse the leader-specific log
entry to get one PID), or 3) using `pgrep -f` with a stricter pattern that
uniquely matches the leader process and verify it returns exactly one PID before
calling `kill -SIGTERM <pid>`; ensure the docs show verifying a single PID and
prefer systemd/PID-file methods over a plain `pgrep evm` call.
- Around line 29-32: The fenced multiaddr code block containing
"/ip4/<ip>/tcp/<port>/p2p/<peer-id>" is missing a language tag; update the fence
for that snippet from ``` to ```text so the block reads as a text code fence
(i.e., change the fence surrounding the multiaddr example to use the "text"
language tag).
- Around line 230-233: The fenced log snippet in cluster-setup.md is missing a
language tag (MD040); edit the fenced block containing "INF raft: entering
follower state  leader=node-1" / "INF block applied from raft log..." and add
the language identifier `text` after the opening backticks so the fence reads
```text; update the block with that tag around the existing log lines in the
file.
- Around line 221-226: The fenced log block containing "INF raft: entering
candidate state  node=node-1" and the line "INF raft: election won              
tally=3" is missing a language tag; update the opening fence from ``` to ```text
so the snippet is marked as plain text (i.e., change the code fence that
precedes the lines starting with "INF raft:" to ```text).

In `@docs/guides/ha/overview.md`:
- Around line 358-366: The P2P port in the YAML example under the "Interaction
with P2P" section is inconsistent (listen_address uses 7676) and should match
the rest of the docs; update the YAML snippet's listen_address and peers example
to use 26656 (or replace the numeric port with a <P2P_PORT> placeholder) so the
listen_address and peers entries are consistent with cluster-setup.md;
specifically edit the "listen_address" line and any sample "peers" entries in
the overview.md snippet to use 26656 or the placeholder.
- Around line 349-354: The fenced code block containing the RTT scaling rules
(lines showing heartbeat_timeout, election_timeout, leader_lease_timeout,
send_timeout) is missing a language tag which triggers markdownlint MD040;
update that fenced block by adding a language identifier (e.g., text or bash)
after the opening triple backticks so it becomes ```text (or ```bash) to satisfy
the linter and improve readability.
- Around line 144-159: Decide whether raft.peers is meant to include the local
node, then update both docs to be explicit and consistent: if raft.peers must
include self, remove the word "remote" from the overview.md description of
raft.peers and replace the example list with one that includes the local node
(use the same nodeID@host:port entries shown in cluster-setup.md); if raft.peers
must exclude self, update the cluster-setup.md examples so each node’s
raft.peers omits its own nodeID and add a clear note in overview.md explaining
“raft.peers excludes the local node (do not list node-X in node-X’s config)”; in
either case, ensure the examples and the explanatory sentence referencing
raft.peers are identical across overview.md and cluster-setup.md and add a short
clarifying comment adjacent to the existing example to avoid future
inconsistencies.

In `@docs/guides/ha/single-to-ha.md`:
- Around line 75-90: Update the signer-copy step in single-to-ha.md to avoid
hardcoding priv_validator_key.json: either (A) use the canonical filename
produced by the tool (the signer file created by running evm init, e.g.,
signer.json) to match cluster-setup.md, or (B) change the text/commands to a
generic instruction that tells users to “copy the signer key file generated by
evm init” (reference evm init and the signer filename like
signer.json/priv_validator_key.json) so the guide is robust and consistent with
cluster-setup.md.
- Around line 279-298: Update the two fenced code blocks in the "Check leader
election" section so they include a language specifier (e.g., text or log);
specifically, change the block containing the lines starting with "INF raft:
election won  tally=3  leader=node-1" and the block starting with "INF raft:
entering follower state  leader=node-1" to use ```text (or ```log) instead of
plain ``` to satisfy MD040.

---

Outside diff comments:
In `@docs/.vitepress/config.ts`:
- Around line 345-355: The fenced code block under the "Adapting for different
RTT values" heading is missing a language tag which triggers markdownlint MD040;
edit the markdown string containing that heading and change the opening triple
backticks from plain ``` to a language-tagged fence (for example ```text or
```math) so the RTT equations block is annotated (preserve the existing lines
for heartbeat_timeout, election_timeout, leader_lease_timeout, send_timeout).
Ensure only the opening fence is updated and no other content is altered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3932bf12-f4e0-4e30-85b7-dab97b9d351c

📥 Commits

Reviewing files that changed from the base of the PR and between 14b82da and 9e05891.

📒 Files selected for processing (4)
  • docs/.vitepress/config.ts
  • docs/guides/ha/cluster-setup.md
  • docs/guides/ha/overview.md
  • docs/guides/ha/single-to-ha.md

Comment thread docs/guides/ha/cluster-setup.md Outdated
Comment thread docs/guides/ha/cluster-setup.md Outdated
Comment thread docs/guides/ha/cluster-setup.md Outdated
Comment thread docs/guides/ha/cluster-setup.md
Comment thread docs/guides/ha/overview.md
Comment thread docs/guides/ha/overview.md Outdated
Comment thread docs/guides/ha/overview.md
Comment thread docs/guides/ha/single-to-ha.md
Comment thread docs/guides/ha/single-to-ha.md
@auricom auricom marked this pull request as draft April 28, 2026 11:29
auricom and others added 5 commits April 28, 2026 14:15
Critical fixes:
- Fix snapshot_threshold math: 5000 ÷ 10 = 500s ≈ 8.3 min (not 83s)
- Fix trailing_logs math: 18000 ÷ 10 = 1800s = 30 min (not 5 min)

Medium fixes:
- Fix heartbeat_timeout description: it is a follower-side election trigger,
  not the interval at which the leader sends heartbeats
- Add explicit restart instruction after Step 5 data copy in single-to-ha.md
  so the chain keeps producing blocks during preparation (Steps 6-8)
- Replace priv_validator_key.json with signer.json in single-to-ha.md
  to match cluster-setup.md and the E2E tests

Minor fixes:
- Exclude self from raft.peers in all examples (cluster-setup.md node-1
  yaml/CLI/systemd, single-to-ha.md node-1 and node-2)
- Add "exclude local node" note to raft.peers description in overview.md
- Fix P2P port in overview.md Interaction with P2P section (7676 → 26656)
- Add text language tag to all bare fenced blocks (MD040): multiaddr
  example, RTT equations, and all log snippets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
raft_production.md had no sidebar entry and its content was fully
superseded by the new ha/ guides. Extract the three pieces that were
unique to it — bootstrap flag docs, auto-detection startup mode
explanation, and static-membership limitation note — into
ha/overview.md, then delete the file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Passing --evnode.signer.passphrase inline exposes the secret in
ps aux, journalctl, and shell history.

- Add EnvironmentFile=/etc/ev-node/env (chmod 600) to the systemd
  unit in cluster-setup.md with setup instructions
- Replace all inline <YOUR_PASSPHRASE> occurrences with
  $EV_SIGNER_PASSPHRASE sourced from /etc/ev-node/env in every
  evm start / evm init snippet across both guides

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace "peers list is identical" stub in node-2 config with an
  explicit peers list that excludes node-2 itself, and add a note
  that each node must omit itself from raft.peers
- Replace "Wait ~30 seconds" in rolling restart with journalctl
  one-liners that exit as soon as the node logs follower/leader state,
  giving a deterministic signal instead of an arbitrary timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The abbreviated node-2 snippet with "# peers list is identical" caused
a startup failure: with raft_addr=0.0.0.0:5001 the bootstrap code's
literal address comparison does not recognise node-2@10.0.0.2:5001 as
self, so node-2 is appended twice and deduplicateServers returns
"duplicate peers found in config".

- Fix intro text: "only raft.node_id and raft_addr differ" →
  "raft.node_id is unique; raft.peers and p2p.peers must exclude self"
- Expand node-2 snippet to a full evnode.yaml with the correct peers
  list (node-1, node-3, node-4, node-5 — no node-2) and an inline
  explanation of the wildcard address pitfall
- Align overview.md trailing_logs example to 1 block/s (matching
  block_time: "1s" used throughout) and note the 10 block/s rate too

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom marked this pull request as ready for review April 28, 2026 15:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/guides/ha/cluster-setup.md (1)

290-294: ⚠️ Potential issue | 🟡 Minor

Fallback kill path still doesn’t enforce single-PID targeting.

Line 290 says to verify exactly one PID, but Line 291–293 do not enforce that before sending SIGTERM. Add an explicit cardinality check to prevent stopping the wrong process.

Proposed doc-safe command patch
-# Fallback: stop the process directly (verify exactly one PID before killing)
-PID=$(pgrep -f "evm start")
-echo "Stopping PID $PID"
-kill -SIGTERM "$PID"
+# Fallback: stop the process directly (must resolve exactly one PID)
+mapfile -t PIDS < <(pgrep -f "evm start")
+if [ "${`#PIDS`[@]}" -ne 1 ]; then
+  echo "Expected exactly 1 evm PID, found ${`#PIDS`[@]}: ${PIDS[*]}"
+  exit 1
+fi
+echo "Stopping PID ${PIDS[0]}"
+kill -SIGTERM "${PIDS[0]}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/ha/cluster-setup.md` around lines 290 - 294, The fallback kill
path uses PID=$(pgrep -f "evm start") and then kill -SIGTERM "$PID" without
verifying cardinality; change the instructions to first capture all matching
PIDs (via pgrep -f "evm start"), check that exactly one PID is present
(fail/exit with an error message if zero or more than one), and only then
assign/use that single PID variable and run kill -SIGTERM on it; reference the
PID variable, the pgrep -f "evm start" invocation, and the kill -SIGTERM command
so the doc update enforces the single-PID check before killing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/guides/ha/cluster-setup.md`:
- Around line 77-80: The guide's systemd examples incorrectly recommend passing
the passphrase via the CLI flag --evnode.signer.passphrase (which expands
$EV_SIGNER_PASSPHRASE into the command line and exposes the secret); update the
systemd unit examples and any CLI examples that use --evnode.signer.passphrase
(search for occurrences of --evnode.signer.passphrase and EnvironmentFile) to
instead use --evnode.signer.passphrase_file pointing to a protected file (and
explain creating the file and setting permissions), and update the explanatory
text at the cited examples (including the instances around the earlier example
and the later one) to recommend passphrase_file for production rather than
embedding the expanded env var on ExecStart.

---

Duplicate comments:
In `@docs/guides/ha/cluster-setup.md`:
- Around line 290-294: The fallback kill path uses PID=$(pgrep -f "evm start")
and then kill -SIGTERM "$PID" without verifying cardinality; change the
instructions to first capture all matching PIDs (via pgrep -f "evm start"),
check that exactly one PID is present (fail/exit with an error message if zero
or more than one), and only then assign/use that single PID variable and run
kill -SIGTERM on it; reference the PID variable, the pgrep -f "evm start"
invocation, and the kill -SIGTERM command so the doc update enforces the
single-PID check before killing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e2ebbdf5-5ca6-4b02-aaf8-c61c1722965d

📥 Commits

Reviewing files that changed from the base of the PR and between 9e05891 and 516b533.

📒 Files selected for processing (4)
  • docs/guides/ha/cluster-setup.md
  • docs/guides/ha/overview.md
  • docs/guides/ha/single-to-ha.md
  • docs/guides/raft_production.md
💤 Files with no reviewable changes (1)
  • docs/guides/raft_production.md
✅ Files skipped from review due to trivial changes (2)
  • docs/guides/ha/single-to-ha.md
  • docs/guides/ha/overview.md

Comment on lines +77 to +80
./evm init --evnode.node.aggregator=true --evnode.signer.passphrase "$EV_SIGNER_PASSPHRASE"
```

> Set `EV_SIGNER_PASSPHRASE` in your shell session before running this command so the passphrase does not appear in `ps aux` or your shell history. The [EnvironmentFile setup](#running-as-a-systemd-service) later in this guide shows how to store it securely.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether ev-node/evm supports a safer non-CLI passphrase input mechanism.
# Expected: presence of flags/config such as passphrase-file, stdin prompt, or dedicated secret provider.
# If none are found, docs should not claim CLI args are hidden from process inspection.

rg -n -C3 'signer\.passphrase|passphrase-file|passphrase_file|stdin|prompt|EnvironmentFile' docs
rg -n -C3 'signer\.passphrase|passphrase-file|passphrase_file|stdin|prompt' .

Repository: evstack/ev-node

Length of output: 50373


🏁 Script executed:

sed -n '305,365p' docs/guides/ha/cluster-setup.md

Repository: evstack/ev-node

Length of output: 1968


Update systemd service configuration to use passphrase file instead of CLI argument.

The security claim at line 313-314 is incorrect. When systemd loads EnvironmentFile=/etc/ev-node/env, the environment variable $EV_SIGNER_PASSPHRASE gets expanded into the actual secret value. Passing it to the CLI via --evnode.signer.passphrase=$EV_SIGNER_PASSPHRASE (line 350) makes the expanded passphrase visible in ps aux, /proc/<pid>/cmdline, and process inspection tools—directly contradicting the guide's security guarantee.

Use the supported --evnode.signer.passphrase_file flag instead:

Corrected unit file snippet
[Service]
EnvironmentFile=/etc/ev-node/env
ExecStart=/usr/local/bin/evm start \
  --evnode.node.aggregator=true \
  ...other flags... \
  --evnode.signer.passphrase_file=/etc/ev-node/passphrase

Then store the passphrase in a separate file:

echo -n "<YOUR_PASSPHRASE>" | sudo tee /etc/ev-node/passphrase > /dev/null
sudo chmod 600 /etc/ev-node/passphrase

Also applies to lines 77 and 229 where the same false security claim appears—those examples should also be updated to recommend --evnode.signer.passphrase_file for production use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/ha/cluster-setup.md` around lines 77 - 80, The guide's systemd
examples incorrectly recommend passing the passphrase via the CLI flag
--evnode.signer.passphrase (which expands $EV_SIGNER_PASSPHRASE into the command
line and exposes the secret); update the systemd unit examples and any CLI
examples that use --evnode.signer.passphrase (search for occurrences of
--evnode.signer.passphrase and EnvironmentFile) to instead use
--evnode.signer.passphrase_file pointing to a protected file (and explain
creating the file and setting permissions), and update the explanatory text at
the cited examples (including the instances around the earlier example and the
later one) to recommend passphrase_file for production rather than embedding the
expanded env var on ExecStart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant