Skip to content

fix(tools/talis): wait-for-chain + atomic keyring + one-command driver#3303

Draft
walldiss wants to merge 11 commits intoevstack:julien/fiberfrom
walldiss:feat/talis-fibre-fixes
Draft

fix(tools/talis): wait-for-chain + atomic keyring + one-command driver#3303
walldiss wants to merge 11 commits intoevstack:julien/fiberfrom
walldiss:feat/talis-fibre-fixes

Conversation

@walldiss
Copy link
Copy Markdown

@walldiss walldiss commented May 1, 2026

Summary

Three race conditions surfaced repeatedly on a fresh AWS bring-up of the Fibre throughput experiment. Each had the same shape: a talis step "succeeded" at the CLI level before the chain actually applied the work, leaving downstream steps to fail in confusing ways. This PR makes each step verify outcome, not just invocation, so the experiment goes from talis up to a running loadgen without manual intervention.

Symptom matrix from the broken run:

  • setup-fibre "complete" → query valaddr providers empty → uploads cascade to voting power: collected 0
  • evnode crashes at startup with keyring entry "fibre-0" not found despite the file existing on disk
  • escrow account not found for signer celestia17ftrx… even though setup-fibre claimed to fund 100× fibre-N accounts

Fixes

setup-fibre script (fibre_setup.go) now:

  • polls celestia-appd status for latest_block_height>0 before submitting any tx
  • retries set-host until query valaddr providers reflects the new host
  • verifies fibre-0's escrow lands on-chain before the tmux session exits
  • CLI side cross-checks all validators are registered from a single vantage point before returning

fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) stages the keyring scp into a tmp dir then mvs atomically into place. The previous direct scp -r created /root/keyring-fibre/keyring-test/ before transferring its contents — evnode_init.sh's [ -d keyring-test ] poll passed mid-scp.

evnode_init.sh template (genesis.go) waits for the specific fibre-0.info file (not just the directory), as defence-in-depth.

New talis fibre-experiment umbrella command runs up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode in order. Operator goes from a prepared root dir to a running loadgen with one command.

Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.

  • setup-fibre script (fibre_setup.go) now:
    - polls `celestia-appd status` for `latest_block_height>0`
      before submitting any tx — fixes the silent-noop where
      set-host + 100× deposit-to-escrow all bounced with
      "celestia-app is not ready; please wait for first block";
    - retries `set-host` in a loop until the validator's host
      shows up in `query valaddr providers` — fixes the case
      where --yes returns the txhash before block inclusion and
      the tx silently lands in the mempool but never confirms;
    - verifies fibre-0's escrow account is funded on-chain before
      the tmux session exits — same silent-failure mode as
      set-host, but on the deposit side.
    The talis-CLI step also now cross-checks all validators are
    registered from a single vantage point before returning, so a
    concurrent set-host race surfaces as an error instead of a
    half-empty provider list start-fibre would cache forever.

  • fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
    the keyring scp into a tmp directory and `mv`s it atomically
    into place. The previous direct `scp -r` to
    /root/keyring-fibre/keyring-test created the directory before
    transferring its contents — the evnode init script's
    `[ -d keyring-test ]` poll passed mid-transfer, the daemon
    launched with no fibre-0.info, and crashed with `keyring entry
    "fibre-0" not found`.

  • evnode_init.sh (genesis.go) now waits for the specific
    keyring-test/fibre-0.info file rather than just the
    keyring-test directory. Belt-and-braces: the bootstrap mv is
    already atomic on the same filesystem, but the file-level
    guard means a hand-pushed keyring (not via talis) can't trip
    the same race.

  • New `talis fibre-experiment` umbrella command runs
    up → genesis → deploy → setup-fibre → start-fibre →
    fibre-bootstrap-evnode in order. Each step uses the same
    binary as a subprocess; failures in any step abort the chain.
    Operator goes from a prepared root dir to a running loadgen
    with one command, instead of remembering the sequence.

Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR evstack#3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR evstack#3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4b3f08ee-274e-411b-b2fb-1a6b09738b59

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: Turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

walldiss and others added 10 commits May 1, 2026 17:21
Three follow-up bugs surfaced from the PR evstack#3303 follow-up
verification run on a 3-validator AWS Fibre cluster:

- aws.go: CreateAWSInstances exited 0 even when individual
  instance launches failed, so `talis up` lied about success
  and downstream steps proceeded against a partial cluster.
  Returns a joined error now so failure cascades stop early.

- download.go: sshExec used cmd.CombinedOutput, mixing SSH
  warnings (the "Warning: Permanently added '...'..." chatter
  on stderr) into bytes the caller hands to fmt.Sscanf("%d").
  The CLI-side providers cross-check parsed those warnings
  as 0 and looped until its 5-min deadline even though a
  direct SSH query showed all 3 providers registered. Switch
  to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR`
  to silence the chatter for any caller that does combine
  streams.

- fibre_setup.go: the per-validator escrow verification used
  `celestia-appd query fibre escrow` which doesn't exist —
  the actual subcommand is `escrow-account`. The query
  errored on every retry, the grep for "amount" never
  matched, and the script wedged on the 3-min deadline
  reporting `FATAL: fibre-0 escrow not present`. Switch to
  `escrow-account` and key on `"found":true` (the explicit
  existence flag in the response). Also wrap the fibre-0
  deposit-to-escrow itself in a retry loop matching set-host
  — same `--yes`-returns-before-inclusion silent-failure
  mode bit it. fibre-1..N stay best-effort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reproduced an OOM at 63.8 GiB on c6in.8xlarge (matching the
prior 63.9 GiB note in evnode-fibre's main.go that warned
against raising UploadMemoryBudget). On a 3-validator cluster
running evnode-txsim at 80 MiB/s, evnode-fibre's submitter
queue grew to 50 pending data + 26 pending headers, each blob
≈32 MiB raw. With 3 fan-out targets and per-attempt retry
buffers in flight, total in-memory upload state crossed 64 GiB
and the kernel killed evnode 30 s into the load test. The
loadgen-side TXSIM number collapsed from the prior journal's
47.65 MB/s to 0.66 MB/s as evnode died.

Two changes, one root cause (a Fibre upload stall snowballs):

- pkg/config: ApplyFiberDefaults sets MaxPendingHeadersAndData
  to 10 (was 50). The cap is what bounds in-flight blob copies
  + retry buffers; 50 was sized when evnode hadn't yet seen the
  pathological case where Fibre uploads time out, retries
  accumulate, and GC pressure pushes uploads even slower in a
  positive-feedback loop. 10 keeps the in-flight footprint
  bounded while still letting healthy uploads pipeline.

- evnode-fibre: override SubmitConfig.Fibre to set RPCTimeout
  to 30 s (upstream default 15 s). Verified live: with the
  pending cap at 10, a 17-blob 115 MiB upload completes in
  ~1.5 s — well below the 15 s default. The 30 s margin only
  matters for transient slow paths during signature collection
  across 3 FSPs; the cap fix is the load-bearing change.

Drop the stale main.go comment claiming we don't override
SubmitConfig.Fibre.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Fibre Submit path was opaque: failures showed up as
DeadlineExceeded with no signal of how long the upload
actually took, and successes only logged at debug level
inside the upstream library. During load-test debugging
this turned into a guessing game — was the cluster slow,
the deadline too tight, or something stuck mid-RPC?

Add a single info-level (warn-on-failure) log line in
fiberDAClient.Submit covering the Upload call: duration,
flat blob bytes, blob count. Cheap (one time.Since) and
gives the operator concrete numbers — e.g. "17 blobs / 115
MiB / 1.5 s" — to reason about whether RPCTimeout, pending
cap, or batch sizing is the right knob to turn next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under sustained txsim load (~50 MiB/s) the DA submitter
batched 10 block_data items into one Upload(), producing a
flat payload of 144 MiB. Fibre's per-upload cap is hard at
~128 MiB ("blob size exceeds maximum allowed size: data
size 144366912 exceeds maximum 134217723") and rejected
every batched upload. With MaxPendingHeadersAndData=10
that took down 170 consecutive submissions before the
node halted itself with "Data exceeds DA blob size limit".

Wrap the Upload call in a chunker that groups input blobs
into ≤120 MiB chunks (8 MiB headroom under Fibre's cap for
the per-blob length-prefix overhead added by flattenBlobs)
and uploads each chunk separately. Aggregates submitted
counts and BlobIDs across chunks; on first chunk failure,
returns the error with the partially-submitted count so
the submitter's retry/backoff logic sees a coherent state
instead of all-or-nothing.

Single oversized blobs (already validated against
DefaultMaxBlobSize earlier in Submit) still land alone and
fail server-side, but at least don't drag healthy peers
into the same rejected batch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the submitter chunking fix. The submitter can
split a multi-blob batch into ≤120 MiB Fibre uploads, but
a *single* block_data item that exceeds 128 MiB still ends
up alone in its own chunk and fails server-side ("blob size
exceeds maximum allowed size"). Lower the per-block cap to
100 MiB so under high-throughput txsim a single block can't
grow past Fibre's hard limit, and update the comment to
explain the relationship between this cap and Fibre's
~128 MiB upload reject threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stub executor used by the runner returned FilterOK for
every transaction unconditionally, ignoring the maxBytes
budget plumbed through SoloSequencer.GetNextBatch. Under
sustained txsim load (~50 MiB/s, 8 concurrent senders) the
mempool would accumulate ~50K txs while a 100 MiB upload
was in flight; on the next batch the sequencer drained
ALL of them into one block (~369 MiB raw), the submitter
saw a single item exceeding the per-blob cap, and halted
the node with `single item exceeds DA blob size limit`.

Walk the input txs in arrival order, accumulate sizes
against maxBytes, and return FilterPostpone past the
budget so the sequencer puts the overflow back on its
queue. Verified live: blocks now cap at ~10K txs / ~100
MiB and evnode sustains 58.77 MB/s DA upload throughput
through a 5-min txsim run with zero crashes (was 0 →
crash within 30 s before this fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Headers and data submissions used a Mutex.TryLock single-flight
guard, so even though up to 10 blobs could be pending, only one
upload of each kind ran at a time. That tied DA upload throughput
to the latency of one gRPC round-trip; under load with 1-7 s per
98 MiB upload, achievable DA bandwidth capped at ~20 MB/s even
though the FSPs and the network had abundant headroom.

Replace each Mutex with a buffered-channel semaphore sized to
config.Node.MaxPendingHeadersAndData (the same cap that bounds
in-flight blob memory). Acquisition is non-blocking (`select`
with `default:` skip on full), so a busy tick reuses the existing
"try and skip" semantics — but instead of blocking the next
batch, multiple batches can pipeline through Fibre concurrently.

Pool size = pending-cap is the right relationship: each pending
blob can have at most one in-flight submission, and we never
exceed the memory the cap was set against. A zero pending-cap
falls back to single-flight (cap 1) so callers that opted out
of pending-bounding don't accidentally get unbounded fan-out.

The TestSubmitter_daSubmissionLoop test constructs Submitter{}
directly bypassing NewSubmitter, so it now also initializes the
two semaphores — without them the channels are nil and `select`
always hits the default branch, causing the test to time out
waiting for SubmitHeaders/SubmitData calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two diagnostic improvements for the load generator:

1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
   With --concurrency=8 (or higher), 6+ goroutines per cycle had
   to open fresh TCP+TLS sockets per request because the pool
   couldn't hold their idle conns between requests. Bump
   MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
   2*concurrency so every active sender has a reusable keep-alive
   socket, eliminating handshake churn from the hot path.

2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
   load tester, not a production daemon, so cost of always serving
   profiling is acceptable; the payoff is being able to grab CPU
   profiles under live load without re-deploying the binary —
   `ssh -L 6060:127.0.0.1:6060 root@loadgen \
     go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.

A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under sustained ingest above the block-production drain rate,
SoloSequencer.queue grew monotonically. A 32-vCPU loadgen pushing
>100 MB/s into a runner whose executor drains ~100 MB/s per block
filled the queue at ~150 MB/s of net-positive growth — heap
profiles showed 24 GB of retained io.ReadAll bytes in the queue
within ~30 s, then anon-rss:63GB OOM-kill at the box's 64 GiB
ceiling. Reproducible twice with identical signature.

Two changes, one feature:

- SoloSequencer.SetMaxQueueBytes(n) caps the queue's total
  retained tx bytes. SubmitBatchTxs uses all-or-nothing admission
  against the cap: if the incoming batch would push us over, the
  whole batch is rejected with ErrQueueFull and the queue keeps
  its current contents untouched. Partial admission would force
  the caller to track which prefix succeeded and only re-feed the
  suffix on retry; the reaper currently doesn't do that, so the
  whole-batch rule lets the reaper just retry the same batch
  later when the queue has drained. queueBytes is decremented
  on drain (queue := nil) and re-counted for postponed txs that
  the executor's FilterTxs returns to the queue. Zero cap = the
  legacy unbounded path, preserved for tests and small
  deployments.

- The reaper bridging executor mempool → sequencer matches
  ErrQueueFull via errors.Is and treats it as transient
  backpressure: marks the rejected hashes as "seen" so the
  next reaper tick doesn't re-hash + re-submit the same already-
  rejected txs forever, logs a warn line with the dropped count,
  and continues running. Without this match every queue-full
  event would tear the daemon down via the existing fatal-on-
  submit-error path.

Loadgen sees the backpressure indirectly: with the sequencer
queue full, the executor's txChan stops draining, /tx blocks on
its bounded channel send, and txsim observes 5xx / timeouts —
cleanly applied at the application layer instead of via the
kernel OOM-killer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two runner-side changes paired with the SoloSequencer bound:

- After constructing the SoloSequencer, call SetMaxQueueBytes
  with 10× the per-block tx budget (= 1 GiB at the current 100
  MiB MaxBlobSize). 10× is the sweet spot: large enough that a
  short burst above steady-state ingest doesn't trigger
  backpressure (we want to absorb that), small enough that the
  worst-case retained bytes fit comfortably under the box's
  RAM budget alongside the pending cache + DA in-flight buffers.

- Lift the inMemExecutor's hardcoded ingest caps. txChan and
  maxBlockTxs were sized at 500 (5 MB / 5K txs per reaper poll)
  back when those were the only memory bound on the runner. With
  the SetMaxQueueBytes cap and the FilterTxs-enforced per-block
  budget now actually doing the bounding, the ingest queue can
  hold a full 100 MiB block-worth of txs (10K slots at 10 KB)
  without burdening memory — and a single reaper poll can
  drain that whole batch in one GetTxs call instead of
  needing 20× cycles. This was the binding constraint at
  ~5,000 tx/s = 50 MB/s in earlier runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
julienrbrt pushed a commit that referenced this pull request May 3, 2026
* fix(tools/talis): wait-for-chain + atomic keyring + one-command driver

Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.

  • setup-fibre script (fibre_setup.go) now:
    - polls `celestia-appd status` for `latest_block_height>0`
      before submitting any tx — fixes the silent-noop where
      set-host + 100× deposit-to-escrow all bounced with
      "celestia-app is not ready; please wait for first block";
    - retries `set-host` in a loop until the validator's host
      shows up in `query valaddr providers` — fixes the case
      where --yes returns the txhash before block inclusion and
      the tx silently lands in the mempool but never confirms;
    - verifies fibre-0's escrow account is funded on-chain before
      the tmux session exits — same silent-failure mode as
      set-host, but on the deposit side.
    The talis-CLI step also now cross-checks all validators are
    registered from a single vantage point before returning, so a
    concurrent set-host race surfaces as an error instead of a
    half-empty provider list start-fibre would cache forever.

  • fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
    the keyring scp into a tmp directory and `mv`s it atomically
    into place. The previous direct `scp -r` to
    /root/keyring-fibre/keyring-test created the directory before
    transferring its contents — the evnode init script's
    `[ -d keyring-test ]` poll passed mid-transfer, the daemon
    launched with no fibre-0.info, and crashed with `keyring entry
    "fibre-0" not found`.

  • evnode_init.sh (genesis.go) now waits for the specific
    keyring-test/fibre-0.info file rather than just the
    keyring-test directory. Belt-and-braces: the bootstrap mv is
    already atomic on the same filesystem, but the file-level
    guard means a hand-pushed keyring (not via talis) can't trip
    the same race.

  • New `talis fibre-experiment` umbrella command runs
    up → genesis → deploy → setup-fibre → start-fibre →
    fibre-bootstrap-evnode in order. Each step uses the same
    binary as a subprocess; failures in any step abort the chain.
    Operator goes from a prepared root dir to a running loadgen
    with one command, instead of remembering the sequence.

Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR #3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR #3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).

* fix(tools/talis): finalize fibre setup race fixes

Three follow-up bugs surfaced from the PR #3303 follow-up
verification run on a 3-validator AWS Fibre cluster:

- aws.go: CreateAWSInstances exited 0 even when individual
  instance launches failed, so `talis up` lied about success
  and downstream steps proceeded against a partial cluster.
  Returns a joined error now so failure cascades stop early.

- download.go: sshExec used cmd.CombinedOutput, mixing SSH
  warnings (the "Warning: Permanently added '...'..." chatter
  on stderr) into bytes the caller hands to fmt.Sscanf("%d").
  The CLI-side providers cross-check parsed those warnings
  as 0 and looped until its 5-min deadline even though a
  direct SSH query showed all 3 providers registered. Switch
  to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR`
  to silence the chatter for any caller that does combine
  streams.

- fibre_setup.go: the per-validator escrow verification used
  `celestia-appd query fibre escrow` which doesn't exist —
  the actual subcommand is `escrow-account`. The query
  errored on every retry, the grep for "amount" never
  matched, and the script wedged on the 3-min deadline
  reporting `FATAL: fibre-0 escrow not present`. Switch to
  `escrow-account` and key on `"found":true` (the explicit
  existence flag in the response). Also wrap the fibre-0
  deposit-to-escrow itself in a retry loop matching set-host
  — same `--yes`-returns-before-inclusion silent-failure
  mode bit it. fibre-1..N stay best-effort.

* feat(evnode-txsim): keep-alive conn pool + pprof endpoint

Two diagnostic improvements for the load generator:

1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
   With --concurrency=8 (or higher), 6+ goroutines per cycle had
   to open fresh TCP+TLS sockets per request because the pool
   couldn't hold their idle conns between requests. Bump
   MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
   2*concurrency so every active sender has a reusable keep-alive
   socket, eliminating handshake churn from the hot path.

2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
   load tester, not a production daemon, so cost of always serving
   profiling is acceptable; the payoff is being able to grab CPU
   profiles under live load without re-deploying the binary —
   `ssh -L 6060:127.0.0.1:6060 root@loadgen \
     go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.

A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant