fix(tools/talis): fibre-experiment race fixes & loadgen tooling#3305
fix(tools/talis): fibre-experiment race fixes & loadgen tooling#3305julienrbrt merged 3 commits intoevstack:julien/fiberfrom
Conversation
Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.
• setup-fibre script (fibre_setup.go) now:
- polls `celestia-appd status` for `latest_block_height>0`
before submitting any tx — fixes the silent-noop where
set-host + 100× deposit-to-escrow all bounced with
"celestia-app is not ready; please wait for first block";
- retries `set-host` in a loop until the validator's host
shows up in `query valaddr providers` — fixes the case
where --yes returns the txhash before block inclusion and
the tx silently lands in the mempool but never confirms;
- verifies fibre-0's escrow account is funded on-chain before
the tmux session exits — same silent-failure mode as
set-host, but on the deposit side.
The talis-CLI step also now cross-checks all validators are
registered from a single vantage point before returning, so a
concurrent set-host race surfaces as an error instead of a
half-empty provider list start-fibre would cache forever.
• fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
the keyring scp into a tmp directory and `mv`s it atomically
into place. The previous direct `scp -r` to
/root/keyring-fibre/keyring-test created the directory before
transferring its contents — the evnode init script's
`[ -d keyring-test ]` poll passed mid-transfer, the daemon
launched with no fibre-0.info, and crashed with `keyring entry
"fibre-0" not found`.
• evnode_init.sh (genesis.go) now waits for the specific
keyring-test/fibre-0.info file rather than just the
keyring-test directory. Belt-and-braces: the bootstrap mv is
already atomic on the same filesystem, but the file-level
guard means a hand-pushed keyring (not via talis) can't trip
the same race.
• New `talis fibre-experiment` umbrella command runs
up → genesis → deploy → setup-fibre → start-fibre →
fibre-bootstrap-evnode in order. Each step uses the same
binary as a subprocess; failures in any step abort the chain.
Operator goes from a prepared root dir to a running loadgen
with one command, instead of remembering the sequence.
Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR evstack#3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR evstack#3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).
Three follow-up bugs surfaced from the PR evstack#3303 follow-up verification run on a 3-validator AWS Fibre cluster: - aws.go: CreateAWSInstances exited 0 even when individual instance launches failed, so `talis up` lied about success and downstream steps proceeded against a partial cluster. Returns a joined error now so failure cascades stop early. - download.go: sshExec used cmd.CombinedOutput, mixing SSH warnings (the "Warning: Permanently added '...'..." chatter on stderr) into bytes the caller hands to fmt.Sscanf("%d"). The CLI-side providers cross-check parsed those warnings as 0 and looped until its 5-min deadline even though a direct SSH query showed all 3 providers registered. Switch to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR` to silence the chatter for any caller that does combine streams. - fibre_setup.go: the per-validator escrow verification used `celestia-appd query fibre escrow` which doesn't exist — the actual subcommand is `escrow-account`. The query errored on every retry, the grep for "amount" never matched, and the script wedged on the 3-min deadline reporting `FATAL: fibre-0 escrow not present`. Switch to `escrow-account` and key on `"found":true` (the explicit existence flag in the response). Also wrap the fibre-0 deposit-to-escrow itself in a retry loop matching set-host — same `--yes`-returns-before-inclusion silent-failure mode bit it. fibre-1..N stay best-effort.
Two diagnostic improvements for the load generator:
1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
With --concurrency=8 (or higher), 6+ goroutines per cycle had
to open fresh TCP+TLS sockets per request because the pool
couldn't hold their idle conns between requests. Bump
MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
2*concurrency so every active sender has a reusable keep-alive
socket, eliminating handshake churn from the hot path.
2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
load tester, not a production daemon, so cost of always serving
profiling is acceptable; the payoff is being able to grab CPU
profiles under live load without re-deploying the binary —
`ssh -L 6060:127.0.0.1:6060 root@loadgen \
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.
A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: Turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. 👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Issues
Operating the talis-driven Fibre throughput experiment surfaced three classes of bug that consistently broke fresh cluster bring-up and made one set of test runs unrepeatable:
set-hostanddeposit-to-escrowwere one-shot calls —--yesreturns the txhash before block inclusion, so the operator saw "success" while the tx silently bounced (celestia-app is not ready; please wait for first block). All subsequent steps then ran against an unregistered host with no escrow.scp -rcreateskeyring-test/on the target before any of its files transfer; the evnode init script's[ -d keyring-test ]poll then triggered mid-transfer and evnode launched with a partial keyring (fibre-0.info: key not found).talis uplied about partial failures. When some EC2 instances failed to launch,CreateAWSInstancesreturnednilerror andtalis upprinted "deployment complete" against a partial cluster, breaking later steps with confusing errors.A few smaller bugs also fixed here:
fibre_setup.goqueriedcelestia-appd query fibre escrow(which doesn't exist — it'sescrow-account), so the verification loop wedged for 3 minutes regardless of actual state.sshExecusedcmd.CombinedOutput, mixing SSHWarning: Permanently addedchatter into bytes the CLI handed toSscanf("%d")— the providers cross-check parsed those warnings as0and looped to its 5-minute deadline even when SSH-direct returned3.evnode-txsimused the stdlib defaultMaxIdleConnsPerHost=2, so any--concurrency > 2opened fresh TCP/TLS sockets per request and turned the loadgen itself into the binding constraint.Solution
fibre_setup.go: chain-ready preamble, retry-until-confirmed loops forset-hostandfibre-0deposit-to-escrow, switched verification query toescrow-accountand key on"found":true, CLI cross-check that all validators registered.fibre_bootstrap_evnode.go: stage keyring under/root/.keyring-fibre.staging/thenmvinto place atomically.fibre_experiment.goumbrella command drivingup → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnodeso the operator can't skip a verification step.aws.go:CreateAWSInstancesnow returns a joined error on any partial failure.download.go:sshExecreturns stdout only viacmd.Output()and adds-q -o LogLevel=ERRORto silence SSH warning chatter.evnode-txsim: bump per-host idle conns to2 × concurrency; always-on pprof on127.0.0.1:6060for in-process profiling under live load.Test plan
talis fibre-experimentruns unattended end-to-end against a fresh clustertalis uppartial-failsevnode-txsimsaturates with--concurrency 32(verified via pprof — request hot path is nowwrite(2)/body copy, not TCP handshake)