fix: exit 0 on graceful SIGTERM/deadline shutdown#30
Merged
Conversation
Background tasks return ctx.Err() on shutdown, which bubbles to runLoadTest as context.Canceled and is then log.Fatal()'d, so every graceful shutdown exits 1. Treat context.Canceled (and DeadlineExceeded for future-proofing) as the expected end-of-run signal at the process boundary. Fixes KubeJobFailed alerts for nightly/seiload-* on harbor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
amir-deris
approved these changes
Apr 30, 2026
3 tasks
bdchatham
added a commit
that referenced
this pull request
Apr 30, 2026
## Summary Add `--duration` flag so seiload self-terminates cleanly inside K8s Job `activeDeadlineSeconds`, producing pod exit 0 → Job condition `Complete` instead of K8s-mandated `Failed/DeadlineExceeded`. ## Why this exists Followup to #30. The exit-code fix in #30 is correct in isolation (graceful SIGTERM → exit 0), but on Kubernetes Jobs with **Job-level `activeDeadlineSeconds`**, the K8s Job controller sets `condition=Failed, reason=DeadlineExceeded` *regardless of the container's exit code*: > Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded. > — [Kubernetes Job docs](https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup) A run on harbor's `nightly` namespace today (workflow [25189102396](https://github.com/sei-protocol/platform/actions/runs/25189102396)) confirmed this: seiload's pod was SIGTERMed at the deadline and likely exited 0 (post-#30), but the Job condition was set: ``` "reason": "DeadlineExceeded", "message": "Job was active longer than specified deadline" ``` This means `kube_job_failed=1` and `KubeJobFailed` keeps firing — the original symptom that motivated this whole investigation. The actual fix has to make seiload self-terminate *before* K8s decides "DeadlineExceeded." ## What changes ```go rootCmd.Flags().Duration("duration", 0, "Run duration; the load test ctx is canceled after this elapses, the existing graceful-shutdown path runs, and the process exits 0. 0 means run until SIGTERM/SIGINT.") ``` ```go if duration, _ := cmd.Flags().GetDuration("duration"); duration > 0 { log.Printf("⏰ Run duration: %s", duration) var cancel context.CancelFunc ctx, cancel = context.WithTimeout(ctx, duration) defer cancel() } ``` When `--duration` is set: 1. The load-test context is wrapped with `WithTimeout`. 2. After `duration` elapses, ctx is canceled with `context.DeadlineExceeded`. 3. Background tasks (dispatcher, logger, block_collector) unwind via `ctx.Done()`. 4. `service.Run` returns the wrapped DeadlineExceeded error. 5. Final stats emit, `EmitRunSummary` runs, post-summary flush sleeps for 45s. 6. Existing boundary check from #30 (`errors.Is(err, context.DeadlineExceeded)`) clears the error. 7. Process exits 0. The existing post-summary flush delay still runs by design — it sits *after* `service.Run` returns, in the cleanup pipeline. ## What doesn't change - Default is `0` (unlimited) so existing callers without the flag are unaffected. - SIGTERM/SIGINT handling in `main.go:347-352` is untouched and still works. - Exit-code semantics from #30 (Canceled OR DeadlineExceeded → exit 0) already cover both internal-timeout and external-SIGTERM paths. ## Test plan - [x] `GOWORK=off go build .` passes - [x] `GOWORK=off go vet ./...` clean - [ ] Companion platform PR will pass `--duration=${DURATION_MINUTES}m` to seiload args; tomorrow's nightly will verify Job condition flips to `Complete` ## Companion change Will follow with a small PR on `sei-protocol/platform` that: 1. Bumps the seiload image to the new SHA after this merges. 2. Adds `--duration=${DURATION_MINUTES}m` to seiload args in `clusters/harbor/nightly/templates/seiload-job.yaml`. 3. Optionally raises `JOB_DEADLINE_SECONDS` slightly to keep `activeDeadlineSeconds` as a backstop only. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
seiload exits non-zero on every graceful shutdown. Treat
context.Canceled(andcontext.DeadlineExceededfor future-proofing) as the expected end-of-run signal at the top-level boundary ofrunLoadTest.Root cause
Tracing the chain when Kubernetes SIGTERMs the container at
activeDeadlineSeconds:runLoadTest(main.go:347-352) catches SIGTERM cleanly and the main task closure passed toservice.Runreturns nil.service.Run(utils/service/start.go:124-132) cancels its internal context once the main task finishes.ctx.Done()and returnctx.Err()(=context.Canceled):sender/dispatcher.go:106stats/logger.go:253stats/block_collector.go:87SpawnBgNamed(utils/service/start.go:101-113) wraps withfmt.Errorf("%s: %w", name, err)— preserveserrors.Istraversal.runLoadTestreturns it (main.go:365) →rootCmd.Runcallslog.Fatal(err)(main.go:49) → exit 1.Net effect: every successful nightly run exits 1, the K8s Job goes to Failed, and
kube_job_failed=1firesKubeJobFailedalerts on the harbor cluster (8 distinct run IDs in the last 14d).Fix
Conventional Go pattern at the process boundary (parallels
http.ErrServerClosed). The signal handler atmain.go:351-352already declares "SIGTERM is graceful" by returning nil; this honors that intent.DeadlineExceededis included defensively — the internal ctx inservice.RunisWithCanceltoday so it can't appear, but if anyone wraps the parent inWithTimeoutlater we'd regress.Why at the boundary, not at task sites
Considered pushing
return nilinto each background task, butreturn ctx.Err()at task sites is honest signal: if a sibling task fails for a real reason (sender error, RPC down) and that propagates ctx-cancellation, you want each background task to surface that fact, not silently swallow it. The boundary check is the right level — a single point where the process owner decides "this counts as success."Test plan
go build .passescondition=Complete,kube_job_failedstays at 0, andKubeJobFailedalert clears for new runsProduction impact
Fixes the upstream cause of
KubeJobFailedwarnings onnightly/seiload-*Jobs (harborcluster,nightlynamespace). Companion PR insei-protocol/platformadds (a) EKS auth refresh in the workflow's teardown step and (b) a defense-in-depth GC CronJob, which together address resource leaks discovered while diagnosing this exit-code issue.🤖 Generated with Claude Code