fix(systemd): bound Restart=always with a start-rate limit#216
Conversation
c0d2cf8 to
0bd5cbf
Compare
Independent reliability reviewReviewed with a fresh-context reliability persona. Section placement ( P2 (addressed in On the related concern — a transient bursty-at-boot failure (e.g. docker slow to come up) now also lands in sticky P3 (no change): test robustness nit. The invariant test trusts |
Generated units set Restart=always / RestartSec=10 with no StartLimitIntervalSec/StartLimitBurst, so a deterministically-failing ExecStartPre (prestart, configure-container-routing, or an app-prestart.sh hook sourced under set -e) made systemd retry start every 10s forever instead of entering `failed`. Add StartLimitIntervalSec=600 / StartLimitBurst=5 to the [Unit] section. With RestartSec=10 a failing start trips the burst in ~40s, terminating in `failed`, while a legitimate in-app restart spaced further apart than interval/burst (>2 min) never accumulates toward the limit. A start-limit `failed` unit does not self-heal and an upgrade does not clear it, so document the recovery (systemctl reset-failed) in the unit comment and docs/DESIGN.md. Closes #213 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0bd5cbf to
077bc32
Compare
Motivation
Generated systemd units set
Restart=always/RestartSec=10(templates/systemd/service.j2) with noStartLimitIntervalSec/StartLimitBurst. A deterministically-failingExecStartPre— the frameworkprestart.sh,configure-container-routing, or (since the composable-prestart work) a failing app-prestart.shhook sourced underset -e— makes systemd retry start every 10s forever instead of enteringfailed. On a device this is quiet log-spam/churn rather than an observable failure.Restart=alwaysitself is intentional: PR #199 set it specifically so an app can request its own restart (e.g. Signal K restarting itself). So the fix must bound the restart rate without breaking legitimate in-app restarts.Approach
Add to the
[Unit]section of the generated unit:(
StartLimit*are[Unit]-section directives — placing them under[Service]makes systemd ignore them; a test asserts they land in[Unit].)Value justification
The two requirements — terminate a deterministic prestart failure in
failed, yet survive legitimate infrequent in-app restarts — do not conflict, because they live on very different timescales. The chosen values separate them cleanly:StartLimitIntervalSec=10s,StartLimitBurst=5. WithRestartSec=10, five starts span ~40s, which exceeds the 10s default window — so the limit never trips and the unit loops forever. Widening the interval is what actually makes the limit effective.RestartSec=10. Five starts accumulate in ~40s, well inside the 600s (10 min) window, so the unit hits the burst and entersfailedafter ~40s — observable viasystemctl status/is-failedinstead of silent churn.Net:
burst * RestartSec(50s)< interval(600s) guarantees a same-cause loop is caught, while the per-restart budget of ~120s stays generous for intentional, infrequent restarts. A test asserts theburst * RestartSec < intervalinvariant directly so the values can't silently drift into ineffectiveness.Tests
TDD: added
TestStartLimitintests/test_renderer.py(asserts both directives are emitted, that they sit in the[Unit]section, and theburst * RestartSec < intervalinvariant). Confirmed red before the template change, green after. Full unit + integration suites, ruff lint/format, andtyall pass locally.Caveat
Per the workspace version-bump policy, version-cycle coordination is handled at merge by the orchestrator — this PR does not bump
VERSIONor editdebian/changelog. CIversion-bump-checkmay therefore be red; that is expected and will be resolved at merge time.Closes #213
🤖 Generated with Claude Code