Skip to content

fix(systemd): bound Restart=always with a start-rate limit#216

Merged
mairas merged 1 commit into
mainfrom
fix/systemd-startlimit
Jun 15, 2026
Merged

fix(systemd): bound Restart=always with a start-rate limit#216
mairas merged 1 commit into
mainfrom
fix/systemd-startlimit

Conversation

@mairas

@mairas mairas commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Motivation

Generated systemd units set Restart=always / RestartSec=10 (templates/systemd/service.j2) with no StartLimitIntervalSec/StartLimitBurst. A deterministically-failing ExecStartPre — the framework prestart.sh, configure-container-routing, or (since the composable-prestart work) a failing app-prestart.sh hook sourced under set -e — makes systemd retry start every 10s forever instead of entering failed. On a device this is quiet log-spam/churn rather than an observable failure.

Restart=always itself is intentional: PR #199 set it specifically so an app can request its own restart (e.g. Signal K restarting itself). So the fix must bound the restart rate without breaking legitimate in-app restarts.

Approach

Add to the [Unit] section of the generated unit:

StartLimitIntervalSec=600
StartLimitBurst=5

(StartLimit* are [Unit]-section directives — placing them under [Service] makes systemd ignore them; a test asserts they land in [Unit].)

Value justification

The two requirements — terminate a deterministic prestart failure in failed, yet survive legitimate infrequent in-app restarts — do not conflict, because they live on very different timescales. The chosen values separate them cleanly:

  • Why the systemd defaults don't work here. systemd's defaults are StartLimitIntervalSec=10s, StartLimitBurst=5. With RestartSec=10, five starts span ~40s, which exceeds the 10s default window — so the limit never trips and the unit loops forever. Widening the interval is what actually makes the limit effective.
  • Deterministic prestart failure terminates fast. A prestart that fails on every start restarts every RestartSec=10. Five starts accumulate in ~40s, well inside the 600s (10 min) window, so the unit hits the burst and enters failed after ~40s — observable via systemctl status / is-failed instead of silent churn.
  • Legitimate in-app restarts are safe. The burst (5) over the interval (600s) tolerates a sustained restart rate of one per ~120s. An app deciding to restart itself does so on the order of minutes to hours apart, far below that threshold, so such restarts never accumulate toward the limit. A genuine flapping crash-loop (sub-2-min cadence) should trip the limit — that is the intended safety behavior, not a regression.

Net: burst * RestartSec (50s) < interval (600s) guarantees a same-cause loop is caught, while the per-restart budget of ~120s stays generous for intentional, infrequent restarts. A test asserts the burst * RestartSec < interval invariant directly so the values can't silently drift into ineffectiveness.

Tests

TDD: added TestStartLimit in tests/test_renderer.py (asserts both directives are emitted, that they sit in the [Unit] section, and the burst * RestartSec < interval invariant). Confirmed red before the template change, green after. Full unit + integration suites, ruff lint/format, and ty all pass locally.

Caveat

Per the workspace version-bump policy, version-cycle coordination is handled at merge by the orchestrator — this PR does not bump VERSION or edit debian/changelog. CI version-bump-check may therefore be red; that is expected and will be resolved at merge time.

Closes #213

🤖 Generated with Claude Code

@mairas mairas force-pushed the fix/systemd-startlimit branch from c0d2cf8 to 0bd5cbf Compare June 15, 2026 21:42
@mairas

mairas commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Independent reliability review

Reviewed with a fresh-context reliability persona. Section placement ([Unit]), the Restart=always / RestartSec=10 interaction, the 600/5 values, the StartLimitAction default (none → just failed, no reboot), and the tests (presence, section, and the burst × RestartSec < interval invariant) all check out.

P2 (addressed in 0bd5cbf): recovery was undocumented, and the failed state is sticky. A unit that trips the start limit enters failed and does not self-heal — it needs systemctl reset-failed + systemctl start, and a package upgrade does not clear it. That recovery procedure is now documented in both the generated unit comment (visible via systemctl cat) and docs/DESIGN.md.

On the related concern — a transient bursty-at-boot failure (e.g. docker slow to come up) now also lands in sticky failed instead of retrying indefinitely: I'm accepting this trade rather than retuning. The unit already orders After=/Requires=docker.service (and After= the routing/authelia deps), so a dependency being down for 5 starts over ~40s at boot is already a broken boot, not a brief flap; and indefinite silent retry was the bug being fixed. If real-world boot flapping proves otherwise we can widen StartLimitIntervalSec.

P3 (no change): test robustness nit. The invariant test trusts RestartSec exists; acceptable — it's strictly better than a bare string match.

Generated units set Restart=always / RestartSec=10 with no
StartLimitIntervalSec/StartLimitBurst, so a deterministically-failing
ExecStartPre (prestart, configure-container-routing, or an
app-prestart.sh hook sourced under set -e) made systemd retry start
every 10s forever instead of entering `failed`.

Add StartLimitIntervalSec=600 / StartLimitBurst=5 to the [Unit] section.
With RestartSec=10 a failing start trips the burst in ~40s, terminating
in `failed`, while a legitimate in-app restart spaced further apart than
interval/burst (>2 min) never accumulates toward the limit.

A start-limit `failed` unit does not self-heal and an upgrade does not
clear it, so document the recovery (systemctl reset-failed) in the unit
comment and docs/DESIGN.md.

Closes #213

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mairas mairas force-pushed the fix/systemd-startlimit branch from 0bd5cbf to 077bc32 Compare June 15, 2026 21:55
@mairas mairas merged commit 9353ce7 into main Jun 15, 2026
4 checks passed
@mairas mairas deleted the fix/systemd-startlimit branch June 15, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: generated units lack StartLimit, so a failing prestart loops forever

1 participant