Skip to content

feat: add degraded boot mode for unavailable bootloader env#9

Open
JoergZeidler wants to merge 37 commits into
omnect:mainfrom
JoergZeidler:feat/degraded-boot-mode
Open

feat: add degraded boot mode for unavailable bootloader env#9
JoergZeidler wants to merge 37 commits into
omnect:mainfrom
JoergZeidler:feat/degraded-boot-mode

Conversation

@JoergZeidler
Copy link
Copy Markdown
Contributor

@JoergZeidler JoergZeidler commented May 20, 2026

Summary

  • Introduce BootEnvState enum replacing Option<Box<dyn BootEnv>> throughout,
    with variants Available(Box<dyn BootEnv>) and Degraded(BootEnvError)
  • Add pure classify_boot_env() as the single decision point for degraded boot:
    • debug-image + unavailable boot env → Abort → debug shell via handle_fatal_error()
    • release-image + unavailable boot env → Continue(Degraded) → boot proceeds
  • Add degraded_boot: bool to OdsStatus (written to
    /run/omnect-device-service/omnect-os-initramfs.json), signalling the running
    OS that it booted degraded
  • resize-data preflight now runs without a boot env guard in degraded mode —
    first-boot resize is no longer silently skipped when the boot env is unavailable
  • Add InitramfsError::DegradedBoot error variant
  • Add test-utils feature exposing MockBootEnv for integration tests
  • Rename all Bootloader* types to BootEnv* throughout
    (BootEnv, BootEnvState, BootEnvDecision, BootEnvKey, BootEnvError,
    GrubBootEnv, UBootBootEnv, MockBootEnv, open_boot_env(),
    classify_boot_env(), apply_boot_env_decision())

Reason

On first-boot, open_boot_env() can fail (e.g. GRUB boot partition not yet
mounted). Previously this silently skipped all boot-env-guarded operations,
including resize-data, with no signal to the OS. This change makes the failure
explicit and handles it correctly per image type.

The Bootloader* naming described the hardware component rather than the
operation (environment access), causing confusion. BootEnv* names the
abstraction by what it does.

Verification

All 14 feature combinations pass:
{grub, uboot} × {gpt, dos} × {∅, resize-data} × {∅, release-image, test-utils}

  • cargo clippy --tests -- -D warnings — clean
  • cargo fmt -- --check — clean
  • cargo audit — no vulnerabilities

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…tion

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…d OdsStatus

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…f_needed

Changes the resize_if_needed function signature to accept an optional
bootloader reference instead of a required one. This allows the degraded
boot path to call resize without access to a bootloader instance.

The guard write that persists the resize state to the bootloader
environment is now conditional on the bootloader being available.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…ed resize

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
… on degraded debug-image

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
@JoergZeidler JoergZeidler requested a review from JanZachmann May 20, 2026 12:59
Copy link
Copy Markdown

@JanZachmann JanZachmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated multi-agent review (correctness, comments, integration tests, silent failures). PR builds clean and all tests pass under grub,gpt,resize-data,release-image,test-utils; spec §3.4 precedence invariant verified in all four (image × open_result) cells.

Findings below — 4 critical, 6 important, 9 minor. The critical ones cluster around the integration tests being orphaned from the spec's CI matrix and the underlying BootloaderError being dropped on the release degraded path (which is the exact silent-failure class the PR set out to eliminate).

Comment thread Cargo.toml
Comment thread src/lib.rs Outdated
Comment thread src/filesystem/resize_data.rs
Comment thread src/preflight/resize_data.rs
Comment thread src/mode/normal.rs Outdated
Comment thread src/bootloader/mod.rs Outdated
Comment thread src/mode/mod.rs Outdated
Comment thread src/runtime/omnect_device_service.rs
Comment thread src/filesystem/resize_data.rs Outdated
Comment thread tests/degraded_boot.rs
Integration tests in tests/degraded_boot.rs require MockBootloader via the
test-utils feature. The spec §6.5 matrix lacked test-utils on all combos,
so the degraded_boot integration test binary was silently skipped.

Update spec §6.5 to add test-utils to all 14 combinations.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
cause. Operators had no way to distinguish env-file-missing from
grub-editenv crash or IO error.

Match directly on BootloaderEnv::Degraded to surface the error, and drop
the now-redundant bool from the Continue pattern.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…hen None

Extract the bootloader guard write from resize_if_needed into a separate
pub(crate) function write_resize_guard. This makes the guard-write dispatch
testable independently of the resize commands, which require real block
devices.

Add three unit tests:
- write_guard_none_bootloader_is_noop: Ok() with no side effects
- write_guard_some_bootloader_sets_env: guard written when available
- write_guard_none_does_not_call_set_env: verifies the degraded-mode
  skip behaviour — the exact silent-failure the PR exists to make explicit

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Replace degraded_env_with_empty_layout_returns_ok (the empty_layout()
anti-pattern called out in spec §6.3) with two tests:

- degraded_env_skips_guard_write: empty_layout() + Degraded, verifies the
  Degraded arm is dispatched without panic. The empty_layout() concession
  is intentional — real devices are CI/Concourse-only; the guard-write skip
  is separately verified in write_resize_guard unit tests (C3).

- degraded_env_with_data_layout_guard_not_written: layout_with_data() +
  Degraded, asserts the error on a non-existent device is ResizeDataError
  (not BootloaderError), confirming the guard-write path was not reached.

Update spec §6.3 to document the concession and the two-test approach.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Change persist_fsck_results to take Option<&mut dyn Bootloader>.
When None (degraded mode), the bootloader env write is skipped but
on-disk log files (/data/var/log/fsck/) are still written when the
data partition is mounted — keeping fsck diagnostics accessible to
operators and ODS even when the bootloader environment is unavailable.

Update both call sites in lib.rs and mode/normal.rs accordingly.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
so the test would pass even if DegradedBoot stopped wrapping a typed
cause. Destructure the variant and assert the inner error is the
expected CommandFailed { command: "grub-editenv" }, proving the
#[source] wiring required by spec §6.1 is in place.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Extract the ordering-critical dispatch into apply_bootloader_decision()
so the FsckRequiresReboot-wins invariant can be unit-tested directly.
The function is small (~10 lines) and pure; persist_fsck_results stays
in run_init where it has access to the rootfs path.

Four unit tests cover the precedence matrix:
- Available env + Ok core       → Ok(Available)
- Degraded env + Ok core        → Ok(Degraded) + degraded_boot flag set
- Degraded env + FsckReboot     → Err(FsckRequiresReboot), flag NOT set
- Abort(DegradedBoot) + FsckReboot → Err(FsckRequiresReboot)

Also fix unused variable warning in tests/degraded_boot.rs (I3).

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…tloader

Documents the degraded-mode contract: passing None bootloader must not
attempt any save_fsck_status call. The None type statically prevents
any implicit fallback; this test catches a future refactor that
re-introduces a null-object or silent absorber pattern.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
When the bootloader is unavailable, handle_update_validation is
silently skipped. An in-flight A/B update will roll back on timer
expiry with no trace in the initramfs logs. Add an explicit warn!
so operators and CI have a grep target when diagnosing OTA failures
after a degraded boot.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
M1/M9: wire is_degraded() into the release-image integration test,
  giving the method a regression test and making the assertion explicit.

M2: document why fsck.clear() is not done in degraded mode.

M3: update lib.rs comment to reflect apply_bootloader_decision refactor.

M4: s/normal boot/successful open/ in classify_bootloader doc — avoids
  collision with the BootMode::Normal enum variant.

M5: fix BootloaderEnv doc — describes the type, not classify_bootloader.

M6: note that BootMode::detect returns Normal for both available and
  degraded bootloader states.

M7: document skip_serializing_if rationale on degraded_boot field.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
If bl.set_env(ResizedData) fails after resize2fs + sync succeeded,
propagating a BootloaderError would make it impossible to distinguish
'resize did not run' from 'resize ran but guard failed'. The resize
is idempotent, so a failed guard write is non-fatal: log a warn and
let the boot continue. Resize will repeat on next boot.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
- Add degraded boot mode to implemented functionality overview
- Add test-utils to feature table with 'never in production' note
- Fix release-image description to include degraded-boot behaviour
- Expand testing section to all 14 matrix combinations with test-utils

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…Continue

BootloaderEnv::is_degraded() is already the single source of truth for
whether the env is degraded. The second bool field in Continue(env, bool)
was always equal to env.is_degraded() and was discarded with _ in every
production call site, enabling logically contradictory constructions.

Remove the bool field entirely. All call sites updated.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
@JoergZeidler JoergZeidler requested a review from JanZachmann May 21, 2026 11:38
Copy link
Copy Markdown

@JanZachmann JanZachmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of PR #9 (head 5aff395)

Verified all 19 prior findings against the new code at 5aff395. PR builds clean and 145 unit tests + 4 integration tests + 12 device-detection tests all pass under grub,gpt,resize-data,release-image,test-utils.

Status of prior findings

Original Status Verification
C1 — Cargo.toml test-utils gating ✅ Fixed Spec §6.5 amended (4f83ae4); cargo test --features grub,gpt,test-utils now runs 4 integration tests
C2 — BootloaderError silently dropped ✅ Fixed apply_bootloader_decision logs warn!("… {e} …") when env is Degraded (aefc2e2)
C3 — silent-skip bug untested 🟡 Mostly fixed write_resize_guard extracted with 3 tests (3d7fe98); write_guard_some_bootloader_sets_env strongly verifies the happy path. See new finding N3 below.
C4 — spec §6.3 violation ✅ Fixed degraded_env_with_data_layout_guard_not_written added (5654c16); spec §6.3 amended
I5 — fsck log files lost ✅ Fixed persist_fsck_results signature is now Option<&mut dyn Bootloader> (7c88cc5); log-file write proceeds regardless of bootloader availability
I6 — reboot loop risk 🟡 Deferred to follow-up issue (acknowledged)
I7 — #[source] cause unverified ✅ Fixed err_debug_image_is_abort_with_cause now destructures and asserts CommandFailed { command: "grub-editenv", .. } (fff0eb4)
I8 — precedence not unit-testable ✅ Fixed apply_bootloader_decision extracted with 4 precedence tests (6f0a00d)
I9 — persist_fsck_results skip not asserted ✅ Fixed test_persist_none_bootloader_skips_save_fsck_status added (61d9b60)
I10 — update-validation cascade unlogged ✅ Fixed warn!("update validation skipped — bootloader unavailable; any in-flight A/B update will roll back on timer expiry") (66cb59a)
M1–M9 (all minor) ✅ Fixed 0a2e458, 2b2d859

Bonus refactor

5aff395 removes the redundant bool from BootloaderDecision::Continue(env, bool). Good simplification — eliminates the possibility of Continue(Available(_), true) (logically contradictory).

New findings introduced in the rework

N1 (Important) — persist-before-propagate ordering regression on uboot. Reproduced with a failing test. See inline comment on src/lib.rs:103.

N2 (Minor) — spec §3.3 and §3.4 are stale w.r.t. the bool removal and apply_bootloader_decision extraction. Inline comments on the spec.

N3 (Minor) — one of the new resize-guard tests is tautological. Inline comment on src/filesystem/resize_data.rs:325.

Overall: a large net improvement. N1 is the only finding worth fixing before merge (observability of boot fsck diagnostics on uboot+FsckRequiresReboot); the others are spec/test cleanup.

Comment thread src/lib.rs Outdated
Comment thread docs/superpowers/specs/2026-05-19-degraded-boot-mode-design.md
Comment thread docs/superpowers/specs/2026-05-19-degraded-boot-mode-design.md
Comment thread src/filesystem/resize_data.rs Outdated
Copy link
Copy Markdown

@JanZachmann JanZachmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of PR #9 (head 5aff395)

Verified all 19 prior findings against the new code at 5aff395. PR builds clean and 145 unit tests + 4 integration tests + 12 device-detection tests all pass under grub,gpt,resize-data,release-image,test-utils.

Status of prior findings

Original Status Verification
C1 — Cargo.toml test-utils gating ✅ Fixed Spec §6.5 amended (4f83ae4); cargo test --features grub,gpt,test-utils now runs 4 integration tests
C2 — BootloaderError silently dropped ✅ Fixed apply_bootloader_decision logs warn!("… {e} …") when env is Degraded (aefc2e2)
C3 — silent-skip bug untested 🟡 Mostly fixed write_resize_guard extracted with 3 tests (3d7fe98); write_guard_some_bootloader_sets_env strongly verifies the happy path. See new finding N3 below.
C4 — spec §6.3 violation ✅ Fixed degraded_env_with_data_layout_guard_not_written added (5654c16); spec §6.3 amended
I5 — fsck log files lost ✅ Fixed persist_fsck_results signature is now Option<&mut dyn Bootloader> (7c88cc5); log-file write proceeds regardless of bootloader availability
I6 — reboot loop risk 🟡 Deferred to follow-up issue (acknowledged)
I7 — #[source] cause unverified ✅ Fixed err_debug_image_is_abort_with_cause now destructures and asserts CommandFailed { command: "grub-editenv", .. } (fff0eb4)
I8 — precedence not unit-testable ✅ Fixed apply_bootloader_decision extracted with 4 precedence tests (6f0a00d)
I9 — persist_fsck_results skip not asserted ✅ Fixed test_persist_none_bootloader_skips_save_fsck_status added (61d9b60)
I10 — update-validation cascade unlogged ✅ Fixed warn!("update validation skipped — bootloader unavailable; any in-flight A/B update will roll back on timer expiry") (66cb59a)
M1–M9 (all minor) ✅ Fixed 0a2e458, 2b2d859

Bonus refactor

5aff395 removes the redundant bool from BootloaderDecision::Continue(env, bool). Good simplification — eliminates the possibility of Continue(Available(_), true) (logically contradictory).

New findings introduced in the rework

N1 (Important) — persist-before-propagate ordering regression on uboot. Reproduced with a failing test. See inline comment on src/lib.rs:103.

N2 (Minor) — spec §3.3 and §3.4 are stale w.r.t. the bool removal and apply_bootloader_decision extraction. Inline comments on the spec.

N3 (Minor) — one of the new resize-guard tests is tautological. Inline comment on src/filesystem/resize_data.rs:325.

Overall: a large net improvement. N1 is the only finding worth fixing before merge (observability of boot fsck diagnostics on uboot+FsckRequiresReboot); the others are spec/test cleanup.

Comment thread src/lib.rs Outdated
Comment thread docs/superpowers/specs/2026-05-19-degraded-boot-mode-design.md
Comment thread docs/superpowers/specs/2026-05-19-degraded-boot-mode-design.md
Comment thread src/filesystem/resize_data.rs Outdated
@JanZachmann
Copy link
Copy Markdown

Note: the previous re-review was posted twice by accident (reviews 4336722419 and 4336723341 are identical). Please read either one; GitHub does not allow deleting submitted reviews. Apologies for the noise.

On uboot, UBootBootloader::new() is infallible so open_bootloader_env()
always succeeds and env is Available even when boot fsck fails. The I4
refactor (apply_bootloader_decision) introduced a regression: core_result?
propagated inside the function before persist_fsck_results was called,
so the boot fsck diagnostic was silently lost across the reboot on uboot.

Fix: move persist_fsck_results and ods_status.fsck.clear() into the
Continue arm of apply_bootloader_decision, before core_result?. This
satisfies the contract documented in mount_core_partitions (boot_sequence.rs)
for both GRUB (persist is a no-op because env is Degraded) and uboot
(env is Available, diagnostic is now saved before propagation).

Add persist_runs_before_fsck_reboot_propagates regression test asserting
ods.fsck is empty after the dispatcher returns with FsckRequiresReboot,
proving the persist ran before propagation. Update stale boot_sequence.rs
docstring to point at apply_bootloader_decision as the contract owner.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…_set_env

The test created bl but passed None to write_resize_guard, so bl was
never reachable from inside the function. The assertion bl.get_env(...)
is_none() was trivially true regardless of any bug. The contract is
fully covered by write_guard_none_bootloader_is_noop (None returns Ok)
and write_guard_some_bootloader_sets_env (Some sets the env key).

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
§3.3: Remove redundant bool field from BootloaderDecision::Continue
(removed in 5aff395). Update behaviour table to reference is_degraded()
as the single source of truth.

§3.4: Update boot flow diagrams to show apply_bootloader_decision with
persist-before-propagate ordering (fixed in 9b656e9). Add note on the
uboot-specific reason persist must precede core_result? propagation.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
@JoergZeidler JoergZeidler requested a review from JanZachmann May 21, 2026 13:25
@JanZachmann
Copy link
Copy Markdown

Re-checked at 21661e0. All four findings from the previous re-review are fixed:

  • N1 (persist regression, 9b656e9) — apply_bootloader_decision now takes rootfs: &Path and persists fsck records before propagating core_result. The new persist_runs_before_fsck_reboot_propagates test asserts ods.fsck is empty after the dispatcher returns with FsckRequiresReboot, which would fail under the prior ordering. The stale boot_sequence.rs docstring is also updated to point at apply_bootloader_decision as the contract owner.
  • N2a / N2b (stale spec, 21661e0) — §3.3 drops the bool and uses env.is_degraded(); §3.4 shows the apply_bootloader_decision flow with the explicit persist-before-core_result? ordering and an inline note explaining why this matters on uboot.
  • N3 (tautological test, 490bce8) — removed; the two sibling write_guard_* tests cover the contract.

Verified: cargo clippy --features grub,gpt,resize-data,release-image,test-utils --tests -- -D warnings clean; 145 unit + 4 integration + 12 device-detection tests pass.

Threads N1, N2a, N2b, N3 resolved. Only open thread is I6 (degraded-resize reboot loop, deferred by author).

@JoergZeidler JoergZeidler requested a review from mlilien May 22, 2026 05:30
The Bootloader trait and related types were named after the hardware
component (GRUB/U-Boot) rather than what they actually represent:
access to the bootloader environment variables. This caused ambiguity
between the bootloader software and the env-access abstraction.

Rename map:
  Bootloader         -> BootEnv
  BootloaderEnv      -> BootEnvState
  BootloaderDecision -> BootEnvDecision
  BootloaderEnvKey   -> BootEnvKey
  BootloaderError    -> BootEnvError
  GrubBootloader     -> GrubBootEnv
  UBootBootloader    -> UBootBootEnv
  MockBootloader     -> MockBootEnv
  open_bootloader_env() -> open_boot_env()
  classify_bootloader() -> classify_boot_env()
  variable bootloader   -> boot_env

The src/bootloader/ module path is unchanged.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Update struct/module doc comments in grub.rs, uboot.rs and mod.rs that
still referred to GRUB/U-Boot implementations as "bootloader" rather
than "boot environment". Keeps industry-standard terms ("bootloader
environment variables", "bootloader env") unchanged.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Update README.md, project-context.md, docs/error-handling.md and
docs/first-boot-detection-concept.md to reflect the renamed types:
  Bootloader        -> BootEnv
  BootloaderEnv     -> BootEnvState
  BootloaderError   -> BootEnvError
  MockBootloader    -> MockBootEnv
  UBootBootloader   -> UBootBootEnv
  open_bootloader_env() -> open_boot_env()
  classify_bootloader() -> classify_boot_env()

Historical docs in docs/superpowers/ (specs, plans, reviews) are not
updated — they are point-in-time design artefacts.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…ision

Consistent with the BootEnv naming scheme introduced in the previous
commits. Historical docs in docs/superpowers/ are not updated.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
'BootEnv environment' is tautological (Env = environment). Replace with
'Boot environment' in comments, doc strings, and log messages.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
… doc

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Replace references to 'bootloader' that describe BootEnv trait objects
or BootEnvState with 'boot env'. Keeps legitimate hardware references
(GRUB, U-Boot, 'bootloader environment' as industry term) unchanged.

Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants