feat: add degraded boot mode for unavailable bootloader env#9
feat: add degraded boot mode for unavailable bootloader env#9JoergZeidler wants to merge 37 commits into
Conversation
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…tion Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…d OdsStatus Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…f_needed Changes the resize_if_needed function signature to accept an optional bootloader reference instead of a required one. This allows the degraded boot path to call resize without access to a bootloader instance. The guard write that persists the resize state to the bootloader environment is now conditional on the bootloader being available. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…ed resize Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
… on degraded debug-image Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
JanZachmann
left a comment
There was a problem hiding this comment.
Automated multi-agent review (correctness, comments, integration tests, silent failures). PR builds clean and all tests pass under grub,gpt,resize-data,release-image,test-utils; spec §3.4 precedence invariant verified in all four (image × open_result) cells.
Findings below — 4 critical, 6 important, 9 minor. The critical ones cluster around the integration tests being orphaned from the spec's CI matrix and the underlying BootloaderError being dropped on the release degraded path (which is the exact silent-failure class the PR set out to eliminate).
Integration tests in tests/degraded_boot.rs require MockBootloader via the test-utils feature. The spec §6.5 matrix lacked test-utils on all combos, so the degraded_boot integration test binary was silently skipped. Update spec §6.5 to add test-utils to all 14 combinations. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
cause. Operators had no way to distinguish env-file-missing from grub-editenv crash or IO error. Match directly on BootloaderEnv::Degraded to surface the error, and drop the now-redundant bool from the Continue pattern. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…hen None Extract the bootloader guard write from resize_if_needed into a separate pub(crate) function write_resize_guard. This makes the guard-write dispatch testable independently of the resize commands, which require real block devices. Add three unit tests: - write_guard_none_bootloader_is_noop: Ok() with no side effects - write_guard_some_bootloader_sets_env: guard written when available - write_guard_none_does_not_call_set_env: verifies the degraded-mode skip behaviour — the exact silent-failure the PR exists to make explicit Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Replace degraded_env_with_empty_layout_returns_ok (the empty_layout() anti-pattern called out in spec §6.3) with two tests: - degraded_env_skips_guard_write: empty_layout() + Degraded, verifies the Degraded arm is dispatched without panic. The empty_layout() concession is intentional — real devices are CI/Concourse-only; the guard-write skip is separately verified in write_resize_guard unit tests (C3). - degraded_env_with_data_layout_guard_not_written: layout_with_data() + Degraded, asserts the error on a non-existent device is ResizeDataError (not BootloaderError), confirming the guard-write path was not reached. Update spec §6.3 to document the concession and the two-test approach. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Change persist_fsck_results to take Option<&mut dyn Bootloader>. When None (degraded mode), the bootloader env write is skipped but on-disk log files (/data/var/log/fsck/) are still written when the data partition is mounted — keeping fsck diagnostics accessible to operators and ODS even when the bootloader environment is unavailable. Update both call sites in lib.rs and mode/normal.rs accordingly. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
so the test would pass even if DegradedBoot stopped wrapping a typed
cause. Destructure the variant and assert the inner error is the
expected CommandFailed { command: "grub-editenv" }, proving the
#[source] wiring required by spec §6.1 is in place.
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Extract the ordering-critical dispatch into apply_bootloader_decision() so the FsckRequiresReboot-wins invariant can be unit-tested directly. The function is small (~10 lines) and pure; persist_fsck_results stays in run_init where it has access to the rootfs path. Four unit tests cover the precedence matrix: - Available env + Ok core → Ok(Available) - Degraded env + Ok core → Ok(Degraded) + degraded_boot flag set - Degraded env + FsckReboot → Err(FsckRequiresReboot), flag NOT set - Abort(DegradedBoot) + FsckReboot → Err(FsckRequiresReboot) Also fix unused variable warning in tests/degraded_boot.rs (I3). Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…tloader Documents the degraded-mode contract: passing None bootloader must not attempt any save_fsck_status call. The None type statically prevents any implicit fallback; this test catches a future refactor that re-introduces a null-object or silent absorber pattern. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
When the bootloader is unavailable, handle_update_validation is silently skipped. An in-flight A/B update will roll back on timer expiry with no trace in the initramfs logs. Add an explicit warn! so operators and CI have a grep target when diagnosing OTA failures after a degraded boot. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
M1/M9: wire is_degraded() into the release-image integration test, giving the method a regression test and making the assertion explicit. M2: document why fsck.clear() is not done in degraded mode. M3: update lib.rs comment to reflect apply_bootloader_decision refactor. M4: s/normal boot/successful open/ in classify_bootloader doc — avoids collision with the BootMode::Normal enum variant. M5: fix BootloaderEnv doc — describes the type, not classify_bootloader. M6: note that BootMode::detect returns Normal for both available and degraded bootloader states. M7: document skip_serializing_if rationale on degraded_boot field. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
If bl.set_env(ResizedData) fails after resize2fs + sync succeeded, propagating a BootloaderError would make it impossible to distinguish 'resize did not run' from 'resize ran but guard failed'. The resize is idempotent, so a failed guard write is non-fatal: log a warn and let the boot continue. Resize will repeat on next boot. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
- Add degraded boot mode to implemented functionality overview - Add test-utils to feature table with 'never in production' note - Fix release-image description to include degraded-boot behaviour - Expand testing section to all 14 matrix combinations with test-utils Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…Continue BootloaderEnv::is_degraded() is already the single source of truth for whether the env is degraded. The second bool field in Continue(env, bool) was always equal to env.is_degraded() and was discarded with _ in every production call site, enabling logically contradictory constructions. Remove the bool field entirely. All call sites updated. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
JanZachmann
left a comment
There was a problem hiding this comment.
Re-review of PR #9 (head 5aff395)
Verified all 19 prior findings against the new code at 5aff395. PR builds clean and 145 unit tests + 4 integration tests + 12 device-detection tests all pass under grub,gpt,resize-data,release-image,test-utils.
Status of prior findings
| Original | Status | Verification |
|---|---|---|
| C1 — Cargo.toml test-utils gating | ✅ Fixed | Spec §6.5 amended (4f83ae4); cargo test --features grub,gpt,test-utils now runs 4 integration tests |
C2 — BootloaderError silently dropped |
✅ Fixed | apply_bootloader_decision logs warn!("… {e} …") when env is Degraded (aefc2e2) |
| C3 — silent-skip bug untested | 🟡 Mostly fixed | write_resize_guard extracted with 3 tests (3d7fe98); write_guard_some_bootloader_sets_env strongly verifies the happy path. See new finding N3 below. |
| C4 — spec §6.3 violation | ✅ Fixed | degraded_env_with_data_layout_guard_not_written added (5654c16); spec §6.3 amended |
| I5 — fsck log files lost | ✅ Fixed | persist_fsck_results signature is now Option<&mut dyn Bootloader> (7c88cc5); log-file write proceeds regardless of bootloader availability |
| I6 — reboot loop risk | 🟡 Deferred to follow-up issue (acknowledged) | |
I7 — #[source] cause unverified |
✅ Fixed | err_debug_image_is_abort_with_cause now destructures and asserts CommandFailed { command: "grub-editenv", .. } (fff0eb4) |
| I8 — precedence not unit-testable | ✅ Fixed | apply_bootloader_decision extracted with 4 precedence tests (6f0a00d) |
I9 — persist_fsck_results skip not asserted |
✅ Fixed | test_persist_none_bootloader_skips_save_fsck_status added (61d9b60) |
| I10 — update-validation cascade unlogged | ✅ Fixed | warn!("update validation skipped — bootloader unavailable; any in-flight A/B update will roll back on timer expiry") (66cb59a) |
| M1–M9 (all minor) | ✅ Fixed | 0a2e458, 2b2d859 |
Bonus refactor
5aff395 removes the redundant bool from BootloaderDecision::Continue(env, bool). Good simplification — eliminates the possibility of Continue(Available(_), true) (logically contradictory).
New findings introduced in the rework
N1 (Important) — persist-before-propagate ordering regression on uboot. Reproduced with a failing test. See inline comment on src/lib.rs:103.
N2 (Minor) — spec §3.3 and §3.4 are stale w.r.t. the bool removal and apply_bootloader_decision extraction. Inline comments on the spec.
N3 (Minor) — one of the new resize-guard tests is tautological. Inline comment on src/filesystem/resize_data.rs:325.
Overall: a large net improvement. N1 is the only finding worth fixing before merge (observability of boot fsck diagnostics on uboot+FsckRequiresReboot); the others are spec/test cleanup.
JanZachmann
left a comment
There was a problem hiding this comment.
Re-review of PR #9 (head 5aff395)
Verified all 19 prior findings against the new code at 5aff395. PR builds clean and 145 unit tests + 4 integration tests + 12 device-detection tests all pass under grub,gpt,resize-data,release-image,test-utils.
Status of prior findings
| Original | Status | Verification |
|---|---|---|
| C1 — Cargo.toml test-utils gating | ✅ Fixed | Spec §6.5 amended (4f83ae4); cargo test --features grub,gpt,test-utils now runs 4 integration tests |
C2 — BootloaderError silently dropped |
✅ Fixed | apply_bootloader_decision logs warn!("… {e} …") when env is Degraded (aefc2e2) |
| C3 — silent-skip bug untested | 🟡 Mostly fixed | write_resize_guard extracted with 3 tests (3d7fe98); write_guard_some_bootloader_sets_env strongly verifies the happy path. See new finding N3 below. |
| C4 — spec §6.3 violation | ✅ Fixed | degraded_env_with_data_layout_guard_not_written added (5654c16); spec §6.3 amended |
| I5 — fsck log files lost | ✅ Fixed | persist_fsck_results signature is now Option<&mut dyn Bootloader> (7c88cc5); log-file write proceeds regardless of bootloader availability |
| I6 — reboot loop risk | 🟡 Deferred to follow-up issue (acknowledged) | |
I7 — #[source] cause unverified |
✅ Fixed | err_debug_image_is_abort_with_cause now destructures and asserts CommandFailed { command: "grub-editenv", .. } (fff0eb4) |
| I8 — precedence not unit-testable | ✅ Fixed | apply_bootloader_decision extracted with 4 precedence tests (6f0a00d) |
I9 — persist_fsck_results skip not asserted |
✅ Fixed | test_persist_none_bootloader_skips_save_fsck_status added (61d9b60) |
| I10 — update-validation cascade unlogged | ✅ Fixed | warn!("update validation skipped — bootloader unavailable; any in-flight A/B update will roll back on timer expiry") (66cb59a) |
| M1–M9 (all minor) | ✅ Fixed | 0a2e458, 2b2d859 |
Bonus refactor
5aff395 removes the redundant bool from BootloaderDecision::Continue(env, bool). Good simplification — eliminates the possibility of Continue(Available(_), true) (logically contradictory).
New findings introduced in the rework
N1 (Important) — persist-before-propagate ordering regression on uboot. Reproduced with a failing test. See inline comment on src/lib.rs:103.
N2 (Minor) — spec §3.3 and §3.4 are stale w.r.t. the bool removal and apply_bootloader_decision extraction. Inline comments on the spec.
N3 (Minor) — one of the new resize-guard tests is tautological. Inline comment on src/filesystem/resize_data.rs:325.
Overall: a large net improvement. N1 is the only finding worth fixing before merge (observability of boot fsck diagnostics on uboot+FsckRequiresReboot); the others are spec/test cleanup.
|
Note: the previous re-review was posted twice by accident (reviews 4336722419 and 4336723341 are identical). Please read either one; GitHub does not allow deleting submitted reviews. Apologies for the noise. |
On uboot, UBootBootloader::new() is infallible so open_bootloader_env() always succeeds and env is Available even when boot fsck fails. The I4 refactor (apply_bootloader_decision) introduced a regression: core_result? propagated inside the function before persist_fsck_results was called, so the boot fsck diagnostic was silently lost across the reboot on uboot. Fix: move persist_fsck_results and ods_status.fsck.clear() into the Continue arm of apply_bootloader_decision, before core_result?. This satisfies the contract documented in mount_core_partitions (boot_sequence.rs) for both GRUB (persist is a no-op because env is Degraded) and uboot (env is Available, diagnostic is now saved before propagation). Add persist_runs_before_fsck_reboot_propagates regression test asserting ods.fsck is empty after the dispatcher returns with FsckRequiresReboot, proving the persist ran before propagation. Update stale boot_sequence.rs docstring to point at apply_bootloader_decision as the contract owner. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…_set_env The test created bl but passed None to write_resize_guard, so bl was never reachable from inside the function. The assertion bl.get_env(...) is_none() was trivially true regardless of any bug. The contract is fully covered by write_guard_none_bootloader_is_noop (None returns Ok) and write_guard_some_bootloader_sets_env (Some sets the env key). Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
§3.3: Remove redundant bool field from BootloaderDecision::Continue (removed in 5aff395). Update behaviour table to reference is_degraded() as the single source of truth. §3.4: Update boot flow diagrams to show apply_bootloader_decision with persist-before-propagate ordering (fixed in 9b656e9). Add note on the uboot-specific reason persist must precede core_result? propagation. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
|
Re-checked at
Verified: Threads N1, N2a, N2b, N3 resolved. Only open thread is I6 (degraded-resize reboot loop, deferred by author). |
The Bootloader trait and related types were named after the hardware component (GRUB/U-Boot) rather than what they actually represent: access to the bootloader environment variables. This caused ambiguity between the bootloader software and the env-access abstraction. Rename map: Bootloader -> BootEnv BootloaderEnv -> BootEnvState BootloaderDecision -> BootEnvDecision BootloaderEnvKey -> BootEnvKey BootloaderError -> BootEnvError GrubBootloader -> GrubBootEnv UBootBootloader -> UBootBootEnv MockBootloader -> MockBootEnv open_bootloader_env() -> open_boot_env() classify_bootloader() -> classify_boot_env() variable bootloader -> boot_env The src/bootloader/ module path is unchanged. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Update struct/module doc comments in grub.rs, uboot.rs and mod.rs that
still referred to GRUB/U-Boot implementations as "bootloader" rather
than "boot environment". Keeps industry-standard terms ("bootloader
environment variables", "bootloader env") unchanged.
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Update README.md, project-context.md, docs/error-handling.md and docs/first-boot-detection-concept.md to reflect the renamed types: Bootloader -> BootEnv BootloaderEnv -> BootEnvState BootloaderError -> BootEnvError MockBootloader -> MockBootEnv UBootBootloader -> UBootBootEnv open_bootloader_env() -> open_boot_env() classify_bootloader() -> classify_boot_env() Historical docs in docs/superpowers/ (specs, plans, reviews) are not updated — they are point-in-time design artefacts. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
…ision Consistent with the BootEnv naming scheme introduced in the previous commits. Historical docs in docs/superpowers/ are not updated. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
'BootEnv environment' is tautological (Env = environment). Replace with 'Boot environment' in comments, doc strings, and log messages. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
… doc Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Replace references to 'bootloader' that describe BootEnv trait objects or BootEnvState with 'boot env'. Keeps legitimate hardware references (GRUB, U-Boot, 'bootloader environment' as industry term) unchanged. Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Signed-off-by: Joerg Zeidler <62105035+JoergZeidler@users.noreply.github.com>
Summary
BootEnvStateenum replacingOption<Box<dyn BootEnv>>throughout,with variants
Available(Box<dyn BootEnv>)andDegraded(BootEnvError)classify_boot_env()as the single decision point for degraded boot:Abort→ debug shell viahandle_fatal_error()Continue(Degraded)→ boot proceedsdegraded_boot: booltoOdsStatus(written to/run/omnect-device-service/omnect-os-initramfs.json), signalling the runningOS that it booted degraded
resize-datapreflight now runs without a boot env guard in degraded mode —first-boot resize is no longer silently skipped when the boot env is unavailable
InitramfsError::DegradedBooterror varianttest-utilsfeature exposingMockBootEnvfor integration testsBootloader*types toBootEnv*throughout(
BootEnv,BootEnvState,BootEnvDecision,BootEnvKey,BootEnvError,GrubBootEnv,UBootBootEnv,MockBootEnv,open_boot_env(),classify_boot_env(),apply_boot_env_decision())Reason
On first-boot,
open_boot_env()can fail (e.g. GRUB boot partition not yetmounted). Previously this silently skipped all boot-env-guarded operations,
including
resize-data, with no signal to the OS. This change makes the failureexplicit and handles it correctly per image type.
The
Bootloader*naming described the hardware component rather than theoperation (environment access), causing confusion.
BootEnv*names theabstraction by what it does.
Verification
All 14 feature combinations pass:
{grub, uboot} × {gpt, dos} × {∅, resize-data} × {∅, release-image, test-utils}cargo clippy --tests -- -D warnings— cleancargo fmt -- --check— cleancargo audit— no vulnerabilities