libpcp: recover from corrupted archives in multi-archive contexts#2586
libpcp: recover from corrupted archives in multi-archive contexts#2586kurik wants to merge 1 commit into
Conversation
|
@natoscott , @kmcdonell May I ask you for a review, please? |
| "__pmLogRead: corrupted archive \"%s\" vol %d, attempting to skip", | ||
| acp->ac_log->name, acp->ac_curvol); | ||
| if (acp->ac_cur_log >= 0 && acp->ac_cur_log < acp->ac_num_logs) | ||
| acp->ac_log_list[acp->ac_cur_log]->corrupted = 1; |
There was a problem hiding this comment.
@kurik I wonder if it would be simpler (and safer) to remove the corrupt archive from the list, instead of keeping it around flagging it as bad? If we keep it around there's always a chance it might be used somewhere accidentally when it should not be, whereas if we drop it completely from the list we cannot really go wrong can we?
|
[babble from slack] If they are seeing corruption that is not of the "truncation" form I'd be very interested to know the details of the corruption. |
ebae07d to
989ad25
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughThis PR makes multi-archive log handling resilient: init skips unreadable archives with warnings and can rebuild shared state, read removes corrupted archives and retries, fetch treats corruption non-fatally when multiple archives exist, and QA normalizes stderr output for stable tests. ChangesMulti-archive corruption recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/libpcp/src/context.c`:
- Around line 947-957: The current code in context.c that calls
__pmFindOrOpenArchive(ctxp, current, multi_arch) treats every negative sts as a
recoverable "skip and continue" when multi_arch is true; instead, only treat
truly corrupt/unreadable errors as recoverable and propagate
semantic/incompatibility errors. Change the multi_arch branch to inspect sts and
only call pmNotifyErr(... "skipping corrupt/unreadable archive ...") and
continue when sts matches the specific recoverable error codes (e.g., corrupt or
I/O related codes returned by __pmFindOrOpenArchive); for all other negative sts
(label/host/schema mismatches or other semantic incompatibilities) log an error
and return sts (or otherwise fail the open) so the caller sees the failure.
Update the block around __pmFindOrOpenArchive, sts, multi_arch, current, end,
pmNotifyErr and pmErrStr to implement this conditional handling and propagate
non-recoverable errors instead of silently skipping them.
In `@src/libpcp/src/logutil.c`:
- Around line 1994-2003: The retry path inside the PM_MODE_FORW branch fails to
refresh the archive file offset after swapping archives: when
__pmLogChangeArchive(ctxp, corrupt_idx) succeeds we update lcp and f but leave
acp->ac_offset pointing to the removed archive; modify the branch that sets lcp
= acp->ac_log and f = acp->ac_mfp to also reset acp->ac_offset =
__pmLogLabelSize(lcp) (matching the other archive-switch paths) before setting
acp->ac_vol = acp->ac_curvol and goto again so subsequent seeks/read recovery
use the correct offset for the new file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a19c6ec9-8e44-41f7-8a44-15f2e38fe877
⛔ Files ignored due to path filters (2)
qa/1671.outis excluded by!**/*.outqa/722.outis excluded by!**/*.out
📒 Files selected for processing (5)
qa/1671qa/722src/libpcp/src/context.csrc/libpcp/src/interp.csrc/libpcp/src/logutil.c
ad73707 to
9e214bd
Compare
When reading multi-archive PCP logs a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are removed from ac_log_list, so that data from healthy volumes can still be read. Tests qa/1671 and qa/722 are modified accordingly to reflect this change.
9e214bd to
20605cc
Compare
When reading multi-archive PCP logs (e.g. Cockpit metrics history), a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are flagged and skipped so that data from healthy volumes can still be read.
Resolves: RHEL-169855
Summary by CodeRabbit
Bug Fixes
Tests