Skip to content

libpcp: recover from corrupted archives in multi-archive contexts#2586

Open
kurik wants to merge 1 commit into
performancecopilot:mainfrom
kurik:corrupted-archive-recovery
Open

libpcp: recover from corrupted archives in multi-archive contexts#2586
kurik wants to merge 1 commit into
performancecopilot:mainfrom
kurik:corrupted-archive-recovery

Conversation

@kurik
Copy link
Copy Markdown
Contributor

@kurik kurik commented May 7, 2026

When reading multi-archive PCP logs (e.g. Cockpit metrics history), a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are flagged and skipped so that data from healthy volumes can still be read.

  • Add 'corrupted' field to __pmMultiLogCtl to track damaged archives
  • Skip corrupt/unreadable archives during multi-archive context init with a warning rather than failing outright
  • Rebuild shared __pmLogCtl when lost due to failed archive opens
  • On PM_ERR_LOGREC in __pmLogRead, advance to the next (or previous) archive instead of returning an error in multi-archive mode
  • Skip known-corrupted archives in LogChangeToNextArchive() and LogChangeToPreviousArchive()
  • Update QA tests 722 and 1671 for new recovery behaviour

Resolves: RHEL-169855

Summary by CodeRabbit

  • Bug Fixes

    • Improved multi-archive resilience: corrupted or unreadable archives are now skipped with a warning so processing continues with remaining archives.
    • Archive-corruption diagnostics in multi-archive contexts downgraded from errors to warnings.
  • Tests

    • Normalized test/stderr output: added filtering to replace timestamps and process IDs for consistent test logs.

Review Change Stack

@kurik kurik requested review from kmcdonell and natoscott May 7, 2026 08:54
@kurik
Copy link
Copy Markdown
Contributor Author

kurik commented May 7, 2026

@natoscott , @kmcdonell May I ask you for a review, please?

Comment thread src/libpcp/src/logutil.c Outdated
"__pmLogRead: corrupted archive \"%s\" vol %d, attempting to skip",
acp->ac_log->name, acp->ac_curvol);
if (acp->ac_cur_log >= 0 && acp->ac_cur_log < acp->ac_num_logs)
acp->ac_log_list[acp->ac_cur_log]->corrupted = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kurik I wonder if it would be simpler (and safer) to remove the corrupt archive from the list, instead of keeping it around flagging it as bad? If we keep it around there's always a chance it might be used somewhere accidentally when it should not be, whereas if we drop it completely from the list we cannot really go wrong can we?

@kmcdonell
Copy link
Copy Markdown
Member

[babble from slack]
@kurik More or less orthogonal to the fixes for libpcp, it may be worth (I'm not sure how, but particularly for RH Support) publicising that pmlogcheck -r or -R if you're brave, repairs the most common forms of archive corruption (the last record in one of the files is incomplete). Since I took my shell script hack and turned it into C code in pmlogcheck I've had 100% success in repairing corrupted archives and preserving as much data as is possible. This is something that happens infrequently, but regularly, in the QA Farm and allows pmlogger-daily to recover from corrupted archives (something that cannot be done in libpcp and something where we don't want to skip the corrupted archive if there is useful information up to the point of corruption). On reflection, I wonder if the pmlogcheck logic should be somehow incorporated into libpcp (some PCP_FOO_OPTION?) and this makes it not-so-orthogonal, so I'll add this comment to the PR so the thought bubble does not get lost.

If they are seeing corruption that is not of the "truncation" form I'd be very interested to know the details of the corruption.

@kurik kurik force-pushed the corrupted-archive-recovery branch from ebae07d to 989ad25 Compare May 19, 2026 06:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: df4d343d-fb8c-4fb3-b33e-cf30af2df500

📥 Commits

Reviewing files that changed from the base of the PR and between 9e214bd and 20605cc.

⛔ Files ignored due to path filters (1)
  • qa/1671.out is excluded by !**/*.out
📒 Files selected for processing (4)
  • qa/1671
  • src/libpcp/src/context.c
  • src/libpcp/src/interp.c
  • src/libpcp/src/logutil.c
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/libpcp/src/interp.c
  • qa/1671
  • src/libpcp/src/context.c
  • src/libpcp/src/logutil.c

📝 Walkthrough

Walkthrough

This PR makes multi-archive log handling resilient: init skips unreadable archives with warnings and can rebuild shared state, read removes corrupted archives and retries, fetch treats corruption non-fatally when multiple archives exist, and QA normalizes stderr output for stable tests.

Changes

Multi-archive corruption recovery

Layer / File(s) Summary
Multi-archive context initialization with resilience
src/libpcp/src/context.c
initarchive() detects and warns about individual archive open failures in multi-archive mode, skips failed archives by advancing past them in the comma-delimited list, errors only if no archives succeeded, and reconstructs shared log control if lost.
Corrupted archive removal and retry mechanism
src/libpcp/src/logutil.c
__pmLogRead_ctx() detects PM_ERR_LOGREC on corrupted archives when multiple archives exist, removes the corrupted entry (freeing its strings and control struct), compacts the list, updates indices/offsets, and retries reading by switching to the next or previous archive using a new read_retry target.
Fetch error handling for multi-archive contexts
src/libpcp/src/interp.c
__pmLogFetchInterp() downgrades corrupted-archive diagnostics from Error to Warning and only returns the error when the context has one or fewer logs, allowing continuation when multiple archives exist.
QA test infrastructure and stderr normalization
qa/1671
Adds _filter_warnings() shell helper to mask bracketed timestamps and numeric sequences in test stderr and updates the per-interval replay loop to emit filtered stderr (_filter_warnings <$tmp.err) instead of raw cat $tmp.err.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A rabbit hops through archives worn,
Skips the bad, keeps the rest reborn,
Reads retry, the logs align,
Warnings masked, the tests refine,
Hooray—resilient tails and time! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'libpcp: recover from corrupted archives in multi-archive contexts' directly and clearly describes the main objective of the pull request—enabling recovery from corrupted archives in multi-archive scenarios.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/libpcp/src/context.c`:
- Around line 947-957: The current code in context.c that calls
__pmFindOrOpenArchive(ctxp, current, multi_arch) treats every negative sts as a
recoverable "skip and continue" when multi_arch is true; instead, only treat
truly corrupt/unreadable errors as recoverable and propagate
semantic/incompatibility errors. Change the multi_arch branch to inspect sts and
only call pmNotifyErr(... "skipping corrupt/unreadable archive ...") and
continue when sts matches the specific recoverable error codes (e.g., corrupt or
I/O related codes returned by __pmFindOrOpenArchive); for all other negative sts
(label/host/schema mismatches or other semantic incompatibilities) log an error
and return sts (or otherwise fail the open) so the caller sees the failure.
Update the block around __pmFindOrOpenArchive, sts, multi_arch, current, end,
pmNotifyErr and pmErrStr to implement this conditional handling and propagate
non-recoverable errors instead of silently skipping them.

In `@src/libpcp/src/logutil.c`:
- Around line 1994-2003: The retry path inside the PM_MODE_FORW branch fails to
refresh the archive file offset after swapping archives: when
__pmLogChangeArchive(ctxp, corrupt_idx) succeeds we update lcp and f but leave
acp->ac_offset pointing to the removed archive; modify the branch that sets lcp
= acp->ac_log and f = acp->ac_mfp to also reset acp->ac_offset =
__pmLogLabelSize(lcp) (matching the other archive-switch paths) before setting
acp->ac_vol = acp->ac_curvol and goto again so subsequent seeks/read recovery
use the correct offset for the new file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a19c6ec9-8e44-41f7-8a44-15f2e38fe877

📥 Commits

Reviewing files that changed from the base of the PR and between 1f4bcc3 and 989ad25.

⛔ Files ignored due to path filters (2)
  • qa/1671.out is excluded by !**/*.out
  • qa/722.out is excluded by !**/*.out
📒 Files selected for processing (5)
  • qa/1671
  • qa/722
  • src/libpcp/src/context.c
  • src/libpcp/src/interp.c
  • src/libpcp/src/logutil.c

Comment thread src/libpcp/src/context.c
Comment thread src/libpcp/src/logutil.c
@kurik kurik force-pushed the corrupted-archive-recovery branch from ad73707 to 9e214bd Compare May 19, 2026 09:57
When reading multi-archive PCP logs a corrupted volume no longer
terminates the entire replay. Instead, corrupted archives are removed
from ac_log_list, so that data from healthy volumes can still be read.

Tests qa/1671 and qa/722 are modified accordingly to reflect this change.
@kurik kurik force-pushed the corrupted-archive-recovery branch from 9e214bd to 20605cc Compare May 19, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants