Skip to content

Centralize cache delete-and-push mechanism to one place#1645

Open
coreyjadams wants to merge 4 commits into
mainfrom
fix-testmon-db-cache
Open

Centralize cache delete-and-push mechanism to one place#1645
coreyjadams wants to merge 4 commits into
mainfrom
fix-testmon-db-cache

Conversation

@coreyjadams
Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR centralises the GitHub Actions delete-before-save cache pattern into a new reusable replace-cache composite action, eliminating ~100 lines of duplicated shell logic across the nightly workflow. It simultaneously fixes a long-standing stale-cache bug: testmon and coverage caches previously used hashFiles('uv.lock', 'pyproject.toml') keys that collided on consecutive nightlies with an unchanged lockfile, silently leaving stale data in place for days.

  • New replace-cache action: encapsulates delete → save → verify for any mutable -latest slot; callers supply their own if: gate and github-token.
  • Nightly workflow: four mutable-slot saves (uv, JIT, testmon, coverage) now all go through replace-cache; testmon and coverage keys migrated from hash-suffix to -latest.
  • PR workflow: restore steps updated to the new -latest key; restore-keys prefix fallback removed intentionally (fail-open semantics preserved, testmon handles stale DBs gracefully).
  • Dependency pinning: CI-only test deps in setup-uv-env tightened from >= to == to stabilise the testmon DB environment fingerprint and prevent spurious full-suite re-runs on PRs.

Important Files Changed

Filename Overview
.github/actions/replace-cache/action.yml New composite action encapsulating delete-before-save for mutable -latest cache slots; verify step retries 5×5 s which may be tight under heavy GitHub API load
.github/workflows/github-nightly-uv.yml Replaces three separate inline delete/save/verify blocks with single replace-cache invocations; testmon and coverage keys migrated from hash-suffix to -latest; no logic regressions
.github/workflows/github-pr.yml PR restore steps updated to match new -latest keys; restore-keys prefix fallback removed intentionally (fail-open design)
.github/actions/setup-uv-env/action.yml CI-only test deps pinned with == to stabilise testmon DB environment fingerprint; transitive churn acknowledged in comments
.github/CACHE_CONTRACT.md Documentation updated to cover testmon and coverage cache contracts, -latest mutable-slot rationale, and the replace-cache building block

Reviews (1): Last reviewed commit: "Merge branch 'main' into fix-testmon-db-..." | Re-trigger Greptile

Comment on lines +69 to +88
run: |
set -euo pipefail
if ! command -v gh >/dev/null 2>&1; then
echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot."
exit 1
fi
# Use --json key + --jq for robust matching (no false positives
# on prefix overlap from sibling cache keys).
existing="$(gh cache list \
--repo "$REPO" \
--key "$CACHE_KEY" \
--json key \
--jq '.[].key' \
| grep -Fx "$CACHE_KEY" || true)"
if [ -n "$existing" ]; then
gh cache delete "$CACHE_KEY" --repo "$REPO"
echo "deleted stale ${CACHE_DESC}: $CACHE_KEY"
else
echo "no existing ${CACHE_DESC} to delete: $CACHE_KEY"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TOCTOU: gh cache delete can fail if the slot disappears between list and delete

set -euo pipefail is active, so if another process deletes the entry between the gh cache list check and the gh cache delete call, the delete returns a non-zero exit code and the entire step fails. In practice this is prevented by the concurrency: nightly-github-uv group, but a manual re-run triggered while a nightly is still in its save window could hit this. Adding || true to the gh cache delete line would make the step idempotent without masking real gh CLI errors (the save + verify steps still catch the important failure class).

Comment on lines +112 to +120
for attempt in 1 2 3 4 5; do
if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \
| grep -Fxq "$CACHE_KEY"; then
echo "${CACHE_DESC} present: $CACHE_KEY"
exit 0
fi
echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..."
sleep 5
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Verify retry window may be too short under load

The loop retries 5 times with 5-second sleeps (25 s total). GitHub's cache index is eventually-consistent and under heavy repository load the propagation delay can exceed 25 s, causing a spurious hard failure immediately after a successful save. Consider increasing the sleep interval (e.g. 10 s per attempt → 50 s total) or the attempt count to give the index more time to settle, especially since the action is intended to fail loudly on genuine misses.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, recursive retry with exponential backoff is probably mildly better here

Comment thread .github/actions/setup-uv-env/action.yml Outdated
Comment on lines +237 to +243
"moto[s3]==5.2.1" \
"numpy-stl==3.2.0" \
"scikit-image==0.26.0" \
"shapely==2.1.2" \
"multi-storage-client[boto3]==0.48.0" \
"tensorstore==0.1.83" \
"pyarrow==24.0.0"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickGeneva @laserkelvin and also @peterdsharpe there has been a little discussion about pinning here, vs. pinning in pyproject.toml. Summarizing some pros and cons.

Why pin? If we don't pin, and one of these updates, the nightly build will get out of sync with the PR venv and it will trigger a rebuild of the environment (slow on the PR on the GPU nodes) and trigger ALL tests to run (also slow) because the testmon DB requires the venv to match. So, pinning is a good idea IMO.

We can pin here, and that is nice because it's not disruptive to pyproject.toml, and can control our CI system independently. We already have that in blossom since we run in a container, and the installed packages are not necessarily aligned with what's in uv.lock. I contend that is OK. On the other hand, we might want to be able to control the CI env tightly against the uv.lock file for some reason?

We can pin in pyproject.toml by creating all of these deps in ci-deps development group with specified numbers. That's an update to pyproject.toml (no big deal) and extra lock resolution (no big deal) but any changes to CI env will have to go through that instead. And changes to pyproject.toml are meant, deliberately, to invalidate the testmon db and trigger all tests to rerun, for what it's worth (and I like that design, updating pyproject.toml in physicsnemo should be painful).

A middle ground might be to put these in a ci-requirements.txt or similar that is contained?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good analysis!

I agree that pinning overall is a good idea.

As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.

@coreyjadams coreyjadams requested a review from peterdsharpe May 14, 2026 14:41
@coreyjadams
Copy link
Copy Markdown
Collaborator Author

This PR effectively is doing two things:

  1. Consolidate the logic of delete-then-upload to refresh immutable caches for the various caching mechanisms I've set up to accelerate our CI. Since there are now several, it made sense to turn it into a custom action that can be reused.

  2. Pin CI dependencies to specific versions.

If needed I'll split these up.

Copy link
Copy Markdown
Collaborator

@peterdsharpe peterdsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Interesting discussion about how to implement version-pinning; I think any of the three presented options are perfectly fine (including as-is).

set -euo pipefail
if ! command -v gh >/dev/null 2>&1; then
echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot."
exit 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard is probably not necessary if you have -euo pipefail already, but this also doesn't hurt

Comment on lines +112 to +120
for attempt in 1 2 3 4 5; do
if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \
| grep -Fxq "$CACHE_KEY"; then
echo "${CACHE_DESC} present: $CACHE_KEY"
exit 0
fi
echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..."
sleep 5
done
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, recursive retry with exponential backoff is probably mildly better here

Comment thread .github/actions/setup-uv-env/action.yml Outdated
Comment on lines +237 to +243
"moto[s3]==5.2.1" \
"numpy-stl==3.2.0" \
"scikit-image==0.26.0" \
"shapely==2.1.2" \
"multi-storage-client[boto3]==0.48.0" \
"tensorstore==0.1.83" \
"pyarrow==24.0.0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good analysis!

I agree that pinning overall is a good idea.

As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants