Centralize cache delete-and-push mechanism to one place by coreyjadams · Pull Request #1645 · NVIDIA/physicsnemo

coreyjadams · 2026-05-13T23:29:35Z

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

copy-pr-bot · 2026-05-13T23:29:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-13T23:33:04Z

Greptile Summary

This PR centralises the GitHub Actions delete-before-save cache pattern into a new reusable replace-cache composite action, eliminating ~100 lines of duplicated shell logic across the nightly workflow. It simultaneously fixes a long-standing stale-cache bug: testmon and coverage caches previously used hashFiles('uv.lock', 'pyproject.toml') keys that collided on consecutive nightlies with an unchanged lockfile, silently leaving stale data in place for days.

New replace-cache action: encapsulates delete → save → verify for any mutable -latest slot; callers supply their own if: gate and github-token.
Nightly workflow: four mutable-slot saves (uv, JIT, testmon, coverage) now all go through replace-cache; testmon and coverage keys migrated from hash-suffix to -latest.
PR workflow: restore steps updated to the new -latest key; restore-keys prefix fallback removed intentionally (fail-open semantics preserved, testmon handles stale DBs gracefully).
Dependency pinning: CI-only test deps in setup-uv-env tightened from >= to == to stabilise the testmon DB environment fingerprint and prevent spurious full-suite re-runs on PRs.

Important Files Changed

Filename	Overview
.github/actions/replace-cache/action.yml	New composite action encapsulating delete-before-save for mutable `-latest` cache slots; verify step retries 5×5 s which may be tight under heavy GitHub API load
.github/workflows/github-nightly-uv.yml	Replaces three separate inline delete/save/verify blocks with single replace-cache invocations; testmon and coverage keys migrated from hash-suffix to -latest; no logic regressions
.github/workflows/github-pr.yml	PR restore steps updated to match new -latest keys; restore-keys prefix fallback removed intentionally (fail-open design)
.github/actions/setup-uv-env/action.yml	CI-only test deps pinned with == to stabilise testmon DB environment fingerprint; transitive churn acknowledged in comments
.github/CACHE_CONTRACT.md	Documentation updated to cover testmon and coverage cache contracts, -latest mutable-slot rationale, and the replace-cache building block

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into fix-testmon-db-..." | Re-trigger Greptile}

greptile-apps · 2026-05-13T23:33:08Z

+      run: |
+        set -euo pipefail
+        if ! command -v gh >/dev/null 2>&1; then
+          echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot."
+          exit 1
+        fi
+        # Use --json key + --jq for robust matching (no false positives
+        # on prefix overlap from sibling cache keys).
+        existing="$(gh cache list \
+          --repo "$REPO" \
+          --key "$CACHE_KEY" \
+          --json key \
+          --jq '.[].key' \
+          | grep -Fx "$CACHE_KEY" || true)"
+        if [ -n "$existing" ]; then
+          gh cache delete "$CACHE_KEY" --repo "$REPO"
+          echo "deleted stale ${CACHE_DESC}: $CACHE_KEY"
+        else
+          echo "no existing ${CACHE_DESC} to delete: $CACHE_KEY"
+        fi


TOCTOU: gh cache delete can fail if the slot disappears between list and delete

set -euo pipefail is active, so if another process deletes the entry between the gh cache list check and the gh cache delete call, the delete returns a non-zero exit code and the entire step fails. In practice this is prevented by the concurrency: nightly-github-uv group, but a manual re-run triggered while a nightly is still in its save window could hit this. Adding || true to the gh cache delete line would make the step idempotent without masking real gh CLI errors (the save + verify steps still catch the important failure class).

greptile-apps · 2026-05-13T23:33:08Z

+        for attempt in 1 2 3 4 5; do
+          if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \
+              | grep -Fxq "$CACHE_KEY"; then
+            echo "${CACHE_DESC} present: $CACHE_KEY"
+            exit 0
+          fi
+          echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..."
+          sleep 5
+        done


Verify retry window may be too short under load

The loop retries 5 times with 5-second sleeps (25 s total). GitHub's cache index is eventually-consistent and under heavy repository load the propagation delay can exceed 25 s, causing a spurious hard failure immediately after a successful save. Consider increasing the sleep interval (e.g. 10 s per attempt → 50 s total) or the attempt count to give the index more time to settle, especially since the action is intended to fail loudly on genuine misses.

Agree, recursive retry with exponential backoff is probably mildly better here

coreyjadams · 2026-05-14T14:40:47Z

+          "moto[s3]==5.2.1" \
+          "numpy-stl==3.2.0" \
+          "scikit-image==0.26.0" \
+          "shapely==2.1.2" \
+          "multi-storage-client[boto3]==0.48.0" \
+          "tensorstore==0.1.83" \
+          "pyarrow==24.0.0"


@NickGeneva @laserkelvin and also @peterdsharpe there has been a little discussion about pinning here, vs. pinning in pyproject.toml. Summarizing some pros and cons.

Why pin? If we don't pin, and one of these updates, the nightly build will get out of sync with the PR venv and it will trigger a rebuild of the environment (slow on the PR on the GPU nodes) and trigger ALL tests to run (also slow) because the testmon DB requires the venv to match. So, pinning is a good idea IMO.

We can pin here, and that is nice because it's not disruptive to pyproject.toml, and can control our CI system independently. We already have that in blossom since we run in a container, and the installed packages are not necessarily aligned with what's in uv.lock. I contend that is OK. On the other hand, we might want to be able to control the CI env tightly against the uv.lock file for some reason?

We can pin in pyproject.toml by creating all of these deps in ci-deps development group with specified numbers. That's an update to pyproject.toml (no big deal) and extra lock resolution (no big deal) but any changes to CI env will have to go through that instead. And changes to pyproject.toml are meant, deliberately, to invalidate the testmon db and trigger all tests to rerun, for what it's worth (and I like that design, updating pyproject.toml in physicsnemo should be painful).

A middle ground might be to put these in a ci-requirements.txt or similar that is contained?

This is a good analysis!

I agree that pinning overall is a good idea.

As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.

coreyjadams · 2026-05-14T14:43:32Z

This PR effectively is doing two things:

Consolidate the logic of delete-then-upload to refresh immutable caches for the various caching mechanisms I've set up to accelerate our CI. Since there are now several, it made sense to turn it into a custom action that can be reused.
Pin CI dependencies to specific versions.

If needed I'll split these up.

peterdsharpe

Looks good! Interesting discussion about how to implement version-pinning; I think any of the three presented options are perfectly fine (including as-is).

peterdsharpe · 2026-05-14T23:26:45Z

+        set -euo pipefail
+        if ! command -v gh >/dev/null 2>&1; then
+          echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot."
+          exit 1


The guard is probably not necessary if you have -euo pipefail already, but this also doesn't hurt

peterdsharpe · 2026-05-15T01:37:55Z

+        for attempt in 1 2 3 4 5; do
+          if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \
+              | grep -Fxq "$CACHE_KEY"; then
+            echo "${CACHE_DESC} present: $CACHE_KEY"
+            exit 0
+          fi
+          echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..."
+          sleep 5
+        done


Agree, recursive retry with exponential backoff is probably mildly better here

peterdsharpe · 2026-05-15T01:42:48Z

+          "moto[s3]==5.2.1" \
+          "numpy-stl==3.2.0" \
+          "scikit-image==0.26.0" \
+          "shapely==2.1.2" \
+          "multi-storage-client[boto3]==0.48.0" \
+          "tensorstore==0.1.83" \
+          "pyarrow==24.0.0"


This is a good analysis!

I agree that pinning overall is a good idea.

As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.

Centralize cache delete-and-push mechanism to one place

945f471

coreyjadams requested review from NickGeneva and ktangsali as code owners May 13, 2026 23:29

Merge branch 'main' into fix-testmon-db-cache

a2f32e0

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

coreyjadams commented May 14, 2026

View reviewed changes

coreyjadams requested a review from peterdsharpe May 14, 2026 14:41

peterdsharpe approved these changes May 15, 2026

View reviewed changes

coreyjadams added 2 commits May 15, 2026 14:23

Pin CI deps in a file, instead of the github action.

e510814

Use a localized reinstally for pyg.

fbed815

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize cache delete-and-push mechanism to one place#1645

Centralize cache delete-and-push mechanism to one place#1645
coreyjadams wants to merge 4 commits into
mainfrom
fix-testmon-db-cache

coreyjadams commented May 13, 2026

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

peterdsharpe May 15, 2026

Uh oh!

coreyjadams May 14, 2026

Uh oh!

peterdsharpe May 15, 2026

Uh oh!

coreyjadams commented May 14, 2026

Uh oh!

peterdsharpe left a comment

Uh oh!

peterdsharpe May 14, 2026

Uh oh!

peterdsharpe May 15, 2026

Uh oh!

peterdsharpe May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coreyjadams commented May 13, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

peterdsharpe May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coreyjadams May 14, 2026

Choose a reason for hiding this comment

Uh oh!

peterdsharpe May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coreyjadams commented May 14, 2026

Uh oh!

peterdsharpe left a comment

Choose a reason for hiding this comment

Uh oh!

peterdsharpe May 14, 2026

Choose a reason for hiding this comment

Uh oh!

peterdsharpe May 15, 2026

Choose a reason for hiding this comment

Uh oh!

peterdsharpe May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants