Centralize cache delete-and-push mechanism to one place#1645
Centralize cache delete-and-push mechanism to one place#1645coreyjadams wants to merge 4 commits into
Conversation
Greptile SummaryThis PR centralises the GitHub Actions delete-before-save cache pattern into a new reusable
Important Files Changed
Reviews (1): Last reviewed commit: "Merge branch 'main' into fix-testmon-db-..." | Re-trigger Greptile |
| run: | | ||
| set -euo pipefail | ||
| if ! command -v gh >/dev/null 2>&1; then | ||
| echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot." | ||
| exit 1 | ||
| fi | ||
| # Use --json key + --jq for robust matching (no false positives | ||
| # on prefix overlap from sibling cache keys). | ||
| existing="$(gh cache list \ | ||
| --repo "$REPO" \ | ||
| --key "$CACHE_KEY" \ | ||
| --json key \ | ||
| --jq '.[].key' \ | ||
| | grep -Fx "$CACHE_KEY" || true)" | ||
| if [ -n "$existing" ]; then | ||
| gh cache delete "$CACHE_KEY" --repo "$REPO" | ||
| echo "deleted stale ${CACHE_DESC}: $CACHE_KEY" | ||
| else | ||
| echo "no existing ${CACHE_DESC} to delete: $CACHE_KEY" | ||
| fi |
There was a problem hiding this comment.
TOCTOU:
gh cache delete can fail if the slot disappears between list and delete
set -euo pipefail is active, so if another process deletes the entry between the gh cache list check and the gh cache delete call, the delete returns a non-zero exit code and the entire step fails. In practice this is prevented by the concurrency: nightly-github-uv group, but a manual re-run triggered while a nightly is still in its save window could hit this. Adding || true to the gh cache delete line would make the step idempotent without masking real gh CLI errors (the save + verify steps still catch the important failure class).
| for attempt in 1 2 3 4 5; do | ||
| if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \ | ||
| | grep -Fxq "$CACHE_KEY"; then | ||
| echo "${CACHE_DESC} present: $CACHE_KEY" | ||
| exit 0 | ||
| fi | ||
| echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..." | ||
| sleep 5 | ||
| done |
There was a problem hiding this comment.
Verify retry window may be too short under load
The loop retries 5 times with 5-second sleeps (25 s total). GitHub's cache index is eventually-consistent and under heavy repository load the propagation delay can exceed 25 s, causing a spurious hard failure immediately after a successful save. Consider increasing the sleep interval (e.g. 10 s per attempt → 50 s total) or the attempt count to give the index more time to settle, especially since the action is intended to fail loudly on genuine misses.
There was a problem hiding this comment.
Agree, recursive retry with exponential backoff is probably mildly better here
| "moto[s3]==5.2.1" \ | ||
| "numpy-stl==3.2.0" \ | ||
| "scikit-image==0.26.0" \ | ||
| "shapely==2.1.2" \ | ||
| "multi-storage-client[boto3]==0.48.0" \ | ||
| "tensorstore==0.1.83" \ | ||
| "pyarrow==24.0.0" |
There was a problem hiding this comment.
@NickGeneva @laserkelvin and also @peterdsharpe there has been a little discussion about pinning here, vs. pinning in pyproject.toml. Summarizing some pros and cons.
Why pin? If we don't pin, and one of these updates, the nightly build will get out of sync with the PR venv and it will trigger a rebuild of the environment (slow on the PR on the GPU nodes) and trigger ALL tests to run (also slow) because the testmon DB requires the venv to match. So, pinning is a good idea IMO.
We can pin here, and that is nice because it's not disruptive to pyproject.toml, and can control our CI system independently. We already have that in blossom since we run in a container, and the installed packages are not necessarily aligned with what's in uv.lock. I contend that is OK. On the other hand, we might want to be able to control the CI env tightly against the uv.lock file for some reason?
We can pin in pyproject.toml by creating all of these deps in ci-deps development group with specified numbers. That's an update to pyproject.toml (no big deal) and extra lock resolution (no big deal) but any changes to CI env will have to go through that instead. And changes to pyproject.toml are meant, deliberately, to invalidate the testmon db and trigger all tests to rerun, for what it's worth (and I like that design, updating pyproject.toml in physicsnemo should be painful).
A middle ground might be to put these in a ci-requirements.txt or similar that is contained?
There was a problem hiding this comment.
This is a good analysis!
I agree that pinning overall is a good idea.
As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.
|
This PR effectively is doing two things:
If needed I'll split these up. |
peterdsharpe
left a comment
There was a problem hiding this comment.
Looks good! Interesting discussion about how to implement version-pinning; I think any of the three presented options are perfectly fine (including as-is).
| set -euo pipefail | ||
| if ! command -v gh >/dev/null 2>&1; then | ||
| echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot." | ||
| exit 1 |
There was a problem hiding this comment.
The guard is probably not necessary if you have -euo pipefail already, but this also doesn't hurt
| for attempt in 1 2 3 4 5; do | ||
| if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \ | ||
| | grep -Fxq "$CACHE_KEY"; then | ||
| echo "${CACHE_DESC} present: $CACHE_KEY" | ||
| exit 0 | ||
| fi | ||
| echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..." | ||
| sleep 5 | ||
| done |
There was a problem hiding this comment.
Agree, recursive retry with exponential backoff is probably mildly better here
| "moto[s3]==5.2.1" \ | ||
| "numpy-stl==3.2.0" \ | ||
| "scikit-image==0.26.0" \ | ||
| "shapely==2.1.2" \ | ||
| "multi-storage-client[boto3]==0.48.0" \ | ||
| "tensorstore==0.1.83" \ | ||
| "pyarrow==24.0.0" |
There was a problem hiding this comment.
This is a good analysis!
I agree that pinning overall is a good idea.
As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.
PhysicsNeMo Pull Request
Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.