fix: use created_at as canonical anchor for stale-claim detection#14
fix: use created_at as canonical anchor for stale-claim detection#14jolovicdev wants to merge 1 commit into
Conversation
Tasks that sit in PENDING for a long time before being claimed could previously hide behind a fresh claimed_at heartbeat indefinitely, because _is_stale_claim only looked at claimed_at. A worker that crashes after creating a commit but before claiming it would leave the commit with created_at old and claimed_at recent (if another worker reclaimed it), making it appear fresh when it is not. Changes: - _is_stale_claim now uses created_at, which never changes. - find_running_by_fingerprint orders by created_at DESC for consistency. - Tests verify that an old created_at triggers reclaim even when claimed_at was updated recently.
|
@ds-review |
There was a problem hiding this comment.
PR Review
PR: fix: use created_at as canonical anchor for stale-claim detection
Important
Verdict: Request changes - 3 actionable findings, highest severity P0.
Findings
P0 Critical - src/cashet/async_executor.py:50
The change from commit.claimed_at to commit.created_at correctly ensures that stale-claim detection uses the task's immutable creation time. However, consider adding a safeguard for edge cases where created_at m...
Next step: Fix this inline, then reply @ds-review recheck in the thread after pushing.
P2 Medium - tests/test_store.py:302
The test test_old_created_at_trumps_recent_claimed_at correctly validates the new logic. It uses a synthetic commit with created_at 400s ago and claimed_at 5s ago, then expects the task to be reclaimed and retur...
Next step: Fix this inline, then reply @ds-review recheck in the thread after pushing.
P3 Low - src/cashet/models.py:67
The comment on created_at clarifies its role as the canonical anchor. Consider also adding a note that this field is set once at commit creation and should never be updated, to prevent accidental misuse.
Next step: Fix this inline, then reply @ds-review recheck in the thread after pushing.
What To Do Next
- Address the highest-priority findings first.
- Reply
@ds-review recheckunder a finding when you want this bot to verify the fix.
| def _is_stale_claim(commit: Commit, ttl: timedelta) -> bool: | ||
| return datetime.now(UTC) - commit.claimed_at > ttl | ||
| # Use created_at so that long-pending tasks (created but never successfully | ||
| # claimed) are still eligible for reclaim, rather than being stuck forever. |
There was a problem hiding this comment.
P0 Critical The change from commit.claimed_at to commit.created_at correctly ensures that stale-claim detection uses the task's immutable creation time. However, consider adding a safeguard for edge cases where created_at might be in the future (e.g., clock skew). While not present in current code, defensive checking could prevent unexpected behavior. This is a minor concern; the logic is otherwise sound.
| with pytest.raises(TaskError, match="TimeoutError"): | ||
| client.submit(slow, _timeout=0.01) | ||
|
|
||
| def test_old_created_at_trumps_recent_claimed_at(self, store_dir: Path) -> None: |
There was a problem hiding this comment.
P2 Medium The test test_old_created_at_trumps_recent_claimed_at correctly validates the new logic. It uses a synthetic commit with created_at 400s ago and claimed_at 5s ago, then expects the task to be reclaimed and return 42. This covers the critical scenario described in the PR. The async counterpart (test_async_client.py) mirrors it well.
| status: TaskStatus = TaskStatus.PENDING | ||
| # created_at is the canonical anchor for task lifetime; it never changes | ||
| # and is used for stale-claim detection so that pending tasks cannot hide | ||
| # behind a recent heartbeat. |
There was a problem hiding this comment.
P3 Low The comment on created_at clarifies its role as the canonical anchor. Consider also adding a note that this field is set once at commit creation and should never be updated, to prevent accidental misuse.
|
Closing in favor of a cleaner approach. |
Summary
Switches stale-claim detection from
claimed_attocreated_atso that long-pending tasks cannot hide behind a recent heartbeat.Problem
_is_stale_claimpreviously checkeddatetime.now(UTC) - commit.claimed_at > ttl. This meant that a task which was created a long time ago but recently reclaimed (updatingclaimed_at) would appear fresh, even though the overall task lifetime already exceeded the reclaim window.In production this can happen when:
running_ttl, updatingclaimed_at.claimed_atas recent and waits up torunning_ttlagain, even though the task was originally created much earlier.Changes
src/cashet/async_executor.py—_is_stale_claimnow usescreated_atas the canonical anchor.src/cashet/store.py—find_running_by_fingerprintorders bycreated_at DESCto stay consistent with the new stale-detection logic.src/cashet/models.py— Documentedcreated_atas the immutable lifetime anchor.tests/test_store.py— Addedtest_old_created_at_trumps_recent_claimed_at.tests/test_async_client.py— Added async counterparttest_old_created_at_causes_reclaim_despite_fresh_claim.Testing
All 296 existing tests pass (45 skipped for Redis). The two new tests verify that a commit with
created_at=400s agoandclaimed_at=5s agois correctly reclaimed.