Skip to content

fix: use created_at as canonical anchor for stale-claim detection#14

Closed
jolovicdev wants to merge 1 commit into
masterfrom
test/normalize-stale-detection
Closed

fix: use created_at as canonical anchor for stale-claim detection#14
jolovicdev wants to merge 1 commit into
masterfrom
test/normalize-stale-detection

Conversation

@jolovicdev
Copy link
Copy Markdown
Owner

Summary

Switches stale-claim detection from claimed_at to created_at so that long-pending tasks cannot hide behind a recent heartbeat.

Problem

_is_stale_claim previously checked datetime.now(UTC) - commit.claimed_at > ttl. This meant that a task which was created a long time ago but recently reclaimed (updating claimed_at) would appear fresh, even though the overall task lifetime already exceeded the reclaim window.

In production this can happen when:

  1. Worker A creates a commit but crashes before finishing.
  2. Worker B reclaims the stale claim after running_ttl, updating claimed_at.
  3. Worker B also crashes.
  4. Worker C now sees claimed_at as recent and waits up to running_ttl again, even though the task was originally created much earlier.

Changes

  • src/cashet/async_executor.py_is_stale_claim now uses created_at as the canonical anchor.
  • src/cashet/store.pyfind_running_by_fingerprint orders by created_at DESC to stay consistent with the new stale-detection logic.
  • src/cashet/models.py — Documented created_at as the immutable lifetime anchor.
  • tests/test_store.py — Added test_old_created_at_trumps_recent_claimed_at.
  • tests/test_async_client.py — Added async counterpart test_old_created_at_causes_reclaim_despite_fresh_claim.

Testing

All 296 existing tests pass (45 skipped for Redis). The two new tests verify that a commit with created_at=400s ago and claimed_at=5s ago is correctly reclaimed.

Tasks that sit in PENDING for a long time before being claimed could
previously hide behind a fresh claimed_at heartbeat indefinitely,
because _is_stale_claim only looked at claimed_at. A worker that
crashes after creating a commit but before claiming it would leave
the commit with created_at old and claimed_at recent (if another
worker reclaimed it), making it appear fresh when it is not.

Changes:
- _is_stale_claim now uses created_at, which never changes.
- find_running_by_fingerprint orders by created_at DESC for consistency.
- Tests verify that an old created_at triggers reclaim even when
  claimed_at was updated recently.
@jolovicdev
Copy link
Copy Markdown
Owner Author

@ds-review

Copy link
Copy Markdown

@ds-review ds-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

PR: fix: use created_at as canonical anchor for stale-claim detection

Important

Verdict: Request changes - 3 actionable findings, highest severity P0.

Findings

P0 Critical - src/cashet/async_executor.py:50

The change from commit.claimed_at to commit.created_at correctly ensures that stale-claim detection uses the task's immutable creation time. However, consider adding a safeguard for edge cases where created_at m...

Next step: Fix this inline, then reply @ds-review recheck in the thread after pushing.

P2 Medium - tests/test_store.py:302

The test test_old_created_at_trumps_recent_claimed_at correctly validates the new logic. It uses a synthetic commit with created_at 400s ago and claimed_at 5s ago, then expects the task to be reclaimed and retur...

Next step: Fix this inline, then reply @ds-review recheck in the thread after pushing.

P3 Low - src/cashet/models.py:67

The comment on created_at clarifies its role as the canonical anchor. Consider also adding a note that this field is set once at commit creation and should never be updated, to prevent accidental misuse.

Next step: Fix this inline, then reply @ds-review recheck in the thread after pushing.

What To Do Next

  • Address the highest-priority findings first.
  • Reply @ds-review recheck under a finding when you want this bot to verify the fix.

def _is_stale_claim(commit: Commit, ttl: timedelta) -> bool:
return datetime.now(UTC) - commit.claimed_at > ttl
# Use created_at so that long-pending tasks (created but never successfully
# claimed) are still eligible for reclaim, rather than being stuck forever.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Critical The change from commit.claimed_at to commit.created_at correctly ensures that stale-claim detection uses the task's immutable creation time. However, consider adding a safeguard for edge cases where created_at might be in the future (e.g., clock skew). While not present in current code, defensive checking could prevent unexpected behavior. This is a minor concern; the logic is otherwise sound.

Comment thread tests/test_store.py
with pytest.raises(TaskError, match="TimeoutError"):
client.submit(slow, _timeout=0.01)

def test_old_created_at_trumps_recent_claimed_at(self, store_dir: Path) -> None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Medium The test test_old_created_at_trumps_recent_claimed_at correctly validates the new logic. It uses a synthetic commit with created_at 400s ago and claimed_at 5s ago, then expects the task to be reclaimed and return 42. This covers the critical scenario described in the PR. The async counterpart (test_async_client.py) mirrors it well.

Comment thread src/cashet/models.py
status: TaskStatus = TaskStatus.PENDING
# created_at is the canonical anchor for task lifetime; it never changes
# and is used for stale-claim detection so that pending tasks cannot hide
# behind a recent heartbeat.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Low The comment on created_at clarifies its role as the canonical anchor. Consider also adding a note that this field is set once at commit creation and should never be updated, to prevent accidental misuse.

@jolovicdev
Copy link
Copy Markdown
Owner Author

Closing in favor of a cleaner approach.

@jolovicdev jolovicdev closed this May 7, 2026
@jolovicdev jolovicdev deleted the test/normalize-stale-detection branch May 7, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant