fix(agent-server): don't bump explicit_interrupt_generation on no-op pause/interrupt#3557
fix(agent-server): don't bump explicit_interrupt_generation on no-op pause/interrupt#3557shanemort1982 wants to merge 3 commits into
Conversation
…pause/interrupt Fixes the stuck-conversation regression reported in OpenHands/OpenHands#14698. EventService.pause() and EventService.interrupt() previously bumped self._explicit_interrupt_generation unconditionally before delegating to LocalConversation. On a FINISHED conversation (or any status that is not RUNNING/IDLE), LocalConversation.pause() is a no-op, and LocalConversation.interrupt() falls back to self.pause() which is also a no-op. The bump left a phantom stop-intent that the next send_message(run=True) honoured at its post-persist generation guard (event_service.py L457), silently early-returning without calling run(). From the user's view, the follow-up message was persisted but the agent loop never restarted. The fix only bumps the counter when the call will actually change execution state. Same precedent as the existing internal_acp_rerun gate at L1034: an interrupt that does nothing should not bump the stop-intent counter. Intent-clearing (_rerun_requested, _acp_internal_rerun_requested) remains unconditional on external calls: the user explicitly asked to stop, even if there is nothing to stop, so any pending re-run intent is cleared. This preserves the 'explicit-stop-wins-over-pending-rerun' invariant covered by test_explicit_interrupt_clears_internal_acp_rerun_request. Regression coverage added in TestEventServiceNoOpPauseInterruptDoesNotStrandSendMessage: - pause/interrupt on FINISHED/PAUSED do NOT bump - pause/interrupt on RUNNING/IDLE still bump (no regression) - internal_acp_rerun=True still skips the bump (existing precedent) - end-to-end concurrent-race test: deterministically reproduces the strand without the fix (run() call count = 0) and confirms it is resolved with the fix (run() call count = 1) Co-authored-by: openhands <openhands@all-hands.dev>
Status alone is not a complete predicate for 'interrupt will do something'. During the wait_for_pending drain tail at the end of _run_and_publish, arun() has set FINISHED and returned, but _run_task is still alive draining the callback queue. An interrupt landing then DOES cancel the live task - a real effect that the status-only gate would miss. The disjunct (status in RUNNING/IDLE OR live _run_task) closes that gap. The drain tail grows with the conversation callback queue, so this is the same length-correlated signature as #14698 itself - a reviewer would have caught it. pause() does not need the disjunct: LocalConversation.pause() only ever transitions status on RUNNING/IDLE and never cancels a task, so its status-only gate is already a complete predicate. The fail-safe direction also matters: a missed bump lets a racing send_message proceed to run() (the opposite of #14698) rather than strand. So the prior status-only gate was a 'stop-wins completeness' gap, not a correctness regression. Still fix it. Adds test_interrupt_on_finished_with_live_run_task_still_bumps which verifies the disjunct: status=FINISHED + live _run_task -> bump must fire. Test fails on the status-only patch, passes with the disjunct restored. Co-authored-by: openhands <openhands@all-hands.dev>
|
Pushed The original sketch for The drain tail grows with the callback queue, i.e. with conversation length, which is exactly the length-correlated signature of #14698. Naming it explicitly here so it doesn't get found in review. Worth flagging that the gap fails safe in the opposite direction to the bug being fixed: a missed bump lets a racing
New regression test
Full suite: 171 passed ( Inline comments tightened per review style preference. Honest confidence split for reviewers: medium-high on the server-side correctness (this is what the static code path says will happen, backed by deterministic test coverage of both fail and pass cases). Medium on "this is the production trigger of #14698 in the wild" until field data lands - the diagnostic fleet watcher described in the issue can correlate a live |
|
I checked out this PR because CI is red and reproduced both failures locally. Findings:
Root cause I found: the new stop-intent gate calls I cannot push to the source branch directly, so I pushed a small follow-up patch here if useful: Local validation on that patch:
Happy to adapt the patch if you prefer a different shape; the important bit is avoiding the persisted-state context before the actual interrupt while a run is active. |
|
CI is green on the latest head (626a51a). Both earlier failures are resolved and the #14698 fix is unchanged behaviourally:
Full cross suite passes (336 passed, 1 skipped). Ready for review. |
|
[Automatic Post]: I have assigned @enyst as a reviewer based on git blame information. Thanks in advance for the help! This comment was created by an AI agent (OpenHands) on behalf of the user. |
H:
AGENT:
Why
Fixes the stuck-conversation regression reported in All-Hands-AI/OpenHands#14698.
EventService.pause()andEventService.interrupt()previously bumpedself._explicit_interrupt_generationunconditionally before delegating toLocalConversation. On any conversation whose status is notRUNNINGorIDLE(typicallyFINISHED, but alsoPAUSEDandERROR):LocalConversation.pause()is a no-op (early-returns at the status check around L1648-1660)LocalConversation.interrupt()falls back toself.pause(), which is also a no-opThe bump left a "phantom" stop-intent on a conversation that was never actually paused or interrupted. When the next user
send_message(run=True)lands, the silent early-return guard atevent_service.py:457:honours the phantom bump and skips
self.run()entirely. From the user's view the message is persisted,last_user_message_idupdates, but norunningevent ever fires and the agent loop stays dormant — the exact symptom described in #14698.The fix only bumps the counter when the call will actually change execution state. This is the same precedent already used at L1034 for
internal_acp_rerun: an interrupt that does nothing should not bump the stop-intent counter.Intent-clearing of
_rerun_requestedand_acp_internal_rerun_requestedis kept unconditional on external calls — the user explicitly asked to stop, so any pending re-run intent is cleared even if the call is a no-op for execution-state purposes. This preserves thetest_explicit_interrupt_clears_internal_acp_rerun_requestinvariant.Summary
EventService.pause(): only bump_explicit_interrupt_generationwhen status isRUNNINGorIDLE(status range whereLocalConversation.pause()is not a no-op).EventService.interrupt(): same guard applied to the bump inside thenot internal_acp_rerunbranch._rerun_requested/_acp_internal_rerun_requestedremains unconditional on every external call so explicit stop continues to win over pending re-runs.TestEventServiceNoOpPauseInterruptDoesNotStrandSendMessage, including a deterministic end-to-end race reproduction that proves the strand without the patch (run() call count = 0) and the fix with it (run() call count = 1).Issue Number
Closes OpenHands/OpenHands#14698.
How to Test
Code-walk verification first. The defect was identified by reading
event_service.pyandlocal_conversation.pyagainst the v1.25.0 image (sha256:1fa7632287f9f48b6ceedfc67da0813459d5c7c8c9e2cb0b6ef63c0d762c360e) reported in the issue and confirmed still present onmainat6fdc84f.Empirical verification ran:
The end-to-end test
test_concurrent_no_op_pause_during_send_message_does_not_stranddeterministically reproduces the original symptom: it fires an externalPOST /pause(mocked as aFINISHED-status no-op) during the in-flightawait loop.run_in_executor(None, self._conversation.send_message, ...)window, then asserts thatself.run()is called afterwards. Without the patch this assertion fails because the phantom bump silently strands the conversation.Live-infrastructure verification on the affected fleet has not yet been done; the diagnostic fleet watcher described in #14698 catches
POST /pause/POST /interruptagainst the agent-server, so future production incidents can be correlated with the fix being in or out.Video/Screenshots
N/A. Server-side logic fix, fully covered by the new regression tests.
Type
Notes
POST /pause/POST /interrupttraffic — typically from the OpenHands GUI when a user clicks pause on a sandbox, or from any fleet-side automation that lifecycles sandboxes. The fix is trigger-independent: it removes the defect, not the trigger.event_service.py:457is NOT the patch target. That guard exists to enforce a real user-stop-intent invariant when the bump was legitimate, and patching it to fall through torun()would break that invariant. The defect is the unconditional bump on a no-op call; fixing it at source meansL457no longer fires for the wrong reason.sockets.py:312-313already hardcodesevent_service.send_message(message, True), so that PR is inert against this bug. See the comment thread on #14701 for the code-walk.