Skip to content

fix(agent-server): don't bump explicit_interrupt_generation on no-op pause/interrupt#3557

Open
shanemort1982 wants to merge 3 commits into
OpenHands:mainfrom
shanemort1982:fix/no-op-pause-interrupt-strand
Open

fix(agent-server): don't bump explicit_interrupt_generation on no-op pause/interrupt#3557
shanemort1982 wants to merge 3 commits into
OpenHands:mainfrom
shanemort1982:fix/no-op-pause-interrupt-strand

Conversation

@shanemort1982

Copy link
Copy Markdown
Contributor

H:

  • A human has tested these changes.

AGENT:


Why

Fixes the stuck-conversation regression reported in All-Hands-AI/OpenHands#14698.

EventService.pause() and EventService.interrupt() previously bumped self._explicit_interrupt_generation unconditionally before delegating to LocalConversation. On any conversation whose status is not RUNNING or IDLE (typically FINISHED, but also PAUSED and ERROR):

  • LocalConversation.pause() is a no-op (early-returns at the status check around L1648-1660)
  • LocalConversation.interrupt() falls back to self.pause(), which is also a no-op

The bump left a "phantom" stop-intent on a conversation that was never actually paused or interrupted. When the next user send_message(run=True) lands, the silent early-return guard at event_service.py:457:

explicit_interrupt_generation = self._explicit_interrupt_generation
await loop.run_in_executor(None, self._conversation.send_message, message)
if run:
    if self._explicit_interrupt_generation != explicit_interrupt_generation:
        return    # ← silent strand, no logging

honours the phantom bump and skips self.run() entirely. From the user's view the message is persisted, last_user_message_id updates, but no running event ever fires and the agent loop stays dormant — the exact symptom described in #14698.

The fix only bumps the counter when the call will actually change execution state. This is the same precedent already used at L1034 for internal_acp_rerun: an interrupt that does nothing should not bump the stop-intent counter.

Intent-clearing of _rerun_requested and _acp_internal_rerun_requested is kept unconditional on external calls — the user explicitly asked to stop, so any pending re-run intent is cleared even if the call is a no-op for execution-state purposes. This preserves the test_explicit_interrupt_clears_internal_acp_rerun_request invariant.

Summary

  • EventService.pause(): only bump _explicit_interrupt_generation when status is RUNNING or IDLE (status range where LocalConversation.pause() is not a no-op).
  • EventService.interrupt(): same guard applied to the bump inside the not internal_acp_rerun branch.
  • Intent-clearing of _rerun_requested / _acp_internal_rerun_requested remains unconditional on every external call so explicit stop continues to win over pending re-runs.
  • Adds 8 regression tests in TestEventServiceNoOpPauseInterruptDoesNotStrandSendMessage, including a deterministic end-to-end race reproduction that proves the strand without the patch (run() call count = 0) and the fix with it (run() call count = 1).

Issue Number

Closes OpenHands/OpenHands#14698.

How to Test

Code-walk verification first. The defect was identified by reading event_service.py and local_conversation.py against the v1.25.0 image (sha256:1fa7632287f9f48b6ceedfc67da0813459d5c7c8c9e2cb0b6ef63c0d762c360e) reported in the issue and confirmed still present on main at 6fdc84f.

Empirical verification ran:

# (a) Patch applied: full event_service test suite passes
uv run pytest tests/agent_server/test_event_service.py -q
# 101 passed in 17.41s

# (b) Patch applied + adjacent HTTP layer regression check
uv run pytest tests/agent_server/test_conversation_router.py -q
# 69 passed in 4.14s

# (c) New tests fail deterministically WITHOUT the patch
#     (reverted event_service.py to upstream/main, kept new tests)
uv run pytest tests/agent_server/test_event_service.py::TestEventServiceNoOpPauseInterruptDoesNotStrandSendMessage -q
# 3 failed (FINISHED + PAUSED no-bump tests) + the end-to-end race test
# FAILED with: 'AssertionError: ... Actual run() call count: 0'

# (d) New tests pass deterministically WITH the patch restored
uv run pytest tests/agent_server/test_event_service.py::TestEventServiceNoOpPauseInterruptDoesNotStrandSendMessage -q
# 8 passed

The end-to-end test test_concurrent_no_op_pause_during_send_message_does_not_strand deterministically reproduces the original symptom: it fires an external POST /pause (mocked as a FINISHED-status no-op) during the in-flight await loop.run_in_executor(None, self._conversation.send_message, ...) window, then asserts that self.run() is called afterwards. Without the patch this assertion fails because the phantom bump silently strands the conversation.

Live-infrastructure verification on the affected fleet has not yet been done; the diagnostic fleet watcher described in #14698 catches POST /pause / POST /interrupt against the agent-server, so future production incidents can be correlated with the fix being in or out.

Video/Screenshots

N/A. Server-side logic fix, fully covered by the new regression tests.

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Notes

  • The bumper that triggers the race in production is external POST /pause / POST /interrupt traffic — typically from the OpenHands GUI when a user clicks pause on a sandbox, or from any fleet-side automation that lifecycles sandboxes. The fix is trigger-independent: it removes the defect, not the trigger.
  • The post-persist generation guard at event_service.py:457 is NOT the patch target. That guard exists to enforce a real user-stop-intent invariant when the bump was legitimate, and patching it to fall through to run() would break that invariant. The defect is the unconditional bump on a no-op call; fixing it at source means L457 no longer fires for the wrong reason.
  • Related upstream PR All-Hands-AI/OpenHands#14701 targets the frontend WebSocket payload instead, but the agent-server WebSocket handler at sockets.py:312-313 already hardcodes event_service.send_message(message, True), so that PR is inert against this bug. See the comment thread on #14701 for the code-walk.

…pause/interrupt

Fixes the stuck-conversation regression reported in
OpenHands/OpenHands#14698.

EventService.pause() and EventService.interrupt() previously bumped
self._explicit_interrupt_generation unconditionally before delegating
to LocalConversation. On a FINISHED conversation (or any status that
is not RUNNING/IDLE), LocalConversation.pause() is a no-op, and
LocalConversation.interrupt() falls back to self.pause() which is
also a no-op. The bump left a phantom stop-intent that the next
send_message(run=True) honoured at its post-persist generation
guard (event_service.py L457), silently early-returning without
calling run(). From the user's view, the follow-up message was
persisted but the agent loop never restarted.

The fix only bumps the counter when the call will actually change
execution state. Same precedent as the existing internal_acp_rerun
gate at L1034: an interrupt that does nothing should not bump the
stop-intent counter.

Intent-clearing (_rerun_requested, _acp_internal_rerun_requested)
remains unconditional on external calls: the user explicitly asked
to stop, even if there is nothing to stop, so any pending re-run
intent is cleared. This preserves the
'explicit-stop-wins-over-pending-rerun' invariant covered by
test_explicit_interrupt_clears_internal_acp_rerun_request.

Regression coverage added in
TestEventServiceNoOpPauseInterruptDoesNotStrandSendMessage:

- pause/interrupt on FINISHED/PAUSED do NOT bump
- pause/interrupt on RUNNING/IDLE still bump (no regression)
- internal_acp_rerun=True still skips the bump (existing precedent)
- end-to-end concurrent-race test: deterministically reproduces the
  strand without the fix (run() call count = 0) and confirms it is
  resolved with the fix (run() call count = 1)

Co-authored-by: openhands <openhands@all-hands.dev>
Status alone is not a complete predicate for 'interrupt will do
something'. During the wait_for_pending drain tail at the end of
_run_and_publish, arun() has set FINISHED and returned, but
_run_task is still alive draining the callback queue. An interrupt
landing then DOES cancel the live task - a real effect that the
status-only gate would miss.

The disjunct (status in RUNNING/IDLE OR live _run_task) closes that
gap. The drain tail grows with the conversation callback queue, so
this is the same length-correlated signature as #14698 itself - a
reviewer would have caught it.

pause() does not need the disjunct: LocalConversation.pause() only
ever transitions status on RUNNING/IDLE and never cancels a task,
so its status-only gate is already a complete predicate.

The fail-safe direction also matters: a missed bump lets a racing
send_message proceed to run() (the opposite of #14698) rather than
strand. So the prior status-only gate was a 'stop-wins completeness'
gap, not a correctness regression. Still fix it.

Adds test_interrupt_on_finished_with_live_run_task_still_bumps which
verifies the disjunct: status=FINISHED + live _run_task -> bump
must fire. Test fails on the status-only patch, passes with the
disjunct restored.

Co-authored-by: openhands <openhands@all-hands.dev>
@shanemort1982

Copy link
Copy Markdown
Contributor Author

Pushed 4449d72 to address a gap in the first pass:

The original sketch for interrupt() had a live-task OR status-in-{RUNNING,IDLE} disjunct that got simplified to status-only when I shipped. Status alone is not a complete predicate, because during the wait_for_pending drain tail at the end of _run_and_publish, arun() has set FINISHED and returned, but _run_task is still alive draining the callback queue. An /interrupt landing in that tail DOES cancel the live task - a real effect that the status-only gate would miss.

The drain tail grows with the callback queue, i.e. with conversation length, which is exactly the length-correlated signature of #14698. Naming it explicitly here so it doesn't get found in review.

Worth flagging that the gap fails safe in the opposite direction to the bug being fixed: a missed bump lets a racing send_message proceed to run() (a follow-up runs) rather than strand. So the prior status-only patch was a "stop-wins completeness" gap, not a correctness regression - calling that out so it isn't conflated with #14698 itself.

pause() is intentionally unchanged: LocalConversation.pause() only transitions status on RUNNING/IDLE and never cancels a task, so its status-only gate is already a complete predicate. The disjunct only belongs in interrupt().

New regression test test_interrupt_on_finished_with_live_run_task_still_bumps verifies the disjunct fires: status=FINISHED + live _run_task -> bump must happen. Confirmed:

  • Against the previous (status-only) commit: test fails with assert 0 == 1
  • Against the disjunct commit: test passes

Full suite: 171 passed (test_event_service.py 102 + test_conversation_router.py 69).

Inline comments tightened per review style preference.

Honest confidence split for reviewers: medium-high on the server-side correctness (this is what the static code path says will happen, backed by deterministic test coverage of both fail and pass cases). Medium on "this is the production trigger of #14698 in the wild" until field data lands - the diagnostic fleet watcher described in the issue can correlate a live POST /pause or POST /interrupt hit with a stuck repro on the affected host. That second number only moves with empirical data, not more code-reading. The fix itself is trigger-independent.

@fengjikui

Copy link
Copy Markdown

I checked out this PR because CI is red and reproduced both failures locally.

Findings:

  • pre-commit is just ruff-format on event_service.py and test_event_service.py.
  • cross-tests is PR-introduced, not a base flake: on main, tests/cross/test_remote_conversation_live_server.py::test_interrupt_endpoint_cancels_running_conversation passes; on this PR head it times out on POST /interrupt.

Root cause I found: the new stop-intent gate calls _get_execution_status() on the /interrupt hot path before cancelling the run. That enters ConversationState's context via a thread-pool read. During a live server run, this adds a persistence/lease-aware state context right in the interrupt race window and can strand the HTTP interrupt long enough for the client to time out. The reliable version is to use in-memory task state first (_run_task / _arun_task) and only peek the in-memory execution status when there is no live task, so FINISHED/PAUSED no-op calls still avoid the phantom generation bump.

I cannot push to the source branch directly, so I pushed a small follow-up patch here if useful:
fengjikui@84e61550

Local validation on that patch:

  • uv run ruff format --check openhands-agent-server/openhands/agent_server/event_service.py tests/agent_server/test_event_service.py -> formatted
  • uv run python -m pytest tests/agent_server/test_event_service.py -q -> 102 passed
  • CI=true uv run python -m pytest -q tests/cross/test_remote_conversation_live_server.py::test_interrupt_endpoint_cancels_running_conversation -> passed
  • CI=true uv run python -m pytest -q --basetemp=/tmp/pytest-oh-sdk-cross -o tmp_path_retention=none -o tmp_path_retention_count=0 tests/cross -> 336 passed, 1 skipped

Happy to adapt the patch if you prefer a different shape; the important bit is avoiding the persisted-state context before the actual interrupt while a run is active.

@shanemort1982

Copy link
Copy Markdown
Contributor Author

CI is green on the latest head (626a51a). Both earlier failures are resolved and the #14698 fix is unchanged behaviourally:

  • cross-tests: the interrupt gate was calling _get_execution_status() on the /interrupt hot path, which enters ConversationState via its fair FIFOLock. The running loop already holds that lock for the duration of run(), so the interrupt queued behind the very run it was meant to cancel and the live-server test timed out. It now reads the status lock-free (_peek_execution_status) and checks the live _run_task/_arun_task handles first, so the no-op gate keeps the FINISHED/PAUSED/ERROR semantics without the blocking read. Thanks @fengjikui for the diagnosis and patch, incorporated here with authorship preserved.
  • pre-commit: ruff-format on event_service.py and test_event_service.py.

Full cross suite passes (336 passed, 1 skipped). Ready for review.

@all-hands-bot all-hands-bot requested a review from enyst June 8, 2026 14:00
@all-hands-bot

Copy link
Copy Markdown
Collaborator

[Automatic Post]: I have assigned @enyst as a reviewer based on git blame information. Thanks in advance for the help!

This comment was created by an AI agent (OpenHands) on behalf of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: User message on finished/idle conversation never triggers /run, agent loop stays dormant (regression on main)

3 participants