DRAFT: feat(stuck-detector): emit typed StuckPatternDetected event with reason codes#3565
DRAFT: feat(stuck-detector): emit typed StuckPatternDetected event with reason codes#3565juanmichelini wants to merge 3 commits into
Conversation
Adds a structured signal so eval harnesses, retry loops, and dashboards can distinguish a 'definitive failure' stuck pattern from transient errors without parsing free-form log strings. Changes: * StuckDetector.detect_stuck_reason() -> str | None returns a stable code identifying which pattern fired (repeating_action_observation, repeating_action_error, monologue, alternating_action_observation, context_window_error_loop). is_stuck() is now a thin wrapper around it, so all existing callers keep working unchanged. * LocalConversation now emits a ConversationErrorEvent with code='StuckPatternDetected' and the reason embedded as 'stuck_pattern=<reason>' in the detail. The conversation still transitions to STUCK status (no behavior change there). * The detail string says explicitly 'definitive failure ... avoid retrying without intervention' so harnesses that grep the detail field still get the right signal. Motivated by eval runs where harnesses retried genuinely-stuck instances over and over, burning quota on hopeless runs that the SDK had already identified as unrecoverable. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||||||||||||
|
🔍 Review in progress… We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
🟡 Acceptable - Clean, well-designed enhancement with backward compatibility.
Summary
This PR adds typed reason codes to the stuck detector so eval harnesses can tell which pattern fired without scraping log lines. The implementation is sound: is_stuck() becomes a thin wrapper around the new detect_stuck_reason(), and LocalConversation emits a ConversationErrorEvent with the reason embedded in the detail field.
[IMPROVEMENT OPPORTUNITIES]
-
🟡 Test count discrepancy —
tests/cross/test_stuck_detector_typed_reason.pyThe PR description claims "7 new tests covering each reason code," but the test file has 6 tests covering only 3 of the 5 reason codes:
- ✅
repeating_action_observation - ✅
repeating_action_error - ✅
monologue - ❌
alternating_action_observation— not directly tested - ❌
context_window_error_loop— not directly tested
Either add tests for the missing reason codes or update the description to reflect the actual coverage (6 tests, 3/5 reason codes directly tested).
- ✅
[RISK ASSESSMENT]
-
⚠️ Risk Assessment: 🟢 LOWAdditive change only —
is_stuck()still returnsbooland works identically. The newdetect_stuck_reason()is purely additive, and the event emission preserves the existing log line. No breaking changes.
VERDICT:
✅ Worth merging — Core logic is solid, backward compatibility is maintained, and tests pass (6 new + 10 existing = 16/16).
KEY INSIGHT:
The reason codes are stable, machine-readable identifiers that will let eval harnesses make smarter retry decisions without parsing free-form text.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 QA Report: PARTIAL
Core typed stuck-detection behavior works in SDK execution: the PR branch returns machine-readable reasons and emits ConversationErrorEvent(code="StuckPatternDetected") during Conversation.run(), but I could not realistically exercise the context-window-loop reason without forcing a real LLM context-window failure.
Does this PR achieve its stated goal?
Partially verified, leaning yes for the main user-facing goal. On main, the same stuck conversation only exposed is_stuck=True, transitioned to stuck, and emitted no error event; on the PR branch, the SDK returned repeating_action_observation, preserved is_stuck=True, transitioned to stuck, and emitted one typed StuckPatternDetected event whose detail included both stuck_pattern=repeating_action_observation and the “definitive failure” retry guidance. I also exercised additional reason-code paths for repeating_action_error, monologue, and alternating_action_observation; I could not verify context_window_error_loop with synthetic SDK event construction, so that specific reason remains unverified by this QA pass.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully |
| CI Status | ✅ Existing PR checks were green except this qa-changes check in progress; several optional workflows were skipped |
| Functional Verification | 🟡 Main behavior verified; real context-window-loop scenario not verified |
Functional Verification
Test 1: Before/after stuck conversation behavior
Step 1 — Establish baseline on origin/main:
Checked out origin/main and ran an SDK script that created a Conversation, sent a user message, appended four identical terminal action/observation pairs, then called StuckDetector and conv.run().
Command:
git switch --detach origin/main
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
# Script constructed a Conversation with stuck_detection=True, added repeated
# terminal action/observation events, then printed StuckDetector and run-loop output.
PYObserved output:
has_detect_stuck_reason= False
detected_reason_error= AttributeError
is_stuck= True
status_after_run= stuck
error_event_count= 0
This confirms the pre-PR behavior: the SDK can classify the run as stuck, but consumers get no stable reason API and no typed conversation error event to branch on.
Step 2 — Apply the PR's changes:
Checked out openhands/sdk-2-typed-stuck-error at 57154b3ddfe0588b3aeecbfaf45fda31f61cd906.
Step 3 — Re-run with the fix in place:
Ran the same SDK scenario on the PR branch.
Observed output:
has_detect_stuck_reason= True
detected_reason= repeating_action_observation
is_stuck= True
status_after_run= stuck
error_event_count= 1
event_code= StuckPatternDetected
event_detail_contains_reason= True
event_detail_contains_definitive_failure= True
This shows the PR delivers the main claimed behavior for a real Conversation.run() path: the status transition is preserved, is_stuck() remains backward-compatible, and downstream consumers can read a typed event plus stable reason text instead of scraping the old generic log line.
Test 2: Additional reason-code API behavior
On the PR branch, I ran SDK scripts that constructed conversation histories for several stuck patterns and called StuckDetector.detect_stuck_reason() directly.
Command:
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
# Script created separate Conversation states for repeated action/observation,
# repeated action/error, repeated assistant monologue, and alternating actions.
PYObserved output:
repeating_action_observation: repeating_action_observation
repeating_action_error: repeating_action_error
monologue: monologue
pairs 4 => None
pairs 5 => None
pairs 6 => alternating_action_observation
pairs 8 => alternating_action_observation
pairs 10 => alternating_action_observation
This confirms the new API exposes stable string reasons for multiple stuck-pattern categories and that the alternating pattern begins firing once the history is long enough for the detector threshold.
Unable to Verify
I could not fully verify context_window_error_loop as a real end-user path in this environment without intentionally driving an LLM into repeated context-window failures, which would require external model behavior, credentials, and potentially high token usage. I tried several synthetic SDK event histories using AgentErrorEvent, ConversationErrorEvent, and LLMCompletionLogEvent carrying context-window-style messages; those did not trigger that reason code, so I am not treating them as evidence against the PR because they may not match the SDK's real context-window failure event shape.
Suggested future QA guidance: add a lightweight documented script or fixture that demonstrates the real event sequence produced by an LLM context-window error loop so QA can exercise context_window_error_loop without relying on expensive external LLM calls.
Issues Found
None from the behavior I could exercise.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: PR #3565
Taste Rating: 🟡 Acceptable — Solid implementation with minor concerns
[CRITICAL ISSUES]
None identified. The implementation is backward-compatible and introduces no breaking changes.
[IMPROVEMENT OPPORTUNITIES]
-
[openhands-sdk/openhands/sdk/conversation/stuck_detector.py, Line 93] Error Handling Gap: If
detect_stuck_reason()is called and an exception occurs during event analysis, the exception propagates instead of returningNone. This could cause unhandled failures in production. Consider wrapping the detection logic in a try/except. -
[openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py, Line 372] Missing Type Annotation:
_emit_stuck_pattern_detectedis private but has no_prefix convention (it starts with_emit). The method signaturereason: strlacks validation — passing an unknown reason code (e.g.,"unknown_pattern") will embed it in the event detail without warning. Consider either validating against known codes or documenting that callers must use documented codes only.
[STYLE NOTES]
- Generally well-structured. The docstring on
detect_stuck_reason()(lines 72-95) is thorough and clearly documents the stable reason codes. The code is clean.
[TESTING GAPS]
-
The test file
tests/cross/test_stuck_detector_typed_reason.pyprovides good coverage of the new functionality. Tests are real (no mocks) and verify both the new API and backward compatibility. -
Edge Case Missing: No test for when
_is_stuck_context_window_error()returnsTrue(thecontext_window_error_loopreason). This scenario should have at least one test to ensure the reason code is returned correctly.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This is a low-risk change. It adds a new API method while preserving the existing is_stuck() for backward compatibility. The PR introduces typed reason codes that enable better telemetry and debugging for stuck agents. No breaking changes to existing APIs.
VERDICT:
✅ Worth merging: Core logic is sound. The backward compatibility approach is good. Minor improvements suggested around error handling and testing the context window error loop scenario.
KEY INSIGHT:
The design elegantly solves a real problem (distinguishing between different stuck patterns) without breaking the existing API, making it easy for consumers to adopt incrementally.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
| Note: To avoid materializing potentially large file-backed event | ||
| histories, only the last MAX_EVENTS_TO_SCAN_FOR_STUCK_DETECTION | ||
| events are analyzed. If a user message exists within this window, | ||
| only events after it are checked. |
There was a problem hiding this comment.
🟡 Suggestion: Consider adding a try/except around the detection logic to ensure None is returned if an unexpected error occurs during stuck pattern analysis. This prevents unhandled exceptions from propagating to callers.
| """Get the stuck detector instance if enabled.""" | ||
| return self._stuck_detector | ||
|
|
||
| def _emit_stuck_pattern_detected(self, reason: str) -> None: |
There was a problem hiding this comment.
🟡 Suggestion: Consider validating reason against the documented codes, or adding a note in the docstring that callers must use only the documented reason codes to avoid silently embedding invalid patterns in event details.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 QA Report: PARTIAL
Verified the main typed stuck-detector behavior through SDK-level probes, but could not reproduce/verify the promised context_window_error_loop reason code locally.
Does this PR achieve its stated goal?
Partially. The PR does deliver stable typed reason codes for the repeating action/observation, repeating action/error, monologue, and alternating action/observation stuck patterns, while preserving is_stuck() boolean behavior. It also changes a real Conversation.run() stuck transition from silent STUCK to a typed ConversationErrorEvent with code="StuckPatternDetected", stuck_pattern=repeating_action_observation, and the “definitive failure / avoid retrying” guidance. I could not verify the context_window_error_loop reason: deterministic seeded context-window error events returned None, and a real SDK conversation with an LLM that raises LLMContextWindowExceedError transitions to ERROR with LLMContextWindowExceedError rather than a stuck-loop event.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Dependencies bootstrapped with uv sync --dev; SDK probes ran successfully. |
| CI Status | 🟡 At QA time: 26 checks successful, 3 in progress, 3 skipped; no CI jobs were rerun. |
| Functional Verification | 🟡 Four reason codes + typed conversation event verified; context-window loop reason not verified. |
Functional Verification
Probe 1: Baseline on main establishes old behavior
Step 1 — Reproduce / establish baseline (without the fix):
Checked out origin/main at fd88a5ce0495e6c254bb95509b6e8ec91cce08fb and ran:
OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_stuck_all_reasons.pyRelevant output:
{
"conversation_run": {
"error_events": [],
"execution_status": "ConversationExecutionStatus.STUCK",
"run_error": null
},
"detector_results": {
"alternating_action_observation": {
"detect_stuck_reason_error": "AttributeError: 'StuckDetector' object has no attribute 'detect_stuck_reason'",
"is_stuck": true
},
"monologue": {
"detect_stuck_reason_error": "AttributeError: 'StuckDetector' object has no attribute 'detect_stuck_reason'",
"is_stuck": true
},
"repeating_action_error": {
"detect_stuck_reason_error": "AttributeError: 'StuckDetector' object has no attribute 'detect_stuck_reason'",
"is_stuck": true
},
"repeating_action_observation": {
"detect_stuck_reason_error": "AttributeError: 'StuckDetector' object has no attribute 'detect_stuck_reason'",
"is_stuck": true
}
}
}This confirms the baseline problem: stuck patterns are detected only as booleans, and the conversation reaches STUCK without a typed error event for downstream harnesses.
Step 2 — Apply the PR's changes:
Checked out openhands/sdk-2-typed-stuck-error at 17fff196a6115984ced62291159de10445f904e8.
Step 3 — Re-run with the fix in place:
Ran the same SDK probe:
OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_stuck_all_reasons.pyRelevant output:
{
"conversation_run": {
"error_events": [
{
"code": "StuckPatternDetected",
"detail": "Agent appears to be stuck in a repetitive or unproductive pattern (stuck_pattern=repeating_action_observation). The conversation has been moved to the STUCK state; callers should treat this as a definitive failure rather than a transient one and avoid retrying without intervention."
}
],
"execution_status": "ConversationExecutionStatus.STUCK",
"run_error": null
},
"detector_results": {
"alternating_action_observation": {
"detect_stuck_reason": "alternating_action_observation",
"is_stuck": true
},
"monologue": {
"detect_stuck_reason": "monologue",
"is_stuck": true
},
"repeating_action_error": {
"detect_stuck_reason": "repeating_action_error",
"is_stuck": true
},
"repeating_action_observation": {
"detect_stuck_reason": "repeating_action_observation",
"is_stuck": true
}
}
}This verifies the main new SDK-facing behavior: downstream code can now call detect_stuck_reason() for stable string codes, existing is_stuck() callers still receive true, and a real conversation stuck transition emits a typed ConversationErrorEvent while preserving STUCK status.
Probe 2: Context-window stuck-loop reason could not be reproduced
Step 1 — Attempt seeded context-window error pattern:
The same PR probe seeded repeated context-window-style agent errors and called detect_stuck_reason().
Relevant output:
{
"context_window_error_loop": {
"detect_stuck_reason": null,
"is_stuck": false
}
}This did not verify the promised context_window_error_loop code.
Step 2 — Attempt real SDK conversation context-window failure:
Ran a deterministic SDK conversation using an LLM subclass whose completion() raises LLMContextWindowExceedError:
OPENHANDS_SUPPRESS_BANNER=1 timeout 30s uv run python /tmp/qa_context_loop_conversation.pyRelevant output:
{
"conversation_error_events": [
{
"code": "LLMContextWindowExceedError",
"detail": "Conversation history longer than LLM context window limit. Consider enabling a condenser or shortening inputs."
}
],
"event_count": 3,
"event_types_tail": [
"SystemPromptEvent",
"MessageEvent",
"ConversationErrorEvent"
],
"run_error": "ConversationRunError: Conversation run failed ...",
"status": "ConversationExecutionStatus.ERROR"
}This shows the realistic no-condenser context-window path fails once as ERROR, not as a stuck-loop. I did not find a local deterministic way to exercise the “context-window error loop despite condensation” path without external LLM calls.
Unable to Verify
context_window_error_loopreason code: seeded context-window-like errors returnedNone, and a realLLMContextWindowExceedErrorconversation path emittedLLMContextWindowExceedErrorand ended inERROR. Suggested future QA guidance: add a documented deterministic SDK harness for the condensation-retry loop so QA can trigger the context-window stuck condition without external LLM credentials.
Issues Found
None in the behavior that was successfully exercised. The only gap is the unverified context_window_error_loop path noted above.
This review was created by an AI agent (OpenHands) on behalf of the user.
Why
When the SDK detects a stuck pattern today, it transitions to
STUCKstatus and logs"Stuck pattern detected."— that's it. Consumers can't tell which pattern fired (loop vs monologue vs context-window error loop), and harnesses that want to skip retrying a definitively-stuck run have to scrape a log line.Trajectory analysis on a SWE-bench eval showed harnesses re-running instances that the SDK had already correctly classified as unrecoverable, burning quota.
What
StuckDetector.detect_stuck_reason() -> str | Nonereturns a stable code:repeating_action_observation,repeating_action_error,monologue,alternating_action_observation,context_window_error_loop.is_stuck()is now a thin wrapper, so every existing caller keeps working.LocalConversationemits aConversationErrorEventwithcode="StuckPatternDetected"and the reason embedded indetailasstuck_pattern=<reason>. Detail also says explicitly "definitive failure ... avoid retrying without intervention" so harnesses that match free-form text still get the right signal.Tests
tests/cross/test_stuck_detector_typed_reason.py— 7 new tests covering each reason code, the backwards-compat boolean, and the conversation-level event emission. All 18 stuck-detector tests pass.Risk / blast radius
This PR was created by an AI agent (OpenHands) on behalf of the user investigating low throughput in eval run
swebench/litellm_proxy-nemotron-3-ultra-550b-a55b-or-paid/27052068762.@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:95ef0cb-pythonRun
All tags pushed for this build
About Multi-Architecture Support
95ef0cb-python) is a multi-arch manifest supporting both amd64 and arm6495ef0cb-python-amd64) are also available if needed