DRAFT: feat(conversation): per-run cost and input-token budgets#3564
DRAFT: feat(conversation): per-run cost and input-token budgets#3564juanmichelini wants to merge 3 commits into
Conversation
Adds two optional caps to LocalConversation alongside the existing max_iteration_per_run limit: * max_cost_per_run (USD, float | None) * max_input_tokens_per_run (int | None) The agent loop checks both after every iteration (sync run, async arun, and the ACP-step path). When either limit is hit the conversation transitions to ERROR and emits a ConversationErrorEvent with code MaxCostExceeded / MaxInputTokensExceeded, mirroring the MaxIterationsReached path (including the 'preserve FINISHED on last iteration' exception). Motivated by eval runs where a single instance burned >$10 and >20M prompt tokens before any limit fired — costing the whole job's wall-clock budget. A per-run cap kills runaway instances early so the worker can move on. Both defaults are None (purely additive). RemoteConversation raises NotImplementedError when these are set so misuse is loud, not silent — the agent-server side will be enabled in a follow-up once the ConversationRequest schema is extended. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ❌ FAILEDResult: ❌ FAILED
Log excerpt (first 1000 characters) |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||||||||||||
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: Per-run Cost and Input-token Budgets
Taste Rating
🟡 Acceptable - Works but could be cleaner
Summary
This PR introduces per-run budget enforcement for cost and input tokens in LocalConversation, along with related cleanup in the ACP command handling. The core feature is well-designed with proper validation, error handling, and useful fork inheritance. However, the ACP version pinning removal introduces a behavioral change worth discussing.
[IMPROVEMENT OPPORTUNITIES]
-
[openhands-sdk/openhands/sdk/settings/acp_providers.py] Breaking Change: Removing pinned ACP package versions (
CLAUDE_AGENT_ACP_VERSION,CODEX_ACP_VERSION,GEMINI_CLI_VERSION) meansnpx -y @pkgwill now resolve tolatestat runtime. This is a significant behavioral change for users who rely on version stability. The comment in the removed code mentioned syncing with the Dockerfile — has that been addressed? -
[openhands-sdk/openhands/sdk/settings/model.py] Simplification: The
_prefer_pinned_binarychange from version-agnostic matching to exact comparison loses the ability to rewrite version-specified packages to the pinned binary. If this is intentional, the behavioral shift should be documented in the PR description. -
[openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py] Documentation: The
_check_run_budgetsimplementation is solid. The docstring is comprehensive and explains the cache-write token exclusion rationale. Well done.
[TESTING]
-
The new test file
tests/sdk/conversation/local/test_conversation_run_budgets.pyis thorough with good coverage of:- Construction validation (positive values required)
- Budget checking semantics (disabled, cost breach, token breach, precedence)
- End-to-end emission (error status + event, status preservation on FINISHED)
- Remote conversation guardrail
-
The
cost_breach_takes_precedence_over_token_breachtest documents an important behavior. Consider documenting this precedence in the public API so users know what to expect. -
[tests/sdk/test_settings.py] The test removal for version-agnostic matching aligns with the simplification in
model.py— good alignment.
[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM
Key factors:
- ACP Version Stability: Removing pinned versions means
npx -ywill pulllatestinstead of a known-good version. This could break reproducibility. - Binary Path Matching: The change from version-agnostic matching to exact comparison means users with explicit version specs won't get the binary rewrite.
- Feature Completeness: The
RemoteConversationrejection is correct, and the follow-up tracking reference in the error message is good practice. - Test Coverage: The new tests are solid and cover edge cases well.
[VERDICT]
✅ Worth merging — Core logic is sound with good test coverage. The ACP version pinning removal is a notable change that should be explicitly called out in the PR description.
[KEY INSIGHT]
The budget enforcement implementation is elegant — it follows the existing pattern for max_iteration_per_run and correctly preserves FINISHED status on tail calls. The main concern is the ACP version pins removal, which affects version stability.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Per-run budget options are accepted and do stop ongoing tool-using runs with typed error events, but a completed one-step run can exceed the configured budget without any error event.
Does this PR achieve its stated goal?
Partially. I verified with real SDK conversations and real LLM calls that max_cost_per_run and max_input_tokens_per_run are new user-facing kwargs and that both produce ConversationExecutionStatus.ERROR plus MaxCostExceeded / MaxInputTokensExceeded events during a tool-using run. However, I also ran a common one-step conversation with max_input_tokens_per_run=1; it consumed 3,642 prompt tokens but finished successfully with no ConversationErrorEvent, so the PR does not consistently deliver the stated “On breach: status → ERROR, event emitted” behavior.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully |
| CI Status | pre-commit and Python API failing; most package tests/builds are green; I did not rerun CI |
| Functional Verification |
Functional Verification
Test 1: Establish baseline — main does not accept budget kwargs
Step 1 — Reproduce / establish baseline (without the fix):
Checked out main (567eb218) and ran a short SDK script that constructs Conversation(..., max_input_tokens_per_run=1).
Output excerpt:
{
"accepted_budget_kwargs": false,
"exception_type": "TypeError",
"message": "Conversation.__new__() got an unexpected keyword argument 'max_input_tokens_per_run'"
}This confirms the new per-run budget API is absent on the base branch.
Step 2 — Apply the PR's changes:
Checked out openhands/sdk-1-per-run-budgets at cacc203e184339c41ecb02f91afc5693114d195b.
Step 3 — Re-run with the fix in place:
Ran an SDK construction/validation script with accepted values, invalid values, and a RemoteWorkspace.
Output excerpt:
{
"accepted_values": {
"max_cost_per_run": 1.25,
"max_input_tokens_per_run": 1000
},
"zero_cost": {
"exception_type": "ValueError",
"message": "max_cost_per_run must be positive when set"
},
"negative_input_tokens": {
"exception_type": "ValueError",
"message": "max_input_tokens_per_run must be positive when set"
},
"remote_budget": {
"exception_type": "NotImplementedError",
"message": "max_cost_per_run and max_input_tokens_per_run are not yet supported for RemoteConversation. Track the agent server follow-up via the SDK-1 PR description."
}
}This shows the PR adds the advertised local-conversation kwargs, validates non-positive values, and rejects remote usage loudly.
Test 2: Input-token budget stops an ongoing real run
Step 1 — Reproduce / establish baseline:
On main, this behavior cannot be exercised because the budget kwarg is rejected as shown in Test 1.
Step 2 — Apply the PR's changes:
On the PR branch, I ran a real SDK conversation with TerminalTool, max_input_tokens_per_run=1, and asked the agent to run printf budget-qa.
Step 3 — Re-run with the fix in place:
Output excerpt after the real LLM/tool run:
{
"status": "ConversationExecutionStatus.ERROR",
"error_events": [
{
"code": "MaxInputTokensExceeded",
"detail": "Agent reached maximum input tokens per run (4521 >= 1)."
}
],
"prompt_tokens": 4521,
"accumulated_cost": 0.024135000000000004
}This confirms the input-token budget check works when the run is still active after a tool action.
Test 3: Cost budget stops an ongoing real run
Step 1 — Reproduce / establish baseline:
On main, this behavior cannot be exercised because the budget kwarg is rejected as shown in Test 1.
Step 2 — Apply the PR's changes:
On the PR branch, I ran a real SDK conversation with TerminalTool, max_cost_per_run=0.000001, and asked the agent to run printf cost-budget-qa.
Step 3 — Re-run with the fix in place:
Output excerpt after the real LLM/tool run:
{
"status": "ConversationExecutionStatus.ERROR",
"error_events": [
{
"code": "MaxCostExceeded",
"detail": "Agent reached maximum cost per run ($0.02 >= $0.00)."
}
],
"prompt_tokens": 4522,
"accumulated_cost": 0.02417
}This confirms the cost budget check works for an ongoing run.
Test 4: Over-budget one-step runs finish successfully instead of erroring
Step 1 — Reproduce / establish baseline:
On main, the budget kwarg is rejected, so this is a new-behavior check on the PR branch.
Step 2 — Apply the PR's changes:
On the PR branch, I ran a real one-step SDK conversation with no tools, max_input_tokens_per_run=1, and asked: Reply with exactly: OK.
Step 3 — Re-run with the fix in place:
Output excerpt:
{
"status": "ConversationExecutionStatus.FINISHED",
"error_events": [],
"prompt_tokens": 3642,
"accumulated_cost": 0.01836
}This shows the run exceeded the configured input-token budget by 3,641 tokens but still ended as FINISHED with no typed budget event. That conflicts with the PR description's promise that a budget breach emits MaxInputTokensExceeded and moves the conversation to ERROR.
Issues Found
- 🟠 Issue: A single-step run that finishes on the same iteration as the budget breach does not emit
MaxCostExceeded/MaxInputTokensExceededor setERROR, so harnesses cannot distinguish over-budget successful completions from normal completions.
This review was generated by an AI agent (OpenHands) on behalf of the user.
| successful tail-call is preserved (mirrors how | ||
| ``MaxIterationsReached`` is handled). | ||
| """ | ||
| if self._state.execution_status == ConversationExecutionStatus.FINISHED: |
There was a problem hiding this comment.
🟠 Important: I exercised this with a real SDK one-step run (max_input_tokens_per_run=1, prompt Reply with exactly: OK). The run consumed 3,642 prompt tokens but ended as ConversationExecutionStatus.FINISHED with no ConversationErrorEvent because this early return drops budget hits after a final agent step. That conflicts with the PR goal/description that a budget breach should set ERROR and emit MaxInputTokensExceeded/MaxCostExceeded, and it means eval harnesses cannot detect over-budget successful single-step runs.
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: Per-run Cost and Input-token Budgets
Taste Rating
🟢 Good taste - Elegant, simple solution → Core implementation is solid
Summary
This PR adds per-run budget enforcement for cost and input tokens in LocalConversation. The implementation is well-structured and follows existing patterns (mirrors max_iteration_per_run behavior). The RemoteConversation guardrail correctly prevents silent failures. The new test file is thorough.
[CRITICAL ISSUES]
None. The implementation is sound.
[IMPROVEMENT OPPORTUNITIES]
- [openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py, Line 440-456] Consistency with MaxIterationsReached: The
_emit_run_budget_exceededearly return whenexecution_status == FINISHEDmirrors the iteration limit behavior — this is intentional and consistent. However, this means single-step runs that finish on the same iteration as a budget breach will end asFINISHEDrather thanERROR. If you want over-budget completions to always produceERROR, the check would need to happen before the agent step finalizesFINISHEDstatus. The current behavior matches the existing iteration limit pattern, so this is a design choice rather than a bug.
[TESTING]
- [tests/sdk/conversation/local/test_conversation_run_budgets.py] Comprehensive coverage: The test file is thorough — covers construction validation, budget semantics, event emission, and the RemoteConversation guardrail. The
cost_breach_takes_precedence_over_token_breachtest documents an important behavior decision.
[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟢 LOW
Key factors:
- API surface: New parameters are opt-in (default
None), so no impact on existing callers - Validation: Non-positive values are rejected early with clear errors
- Remote workspace: Guardrail prevents silent failures, pointing to the SDK-1 tracking
- Fork inheritance: Budget limits are correctly passed to forked conversations
- Test coverage: Comprehensive unit tests with good edge case coverage
[VERDICT]
✅ Worth merging — Clean implementation following established patterns. No blocking issues.
[KEY INSIGHT]
The budget enforcement implementation is well-designed: it uses a dedicated _check_run_budgets() method that returns early when both limits are disabled, performs cost and token checks in sequence (cost first for higher signal), and mirrors the existing iteration limit pattern for status preservation on tail calls.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 QA Report: PARTIAL
Most SDK budget behavior works in real local runs; ACP enforcement was not exercised, and CI currently has failing checks.
Does this PR achieve its stated goal?
Partially verified. Yes for the LocalConversation API, sync run loop, async run loop, fork propagation, and RemoteConversation guardrail: I exercised those paths through the public SDK and observed budget caps producing status=error with typed ConversationErrorEvent codes (MaxInputTokensExceeded / MaxCostExceeded). I did not verify the ACP step path because no ACP-compatible runtime/server was configured in this QA environment.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; no tests/linters were run locally. |
| CI Status | gh pr checks: 3 failing, 18 successful, 3 skipped, 8 pending. |
| Functional Verification | 🟡 Local sync/async SDK behavior verified; ACP path not verified. |
Functional Verification
Test 1: Baseline — new budget kwargs do not exist on base
Step 1 — Reproduce / establish baseline (without the fix):
Checked out origin/main and ran a short SDK script that constructed Conversation(...) with the new kwargs:
git switch --detach origin/main
uv run python - <<'PY'
# constructs Conversation(..., max_cost_per_run=0.01)
# constructs Conversation(..., max_input_tokens_per_run=100)
PYObserved:
cost budget: TypeError: Conversation.__new__() got an unexpected keyword argument 'max_cost_per_run'
input-token budget: TypeError: Conversation.__new__() got an unexpected keyword argument 'max_input_tokens_per_run'
This establishes the baseline: users could not pass direct per-run cost or input-token budgets before the PR.
Step 2 — Apply the PR's changes:
Checked out openhands/sdk-1-per-run-budgets at e4076c7c088cfd615e8fa63d0919b4fc79e18916.
Step 3 — Re-run with the fix in place:
Ran the same constructor-level SDK checks plus validation and remote guardrails.
Observed:
local constructed: LocalConversation
cost budget: 0.01
input budget: 100
zero cost: ValueError: max_cost_per_run must be positive when set
negative tokens: ValueError: max_input_tokens_per_run must be positive when set
remote cost budget: NotImplementedError: max_cost_per_run and max_input_tokens_per_run are not yet supported for RemoteConversation. Track the agent server follow-up via the SDK-1 PR description.
remote input budget: NotImplementedError: max_cost_per_run and max_input_tokens_per_run are not yet supported for RemoteConversation. Track the agent server follow-up via the SDK-1 PR description.
This confirms the new LocalConversation API is usable, invalid caps are rejected, and RemoteConversation misuse is loud rather than silently ignored.
Test 2: Real sync run() budget enforcement with LLM + terminal tool
Step 1 — Baseline:
Base branch rejected the new kwargs in Test 1, so there was no public entry point to enforce these run-scoped caps.
Step 2 — Apply the PR's changes:
On the PR branch, I ran real SDK conversations using LLM_MODEL=litellm_proxy/openai/gpt-5.5, the configured LLM proxy, and TerminalTool.
Step 3 — Re-run with the fix in place:
For the input-token cap, I ran Conversation(..., max_input_tokens_per_run=1) and asked the agent to create a file with the terminal tool.
Observed:
status=error
error_count=1
error=MaxInputTokensExceeded: Agent reached maximum input tokens per run (4520 >= 1).
prompt_tokens=4520
accumulated_cost=0.02488000
file_exists=True
For the cost cap, I ran Conversation(..., max_cost_per_run=0.000001) and asked the agent to use the terminal tool.
Observed:
status=error
error_count=1
error_codes=MaxCostExceeded
error_details=Agent reached maximum cost per run ($0.02 >= $0.00).
accumulated_cost=0.02399500
This confirms sync local runs check real accumulated metrics after an actual LLM/tool step and emit typed budget error events.
Test 3: Real async arun() budget enforcement
Step 1 — Baseline:
Base branch had no public budget kwargs, as shown in Test 1.
Step 2 — Apply the PR's changes:
On the PR branch, I ran an async SDK conversation with await conv.arun() using the same real LLM + terminal setup.
Step 3 — Re-run with the fix in place:
Ran Conversation(..., max_input_tokens_per_run=1) and awaited conv.arun().
Observed:
async_status=error
async_error_count=1
async_error_codes=MaxInputTokensExceeded
async_prompt_tokens=4511
async_cost=0.00556300
This confirms async local runs also enforce the input-token budget and emit the typed event.
Test 4: Fork propagation
Step 1 — Baseline:
Base branch could not construct a budgeted conversation, so there was no budget state to propagate through fork().
Step 2 — Apply the PR's changes:
On the PR branch, I constructed a local conversation with both budgets and called conv.fork().
Step 3 — Re-run with the fix in place:
Observed:
parent_cost=0.25
parent_tokens=1234
fork_type=LocalConversation
fork_cost=0.25
fork_tokens=1234
This confirms forked LocalConversations retain both budget settings.
Unable to Verify
- ACP step budget enforcement was not functionally exercised. I verified sync
run()and asyncarun()with real LLM/tool execution, but this QA environment did not have an ACP-compatible runtime/server configured to exercise the ACP-specific path as a real user would. Future QA guidance inAGENTS.mdcould name a lightweight ACP server fixture or example command for manual ACP-agent verification.
Issues Found
- 🟠 Issue: CI is not green at review time:
Python API breakage checks/Python API,Pre-commit checks/pre-commit, andReview Thread Gate/unresolved-review-threadsare failing, with 8 checks still pending. I did not rerun any tests or linters locally per QA instructions.
Final verdict: PARTIAL — local sync/async budget behavior works as described, but ACP enforcement remains unverified and CI is currently failing.
This review was created by an AI agent (OpenHands) on behalf of the user.
Why
The Conversation already exposes
max_iteration_per_run, but iteration count is a poor proxy for spend: a single step can be cheap or hundreds of thousands of input tokens. On eval runs where the chosen model has no prompt-cache support (e.g. a 550B Nemotron), a run can blow past$100while iteration count looks healthy.This PR adds two run-scoped budgets so harnesses can cap spend and prompt-token volume directly.
What
max_cost_per_run: float | Noneandmax_input_tokens_per_run: int | NoneonConversation/LocalConversation(validated> 0, elseValueError)._check_run_budgets()check is wired into all three loop sites (syncrun, asyncarun, ACP step).ERROR, aConversationErrorEventis emitted with codeMaxCostExceeded/MaxInputTokensExceededso harnesses can distinguish budget exits from other failures.fork()passes the new params through.RemoteConversationrejects them at construction (no semantics there yet).Tests
tests/sdk/conversation/local/test_conversation_run_budgets.py— 13 new tests covering both budgets, validation, the typed event, and theRemoteConversationrejection. All pass; linters + pyright clean.Risk / blast radius
None⇒ existing callers see zero behavior change.O(1)per step (readsMetrics); negligible.This PR was created by an AI agent (OpenHands) on behalf of the user investigating low throughput in eval run
swebench/litellm_proxy-nemotron-3-ultra-550b-a55b-or-paid/27052068762.@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:f60b8bb-pythonRun
All tags pushed for this build
About Multi-Architecture Support
f60b8bb-python) is a multi-arch manifest supporting both amd64 and arm64f60b8bb-python-amd64) are also available if needed