DRAFT: feat(stats): ConversationStats.cache_efficiency_summary() probe#3567
DRAFT: feat(stats): ConversationStats.cache_efficiency_summary() probe#3567juanmichelini wants to merge 3 commits into
Conversation
A single structured probe that aggregates prompt-cache health across
every LLM used in a conversation, plus a per-LLM breakdown.
Dashboards and eval harnesses can call this directly instead of
iterating ConversationStats internals or scraping log lines.
Output shape (stable):
{
'prompt_tokens', 'cache_read_tokens', 'cache_write_tokens',
'completion_tokens',
'cache_hit_rate': float | None, # weighted by prompt tokens
'cache_write_rate': float | None,
'per_usage': {usage_id: {model, prompt_tokens, cache_read_tokens,
cache_write_tokens, cache_hit_rate}},
}
Two design choices worth flagging:
* Hit rate is weighted by prompt_tokens across LLMs, not a naive mean
of per-LLM rates. A loud uncached side-car LLM (1k prompt tokens)
doesn't drag a healthy 100k-token main run down to 50%.
* Aggregate rates are clamped to 1.0 so dashboards stay sane when a
provider double-counts cache reads outside prompt_tokens.
None is returned for hit/write rate when no prompt tokens have been
billed yet, so 'no data' is distinct from '0% hit'.
Motivated by an eval where the full prompt-cache picture was only
visible after summing per-LLM TokenUsage by hand.
Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: ConversationStats.cache_efficiency_summary()
🟢 Good taste — Clean, well-scoped addition with solid test coverage.
Summary
This PR adds a structured probe for cache efficiency metrics that dashboards and eval harnesses can call directly without scraping log lines. The implementation is straightforward: aggregate token counts across all LLMs, compute hit/write rates, and provide a per-LLM breakdown.
What works well:
- Edge cases handled correctly: empty stats →
Nonerates, providers over-reporting cache tokens → clamped to 1.0 - The
_rate()helper keeps the main loop clean - Tests cover the important cases including the "weird provider" scenario
- Docstring clearly explains the return shape and semantics
Suggestion (non-blocking):
- [conversation_stats.py, line 130] The return type is
dict[str, Any]. ATypedDictordataclasswould give callers type-safe access and make the shape self-documenting in IDEs. Not blocking — the current dict works fine for a probe, and retrofitting might be out of scope for this PR.
Testing:
- All 6 tests pass. Coverage is good — includes weighted aggregation, clamping, fallback behavior, and empty state.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
No breaking changes. Pure additive feature with comprehensive tests. No external dependencies or security implications.
VERDICT:
✅ Worth merging — Solid addition, no blocking issues.
KEY INSIGHT:
The design decision to return a dict (vs TypedDict) keeps the probe lightweight and flexible for dashboards, while the comprehensive tests ensure edge cases like provider double-counting are handled correctly.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
| hit_rate = None | ||
| write_rate = None | ||
|
|
||
| return { |
There was a problem hiding this comment.
🟡 Suggestion: Consider using a TypedDict or dataclass for the return type to give callers type-safe access. Not blocking for this PR, but would improve IDE support and self-document the shape.
from typing import TypedDict
class CacheEfficiencySummary(TypedDict):
prompt_tokens: int
cache_read_tokens: int
cache_write_tokens: int
completion_tokens: int
cache_hit_rate: float | None
cache_write_rate: float | None
per_usage: dict[str, dict[str, Any]]
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
I verified the new SDK probe by importing ConversationStats as a user would and calling cache_efficiency_summary() with realistic multi-LLM token usage; it returns the claimed stable aggregate and per-usage shape.
Does this PR achieve its stated goal?
Yes. On origin/main, ConversationStats has no cache_efficiency_summary() method; at PR commit 82e53b5, the same user script exposes the method and returns aggregate prompt/cache/completion totals, token-weighted hit rate (9000 / 11000 = 0.818...), clamped provider over-count rates, None rates for no prompt tokens, and model-name fallback. This delivers the promised dashboard/eval-harness probe without callers hand-iterating usage_to_metrics.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; no tests/linters were run. |
| CI Status | 🟡 GitHub reports 35 passing, 1 pending (qa-changes), 15 skipped. |
| Functional Verification | ✅ Before/after SDK probe confirms the feature works as described. |
Functional Verification
Test 1: New ConversationStats.cache_efficiency_summary() SDK probe
Step 1 — Reproduce / establish baseline (without the fix):
Ran git switch --detach origin/main --quiet && uv run python /tmp/cache_summary_user_probe.py:
has_cache_efficiency_summary False
call_result AttributeError 'ConversationStats' object has no attribute 'cache_efficiency_summary'
This confirms the baseline problem: a user/dashboard cannot call a structured cache-efficiency probe and would need to compute the summary manually from usage_to_metrics.
Step 2 — Apply the PR's changes:
Checked out the PR branch/commit with git switch openhands/p1-cache-efficiency-summary --quiet && git checkout 82e53b5e1a65a13e0e962939fd1c751c5007cbed --quiet.
Step 3 — Re-run with the fix in place:
Ran uv run python /tmp/cache_summary_user_probe.py:
has_cache_efficiency_summary True
empty {"cache_hit_rate": null, "cache_read_tokens": 0, "cache_write_rate": null, "cache_write_tokens": 0, "completion_tokens": 0, "per_usage": {}, "prompt_tokens": 0}
weighted_multi_llm {"cache_hit_rate": 0.8181818181818182, "cache_read_tokens": 9000, "cache_write_rate": 0.045454545454545456, "cache_write_tokens": 500, "completion_tokens": 225, "per_usage": {"cached-llm": {"cache_hit_rate": 0.9, "cache_read_tokens": 9000, "cache_write_tokens": 500, "model": "claude-sonnet-4-5", "prompt_tokens": 10000}, "uncached-sidecar": {"cache_hit_rate": 0.0, "cache_read_tokens": 0, "cache_write_tokens": 0, "model": "nemotron-3-ultra-550b", "prompt_tokens": 1000}}, "prompt_tokens": 11000}
provider_double_count {"cache_hit_rate": 1.0, "cache_read_tokens": 1500, "cache_write_rate": 1.0, "cache_write_tokens": 1200, "completion_tokens": 0, "per_usage": {"double-count-provider": {"cache_hit_rate": 1.0, "cache_read_tokens": 1500, "cache_write_tokens": 1200, "model": "weird-model", "prompt_tokens": 1000}}, "prompt_tokens": 1000}
model_fallback {"cache_hit_rate": 0.5, "cache_read_tokens": 50, "cache_write_rate": 0.0, "cache_write_tokens": 0, "completion_tokens": 0, "per_usage": {"fallback-model": {"cache_hit_rate": 0.5, "cache_read_tokens": 50, "cache_write_tokens": 0, "model": "metrics-model-name", "prompt_tokens": 100}}, "prompt_tokens": 100}
registered_empty {"cache_hit_rate": null, "cache_read_tokens": 0, "cache_write_rate": null, "cache_write_tokens": 0, "completion_tokens": 0, "per_usage": {"empty-registered": {"cache_hit_rate": null, "cache_read_tokens": 0, "cache_write_tokens": 0, "model": "registered-but-no-completion", "prompt_tokens": 0}}, "prompt_tokens": 0}
This confirms the new method is available through the SDK export and returns the promised structure. The multi-LLM aggregate is token-weighted rather than a naive average, over-counted cache reads/writes are clamped to 1.0, empty prompt-token cases preserve None rates, and the per-usage breakdown surfaces individual usage IDs and model names.
Setup and CI observation
Ran make build:
Dependencies installed successfully.
Pre-commit hooks installed successfully.
Build complete! Development environment is ready.
Ran gh pr checks 3567 --repo OpenHands/software-agent-sdk ...:
pass: 35
pending: 1
skipping: 15
pending IN_PROGRESS qa-changes
No local test suite, linter, formatter, type checker, or pre-commit hook was run.
Issues Found
None.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: cache_efficiency_summary()
🟡 Acceptable — Core logic is sound, but there are a few opportunities to improve clarity and reduce complexity.
[IMPROVEMENT OPPORTUNITIES]
-
[openhands-sdk/openhands/sdk/conversation/conversation_stats.py, Line 96] Complexity: The local
_ratehelper is a one-liner used twice. Consider inliningmin(1.0, num / den) if den > 0 else Nonedirectly or extracting it as a module-level helper if reused elsewhere. -
[openhands-sdk/openhands/sdk/conversation/conversation_stats.py, Line 116] Naming: The
per_usagedict key is the internalusage_id. If dashboards need to aggregate by model, they'll need to iterate through the dict. Consider whether the return structure should include aper_modelaggregation as well, or document that consumers should key by usage_id and accessper_usage[id]['model']for the model name.
[STYLE NOTES]
-
The comment on line 83-84 explaining provider double-counting is appropriate — it describes a non-obvious invariant that callers need to understand. ✅
-
The docstring is thorough and the return shape is well-documented. ✅
[TESTING]
Tests are solid:
- Edge cases (empty stats, no completions) are covered
- Weighted aggregate calculation is verified correctly (9k/11k ≠ mean of 0.9 and 0.0)
- Clamping behavior for >100% is tested
- Model fallback from
metrics.model_nameis covered
No testing gaps identified. Tests exercise real code paths and assert on actual outputs.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This is a pure additive feature — new method, new tests. No breaking changes to existing APIs. No external dependencies added. The risk of this PR is minimal.
VERDICT:
✅ Worth merging: The implementation is correct, well-tested, and solves a real problem (dashboards need a structured probe instead of scraping log lines).
KEY INSIGHT:
The weighted aggregate (total_cache_read / total_prompt) is correct — this avoids the trap of naive per-LLM averaging that would hide a noisy participant behind healthy averages.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
| } | ||
|
|
||
| ``cache_hit_rate`` is ``None`` when no prompt tokens were | ||
| billed (lets callers distinguish "no data" from "0% hit"). |
There was a problem hiding this comment.
🟡 Suggestion: The local _rate helper is a one-liner used twice. Could be inlined directly:
hit_rate = min(1.0, total_cache_read / total_prompt) if total_prompt > 0 else None
write_rate = min(1.0, total_cache_write / total_prompt) if total_prompt > 0 else NoneOr if you prefer keeping the helper, rename it to _clamp_rate to better describe what it does.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
The new SDK probe was exercised through real uv run python scripts and returned the advertised aggregate/per-usage cache metrics, including edge cases.
Does this PR achieve its stated goal?
Yes. The PR set out to add a structured ConversationStats.cache_efficiency_summary() probe so dashboards/eval harnesses no longer have to hand-roll prompt-cache aggregation. On the base branch the method is absent; on the PR branch, the same SDK usage returns the promised shape with totals, weighted rates, per-usage details, model fallback, None rates for no prompt tokens, and clamped >100% provider reports.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run python created/used the project environment and imported the SDK successfully. |
| CI Status | unresolved-review-threads was failing and several build/QA checks were still pending. |
| Functional Verification | ✅ SDK users can call the new method and get the claimed cache summary behavior. |
Functional Verification
Test 1: New SDK method exists and returns the advertised single-LLM summary
Step 1 — Reproduce / establish baseline without the fix:
Ran on origin/main:
git switch --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
from openhands.sdk import ConversationStats
from openhands.sdk.llm.utils.metrics import Metrics, TokenUsage
stats = ConversationStats()
metrics = Metrics(model_name='claude-sonnet-4-5')
metrics.accumulated_token_usage = TokenUsage(
prompt_tokens=10000,
completion_tokens=200,
cache_read_tokens=8000,
cache_write_tokens=500,
)
stats.usage_to_metrics['main'] = metrics
print(stats.cache_efficiency_summary())
PYOutput:
AttributeError: 'ConversationStats' object has no attribute 'cache_efficiency_summary'
This confirms the prior SDK had no single structured cache-efficiency probe; a user/dashboard would need to aggregate metrics manually.
Step 2 — Apply the PR's changes:
Checked out openhands/p1-cache-efficiency-summary at 7000b744b31890ec05f7438f227fe2a1d905695d.
Step 3 — Re-run with the fix in place:
Ran the same SDK script on the PR branch:
{'prompt_tokens': 10000, 'cache_read_tokens': 8000, 'cache_write_tokens': 500, 'completion_tokens': 200, 'cache_hit_rate': 0.8, 'cache_write_rate': 0.05, 'per_usage': {'main': {'model': 'claude-sonnet-4-5', 'prompt_tokens': 10000, 'cache_read_tokens': 8000, 'cache_write_tokens': 500, 'cache_hit_rate': 0.8}}}
This shows the method is now callable and returns the promised stable shape with correct totals and rates for a realistic SDK consumer.
Test 2: Multi-LLM weighted aggregation, per-usage detail, empty usage, and model fallback
Ran a user-style SDK script with a cached main LLM, uncached sidecar, registered-not-started LLM, and TokenUsage without a model:
--- multi-usage summary ---
{
"cache_hit_rate": 0.8153153153153153,
"cache_read_tokens": 9050,
"cache_write_rate": 0.0472972972972973,
"cache_write_tokens": 525,
"completion_tokens": 260,
"per_usage": {
"fallback": {
"cache_hit_rate": 0.5,
"cache_read_tokens": 50,
"cache_write_tokens": 25,
"model": "fallback-model-name",
"prompt_tokens": 100
},
"main-agent": {
"cache_hit_rate": 0.9,
"cache_read_tokens": 9000,
"cache_write_tokens": 500,
"model": "claude-sonnet-4-5",
"prompt_tokens": 10000
},
"not-started": {
"cache_hit_rate": null,
"cache_read_tokens": 0,
"cache_write_tokens": 0,
"model": "registered-not-started",
"prompt_tokens": 0
},
"sidecar": {
"cache_hit_rate": 0.0,
"cache_read_tokens": 0,
"cache_write_tokens": 0,
"model": "nemotron-3-ultra-550b",
"prompt_tokens": 1000
}
},
"prompt_tokens": 11100
}
This confirms aggregate hit/write rates are token-weighted from totals, not a naive mean; the uncached sidecar is still visible in per_usage; a not-yet-started LLM preserves null rate semantics; and the model-name fallback works.
Test 3: Empty stats and provider double-count clamp
Ran edge-case SDK scripts for empty stats and cache counters larger than prompt tokens:
--- clamp scenario ---
{
"cache_hit_rate": 1.0,
"cache_read_tokens": 1500,
"cache_write_rate": 1.0,
"cache_write_tokens": 1250,
"completion_tokens": 0,
"per_usage": {
"weird-provider": {
"cache_hit_rate": 1.0,
"cache_read_tokens": 1500,
"cache_write_tokens": 1250,
"model": "provider-double-counts",
"prompt_tokens": 1000
}
},
"prompt_tokens": 1000
}
--- empty stats scenario ---
{
"cache_hit_rate": null,
"cache_read_tokens": 0,
"cache_write_rate": null,
"cache_write_tokens": 0,
"completion_tokens": 0,
"per_usage": {},
"prompt_tokens": 0
}
This confirms dashboards will not see >100% aggregate rates and can distinguish “no data yet” from a real 0% cache hit rate.
Issues Found
None.
This QA review was generated by an AI agent (OpenHands) on behalf of the user.
Why
Reconstructing prompt-cache health for a multi-LLM conversation today means iterating
ConversationStats.usage_to_metricsand summingTokenUsagefields by hand. Dashboards and eval harnesses keep reinventing the same loop, often picking the wrong aggregation (e.g. mean of per-LLM rates instead of token-weighted).What
A single structured probe with a stable output shape:
{ "prompt_tokens", "cache_read_tokens", "cache_write_tokens", "completion_tokens", "cache_hit_rate": float | None, # weighted by prompt tokens "cache_write_rate": float | None, "per_usage": { "<usage_id>": {"model", "prompt_tokens", "cache_read_tokens", "cache_write_tokens", "cache_hit_rate"}, ... }, }Two design choices worth flagging in review:
prompt_tokens, not a mean of per-LLM rates. A loud uncached sidecar LLM (e.g. 1k prompt tokens) shouldn't drag a healthy 100k-token main run down to 50%.Nonefor rates when no prompt tokens have been recorded — preserves the "no data" vs "0% hit" distinction. Per-LLM breakdown still surfaces the bad citizen even when the aggregate is healthy.Aggregate rates are also clamped to
1.0so dashboards stay sane when a provider double-counts cache reads outsideprompt_tokens.Tests
tests/sdk/conversation/test_conversation_stats_cache_summary.py— 6 new tests covering empty stats, the multi-LLM weighted aggregation (the 9k/11k ≈ 0.818 case), the >100% clamp, and the model-name fallback. All 13 stats tests pass; linters + pyright clean.Risk / blast radius
Purely additive — one new method on
ConversationStats. No schema, persistence, or wire-format change.Stacking
Independent of #3566 (SDK-3); both can land in either order.
This PR was created by an AI agent (OpenHands) on behalf of the user investigating low throughput in eval run
swebench/litellm_proxy-nemotron-3-ultra-550b-a55b-or-paid/27052068762.@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:91df995-pythonRun
All tags pushed for this build
About Multi-Architecture Support
91df995-python) is a multi-arch manifest supporting both amd64 and arm6491df995-python-amd64) are also available if needed