DRAFT: feat(observability): cache hit/write rate + non-caching-model warning#3566
DRAFT: feat(observability): cache hit/write rate + non-caching-model warning#3566juanmichelini wants to merge 3 commits into
Conversation
Surfaces 'this run will pay full prompt-token cost on every step' as a WARN at conversation construction, and exposes hit/write rate as first-class derived metrics so dashboards and eval harnesses can flag non-cached runs without per-provider arithmetic. Changes: * TokenUsage.cache_hit_rate / cache_write_rate properties returning float in [0, 1] or None when no prompt tokens were recorded. Defensive min(1.0, ...) clamp keeps dashboards sane when providers double-count cache reads (some bill them inside prompt_tokens, some don't). * LocalConversation._warn_if_prompt_cache_inactive() fires once at __init__ when llm.is_caching_prompt_active() is False, naming the model so log aggregation can group by it. Wrapped in try/except so a logging failure can never break conversation construction. Motivated by an eval run on a 550B model with no prompt-cache support where 95%+ of the bill was repeated prompt tokens — the SDK already knew the model didn't support caching but nothing surfaced it. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||||||||||||
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: PR #3566 — DRAFT: feat(observability): cache hit/write rate + non-caching-model warning
Taste Rating
🟢 Good taste — Elegant, minimal additions that solve a real operational problem.
Summary
Two clean, additive improvements to the SDK's observability surface:
-
TokenUsage.cache_hit_rate/cache_write_rate— Derived properties that expose cache efficiency as normalized floats[0, 1], clamping to 1.0 to handle provider quirks. Thefloat | Nonereturn type correctly distinguishes "no data yet" from "0% hit rate". -
LocalConversationnon-caching warning — One-timelogger.warningat conversation construction whenis_caching_prompt_active() == False. Thetry/exceptguard is the right defensive posture — a logging failure should never break conversation construction.
The PR description is exemplary: it names the real incident (Nemotron 550B eval run, 95% repeated prompt tokens), explains the blast radius clearly, and documents the design rationale for the clamp. This is how PRs should be written.
Inline Comments
| Priority | File | Line | Comment |
|---|---|---|---|
| 🟡 Suggestion | local_conversation.py |
~766-796 | The 20-line docstring on _warn_if_prompt_cache_inactive() is more PR-description than implementation doc. Consider trimming to 3-4 lines explaining what the method does; the why belongs in the PR body or a single-line comment at the call site. |
What Was Done Well
metrics.py: Clean property implementations. No branching beyond the guard clause.min(1.0, ...)clamp is the right call — it future-proofs against any provider that double-counts cache reads.- Warning fires for
caching=Falseon a supported model — Correct behavior. If the user explicitly disabled caching, they're making a trade-off and should be reminded loudly. - Tests: Edge-case coverage is solid — zero prompt tokens, >100% clamp, parametric sweep, once-per-conversation contract.
- No breaking changes: Read-only properties on an existing model, no schema changes.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW - No breaking changes. Read-only properties. No new dependencies. The one new log path is guarded by
try/except. The PR author correctly notes it can be promoted to aConversationErrorEventlater if desired.
VERDICT:
✅ Worth merging — Core logic is sound, tests are thorough, and the PR solves a real observability gap.
KEY INSIGHT:
The min(1.0, ...) clamp is a pragmatic guard against provider inconsistency that makes dashboards safe without adding complexity.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.Was this review helpful? React with 👍 or 👎 to give feedback.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
| @@ -763,6 +764,38 @@ def _pin_prompt_cache_key(self) -> None: | |||
| if self.agent.llm._prompt_cache_key is None: | |||
| self.agent.llm._prompt_cache_key = str(self._state.id) | |||
|
|
|||
There was a problem hiding this comment.
🟡 Suggestion: The 20-line docstring on _warn_if_prompt_cache_inactive() is more PR-description than implementation doc. Consider trimming to 3-4 lines explaining what the method does; the why belongs in the PR body or a single-line comment at the call site.
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Functional verification passed: the SDK now exposes cache hit/write rates and emits the promised inactive prompt-cache warning, but the PR currently has a failing CI pre-commit check.
Does this PR achieve its stated goal?
Yes. Exercising the SDK as a user would showed that TokenUsage now reports None when there are no prompt tokens, computes 0.75 / 0.2 for normal cache read/write samples, and clamps provider over-counting to 1.0. Constructing real Conversation objects also now emits warnings only when prompt caching is inactive: once for the non-caching Nemotron model and once when caching is explicitly disabled on a supported Claude model, with no warning for the supported active-cache case.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully with uv sync --dev and local packages installed |
| CI Status | pre-commit failing and qa-changes in progress; other SDK/package checks shown by gh pr checks were passing or skipped |
| Functional Verification | ✅ Before/after SDK scripts confirmed both claimed observability behaviors |
Functional Verification
Test 1: TokenUsage.cache_hit_rate / cache_write_rate
Step 1 — Establish baseline without the fix:
Checked out origin/main and ran uv run python /tmp/qa_prompt_cache_observability.py against real SDK imports. Relevant output:
{
"rates": {
"no_prompt": {
"cache_hit_rate": "MISSING: AttributeError",
"cache_write_rate": "MISSING: AttributeError"
},
"overcount": {
"cache_hit_rate": "MISSING: AttributeError",
"cache_write_rate": "MISSING: AttributeError"
},
"partial": {
"cache_hit_rate": "MISSING: AttributeError",
"cache_write_rate": "MISSING: AttributeError"
}
},
"warnings": []
}This confirms the baseline SDK had no user-accessible cache-rate properties.
Step 2 — Apply the PR changes:
Checked out cddcca7e8e3ffbd5cfaefff363fb13b63de38b19.
Step 3 — Re-run with the fix in place:
Ran the same uv run python /tmp/qa_prompt_cache_observability.py. Relevant output:
{
"rates": {
"no_prompt": {
"cache_hit_rate": null,
"cache_write_rate": null
},
"overcount": {
"cache_hit_rate": 1.0,
"cache_write_rate": 1.0
},
"partial": {
"cache_hit_rate": 0.75,
"cache_write_rate": 0.2
}
}
}This confirms the new properties behave as described, including the no-data None case and >100% clamp.
Test 2: Inactive prompt-cache warning at conversation construction
Step 1 — Establish baseline without the fix:
On origin/main, the same script constructed real Conversation objects for:
nemotron-3-ultra-550bclaude-sonnet-4-5claude-sonnet-4-5withcaching_prompt=False
Relevant output:
{
"warnings": []
}This confirms the old behavior did not surface inactive prompt caching at construction time.
Step 2 — Apply the PR changes:
Checked out cddcca7e8e3ffbd5cfaefff363fb13b63de38b19.
Step 3 — Re-run with the fix in place:
Ran the same script. Relevant output:
{
"warnings": [
"Prompt caching is not active for model 'nemotron-3-ultra-550b' on this conversation. Long agent loops will pay full prompt-token cost on every step. Consider tightening `max_iteration_per_run` or switching to a model with prompt-cache support for long-running tasks.",
"Prompt caching is not active for model 'claude-sonnet-4-5' on this conversation. Long agent loops will pay full prompt-token cost on every step. Consider tightening `max_iteration_per_run` or switching to a model with prompt-cache support for long-running tasks."
]
}This confirms the warning appears for an unsupported model and for explicitly disabled caching, while the supported active-cache Claude construction did not add a warning.
Test 3: Warning logging failure does not block construction
On the PR commit, I attached a logging handler that raises on WARNING and constructed a non-caching Nemotron conversation via uv run python /tmp/qa_prompt_cache_logging_failure.py. Relevant output:
conversation_created_despite_logging_failure
This confirms conversation construction still succeeds even if the warning path encounters a logging backend failure.
Issues Found
- 🟠 Issue: Current GitHub checks show
pre-commitfailing andqa-changesstill in progress. I did not rerun linters/pre-commit locally per the QA instructions, but this CI status should be resolved before merge.
This review was created by an AI agent (OpenHands) on behalf of the user.
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: cache hit/write rate + non-caching-model warning
Taste Rating: 🟢 Good taste — elegant, simple solution
Summary
This PR adds observability tooling for prompt caching: cache hit/write rate derived properties on TokenUsage, and a one-time warning at Conversation construction when the primary LLM doesn't support prompt caching.
Both additions are clean, well-scoped, and come with good test coverage. No blocking issues found.
Evaluation Checklist
| Category | Status | Notes |
|---|---|---|
| Data structures | ✅ | TokenUsage properties are stateless derived values — no mutation risk, no ownership complexity. |
| Breaking changes | ✅ | No public API changes. TokenUsage.cache_hit_rate and cache_write_rate are new properties, existing code is unaffected. |
| Complexity | ✅ | Single-purpose methods, clear control flow, no nesting > 2 levels. |
| Pragmatism | ✅ | Addresses a real cost visibility problem. The warning in particular solves something teams discover too late (on their invoice). |
| Testing | ✅ | Both test files cover happy paths, edge cases, and idempotency. The cache-rate tests use parametric cases which is good. |
| Defensive coding | ✅ | _warn_if_prompt_cache_inactive has a broad but appropriate try/except guard that never propagates. |
| Documentation | ✅ | Docstrings are thorough and explain the "why" (e.g., why clamp, when None is returned). |
Minor Suggestions (non-blocking)
-
local_conversation.pyline 789: Theexcept Exceptionin_warn_if_prompt_cache_inactivecatches everything. Consider documenting why the broad catch is intentional (e.g., "never fail init on a log"). Theexc_info=Truein the debug log is good. | -
metrics.pyline 94: Themin(1.0, ...)clamping is correct. Consider a brief inline comment explaining the defensive clamp (some providers double-count cache reads inside prompt_tokens), though the docstring and tests already cover this. |
Risk Assessment
⚠️ Overall PR Risk: 🟢 LOW
No breaking changes, no new external dependencies, no security-sensitive code paths. The additions are purely additive observability features with their own tests.
Verdict
✅ Worth merging: Core logic is sound, tests are solid, no action required.
Key insight: The cache rate properties use a clean stateless derivation pattern that composes well — TokenUsage objects can be added together and the rates recompute correctly from accumulated totals, as the test suite verifies.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
| @property | ||
| def cache_write_rate(self) -> float | None: | ||
| """Fraction of prompt tokens that produced a new cache write. | ||
|
|
There was a problem hiding this comment.
🟡 Suggestion: The min(1.0, ...) clamping is correct but worth a brief inline comment explaining why it's defensive (some providers double-count cache reads inside prompt_tokens). The test covers it, so the docstring could also note this edge case briefly.
| "Prompt caching is not active for model %r on this " | ||
| "conversation. Long agent loops will pay full prompt-token " | ||
| "cost on every step. Consider tightening `max_iteration_per_run` " | ||
| "or switching to a model with prompt-cache support for " |
There was a problem hiding this comment.
🟡 Suggestion: Consider documenting why the broad except Exception is intentional (e.g., "never fail init on a log"). The exc_info=True in the debug branch is good practice.
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Functional SDK verification passed for the new cache-rate observability and prompt-cache inactive warning, but the PR is not currently CI-green.
Does this PR achieve its stated goal?
Yes. As a real SDK user, I imported and used TokenUsage and constructed Conversation objects with supported and unsupported LLM configurations. On the PR commit, cache hit/write rates returned the expected partial, None, and clamped values, and conversation construction emitted exactly the intended operator-facing warning when prompt caching was inactive while staying silent for an active cached model.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully and installed the SDK workspace dependencies. |
| CI Status | pre-commit and unresolved-review-threads failing; several build/publish/coverage checks still in progress; sdk-tests completed successfully in CI. |
| Functional Verification | ✅ Exercised the changed SDK APIs directly with real Python usage; behavior matched the PR description. |
Functional Verification
Test 1: TokenUsage cache rate properties
Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main and ran a short SDK script that constructs TokenUsage(prompt_tokens=1000, cache_read_tokens=750, cache_write_tokens=200) and reads cache_hit_rate / cache_write_rate.
Relevant output:
commit fd88a5ce
cache_hit_rate AttributeError: 'TokenUsage' object has no attribute 'cache_hit_rate'
cache_write_rate AttributeError: 'TokenUsage' object has no attribute 'cache_write_rate'
This establishes the prior state: operators had no cache hit/write rate properties available on TokenUsage.
Step 2 — Apply the PR's changes:
Checked out PR commit 1643f000d7df3b45ae8b0d2cd8b080722ed415fa.
Step 3 — Re-run with the fix in place:
Ran the equivalent SDK script with realistic and edge-case usage totals.
Relevant output:
commit 1643f000
partial_read hit= 0.75 write= 0.0
partial_write hit= 0.0 write= 0.2
no_prompt_tokens hit= None write= None
clamped_read_and_write hit= 1.0 write= 1.0
This confirms the new properties work for normal reads/writes, return None when no prompt tokens are recorded, and clamp provider over-counting to 1.0.
Test 2: Conversation construction warning for inactive prompt caching
Step 1 — Reproduce / establish baseline without the fix:
On origin/main, constructed a real Conversation with LLM(model='nemotron-3-ultra-550b', api_key=SecretStr('k')) and captured warning logs from openhands.sdk.conversation.impl.local_conversation.
Relevant output:
commit fd88a5ce
prompt_cache_warning_count 0
prompt_cache_warning_messages []
This confirms the operational visibility gap described in the PR: a non-caching model can be used without any construction-time warning.
Step 2 — Apply the PR's changes:
Checked out PR commit 1643f000d7df3b45ae8b0d2cd8b080722ed415fa.
Step 3 — Re-run with the fix in place:
Constructed conversations for a non-caching model, a supported caching model, a supported model with caching explicitly disabled, and two separate non-caching conversations.
Relevant output:
non_caching_model warning_count= 1
non_caching_model warning_message= Prompt caching is not active for model 'nemotron-3-ultra-550b' on this conversation. Long agent loops will pay full prompt-token cost on every step. Consider tightening `max_iteration_per_run` or switching to a model with prompt-cache support for long-running tasks.
supported_model warning_count= 0
supported_model_disabled warning_count= 1
two_new_non_caching_conversations warning_count= 2
This confirms the warning is visible to operators when caching is inactive, includes the model name for grouping, remains silent for a supported model with active caching, and fires once per new conversation construction.
Issues Found
- 🟠 Issue: CI is not currently green: GitHub reports
pre-commitandunresolved-review-threadsas failed, with several build/publish/coverage checks still in progress. I did not rerun tests, linters, or pre-commit locally per QA instructions.
This review was created by an AI agent (OpenHands) on behalf of the user.
Final verdict: PASS WITH ISSUES
Why
On an eval run with a 550B Nemotron (no prompt-cache support), ~95% of the bill was repeated prompt tokens — the SDK already knew the model wasn't in
PROMPT_CACHE_MODELS(viaLLM.is_caching_prompt_active()), but nothing surfaced it. Operators only found out from the invoice.What
Two complementary, additive improvements:
1.
TokenUsage.cache_hit_rate/cache_write_rateDerived properties returning
float | None:cache_hit_rate = cache_read_tokens / prompt_tokens(clamped to[0, 1])cache_write_rate = cache_write_tokens / prompt_tokens(clamped to[0, 1])Nonewhen no prompt tokens have been recorded so callers can distinguish "no data yet" from "0% hit rate".The clamp matters: some providers double-count cache reads outside
prompt_tokens. Without the clamp, dashboards see hit rates >100%.2. One-time non-caching-model warning at conversation construction
LocalConversation.__init__now fires alogger.warningonce whenagent.llm.is_caching_prompt_active() == False, naming the model so log aggregation can group by it. Wrapped intry/exceptso a logging failure can never break conversation construction.Tests
tests/sdk/llm/test_token_usage_cache_rates.py— 9 tests including the parametric hit-rate sweep and the >100% clamp.tests/sdk/conversation/local/test_conversation_prompt_cache_warning.py— 4 tests covering the warning firing for a non-cached model, staying silent for a cached one, firing for a cached model with caching explicitly disabled, and the once-per-conversation contract.All 14 new tests pass; existing
test_llm_metrics.py(51 tests) regression-clean.Risk / blast radius
TokenUsage— no schema change, no existing call sites affected.LocalConversation()construction (only when caching is inactive). No structured event yet; can be promoted to aConversationErrorEventseparately if there's appetite.This PR was created by an AI agent (OpenHands) on behalf of the user investigating low throughput in eval run
swebench/litellm_proxy-nemotron-3-ultra-550b-a55b-or-paid/27052068762.@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:46a5bff-pythonRun
All tags pushed for this build
About Multi-Architecture Support
46a5bff-python) is a multi-arch manifest supporting both amd64 and arm6446a5bff-python-amd64) are also available if needed