DRAFT: feat(stats): ConversationStats.cache_efficiency_summary() probe by juanmichelini · Pull Request #3567 · OpenHands/software-agent-sdk

juanmichelini · 2026-06-08T06:38:18Z

Why

Reconstructing prompt-cache health for a multi-LLM conversation today means iterating ConversationStats.usage_to_metrics and summing TokenUsage fields by hand. Dashboards and eval harnesses keep reinventing the same loop, often picking the wrong aggregation (e.g. mean of per-LLM rates instead of token-weighted).

What

A single structured probe with a stable output shape:

{
  "prompt_tokens", "cache_read_tokens", "cache_write_tokens", "completion_tokens",
  "cache_hit_rate":  float | None,    # weighted by prompt tokens
  "cache_write_rate": float | None,
  "per_usage": {
    "<usage_id>": {"model", "prompt_tokens", "cache_read_tokens",
                   "cache_write_tokens", "cache_hit_rate"},
    ...
  },
}

Two design choices worth flagging in review:

Aggregate hit rate is weighted by prompt_tokens, not a mean of per-LLM rates. A loud uncached sidecar LLM (e.g. 1k prompt tokens) shouldn't drag a healthy 100k-token main run down to 50%.
None for rates when no prompt tokens have been recorded — preserves the "no data" vs "0% hit" distinction. Per-LLM breakdown still surfaces the bad citizen even when the aggregate is healthy.

Aggregate rates are also clamped to 1.0 so dashboards stay sane when a provider double-counts cache reads outside prompt_tokens.

Tests

tests/sdk/conversation/test_conversation_stats_cache_summary.py — 6 new tests covering empty stats, the multi-LLM weighted aggregation (the 9k/11k ≈ 0.818 case), the >100% clamp, and the model-name fallback. All 13 stats tests pass; linters + pyright clean.

Risk / blast radius

Purely additive — one new method on ConversationStats. No schema, persistence, or wire-format change.

Stacking

Independent of #3566 (SDK-3); both can land in either order.

This PR was created by an AI agent (OpenHands) on behalf of the user investigating low throughput in eval run swebench/litellm_proxy-nemotron-3-ultra-550b-a55b-or-paid/27052068762.

@juanmichelini can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:91df995-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-91df995-python \
  ghcr.io/openhands/agent-server:91df995-python

All tags pushed for this build

ghcr.io/openhands/agent-server:91df995-golang-amd64
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-golang-amd64
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-golang-amd64
ghcr.io/openhands/agent-server:91df995-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:91df995-golang-arm64
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-golang-arm64
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-golang-arm64
ghcr.io/openhands/agent-server:91df995-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:91df995-java-amd64
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-java-amd64
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-java-amd64
ghcr.io/openhands/agent-server:91df995-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:91df995-java-arm64
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-java-arm64
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-java-arm64
ghcr.io/openhands/agent-server:91df995-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:91df995-python-amd64
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-python-amd64
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-python-amd64
ghcr.io/openhands/agent-server:91df995-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:91df995-python-arm64
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-python-arm64
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-python-arm64
ghcr.io/openhands/agent-server:91df995-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:91df995-golang
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-golang
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-golang
ghcr.io/openhands/agent-server:91df995-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:91df995-java
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-java
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-java
ghcr.io/openhands/agent-server:91df995-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:91df995-python
ghcr.io/openhands/agent-server:91df9952e0ea25a462398fe7fce53484091506d5-python
ghcr.io/openhands/agent-server:openhands-p1-cache-efficiency-summary-python
ghcr.io/openhands/agent-server:91df995-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 91df995-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 91df995-python-amd64) are also available if needed

A single structured probe that aggregates prompt-cache health across every LLM used in a conversation, plus a per-LLM breakdown. Dashboards and eval harnesses can call this directly instead of iterating ConversationStats internals or scraping log lines. Output shape (stable): { 'prompt_tokens', 'cache_read_tokens', 'cache_write_tokens', 'completion_tokens', 'cache_hit_rate': float | None, # weighted by prompt tokens 'cache_write_rate': float | None, 'per_usage': {usage_id: {model, prompt_tokens, cache_read_tokens, cache_write_tokens, cache_hit_rate}}, } Two design choices worth flagging: * Hit rate is weighted by prompt_tokens across LLMs, not a naive mean of per-LLM rates. A loud uncached side-car LLM (1k prompt tokens) doesn't drag a healthy 100k-token main run down to 50%. * Aggregate rates are clamped to 1.0 so dashboards stay sane when a provider double-counts cache reads outside prompt_tokens. None is returned for hit/write rate when no prompt tokens have been billed yet, so 'no data' is distinct from '0% hit'. Motivated by an eval where the full prompt-cache picture was only visible after summing per-LLM TokenUsage by hand. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-06-08T06:39:07Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-08T06:39:27Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-08T06:41:16Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/conversation
conversation_stats.py	61	1	98%	110
TOTAL	30208	8457	72%

all-hands-bot · 2026-06-08T14:42:56Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: `ConversationStats.cache_efficiency_summary()`

🟢 Good taste — Clean, well-scoped addition with solid test coverage.

Summary

This PR adds a structured probe for cache efficiency metrics that dashboards and eval harnesses can call directly without scraping log lines. The implementation is straightforward: aggregate token counts across all LLMs, compute hit/write rates, and provide a per-LLM breakdown.

What works well:

Edge cases handled correctly: empty stats → None rates, providers over-reporting cache tokens → clamped to 1.0
The _rate() helper keeps the main loop clean
Tests cover the important cases including the "weird provider" scenario
Docstring clearly explains the return shape and semantics

Suggestion (non-blocking):

[conversation_stats.py, line 130] The return type is dict[str, Any]. A TypedDict or dataclass would give callers type-safe access and make the shape self-documenting in IDEs. Not blocking — the current dict works fine for a probe, and retrofitting might be out of scope for this PR.

Testing:

All 6 tests pass. Coverage is good — includes weighted aggregation, clamping, fallback behavior, and empty state.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW
No breaking changes. Pure additive feature with comprehensive tests. No external dependencies or security implications.

VERDICT:
✅ Worth merging — Solid addition, no blocking issues.

KEY INSIGHT:
The design decision to return a dict (vs TypedDict) keeps the probe lightweight and flexible for dashboards, while the comprehensive tests ensure edge cases like provider double-counting are handled correctly.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-06-08T14:46:13Z

+            hit_rate = None
+            write_rate = None
+
+        return {


🟡 Suggestion: Consider using a TypedDict or dataclass for the return type to give callers type-safe access. Not blocking for this PR, but would improve IDE support and self-document the shape.

from typing import TypedDict class CacheEfficiencySummary(TypedDict): prompt_tokens: int cache_read_tokens: int cache_write_tokens: int completion_tokens: int cache_hit_rate: float | None cache_write_rate: float | None per_usage: dict[str, dict[str, Any]]

all-hands-bot

✅ QA Report: PASS

I verified the new SDK probe by importing ConversationStats as a user would and calling cache_efficiency_summary() with realistic multi-LLM token usage; it returns the claimed stable aggregate and per-usage shape.

Does this PR achieve its stated goal?

Yes. On origin/main, ConversationStats has no cache_efficiency_summary() method; at PR commit 82e53b5, the same user script exposes the method and returns aggregate prompt/cache/completion totals, token-weighted hit rate (9000 / 11000 = 0.818...), clamped provider over-count rates, None rates for no prompt tokens, and model-name fallback. This delivers the promised dashboard/eval-harness probe without callers hand-iterating usage_to_metrics.

Phase	Result
Environment Setup	✅ `make build` completed successfully; no tests/linters were run.
CI Status	🟡 GitHub reports 35 passing, 1 pending (`qa-changes`), 15 skipped.
Functional Verification	✅ Before/after SDK probe confirms the feature works as described.

Functional Verification

Test 1: New `ConversationStats.cache_efficiency_summary()` SDK probe

Step 1 — Reproduce / establish baseline (without the fix):

Ran git switch --detach origin/main --quiet && uv run python /tmp/cache_summary_user_probe.py:

has_cache_efficiency_summary False
call_result AttributeError 'ConversationStats' object has no attribute 'cache_efficiency_summary'

This confirms the baseline problem: a user/dashboard cannot call a structured cache-efficiency probe and would need to compute the summary manually from usage_to_metrics.

Step 2 — Apply the PR's changes:

Checked out the PR branch/commit with git switch openhands/p1-cache-efficiency-summary --quiet && git checkout 82e53b5e1a65a13e0e962939fd1c751c5007cbed --quiet.

Step 3 — Re-run with the fix in place:

Ran uv run python /tmp/cache_summary_user_probe.py:

has_cache_efficiency_summary True
empty {"cache_hit_rate": null, "cache_read_tokens": 0, "cache_write_rate": null, "cache_write_tokens": 0, "completion_tokens": 0, "per_usage": {}, "prompt_tokens": 0}
weighted_multi_llm {"cache_hit_rate": 0.8181818181818182, "cache_read_tokens": 9000, "cache_write_rate": 0.045454545454545456, "cache_write_tokens": 500, "completion_tokens": 225, "per_usage": {"cached-llm": {"cache_hit_rate": 0.9, "cache_read_tokens": 9000, "cache_write_tokens": 500, "model": "claude-sonnet-4-5", "prompt_tokens": 10000}, "uncached-sidecar": {"cache_hit_rate": 0.0, "cache_read_tokens": 0, "cache_write_tokens": 0, "model": "nemotron-3-ultra-550b", "prompt_tokens": 1000}}, "prompt_tokens": 11000}
provider_double_count {"cache_hit_rate": 1.0, "cache_read_tokens": 1500, "cache_write_rate": 1.0, "cache_write_tokens": 1200, "completion_tokens": 0, "per_usage": {"double-count-provider": {"cache_hit_rate": 1.0, "cache_read_tokens": 1500, "cache_write_tokens": 1200, "model": "weird-model", "prompt_tokens": 1000}}, "prompt_tokens": 1000}
model_fallback {"cache_hit_rate": 0.5, "cache_read_tokens": 50, "cache_write_rate": 0.0, "cache_write_tokens": 0, "completion_tokens": 0, "per_usage": {"fallback-model": {"cache_hit_rate": 0.5, "cache_read_tokens": 50, "cache_write_tokens": 0, "model": "metrics-model-name", "prompt_tokens": 100}}, "prompt_tokens": 100}
registered_empty {"cache_hit_rate": null, "cache_read_tokens": 0, "cache_write_rate": null, "cache_write_tokens": 0, "completion_tokens": 0, "per_usage": {"empty-registered": {"cache_hit_rate": null, "cache_read_tokens": 0, "cache_write_tokens": 0, "model": "registered-but-no-completion", "prompt_tokens": 0}}, "prompt_tokens": 0}

This confirms the new method is available through the SDK export and returns the promised structure. The multi-LLM aggregate is token-weighted rather than a naive average, over-counted cache reads/writes are clamped to 1.0, empty prompt-token cases preserve None rates, and the per-usage breakdown surfaces individual usage IDs and model names.

Setup and CI observation

Ran make build:

Dependencies installed successfully.
Pre-commit hooks installed successfully.
Build complete! Development environment is ready.

Ran gh pr checks 3567 --repo OpenHands/software-agent-sdk ...:

pass: 35
pending: 1
skipping: 15
pending IN_PROGRESS qa-changes

No local test suite, linter, formatter, type checker, or pre-commit hook was run.

Issues Found

None.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot · 2026-06-09T00:29:40Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: `cache_efficiency_summary()`

🟡 Acceptable — Core logic is sound, but there are a few opportunities to improve clarity and reduce complexity.

[IMPROVEMENT OPPORTUNITIES]

[openhands-sdk/openhands/sdk/conversation/conversation_stats.py, Line 96] Complexity: The local _rate helper is a one-liner used twice. Consider inlining min(1.0, num / den) if den > 0 else None directly or extracting it as a module-level helper if reused elsewhere.
[openhands-sdk/openhands/sdk/conversation/conversation_stats.py, Line 116] Naming: The per_usage dict key is the internal usage_id. If dashboards need to aggregate by model, they'll need to iterate through the dict. Consider whether the return structure should include a per_model aggregation as well, or document that consumers should key by usage_id and access per_usage[id]['model'] for the model name.

[STYLE NOTES]

The comment on line 83-84 explaining provider double-counting is appropriate — it describes a non-obvious invariant that callers need to understand. ✅
The docstring is thorough and the return shape is well-documented. ✅

[TESTING]

Tests are solid:

Edge cases (empty stats, no completions) are covered
Weighted aggregate calculation is verified correctly (9k/11k ≠ mean of 0.9 and 0.0)
Clamping behavior for >100% is tested
Model fallback from metrics.model_name is covered

No testing gaps identified. Tests exercise real code paths and assert on actual outputs.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW
This is a pure additive feature — new method, new tests. No breaking changes to existing APIs. No external dependencies added. The risk of this PR is minimal.

VERDICT:
✅ Worth merging: The implementation is correct, well-tested, and solves a real problem (dashboards need a structured probe instead of scraping log lines).

KEY INSIGHT:
The weighted aggregate (total_cache_read / total_prompt) is correct — this avoids the trap of naive per-LLM averaging that would hide a noisy participant behind healthy averages.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-06-09T00:31:29Z

+                }
+
+            ``cache_hit_rate`` is ``None`` when no prompt tokens were
+            billed (lets callers distinguish "no data" from "0% hit").


🟡 Suggestion: The local _rate helper is a one-liner used twice. Could be inlined directly:

hit_rate = min(1.0, total_cache_read / total_prompt) if total_prompt > 0 else None write_rate = min(1.0, total_cache_write / total_prompt) if total_prompt > 0 else None

Or if you prefer keeping the helper, rename it to _clamp_rate to better describe what it does.

all-hands-bot

✅ QA Report: PASS

The new SDK probe was exercised through real uv run python scripts and returned the advertised aggregate/per-usage cache metrics, including edge cases.

Does this PR achieve its stated goal?

Yes. The PR set out to add a structured ConversationStats.cache_efficiency_summary() probe so dashboards/eval harnesses no longer have to hand-roll prompt-cache aggregation. On the base branch the method is absent; on the PR branch, the same SDK usage returns the promised shape with totals, weighted rates, per-usage details, model fallback, None rates for no prompt tokens, and clamped >100% provider reports.

Phase	Result
Environment Setup	✅ `uv run python` created/used the project environment and imported the SDK successfully.
CI Status	⚠️ Most checks were green at refresh; `unresolved-review-threads` was failing and several build/QA checks were still pending.
Functional Verification	✅ SDK users can call the new method and get the claimed cache summary behavior.

Functional Verification

Test 1: New SDK method exists and returns the advertised single-LLM summary

Step 1 — Reproduce / establish baseline without the fix:
Ran on origin/main:

git switch --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
from openhands.sdk import ConversationStats
from openhands.sdk.llm.utils.metrics import Metrics, TokenUsage

stats = ConversationStats()
metrics = Metrics(model_name='claude-sonnet-4-5')
metrics.accumulated_token_usage = TokenUsage(
    prompt_tokens=10000,
    completion_tokens=200,
    cache_read_tokens=8000,
    cache_write_tokens=500,
)
stats.usage_to_metrics['main'] = metrics
print(stats.cache_efficiency_summary())
PY

Output:

AttributeError: 'ConversationStats' object has no attribute 'cache_efficiency_summary'

This confirms the prior SDK had no single structured cache-efficiency probe; a user/dashboard would need to aggregate metrics manually.

Step 2 — Apply the PR's changes:
Checked out openhands/p1-cache-efficiency-summary at 7000b744b31890ec05f7438f227fe2a1d905695d.

Step 3 — Re-run with the fix in place:
Ran the same SDK script on the PR branch:

{'prompt_tokens': 10000, 'cache_read_tokens': 8000, 'cache_write_tokens': 500, 'completion_tokens': 200, 'cache_hit_rate': 0.8, 'cache_write_rate': 0.05, 'per_usage': {'main': {'model': 'claude-sonnet-4-5', 'prompt_tokens': 10000, 'cache_read_tokens': 8000, 'cache_write_tokens': 500, 'cache_hit_rate': 0.8}}}

This shows the method is now callable and returns the promised stable shape with correct totals and rates for a realistic SDK consumer.

Test 2: Multi-LLM weighted aggregation, per-usage detail, empty usage, and model fallback

Ran a user-style SDK script with a cached main LLM, uncached sidecar, registered-not-started LLM, and TokenUsage without a model:

--- multi-usage summary ---
{
  "cache_hit_rate": 0.8153153153153153,
  "cache_read_tokens": 9050,
  "cache_write_rate": 0.0472972972972973,
  "cache_write_tokens": 525,
  "completion_tokens": 260,
  "per_usage": {
    "fallback": {
      "cache_hit_rate": 0.5,
      "cache_read_tokens": 50,
      "cache_write_tokens": 25,
      "model": "fallback-model-name",
      "prompt_tokens": 100
    },
    "main-agent": {
      "cache_hit_rate": 0.9,
      "cache_read_tokens": 9000,
      "cache_write_tokens": 500,
      "model": "claude-sonnet-4-5",
      "prompt_tokens": 10000
    },
    "not-started": {
      "cache_hit_rate": null,
      "cache_read_tokens": 0,
      "cache_write_tokens": 0,
      "model": "registered-not-started",
      "prompt_tokens": 0
    },
    "sidecar": {
      "cache_hit_rate": 0.0,
      "cache_read_tokens": 0,
      "cache_write_tokens": 0,
      "model": "nemotron-3-ultra-550b",
      "prompt_tokens": 1000
    }
  },
  "prompt_tokens": 11100
}

This confirms aggregate hit/write rates are token-weighted from totals, not a naive mean; the uncached sidecar is still visible in per_usage; a not-yet-started LLM preserves null rate semantics; and the model-name fallback works.

Test 3: Empty stats and provider double-count clamp

Ran edge-case SDK scripts for empty stats and cache counters larger than prompt tokens:

--- clamp scenario ---
{
  "cache_hit_rate": 1.0,
  "cache_read_tokens": 1500,
  "cache_write_rate": 1.0,
  "cache_write_tokens": 1250,
  "completion_tokens": 0,
  "per_usage": {
    "weird-provider": {
      "cache_hit_rate": 1.0,
      "cache_read_tokens": 1500,
      "cache_write_tokens": 1250,
      "model": "provider-double-counts",
      "prompt_tokens": 1000
    }
  },
  "prompt_tokens": 1000
}
--- empty stats scenario ---
{
  "cache_hit_rate": null,
  "cache_read_tokens": 0,
  "cache_write_rate": null,
  "cache_write_tokens": 0,
  "completion_tokens": 0,
  "per_usage": {},
  "prompt_tokens": 0
}

This confirms dashboards will not see >100% aggregate rates and can distinguish “no data yet” from a real 0% cache hit rate.

Issues Found

None.

This QA review was generated by an AI agent (OpenHands) on behalf of the user.

juanmichelini added the enhancement New feature or request label Jun 8, 2026 — with OpenHands AI

juanmichelini requested a review from all-hands-bot June 8, 2026 14:41

all-hands-bot reviewed Jun 8, 2026

View reviewed changes

juanmichelini marked this pull request as ready for review June 9, 2026 00:28

Merge branch 'main' into openhands/p1-cache-efficiency-summary

7000b74

juanmichelini requested a review from all-hands-bot June 9, 2026 00:28

all-hands-bot reviewed Jun 9, 2026

View reviewed changes

Merge branch 'main' into openhands/p1-cache-efficiency-summary

91df995

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: feat(stats): ConversationStats.cache_efficiency_summary() probe#3567

DRAFT: feat(stats): ConversationStats.cache_efficiency_summary() probe#3567
juanmichelini wants to merge 3 commits into
mainfrom
openhands/p1-cache-efficiency-summary

juanmichelini commented Jun 8, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

all-hands-bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Jun 8, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Jun 9, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juanmichelini commented Jun 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Tests

Risk / blast radius

Stacking

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: ConversationStats.cache_efficiency_summary()

Summary

Uh oh!

all-hands-bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: New ConversationStats.cache_efficiency_summary() SDK probe

Setup and CI observation

Issues Found

Uh oh!

all-hands-bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: cache_efficiency_summary()

[IMPROVEMENT OPPORTUNITIES]

[STYLE NOTES]

[TESTING]

Uh oh!

all-hands-bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: New SDK method exists and returns the advertised single-LLM summary

Test 2: Multi-LLM weighted aggregation, per-usage detail, empty usage, and model fallback

Test 3: Empty stats and provider double-count clamp

Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juanmichelini commented Jun 8, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

all-hands-bot commented Jun 8, 2026 •

edited

Loading

Code Review: `ConversationStats.cache_efficiency_summary()`

Test 1: New `ConversationStats.cache_efficiency_summary()` SDK probe

all-hands-bot commented Jun 9, 2026 •

edited

Loading

Code Review: `cache_efficiency_summary()`