DRAFT: feat(llm): allowlist `nemotron-3-ultra` for prompt caching markers by juanmichelini · Pull Request #3671 · OpenHands/software-agent-sdk

juanmichelini · 2026-06-11T19:39:06Z

TL;DR

Add nemotron-3-ultra to PROMPT_CACHE_MODELS so the SDK starts attaching cache_control: {"type": "ephemeral"} markers to the long stable prefix on Nemotron requests. Half of two pieces — this PR is the SDK-side change; the infra-side change (routing Nemotron to a provider that actually honors the markers) is tracked separately.

What this PR does

Single addition to openhands-sdk/openhands/sdk/llm/utils/model_features.py:

PROMPT_CACHE_MODELS: list[str] = [
    ...existing Claude variants...
    "nemotron-3-ultra",   # ← new
]

model_matches does case-insensitive substring matching, so the single token covers every routing form we deploy:

Routing form	Matched?
`litellm_proxy/nemotron-3-ultra-550b-a55b`	✅
`openrouter/nvidia/nemotron-3-ultra-550b-a55b`	✅
`deepinfra/nvidia/Nemotron-3-Ultra-550B-A55B` (planned)	✅
`nvidia/llama-nemotron-rerank-vl-1b`	❌
`nvidia/nemotron-3.5-content-safety`	❌

The last two are deliberate negatives — they're a different Nemotron family with a different caching story, and the test suite pins this so a future broadening of the substring fails loudly.

Why this is safe to ship before infra changes

OpenRouter's prompt-caching docs list 8 supported providers: OpenAI · Grok · Moonshot AI · Groq · Alibaba Qwen · Anthropic · DeepSeek · Google Gemini. NVIDIA and DeepInfra are not on this list. And the Nemotron-3 Ultra model page shows a single provider (DeepInfra) with a 0.2% global cache hit rate.

On a route that doesn't honor cache_control, sending the markers is a silent no-op:

The marker is just an extra JSON field on a content block. Providers that don't recognize it ignore it (per OpenAI Chat Completions spec on unknown fields).
No 400s, no error path changes, no behavior diff for the current eval.
The SDK was already gated by self.caching_prompt and supports_prompt_cache — only the second half flips here. The first half defaults to True.

Bench check: I ran the full test_model_features.py after the change: 154 passed, including the 5 new parametrized Nemotron cases. No other test in the SDK depends on this particular substring.

Why ship now (the actual reason)

The companion infra-side change routes Nemotron through a provider that DOES honor caching — DeepInfra direct, NVIDIA NIM direct, or self-hosted vLLM ≥0.6.5. On that route, the SDK must already be sending cache_control markers for caching to register; otherwise the route switch produces no measurable cost win and we'd be wondering why.

Shipping the SDK side first means the infra switch is a one-config-edit that takes effect immediately. Shipping the infra side first means the SDK release becomes a blocker on every eval.

Why not just match `"nemotron"`

Each Nemotron-family model has its own caching story:

Nemotron-3 Ultra (this PR): allowlisted speculatively, will work once infra route changes.
Nemotron 3.5 Content Safety: free tier, content-safety classifier — caching not yet evaluated.
Llama Nemotron Rerank VL 1B: small reranker, different architecture entirely.

A bare "nemotron" token would silently enable markers for all three. The two negative test cases (nvidia/llama-nemotron-rerank-vl-1b, nvidia/nemotron-3.5-content-safety) document this intent and prevent a future "broaden the match" PR from regressing it without re-verification.

Expected impact (honest)

When	Impact on Nemotron cost
Today (route via OpenRouter→DeepInfra)	0% — markers ignored, this PR is a no-op
After infra route change to a cache-honoring provider	~3× cost reduction on long agent conversations (same math as Sonnet)

The 3× number comes from the cost decomposition I did on real trajectories: at 50K-token avg context × 200-400 turns, without caching every turn re-pays the entire history (quadratic). With 80%+ cache hit on the stable prefix billed at ~$0.15/M instead of $0.50/M, the effective per-turn cost drops by ~70%, and conversation-total cost by ~65–70%.

Tests

tests/sdk/llm/test_model_features.py::test_prompt_cache_support — 5 new parametrized cases added to the existing parametrize block:

3 positive: every routing-form variant we currently or plan to deploy
2 negative: Nemotron-family variants that must NOT be enabled by this allowlist entry

All 154 existing test_model_features cases continue to pass. Lint + pyright clean.

Companion work (NOT in this PR)

What	Where	Status
Route Nemotron to a cache-honoring provider	`all-hands-ai/infra` `k8s/evaluation/litellm.yaml`	needs upstream curl test first
Validate DeepInfra-direct exposes `prompt_tokens_details.cached_tokens`	manual curl test, ~30 s	pending
If DeepInfra doesn't, test NVIDIA NIM direct as fallback	manual curl test	pending
Failing both, file OpenRouter support request	https://openrouter.ai/docs	optional

Risk

Behavior change on the current OpenRouter route: none. The markers are silently ignored.
Behavior change on Anthropic / other already-allowlisted models: none. This entry only adds a new case; it doesn't touch existing logic.
Memory / latency: the markers are a few bytes per content block. Imperceptible.
Rollback: revert this commit. Single-line removal.

This PR was created by an AI agent (OpenHands) on behalf of the user investigating per-instance cost on the Nemotron 550B SWE-Bench Verified eval. The change pairs with a planned infra-side change to route Nemotron away from the OpenRouter→DeepInfra path, which doesn't honor cache_control per OpenRouter's official docs.

@juanmichelini can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:a53f053-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-a53f053-python \
  ghcr.io/openhands/agent-server:a53f053-python

All tags pushed for this build

ghcr.io/openhands/agent-server:a53f053-golang-amd64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-golang-amd64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-golang-amd64
ghcr.io/openhands/agent-server:a53f053-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:a53f053-golang-arm64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-golang-arm64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-golang-arm64
ghcr.io/openhands/agent-server:a53f053-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:a53f053-java-amd64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-java-amd64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-java-amd64
ghcr.io/openhands/agent-server:a53f053-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:a53f053-java-arm64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-java-arm64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-java-arm64
ghcr.io/openhands/agent-server:a53f053-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:a53f053-python-amd64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-python-amd64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-python-amd64
ghcr.io/openhands/agent-server:a53f053-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:a53f053-python-arm64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-python-arm64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-python-arm64
ghcr.io/openhands/agent-server:a53f053-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:a53f053-golang
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-golang
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-golang
ghcr.io/openhands/agent-server:a53f053-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:a53f053-java
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-java
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-java
ghcr.io/openhands/agent-server:a53f053-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:a53f053-python
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-python
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-python
ghcr.io/openhands/agent-server:a53f053-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., a53f053-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., a53f053-python-amd64) are also available if needed

Add NVIDIA Nemotron-3 Ultra (550B MoE) to PROMPT_CACHE_MODELS so that `get_features(model).supports_prompt_cache` returns True for every deployed routing form: - litellm_proxy/nemotron-3-ultra-550b-a55b (current eval route) - openrouter/nvidia/nemotron-3-ultra-550b-a55b - deepinfra/nvidia/Nemotron-3-Ultra-550B-A55B (planned) When this gate is True, the SDK starts attaching `cache_control: {"type": "ephemeral"}` markers to the long stable prefix (system prompt + tool definitions + last user/tool turn), the same scheme that gives Sonnet its ~3x cost reduction on long agent conversations. ## Why this is safe to ship before the infra route changes The current OpenRouter route to Nemotron goes through DeepInfra, which does NOT honor cache_control markers (verified against OpenRouter's official prompt-caching docs at https://openrouter.ai/docs/guides/best-practices/prompt-caching — NVIDIA / DeepInfra are not in the supported-provider list; the model dashboard shows 0.2% global cache hit rate). On that route, sending the markers is a silent no-op: providers that don't recognize the field ignore it; no 400s, no behavior change. This PR ships SDK-side now so that when the companion infra change lands (routing Nemotron through a provider that DOES honor caching — DeepInfra direct, NVIDIA NIM direct, or self-hosted vLLM ≥0.6.5), caching activates immediately without requiring a coordinated two-repo release. ## Why not just match "nemotron" Each Nemotron-family model has its own caching story and deployment path. Bulk-matching "nemotron" would silently enable markers for Nemotron 3.5 Content Safety, Llama Nemotron Rerank, and any future NVIDIA-family entries — none of which have been verified. The test suite pins this: two negative cases (`nvidia/llama-nemotron-rerank-vl-1b`, `nvidia/nemotron-3.5-content-safety`) fail loudly if anyone broadens the substring without re-verifying. ## Tests `tests/sdk/llm/test_model_features.py::test_prompt_cache_support` — 5 new parametrized cases (3 positive routing-form variants, 2 negative Nemotron-family false-positive guards). All 154 existing test_model_features cases still pass. Lint + pyright clean. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-06-11T19:39:39Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-11T19:39:47Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-11T19:42:41Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/llm/utils
model_features.py	67	1	98%	38
TOTAL	30500	8438	72%

juanmichelini added the enhancement New feature or request label Jun 11, 2026 — with OpenHands AI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: feat(llm): allowlist `nemotron-3-ultra` for prompt caching markers#3671

DRAFT: feat(llm): allowlist `nemotron-3-ultra` for prompt caching markers#3671
juanmichelini wants to merge 1 commit into
mainfrom
openhands/sdk-10-nemotron-prompt-cache-allowlist

juanmichelini commented Jun 11, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented Jun 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What this PR does

Why this is safe to ship before infra changes

Why ship now (the actual reason)

Why not just match "nemotron"

Expected impact (honest)

Tests

Companion work (NOT in this PR)

Risk

Uh oh!

github-actions Bot commented Jun 11, 2026

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 11, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juanmichelini commented Jun 11, 2026 •

edited by github-actions Bot

Loading

Why not just match `"nemotron"`