Skip to content

DRAFT: feat(llm): allowlist nemotron-3-ultra for prompt caching markers#3671

Draft
juanmichelini wants to merge 1 commit into
mainfrom
openhands/sdk-10-nemotron-prompt-cache-allowlist
Draft

DRAFT: feat(llm): allowlist nemotron-3-ultra for prompt caching markers#3671
juanmichelini wants to merge 1 commit into
mainfrom
openhands/sdk-10-nemotron-prompt-cache-allowlist

Conversation

@juanmichelini

@juanmichelini juanmichelini commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

TL;DR

Add nemotron-3-ultra to PROMPT_CACHE_MODELS so the SDK starts attaching cache_control: {"type": "ephemeral"} markers to the long stable prefix on Nemotron requests. Half of two pieces — this PR is the SDK-side change; the infra-side change (routing Nemotron to a provider that actually honors the markers) is tracked separately.

What this PR does

Single addition to openhands-sdk/openhands/sdk/llm/utils/model_features.py:

PROMPT_CACHE_MODELS: list[str] = [
    ...existing Claude variants...
    "nemotron-3-ultra",   # ← new
]

model_matches does case-insensitive substring matching, so the single token covers every routing form we deploy:

Routing form Matched?
litellm_proxy/nemotron-3-ultra-550b-a55b
openrouter/nvidia/nemotron-3-ultra-550b-a55b
deepinfra/nvidia/Nemotron-3-Ultra-550B-A55B (planned)
nvidia/llama-nemotron-rerank-vl-1b
nvidia/nemotron-3.5-content-safety

The last two are deliberate negatives — they're a different Nemotron family with a different caching story, and the test suite pins this so a future broadening of the substring fails loudly.

Why this is safe to ship before infra changes

OpenRouter's prompt-caching docs list 8 supported providers: OpenAI · Grok · Moonshot AI · Groq · Alibaba Qwen · Anthropic · DeepSeek · Google Gemini. NVIDIA and DeepInfra are not on this list. And the Nemotron-3 Ultra model page shows a single provider (DeepInfra) with a 0.2% global cache hit rate.

On a route that doesn't honor cache_control, sending the markers is a silent no-op:

  • The marker is just an extra JSON field on a content block. Providers that don't recognize it ignore it (per OpenAI Chat Completions spec on unknown fields).
  • No 400s, no error path changes, no behavior diff for the current eval.
  • The SDK was already gated by self.caching_prompt and supports_prompt_cache — only the second half flips here. The first half defaults to True.

Bench check: I ran the full test_model_features.py after the change: 154 passed, including the 5 new parametrized Nemotron cases. No other test in the SDK depends on this particular substring.

Why ship now (the actual reason)

The companion infra-side change routes Nemotron through a provider that DOES honor caching — DeepInfra direct, NVIDIA NIM direct, or self-hosted vLLM ≥0.6.5. On that route, the SDK must already be sending cache_control markers for caching to register; otherwise the route switch produces no measurable cost win and we'd be wondering why.

Shipping the SDK side first means the infra switch is a one-config-edit that takes effect immediately. Shipping the infra side first means the SDK release becomes a blocker on every eval.

Why not just match "nemotron"

Each Nemotron-family model has its own caching story:

  • Nemotron-3 Ultra (this PR): allowlisted speculatively, will work once infra route changes.
  • Nemotron 3.5 Content Safety: free tier, content-safety classifier — caching not yet evaluated.
  • Llama Nemotron Rerank VL 1B: small reranker, different architecture entirely.

A bare "nemotron" token would silently enable markers for all three. The two negative test cases (nvidia/llama-nemotron-rerank-vl-1b, nvidia/nemotron-3.5-content-safety) document this intent and prevent a future "broaden the match" PR from regressing it without re-verification.

Expected impact (honest)

When Impact on Nemotron cost
Today (route via OpenRouter→DeepInfra) 0% — markers ignored, this PR is a no-op
After infra route change to a cache-honoring provider ~3× cost reduction on long agent conversations (same math as Sonnet)

The 3× number comes from the cost decomposition I did on real trajectories: at 50K-token avg context × 200-400 turns, without caching every turn re-pays the entire history (quadratic). With 80%+ cache hit on the stable prefix billed at ~$0.15/M instead of $0.50/M, the effective per-turn cost drops by ~70%, and conversation-total cost by ~65–70%.

Tests

tests/sdk/llm/test_model_features.py::test_prompt_cache_support — 5 new parametrized cases added to the existing parametrize block:

  • 3 positive: every routing-form variant we currently or plan to deploy
  • 2 negative: Nemotron-family variants that must NOT be enabled by this allowlist entry

All 154 existing test_model_features cases continue to pass. Lint + pyright clean.

Companion work (NOT in this PR)

What Where Status
Route Nemotron to a cache-honoring provider all-hands-ai/infra k8s/evaluation/litellm.yaml needs upstream curl test first
Validate DeepInfra-direct exposes prompt_tokens_details.cached_tokens manual curl test, ~30 s pending
If DeepInfra doesn't, test NVIDIA NIM direct as fallback manual curl test pending
Failing both, file OpenRouter support request https://openrouter.ai/docs optional

Risk

  • Behavior change on the current OpenRouter route: none. The markers are silently ignored.
  • Behavior change on Anthropic / other already-allowlisted models: none. This entry only adds a new case; it doesn't touch existing logic.
  • Memory / latency: the markers are a few bytes per content block. Imperceptible.
  • Rollback: revert this commit. Single-line removal.

This PR was created by an AI agent (OpenHands) on behalf of the user investigating per-instance cost on the Nemotron 550B SWE-Bench Verified eval. The change pairs with a planned infra-side change to route Nemotron away from the OpenRouter→DeepInfra path, which doesn't honor cache_control per OpenRouter's official docs.

@juanmichelini can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:a53f053-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-a53f053-python \
  ghcr.io/openhands/agent-server:a53f053-python

All tags pushed for this build

ghcr.io/openhands/agent-server:a53f053-golang-amd64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-golang-amd64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-golang-amd64
ghcr.io/openhands/agent-server:a53f053-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:a53f053-golang-arm64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-golang-arm64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-golang-arm64
ghcr.io/openhands/agent-server:a53f053-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:a53f053-java-amd64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-java-amd64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-java-amd64
ghcr.io/openhands/agent-server:a53f053-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:a53f053-java-arm64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-java-arm64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-java-arm64
ghcr.io/openhands/agent-server:a53f053-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:a53f053-python-amd64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-python-amd64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-python-amd64
ghcr.io/openhands/agent-server:a53f053-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:a53f053-python-arm64
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-python-arm64
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-python-arm64
ghcr.io/openhands/agent-server:a53f053-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:a53f053-golang
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-golang
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-golang
ghcr.io/openhands/agent-server:a53f053-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:a53f053-java
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-java
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-java
ghcr.io/openhands/agent-server:a53f053-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:a53f053-python
ghcr.io/openhands/agent-server:a53f05306b32b5aa177c9f25e7e8a2b4debf64ce-python
ghcr.io/openhands/agent-server:openhands-sdk-10-nemotron-prompt-cache-allowlist-python
ghcr.io/openhands/agent-server:a53f053-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., a53f053-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., a53f053-python-amd64) are also available if needed

Add NVIDIA Nemotron-3 Ultra (550B MoE) to PROMPT_CACHE_MODELS so that
`get_features(model).supports_prompt_cache` returns True for every
deployed routing form:

  - litellm_proxy/nemotron-3-ultra-550b-a55b   (current eval route)
  - openrouter/nvidia/nemotron-3-ultra-550b-a55b
  - deepinfra/nvidia/Nemotron-3-Ultra-550B-A55B  (planned)

When this gate is True, the SDK starts attaching
`cache_control: {"type": "ephemeral"}` markers to the long stable
prefix (system prompt + tool definitions + last user/tool turn), the
same scheme that gives Sonnet its ~3x cost reduction on long agent
conversations.

## Why this is safe to ship before the infra route changes

The current OpenRouter route to Nemotron goes through DeepInfra, which
does NOT honor cache_control markers (verified against OpenRouter's
official prompt-caching docs at
https://openrouter.ai/docs/guides/best-practices/prompt-caching —
NVIDIA / DeepInfra are not in the supported-provider list; the model
dashboard shows 0.2% global cache hit rate). On that route, sending
the markers is a silent no-op: providers that don't recognize the
field ignore it; no 400s, no behavior change.

This PR ships SDK-side now so that when the companion infra change
lands (routing Nemotron through a provider that DOES honor caching —
DeepInfra direct, NVIDIA NIM direct, or self-hosted vLLM ≥0.6.5),
caching activates immediately without requiring a coordinated
two-repo release.

## Why not just match "nemotron"

Each Nemotron-family model has its own caching story and deployment
path. Bulk-matching "nemotron" would silently enable markers for
Nemotron 3.5 Content Safety, Llama Nemotron Rerank, and any future
NVIDIA-family entries — none of which have been verified. The test
suite pins this: two negative cases (`nvidia/llama-nemotron-rerank-vl-1b`,
`nvidia/nemotron-3.5-content-safety`) fail loudly if anyone broadens
the substring without re-verifying.

## Tests

`tests/sdk/llm/test_model_features.py::test_prompt_cache_support` —
5 new parametrized cases (3 positive routing-form variants, 2 negative
Nemotron-family false-positive guards). All 154 existing
test_model_features cases still pass.

Lint + pyright clean.

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini juanmichelini added the enhancement New feature or request label Jun 11, 2026 — with OpenHands AI
@github-actions

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm/utils
   model_features.py67198%38
TOTAL30500843872% 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants