DRAFT: feat(llm): allowlist nemotron-3-ultra for prompt caching markers#3671
Draft
juanmichelini wants to merge 1 commit into
Draft
DRAFT: feat(llm): allowlist nemotron-3-ultra for prompt caching markers#3671juanmichelini wants to merge 1 commit into
nemotron-3-ultra for prompt caching markers#3671juanmichelini wants to merge 1 commit into
Conversation
Add NVIDIA Nemotron-3 Ultra (550B MoE) to PROMPT_CACHE_MODELS so that
`get_features(model).supports_prompt_cache` returns True for every
deployed routing form:
- litellm_proxy/nemotron-3-ultra-550b-a55b (current eval route)
- openrouter/nvidia/nemotron-3-ultra-550b-a55b
- deepinfra/nvidia/Nemotron-3-Ultra-550B-A55B (planned)
When this gate is True, the SDK starts attaching
`cache_control: {"type": "ephemeral"}` markers to the long stable
prefix (system prompt + tool definitions + last user/tool turn), the
same scheme that gives Sonnet its ~3x cost reduction on long agent
conversations.
## Why this is safe to ship before the infra route changes
The current OpenRouter route to Nemotron goes through DeepInfra, which
does NOT honor cache_control markers (verified against OpenRouter's
official prompt-caching docs at
https://openrouter.ai/docs/guides/best-practices/prompt-caching —
NVIDIA / DeepInfra are not in the supported-provider list; the model
dashboard shows 0.2% global cache hit rate). On that route, sending
the markers is a silent no-op: providers that don't recognize the
field ignore it; no 400s, no behavior change.
This PR ships SDK-side now so that when the companion infra change
lands (routing Nemotron through a provider that DOES honor caching —
DeepInfra direct, NVIDIA NIM direct, or self-hosted vLLM ≥0.6.5),
caching activates immediately without requiring a coordinated
two-repo release.
## Why not just match "nemotron"
Each Nemotron-family model has its own caching story and deployment
path. Bulk-matching "nemotron" would silently enable markers for
Nemotron 3.5 Content Safety, Llama Nemotron Rerank, and any future
NVIDIA-family entries — none of which have been verified. The test
suite pins this: two negative cases (`nvidia/llama-nemotron-rerank-vl-1b`,
`nvidia/nemotron-3.5-content-safety`) fail loudly if anyone broadens
the substring without re-verifying.
## Tests
`tests/sdk/llm/test_model_features.py::test_prompt_cache_support` —
5 new parametrized cases (3 positive routing-form variants, 2 negative
Nemotron-family false-positive guards). All 154 existing
test_model_features cases still pass.
Lint + pyright clean.
Co-authored-by: openhands <openhands@all-hands.dev>
Contributor
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
Contributor
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Add
nemotron-3-ultratoPROMPT_CACHE_MODELSso the SDK starts attachingcache_control: {"type": "ephemeral"}markers to the long stable prefix on Nemotron requests. Half of two pieces — this PR is the SDK-side change; the infra-side change (routing Nemotron to a provider that actually honors the markers) is tracked separately.What this PR does
Single addition to
openhands-sdk/openhands/sdk/llm/utils/model_features.py:model_matchesdoes case-insensitive substring matching, so the single token covers every routing form we deploy:litellm_proxy/nemotron-3-ultra-550b-a55bopenrouter/nvidia/nemotron-3-ultra-550b-a55bdeepinfra/nvidia/Nemotron-3-Ultra-550B-A55B(planned)nvidia/llama-nemotron-rerank-vl-1bnvidia/nemotron-3.5-content-safetyThe last two are deliberate negatives — they're a different Nemotron family with a different caching story, and the test suite pins this so a future broadening of the substring fails loudly.
Why this is safe to ship before infra changes
OpenRouter's prompt-caching docs list 8 supported providers: OpenAI · Grok · Moonshot AI · Groq · Alibaba Qwen · Anthropic · DeepSeek · Google Gemini. NVIDIA and DeepInfra are not on this list. And the Nemotron-3 Ultra model page shows a single provider (DeepInfra) with a 0.2% global cache hit rate.
On a route that doesn't honor
cache_control, sending the markers is a silent no-op:self.caching_prompt and supports_prompt_cache— only the second half flips here. The first half defaults toTrue.Bench check: I ran the full
test_model_features.pyafter the change: 154 passed, including the 5 new parametrized Nemotron cases. No other test in the SDK depends on this particular substring.Why ship now (the actual reason)
The companion infra-side change routes Nemotron through a provider that DOES honor caching — DeepInfra direct, NVIDIA NIM direct, or self-hosted vLLM ≥0.6.5. On that route, the SDK must already be sending
cache_controlmarkers for caching to register; otherwise the route switch produces no measurable cost win and we'd be wondering why.Shipping the SDK side first means the infra switch is a one-config-edit that takes effect immediately. Shipping the infra side first means the SDK release becomes a blocker on every eval.
Why not just match
"nemotron"Each Nemotron-family model has its own caching story:
A bare
"nemotron"token would silently enable markers for all three. The two negative test cases (nvidia/llama-nemotron-rerank-vl-1b,nvidia/nemotron-3.5-content-safety) document this intent and prevent a future "broaden the match" PR from regressing it without re-verification.Expected impact (honest)
The 3× number comes from the cost decomposition I did on real trajectories: at 50K-token avg context × 200-400 turns, without caching every turn re-pays the entire history (quadratic). With 80%+ cache hit on the stable prefix billed at ~$0.15/M instead of $0.50/M, the effective per-turn cost drops by ~70%, and conversation-total cost by ~65–70%.
Tests
tests/sdk/llm/test_model_features.py::test_prompt_cache_support— 5 new parametrized cases added to the existing parametrize block:All 154 existing
test_model_featurescases continue to pass. Lint + pyright clean.Companion work (NOT in this PR)
all-hands-ai/infrak8s/evaluation/litellm.yamlprompt_tokens_details.cached_tokensRisk
This PR was created by an AI agent (OpenHands) on behalf of the user investigating per-instance cost on the Nemotron 550B SWE-Bench Verified eval. The change pairs with a planned infra-side change to route Nemotron away from the OpenRouter→DeepInfra path, which doesn't honor
cache_controlper OpenRouter's official docs.@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:a53f053-pythonRun
All tags pushed for this build
About Multi-Architecture Support
a53f053-python) is a multi-arch manifest supporting both amd64 and arm64a53f053-python-amd64) are also available if needed