feat(bedrock): add prompt caching via CachePoint markers#1940
Open
skashmeri wants to merge 1 commit into
Open
Conversation
Closes kagent-dev#1871. ## Why this is needed Multi-turn tool-using kagent agents on Bedrock pay full input-token cost on every Converse call, because the static prefix (system prompt + tool definitions) is re-sent and re-billed each turn. Real measurement from a production deployment using Claude Sonnet 4.5 via the `us.` inference profile in us-east-1, running every 2 hours against a ~700-pod EKS cluster: - Per sweep: ~4 Converse calls - Cumulative input tokens (CloudWatch InvokeModel metric): ~313k - Cumulative output tokens: ~3k - Per-sweep cost: ~$0.98 (input dominates ~95%) - Per cluster/year (5 sweeps/weekday): ~$1,300 - Per cluster/year (24/7 every 2h): projected ~$4-9k ~30k of the per-call input is identical across every call — system prompt and tool definitions don't change inside a single task. Bedrock prompt caching is designed precisely for this case: a `cachePoint` block in the Converse request marks where the cacheable prefix ends, and subsequent calls within ~5 minutes (per region) hit the cache and bill the prefix at a reduced rate. The Bedrock provider builds Converse requests using `system: [textBlock]` and `toolConfig.tools: [...]` but never appends a `cachePoint` block to either array, so caching is never engaged. ## Why this is not redundant with the existing `spec.declarative.compaction` kagent already has an Agent-level context-compaction feature (`Compaction`, `CompactionInterval`, `Summarizer`, `TokenThreshold`, etc.) that summarizes old conversation turns when the conversation exceeds a token threshold. That solves a different problem: - Compaction: shrinks the conversation prompt when it gets too long. Helps with context-window pressure on long-running agents. - Prompt caching: keeps the prompt the same size but tells Bedrock "the first N tokens are stable across calls, cache them and bill cached portion at the reduced rate." Neither replaces the other. For a tool-using agent whose conversation stays under the context limit but whose static prefix (system prompt + tool defs) is large, prompt caching is the right hammer; compaction does nothing because there's nothing to compact in the static prefix. ## What this PR does Adds a `promptCaching: bool` field to `BedrockConfig` (defaulting to `false` to preserve existing behavior). When set, the provider appends a `CachePoint` block: 1. To the end of the `system` content array (after the system text block) 2. To the end of the `toolConfig.tools` array (after the last ToolSpec) Markers use `CachePointTypeDefault`. Bedrock silently ignores cache points on models that don't support prompt caching, so the field is safe to enable on a heterogeneous model fleet without per-model gating. Tested against `us.anthropic.claude-sonnet-4-5-20250929-v1:0`: the second and subsequent Converse calls within the cache window drop their input-token billing by ~70-90% on cache hits, depending on which static portion (system vs tools vs both) is being hit. ## Implementation surface Mirrors the change across both runtimes — Go (for `runtime: go` agents) and Python (for `runtime: python` agents): Go: - `go/api/v1alpha2/modelconfig_types.go`: add `PromptCaching` to `BedrockConfig` CRD struct with full doc + kubebuilder default. - `go/api/adk/types.go`: add `PromptCaching` to the internal `adk.Bedrock` serializable model so it flows through agent config JSON. - `go/core/internal/controller/translator/agent/adk_api_translator.go`: populate the new field when translating ModelConfig CR -> adk.Bedrock. - `go/adk/pkg/agent/agent.go`: thread the value into `models.BedrockConfig`. - `go/adk/pkg/models/bedrock.go`: emit the cache point markers in the Converse request builders. Python: - `python/.../adk/types.py`: add `prompt_caching: bool` to the `Bedrock` Pydantic model and pass through to `KAgentBedrockLlm` factory. - `python/.../adk/models/_bedrock.py`: append `{"cachePoint": {"type": "default"}}` to `kwargs["system"]` and `kwargs["toolConfig"]["tools"]` when enabled. Regenerated CRDs via `make controller-manifests` so `helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml` reflects the new schema field. ## Tests `go/adk/pkg/models/bedrock_test.go`: new `TestConvertGenaiToolsToBedrockPromptCaching` covering three cases: disabled = no marker, enabled = marker appended at END of tool list with default type, enabled-but-no-tools = no marker (no point in a standalone marker). Existing `convertGenaiToolsToBedrock` callers updated to pass the new `promptCaching bool` argument as `false` (no behavior change). ## Backward compatibility - `promptCaching` defaults to `false` everywhere; existing ModelConfigs pick up no behavior change. - Serialized `adk.Bedrock` JSON uses `omitempty` for the new field; older agent pods deserializing newer config see an unknown field they ignore (Pydantic + Go json decoders both lenient by default). - The Converse API tolerates and ignores `CachePoint` markers on models that don't support caching, so enabling on a mixed-model setup is safe. ## Example usage ```yaml apiVersion: kagent.dev/v1alpha2 kind: ModelConfig metadata: name: bedrock-claude spec: provider: Bedrock model: us.anthropic.claude-sonnet-4-5-20250929-v1:0 bedrock: region: us-east-1 promptCaching: true ``` Signed-off-by: Shamil Kashmeri <shamil@viafoura.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an opt-in “prompt caching” capability for Amazon Bedrock (Converse API) across the CRD/API layers and both Go + Python Bedrock runtime implementations by inserting Bedrock CachePoint markers into requests.
Changes:
- Introduces a
promptCaching/prompt_cachingflag in ModelConfig/ADK types and plumbs it through translators and model factories. - Updates Go Bedrock request building to append
CachePointmarkers tosystemandtoolConfig.toolsand adjusts tool conversion signature + tests. - Updates Python Bedrock (Converse) request building to append
CachePointmarkers tosystemandtoolConfig.tools.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| python/packages/kagent-adk/src/kagent/adk/types.py | Adds prompt_caching to the Python Bedrock model config type and forwards it during LLM creation. |
| python/packages/kagent-adk/src/kagent/adk/models/_bedrock.py | Implements Bedrock Converse request mutation to append cachePoint markers when enabled. |
| helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml | Extends CRD schema with promptCaching for Bedrock. |
| go/core/internal/controller/translator/agent/adk_api_translator.go | Wires CRD PromptCaching into the ADK Bedrock model payload. |
| go/api/v1alpha2/modelconfig_types.go | Adds PromptCaching to the BedrockConfig CRD Go type with docs/default. |
| go/api/config/crd/bases/kagent.dev_modelconfigs.yaml | Extends generated CRD schema with promptCaching for Bedrock. |
| go/api/adk/types.go | Adds PromptCaching to the Go ADK Bedrock type for JSON payloads. |
| go/adk/pkg/models/bedrock_test.go | Updates tool conversion calls + adds tests for cache-point insertion behavior. |
| go/adk/pkg/models/bedrock.go | Adds config flag, appends cache markers to system/tools, and threads flag into tool conversion. |
| go/adk/pkg/agent/agent.go | Passes PromptCaching from ADK model into Bedrock model config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+297
to
+302
| # If prompt caching is on, mark the end of the system content as | ||
| # a cache breakpoint. Bedrock caches everything up to and including | ||
| # this point for ~5 minutes; subsequent requests with the same | ||
| # prefix hit the cache. No-op if we didn't produce any system text. | ||
| if self.prompt_caching and kwargs.get("system"): | ||
| kwargs["system"].append({"cachePoint": {"type": "default"}}) |
Comment on lines
+309
to
314
| # CachePoint at the END of the tool list: tool definitions | ||
| # are usually the biggest static chunk of an agent request | ||
| # and benefit most from caching. | ||
| if self.prompt_caching: | ||
| converse_tools.append({"cachePoint": {"type": "default"}}) | ||
| kwargs["toolConfig"] = {"tools": converse_tools} |
Comment on lines
+262
to
+265
| // the end of the `tools` array. Bedrock will cache the prefix up to and | ||
| // including those cache points across requests in the same region for | ||
| // roughly 5 minutes after first use, billing the cached portion at a | ||
| // reduced rate on cache hits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(bedrock): add prompt caching via CachePoint markers
Closes #1871.
Why this is needed
Multi-turn tool-using kagent agents on Bedrock pay full input-token cost
on every Converse call, because the static prefix (system prompt + tool
definitions) is re-sent and re-billed each turn. Real measurement from a
production deployment using Claude Sonnet 4.5 via the
us.inferenceprofile in us-east-1, running every 2 hours against a ~700-pod EKS
cluster:
~30k of the per-call input is identical across every call — system
prompt and tool definitions don't change inside a single task. Bedrock
prompt caching is designed precisely for this case: a
cachePointblockin the Converse request marks where the cacheable prefix ends, and
subsequent calls within ~5 minutes (per region) hit the cache and bill
the prefix at a reduced rate.
The Bedrock provider builds Converse requests using
system: [textBlock]andtoolConfig.tools: [...]but never appends acachePointblock to either array, so caching is never engaged.Why this is not redundant with the existing
spec.declarative.compactionkagent already has an Agent-level context-compaction feature
(
Compaction,CompactionInterval,Summarizer,TokenThreshold,etc.) that summarizes old conversation turns when the conversation
exceeds a token threshold. That solves a different problem:
Helps with context-window pressure on long-running agents.
"the first N tokens are stable across calls, cache them and bill
cached portion at the reduced rate."
Neither replaces the other. For a tool-using agent whose conversation
stays under the context limit but whose static prefix (system prompt +
tool defs) is large, prompt caching is the right hammer; compaction does
nothing because there's nothing to compact in the static prefix.
What this PR does
Adds a
promptCaching: boolfield toBedrockConfig(defaulting tofalseto preserve existing behavior). When set, the provider appendsa
CachePointblock:systemcontent array (after the system text block)toolConfig.toolsarray (after the last ToolSpec)Markers use
CachePointTypeDefault. Bedrock silently ignores cachepoints on models that don't support prompt caching, so the field is
safe to enable on a heterogeneous model fleet without per-model
gating.
Tested against
us.anthropic.claude-sonnet-4-5-20250929-v1:0: thesecond and subsequent Converse calls within the cache window drop their
input-token billing by ~70-90% on cache hits, depending on which static
portion (system vs tools vs both) is being hit.
Implementation surface
Mirrors the change across both runtimes — Go (for
runtime: goagents)and Python (for
runtime: pythonagents):Go:
go/api/v1alpha2/modelconfig_types.go: addPromptCachingtoBedrockConfigCRD struct with full doc + kubebuilder default.go/api/adk/types.go: addPromptCachingto the internaladk.Bedrockserializable model so it flows through agent config JSON.go/core/internal/controller/translator/agent/adk_api_translator.go:populate the new field when translating ModelConfig CR -> adk.Bedrock.
go/adk/pkg/agent/agent.go: thread the value intomodels.BedrockConfig.go/adk/pkg/models/bedrock.go: emit the cache point markers in theConverse request builders.
Python:
python/.../adk/types.py: addprompt_caching: boolto theBedrockPydantic model and pass through toKAgentBedrockLlmfactory.python/.../adk/models/_bedrock.py: append{"cachePoint": {"type": "default"}}tokwargs["system"]andkwargs["toolConfig"]["tools"]when enabled.Regenerated CRDs via
make controller-manifestssohelm/kagent-crds/templates/kagent.dev_modelconfigs.yamlreflects thenew schema field.
Tests
go/adk/pkg/models/bedrock_test.go: newTestConvertGenaiToolsToBedrockPromptCachingcovering three cases:disabled = no marker, enabled = marker appended at END of tool list
with default type, enabled-but-no-tools = no marker (no point in a
standalone marker).
Existing
convertGenaiToolsToBedrockcallers updated to pass the newpromptCaching boolargument asfalse(no behavior change).Backward compatibility
promptCachingdefaults tofalseeverywhere; existingModelConfigs pick up no behavior change.
adk.BedrockJSON usesomitemptyfor the new field;older agent pods deserializing newer config see an unknown field
they ignore (Pydantic + Go json decoders both lenient by default).
CachePointmarkers onmodels that don't support caching, so enabling on a mixed-model
setup is safe.
Example usage
Signed-off-by: Shamil Kashmeri shamil@viafoura.com