Skip to content

feat(bedrock): add prompt caching via CachePoint markers#1940

Open
skashmeri wants to merge 1 commit into
kagent-dev:mainfrom
skashmeri:feat/bedrock-prompt-caching
Open

feat(bedrock): add prompt caching via CachePoint markers#1940
skashmeri wants to merge 1 commit into
kagent-dev:mainfrom
skashmeri:feat/bedrock-prompt-caching

Conversation

@skashmeri
Copy link
Copy Markdown

feat(bedrock): add prompt caching via CachePoint markers

Closes #1871.

Why this is needed

Multi-turn tool-using kagent agents on Bedrock pay full input-token cost
on every Converse call, because the static prefix (system prompt + tool
definitions) is re-sent and re-billed each turn. Real measurement from a
production deployment using Claude Sonnet 4.5 via the us. inference
profile in us-east-1, running every 2 hours against a ~700-pod EKS
cluster:

  • Per sweep: ~4 Converse calls
  • Cumulative input tokens (CloudWatch InvokeModel metric): ~313k
  • Cumulative output tokens: ~3k
  • Per-sweep cost: ~$0.98 (input dominates ~95%)
  • Per cluster/year (5 sweeps/weekday): ~$1,300
  • Per cluster/year (24/7 every 2h): projected ~$4-9k

~30k of the per-call input is identical across every call — system
prompt and tool definitions don't change inside a single task. Bedrock
prompt caching is designed precisely for this case: a cachePoint block
in the Converse request marks where the cacheable prefix ends, and
subsequent calls within ~5 minutes (per region) hit the cache and bill
the prefix at a reduced rate.

The Bedrock provider builds Converse requests using
system: [textBlock] and toolConfig.tools: [...] but never appends a
cachePoint block to either array, so caching is never engaged.

Why this is not redundant with the existing spec.declarative.compaction

kagent already has an Agent-level context-compaction feature
(Compaction, CompactionInterval, Summarizer, TokenThreshold,
etc.) that summarizes old conversation turns when the conversation
exceeds a token threshold. That solves a different problem:

  • Compaction: shrinks the conversation prompt when it gets too long.
    Helps with context-window pressure on long-running agents.
  • Prompt caching: keeps the prompt the same size but tells Bedrock
    "the first N tokens are stable across calls, cache them and bill
    cached portion at the reduced rate."

Neither replaces the other. For a tool-using agent whose conversation
stays under the context limit but whose static prefix (system prompt +
tool defs) is large, prompt caching is the right hammer; compaction does
nothing because there's nothing to compact in the static prefix.

What this PR does

Adds a promptCaching: bool field to BedrockConfig (defaulting to
false to preserve existing behavior). When set, the provider appends
a CachePoint block:

  1. To the end of the system content array (after the system text block)
  2. To the end of the toolConfig.tools array (after the last ToolSpec)

Markers use CachePointTypeDefault. Bedrock silently ignores cache
points on models that don't support prompt caching, so the field is
safe to enable on a heterogeneous model fleet without per-model
gating.

Tested against us.anthropic.claude-sonnet-4-5-20250929-v1:0: the
second and subsequent Converse calls within the cache window drop their
input-token billing by ~70-90% on cache hits, depending on which static
portion (system vs tools vs both) is being hit.

Implementation surface

Mirrors the change across both runtimes — Go (for runtime: go agents)
and Python (for runtime: python agents):

Go:

  • go/api/v1alpha2/modelconfig_types.go: add PromptCaching to
    BedrockConfig CRD struct with full doc + kubebuilder default.
  • go/api/adk/types.go: add PromptCaching to the internal
    adk.Bedrock serializable model so it flows through agent config JSON.
  • go/core/internal/controller/translator/agent/adk_api_translator.go:
    populate the new field when translating ModelConfig CR -> adk.Bedrock.
  • go/adk/pkg/agent/agent.go: thread the value into models.BedrockConfig.
  • go/adk/pkg/models/bedrock.go: emit the cache point markers in the
    Converse request builders.

Python:

  • python/.../adk/types.py: add prompt_caching: bool to the
    Bedrock Pydantic model and pass through to KAgentBedrockLlm factory.
  • python/.../adk/models/_bedrock.py: append
    {"cachePoint": {"type": "default"}} to kwargs["system"] and
    kwargs["toolConfig"]["tools"] when enabled.

Regenerated CRDs via make controller-manifests so
helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml reflects the
new schema field.

Tests

go/adk/pkg/models/bedrock_test.go: new
TestConvertGenaiToolsToBedrockPromptCaching covering three cases:
disabled = no marker, enabled = marker appended at END of tool list
with default type, enabled-but-no-tools = no marker (no point in a
standalone marker).

Existing convertGenaiToolsToBedrock callers updated to pass the new
promptCaching bool argument as false (no behavior change).

Backward compatibility

  • promptCaching defaults to false everywhere; existing
    ModelConfigs pick up no behavior change.
  • Serialized adk.Bedrock JSON uses omitempty for the new field;
    older agent pods deserializing newer config see an unknown field
    they ignore (Pydantic + Go json decoders both lenient by default).
  • The Converse API tolerates and ignores CachePoint markers on
    models that don't support caching, so enabling on a mixed-model
    setup is safe.

Example usage

apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: bedrock-claude
spec:
  provider: Bedrock
  model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
  bedrock:
    region: us-east-1
    promptCaching: true

Signed-off-by: Shamil Kashmeri shamil@viafoura.com

Closes kagent-dev#1871.

## Why this is needed

Multi-turn tool-using kagent agents on Bedrock pay full input-token cost
on every Converse call, because the static prefix (system prompt + tool
definitions) is re-sent and re-billed each turn. Real measurement from a
production deployment using Claude Sonnet 4.5 via the `us.` inference
profile in us-east-1, running every 2 hours against a ~700-pod EKS
cluster:

  - Per sweep: ~4 Converse calls
  - Cumulative input tokens (CloudWatch InvokeModel metric): ~313k
  - Cumulative output tokens: ~3k
  - Per-sweep cost: ~$0.98 (input dominates ~95%)
  - Per cluster/year (5 sweeps/weekday): ~$1,300
  - Per cluster/year (24/7 every 2h): projected ~$4-9k

~30k of the per-call input is identical across every call — system
prompt and tool definitions don't change inside a single task. Bedrock
prompt caching is designed precisely for this case: a `cachePoint` block
in the Converse request marks where the cacheable prefix ends, and
subsequent calls within ~5 minutes (per region) hit the cache and bill
the prefix at a reduced rate.

The Bedrock provider builds Converse requests using
`system: [textBlock]` and `toolConfig.tools: [...]` but never appends a
`cachePoint` block to either array, so caching is never engaged.

## Why this is not redundant with the existing `spec.declarative.compaction`

kagent already has an Agent-level context-compaction feature
(`Compaction`, `CompactionInterval`, `Summarizer`, `TokenThreshold`,
etc.) that summarizes old conversation turns when the conversation
exceeds a token threshold. That solves a different problem:

  - Compaction: shrinks the conversation prompt when it gets too long.
    Helps with context-window pressure on long-running agents.
  - Prompt caching: keeps the prompt the same size but tells Bedrock
    "the first N tokens are stable across calls, cache them and bill
    cached portion at the reduced rate."

Neither replaces the other. For a tool-using agent whose conversation
stays under the context limit but whose static prefix (system prompt +
tool defs) is large, prompt caching is the right hammer; compaction does
nothing because there's nothing to compact in the static prefix.

## What this PR does

Adds a `promptCaching: bool` field to `BedrockConfig` (defaulting to
`false` to preserve existing behavior). When set, the provider appends
a `CachePoint` block:

  1. To the end of the `system` content array (after the system text block)
  2. To the end of the `toolConfig.tools` array (after the last ToolSpec)

Markers use `CachePointTypeDefault`. Bedrock silently ignores cache
points on models that don't support prompt caching, so the field is
safe to enable on a heterogeneous model fleet without per-model
gating.

Tested against `us.anthropic.claude-sonnet-4-5-20250929-v1:0`: the
second and subsequent Converse calls within the cache window drop their
input-token billing by ~70-90% on cache hits, depending on which static
portion (system vs tools vs both) is being hit.

## Implementation surface

Mirrors the change across both runtimes — Go (for `runtime: go` agents)
and Python (for `runtime: python` agents):

Go:
  - `go/api/v1alpha2/modelconfig_types.go`: add `PromptCaching` to
    `BedrockConfig` CRD struct with full doc + kubebuilder default.
  - `go/api/adk/types.go`: add `PromptCaching` to the internal
    `adk.Bedrock` serializable model so it flows through agent config JSON.
  - `go/core/internal/controller/translator/agent/adk_api_translator.go`:
    populate the new field when translating ModelConfig CR -> adk.Bedrock.
  - `go/adk/pkg/agent/agent.go`: thread the value into `models.BedrockConfig`.
  - `go/adk/pkg/models/bedrock.go`: emit the cache point markers in the
    Converse request builders.

Python:
  - `python/.../adk/types.py`: add `prompt_caching: bool` to the
    `Bedrock` Pydantic model and pass through to `KAgentBedrockLlm` factory.
  - `python/.../adk/models/_bedrock.py`: append
    `{"cachePoint": {"type": "default"}}` to `kwargs["system"]` and
    `kwargs["toolConfig"]["tools"]` when enabled.

Regenerated CRDs via `make controller-manifests` so
`helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml` reflects the
new schema field.

## Tests

`go/adk/pkg/models/bedrock_test.go`: new
`TestConvertGenaiToolsToBedrockPromptCaching` covering three cases:
disabled = no marker, enabled = marker appended at END of tool list
with default type, enabled-but-no-tools = no marker (no point in a
standalone marker).

Existing `convertGenaiToolsToBedrock` callers updated to pass the new
`promptCaching bool` argument as `false` (no behavior change).

## Backward compatibility

  - `promptCaching` defaults to `false` everywhere; existing
    ModelConfigs pick up no behavior change.
  - Serialized `adk.Bedrock` JSON uses `omitempty` for the new field;
    older agent pods deserializing newer config see an unknown field
    they ignore (Pydantic + Go json decoders both lenient by default).
  - The Converse API tolerates and ignores `CachePoint` markers on
    models that don't support caching, so enabling on a mixed-model
    setup is safe.

## Example usage

```yaml
apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: bedrock-claude
spec:
  provider: Bedrock
  model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
  bedrock:
    region: us-east-1
    promptCaching: true
```

Signed-off-by: Shamil Kashmeri <shamil@viafoura.com>
Copilot AI review requested due to automatic review settings May 27, 2026 19:28
@github-actions github-actions Bot added the enhancement New feature or request label May 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in “prompt caching” capability for Amazon Bedrock (Converse API) across the CRD/API layers and both Go + Python Bedrock runtime implementations by inserting Bedrock CachePoint markers into requests.

Changes:

  • Introduces a promptCaching/prompt_caching flag in ModelConfig/ADK types and plumbs it through translators and model factories.
  • Updates Go Bedrock request building to append CachePoint markers to system and toolConfig.tools and adjusts tool conversion signature + tests.
  • Updates Python Bedrock (Converse) request building to append CachePoint markers to system and toolConfig.tools.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/packages/kagent-adk/src/kagent/adk/types.py Adds prompt_caching to the Python Bedrock model config type and forwards it during LLM creation.
python/packages/kagent-adk/src/kagent/adk/models/_bedrock.py Implements Bedrock Converse request mutation to append cachePoint markers when enabled.
helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml Extends CRD schema with promptCaching for Bedrock.
go/core/internal/controller/translator/agent/adk_api_translator.go Wires CRD PromptCaching into the ADK Bedrock model payload.
go/api/v1alpha2/modelconfig_types.go Adds PromptCaching to the BedrockConfig CRD Go type with docs/default.
go/api/config/crd/bases/kagent.dev_modelconfigs.yaml Extends generated CRD schema with promptCaching for Bedrock.
go/api/adk/types.go Adds PromptCaching to the Go ADK Bedrock type for JSON payloads.
go/adk/pkg/models/bedrock_test.go Updates tool conversion calls + adds tests for cache-point insertion behavior.
go/adk/pkg/models/bedrock.go Adds config flag, appends cache markers to system/tools, and threads flag into tool conversion.
go/adk/pkg/agent/agent.go Passes PromptCaching from ADK model into Bedrock model config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +297 to +302
# If prompt caching is on, mark the end of the system content as
# a cache breakpoint. Bedrock caches everything up to and including
# this point for ~5 minutes; subsequent requests with the same
# prefix hit the cache. No-op if we didn't produce any system text.
if self.prompt_caching and kwargs.get("system"):
kwargs["system"].append({"cachePoint": {"type": "default"}})
Comment on lines +309 to 314
# CachePoint at the END of the tool list: tool definitions
# are usually the biggest static chunk of an agent request
# and benefit most from caching.
if self.prompt_caching:
converse_tools.append({"cachePoint": {"type": "default"}})
kwargs["toolConfig"] = {"tools": converse_tools}
Comment on lines +262 to +265
// the end of the `tools` array. Bedrock will cache the prefix up to and
// including those cache points across requests in the same region for
// roughly 5 minutes after first use, billing the cached portion at a
// reduced rate on cache hits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Bedrock prompt caching

2 participants