Turn on explicit prompt caching for your Azure OpenAI endpoints with a single click.
Launch region: Central US Β· Preview
Modern LLM applications resend the same large prefix on every call β a system prompt, a tool catalog, a multi-page document, a few-shot rubric, a long conversation history. The model re-tokenizes and re-attends to that prefix every single time. You pay full input-token price for content that never changes.
Explicit context caching lets you tell the service "this prefix is stable β keep it warm." The provider stores the tokenized, pre-attended representation of that prefix and, on subsequent requests that begin with the same content, reuses it. The result:
- Lower latency β the cached prefix skips re-tokenization and prefill.
- Lower cost β cached input tokens are billed at a steep discount.
- Higher throughput β freed compute means more concurrent requests at the same capacity.
Unlike implicit (best-effort) caching that some endpoints do opportunistically, explicit caching is contractual: you create a named cache container, you tell the deployment to use it, and your application controls the lifetime.
Azure exposes explicit caching through a dedicated resource provider β Microsoft.AzureContextCache β that lives in your subscription, in your region, under your RBAC. An Azure OpenAI deployment opts in by setting a single property, properties.contextCacheContainerId, on the deployment resource. Once linked, every chat/completion request sent to that deployment automatically benefits from the cache β no SDK changes, no extra headers.
| Concept | Azure resource |
|---|---|
| Cache namespace for an org/team | Microsoft.AzureContextCache/accounts |
| Cache storage unit for a specific model | Microsoft.AzureContextCache/accounts/containers |
| AOAI deployment that uses the cache | Microsoft.CognitiveServices/accounts/deployments with properties.contextCacheContainerId |
This quickstart packages all three (plus the AOAI account itself) into one ARM template so you can be sending cache-aware requests in about two minutes.
flowchart LR
classDef app fill:#0b3d91,stroke:#0b3d91,color:#ffffff
classDef aoai fill:#107c10,stroke:#0b5a0b,color:#ffffff
classDef cache fill:#5c2d91,stroke:#3b1c5c,color:#ffffff
classDef rg fill:#f5f5f5,stroke:#888,color:#222,stroke-dasharray: 4 3
User["π§βπ» Your application<br/>(OpenAI SDK Β· Responses API)"]:::app
subgraph SUB["Your Azure subscription"]
direction TB
subgraph RG["Resource group Β· Central US"]
direction LR
subgraph AOAIBOX["βοΈ Azure OpenAI account"]
AOAI["Microsoft.CognitiveServices/accounts<br/>kind = OpenAI Β· SKU S0"]:::aoai
DEP["Deployment<br/><b>context-cache-deployment</b><br/>model gpt-5.4 Β· 2026-03-05-contextcache"]:::aoai
AOAI --- DEP
end
subgraph CACHEBOX["β‘ Azure Context Cache"]
ACC["Microsoft.AzureContextCache/accounts<br/>kind = Regional"]:::cache
CONT["Container<br/><b>default-container</b><br/>model gpt-5.4 Β· TTL 7d"]:::cache
ACC --- CONT
end
DEP -. "properties.<br/>contextCacheContainerId" .-> CONT
end
end
User -- "POST /responses<br/>(unchanged API)" --> DEP
DEP == "cached prefix<br/>hit / miss" ==> CONT
class RG rg
class SUB rg
Request path under the cover
- Your app calls the AOAI deployment endpoint exactly as it does today (Responses API).
- The deployment, because
properties.contextCacheContainerIdis set, consults the linked Context Cache container for a matching prefix. - On a hit, the cached tokenized/pre-attended state is reused; only the suffix (your new turn) is processed end-to-end. You are billed for cached input tokens at the discounted rate.
- On a miss, the deployment processes the full prompt normally and writes the prefix into the container for future requests, respecting the container's
timeToLive.
The cache container lives in your subscription so you control isolation, region residency, TTL, and lifecycle β and you can swap models or rotate cache contents without touching the AOAI account.
The button opens the Azure Portal Custom deployment blade pre-loaded with azuredeploy.json. You only need to pick:
| Field | Notes |
|---|---|
| Subscription | Any subscription where the preview features below are registered. |
| Resource group | New or existing; the four resources will be created here. |
| Region | The resource group's region. Pick Central US (the launch region) or swedencentral. All four resources inherit this location automatically. |
| Name prefix | 3β12 lowercase letters/digits. Used to derive <prefix>-aoai and <prefix>-cache. A unique value is suggested for you. |
| Existing AOAI account name | Optional. Leave empty to create a new AOAI account (requires S0 account quota in the chosen region). Set to the name of an existing AOAI account in the same resource group to reuse it β only the cache + linked deployment will be created. |
Click Review + create β Create. When it finishes, the deployment Outputs tab gives you the AOAI endpoint, deployment name, and the cache container resource id.
The preview features below must be Registered before the deployment can succeed. You only need to do this one time per subscription:
az provider register --namespace Microsoft.AzureContextCache
az feature register --namespace Microsoft.AzureContextCache --name EnablePreview
az feature register --namespace Microsoft.CognitiveServices --name OpenAI.ContextCacheAllowedBoth features are gated β if a status stays Pending for more than a few minutes, email azurecontextcacherp@microsoft.com for approval. A convenience script is included:
./scripts/register-providers.ps1 -SubscriptionId <your-sub-id>| # | Resource | Type | Defaults |
|---|---|---|---|
| 1 | Azure OpenAI account | Microsoft.CognitiveServices/accounts |
kind OpenAI, SKU S0, public access on |
| 2 | Context Cache account | Microsoft.AzureContextCache/accounts |
accountKind = Regional |
| 3 | Cache container | Microsoft.AzureContextCache/accounts/containers |
model gpt-5.4, provider OpenAI, timeToLive = 7d |
| 4 | AOAI deployment linked to (3) | Microsoft.CognitiveServices/accounts/deployments (api 2026-03-15-preview) |
Standard / capacity 100, model gpt-5.4 v 2026-03-05-contextcache, contextCacheContainerId pre-wired |
All four are created in a single ARM deployment, in the same region, in the resource group you pick. No portal click-through, no follow-up CLI.
Nothing changes in your client code. Point the Azure OpenAI Responses API at the deployment created above and the cache is consulted transparently β every request whose input begins with the same stable prefix gets a hit.
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint = "<azureOpenAIEndpoint from outputs>",
api_key = "<your AOAI key>",
api_version = "2026-03-15-preview",
)
LONG_STABLE_SYSTEM_PROMPT = """You are an expert support agent for Contoso Cloud...
<thousands of tokens of stable instructions, tool catalog, few-shot examples>"""
resp = client.responses.create(
model = "context-cache-deployment", # the AOAI deployment name
instructions = LONG_STABLE_SYSTEM_PROMPT, # cached prefix β keep byte-identical across calls
input = [
{
"role": "user",
"content": [
{"type": "input_text", "text": user_turn}, # the only part that varies
],
},
],
)
print(resp.output_text)
print("cached input tokens:", resp.usage.input_tokens_details.cached_tokens)Equivalent raw REST call:
POST {azureOpenAIEndpoint}/openai/v1/responses?api-version=2026-03-15-preview
Authorization: Bearer <token>
Content-Type: application/json
{
"model": "context-cache-deployment",
"instructions": "<LONG_STABLE_SYSTEM_PROMPT>",
"input": [
{ "role": "user", "content": [{ "type": "input_text", "text": "<user turn>" }] }
]
}The longer and more stable your prefix, the larger the savings.
Tip: put the stable content in
instructions(or as the leading items ofinput) and the volatile per-turn content at the end. Caching matches on the request prefix, so any byte change near the front invalidates the hit. Watchusage.input_tokens_details.cached_tokenson the response to confirm you are getting hits.
A runnable Python sample lives under demo/ β the same AI Code Reviewer workload used in the contextCacheDemo reference (Step 5: remote prompt cache, warm). It sends 6 PR-review requests through the Responses API and prints per-call cached_tokens + latency so you can see the cache kick in on call #2.
cd demo
python -m venv .venv ; .\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
$env:AOAI_ENDPOINT = "<azureOpenAIEndpoint from deployment outputs>"
$env:AOAI_DEPLOYMENT = "context-cache-deployment"
$env:AOAI_API_KEY = "<your aoai key>"
python code_reviewer_demo.py --runs 6Expected: call #1 cold (cached_tokens β 0, ~8 s), calls #2..6 warm (cached_tokens β 2.4K, ~2 s). See demo/README.md for the full sample output.
| To change | Edit |
|---|---|
| Region | Pick a different Resource group region when deploying β resources use [resourceGroup().location] (centralus is the launch region; swedencentral is also supported) |
| Model / version | modelName, modelVersion variables in azuredeploy.json or bicep/main.bicep |
| TTL, SKU, capacity | Same variables block |
| Use an existing AOAI account instead of creating a new one | Delete the AOAI account resource and reference an existing one as the deployment's parent β see bicep/main.bicep for the pattern |
A pure CLI flow is also provided:
./scripts/deploy.ps1 -ResourceGroup rg-cc-demo # ARM JSON
./scripts/deploy.ps1 -ResourceGroup rg-cc-demo -UseBicep # Bicep.
βββ azuredeploy.json # Single all-in-one ARM template (Deploy-to-Azure target)
βββ azuredeploy.parameters.json # Example parameter file
βββ bicep/
β βββ main.bicep # Bicep equivalent
β βββ main.bicepparam
βββ prereqs/
β βββ assign-reader-role.json # Optional: Reader for CSRP at sub scope (advanced)
β βββ assign-reader-role.bicep
βββ demo/
β βββ code_reviewer_demo.py # End-to-end Responses-API validation sample
β βββ system_prompt.md # ~2.4K-token stable prefix
β βββ diffs/ # PR diffs used as the variable tail
βββ scripts/
β βββ register-providers.ps1 # One-time feature registration
β βββ deploy.ps1 # Convenience CLI wrapper
βββ .github/workflows/validate.yml
| Symptom | Resolution |
|---|---|
FeatureNotRegistered on deploy |
Run the three az feature register commands above and wait until each reports Registered. Email azurecontextcacherp@microsoft.com if state stays Pending. |
LocationNotAvailableForResourceType |
Only centralus (launch) and swedencentral are supported today. |
InvalidResourceName |
namePrefix must be 3β12 lowercase letters/digits. |
| Cache appears not to be used (no latency / cost improvement) | Confirm your prefix is byte-identical across requests, longer than a few hundred tokens, and that traffic is hitting the deployment created by this template (not a sibling deployment without contextCacheContainerId). |
| Need to unlink the cache later | PUT the same deployment with properties.contextCacheContainerId omitted (keep sku and model identical). |