feat: caché-aware single file RAG by mBerasategui-ehu · Pull Request #421 · Lamb-Project/lamb

mBerasategui-ehu · 2026-06-16T08:31:40Z

Implementation Log: Cache-Aware RAG & Grep RAG

Tracking all changes made per IMPLEMENTATION_PLAN_CACHED_GREP_RAG.md.

Cache-Aware Single File RAG ✅

Implementation choice: Option B — Extended simple_augment.py with automatic cache-aware mode detection. When an assistant has rag_processor: "single_file_rag" in its metadata, simple_augment automatically splits the RAG context into a separate user message (before the question), enabling LLM provider prompt caching. No new prompt processor file needed. No auto-selection in main.py needed. Fully backward compatible.

Files Changed

File	Operation	Description
`backend/lamb/completions/pps/simple_augment.py`	Edited	Added `_is_single_file_rag()` helper + cache-aware logic. When assistant metadata has `rag_processor: "single_file_rag"`, context is emitted as a separate user message before conversation history. When not, existing template-based behavior is unchanged.
`backend/tests/test_cache_aware_augment.py`	Created	10 unit tests: cache mode structure, no-RAG fallback, standard mode unchanged, no-metadata backward compat, conversation history, cache stability, cache-vs-standard comparison, cost comparison (86% estimated savings)
`backend/tests/test_integration_cache.py`	Created	Real integration test against OpenAI API. Proves cache mode gets actual `cached_tokens` while standard mode gets zero. Run with `--real` flag.
`docker-compose-example.yaml`	Edited	Changed `GLOBAL_LOG_LEVEL=WARNING` to `${GLOBAL_LOG_LEVEL:-WARNING}` so debug logs can be enabled from the project `.env`

Verified with real API call (OpenAI gpt-4o-mini)

Integration test passed. Ran python tests/test_integration_cache.py --real --size 12000 against the live OpenAI API. Results:

Call	Mode	prompt_tokens	cached_tokens	Result
1	Cache (warm-up)	2,060	0	(expected)
2	Cache (test)	2,058	1,920	✅ CACHE HIT
3	Standard (control)	2,035	0	✅ No cache

Real cost savings: 46% with gpt-4o-mini. With gpt-4o the savings would be ~88%.

How it works

simple_augment checks assistant.metadata for rag_processor == "single_file_rag"
If true + RAG context exists: emits context as separate cached user message
Messages: [system] → [user: file context] → [prev msgs] → [user: question]
System + file context are byte-identical across requests → LLM provider caches them
No new files, no config changes, no UI changes — just works

Example: messages sent to the LLM

With prompt template: "Responde la pregunta del usuario: --- {user_input} ---\n\nEste es el contexto:\n--- {context} ---"

Message 1 — System (CACHED)
─────────────────────────────────────────────────────────────────────────┐
│ Eres un asistente de aprendizaje que ayuda a los estudiantes a        │
│ aprender sobre un tema específico. Utiliza el contexto para responder │
│ las preguntas del usuario.                                             │
└────────────────────────────────────────────────────────────────────────┘

Message 2 — User: file context (CACHED)
┌─────────────────────────────────────────────────────────────────────────┐
│ The user may ask you questions about the following document. Use this  │
│ content to answer their questions accurately.                          │
│                                                                         │
│ El programa de Dbizi es un sistema de préstamo de bicicletas públicas…│
│ [full file content — thousands of characters]                          │
└─────────────────────────────────────────────────────────────────────────┘

Message 3 — User: question + template (INPUT — only this changes)
┌─────────────────────────────────────────────────────────────────────────┐
│ Responde la pregunta del usuario: --- ¿Cuál es el tiempo máximo? ---   │
│                                                                         │
│ Este es el contexto:                                                    │
│                                                                         │
│ ---  ---                                                                │
└─────────────────────────────────────────────────────────────────────────┘

Placeholder handling in cache mode

In cache mode, {context} is replaced with empty string (the actual file content is already in Message 2). The --- --- structure remains as a template artifact. This avoids duplicating the file in the prompt while keeping the template structure intact. {user_input} is replaced with the actual question as usual.

The replacement is language-agnostic — works regardless of what language the template is written in.

Verified in production (Docker + backend logs)

Confirmed working via docker compose -f docker-compose-example.yaml up -d --build with GLOBAL_LOG_LEVEL=DEBUG. Backend log output:

DEBUG:lamb.completions:Processed messages: [
  {'role': 'system', 'content': 'Eres un asistente de aprendizaje...'},
  {'role': 'user',   'content': 'The user may ask you questions about the following document...[full file content]'},
  {'role': 'user',   'content': 'Este es el contexto:\n ---  --- \n\nAhora responde la pregunta del usuario: --- Cuáles son los precios? ---'}
]

Message [0] and [1] are identical across requests → cached. Only message [2] changes.

feat: caché-aware single file RAG

4c026e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: caché-aware single file RAG#421

feat: caché-aware single file RAG#421
mBerasategui-ehu wants to merge 1 commit into
Lamb-Project:devfrom
mBerasategui-ehu:RAG_caching_and_grep_RAG

mBerasategui-ehu commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mBerasategui-ehu commented Jun 16, 2026

Implementation Log: Cache-Aware RAG & Grep RAG

Cache-Aware Single File RAG ✅

Files Changed

Verified with real API call (OpenAI gpt-4o-mini)

How it works

Example: messages sent to the LLM

Placeholder handling in cache mode

Verified in production (Docker + backend logs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant