Skip to content

feat: caché-aware single file RAG#421

Open
mBerasategui-ehu wants to merge 1 commit into
Lamb-Project:devfrom
mBerasategui-ehu:RAG_caching_and_grep_RAG
Open

feat: caché-aware single file RAG#421
mBerasategui-ehu wants to merge 1 commit into
Lamb-Project:devfrom
mBerasategui-ehu:RAG_caching_and_grep_RAG

Conversation

@mBerasategui-ehu

Copy link
Copy Markdown
Collaborator

Implementation Log: Cache-Aware RAG & Grep RAG

Tracking all changes made per IMPLEMENTATION_PLAN_CACHED_GREP_RAG.md.


Cache-Aware Single File RAG ✅

Implementation choice: Option B — Extended simple_augment.py with automatic cache-aware mode detection. When an assistant has rag_processor: "single_file_rag" in its metadata, simple_augment automatically splits the RAG context into a separate user message (before the question), enabling LLM provider prompt caching. No new prompt processor file needed. No auto-selection in main.py needed. Fully backward compatible.

Files Changed

File Operation Description
backend/lamb/completions/pps/simple_augment.py Edited Added _is_single_file_rag() helper + cache-aware logic. When assistant metadata has rag_processor: "single_file_rag", context is emitted as a separate user message before conversation history. When not, existing template-based behavior is unchanged.
backend/tests/test_cache_aware_augment.py Created 10 unit tests: cache mode structure, no-RAG fallback, standard mode unchanged, no-metadata backward compat, conversation history, cache stability, cache-vs-standard comparison, cost comparison (86% estimated savings)
backend/tests/test_integration_cache.py Created Real integration test against OpenAI API. Proves cache mode gets actual cached_tokens while standard mode gets zero. Run with --real flag.
docker-compose-example.yaml Edited Changed GLOBAL_LOG_LEVEL=WARNING to ${GLOBAL_LOG_LEVEL:-WARNING} so debug logs can be enabled from the project .env

Verified with real API call (OpenAI gpt-4o-mini)

Integration test passed. Ran python tests/test_integration_cache.py --real --size 12000 against the live OpenAI API. Results:

Call Mode prompt_tokens cached_tokens Result
1 Cache (warm-up) 2,060 0 (expected)
2 Cache (test) 2,058 1,920 CACHE HIT
3 Standard (control) 2,035 0 ✅ No cache

Real cost savings: 46% with gpt-4o-mini. With gpt-4o the savings would be ~88%.

How it works

  • simple_augment checks assistant.metadata for rag_processor == "single_file_rag"
  • If true + RAG context exists: emits context as separate cached user message
  • Messages: [system] → [user: file context] → [prev msgs] → [user: question]
  • System + file context are byte-identical across requests → LLM provider caches them
  • No new files, no config changes, no UI changes — just works

Example: messages sent to the LLM

With prompt template: "Responde la pregunta del usuario: --- {user_input} ---\n\nEste es el contexto:\n--- {context} ---"

Message 1 — System (CACHED)
─────────────────────────────────────────────────────────────────────────┐
│ Eres un asistente de aprendizaje que ayuda a los estudiantes a        │
│ aprender sobre un tema específico. Utiliza el contexto para responder │
│ las preguntas del usuario.                                             │
└────────────────────────────────────────────────────────────────────────┘

Message 2 — User: file context (CACHED)
┌─────────────────────────────────────────────────────────────────────────┐
│ The user may ask you questions about the following document. Use this  │
│ content to answer their questions accurately.                          │
│                                                                         │
│ El programa de Dbizi es un sistema de préstamo de bicicletas públicas…│
│ [full file content — thousands of characters]                          │
└─────────────────────────────────────────────────────────────────────────┘

Message 3 — User: question + template (INPUT — only this changes)
┌─────────────────────────────────────────────────────────────────────────┐
│ Responde la pregunta del usuario: --- ¿Cuál es el tiempo máximo? ---   │
│                                                                         │
│ Este es el contexto:                                                    │
│                                                                         │
│ ---  ---                                                                │
└─────────────────────────────────────────────────────────────────────────┘

Placeholder handling in cache mode

In cache mode, {context} is replaced with empty string (the actual file content is already in Message 2). The --- --- structure remains as a template artifact. This avoids duplicating the file in the prompt while keeping the template structure intact. {user_input} is replaced with the actual question as usual.

The replacement is language-agnostic — works regardless of what language the template is written in.

Verified in production (Docker + backend logs)

Confirmed working via docker compose -f docker-compose-example.yaml up -d --build with GLOBAL_LOG_LEVEL=DEBUG. Backend log output:

DEBUG:lamb.completions:Processed messages: [
  {'role': 'system', 'content': 'Eres un asistente de aprendizaje...'},
  {'role': 'user',   'content': 'The user may ask you questions about the following document...[full file content]'},
  {'role': 'user',   'content': 'Este es el contexto:\n ---  --- \n\nAhora responde la pregunta del usuario: --- Cuáles son los precios? ---'}
]

Message [0] and [1] are identical across requests → cached. Only message [2] changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant