Project 7 of the GenAI Developer Roadmap 2026. A multimodal Retrieval-Augmented Generation system that processes both text and images from documents (PDFs, images, markdown), retrieves relevant content via hybrid search, and generates cited answers.
Document → Load → Extract Text/Images → Chunk → Embed → Vector Store
↓
Query → Embed → Hybrid Retrieval → Rerank → Context Build → Generate Answer
(text + image search) (token/image limits) (with citations)
Multi-format document ingestion with text and image extraction.
| Module | Purpose |
|---|---|
src/document/models.py |
Core models: Document, DocumentChunk, ContentBlock, ImageData |
src/document/loader.py |
Load PDF (PyMuPDF), images (Pillow), markdown, text |
src/document/text_extractor.py |
Extract text from PDFs, images (OCR), markdown |
src/document/image_extractor.py |
Extract embedded images from PDFs and markdown |
src/document/vision_describer.py |
Generate image descriptions via LLM vision API |
src/document/chunker.py |
Sentence-aware text chunking with overlap, image chunks |
src/storage/document_store.py |
JSON file-based document and chunk storage |
src/llm/client.py |
Multi-provider LLM client (OpenAI/Anthropic) with vision |
Vector embeddings and hybrid text/image retrieval with reranking.
| Module | Purpose |
|---|---|
src/embedding/models.py |
EmbeddingConfig, EmbeddingVector dataclasses |
src/embedding/embedder.py |
TextEmbedder with OpenAI API + deterministic hash fallback |
src/embedding/vector_store.py |
In-memory vector store with cosine similarity search |
src/retrieval/models.py |
QueryType, RetrievalResult, QueryResult |
src/retrieval/text_retriever.py |
Text-only chunk retrieval |
src/retrieval/image_retriever.py |
Image chunk retrieval with description search |
src/retrieval/hybrid_retriever.py |
Reciprocal Rank Fusion (RRF) merging |
src/retrieval/reranker.py |
Heuristic reranking with visual query boosting |
src/retrieval/query_engine.py |
Orchestrates hybrid retrieval + reranking |
End-to-end RAG pipeline with context building, generation, and evaluation metrics.
| Module | Purpose |
|---|---|
src/rag/models.py |
ContextItem, RAGContext, Citation, RAGResponse |
src/rag/context_builder.py |
Build context with token/image limits, format for prompt |
src/rag/generator.py |
Response generation with LLM client or fallback |
src/rag/pipeline.py |
Full pipeline: retrieve → context → generate |
src/rag/config.py |
PipelineConfig with from_settings/to_dict/from_dict |
src/evaluation/metrics.py |
Precision@K, Recall@K, MRR, NDCG, context relevance, faithfulness, image coverage |
src/evaluation/evaluator.py |
RAGEvaluator: retrieval metrics, response quality, batch aggregation |
- No GPU required: All tests run on CPU. Vision/embedding use API calls; local fallback uses deterministic hash-based vectors (SHA256 → position-varied cos() → normalized)
- Hybrid retrieval: Reciprocal Rank Fusion combines text and image search with configurable weights (default 0.6/0.4)
- Visual query detection: Checks for visual terms ("image", "chart", "diagram", etc.) to boost image results during reranking
- Token-aware context: Estimates tokens as len/4, respects configurable max_context_tokens and max_images limits
- Heuristic evaluation: Keyword-overlap metrics for context relevance and answer faithfulness, no LLM calls needed for basic evaluation
# Ingest a document
python main.py ingest report.pdf
# Ask a question
python main.py ask "What are the key findings?"
# Search without generation
python main.py search "neural network architecture"
# Run evaluation
python main.py evaluate --input test_cases.json --output results.json
# List ingested documents
python main.py listWeek 1 — Document Processing: 96 tests passed
Week 2 — Embedding & Retrieval: 72 tests passed
Week 3 — RAG Pipeline & Eval: 125 tests passed
─────────────────────────────────────────────────
Total: 293 tests passed
All settings via environment variables (see src/config/settings.py):
| Variable | Default | Purpose |
|---|---|---|
CHUNK_SIZE |
500 | Characters per text chunk |
CHUNK_OVERLAP |
100 | Overlap between chunks |
EMBEDDING_DIMENSION |
256 | Vector dimension |
RETRIEVAL_TOP_K |
5 | Results to retrieve |
RERANK_TOP_K |
3 | Results after reranking |
TEXT_WEIGHT |
0.6 | Text search weight in hybrid |
IMAGE_WEIGHT |
0.4 | Image search weight in hybrid |
RAG_MAX_CONTEXT_TOKENS |
4000 | Max tokens in context |
RAG_MAX_IMAGES_IN_CONTEXT |
3 | Max images in context |
RAG_TEMPERATURE |
0.7 | LLM generation temperature |