Skip to content

aditya2425/VisionRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

VisionRAG — Multimodal RAG Pipeline

Project 7 of the GenAI Developer Roadmap 2026. A multimodal Retrieval-Augmented Generation system that processes both text and images from documents (PDFs, images, markdown), retrieves relevant content via hybrid search, and generates cited answers.

Architecture

Document → Load → Extract Text/Images → Chunk → Embed → Vector Store
                                                              ↓
Query → Embed → Hybrid Retrieval → Rerank → Context Build → Generate Answer
           (text + image search)              (token/image limits)  (with citations)

Weekly Breakdown

Week 1 — Document Processing (96 tests)

Multi-format document ingestion with text and image extraction.

Module Purpose
src/document/models.py Core models: Document, DocumentChunk, ContentBlock, ImageData
src/document/loader.py Load PDF (PyMuPDF), images (Pillow), markdown, text
src/document/text_extractor.py Extract text from PDFs, images (OCR), markdown
src/document/image_extractor.py Extract embedded images from PDFs and markdown
src/document/vision_describer.py Generate image descriptions via LLM vision API
src/document/chunker.py Sentence-aware text chunking with overlap, image chunks
src/storage/document_store.py JSON file-based document and chunk storage
src/llm/client.py Multi-provider LLM client (OpenAI/Anthropic) with vision

Week 2 — Embedding & Retrieval (72 tests)

Vector embeddings and hybrid text/image retrieval with reranking.

Module Purpose
src/embedding/models.py EmbeddingConfig, EmbeddingVector dataclasses
src/embedding/embedder.py TextEmbedder with OpenAI API + deterministic hash fallback
src/embedding/vector_store.py In-memory vector store with cosine similarity search
src/retrieval/models.py QueryType, RetrievalResult, QueryResult
src/retrieval/text_retriever.py Text-only chunk retrieval
src/retrieval/image_retriever.py Image chunk retrieval with description search
src/retrieval/hybrid_retriever.py Reciprocal Rank Fusion (RRF) merging
src/retrieval/reranker.py Heuristic reranking with visual query boosting
src/retrieval/query_engine.py Orchestrates hybrid retrieval + reranking

Week 3 — RAG Pipeline & Evaluation (125 tests)

End-to-end RAG pipeline with context building, generation, and evaluation metrics.

Module Purpose
src/rag/models.py ContextItem, RAGContext, Citation, RAGResponse
src/rag/context_builder.py Build context with token/image limits, format for prompt
src/rag/generator.py Response generation with LLM client or fallback
src/rag/pipeline.py Full pipeline: retrieve → context → generate
src/rag/config.py PipelineConfig with from_settings/to_dict/from_dict
src/evaluation/metrics.py Precision@K, Recall@K, MRR, NDCG, context relevance, faithfulness, image coverage
src/evaluation/evaluator.py RAGEvaluator: retrieval metrics, response quality, batch aggregation

Key Design Decisions

  • No GPU required: All tests run on CPU. Vision/embedding use API calls; local fallback uses deterministic hash-based vectors (SHA256 → position-varied cos() → normalized)
  • Hybrid retrieval: Reciprocal Rank Fusion combines text and image search with configurable weights (default 0.6/0.4)
  • Visual query detection: Checks for visual terms ("image", "chart", "diagram", etc.) to boost image results during reranking
  • Token-aware context: Estimates tokens as len/4, respects configurable max_context_tokens and max_images limits
  • Heuristic evaluation: Keyword-overlap metrics for context relevance and answer faithfulness, no LLM calls needed for basic evaluation

Usage

# Ingest a document
python main.py ingest report.pdf

# Ask a question
python main.py ask "What are the key findings?"

# Search without generation
python main.py search "neural network architecture"

# Run evaluation
python main.py evaluate --input test_cases.json --output results.json

# List ingested documents
python main.py list

Test Results

Week 1 — Document Processing:    96 tests passed
Week 2 — Embedding & Retrieval:  72 tests passed
Week 3 — RAG Pipeline & Eval:   125 tests passed
─────────────────────────────────────────────────
Total:                           293 tests passed

Configuration

All settings via environment variables (see src/config/settings.py):

Variable Default Purpose
CHUNK_SIZE 500 Characters per text chunk
CHUNK_OVERLAP 100 Overlap between chunks
EMBEDDING_DIMENSION 256 Vector dimension
RETRIEVAL_TOP_K 5 Results to retrieve
RERANK_TOP_K 3 Results after reranking
TEXT_WEIGHT 0.6 Text search weight in hybrid
IMAGE_WEIGHT 0.4 Image search weight in hybrid
RAG_MAX_CONTEXT_TOKENS 4000 Max tokens in context
RAG_MAX_IMAGES_IN_CONTEXT 3 Max images in context
RAG_TEMPERATURE 0.7 LLM generation temperature

About

Multimodal RAG: PDF + image ingestion, OCR, vision embeddings, hybrid retrieval & multimodal answer generation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages