This repository contains a Retrieval-Augmented Generation (RAG) chatbot project that I built myself as a hands-on learning exercise.
The main goal was not product development, but understanding the full RAG pipeline end-to-end.
I created this project to explore and validate the core ideas behind RAG systems in practice:
- how raw text/PDF data is ingested
- how different chunking strategies affect retrieval quality
- how embeddings and vector databases interact
- how prompt design constrains answer behavior
- how retrieval confidence and evaluation influence response quality
This repository therefore represents an experimentation-focused implementation rather than a polished production system.
The strongest emphasis is on content processing and retrieval behavior:
- comparing multiple chunking approaches
- storing chunk outputs as JSON for inspection
- persisting embeddings in ChromaDB for semantic search
- tracing retrieved sources and similarity scores
- evaluating response correctness with simple automated checks
create_database.py— data ingestion, chunk generation, and persistence to Chroma + JSONquery_data.py— retrieval and answer generation from the vector databasechunking_strategies.py— alternative chunking implementations:- character splitter
- statistical chunking
- LLM-based semantic chunking
config.py— central configuration for data paths, model choices, and prompt templatesai_setup.py— Gemini client + embedding wrapper setuptest_rag.py— simple response-validation tests for positive/negative cases
Data/— datasets used for ingestionChunks_*/— chunk outputs stored as JSON for manual reviewChromaDB_*/— persisted vector stores for different chunking variants
- Load source documents (
.txtand.pdf) - Split content into chunks (different chunking strategies)
- Embed chunks and store them in ChromaDB
- Retrieve top-matching chunks for a query
- Generate an answer constrained to retrieved context
- Optionally evaluate expected vs. actual response alignment
- a full local RAG workflow from ingestion to answer generation
- practical comparison of chunking methods in one codebase
- transparent artifact generation (JSON + vector DB) for debugging and analysis