RAG Chatbot (Learning Project)

This repository contains a Retrieval-Augmented Generation (RAG) chatbot project that I built myself as a hands-on learning exercise.
The main goal was not product development, but understanding the full RAG pipeline end-to-end.

Why this project exists

I created this project to explore and validate the core ideas behind RAG systems in practice:

how raw text/PDF data is ingested
how different chunking strategies affect retrieval quality
how embeddings and vector databases interact
how prompt design constrains answer behavior
how retrieval confidence and evaluation influence response quality

This repository therefore represents an experimentation-focused implementation rather than a polished production system.

Project focus

The strongest emphasis is on content processing and retrieval behavior:

comparing multiple chunking approaches
storing chunk outputs as JSON for inspection
persisting embeddings in ChromaDB for semantic search
tracing retrieved sources and similarity scores
evaluating response correctness with simple automated checks

Repository structure and context

Core pipeline files

create_database.py — data ingestion, chunk generation, and persistence to Chroma + JSON
query_data.py — retrieval and answer generation from the vector database
chunking_strategies.py — alternative chunking implementations:
- character splitter
- statistical chunking
- LLM-based semantic chunking
config.py — central configuration for data paths, model choices, and prompt templates
ai_setup.py — Gemini client + embedding wrapper setup
test_rag.py — simple response-validation tests for positive/negative cases

Data and generated artifacts

Data/ — datasets used for ingestion
Chunks_*/ — chunk outputs stored as JSON for manual review
ChromaDB_*/ — persisted vector stores for different chunking variants

Technical idea in one view

Load source documents (.txt and .pdf)
Split content into chunks (different chunking strategies)
Embed chunks and store them in ChromaDB
Retrieve top-matching chunks for a query
Generate an answer constrained to retrieved context
Optionally evaluate expected vs. actual response alignment

What this repository demonstrates

a full local RAG workflow from ingestion to answer generation
practical comparison of chunking methods in one codebase
transparent artifact generation (JSON + vector DB) for debugging and analysis

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ChromaDB_CharacterSplitter		ChromaDB_CharacterSplitter
ChromaDB_LLM		ChromaDB_LLM
ChromaDB_StatisticalChunking		ChromaDB_StatisticalChunking
Chunks_CharacterSplitter		Chunks_CharacterSplitter
Chunks_LLM		Chunks_LLM
Chunks_StatisticalChunking		Chunks_StatisticalChunking
Data		Data
.gitignore		.gitignore
README.md		README.md
ai_setup.py		ai_setup.py
chunking_strategies.py		chunking_strategies.py
config.py		config.py
create_database.py		create_database.py
query_data.py		query_data.py
requirements.txt		requirements.txt
test_rag.py		test_rag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chatbot (Learning Project)

Why this project exists

Project focus

Repository structure and context

Core pipeline files

Data and generated artifacts

Technical idea in one view

What this repository demonstrates

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Chatbot (Learning Project)

Why this project exists

Project focus

Repository structure and context

Core pipeline files

Data and generated artifacts

Technical idea in one view

What this repository demonstrates

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages