Skip to content

IIICodeChrisIII/rag-chatbot

Repository files navigation

RAG Chatbot (Learning Project)

This repository contains a Retrieval-Augmented Generation (RAG) chatbot project that I built myself as a hands-on learning exercise.
The main goal was not product development, but understanding the full RAG pipeline end-to-end.

Why this project exists

I created this project to explore and validate the core ideas behind RAG systems in practice:

  • how raw text/PDF data is ingested
  • how different chunking strategies affect retrieval quality
  • how embeddings and vector databases interact
  • how prompt design constrains answer behavior
  • how retrieval confidence and evaluation influence response quality

This repository therefore represents an experimentation-focused implementation rather than a polished production system.

Project focus

The strongest emphasis is on content processing and retrieval behavior:

  • comparing multiple chunking approaches
  • storing chunk outputs as JSON for inspection
  • persisting embeddings in ChromaDB for semantic search
  • tracing retrieved sources and similarity scores
  • evaluating response correctness with simple automated checks

Repository structure and context

Core pipeline files

  • create_database.py — data ingestion, chunk generation, and persistence to Chroma + JSON
  • query_data.py — retrieval and answer generation from the vector database
  • chunking_strategies.py — alternative chunking implementations:
    • character splitter
    • statistical chunking
    • LLM-based semantic chunking
  • config.py — central configuration for data paths, model choices, and prompt templates
  • ai_setup.py — Gemini client + embedding wrapper setup
  • test_rag.py — simple response-validation tests for positive/negative cases

Data and generated artifacts

  • Data/ — datasets used for ingestion
  • Chunks_*/ — chunk outputs stored as JSON for manual review
  • ChromaDB_*/ — persisted vector stores for different chunking variants

Technical idea in one view

  1. Load source documents (.txt and .pdf)
  2. Split content into chunks (different chunking strategies)
  3. Embed chunks and store them in ChromaDB
  4. Retrieve top-matching chunks for a query
  5. Generate an answer constrained to retrieved context
  6. Optionally evaluate expected vs. actual response alignment

What this repository demonstrates

  • a full local RAG workflow from ingestion to answer generation
  • practical comparison of chunking methods in one codebase
  • transparent artifact generation (JSON + vector DB) for debugging and analysis

About

Experimental RAG pipeline exploring chunking strategies, vector databases, and semantic search. Built as an educational project.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages