This repository was created as an independent personal learning tool and technical evaluation. It documents a comparison phase used to determine the most effective web scraping strategy for a RAG-based (Retrieval-Augmented Generation) chatbot.
The goal of this sandbox repository is to explore how to expand a chatbot's knowledge base by integrating live web data. It serves as a technical comparison between two methodologies to find the optimal solution for ingesting web content into a RAG pipeline.
| Feature | BeautifulSoup & HTTPX | Scrapy Framework |
|---|---|---|
| Approach | Library-based (Scripting) | Full-featured Framework |
| Performance | Sequential / Manual Async | High-speed Native Asynchronicity |
| Scalability | Good for targeted tasks | Built for large-scale crawling |
| RAG Utility | Quick prototyping | Robust production-grade ingestion |
This monorepo allows for a side-by-side comparison of the two implementations:
BeautifulSoup/: Focuses on a lightweight approach usingBeautifulSoup4andhttpx. Ideal for specific, high-precision extraction.Scrapy/: A complete Scrapy project architecture. It demonstrates how to handle complex site structures and large-scale data pipelines.
-
Clone the Repository:
git clone URL
-
Environment Configuration:
python -m venv .venv # Activate on Windows: .\.venv\Scripts\activate
-
Install Requirements:
pip install -r requirements.txt
Option A: BeautifulSoup Version
python BeautifulSoup/Webscraper.pyOption B: Scrapy Version
cd Scrapy
scrapy crawl tum_sitemap