Web Scraping Methodology Evaluation: RAG-Chatbot

This repository was created as an independent personal learning tool and technical evaluation. It documents a comparison phase used to determine the most effective web scraping strategy for a RAG-based (Retrieval-Augmented Generation) chatbot.

Context & Objective

The goal of this sandbox repository is to explore how to expand a chatbot's knowledge base by integrating live web data. It serves as a technical comparison between two methodologies to find the optimal solution for ingesting web content into a RAG pipeline.

Comparison Matrix

Feature	BeautifulSoup & HTTPX	Scrapy Framework
Approach	Library-based (Scripting)	Full-featured Framework
Performance	Sequential / Manual Async	High-speed Native Asynchronicity
Scalability	Good for targeted tasks	Built for large-scale crawling
RAG Utility	Quick prototyping	Robust production-grade ingestion

Repository Structure

This monorepo allows for a side-by-side comparison of the two implementations:

BeautifulSoup/: Focuses on a lightweight approach using BeautifulSoup4 and httpx. Ideal for specific, high-precision extraction.
Scrapy/: A complete Scrapy project architecture. It demonstrates how to handle complex site structures and large-scale data pipelines.

Setup & Installation

Clone the Repository:
```
git clone URL
```

Environment Configuration:

python -m venv .venv
# Activate on Windows:
.\.venv\Scripts\activate

Install Requirements:
```
pip install -r requirements.txt
```

Execution

Option A: BeautifulSoup Version

python BeautifulSoup/Webscraper.py

Option B: Scrapy Version

cd Scrapy
scrapy crawl tum_sitemap

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
BeautifulSoup		BeautifulSoup
Scrapy		Scrapy
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping Methodology Evaluation: RAG-Chatbot

Context & Objective

Comparison Matrix

Repository Structure

Setup & Installation

Execution

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Methodology Evaluation: RAG-Chatbot

Context & Objective

Comparison Matrix

Repository Structure

Setup & Installation

Execution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages