Skip to content

IIICodeChrisIII/webscraper_comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping Methodology Evaluation: RAG-Chatbot

This repository was created as an independent personal learning tool and technical evaluation. It documents a comparison phase used to determine the most effective web scraping strategy for a RAG-based (Retrieval-Augmented Generation) chatbot.

Context & Objective

The goal of this sandbox repository is to explore how to expand a chatbot's knowledge base by integrating live web data. It serves as a technical comparison between two methodologies to find the optimal solution for ingesting web content into a RAG pipeline.


Comparison Matrix

Feature BeautifulSoup & HTTPX Scrapy Framework
Approach Library-based (Scripting) Full-featured Framework
Performance Sequential / Manual Async High-speed Native Asynchronicity
Scalability Good for targeted tasks Built for large-scale crawling
RAG Utility Quick prototyping Robust production-grade ingestion

Repository Structure

This monorepo allows for a side-by-side comparison of the two implementations:

  • BeautifulSoup/: Focuses on a lightweight approach using BeautifulSoup4 and httpx. Ideal for specific, high-precision extraction.
  • Scrapy/: A complete Scrapy project architecture. It demonstrates how to handle complex site structures and large-scale data pipelines.

Setup & Installation

  1. Clone the Repository:

    git clone URL
  2. Environment Configuration:

    python -m venv .venv
    # Activate on Windows:
    .\.venv\Scripts\activate
  3. Install Requirements:

    pip install -r requirements.txt
    

Execution

Option A: BeautifulSoup Version

python BeautifulSoup/Webscraper.py

Option B: Scrapy Version

cd Scrapy
scrapy crawl tum_sitemap

About

A technical evaluation of scraping methodologies (BeautifulSoup vs. Scrapy) for a RAG chatbot.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages