Skip to content

RabbaniHacker/Adaptive-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive RAG

Adaptive RAG is a query-aware Retrieval-Augmented Generation pipeline that dynamically selects its retrieval strategy (Sparse, Dense, or Iterative Multi-Hop) based on the complexity of the incoming query.

Instead of forcing all queries through an expensive hybrid or multi-hop retriever (which hurts latency), or a single-pass dense retriever (which fails on complex questions), this architecture routes queries using a lightweight, fine-tuned DistilBERT classification model.

Adaptive RAG Demo

The $0 Constraint & Architecture Choices

This project was designed and built under a strict $0/free-tier budget constraint. This limitation heavily shaped our engineering decisions:

  • Router Scope: We trimmed the original target scope down to 3 classes (factual, abstractive, multi-hop), intentionally dropping time-sensitive (recency-biased) and no-retrieval-needed classes to reduce data and annotation scale.
  • Silver Labels: The 10,000 training queries were annotated via free-tier Gemini API usage (locally orchestrated), rather than paid human annotation.
  • Hardware Limitations: Our baseline evaluations and modeling were split between free Colab GPUs and consumer-grade local hardware, while generation services fallback to free-tier cloud environments (like HuggingFace Serverless CPU Inference or Oracle Cloud Always-Free ARM instances).
  • Dataset Scope: The evaluation focused on a 7-dataset core drawn from BEIR rather than the full suite, skipping some massive datasets purely due to free-tier storage and compute quota limits.

Core Architecture

  1. Query Router: A DistilBERT model fine-tuned on 10,000 silver-labeled queries to classify user input into factual, abstractive, or multi-hop.
  2. Dynamic Retrieval:
    • factual: Routes to BM25 (Sparse) for exact entity lookups.
    • abstractive: Routes to FAISS (Dense) for semantic similarity.
    • multi-hop: Routes to an iterative LLM-driven retriever that chains multiple search hops.
  3. Confidence Fallback: A post-hoc cross-check that evaluates the quality of the retrieved context. If confidence falls below dataset-calibrated thresholds, it triggers a fallback strategy before passing context to the Generator.
  4. Generator: Streamlit UI supports generating final answers using local models via Ollama or Serverless Inference via the HuggingFace API.

Evaluation Results

We evaluated Adaptive RAG across 7 datasets spanning multiple domains (Multi-hop QA, Medical IR, Financial QA, Fact-checking, Argument Retrieval).

Where Adaptive RAG Wins and Where It Doesn't

Domain Outcome Root Cause
Multi-hop QA (HotpotQA) Beats dense Router correctly escalates 47% of eligible factual/abstractive queries to multi-hop
Medical IR (NFCorpus) Near-parity Small NDCG gap vs dense; beats hybrid
Fact-checking (SciFact) ⚠️ Beats dense, loses to sparse Lexical matching advantage in exact scientific terminology
Argument Retrieval (ArguAna) Matches dense, 3× faster Router correctly uses single-pass dense; avoids hybrid overhead
Financial QA (FiQA) Loses to dense/hybrid Router under-escalates; financial multi-step reasoning misclassified
COVID Medical (TREC-COVID) Largest gap Insufficient calibration data (40 train queries); only 5 test queries

Honest Framing: Adaptive RAG's wins concentrate heavily on domains where the router's training distribution aligns well (e.g., multi-hop factual, general web-style queries). Penalties appear on out-of-domain corpora (medical, financial) where the router systematically misfires on escalation decisions (under-escalating complex financial reasoning, or struggling with highly specialized medical vocabulary).

Critical Evaluation Caveats

When reading the performance and latency metrics in evaluation_report.md, please note:

  1. Hardware Inconsistency: Runs for scifact and arguana were executed on Colab GPUs, while others (hotpotqa, nfcorpus, etc.) were executed locally. Cross-dataset latency comparisons are completely invalid.
  2. Downsampled HotpotQA: To fit memory constraints, the HotpotQA corpus was downsampled to gold documents + ~5,000 random distractors. This dramatically inflates absolute NDCG metrics to ~0.86-0.90 across all strategies on that dataset.
  3. TREC-COVID Scope: The evaluation used a tiny 5-query test set and 40 train queries, making calibration difficult and evaluation metrics statistically noisy.
  4. Missing-Field Defaults: The Multi-Hop Route % and Fallback Trigger % telemetry fields were added late in the evaluation phase. Checkpoints for scifact and arguana predate this addition and show 0.0% as a missing-field default, not an empirical measurement.

Latency Savings

By intelligently avoiding expensive hybrid retrieval when it isn't needed, Adaptive RAG achieves significant latency reductions on single-hop datasets:

  • ArguAna: 65% faster than fixed-hybrid with identical NDCG.
  • SciFact: 34% faster than fixed-hybrid.
  • FiQA: 29% faster than fixed-hybrid.

(Note: Multi-hop queries predictably increase latency due to the iterative LLM chain).

HuggingFace Releases

How to Run Locally

  1. Install dependencies:

    python -m venv venv
    venv\Scripts\activate
    pip install -r requirements.txt
  2. Run the interactive Streamlit demo:

    streamlit run app.py

    Note: If you export an HF_TOKEN in your environment, the app will automatically use the HuggingFace Serverless Inference API instead of a local Ollama generator.

License

The codebase and architectural design (including this README and associated pipeline code) are open-sourced under the Apache License 2.0.

Note: The datasets utilized for training and evaluation are governed by their respective owners' licenses (e.g., CC BY-SA 4.0, academic-use only). Please refer to the dataset card on HuggingFace for specific attribution and usage limitations.

About

Adaptive RAG pipeline with a lightweight DistilBERT query router for dynamic, latency-optimized retrieval strategies.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors