Adaptive RAG

Adaptive RAG is a query-aware Retrieval-Augmented Generation pipeline that dynamically selects its retrieval strategy (Sparse, Dense, or Iterative Multi-Hop) based on the complexity of the incoming query.

Instead of forcing all queries through an expensive hybrid or multi-hop retriever (which hurts latency), or a single-pass dense retriever (which fails on complex questions), this architecture routes queries using a lightweight, fine-tuned DistilBERT classification model.

The $0 Constraint & Architecture Choices

This project was designed and built under a strict $0/free-tier budget constraint. This limitation heavily shaped our engineering decisions:

Router Scope: We trimmed the original target scope down to 3 classes (factual, abstractive, multi-hop), intentionally dropping time-sensitive (recency-biased) and no-retrieval-needed classes to reduce data and annotation scale.
Silver Labels: The 10,000 training queries were annotated via free-tier Gemini API usage (locally orchestrated), rather than paid human annotation.
Hardware Limitations: Our baseline evaluations and modeling were split between free Colab GPUs and consumer-grade local hardware, while generation services fallback to free-tier cloud environments (like HuggingFace Serverless CPU Inference or Oracle Cloud Always-Free ARM instances).
Dataset Scope: The evaluation focused on a 7-dataset core drawn from BEIR rather than the full suite, skipping some massive datasets purely due to free-tier storage and compute quota limits.

Core Architecture

Query Router: A DistilBERT model fine-tuned on 10,000 silver-labeled queries to classify user input into factual, abstractive, or multi-hop.
Dynamic Retrieval:
- factual: Routes to BM25 (Sparse) for exact entity lookups.
- abstractive: Routes to FAISS (Dense) for semantic similarity.
- multi-hop: Routes to an iterative LLM-driven retriever that chains multiple search hops.
Confidence Fallback: A post-hoc cross-check that evaluates the quality of the retrieved context. If confidence falls below dataset-calibrated thresholds, it triggers a fallback strategy before passing context to the Generator.
Generator: Streamlit UI supports generating final answers using local models via Ollama or Serverless Inference via the HuggingFace API.

Evaluation Results

We evaluated Adaptive RAG across 7 datasets spanning multiple domains (Multi-hop QA, Medical IR, Financial QA, Fact-checking, Argument Retrieval).

Where Adaptive RAG Wins and Where It Doesn't

Domain	Outcome	Root Cause
Multi-hop QA (HotpotQA)	✅ Beats dense	Router correctly escalates 47% of eligible factual/abstractive queries to multi-hop
Medical IR (NFCorpus)	✅ Near-parity	Small NDCG gap vs dense; beats hybrid
Fact-checking (SciFact)	⚠️ Beats dense, loses to sparse	Lexical matching advantage in exact scientific terminology
Argument Retrieval (ArguAna)	✅ Matches dense, 3× faster	Router correctly uses single-pass dense; avoids hybrid overhead
Financial QA (FiQA)	❌ Loses to dense/hybrid	Router under-escalates; financial multi-step reasoning misclassified
COVID Medical (TREC-COVID)	❌ Largest gap	Insufficient calibration data (40 train queries); only 5 test queries

Honest Framing: Adaptive RAG's wins concentrate heavily on domains where the router's training distribution aligns well (e.g., multi-hop factual, general web-style queries). Penalties appear on out-of-domain corpora (medical, financial) where the router systematically misfires on escalation decisions (under-escalating complex financial reasoning, or struggling with highly specialized medical vocabulary).

Critical Evaluation Caveats

When reading the performance and latency metrics in evaluation_report.md, please note:

Hardware Inconsistency: Runs for scifact and arguana were executed on Colab GPUs, while others (hotpotqa, nfcorpus, etc.) were executed locally. Cross-dataset latency comparisons are completely invalid.
Downsampled HotpotQA: To fit memory constraints, the HotpotQA corpus was downsampled to gold documents + ~5,000 random distractors. This dramatically inflates absolute NDCG metrics to ~0.86-0.90 across all strategies on that dataset.
TREC-COVID Scope: The evaluation used a tiny 5-query test set and 40 train queries, making calibration difficult and evaluation metrics statistically noisy.
Missing-Field Defaults: The Multi-Hop Route % and Fallback Trigger % telemetry fields were added late in the evaluation phase. Checkpoints for scifact and arguana predate this addition and show 0.0% as a missing-field default, not an empirical measurement.

Latency Savings

By intelligently avoiding expensive hybrid retrieval when it isn't needed, Adaptive RAG achieves significant latency reductions on single-hop datasets:

ArguAna: 65% faster than fixed-hybrid with identical NDCG.
SciFact: 34% faster than fixed-hybrid.
FiQA: 29% faster than fixed-hybrid.

(Note: Multi-hop queries predictably increase latency due to the iterative LLM chain).

HuggingFace Releases

Adaptive RAG Router Model: The fine-tuned distilbert-base-uncased classification model.
Adaptive RAG Labeled Queries: The 10k silver-labeled dataset used to train the router.

How to Run Locally

Install dependencies:

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Run the interactive Streamlit demo:
```
streamlit run app.py
```
Note: If you export an HF_TOKEN in your environment, the app will automatically use the HuggingFace Serverless Inference API instead of a local Ollama generator.

License

The codebase and architectural design (including this README and associated pipeline code) are open-sourced under the Apache License 2.0.

Note: The datasets utilized for training and evaluation are governed by their respective owners' licenses (e.g., CC BY-SA 4.0, academic-use only). Please refer to the dataset card on HuggingFace for specific attribution and usage limitations.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
datasets		datasets
demo		demo
evaluation		evaluation
generators		generators
retrievers		retrievers
router		router
tests		tests
.gitignore		.gitignore
GEMINI.md		GEMINI.md
PROJECT_SPEC.md		PROJECT_SPEC.md
PROJECT_STATE.md		PROJECT_STATE.md
README.md		README.md
TASK_PLAN.md		TASK_PLAN.md
app.py		app.py
build_split.py		build_split.py
calibrate_labels.py		calibrate_labels.py
check_overlap.py		check_overlap.py
check_overlap_analysis.py		check_overlap_analysis.py
check_overlap_extended.py		check_overlap_extended.py
confidence.py		confidence.py
count_labels.py		count_labels.py
dataset_card.md		dataset_card.md
demo_baselines.py		demo_baselines.py
demo_pipeline.py		demo_pipeline.py
download_beir.py		download_beir.py
download_hotpotqa.py		download_hotpotqa.py
download_triviaqa.py		download_triviaqa.py
fast_output.txt		fast_output.txt
fast_thresholds.py		fast_thresholds.py
generate_report.py		generate_report.py
get_thresholds.py		get_thresholds.py
hf_download.py		hf_download.py
kaggle_router_finetune.ipynb		kaggle_router_finetune.ipynb
label_queries.py		label_queries.py
list_models.py		list_models.py
model_card.md		model_card.md
overlap_results.txt		overlap_results.txt
pipeline.py		pipeline.py
requirements.txt		requirements.txt
smoke_test_retrievers.py		smoke_test_retrievers.py
test_api.py		test_api.py
test_baselines.py		test_baselines.py
test_confidence.py		test_confidence.py
test_multihop.py		test_multihop.py
test_ollama.py		test_ollama.py
test_pipeline.py		test_pipeline.py
test_random_slice.py		test_random_slice.py
train_router.py		train_router.py
upload_to_hub.py		upload_to_hub.py
zip_project.py		zip_project.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive RAG

The $0 Constraint & Architecture Choices

Core Architecture

Evaluation Results

Where Adaptive RAG Wins and Where It Doesn't

Critical Evaluation Caveats

Latency Savings

HuggingFace Releases

How to Run Locally

License

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive RAG

The $0 Constraint & Architecture Choices

Core Architecture

Evaluation Results

Where Adaptive RAG Wins and Where It Doesn't

Critical Evaluation Caveats

Latency Savings

HuggingFace Releases

How to Run Locally

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages