Adaptive RAG is a query-aware Retrieval-Augmented Generation pipeline that dynamically selects its retrieval strategy (Sparse, Dense, or Iterative Multi-Hop) based on the complexity of the incoming query.
Instead of forcing all queries through an expensive hybrid or multi-hop retriever (which hurts latency), or a single-pass dense retriever (which fails on complex questions), this architecture routes queries using a lightweight, fine-tuned DistilBERT classification model.
This project was designed and built under a strict $0/free-tier budget constraint. This limitation heavily shaped our engineering decisions:
- Router Scope: We trimmed the original target scope down to 3 classes (
factual,abstractive,multi-hop), intentionally droppingtime-sensitive(recency-biased) andno-retrieval-neededclasses to reduce data and annotation scale. - Silver Labels: The 10,000 training queries were annotated via free-tier Gemini API usage (locally orchestrated), rather than paid human annotation.
- Hardware Limitations: Our baseline evaluations and modeling were split between free Colab GPUs and consumer-grade local hardware, while generation services fallback to free-tier cloud environments (like HuggingFace Serverless CPU Inference or Oracle Cloud Always-Free ARM instances).
- Dataset Scope: The evaluation focused on a 7-dataset core drawn from BEIR rather than the full suite, skipping some massive datasets purely due to free-tier storage and compute quota limits.
- Query Router: A DistilBERT model fine-tuned on 10,000 silver-labeled queries to classify user input into
factual,abstractive, ormulti-hop. - Dynamic Retrieval:
factual: Routes to BM25 (Sparse) for exact entity lookups.abstractive: Routes to FAISS (Dense) for semantic similarity.multi-hop: Routes to an iterative LLM-driven retriever that chains multiple search hops.
- Confidence Fallback: A post-hoc cross-check that evaluates the quality of the retrieved context. If confidence falls below dataset-calibrated thresholds, it triggers a fallback strategy before passing context to the Generator.
- Generator: Streamlit UI supports generating final answers using local models via Ollama or Serverless Inference via the HuggingFace API.
We evaluated Adaptive RAG across 7 datasets spanning multiple domains (Multi-hop QA, Medical IR, Financial QA, Fact-checking, Argument Retrieval).
| Domain | Outcome | Root Cause |
|---|---|---|
| Multi-hop QA (HotpotQA) | ✅ Beats dense | Router correctly escalates 47% of eligible factual/abstractive queries to multi-hop |
| Medical IR (NFCorpus) | ✅ Near-parity | Small NDCG gap vs dense; beats hybrid |
| Fact-checking (SciFact) | Lexical matching advantage in exact scientific terminology | |
| Argument Retrieval (ArguAna) | ✅ Matches dense, 3× faster | Router correctly uses single-pass dense; avoids hybrid overhead |
| Financial QA (FiQA) | ❌ Loses to dense/hybrid | Router under-escalates; financial multi-step reasoning misclassified |
| COVID Medical (TREC-COVID) | ❌ Largest gap | Insufficient calibration data (40 train queries); only 5 test queries |
Honest Framing: Adaptive RAG's wins concentrate heavily on domains where the router's training distribution aligns well (e.g., multi-hop factual, general web-style queries). Penalties appear on out-of-domain corpora (medical, financial) where the router systematically misfires on escalation decisions (under-escalating complex financial reasoning, or struggling with highly specialized medical vocabulary).
When reading the performance and latency metrics in evaluation_report.md, please note:
- Hardware Inconsistency: Runs for
scifactandarguanawere executed on Colab GPUs, while others (hotpotqa,nfcorpus, etc.) were executed locally. Cross-dataset latency comparisons are completely invalid. - Downsampled HotpotQA: To fit memory constraints, the HotpotQA corpus was downsampled to gold documents + ~5,000 random distractors. This dramatically inflates absolute NDCG metrics to ~0.86-0.90 across all strategies on that dataset.
- TREC-COVID Scope: The evaluation used a tiny 5-query test set and 40 train queries, making calibration difficult and evaluation metrics statistically noisy.
- Missing-Field Defaults: The
Multi-Hop Route %andFallback Trigger %telemetry fields were added late in the evaluation phase. Checkpoints forscifactandarguanapredate this addition and show 0.0% as a missing-field default, not an empirical measurement.
By intelligently avoiding expensive hybrid retrieval when it isn't needed, Adaptive RAG achieves significant latency reductions on single-hop datasets:
- ArguAna: 65% faster than fixed-hybrid with identical NDCG.
- SciFact: 34% faster than fixed-hybrid.
- FiQA: 29% faster than fixed-hybrid.
(Note: Multi-hop queries predictably increase latency due to the iterative LLM chain).
- Adaptive RAG Router Model: The fine-tuned
distilbert-base-uncasedclassification model. - Adaptive RAG Labeled Queries: The 10k silver-labeled dataset used to train the router.
-
Install dependencies:
python -m venv venv venv\Scripts\activate pip install -r requirements.txt
-
Run the interactive Streamlit demo:
streamlit run app.py
Note: If you export an
HF_TOKENin your environment, the app will automatically use the HuggingFace Serverless Inference API instead of a local Ollama generator.
The codebase and architectural design (including this README and associated pipeline code) are open-sourced under the Apache License 2.0.
Note: The datasets utilized for training and evaluation are governed by their respective owners' licenses (e.g., CC BY-SA 4.0, academic-use only). Please refer to the dataset card on HuggingFace for specific attribution and usage limitations.
