AgentSQL is a production-grade, asymmetric multi-agent framework designed to solve the Text-to-SQL dilemma: Balancing high Execution Accuracy (EX) with cost-efficiency.
By decoupling the high-volume Generation task from the complex Correction/Reasoning task, AgentSQL achieves state-of-the-art results on the BIRD benchmark while maintaining a significantly lower inference cost compared to monolithic frontier model approaches.
AgentSQL utilizes an Asymmetric Multi-Agent Architecture (MasterPipeline). The workflow strictly isolates offline pre-processing from online inference, allowing for specialized model selection and optimized token usage at each step.
Tip
High-Quality Diagram: A professional TikZ version of this workflow is available in agentsql_workflow.tex, suitable for academic publications and high-resolution reports.
- Phase 1: CHESS Pruning (
tools/chess_linker.py): Offline semantic filtering using lightweight embedding models (e.g.,bge-small) to isolate only the most relevant tables and eliminate schema noise. - Phase 2: MCI-SQL Enrichment (
tools/mci_sql_pipeline.py): Extracts precise metadata (cardinalities, min/max values, exact row samples) from the pruned schema to build a high-fidelity context. - Phase 4a/b: Generator & Reflector (
tools/master_pipeline.py): The core generation loop. An optimized open-source model (e.g.,gpt-oss-120borllama-4-scout-17b) generates the SQL, which is immediately evaluated by a Reflector for logical self-consistency via back-translation. - Phase 4c: Resilient Critic (
nodes/corrector.py): Activated only if the Execution Sandbox detects a syntax error or the Reflector detects a logical mismatch. Powered by a high-reasoning model (e.g.,gemini-2.5-flash), it performs targeted patching using the MAGIC checklist.
- 🛡️ Ephemeral Sandboxing: Native support for SQLite, MySQL, and PostgreSQL with automatic state reset and set-based result comparison.
- 🔄 Round-Robin Key Rotation: The
KeyRotatorabstraction supports multiple API keys per provider to prevent rate-limiting during large-scale evaluations. - 🔌 Resilient LLM Factory: Automatic fallback to local Ollama instances if all cloud API keys are exhausted or unavailable.
- 📊 Unified Research Suite: A centralized evaluation engine that calculates EX, VES, and Soft F1 metrics in a single pass.
We support the full evaluation suite required for the BIRD-SQL benchmark:
| Metric | Definition | Importance |
|---|---|---|
| EX | Execution Accuracy | Measures if the predicted SQL returns the exact same data as the ground truth. |
| VES | Valid Efficiency Score | Measures the runtime efficiency of the SQL (Speed vs. Ground Truth). |
| Soft F1 | Semantic F1 Score | Measures partial correctness by comparing row-level data matches (Precision/Recall). |
Note
Recent evaluations of the MasterPipeline on the BIRD Mini-Dev dataset demonstrate highly competitive Execution Accuracy (EX) while significantly reducing API costs compared to monolithic GPT-4/Claude-3 setups.
Populate your .env file with multiple keys for high-concurrency evaluation:
cp .env.example .env
# Fill GEMINI_API_KEY_1, GEMINI_API_KEY_2, GROQ_API_KEY_1, etc.The framework is fully containerized for reproducibility:
make build
make up
make shellExecute the AgentSQL MasterPipeline on the Mini-Dev dataset:
make eval-master NUM_SAMPLES=20.
├── research/ # Unified evaluation suite & SOTA comparison
├── llm/src/text2sql_agent/ # Core Framework (LangGraph Nodes, Tools, State)
├── evaluation/ # Legacy baseline evaluation scripts
├── data_minidev/ # BIRD-SQL dataset and SQLite databases
├── Makefile # High-level orchestration commands
└── docker-compose.yml # Isolated execution environment
Implemented with ❤️ by the HCMUS Underdogs team. Dedicated to scaling agentic AI workflows with rigor and resilience.
