NL2SQL Data Analyst

A production-grade Natural Language → SQL agent with strict structured output.

📋 Table of Contents

The Problem
The Solution: Data Sandwich Architecture
How It Works
Quick Start
Project Structure
Security
Evaluation
Tech Stack
License

The Problem

Most NL2SQL agents return freeform prose:

"Based on the data, it seems like customers in São Paulo are quite active, and you might want to consider..."

This is unusable for automation. You can't pipe it into a dashboard, trigger a webhook, or validate it programmatically.

The Solution: Data Sandwich Architecture

Every response is forced into a rigid Pydantic schema — no hallucinations, no fluff, no markdown violations.

┌─────────────────────────────────────────┐
│  🪝 THE HOOK                            │  ← Executive headline (10-15 words)
│  "São Paulo drives 42% of all orders"   │
├─────────────────────────────────────────┤
│  📊 THE TRUTH                           │  ← Raw Markdown data table
│  | state | orders | pct |              │
│  | SP    | 41,746 | 42% |              │
│  | RJ    | 12,853 | 13% |              │
├─────────────────────────────────────────┤
│  🎯 THE STRATEGY                        │  ← Exactly 2 actionable takeaways
│  • Expand warehouse capacity in SP      │
│  • Launch targeted ads in RJ            │
└─────────────────────────────────────────┘

Why this matters:

✅ Machine-readable by default
✅ Prevents LLM hallucination via schema enforcement
✅ Audit trail (sql_query_used is always included)
✅ Works with Slack, email, BI dashboards, and downstream agents

How It Works

Dual-Engine Architecture

We use two specialized LLM instances instead of one generalist:

Engine	Role	Mode	Why
Reasoning Engine	Generates SQL from natural language	Tool-calling (`bind_tools`)	Needs to "see" the database schema and emit `run_sql_query` calls
Synthesis Engine	Converts SQL + results into structured JSON	JSON mode (`response_format: json_object`)	Must output valid JSON that validates against `AnalystResponse`

This separation prevents the model from confusing SQL syntax with JSON formatting.

LangGraph Workflow

┌─────────────┐     ┌─────────────────┐     ┌──────────┐     ┌─────────────────┐
│   START     │────▶│ groq_reasoning  │────▶│  tools   │────▶│ groq_synthesis  │
└─────────────┘     │  (SQL generation)│     │(execute) │     │ (JSON output)   │
                    └─────────────────┘     └──────────┘     └─────────────────┘
                           │                                          │
                           └──────────────────────────────────────────┘
                           (bypass tools if no SQL needed)
                                          │
                                          ▼
                                    ┌──────────┐
                                    │    END   │
                                    └──────────┘

Security-First Design

# OS-level read-only enforcement — not just a flag
db_uri = f"file:{DB_PATH}?mode=ro"
conn = sqlite3.connect(db_uri, uri=True)

AST-level validation via guardrails.py (rejects DROP, INSERT, UPDATE, DELETE before execution)
Read-only SQLite URI mode — the OS blocks writes even if the LLM tries to bypass validation
Pandas read_sql_query — results are sanitized into Markdown tables before reaching the LLM

Quick Start

Prerequisites

Python 3.10+
A Groq API key (free tier available)

Installation

git clone https://github.com/Zimal-Fatemah/NL2SQL-data-analyst.git
cd NL2SQL-data-analyst

python -m venv venv
source venv/bin/activate  # Windows: .\venv\Scripts\activate

pip install -r requirements.txt

Configuration

cp .env.example .env
# Edit .env and add your GROQ_API_KEY
groq_api_key=gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Run the Agent

python -m src.agent

Example session:

👤 User: Which 5 cities have the highest number of customers?

🪝 SÃO PAULO LEADS WITH 15,540 CUSTOMERS, FOLLOWED BY RIO DE JANEIRO

| customer_city | customer_count |
|---------------|--------------|
| sao paulo     | 15540        |
| rio de janeiro| 6882         |
| belo horizonte| 2773         |
| brasilia      | 2131         |
| curitiba      | 1521         |

📈 STRATEGIC TAKEAWAYS:
 • Prioritize logistics partnerships in São Paulo and Rio to reduce last-mile delivery costs.
 • Launch localized marketing campaigns in Belo Horizonte and Brasilia to close the gap with top-tier cities.

Run the Evaluation Suite

python -m eval.run_eval

Validates structural correctness against 20 gold-standard questions covering aggregations, joins, time filtering, and comparative analysis.

Project Structure

NL2SQL-data-analyst/
├── src/
│   ├── agent.py          # LangGraph workflow, Pydantic schemas, CLI
│   ├── tools.py          # DB connection, schema introspection, query execution
│   └── guardrails.py     # AST-based SQL validation (whitelist + DML blocking)
├── eval/
│   ├── qa_set.json       # 20 regression test questions
│   └── run_eval.py       # Automated validation runner
├── db/
│   └── olist.db          # SQLite Olist e-commerce dataset
├── requirements.txt
└── .env.example

Security

Layer	Implementation
Input Validation	`sqlglot` AST parsing — rejects non-`SELECT` statements
OS Enforcement	SQLite `?mode=ro` URI flag
Output Sanitization	Pandas `to_markdown()` prevents HTML/JS injection
Schema Enforcement	Pydantic `AnalystResponse` — invalid JSON is discarded

Evaluation

The eval/ suite checks structural integrity (Pydantic validation) across 20 representative queries:

COUNT, SUM, AVG aggregations
GROUP BY + ORDER BY + LIMIT
Date filtering (2017, 2018)
Multi-table implicit joins
Comparative metrics (on time vs late)

Note: The current suite validates that the agent returns well-formed JSON. Semantic correctness ("did the SQL actually answer the question?") requires human review or a gold-standard result set.

Tech Stack

Orchestration: LangGraph 1.2+
LLM: Groq API (llama-3.3-70b-versatile)
Validation: Pydantic 2.x, sqlglot
Database: SQLite (read-only URI mode)
Data Processing: Pandas 3.x

License

MIT

Built with 🥪 by Zimal Fatemah

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NL2SQL Data Analyst

📋 Table of Contents

The Problem

The Solution: Data Sandwich Architecture

How It Works

Dual-Engine Architecture

LangGraph Workflow

Security-First Design

Quick Start

Prerequisites

Installation

Configuration

Run the Agent

Run the Evaluation Suite

Project Structure

Security

Evaluation

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
eval		eval
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NL2SQL Data Analyst

📋 Table of Contents

The Problem

The Solution: Data Sandwich Architecture

How It Works

Dual-Engine Architecture

LangGraph Workflow

Security-First Design

Quick Start

Prerequisites

Installation

Configuration

Run the Agent

Run the Evaluation Suite

Project Structure

Security

Evaluation

Tech Stack

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages