Skip to content

Zimal-Fatemah/NL2SQL-data-analyst

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NL2SQL Data Analyst

A production-grade Natural Language β†’ SQL agent with strict structured output.

Python LangGraph Groq License


πŸ“‹ Table of Contents


The Problem

Most NL2SQL agents return freeform prose:

"Based on the data, it seems like customers in SΓ£o Paulo are quite active, and you might want to consider..."

This is unusable for automation. You can't pipe it into a dashboard, trigger a webhook, or validate it programmatically.


The Solution: Data Sandwich Architecture

Every response is forced into a rigid Pydantic schema β€” no hallucinations, no fluff, no markdown violations.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸͺ THE HOOK                            β”‚  ← Executive headline (10-15 words)
β”‚  "SΓ£o Paulo drives 42% of all orders"   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  πŸ“Š THE TRUTH                           β”‚  ← Raw Markdown data table
β”‚  | state | orders | pct |              β”‚
β”‚  | SP    | 41,746 | 42% |              β”‚
β”‚  | RJ    | 12,853 | 13% |              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  🎯 THE STRATEGY                        β”‚  ← Exactly 2 actionable takeaways
β”‚  β€’ Expand warehouse capacity in SP      β”‚
β”‚  β€’ Launch targeted ads in RJ            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why this matters:

  • βœ… Machine-readable by default
  • βœ… Prevents LLM hallucination via schema enforcement
  • βœ… Audit trail (sql_query_used is always included)
  • βœ… Works with Slack, email, BI dashboards, and downstream agents
image

How It Works

Dual-Engine Architecture

We use two specialized LLM instances instead of one generalist:

Engine Role Mode Why
Reasoning Engine Generates SQL from natural language Tool-calling (bind_tools) Needs to "see" the database schema and emit run_sql_query calls
Synthesis Engine Converts SQL + results into structured JSON JSON mode (response_format: json_object) Must output valid JSON that validates against AnalystResponse

This separation prevents the model from confusing SQL syntax with JSON formatting.

LangGraph Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   START     │────▢│ groq_reasoning  │────▢│  tools   │────▢│ groq_synthesis  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  (SQL generation)β”‚     β”‚(execute) β”‚     β”‚ (JSON output)   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚                                          β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           (bypass tools if no SQL needed)
                                          β”‚
                                          β–Ό
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚    END   β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Security-First Design

# OS-level read-only enforcement β€” not just a flag
db_uri = f"file:{DB_PATH}?mode=ro"
conn = sqlite3.connect(db_uri, uri=True)
  • AST-level validation via guardrails.py (rejects DROP, INSERT, UPDATE, DELETE before execution)
  • Read-only SQLite URI mode β€” the OS blocks writes even if the LLM tries to bypass validation
  • Pandas read_sql_query β€” results are sanitized into Markdown tables before reaching the LLM

Quick Start

Prerequisites

  • Python 3.10+
  • A Groq API key (free tier available)

Installation

git clone https://github.com/Zimal-Fatemah/NL2SQL-data-analyst.git
cd NL2SQL-data-analyst

python -m venv venv
source venv/bin/activate  # Windows: .\venv\Scripts\activate

pip install -r requirements.txt

Configuration

cp .env.example .env
# Edit .env and add your GROQ_API_KEY
groq_api_key=gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Run the Agent

python -m src.agent

Example session:

πŸ‘€ User: Which 5 cities have the highest number of customers?

πŸͺ SΓƒO PAULO LEADS WITH 15,540 CUSTOMERS, FOLLOWED BY RIO DE JANEIRO

| customer_city | customer_count |
|---------------|--------------|
| sao paulo     | 15540        |
| rio de janeiro| 6882         |
| belo horizonte| 2773         |
| brasilia      | 2131         |
| curitiba      | 1521         |

πŸ“ˆ STRATEGIC TAKEAWAYS:
 β€’ Prioritize logistics partnerships in SΓ£o Paulo and Rio to reduce last-mile delivery costs.
 β€’ Launch localized marketing campaigns in Belo Horizonte and Brasilia to close the gap with top-tier cities.

Run the Evaluation Suite

python -m eval.run_eval

Validates structural correctness against 20 gold-standard questions covering aggregations, joins, time filtering, and comparative analysis. image


Project Structure

NL2SQL-data-analyst/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agent.py          # LangGraph workflow, Pydantic schemas, CLI
β”‚   β”œβ”€β”€ tools.py          # DB connection, schema introspection, query execution
β”‚   └── guardrails.py     # AST-based SQL validation (whitelist + DML blocking)
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ qa_set.json       # 20 regression test questions
β”‚   └── run_eval.py       # Automated validation runner
β”œβ”€β”€ db/
β”‚   └── olist.db          # SQLite Olist e-commerce dataset
β”œβ”€β”€ requirements.txt
└── .env.example

Security

Layer Implementation
Input Validation sqlglot AST parsing β€” rejects non-SELECT statements
OS Enforcement SQLite ?mode=ro URI flag
Output Sanitization Pandas to_markdown() prevents HTML/JS injection
Schema Enforcement Pydantic AnalystResponse β€” invalid JSON is discarded

Evaluation

The eval/ suite checks structural integrity (Pydantic validation) across 20 representative queries:

  • COUNT, SUM, AVG aggregations
  • GROUP BY + ORDER BY + LIMIT
  • Date filtering (2017, 2018)
  • Multi-table implicit joins
  • Comparative metrics (on time vs late)

Note: The current suite validates that the agent returns well-formed JSON. Semantic correctness ("did the SQL actually answer the question?") requires human review or a gold-standard result set.


Tech Stack

  • Orchestration: LangGraph 1.2+
  • LLM: Groq API (llama-3.3-70b-versatile)
  • Validation: Pydantic 2.x, sqlglot
  • Database: SQLite (read-only URI mode)
  • Data Processing: Pandas 3.x

License

MIT


Built with πŸ₯ͺ by Zimal Fatemah

About

A high-performance NL2SQL agent powered by LangChain and LangGraph. Features a structured 'Data Sandwich' output pattern, Pydantic-enforced schemas, and a custom automated regression suite for enterprise-grade reliability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages