A context graph system that unifies fragmented SAP Order-to-Cash data into an interactive knowledge graph with a natural language query interface powered by Groq's LLaMA 3.3 70B.
Live Demo: https://query-graph-llm-powered-data-intell.vercel.app/
Backend API: https://querygraph-llm-powered-data-intelligence.onrender.com
GitHub: https://github.com/SahilArate/QueryGraph-LLM-Powered-Data-Intelligence
In real SAP environments, business data is fragmented across dozens of tables — orders, deliveries, invoices, payments — with no unified way to trace how they connect. This system solves that by:
- Unifying the SAP O2C dataset into a graph of interconnected entities
- Visualizing that graph interactively — hover for metadata, click to pin, edges show relationships
- Querying the graph with natural language — the LLM generates SQL dynamically, executes it, and returns data-backed answers
- Analyzing customer clusters, broken flows, and product performance automatically
┌─────────────────────────────────────────────────────────────────┐
│ FRONTEND (Next.js 14) │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────────────┐ │
│ │ Graph Panel │ │ Chat + Analytics Panel │ │
│ │ Cytoscape.js │ │ Streaming SSE responses │ │
│ │ Hover tooltips │ │ Graph clustering view │ │
│ │ Node highlighting │ │ Top products analysis │ │
│ └─────────────────────┘ └──────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ REST + SSE
┌─────────────────────────────────────────────────────────────────┐
│ BACKEND (FastAPI) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ /api/query │ │ /api/graph │ │ /api/query/stream │ │
│ │ NL → SQL │ │ nodes │ │ SSE streaming │ │
│ │ Guardrails │ │ clusters │ │ Token by token │ │
│ │ Trace flow │ │ analytics │ │ response delivery │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │
│ ┌──────────────────────────┐ ┌──────────────────────────────┐ │
│ │ Groq LLM Client │ │ Guardrail Engine │ │
│ │ llama-3.3-70b-versatile │ │ LLM-based domain classifier │ │
│ │ SQL generation │ │ Blocks off-topic queries │ │
│ │ Answer synthesis │ │ Zero hallucination policy │ │
│ └──────────────────────────┘ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │
┌───────────┘ └───────────┐
▼ ▼
┌──────────────────┐ ┌─────────────────────┐
│ PostgreSQL │ │ Neo4j AuraDB │
│ (Neon.tech) │ │ (Free Tier) │
│ │ │ │
│ Raw data store │ │ Graph relationships│
│ SQL analytics │ │ Node traversal │
│ Aggregations │ │ Cypher queries │
│ Broken flows │ │ Path finding │
└──────────────────┘ └─────────────────────┘
This is the most important architectural decision in the project. I chose a dual-database architecture — PostgreSQL for analytical queries and Neo4j for graph relationships — because they solve fundamentally different problems.
Used for: Raw data storage, SQL analytics, aggregations, broken flow detection.
PostgreSQL excels at the analytical queries this system needs most:
-- Broken flow detection — sales orders delivered but never billed
SELECT DISTINCT s.sales_order, s.sold_to_party, s.total_net_amount
FROM sales_order_headers s
INNER JOIN outbound_delivery_items odi ON odi.reference_sd_document = s.sales_order
WHERE s.sales_order NOT IN (
SELECT DISTINCT bdi.reference_sd_document
FROM billing_document_items bdi
INNER JOIN outbound_delivery_headers odh
ON odh.delivery_document = bdi.reference_sd_document
)This kind of multi-table aggregation is exactly what relational databases are built for. Trying to do this in a graph database would be more complex with worse performance.
Why Neon over local PostgreSQL: Neon is serverless PostgreSQL — it scales to zero when idle (critical for a demo deployment), has a generous free tier, and supports standard psycopg2 connections without configuration overhead.
Used for: Storing and traversing the graph structure — nodes and their relationships.
Neo4j's Cypher language makes relationship traversal natural:
-- Trace complete O2C flow for a business partner
MATCH (bp:BusinessPartner)-[:PLACED]->(so:SalesOrder)
-[:HAS_DELIVERY]->(del:Delivery)
-[:HAS_BILLING]->(bill:BillingDocument)
-[:HAS_PAYMENT]->(pay:Payment)
WHERE bp.id = '310000108'
RETURN bp, so, del, bill, payThe same query in PostgreSQL would require 4 JOINs across 5 tables. For the graph visualization layer, Neo4j is the right tool.
The Tradeoff: Running two databases adds operational complexity. The decision to accept this complexity is justified because each database is used for what it does best — PostgreSQL for "give me counts and aggregations", Neo4j for "show me how these entities connect".
| Node Type | Source Table | Key Properties |
|---|---|---|
BusinessPartner |
business_partners |
id, name, is_blocked |
SalesOrder |
sales_order_headers |
id, amount, delivery_status, billing_status |
SalesOrderItem |
sales_order_items |
id, amount |
Product |
products |
id, old_id, group, type |
Delivery |
outbound_delivery_headers |
id, goods_status, picking_status |
BillingDocument |
billing_document_headers |
id, amount, is_cancelled, acct_doc |
Payment |
payments_accounts_receivable |
id, amount, customer |
BusinessPartner ──[PLACED]──────────────► SalesOrder
SalesOrder ──[HAS_ITEM]────────────► SalesOrderItem
SalesOrderItem ──[IS_PRODUCT]──────────► Product
SalesOrder ──[HAS_DELIVERY]────────► Delivery
Delivery ──[HAS_BILLING]─────────► BillingDocument
BillingDocument ──[HAS_PAYMENT]─────────► Payment
This models the complete SAP Order-to-Cash flow: a customer places an order → items are picked from products → goods are delivered → an invoice is raised → payment is collected.
The relationship chain BusinessPartner → SalesOrder → Delivery → BillingDocument → Payment directly maps to the O2C business process. A "broken flow" is simply a node in this chain that has no outgoing edge where one is expected — for example, a SalesOrder node with a HAS_DELIVERY edge but no HAS_BILLING edge downstream.
User Question
│
▼
┌─────────────────┐
│ Guardrail Check │ ← LLM classifies: ALLOWED or BLOCKED
└─────────────────┘
│ ALLOWED
▼
┌─────────────────┐
│ Trace Detector │ ← Regex: does question mention "trace/flow/track"?
└─────────────────┘
│ Standard query
▼
┌─────────────────┐
│ SQL Generator │ ← LLM generates SELECT query with schema context
└─────────────────┘
│
▼
┌─────────────────┐
│ Query Executor │ ← psycopg2 executes against PostgreSQL
└─────────────────┘
│
▼
┌─────────────────┐
│ Answer Synthesizer│ ← LLM converts raw results to natural language
└─────────────────┘
│
▼
Streamed response to UI (token by token via SSE)
The SQL generation prompt includes:
- Full schema context — all 10 table definitions with column names
- Explicit relationship mappings —
outbound_delivery_items.reference_sd_document → sales_order_headers.sales_order - Example queries — for broken flows, traces, and aggregations
- Hard rules —
SELECT only,LIMIT 100,no markdown in output
SCHEMA_CONTEXT = """
You are an expert SQL assistant for a SAP Order-to-Cash (O2C) database.
...
RULES:
- Only generate SELECT queries, never INSERT/UPDATE/DELETE/DROP
- Return ONLY the SQL query, no explanation, no markdown, no backticks
"""The key insight here is few-shot prompting via examples — embedding example SQL for common query patterns (especially broken flows) dramatically reduces LLM errors on complex multi-table queries.
- Speed: Groq's LPU architecture delivers ~10x faster inference than OpenAI — critical for streaming UX
- Free tier: No cost for this assignment
- llama-3.3-70b-versatile: Strong SQL generation capability, better than smaller models for complex JOIN logic
Responses stream token-by-token using FastAPI's StreamingResponse with Server-Sent Events:
async def stream_response():
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[...],
stream=True # ← Groq streams tokens
)
for chunk in stream:
token = chunk.choices[0].delta.content
if token:
yield f"data: {json.dumps({'token': token, 'done': False})}\n\n"
yield f"data: {json.dumps({'done': True, 'sql': sql, 'results': results})}\n\n"The frontend reads the SSE stream and appends each token to the message in real time — identical to ChatGPT's typing effect.
Guardrails are an explicit evaluation criterion. Here's how they work:
Every query goes through an LLM-based classifier before any SQL is generated:
GUARDRAIL_PROMPT = """
You are a strict classifier. Determine if the user's question is related to
SAP Order-to-Cash business data (sales orders, deliveries, billing, payments, products, customers).
Reply with only one word: ALLOWED or BLOCKED
Question: {question}
"""This approach is more robust than keyword filtering — it understands intent, not just surface words. "Show me the matrix" → BLOCKED. "Show me the billing matrix for customer X" → ALLOWED.
The SQL generation prompt explicitly restricts output:
SELECTonly — noINSERT,UPDATE,DELETE,DROP- Results capped at 100 rows
- All queries executed in read-only mode
Trace queries ("trace the full flow of billing document X") are routed to a dedicated endpoint that runs hand-crafted SQL — bypassing LLM SQL generation entirely for the most complex query pattern. This eliminates a class of LLM errors on multi-hop JOIN queries.
| Query | Response |
|---|---|
| "Who is the president of India?" | BLOCKED |
| "Write me a poem" | BLOCKED |
| "What is 2+2?" | BLOCKED |
| "Show me all orders" | ALLOWED |
| "Which products have most billings?" | ALLOWED |
Customers are automatically segmented into three tiers based on total order value and order count:
| Tier | Criteria | Color |
|---|---|---|
| HIGH | Revenue > ₹50K or Orders > 20 | Red |
| MEDIUM | Revenue > ₹20K or Orders > 10 | Amber |
| LOW | Below medium thresholds | Green |
Broken flow detection is overlaid — customers with orders but zero billing documents are flagged automatically.
GET /api/graph/clusters— customer segmentation with summary statsGET /api/graph/analytics— top products by billing count, broken flow count, total revenue, cancelled billings
Which products are associated with the highest number of billing documents?
Generated SQL:
SELECT bi.material, p.product_old_id,
COUNT(DISTINCT bi.billing_document) as billing_count
FROM billing_document_items bi
LEFT JOIN products p ON p.product = bi.material
GROUP BY bi.material, p.product_old_id
ORDER BY billing_count DESC LIMIT 10;Trace the full flow of billing document 90504204
Routes to dedicated /api/graph/trace/{id} endpoint — traverses:
BillingDocument → Delivery → SalesOrder → BusinessPartner → JournalEntry → Payment
Show sales orders delivered but not billed
Generated SQL:
SELECT DISTINCT s.sales_order, s.sold_to_party, s.total_net_amount
FROM sales_order_headers s
INNER JOIN outbound_delivery_items odi ON odi.reference_sd_document = s.sales_order
WHERE s.sales_order NOT IN (
SELECT DISTINCT bdi.reference_sd_document
FROM billing_document_items bdi
INNER JOIN outbound_delivery_headers odh
ON odh.delivery_document = bdi.reference_sd_document
WHERE bdi.reference_sd_document IS NOT NULL
)| Layer | Technology | Reason |
|---|---|---|
| Frontend | Next.js 14 + TypeScript | React server components, easy Vercel deployment |
| Graph Viz | Cytoscape.js | Industry-standard for enterprise graph visualization |
| Styling | Tailwind CSS + inline styles | Rapid dark theme UI development |
| Backend | Python FastAPI | Best ecosystem for AI/LLM tooling, async support |
| Primary DB | PostgreSQL (Neon.tech) | Relational analytics, serverless, free tier |
| Graph DB | Neo4j AuraDB | Native graph storage and traversal |
| LLM | Groq (llama-3.3-70b) | Fastest inference, free tier, strong SQL generation |
| Streaming | FastAPI SSE | Native streaming without WebSocket overhead |
| Deployment | Vercel + Render | Zero-config deployment, free tiers |
QueryGraph-LLM-Powered-Data-Intelligence/
├── frontend/ # Next.js 14 application
│ ├── app/
│ │ └── page.tsx # Main 2-panel layout
│ └── components/
│ ├── GraphPanel.tsx # Cytoscape graph + hover tooltips
│ ├── ChatPanel.tsx # Streaming chat interface
│ └── AnalyticsPanel.tsx # Clustering + analytics view
│
├── backend/ # FastAPI application
│ ├── main.py # App entry point + CORS
│ ├── routes/
│ │ ├── query.py # NL→SQL pipeline + streaming
│ │ └── graph.py # Graph data, trace, clusters, analytics
│ ├── db/
│ │ ├── postgres.py # PostgreSQL connection + query executor
│ │ ├── neo4j_client.py # Neo4j driver wrapper
│ │ ├── schema.sql # PostgreSQL table definitions
│ │ ├── ingest.py # JSONL → PostgreSQL ingestion
│ │ └── neo4j_ingest.py # PostgreSQL → Neo4j graph population
│ └── llm/
│ └── groq_client.py # Groq client, prompts, guardrails
│
├── data/ # SAP O2C dataset (not committed)
│ └── sap-o2c-data/
│
└── .env.example # Environment variable template
- Node.js 18+
- Python 3.11+
- PostgreSQL connection string (Neon.tech free tier)
- Neo4j AuraDB connection (free tier)
- Groq API key (free at console.groq.com)
cd backend
python -m venv venv
venv\Scripts\activate # Windows
source venv/bin/activate # Mac/Linux
pip install -r requirements.txt
# Copy and fill in your credentials
cp .env.example .env
# Ingest data into PostgreSQL
python db/ingest.py
# Populate Neo4j graph
python db/neo4j_ingest.py
# Start the API server
uvicorn main:app --reload --port 8000cd frontend
npm install
# Create environment file
echo "NEXT_PUBLIC_API_URL=http://localhost:8000" > .env.local
npm run dev# PostgreSQL (Neon.tech)
DATABASE_URL=postgresql://username:password@host/dbname?sslmode=require
# Neo4j AuraDB
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password
# Groq
GROQ_API_KEY=gsk_your_groq_api_key_hereFrontend → Vercel
- Root directory:
frontend - Environment variable:
NEXT_PUBLIC_API_URL=https://your-render-url.onrender.com
Backend → Render
- Root directory:
backend - Build command:
pip install -r requirements.txt - Start command:
uvicorn main:app --host 0.0.0.0 --port $PORT - Add all 5 environment variables in Render dashboard
This project was built using Claude (Anthropic) as the primary AI coding assistant. The complete session transcript covering architecture decisions, debugging workflow, and iteration patterns is included in the repository as AI_SESSION_LOGS.md.
Key areas where AI assistance was used:
- Schema design and relationship modelling decisions
- LLM prompt engineering for SQL generation and guardrails
- Debugging PostgreSQL connection issues and indentation errors
- Frontend component architecture and streaming SSE implementation
- Neo4j graph ingestion strategy
| Criteria | Implementation |
|---|---|
| Code quality & architecture | Modular FastAPI routes, typed TypeScript components, clear separation of concerns |
| Graph modelling | 7 node types, 6 relationship types, dual-DB architecture matching query patterns to DB strengths |
| Database / storage choice | PostgreSQL for analytics + Neo4j for graph traversal — documented tradeoffs above |
| LLM integration & prompting | Schema-aware system prompt, few-shot examples, streaming SSE, conversation history |
| Guardrails | LLM domain classifier + SQL injection prevention + dedicated trace endpoint |
- ✅ Natural language to SQL translation — shown in query pipeline
- ✅ Node highlighting — chat responses highlight referenced nodes on graph
- ✅ Streaming responses — token-by-token SSE streaming via Groq
- ✅ Conversation memory — full history passed on each request
- ✅ Graph clustering — customer segmentation by value tier with broken flow detection
- ✅ Graph analytics — top products, revenue stats, cancelled billing count
Built for Dodge AI Forward Deployed Engineer Assignment — March 2026