Lang2SQL

우리는 함께 코드와 아이디어를 나누며 더 나은 데이터 환경을 만들기 위한 오픈소스 여정을 떠납니다. 🌍💡

A document-learning, read-only SQL analytics agent. Feed it your company's docs → it learns your business context → it keeps a separate set of definitions per team → it answers questions over an incomplete database → it remembers every definition and conversation.

👉 프로젝트 전체 그림(단일 SSOT): docs/PROJECT.md · 컨트리뷰터 한눈 가이드: docs/ARCHITECTURE.md

This is the v4.1 rebuild (배경/설계 의도: docs/discord_first_redesign_v4_1.md). Where most text-to-SQL projects compete on "generate better SQL," Lang2SQL competes on everything around the query: business-context learning, per-team semantics, robustness to messy databases, and memory. Discord is the Phase 1 interface, not the identity — Slack/Web are adapters on the same core.

The four pillars

Pillar	What it is
① Business-context learning	Documents are the source of truth. Drop in a doc → the agent extracts metric/dimension/rule candidates → you confirm → they land in the semantic layer.
② Two-axis robustness	(2a) DB robustness — works even when columns lack descriptions (auto-enrichment, v1.5). (2b) Semantic robustness — teams hold different definitions of the same term without conflict. This axis is the product/research identity.
③ Hermes memory	Conversations, facts, and preferences persist instead of resetting each session.
④ Multi-interface	Phase 1 Discord today; Slack/Web are future adapters. No platform lock-in.

Extensibility — outlets and appliances (콘센트/가전)

V1 ships the simplest single implementation of each extension point, but the abstraction (port) is already in place, so v1.5/v2 add a new implementation without touching existing code. Like a wall outlet: the V1 socket has one LED bulb plugged in, but because the socket is standard, you later plug in a fan or a smart light without rewiring the wall.

Four ★ extension patterns sit behind core/ports/:

★	Pattern	Port	Grows by
①	Safety pipeline	`ports/safety.py`	adding one layer class to the line (zero `run_sql` changes)
②	Memory service	`ports/memory.py`	swapping any of 3 axes — Store / Recall / Extractor — independently
③	Ingestion pipeline	`ports/ingestion.py`	a Source × Extractor matrix
④	Semantic federation	`ports/semantic_scope.py`	git-like per-team scope branches

Everything outside tenancy/concierge.py depends only on these Protocols, so the concrete classes (OpenAI, Postgres, SQLite) are swappable at the seams.

Quickstart

Requires Python ≥ 3.10 and uv.

uv sync                       # create .venv and install deps

1. Run the offline demo (no token, no database)

.venv/bin/python bench/ecommerce_demo.py

Shows the federation money-shot (one term, two team definitions, no conflict) and the safety gate (DROP/INSERT blocked, SELECT passes). See bench/README.md.

2. Run the CLI (developer driver)

.venv/bin/lang2sql "list the tables"

The CLI assembles a real HarnessContext and runs one turn through the agent loop. With OPENAI_API_KEY set it calls gpt-4.1-mini; otherwise it uses the offline FakeLLM.

3. Run the Discord bot

export DISCORD_BOT_TOKEN=...        # required
export OPENAI_API_KEY=...           # optional; offline FakeLLM if unset
export LANG2SQL_SECRET_KEY=...      # optional; Fernet key for secret encryption
.venv/bin/lang2sql-bot

The bot exits loudly if DISCORD_BOT_TOKEN is unset. Full setup and hosting: docs/DEPLOY.md. Copy .env.example to start.

What V1 does / does NOT do yet (honesty section)

Does:

3-scope semantic federation (guild / channel / thread) with most-specific-wins resolution; define_metric writes to the current scope.
Safety pipeline with the V1 layers (whitelist + timeout), gating every query.
Agent loop with six tools: run_sql, explore_schema, define_metric, ingest_doc, remember, ask_user.
Memory service (in-memory store + inject-all recall + manual /remember).
Discord frontend (bot, commands, session router, render).
Encrypted-at-rest secrets (Fernet) and SQLite-backed persistence.

Does NOT yet:

Execute against a real database. PostgresExplorer is a V1 stub with canned orders/users schema and sample rows; real psycopg execution is v1.5.
Reason without a key. Without OPENAI_API_KEY, the FakeLLM returns deterministic canned tool cycles — useful for wiring tests, not for answers.
DB metadata auto-enrichment, AST-precise SQL validation, function blocklists, cost gating, /semantic diff / /semantic promote, keyword/vector recall, automatic fact extraction, URL/Notion ingestion — all scoped to v1.5+.
Persist across restarts by default: the V1 SqliteStore defaults to in-memory; point it at a file for durability.

Roadmap at a glance

Area	V1	V1.5	V2	V2.5
Safety	whitelist + timeout	+ AST validation, function blocklist, auto LIMIT, metadata enrichment, rate limit	+ cost gate (EXPLAIN), per-engine pipelines	—
Memory	in-memory + inject-all + manual	SQLite store + keyword recall + auto-extract	+ vector recall + conflict resolution	PostgreSQL + hybrid recall + confidence
Ingestion	file upload + LLM extract	+ URL fetch + DDL parsing	+ Notion/Confluence + hybrid	+ GitHub/Drive + chunked RAG
Federation	3-scope resolution, `/semantic show`	`/semantic diff`, `/semantic promote`, conflict alerts	git sync (semantic-as-code)	branch fork/merge UI, per-scope audit
Interface	Discord	(Anthropic/NIM eval)	Slack	Web

See docs/discord_first_redesign_v4_1.md for the full architecture write-up.

🤝 기여하기

처음 보시는 분은 docs/ARCHITECTURE.md — 디렉토리·레이어 책임, 한 메시지의 lifecycle, 어디를 수정하면 좋을지 가 한곳에 정리돼 있습니다.

git clone https://github.com/CausalInferenceLab/lang2sql.git
cd lang2sql
uv sync
.venv/bin/pytest -q          # 12 safety regressions + full suite must pass

새 기능에는 테스트 작성 (tests/test_<layer>.py)
PR은 master 브랜치 대상, 커밋 메시지에 feat: / fix: / docs: prefix 사용
버그/기능 요청은 이슈로

🙏 감사의 말 / License

Lang2SQL은 가짜연구소의 인과추론팀에서 개발 중인 프로젝트입니다. Licensed under the MIT License. 커뮤니티: Discord.

🏆 Our Team

Role	Name	Interests
Project Manager	이동욱	LLM, Open Source, Causal Inference
AI Engineer	문찬국	LLM, Agentic RAG, Open Source
Data Engineer	박경태	LLM-driven Data Engineering
AI Engineer	손봉균	LLM, RAG, AI Planning
Data Scientist	안재일	LLM, Data Analysis, RAG
ML Engineer	이호민	Multi-Agent Systems
AI Engineer	최세영	LLM, RAG, Multi-Agent
Full-Stack Developer	황윤진	LLM Orchestration
AI Engineer	김경서	LLM, FinNLP, FDS, RAG
Data Engineer	홍지영	LLM, Data Engineering
Data Operator	이화림	LLM, Data Engineering
AI Engineer	남경혜	LLM, RAG, Multi-Agent
AI Engineer	심세원	LLM, RAG, Multi-Agent
Business Analyst	서희진	LLM, Data Analysis

🌍 가짜연구소 소개

가짜연구소는 머신러닝과 AI 기술 발전에 중점을 둔 비영리 조직입니다. 공유, 동기부여, 그리고 협업의 기쁨이라는 핵심 가치를 바탕으로 영향력 있는 오픈소스 프로젝트를 만들어갑니다.

전 세계 5,000명 이상의 연구자들과 함께, 우리는 AI 지식의 민주화와 열린 협업을 통한 혁신 촉진에 전념하고 있습니다.

커뮤니티: 💬 Discord

Name		Name	Last commit message	Last commit date
Latest commit History 471 Commits
.github		.github
bench		bench
docs		docs
src/lang2sql		src/lang2sql
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lang2SQL

The four pillars

Extensibility — outlets and appliances (콘센트/가전)

Quickstart

1. Run the offline demo (no token, no database)

2. Run the CLI (developer driver)

3. Run the Discord bot

What V1 does / does NOT do yet (honesty section)

Roadmap at a glance

🤝 기여하기

🙏 감사의 말 / License

🏆 Our Team

🌍 가짜연구소 소개

🎯 기여자들

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lang2SQL

The four pillars

Extensibility — outlets and appliances (콘센트/가전)

Quickstart

1. Run the offline demo (no token, no database)

2. Run the CLI (developer driver)

3. Run the Discord bot

What V1 does / does NOT do yet (honesty section)

Roadmap at a glance

🤝 기여하기

🙏 감사의 말 / License

🏆 Our Team

🌍 가짜연구소 소개

🎯 기여자들

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages