Skip to content

CausalInferenceLab/Lang2SQL

Repository files navigation

Lang2SQL

PseudoLab Discord Community Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors

우리는 함께 코드와 아이디어를 나누며 더 나은 데이터 환경을 만들기 위한 오픈소스 여정을 떠납니다. 🌍💡


A document-learning, read-only SQL analytics agent. Feed it your company's docs → it learns your business context → it keeps a separate set of definitions per team → it answers questions over an incomplete database → it remembers every definition and conversation.

👉 프로젝트 전체 그림(단일 SSOT): docs/PROJECT.md · 컨트리뷰터 한눈 가이드: docs/ARCHITECTURE.md

This is the v4.1 rebuild (배경/설계 의도: docs/discord_first_redesign_v4_1.md). Where most text-to-SQL projects compete on "generate better SQL," Lang2SQL competes on everything around the query: business-context learning, per-team semantics, robustness to messy databases, and memory. Discord is the Phase 1 interface, not the identity — Slack/Web are adapters on the same core.


The four pillars

Pillar What it is
① Business-context learning Documents are the source of truth. Drop in a doc → the agent extracts metric/dimension/rule candidates → you confirm → they land in the semantic layer.
② Two-axis robustness (2a) DB robustness — works even when columns lack descriptions (auto-enrichment, v1.5). (2b) Semantic robustness — teams hold different definitions of the same term without conflict. This axis is the product/research identity.
③ Hermes memory Conversations, facts, and preferences persist instead of resetting each session.
④ Multi-interface Phase 1 Discord today; Slack/Web are future adapters. No platform lock-in.

Extensibility — outlets and appliances (콘센트/가전)

V1 ships the simplest single implementation of each extension point, but the abstraction (port) is already in place, so v1.5/v2 add a new implementation without touching existing code. Like a wall outlet: the V1 socket has one LED bulb plugged in, but because the socket is standard, you later plug in a fan or a smart light without rewiring the wall.

Four ★ extension patterns sit behind core/ports/:

Pattern Port Grows by
Safety pipeline ports/safety.py adding one layer class to the line (zero run_sql changes)
Memory service ports/memory.py swapping any of 3 axes — Store / Recall / Extractor — independently
Ingestion pipeline ports/ingestion.py a Source × Extractor matrix
Semantic federation ports/semantic_scope.py git-like per-team scope branches

Everything outside tenancy/concierge.py depends only on these Protocols, so the concrete classes (OpenAI, Postgres, SQLite) are swappable at the seams.


Quickstart

Requires Python ≥ 3.10 and uv.

uv sync                       # create .venv and install deps

1. Run the offline demo (no token, no database)

.venv/bin/python bench/ecommerce_demo.py

Shows the federation money-shot (one term, two team definitions, no conflict) and the safety gate (DROP/INSERT blocked, SELECT passes). See bench/README.md.

2. Run the CLI (developer driver)

.venv/bin/lang2sql "list the tables"

The CLI assembles a real HarnessContext and runs one turn through the agent loop. With OPENAI_API_KEY set it calls gpt-4.1-mini; otherwise it uses the offline FakeLLM.

3. Run the Discord bot

export DISCORD_BOT_TOKEN=...        # required
export OPENAI_API_KEY=...           # optional; offline FakeLLM if unset
export LANG2SQL_SECRET_KEY=...      # optional; Fernet key for secret encryption
.venv/bin/lang2sql-bot

The bot exits loudly if DISCORD_BOT_TOKEN is unset. Full setup and hosting: docs/DEPLOY.md. Copy .env.example to start.


What V1 does / does NOT do yet (honesty section)

Does:

  • 3-scope semantic federation (guild / channel / thread) with most-specific-wins resolution; define_metric writes to the current scope.
  • Safety pipeline with the V1 layers (whitelist + timeout), gating every query.
  • Agent loop with six tools: run_sql, explore_schema, define_metric, ingest_doc, remember, ask_user.
  • Memory service (in-memory store + inject-all recall + manual /remember).
  • Discord frontend (bot, commands, session router, render).
  • Encrypted-at-rest secrets (Fernet) and SQLite-backed persistence.

Does NOT yet:

  • Execute against a real database. PostgresExplorer is a V1 stub with canned orders/users schema and sample rows; real psycopg execution is v1.5.
  • Reason without a key. Without OPENAI_API_KEY, the FakeLLM returns deterministic canned tool cycles — useful for wiring tests, not for answers.
  • DB metadata auto-enrichment, AST-precise SQL validation, function blocklists, cost gating, /semantic diff / /semantic promote, keyword/vector recall, automatic fact extraction, URL/Notion ingestion — all scoped to v1.5+.
  • Persist across restarts by default: the V1 SqliteStore defaults to in-memory; point it at a file for durability.

Roadmap at a glance

Area V1 V1.5 V2 V2.5
Safety whitelist + timeout + AST validation, function blocklist, auto LIMIT, metadata enrichment, rate limit + cost gate (EXPLAIN), per-engine pipelines
Memory in-memory + inject-all + manual SQLite store + keyword recall + auto-extract + vector recall + conflict resolution PostgreSQL + hybrid recall + confidence
Ingestion file upload + LLM extract + URL fetch + DDL parsing + Notion/Confluence + hybrid + GitHub/Drive + chunked RAG
Federation 3-scope resolution, /semantic show /semantic diff, /semantic promote, conflict alerts git sync (semantic-as-code) branch fork/merge UI, per-scope audit
Interface Discord (Anthropic/NIM eval) Slack Web

See docs/discord_first_redesign_v4_1.md for the full architecture write-up.


🤝 기여하기

처음 보시는 분은 docs/ARCHITECTURE.md — 디렉토리·레이어 책임, 한 메시지의 lifecycle, 어디를 수정하면 좋을지 가 한곳에 정리돼 있습니다.

git clone https://github.com/CausalInferenceLab/lang2sql.git
cd lang2sql
uv sync
.venv/bin/pytest -q          # 12 safety regressions + full suite must pass
  • 새 기능에는 테스트 작성 (tests/test_<layer>.py)
  • PR은 master 브랜치 대상, 커밋 메시지에 feat: / fix: / docs: prefix 사용
  • 버그/기능 요청은 이슈

🙏 감사의 말 / License

Lang2SQL은 가짜연구소의 인과추론팀에서 개발 중인 프로젝트입니다. Licensed under the MIT License. 커뮤니티: Discord.


🏆 Our Team

Role Name Skills Interests
Project Manager 이동욱 Python LLM, Open Source, Causal Inference
AI Engineer 문찬국 Python LLM, Agentic RAG, Open Source
Data Engineer 박경태 Python LLM-driven Data Engineering
AI Engineer 손봉균 Python LLM, RAG, AI Planning
Data Scientist 안재일 Python LLM, Data Analysis, RAG
ML Engineer 이호민 Python Multi-Agent Systems
AI Engineer 최세영 Python LLM, RAG, Multi-Agent
Full-Stack Developer 황윤진 NextJs React LLM Orchestration
AI Engineer 김경서 Python LLM, FinNLP, FDS, RAG
Data Engineer 홍지영 Python LLM, Data Engineering
Data Operator 이화림 Python LLM, Data Engineering
AI Engineer 남경혜 Python LLM, RAG, Multi-Agent
AI Engineer 심세원 Python LLM, RAG, Multi-Agent
Business Analyst 서희진 Python LLM, Data Analysis

🌍 가짜연구소 소개

가짜연구소는 머신러닝과 AI 기술 발전에 중점을 둔 비영리 조직입니다. 공유, 동기부여, 그리고 협업의 기쁨이라는 핵심 가치를 바탕으로 영향력 있는 오픈소스 프로젝트를 만들어갑니다.

전 세계 5,000명 이상의 연구자들과 함께, 우리는 AI 지식의 민주화와 열린 협업을 통한 혁신 촉진에 전념하고 있습니다.

커뮤니티: 💬 Discord


🎯 기여자들

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages