-
Notifications
You must be signed in to change notification settings - Fork 4
SQL overview #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kbatuigas
wants to merge
11
commits into
rp-sql
Choose a base branch
from
DOC-2049-redpanda-sql-introduction-and-overview
base: rp-sql
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
SQL overview #573
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
b089835
Draft SQL overview rewrite
kbatuigas 090b046
DOC-2049: Apply v1 Iceberg scope and Postgres positioning to overview…
kbatuigas 908c8b1
Add TODO to flesh out sql v pg
kbatuigas df49fe9
Move why RP SQL up
kbatuigas 0ef72ae
Minor edits
kbatuigas 80c2034
Review pass
kbatuigas a9c03ae
Change to default_redpanda_catalog
kbatuigas 73acf11
Tweak overview learning objectives
kbatuigas 8710a86
Review pass
kbatuigas 28331a0
Intro rephrase
kbatuigas c3dd240
Remove tables not describing meaningful differences with Postgres
kbatuigas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| = Redpanda SQL Overview | ||
| :description: Redpanda SQL is a column-oriented OLAP query engine in Redpanda Cloud BYOC for querying live and Iceberg-translated Redpanda topics with PostgreSQL syntax. | ||
| :page-topic-type: overview | ||
| :page-aliases: sql:get-started/what-is-redpanda-sql.adoc | ||
| :personas: app_developer, data_engineer, evaluator | ||
| :learning-objective-1: Identify scenarios where Redpanda SQL fits your analytical needs | ||
| :learning-objective-2: Identify the query patterns Redpanda SQL supports | ||
| :learning-objective-3: Describe the architectural characteristics that enable those patterns | ||
|
|
||
| Redpanda SQL turns your Redpanda glossterm:topic[,topics], including their Iceberg-translated history, into queryable SQL surfaces inside your Redpanda Bring Your Own Cloud (BYOC) glossterm:cluster[]. Built as a column-oriented online analytical processing (OLAP) engine, Redpanda SQL runs analytical queries over streaming and historical data without moving or duplicating data. It is a PostgreSQL-compatible query engine that implements the PostgreSQL wire protocol and a PostgreSQL-based SQL dialect, so you can connect with any PostgreSQL client, including `psql`, JDBC, DBeaver, and DataGrip. | ||
|
|
||
| Redpanda SQL handles a wide range of analytical workloads in a single system. You can power real-time business intelligence (BI) dashboards, process log data, run time-series analytics, and perform exploratory queries over large datasets without switching tools or maintaining separate systems. | ||
|
|
||
| After reading this page, you will be able to: | ||
|
|
||
| * [ ] {learning-objective-1} | ||
| * [ ] {learning-objective-2} | ||
| * [ ] {learning-objective-3} | ||
|
|
||
| == Why use Redpanda SQL | ||
|
|
||
| Querying real-time streaming data alongside historical lakehouse data typically means building ETL pipelines, copying data between systems, and running multiple analytical engines. Redpanda SQL eliminates this overhead by querying both live and historical data in place. | ||
|
|
||
| Redpanda SQL scales horizontally across multiple nodes within a cluster (up to 9 nodes) and uses hardware efficiently within each node, so analytical workloads can grow without proportional infrastructure cost. | ||
|
|
||
| == Primary use cases | ||
|
|
||
| * *Real-time analytics on data streams*: Query Redpanda topics directly with SQL. No ETL pipelines required. Useful for analyst-driven investigations in the streaming layer, debugging streaming applications, and prototyping consumers. | ||
| * *Hybrid streaming and historical analytics*: Query Iceberg-enabled topics in a single SQL query that spans live records and historical Iceberg-committed records. | ||
| * *Application-embedded operational analytics*: Run high-concurrency OLAP queries for dashboards and operational tools from any PostgreSQL client. | ||
|
|
||
| == What you can do with Redpanda SQL | ||
|
|
||
| Redpanda SQL exposes data through xref:sql:query-data/redpanda-catalogs.adoc[catalogs], which are named collections of source data exposed as queryable SQL tables. You can work with that data using two primary query patterns. | ||
|
|
||
| === Query streaming topics | ||
|
|
||
| Each Redpanda topic in your cluster appears as a SQL table inside a Redpanda catalog. Redpanda SQL reads the topic's glossterm:schema[] from glossterm:Schema Registry[] to map fields to SQL columns, and you query the table with `SELECT`: | ||
|
|
||
| [,sql] | ||
| ---- | ||
| CREATE TABLE default_redpanda_catalog=>orders WITH ( | ||
| topic = 'orders', | ||
| schema_subject = 'orders-value' | ||
| ); | ||
|
|
||
| SELECT customer_id, SUM(amount) AS total | ||
| FROM default_redpanda_catalog=>orders | ||
| GROUP BY customer_id | ||
| ORDER BY total DESC | ||
| LIMIT 10; | ||
| ---- | ||
|
|
||
| Analysts and developers can run these queries directly from any PostgreSQL client without moving data into a separate analytics store. | ||
|
|
||
| === Query Iceberg topics | ||
|
|
||
| When a Redpanda topic is configured for Iceberg translation, Redpanda SQL queries its Iceberg-committed data through the same SQL surface as live streaming topics, reading Parquet data and Iceberg metadata directly from cloud storage. | ||
|
|
||
| // "Bridge query" is a tentative internal name; final naming TBC for v1 publication. | ||
| On Iceberg-enabled topics, you can also run a single SQL query that returns a non-overlapping continuum of data across both: live records that haven't been translated to Iceberg yet, plus historical records already in Iceberg. You don't write a `UNION ALL` because Redpanda SQL plans the union for you, and rows aren't duplicated at the boundary between live and historical data. | ||
|
|
||
| == Read-only query engine | ||
|
|
||
| Redpanda SQL operates as a read-only query engine. It doesn't accept standard SQL data manipulation, such as `INSERT`, `UPDATE`, `DELETE`, or most `CREATE TABLE` operations for materializing new data. Upstream systems write data into Redpanda topics (with optional Iceberg translation), and you expose that data to Redpanda SQL through catalog mappings. This architecture lets you run analytical queries over streaming and historical data without duplicating or moving it. | ||
|
|
||
| == Architecture characteristics | ||
|
|
||
| Redpanda SQL is built from the ground up in C++ for analytical workloads, with a focus on resource efficiency. The following sections describe the core architectural decisions that shape its performance and scalability. | ||
|
|
||
| === Vectorized query execution | ||
|
|
||
| Redpanda SQL uses a massively parallel processing (MPP) architecture at the core of its compute engine for high-performance processing. While MPP has been the standard in analytics systems for over a decade, Redpanda SQL takes a modern approach: a clean-slate system built from the ground up in C++, without JVM overhead or third-party engine components. This applies recent advancements in computer science to a fresh codebase, with a focus on <<optimized-data-transfer-between-cpu-and-ram,low-level optimizations that improve resource efficiency>> in the query engine and across the system. | ||
|
|
||
| === Columnar storage optimization | ||
|
|
||
| Transactional (OLTP) databases like PostgreSQL or Microsoft SQL Server use a row-oriented design, optimized for high-frequency writes. Columnar storage, by contrast, targets analytical workloads, allowing for faster scans and more efficient aggregations. | ||
|
|
||
| === Decoupled storage and compute | ||
|
|
||
| Redpanda SQL uses a decoupled storage and compute architecture. Compute resources can be scaled independently of storage, allowing for more efficient resource allocation, easier deployment, and better cost control. | ||
|
|
||
| === Distributed, multi-node architecture | ||
|
|
||
| Redpanda SQL is distributed, running across multiple nodes in parallel for horizontal scaling. Adaptive query pipelines handle different operations efficiently across nodes, and execution strategies are selected at runtime based on workload characteristics for optimal performance in both single-node and multi-node setups. | ||
|
|
||
| === PostgreSQL wire protocol and SQL dialect | ||
|
|
||
| Redpanda SQL uses its own declarative query language under the hood but exposes a xref:reference:sql/index.adoc[PostgreSQL-compatible SQL surface] to users, including the PostgreSQL wire protocol. This means you can connect with `psql`, JDBC, ODBC, or any other PostgreSQL client and write SQL using familiar syntax. | ||
|
|
||
| === Optimized data transfer between CPU and RAM | ||
|
|
||
| Redpanda SQL applies low-level memory access and caching optimizations to keep analytical workloads CPU-cache efficient rather than memory-bandwidth-bound: | ||
|
|
||
| * User-space storage caches minimize overhead from kernel-level memory operations. | ||
| * A custom data format enhances data locality. | ||
| * Hybrid row/column formats allow better alignment with CPU cache lines and vectorized execution. | ||
| * Temporal access patterns help retain frequently used data in memory longer, reducing cache misses. | ||
|
|
||
| == Next steps | ||
|
|
||
| * xref:sql:get-started/sql-quickstart.adoc[Quickstart]: enable Redpanda SQL on a BYOC cluster and run your first query. | ||
| * xref:sql:connect-to-sql/index.adoc[Connect to Redpanda SQL]: connect from psql, JDBC, PHP PDO, or .NET Dapper. | ||
| * xref:reference:sql/index.adoc[Redpanda SQL Reference]: supported SQL statements, clauses, data types, functions, and operators. | ||
| * xref:sql:get-started/oltp-vs-olap.adoc[OLTP vs OLAP]: understand why Redpanda SQL uses an analytical (OLAP) model. | ||
| * xref:sql:get-started/redpanda-sql-vs-postgresql.adoc[Redpanda SQL vs PostgreSQL]: supported functions, operators, and behavioral differences. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the tables under Functions and Mathematical operators as they didn't seem to describe any actual differences from PostgreSQL, and so may not be worth keeping. Are there any actual known differences w.r.t. functions and operators (other than the one with JSON)?