doc: add storing data with memcs chapter#5683
Conversation
9355c21 to
4719c89
Compare
| * Apache Arrow support — data can be exported in Arrow format without conversion, enabling zero-copy interoperability. | ||
| * Dictionary encoding — reduces memory usage for string columns with repeated values. | ||
| * `LZ4 <https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)>`_ compression — compresses column data to reduce memory footprint. | ||
| * SQL integration — supports querying via Tarantool SQL engine. |
There was a problem hiding this comment.
SQL is not supported by memcs.
| - Boolean: ``boolean`` | ||
| - Temporal types: ``datetime`` | ||
| - UUID: ``uuid`` | ||
| - Decimal: ``decimal`` |
There was a problem hiding this comment.
decimal, decimal32, decimal64, decimal128, and decimal256
| - Integer types: ``uint64``, ``int64``, ``uint32``, ``int32`` | ||
| - Floating-point types: ``double``, ``float`` | ||
| - Strings: ``string`` | ||
| - Boolean: ``boolean`` |
There was a problem hiding this comment.
Booleans aren't supported correctly, let's remove them from here.
|
|
||
| MemCS supports a wide range of data types, including: | ||
|
|
||
| - Integer types: ``uint64``, ``int64``, ``uint32``, ``int32`` |
There was a problem hiding this comment.
int8, uint8, int16, uint16, int32, uint32, int64, uint64
| MemCS supports a wide range of data types, including: | ||
|
|
||
| - Integer types: ``uint64``, ``int64``, ``uint32``, ``int32`` | ||
| - Floating-point types: ``double``, ``float`` |
| - Strings: ``string`` | ||
| - Boolean: ``boolean`` | ||
| - Temporal types: ``datetime`` | ||
| - UUID: ``uuid`` |
|
|
||
| - ``plain`` — default layout, no encoding | ||
| - ``null_rle`` — RLE encoding for nullable fields | ||
|
|
|
|
||
| .. _memcs-memory: | ||
|
|
||
| Memory Consumption |
There was a problem hiding this comment.
Let's remove from here till the end, it looks too much AI.
| - Dictionaries only grow — previously produced batches remain compatible | ||
|
|
||
| .. _memcs-column: | ||
|
|
There was a problem hiding this comment.
Please add a chapter "RLE encoding of NULLs":
By default, NULL values are stored explicitly and use up the same space as any other valid column value (1, 2, 4, 8, 16 or 32 bytes depending on an exact field type), however RLE encoding of NULLs is also supported (null_rle). For reference, RLE-encoding of a column with 90% evenly distributed NULL values reduces memory consumption of that column by around 5 times.
| .. admonition:: Enterprise Edition | ||
| :class: fact | ||
|
|
||
| The `memcs` engine uses a single-threaded transaction processor (TX thread), similar to `memtx`. However, unlike `memtx`, |
There was a problem hiding this comment.
Please change to something like:
The memcs engine uses a single-threaded transaction processor (TX thread), similar to memtx, and stores data in the memtx arena but in contrast to memtx it doesn’t organize data in tuples. Instead, it stores data in columns. Each format field is assigned its own BPS tree-like structure (BPS vector), which stores values only of that field. If the field type fits in 32 bytes, raw field values are stored directly in tree leaves without any encoding. The strings are stored in the format similar to "Arrow Variable-size Binary View Layout", also called "German Strings".
The main benefit of such data organization is a significant performance boost of columnar data sequential scans compared to memtx thanks to CPU cache locality. That’s why memcs supports a special C api for such columnar scans: see box_index_arrow_stream() and box_raw_read_view_arrow_stream(). Peak performance is achieved when scanning embedded field types.
Querying full tuples, like in memtx, is also supported, but the performance is worse compared to memtx, because a tuple has to be constructed on the runtime arena from individual field values gathered from each column tree.
Other features include:
- Point lookup.
- Stable iterators.
- Insert / replace / delete / update.
- Batch insertion in the Arrow format.
- Transactions, including cross-engine transactions with memtx (with
memtx_use_mvcc_engine = false). - Read view support.
- Secondary indexes with an ability to specify covered columns and sequentially scan indexed + covered columns.
Fixes: #5630
Deployment: https://docs.d.tarantool.io/en/doc/doc-memcs-engine/platform/engines/memcs/
done with AI help