diff --git a/core/reference/engines/table-engines/mergetree-family/invertedindexes.mdx b/core/reference/engines/table-engines/mergetree-family/invertedindexes.mdx deleted file mode 100644 index 24fe942a..00000000 --- a/core/reference/engines/table-engines/mergetree-family/invertedindexes.mdx +++ /dev/null @@ -1,801 +0,0 @@ ---- -description: 'Quickly find search terms in text.' -keywords: ['full-text search', 'text index', 'index', 'indices'] -sidebarTitle: 'Full-text Search using Text Indexes' -slug: /engines/table-engines/mergetree-family/invertedindexes -title: 'Full-text Search using Text Indexes' -doc_type: 'reference' ---- - -import {PrivatePreviewBadge} from '/snippets/components/PrivatePreviewBadge/PrivatePreviewBadge.jsx' - - - -Text indexes in ClickHouse (also known as ["inverted indexes"](https://en.wikipedia.org/wiki/Inverted_index)) provide fast full-text capabilities on string data. -The index maps each token in the column to the rows which contain the token. -The tokens are generated by a process called tokenization. -For example, ClickHouse tokenizes the English sentence "All cat like mice." by default as ["All", "cat", "like", "mice"] (note that the trailing dot is ignored). -More advanced tokenizers are available, for example for log data. - -## Creating a Text Index - -To create a text index, first enable the corresponding experimental setting: - -```sql -SET allow_experimental_full_text_index = true; -``` - -A text index can be defined on a [String](/core/reference/data-types/string), [FixedString](/core/reference/data-types/fixedstring), [Array(String)](/core/reference/data-types/array), [Array(FixedString)](/core/reference/data-types/array), and [Map](/core/reference/data-types/map) (via [mapKeys](/core/reference/functions/regular-functions/tuple-map-functions#mapkeys) and [mapValues](/core/reference/functions/regular-functions/tuple-map-functions#mapvalues) map functions) column using the following syntax: - -```sql -CREATE TABLE tab -( - `key` UInt64, - `str` String, - INDEX text_idx(str) TYPE text( - -- Mandatory parameters: - tokenizer = splitByNonAlpha|splitByString(S)|ngrams(N)|array - -- Optional parameters: - [, preprocessor = expression(str)] - -- Optional advanced parameters: - [, dictionary_block_size = D] - [, dictionary_block_frontcoding_compression = B] - [, max_cardinality_for_embedded_postings = M] - [, bloom_filter_false_positive_rate = R] - ) [GRANULARITY 64] -) -ENGINE = MergeTree -ORDER BY key -``` - -**Tokenizer argument**. The `tokenizer` argument specifies the tokenizer: - -- `splitByNonAlpha` splits strings along non-alphanumeric ASCII characters (also see function [splitByNonAlpha](/core/reference/functions/regular-functions/splitting-merging-functions#splitByNonAlpha)). -- `splitByString(S)` splits strings along certain user-defined separator strings `S` (also see function [splitByString](/core/reference/functions/regular-functions/splitting-merging-functions#splitByString)). - The separators can be specified using an optional parameter, for example, `tokenizer = splitByString([', ', '; ', '\n', '\\'])`. - Note that each string can consist of multiple characters (`', '` in the example). - The default separator list, if not specified explicitly (for example, `tokenizer = splitByString`), is a single whitespace `[' ']`. -- `ngrams(N)` splits strings into equally large `N`-grams (also see function [ngrams](/core/reference/functions/regular-functions/splitting-merging-functions#ngrams)). - The ngram length can be specified using an optional integer parameter between 2 and 8, for example, `tokenizer = ngrams(3)`. - The default ngram size, if not specified explicitly (for example, `tokenizer = ngrams`), is 3. -- `array` performs no tokenization, i.e. every row value is a token (also see function [array](/core/reference/functions/regular-functions/array-functions#array)). -- `sparseGrams(min_length, max_length, min_cutoff_length)` — uses the algorithm as in the [sparseGrams](/core/reference/functions/regular-functions/string-functions#sparseGrams) function to split a string into all ngrams of `min_length` and several ngrams of larger size up to `max_length`, inclusive. If `min_cutoff_length` is specified, only N-grams with length greater than or equal to `min_cutoff_length` are saved in the index. Unlike `ngrams(N)`, which generates only fixed-length N-grams, `sparseGrams` produces a set of variable-length N-grams within the specified range, allowing for a more flexible representation of text context. For example, `tokenizer = sparseGrams(3, 5, 4)` will generate 3-, 4-, 5-grams from the input string and save only the 4- and 5-grams in the index. - - -The `splitByString` tokenizer applies the split separators left-to-right. -This can create ambiguities. -For example, the separator strings `['%21', '%']` will cause `%21abc` to be tokenized as `['abc']`, whereas switching both separators strings `['%', '%21']` will output `['21abc']`. -In the most cases, you want that matching prefers longer separators first. -This can generally be done by passing the separator strings in order of descending length. -If the separator strings happen to form a [prefix code](https://en.wikipedia.org/wiki/Prefix_code), they can be passed in arbitrary order. - - - -It is at the moment not recommended to build text indexes on top of text in non-western languages, e.g. Chinese. -The currently supported tokenizers may lead to huge index sizes and large query times. -We plan to add specialized language-specific tokenizers in future which will handle these cases better. - - -To test how the tokenizers split the input string, you can use ClickHouse's [tokens](/core/reference/functions/regular-functions/splitting-merging-functions#tokens) function: - -As an example, - -```sql -SELECT tokens('abc def', 'ngrams', 3) AS tokens; -``` - -returns - -```result -+-tokens--------------------------+ -| ['abc','bc ','c d',' de','def'] | -+---------------------------------+ -``` - -**Preprocessor argument**. The optional argument `preprocessor` is an expression which transforms the input string before tokenization. - -Typical use cases for the preprocessor argument include -1. Lower-casing (or upper-casing) the input strings to enable case-insensitive matching, e.g., [lower](/core/reference/functions/regular-functions/string-functions#lower), [lowerUTF8](/core/reference/functions/regular-functions/string-functions#lowerUTF8), see the first example below. -2. UTF-8 normalization, e.g. [normalizeUTF8NFC](/core/reference/functions/regular-functions/string-functions#normalizeUTF8NFC), [normalizeUTF8NFD](/core/reference/functions/regular-functions/string-functions#normalizeUTF8NFD), [normalizeUTF8NFKC](/core/reference/functions/regular-functions/string-functions#normalizeUTF8NFKC), [normalizeUTF8NFKD](/core/reference/functions/regular-functions/string-functions#normalizeUTF8NFKD), [toValidUTF8](/core/reference/functions/regular-functions/string-functions#toValidUTF8). -3. Removing or transforming unwanted characters or substrings, e.g. [extractTextFromHTML](/core/reference/functions/regular-functions/string-functions#extractTextFromHTML), [substring](/core/reference/functions/regular-functions/string-functions#substring), [idnaEncode](/core/reference/functions/regular-functions/string-functions#idnaEncode). - -The preprocessor expression must transform an input value of type [String](/core/reference/data-types/string) or [FixedString](/core/reference/data-types/fixedstring) to a value of the same type. - -Examples: -- `INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))` -- `INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = substringIndex(col, '\n', 1))` -- `INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(extractTextFromHTML(col))` - -Also, the preprocessor expression must only reference the column on top of which the text index is defined. -Using non-deterministic functions is not allowed. - -Functions [hasToken](/core/reference/functions/regular-functions/string-search-functions#hasToken), [hasAllTokens](/core/reference/functions/regular-functions/string-search-functions#hasAllTokens) and [hasAnyTokens](/core/reference/functions/regular-functions/string-search-functions#hasAnyTokens) use the preprocessor to first transform the search term before tokenizing it. - -For example: - -```sql -CREATE TABLE tab -( - key UInt64, - str String, - INDEX idx(str) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(str)) -) -ENGINE = MergeTree -ORDER BY tuple(); - -SELECT count() FROM tab WHERE hasToken(str, 'Foo'); -``` - -is equivalent to: - -```sql -CREATE TABLE tab -( - key UInt64, - str String, - INDEX idx(lower(str)) TYPE text(tokenizer = 'splitByNonAlpha') -) -ENGINE = MergeTree -ORDER BY tuple(); - -SELECT count() FROM tab WHERE hasToken(str, lower('Foo')); -``` - -**Other arguments**. Text indexes in ClickHouse are implemented as [secondary indexes](/core/reference/engines/table-engines/mergetree-family/mergetree#skip-index-types). -However, unlike other skipping indexes, text indexes have a default index GRANULARITY of 64. -This value has been chosen empirically and it provides a good trade-off between speed and index size for most use cases. -Advanced users can specify a different index granularity (we do not recommend this). - - - -The default values of the following advanced parameters will work well in virtually all situations. -We do not recommend changing them. - -Optional parameter `dictionary_block_size` (default: 128) specifies the size of dictionary blocks in rows. - -Optional parameter `dictionary_block_frontcoding_compression` (default: 1) specifies if the dictionary blocks use front coding as compression. - -Optional parameter `max_cardinality_for_embedded_postings` (default: 16) specifies the cardinality threshold below which posting lists should be embedded into dictionary blocks. - -Optional parameter `bloom_filter_false_positive_rate` (default: 0.1) specifies the false-positive rate of the dictionary bloom filter. - - -Text indexes can be added to or removed from a column after the table has been created: - -```sql -ALTER TABLE tab DROP INDEX text_idx; -ALTER TABLE tab ADD INDEX text_idx(s) TYPE text(tokenizer = splitByNonAlpha); -``` - -## Using a Text Index - -Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically. -If no index exists, below string search functions will fall back to slow brute-force scans. - -### Supported functions - -The text index can be used if text functions are used in the `WHERE` clause of a SELECT query: - -```sql -SELECT [...] -FROM [...] -WHERE string_search_function(column_with_text_index) -``` - -#### `=` and `!=` - -`=` ([equals](/core/reference/functions/regular-functions/comparison-functions#equals)) and `!=` ([notEquals](/core/reference/functions/regular-functions/comparison-functions#notEquals) ) match the entire given search term. - -Example: - -```sql -SELECT * from tab WHERE str = 'Hello'; -``` - -The text index supports `=` and `!=`, yet equality and inequality search only make sense with the `array` tokenizer (which causes the index to store entire row values). - -#### `IN` and `NOT IN` - -`IN` ([in](/core/reference/functions/regular-functions/in-functions)) and `NOT IN` ([notIn](/core/reference/functions/regular-functions/in-functions)) are similar to functions `equals` and `notEquals` but they match all (`IN`) or none (`NOT IN`) of the search terms. - -Example: - -```sql -SELECT * from tab WHERE str IN ('Hello', 'World'); -``` - -The same restrictions as for `=` and `!=` apply, i.e. `IN` and `NOT IN` only make sense in conjunction with the `array` tokenizer. - -#### `LIKE`, `NOT LIKE` and `match` - - -These functions currently use the text index for filtering only if the index tokenizer is either `splitByNonAlpha` or `ngrams`. - - -In order to use `LIKE` [like](/core/reference/functions/regular-functions/string-search-functions#like), `NOT LIKE` ([notLike](/core/reference/functions/regular-functions/string-search-functions#notLike)), and the [match](/core/reference/functions/regular-functions/string-search-functions#match) function with text indexes, ClickHouse must be able to extract complete tokens from the search term. - -Example: - -```sql -SELECT count() FROM tab WHERE comment LIKE 'support%'; -``` - -`support` in the example could match `support`, `supports`, `supporting` etc. -This kind of query is a substring query and it cannot be sped up by a text index. - -To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way: - -```sql -SELECT count() FROM tab WHERE comment LIKE ' support %'; -- or `% support %` -``` - -The spaces left and right of `support` make sure that the term can be extracted as a token. - -#### `startsWith` and `endsWith` - -Similar to `LIKE`, functions [startsWith](/core/reference/functions/regular-functions/string-functions#startsWith) and [endsWith](/core/reference/functions/regular-functions/string-functions#endsWith) can only use a text index, if complete tokens can be extracted from the search term. - -Example: - -```sql -SELECT count() FROM tab WHERE startsWith(comment, 'clickhouse support'); -``` - -In the example, only `clickhouse` is considered a token. -`support` is no token because it can match `support`, `supports`, `supporting` etc. - -To find all rows that start with `clickhouse supports`, please end the search pattern with a trailing space: - -```sql -startsWith(comment, 'clickhouse supports ')` -``` - -Similarly, `endsWith` should be used with a leading space: - -```sql -SELECT count() FROM tab WHERE endsWith(comment, ' olap engine'); -``` - -#### `hasToken` and `hasTokenOrNull` - -Functions [hasToken](/core/reference/functions/regular-functions/string-search-functions#hasToken) and [hasTokenOrNull](/core/reference/functions/regular-functions/string-search-functions#hasTokenOrNull) match against a single given token. - -Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token). - -Example: - -```sql -SELECT count() FROM tab WHERE hasToken(comment, 'clickhouse'); -``` - -Functions `hasToken` and `hasTokenOrNull` are the most performant functions to use with the `text` index. - -#### `hasAnyTokens` and `hasAllTokens` - -Functions [hasAnyTokens](/core/reference/functions/regular-functions/string-search-functions#hasAnyTokens) and [hasAllTokens](/core/reference/functions/regular-functions/string-search-functions#hasAllTokens) match against one or all of the given tokens. - -These two functions accept the search tokens as either a string which will be tokenized using the same tokenizer used for the index column, or as an array of already processed tokens to which no tokenization will be applied prior to searching. -See the function documentation for more info. - -Example: - -```sql --- Search tokens passed as string argument -SELECT count() FROM tab WHERE hasAnyTokens(comment, 'clickhouse olap'); -SELECT count() FROM tab WHERE hasAllTokens(comment, 'clickhouse olap'); - --- Search tokens passed as Array(String) -SELECT count() FROM tab WHERE hasAnyTokens(comment, ['clickhouse', 'olap']); -SELECT count() FROM tab WHERE hasAllTokens(comment, ['clickhouse', 'olap']); -``` - -#### `has` - -Array function [has](/core/reference/functions/regular-functions/array-functions#has) matches against a single token in the array of strings. - -Example: - -```sql -SELECT count() FROM tab WHERE has(array, 'clickhouse'); -``` - -#### `mapContains` - -Function [mapContains](/core/reference/functions/regular-functions/tuple-map-functions#mapcontains)(alias of: `mapContainsKey`) matches against a single token in the keys of a map. - -Example: - -```sql -SELECT count() FROM tab WHERE mapContainsKey(map, 'clickhouse'); --- OR -SELECT count() FROM tab WHERE mapContains(map, 'clickhouse'); -``` - -#### `operator[]` - -Access [operator[]](/core/reference/operators#access-operators) can be used with the text index to filter out keys and values. - -Example: - -```sql -SELECT count() FROM tab WHERE map['engine'] = 'clickhouse'; -- will use the text index if defined -``` - -See the following examples for the usage of `Array(T)` and `Map(K, V)` with the text index. - -### Examples for the text index `Array` and `Map` support. - -#### Indexing Array(String) - -In a simple blogging platform, authors assign keywords to their posts to categorize content. -A common feature allows users to discover related content by clicking on keywords or searching for topics. - -Consider this table definition: - -```sql -CREATE TABLE posts ( - post_id UInt64, - title String, - content String, - keywords Array(String) COMMENT 'Author-defined keywords' -) -ENGINE = MergeTree -ORDER BY (post_id); -``` - -Without a text index, finding posts with a specific keyword (e.g. `clickhouse`) requires scanning all entries: - -```sql -SELECT count() FROM posts WHERE has(keywords, 'clickhouse'); -- slow full-table scan - checks every keyword in every post -``` - -As the platform grows, this becomes increasingly slow because the query must examine every keywords array in every row. - -To overcome this performance issue, we can define a text index for the `keywords` that creates a search-optimized structure that pre-processes all keywords, enabling instant lookups: - -```sql -ALTER TABLE posts ADD INDEX keywords_idx(keywords) TYPE text(tokenizer = splitByNonAlpha); -``` - - -Important: After adding the text index, you must rebuild it for existing data: - -```sql -ALTER TABLE posts MATERIALIZE INDEX keywords_idx; -``` - - -#### Indexing Map - -In a logging system, server requests often store metadata in key-value pairs. Operations teams need to efficiently search through logs for debugging, security incidents, and monitoring. - -Consider this logs table: - -```sql -CREATE TABLE logs ( - id UInt64, - timestamp DateTime, - message String, - attributes Map(String, String) -) -ENGINE = MergeTree -ORDER BY (timestamp); -``` - -Without a text index, searching through [Map](/core/reference/data-types/map) data requires full table scans: - -1. Finds all logs with rate limiting: - -```sql -SELECT count() FROM logs WHERE has(mapKeys(attributes), 'rate_limit'); -- slow full-table scan -``` - -2. Finds all logs from a specific IP: - -```sql -SELECT count() FROM logs WHERE has(mapValues(attributes), '192.168.1.1'); -- slow full-table scan -``` - -As log volume grows, these queries become slow. - -The solution is creating a text index for the [Map](/core/reference/data-types/map) keys and values. - -Use [mapKeys](/core/reference/functions/regular-functions/tuple-map-functions#mapkeys) to create a text index when you need to find logs by field names or attribute types: - -```sql -ALTER TABLE logs ADD INDEX attributes_keys_idx mapKeys(attributes) TYPE text(tokenizer = array); -``` - -Use [mapValues](/core/reference/functions/regular-functions/tuple-map-functions#mapvalues) to create a text index when you need to search within the actual content of attributes: - -```sql -ALTER TABLE logs ADD INDEX attributes_vals_idx mapValues(attributes) TYPE text(tokenizer = array); -``` - - -Important: After adding the text index, you must rebuild it for existing data: - -```sql -ALTER TABLE posts MATERIALIZE INDEX attributes_keys_idx; -ALTER TABLE posts MATERIALIZE INDEX attributes_vals_idx; -``` - - -1. Find all rate-limited requests: - -```sql -SELECT * FROM logs WHERE mapContainsKey(attributes, 'rate_limit'); -- fast -``` - -2. Finds all logs from a specific IP: - -```sql -SELECT * FROM logs WHERE has(mapValues(attributes), '192.168.1.1'); -- fast -``` - -## Implementation - -### Index layout - -Each text index consists of two (abstract) data structures: -- a dictionary which maps each token to a postings list, and -- a set of postings lists, each representing a set of row numbers. - -Since a text index is a skip index, these data structures exist logically per index granule. - -During index creation, three files are created (per part): - -**Dictionary blocks file (.dct)** - -The tokens in an index granule are sorted and stored in dictionary blocks of 128 tokens each (the block size is configurable by parameter `dictionary_block_size`). -A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part. - -**Index granules file (.idx)** - -The index granules file contains for each dictionary block the block's first token, its relative offset in the dictionary blocks file, and a bloom filter for all tokens in the block. -This sparse index structure is similar to ClickHouse's [sparse primary key index](/core/guides/clickhouse/data-modelling/sparse-primary-indexes)). -The bloom filter allows to skip dictionary blocks early if the searched token is not contained in a dictionary block. - -**Postings lists file (.pst)** - -The posting lists for all tokens are laid out sequentially in the postings list file. -To save space while still allowing fast intersection and union operations, the posting lists are stored as [roaring bitmaps](https://roaringbitmap.org/). -If the cardinality of a posting list is less than 16 (configurable by parameter `max_cardinality_for_embedded_postings`), it is embedded into the dictionary. - -### Direct read - -Certain types of text queries can be speed up significantly by an optimization called "direct read". -More specifically, the optimization can be applied if the SELECT query does _not_ project from the text column. - -Example: - -```sql -SELECT column_a, column_b, ... -- not: column_with_text_index -FROM [...] -WHERE string_search_function(column_with_text_index) -``` - -The direct read optimization in ClickHouse answers the query exclusively using the text index (i.e., text index lookups) without accessing the underlying text column. -Text index lookups read relatively little data and are therefore much faster than usual skip indexes in ClickHouse (which do a skip index lookup, followed by loading and filtering surviving granules). - -Direct read is controlled by two settings: -- Setting [query_plan_direct_read_from_text_index](/core/reference/settings/session-settings#query_plan_direct_read_from_text_index) (default: 1) which specifies if direct read is generally enabled. -- Setting [use_skip_indexes_on_data_read](/core/reference/settings/session-settings#use_skip_indexes_on_data_read) (default: 1) which is another prerequisite for direct read. Note that on ClickHouse databases with [compatibility](/core/reference/settings/session-settings#compatibility) < 25.10, `use_skip_indexes_on_data_read` is disabled, so you either need to raise the compatibility setting value or `SET use_skip_indexes_on_data_read = 1` explicitly. - -Also, the text index must be fully materialized to use direct reading (use `ALTER TABLE ... MATERIALIZE INDEX` for that). - -**Supported functions** -The direct read optimization supports functions `hasToken`, `hasAllTokens`, and `hasAnyTokens`. -These functions can also be combined by AND, OR, and NOT operators. -The WHERE clause can also contain additional non-text-search-functions filters (for text columns or other columns) - in that case, the direct read optimization will still be used but less effective (it only applies to the supported text search functions). - -To understand a query utilizes direct read, run the query with `EXPLAIN PLAN actions = 1`. -As an example, a query with disabled direct read - -```sql -EXPLAIN PLAN actions = 1 -SELECT count() -FROM tab -WHERE hasToken(col, 'some_token') -SETTINGS query_plan_direct_read_from_text_index = 0; -``` - -returns - -```text -[...] -Filter ((WHERE + Change column names to column identifiers)) -Filter column: hasToken(__table1.col, 'some_token'_String) (removed) -Actions: INPUT : 0 -> col String : 0 - COLUMN Const(String) -> 'some_token'_String String : 1 - FUNCTION hasToken(col :: 0, 'some_token'_String :: 1) -> hasToken(__table1.col, 'some_token'_String) UInt8 : 2 -[...] -``` - -whereas the same query run with `query_plan_direct_read_from_text_index = 1` - -```sql -EXPLAIN PLAN actions = 1 -SELECT count() -FROM tab -WHERE hasToken(col, 'some_token') -SETTINGS query_plan_direct_read_from_text_index = 1; -``` - -returns - -```text -[...] -Expression (Before GROUP BY) -Positions: - Filter - Filter column: __text_index_idx_hasToken_94cc2a813036b453d84b6fb344a63ad3 (removed) - Actions: INPUT :: 0 -> __text_index_idx_hasToken_94cc2a813036b453d84b6fb344a63ad3 UInt8 : 0 -[...] -``` - -The second EXPLAIN PLAN output contains a virtual column `__text_index___`. -If this column is present, then direct read is used. - -## Example: Hackernews dataset - -Let's look at the performance improvements of text indexes on a large dataset with lots of text. -We will use 28.7M rows of comments on the popular Hacker News website. -Here is the table without text index: - -```sql -CREATE TABLE hackernews ( - id UInt64, - deleted UInt8, - type String, - author String, - timestamp DateTime, - comment String, - dead UInt8, - parent UInt64, - poll UInt64, - children Array(UInt32), - url String, - score UInt32, - title String, - parts Array(UInt32), - descendants UInt32 -) -ENGINE = MergeTree -ORDER BY (type, author); -``` - -The 28.7M rows are in a Parquet file in S3 - let's insert them into the `hackernews` table: - -```sql -INSERT INTO hackernews - SELECT * FROM s3Cluster( - 'default', - 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', - 'Parquet', - ' - id UInt64, - deleted UInt8, - type String, - by String, - time DateTime, - text String, - dead UInt8, - parent UInt64, - poll UInt64, - kids Array(UInt32), - url String, - score UInt32, - title String, - parts Array(UInt32), - descendants UInt32'); -``` - -We will use `ALTER TABLE` and add a text index on comment column, then materialize it: - -```sql --- Add the index -ALTER TABLE hackernews ADD INDEX comment_idx(comment) TYPE text(tokenizer = splitByNonAlpha); - --- Materialize the index for existing data -ALTER TABLE hackernews MATERIALIZE INDEX comment_idx SETTINGS mutations_sync = 2; -``` - -Now, let's run queries using `hasToken`, `hasAnyTokens`, and `hasAllTokens` functions. -The following examples will show the dramatic performance difference between a standard index scan and the direct read optimization. - -### 1. Using `hasToken` - -`hasToken` checks if the text contains a specific single token. -We'll search for the case-sensitive token 'ClickHouse'. - -**Direct read disabled (Standard scan)** -By default, ClickHouse uses the skip index to filter granules and then reads the column data for those granules. -We can simulate this behavior by disabling direct read. - -```sql -SELECT count() -FROM hackernews -WHERE hasToken(comment, 'ClickHouse') -SETTINGS query_plan_direct_read_from_text_index = 0, use_skip_indexes_on_data_read = 0; - -┌─count()─┐ -│ 516 │ -└─────────┘ - -1 row in set. Elapsed: 0.362 sec. Processed 24.90 million rows, 9.51 GB -``` - -**Direct read enabled (Fast index read)** -Now we run the same query with direct read enabled (the default). - -```sql -SELECT count() -FROM hackernews -WHERE hasToken(comment, 'ClickHouse') -SETTINGS query_plan_direct_read_from_text_index = 1, use_skip_indexes_on_data_read = 1; - -┌─count()─┐ -│ 516 │ -└─────────┘ - -1 row in set. Elapsed: 0.008 sec. Processed 3.15 million rows, 3.15 MB -``` -The direct read query is over 45 times faster (0.362s vs 0.008s) and processes significantly less data (9.51 GB vs 3.15 MB) by reading from the index alone. - -### 2. Using `hasAnyTokens` - -`hasAnyTokens` checks if the text contains at least one of the given tokens. -We'll search for comments containing either 'love' or 'ClickHouse'. - -**Direct read disabled (Standard scan)** - -```sql -SELECT count() -FROM hackernews -WHERE hasAnyTokens(comment, 'love ClickHouse') -SETTINGS query_plan_direct_read_from_text_index = 0, use_skip_indexes_on_data_read = 0; - -┌─count()─┐ -│ 408426 │ -└─────────┘ - -1 row in set. Elapsed: 1.329 sec. Processed 28.74 million rows, 9.72 GB -``` - -**Direct read enabled (Fast index read)** - -```sql -SELECT count() -FROM hackernews -WHERE hasAnyTokens(comment, 'love ClickHouse') -SETTINGS query_plan_direct_read_from_text_index = 1, use_skip_indexes_on_data_read = 1; - -┌─count()─┐ -│ 408426 │ -└─────────┘ - -1 row in set. Elapsed: 0.015 sec. Processed 27.99 million rows, 27.99 MB -``` -The speedup is even more dramatic for this common "OR" search. -The query is nearly 89 times faster (1.329s vs 0.015s) by avoiding the full column scan. - -### 3. Using `hasAllTokens` - -`hasAllTokens` checks if the text contains all of the given tokens. -We'll search for comments containing both 'love' and 'ClickHouse'. - -**Direct read disabled (Standard scan)** -Even with direct read disabled, the standard skip index is still effective. -It filters down the 28.7M rows to just 147.46K rows, but it still must read 57.03 MB from the column. - -```sql -SELECT count() -FROM hackernews -WHERE hasAllTokens(comment, 'love ClickHouse') -SETTINGS query_plan_direct_read_from_text_index = 0, use_skip_indexes_on_data_read = 0; - -┌─count()─┐ -│ 11 │ -└─────────┘ - -1 row in set. Elapsed: 0.184 sec. Processed 147.46 thousand rows, 57.03 MB -``` - -**Direct read enabled (Fast index read)** -Direct read answers the query by operating on the index data, reading only 147.46 KB. - -```sql -SELECT count() -FROM hackernews -WHERE hasAllTokens(comment, 'love ClickHouse') -SETTINGS query_plan_direct_read_from_text_index = 1, use_skip_indexes_on_data_read = 1; - -┌─count()─┐ -│ 11 │ -└─────────┘ - -1 row in set. Elapsed: 0.007 sec. Processed 147.46 thousand rows, 147.46 KB -``` - -For this "AND" search, the direct read optimization is over 26 times faster (0.184s vs 0.007s) than the standard skip index scan. - -### 4. Compound search: OR, AND, NOT, ... -The direct read optimization also applies to compound boolean expressions. -Here, we'll perform a case-insensitive search for 'ClickHouse' OR 'clickhouse'. - -**Direct read disabled (Standard scan)** - -```sql -SELECT count() -FROM hackernews -WHERE hasToken(comment, 'ClickHouse') OR hasToken(comment, 'clickhouse') -SETTINGS query_plan_direct_read_from_text_index = 0, use_skip_indexes_on_data_read = 0; - -┌─count()─┐ -│ 769 │ -└─────────┘ - -1 row in set. Elapsed: 0.450 sec. Processed 25.87 million rows, 9.58 GB -``` - -**Direct read enabled (Fast index read)** - -```sql -SELECT count() -FROM hackernews -WHERE hasToken(comment, 'ClickHouse') OR hasToken(comment, 'clickhouse') -SETTINGS query_plan_direct_read_from_text_index = 1, use_skip_indexes_on_data_read = 1; - -┌─count()─┐ -│ 769 │ -└─────────┘ - -1 row in set. Elapsed: 0.013 sec. Processed 25.87 million rows, 51.73 MB -``` - -By combining the results from the index, the direct read query is 34 times faster (0.450s vs 0.013s) and avoids reading the 9.58 GB of column data. -For this specific case, `hasAnyTokens(comment, ['ClickHouse', 'clickhouse'])` would be the preferred, more efficient syntax. - -## Tuning the text index - -Currently, there are caches for the deserialized dictionary blocks, headers and posting lists of the text index to reduce I/O. - -They can be enabled via settings [use_text_index_dictionary_cache](/core/reference/settings/session-settings#use_text_index_dictionary_cache), [use_text_index_header_cache](/core/reference/settings/session-settings#use_text_index_header_cache) and [use_text_index_postings_cache](/core/reference/settings/session-settings#use_text_index_postings_cache) respectively. By default, they are disabled. - -Refer the following server settings to configure the cache. - -### Server Settings - -#### Dictionary blocks cache settings - -| Setting | Description | Default | -|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|--------------| -| [text_index_dictionary_block_cache_policy](/core/reference/settings/server-settings/settings#text_index_dictionary_block_cache_policy) | Text index dictionary block cache policy name. | `SLRU` | -| [text_index_dictionary_block_cache_size](/core/reference/settings/server-settings/settings#text_index_dictionary_block_cache_size) | Maximum cache size in bytes. | `1073741824` | -| [text_index_dictionary_block_cache_max_entries](/core/reference/settings/server-settings/settings#text_index_dictionary_block_cache_max_entries) | Maximum number of deserialized dictionary blocks in cache. | `1'000'000` | -| [text_index_dictionary_block_cache_size_ratio](/core/reference/settings/server-settings/settings#text_index_dictionary_block_cache_size_ratio) | The size of the protected queue in the text index dictionary block cache relative to the cache\'s total size. | `0.5` | - -#### Header cache settings - -| Setting | Description | Default | -|--------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------| -| [text_index_header_cache_policy](/core/reference/settings/server-settings/settings#text_index_header_cache_policy) | Text index header cache policy name. | `SLRU` | -| [text_index_header_cache_size](/core/reference/settings/server-settings/settings#text_index_header_cache_size) | Maximum cache size in bytes. | `1073741824` | -| [text_index_header_cache_max_entries](/core/reference/settings/server-settings/settings#text_index_header_cache_max_entries) | Maximum number of deserialized headers in cache. | `100'000` | -| [text_index_header_cache_size_ratio](/core/reference/settings/server-settings/settings#text_index_header_cache_size_ratio) | The size of the protected queue in the text index header cache relative to the cache\'s total size. | `0.5` | - -#### Posting lists cache settings - -| Setting | Description | Default | -|---------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------| -| [text_index_postings_cache_policy](/core/reference/settings/server-settings/settings#text_index_postings_cache_policy) | Text index postings cache policy name. | `SLRU` | -| [text_index_postings_cache_size](/core/reference/settings/server-settings/settings#text_index_postings_cache_size) | Maximum cache size in bytes. | `2147483648` | -| [text_index_postings_cache_max_entries](/core/reference/settings/server-settings/settings#text_index_postings_cache_max_entries) | Maximum number of deserialized postings in cache. | `1'000'000` | -| [text_index_postings_cache_size_ratio](/core/reference/settings/server-settings/settings#text_index_postings_cache_size_ratio) | The size of the protected queue in the text index postings cache relative to the cache\'s total size. | `0.5` | - -## Related content - -- Blog: [Introducing Inverted Indices in ClickHouse](https://clickhouse.com/blog/clickhouse-search-with-inverted-indices) -- Blog: [Inside ClickHouse full-text search: fast, native, and columnar](https://clickhouse.com/blog/clickhouse-full-text-search) -- Video: [Full-Text Indices: Design and Experiments](https://www.youtube.com/watch?v=O_MnyUkrIq8) diff --git a/core/reference/navigation.json b/core/reference/navigation.json index 9547afaa..7f38f495 100644 --- a/core/reference/navigation.json +++ b/core/reference/navigation.json @@ -363,7 +363,6 @@ "core/reference/engines/table-engines/mergetree-family/collapsingmergetree", "core/reference/engines/table-engines/mergetree-family/custom-partitioning-key", "core/reference/engines/table-engines/mergetree-family/graphitemergetree", - "core/reference/engines/table-engines/mergetree-family/invertedindexes", "core/reference/engines/table-engines/mergetree-family/replacingmergetree", "core/reference/engines/table-engines/mergetree-family/replication", "core/reference/engines/table-engines/mergetree-family/summingmergetree",