Skip to content

[BUG/PERF] LIKE and relevance-function pushdown gaps on Calcite engine #5491

@penghuo

Description

@penghuo

Test data

5000 docs, each with three identical fields holding a string like:

logtype=ws:access http_status=200 uri="..." user_agent="Go-http-client/1.1" ...
  • body_text (text)
  • body_keyword (keyword)
  • body_multi (text + .keyword subfield)
  • lt = rex-extracted from body (VARCHAR)

logtype is randomly drawn from {ws:access, ws:error, app:debug, app:info}.

Ground truth

  • 1271 docs contain ws:access as a substring
  • 993 docs contain Go-http-client as a substring

Correctness legend

  • OK — result matches the operator's defined behavior given the field type.
  • WRONG — result deviates from the operator's defined behavior.
  • FAIL — query errors out at runtime.

Note: rows that return 0 hits can still be OK when the operator is doing what it's defined to do. For example, LIKE(body_text, "ws:access") (no wildcards) means strict equality — long body string ≠ short literal → 0 hits is correct. match(body_keyword, "ws:access") does whole-value term matching against the keyword field — body value ≠ literal → 0 hits is correct.

Probe 1: ws:access

query hits correctness dsl
LIKE(body_text, "ws:access") 0 OK SCRIPT[ILIKE]
LIKE(body_text, "%ws:access%") 1271 OK SCRIPT[ILIKE]
match(body_text, "ws:access") 1271 OK match
match_phrase(body_text, "ws:access") 1271 OK match_phrase
LIKE(body_keyword, "ws:access") 0 OK wildcard case_insensitive=true
LIKE(body_keyword, "%ws:access%") 1271 OK wildcard case_insensitive=true
match(body_keyword, "ws:access") 0 OK match
match_phrase(body_keyword, "ws:access") 0 OK match_phrase
body_keyword = "ws:access" 0 OK term
LIKE(body_multi, "ws:access") 0 OK wildcard(.keyword) case_insensitive=true
LIKE(body_multi, "%ws:access%") 1271 OK wildcard(.keyword) case_insensitive=true
match(body_multi, "ws:access") 1271 OK match
match_phrase(body_multi, "ws:access") 1271 OK match_phrase
LIKE(lt, "ws:access") 1271 OK SCRIPT[ILIKE]
LIKE(lt, "%ws:access%") 1271 OK SCRIPT[ILIKE]
match(lt, "ws:access") ERR FAIL SCRIPT[match] (rejected at runtime)
match_phrase(lt, "ws:access") ERR FAIL SCRIPT[match_phrase] (rejected at runtime)
lt = "ws:access" 1271 OK SCRIPT[=]

16 of 18 cells produce a result. Two cells hard-fail at runtime.

Issues identified from the matrix

Issue 1 — match / match_phrase behave inconsistently across field types

The same match / match_phrase operator produces three different outcomes depending on the field's type:

field type behavior example
text works as expected — token / phrase match against the inverted index match(body_text, "ws:access") → 1271
text + .keyword works as expected on the text path match(body_multi, "ws:access") → 1271
keyword whole-value matching — typically returns 0 when the keyword value is a long string match(body_keyword, "ws:access") → 0
derived (rex/eval) hard error: Relevance search query functions are only supported when they are pushed down match(lt, "ws:access") → 500

The keyword case is Lucene-correct but easy to mistake for "no matches found." The derived-column case is a hard runtime failure: the planner emits a SCRIPT[match] clause; the script engine rejects relevance functions at compile time; the query never executes.

This is especially painful because "use match_phrase instead of LIKE" is the natural workaround for the perf issue below — but that workaround fails the moment a user touches a rex-extracted column.

Cross-reference: #3668 — relevance functions need pushdown plumbing in Calcite that doesn't exist yet for derived columns.

Proposal — Issue 1

  1. Reject early, not at runtime. When the planner sees match / match_phrase against a non-pushdownable column (Calcite-derived, no underlying field), fail fast at planning time with a clear error message naming the column and suggesting alternatives. Don't generate a SCRIPT[match] clause that's guaranteed to fail server-side.
  2. Document the keyword-field case in PPL relevance-function docs: match / match_phrase on a keyword field perform whole-value term matching, not substring/phrase. Common alternatives: use a text analyzer, query the .keyword subfield only with =/LIKE %X%, or restructure the data.
  3. Track in #3668 — when relevance-function pushdown extends to derived columns, the hard-fail case becomes a script-path execution.

Issue 2 — LIKE on a pure text field is a script query (perf)

Two text-field rows emit SCRIPT[ILIKE] instead of a native Lucene query:

  • LIKE(body_text, "ws:access")
  • LIKE(body_text, "%ws:access%")

Both produce correct results, but as a Painless ScriptQuery that scans _source per document. The same intent expressed as match_phrase(body_text, "ws:access") returns the same 1271 hits via a native Lucene query. Adding a .keyword subfield to the mapping (the body_multi rows) automatically routes LIKE through wildcard_query against the subfield — fast.

(The rex-extracted lt rows also emit SCRIPT[ILIKE], but those are inherently script because lt is a Calcite-derived column not backed by an index — same root cause as Issue 1.)

Root causeopensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java:1348-1361:

public QueryExpression like(LiteralExpression literal, boolean caseSensitive) {
  String fieldName = getFieldReference();
  String keywordField = OpenSearchTextType.toKeywordSubField(fieldName, this.rel.getExprType());
  boolean isKeywordField = keywordField != null;
  if (isKeywordField) {
    builder = wildcardQuery(keywordField, ...).caseInsensitive(!caseSensitive);
    return this;
  }
  throw new UnsupportedOperationException("Like query is not supported for text field");
}

The throw is caught at PredicateAnalyzer.java:239-249 (broad catch (Throwable)) and falls back to ScriptQueryExpression, which reads _source per doc.

The comment at :1343-1346 explains why: emitting wildcard_query against analyzed text would change LIKE semantics, since Lucene wildcards run per-token. This was an intentional correctness-over-performance choice — see Related Issues.

Proposal — Issue 2

Both an optimization and a doc update are needed; they reinforce each other.

Optimization — rewrite LIKE(text_field, '%LITERAL%') to match_phrase

Add a planner-level rewrite that fires when all of the following hold:

  • Field is text without a .keyword subfield (the keyword path already pushes down via wildcard_query).
  • 2-arg LIKE. 3-arg LIKE with explicit case-sensitive flag falls through to the current script path.
  • Pattern shape is exactly '%LITERAL%' — leading and trailing % only, no interior %/_, no \-escapes. All other shapes (no wildcards, prefix-only, suffix-only, interior wildcards) fall through to the current script path.
  • The field's analyzer applied to LITERAL produces ≥1 token and those tokens survive without stopword/stemming loss. Cache the analyzer check per (field, literal).
  • Cluster setting plugins.calcite.like_to_match_phrase.enabled is on. Default off because the rewrite is analyzer-driven (case-insensitive under standard analyzer) and LIKE is contractually case-sensitive per [BUG] The like function is case insensitive in PPL but it should be case sensitive #4835.

Target is match_phrase, not match. Phrase semantics preserve token adjacency — closer to substring intent on analyzed text. Multi-token literals work; single-token degenerates to match correctly.

Other LIKE patterns on text fields ('LITERAL', 'LITERAL%', '%LITERAL', patterns with interior wildcards) cannot be safely rewritten with current Lucene primitives — they require character-anchored or character-position-aware matching that token-based queries can't express. They remain on the script path.

Blocker: #3668match_phrase doesn't push down on Calcite yet. The rewrite is moot until that lands, otherwise it just exchanges one script path for another.

Documentation — call out which LIKE patterns get optimized

Add a "performance notes" section to docs/user/ppl/cmd/where.md (or a dedicated page). The user-facing message must be specific about what is and isn't optimized:

LIKE pattern on a text field engine behavior
'%LITERAL%' (substring) Optimized to match_phrase when the opt-in setting is enabled — fast.
'LITERAL' (no wildcards, exact match) Per-doc source scan — slow. Use body_field.keyword = 'LITERAL' or restructure data.
'LITERAL%' (prefix) Per-doc source scan — slow. Use a .keyword subfield + LIKE on the subfield.
'%LITERAL' (suffix) Per-doc source scan — slow. No fast alternative on text; use .keyword.
any pattern with interior %/_ Per-doc source scan — slow. No optimization.

Also document:

  • Adding a .keyword subfield to the mapping is the most general fix — it routes all LIKE patterns through wildcard_query against the subfield, which is fast.
  • match_phrase(text_field, "literal") is always available as a direct alternative — the optimization above is just an automatic version of this rewrite for the substring case.
  • Cross-reference #4835 for the case-sensitivity gotcha (any LIKE is implicitly ILIKE today).

Not proposing:

Reproduction

Index template:

PUT logs-test
{
  "mappings": {
    "properties": {
      "body_text":    {"type": "text"},
      "body_keyword": {"type": "keyword"},
      "body_multi":   {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 32766}}}
    }
  }
}

Bulk-load 5000 docs where each body_* field holds a string like logtype=<lt> http_status=<n> uri="..." user_agent="..." with <lt> randomly drawn from the four logtypes.

Each cell ran via:

  • POST /_plugins/_ppl/_explain to extract DSL pushdown shape
  • POST /_plugins/_ppl to capture hit count

Calcite engine enabled (plugins.calcite.enabled: true).

Related issues

  • #3669 — Original feature request to push down LIKE on Calcite.
  • #3915 (merged 2025-08-13) — Introduced today's behavior. PR description: "we should only support pushdown like query for keyword field".
  • #3950 (closed 2025-10-22) — Asked for a _source script fallback to fill the text-field gap. Closed by:
  • #4010 (merged 2025-08-19) — Accepted the script path itself. That's exactly what we observe.
  • #3428 — Provides the semantic counterexamples (e.g., Amber Johnny vs Ambe_) cited as the reason NOT to emit wildcard_query for LIKE on tokenized text.
  • #4835LIKE should be case-sensitive in PPL but currently isn't (every LIKE cell in the matrix shows ILIKE / case_insensitive=true). Constrains any rewrite target.
  • #3668 — Open, unassigned. match_phrase doesn't push down on Calcite yet. Hard prerequisite for any LIKE → match_phrase rewrite to be useful, and the underlying cause of the rex-derived-column hard-fail in Issue 1.

Acceptance criteria

Issue 1 — early failure / docs:

  • Planner detects relevance functions on non-pushdownable columns and fails at planning time with a clear, user-facing error.
  • PPL relevance-function docs describe whole-value matching on keyword fields and the derived-column restriction.

Issue 2 — optimization (follow-up issue once #3668 ships):

  • Cluster setting plugins.calcite.like_to_match_phrase.enabled added, default off.
  • Rewrite emits match_phrase only for '%LITERAL%' on a text field with no .keyword subfield, 2-arg LIKE, and analyzer-token-preserving literal.
  • All other LIKE patterns fall through to the current script path unchanged.
  • Test coverage: standard analyzer, multi-token literals, hyphenated literals, prefix-only / suffix-only / no-wildcard / interior-wildcard patterns (must fall through), escape character LIKE, case-sensitive 3-arg LIKE (must fall through), missing field, NULL handling.

Issue 2 — documentation:

  • Performance-notes page lists each LIKE pattern shape and whether it's optimized or scans source.
  • where.md and the relevance-functions page cross-link to it.
  • Documentation describes the opt-in setting and the conditions under which the rewrite fires.

Reproduction artifacts

Available on request:

  • gen_logs_multi.py — sample data generator.
  • matrix.py — runs the full matrix against localhost:9200.
  • Saved EXPLAIN responses showing the verbatim DSL for each cell.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PPLPiped processing languagecalcitecalcite migration releatedperformanceMake it fast!

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions