You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
logtype is randomly drawn from {ws:access, ws:error, app:debug, app:info}.
Ground truth
1271 docs contain ws:access as a substring
993 docs contain Go-http-client as a substring
Correctness legend
OK — result matches the operator's defined behavior given the field type.
WRONG — result deviates from the operator's defined behavior.
FAIL — query errors out at runtime.
Note: rows that return 0 hits can still be OK when the operator is doing what it's defined to do. For example, LIKE(body_text, "ws:access") (no wildcards) means strict equality — long body string ≠ short literal → 0 hits is correct. match(body_keyword, "ws:access") does whole-value term matching against the keyword field — body value ≠ literal → 0 hits is correct.
Probe 1: ws:access
query
hits
correctness
dsl
LIKE(body_text, "ws:access")
0
OK
SCRIPT[ILIKE]
LIKE(body_text, "%ws:access%")
1271
OK
SCRIPT[ILIKE]
match(body_text, "ws:access")
1271
OK
match
match_phrase(body_text, "ws:access")
1271
OK
match_phrase
LIKE(body_keyword, "ws:access")
0
OK
wildcard case_insensitive=true
LIKE(body_keyword, "%ws:access%")
1271
OK
wildcard case_insensitive=true
match(body_keyword, "ws:access")
0
OK
match
match_phrase(body_keyword, "ws:access")
0
OK
match_phrase
body_keyword = "ws:access"
0
OK
term
LIKE(body_multi, "ws:access")
0
OK
wildcard(.keyword) case_insensitive=true
LIKE(body_multi, "%ws:access%")
1271
OK
wildcard(.keyword) case_insensitive=true
match(body_multi, "ws:access")
1271
OK
match
match_phrase(body_multi, "ws:access")
1271
OK
match_phrase
LIKE(lt, "ws:access")
1271
OK
SCRIPT[ILIKE]
LIKE(lt, "%ws:access%")
1271
OK
SCRIPT[ILIKE]
match(lt, "ws:access")
ERR
FAIL
SCRIPT[match] (rejected at runtime)
match_phrase(lt, "ws:access")
ERR
FAIL
SCRIPT[match_phrase] (rejected at runtime)
lt = "ws:access"
1271
OK
SCRIPT[=]
16 of 18 cells produce a result. Two cells hard-fail at runtime.
Issues identified from the matrix
Issue 1 — match / match_phrase behave inconsistently across field types
The same match / match_phrase operator produces three different outcomes depending on the field's type:
field type
behavior
example
text
works as expected — token / phrase match against the inverted index
match(body_text, "ws:access") → 1271
text + .keyword
works as expected on the text path
match(body_multi, "ws:access") → 1271
keyword
whole-value matching — typically returns 0 when the keyword value is a long string
match(body_keyword, "ws:access") → 0
derived (rex/eval)
hard error: Relevance search query functions are only supported when they are pushed down
match(lt, "ws:access") → 500
The keyword case is Lucene-correct but easy to mistake for "no matches found." The derived-column case is a hard runtime failure: the planner emits a SCRIPT[match] clause; the script engine rejects relevance functions at compile time; the query never executes.
This is especially painful because "use match_phrase instead of LIKE" is the natural workaround for the perf issue below — but that workaround fails the moment a user touches a rex-extracted column.
Cross-reference: #3668 — relevance functions need pushdown plumbing in Calcite that doesn't exist yet for derived columns.
Proposal — Issue 1
Reject early, not at runtime. When the planner sees match / match_phrase against a non-pushdownable column (Calcite-derived, no underlying field), fail fast at planning time with a clear error message naming the column and suggesting alternatives. Don't generate a SCRIPT[match] clause that's guaranteed to fail server-side.
Document the keyword-field case in PPL relevance-function docs: match / match_phrase on a keyword field perform whole-value term matching, not substring/phrase. Common alternatives: use a text analyzer, query the .keyword subfield only with =/LIKE %X%, or restructure the data.
Track in #3668 — when relevance-function pushdown extends to derived columns, the hard-fail case becomes a script-path execution.
Issue 2 — LIKE on a pure text field is a script query (perf)
Two text-field rows emit SCRIPT[ILIKE] instead of a native Lucene query:
LIKE(body_text, "ws:access")
LIKE(body_text, "%ws:access%")
Both produce correct results, but as a Painless ScriptQuery that scans _source per document. The same intent expressed as match_phrase(body_text, "ws:access") returns the same 1271 hits via a native Lucene query. Adding a .keyword subfield to the mapping (the body_multi rows) automatically routes LIKE through wildcard_query against the subfield — fast.
(The rex-extracted lt rows also emit SCRIPT[ILIKE], but those are inherently script because lt is a Calcite-derived column not backed by an index — same root cause as Issue 1.)
Root cause — opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java:1348-1361:
publicQueryExpressionlike(LiteralExpressionliteral, booleancaseSensitive) {
StringfieldName = getFieldReference();
StringkeywordField = OpenSearchTextType.toKeywordSubField(fieldName, this.rel.getExprType());
booleanisKeywordField = keywordField != null;
if (isKeywordField) {
builder = wildcardQuery(keywordField, ...).caseInsensitive(!caseSensitive);
returnthis;
}
thrownewUnsupportedOperationException("Like query is not supported for text field");
}
The throw is caught at PredicateAnalyzer.java:239-249 (broad catch (Throwable)) and falls back to ScriptQueryExpression, which reads _source per doc.
The comment at :1343-1346 explains why: emitting wildcard_query against analyzed text would change LIKE semantics, since Lucene wildcards run per-token. This was an intentional correctness-over-performance choice — see Related Issues.
Proposal — Issue 2
Both an optimization and a doc update are needed; they reinforce each other.
Optimization — rewrite LIKE(text_field, '%LITERAL%') to match_phrase
Add a planner-level rewrite that fires when all of the following hold:
Field is text without a .keyword subfield (the keyword path already pushes down via wildcard_query).
2-arg LIKE. 3-arg LIKE with explicit case-sensitive flag falls through to the current script path.
Pattern shape is exactly '%LITERAL%' — leading and trailing % only, no interior %/_, no \-escapes. All other shapes (no wildcards, prefix-only, suffix-only, interior wildcards) fall through to the current script path.
The field's analyzer applied to LITERAL produces ≥1 token and those tokens survive without stopword/stemming loss. Cache the analyzer check per (field, literal).
Target is match_phrase, not match. Phrase semantics preserve token adjacency — closer to substring intent on analyzed text. Multi-token literals work; single-token degenerates to match correctly.
Other LIKE patterns on text fields ('LITERAL', 'LITERAL%', '%LITERAL', patterns with interior wildcards) cannot be safely rewritten with current Lucene primitives — they require character-anchored or character-position-aware matching that token-based queries can't express. They remain on the script path.
Blocker:#3668 — match_phrase doesn't push down on Calcite yet. The rewrite is moot until that lands, otherwise it just exchanges one script path for another.
Documentation — call out which LIKE patterns get optimized
Add a "performance notes" section to docs/user/ppl/cmd/where.md (or a dedicated page). The user-facing message must be specific about what is and isn't optimized:
LIKE pattern on a text field
engine behavior
'%LITERAL%' (substring)
Optimized to match_phrase when the opt-in setting is enabled — fast.
'LITERAL' (no wildcards, exact match)
Per-doc source scan — slow. Use body_field.keyword = 'LITERAL' or restructure data.
'LITERAL%' (prefix)
Per-doc source scan — slow. Use a .keyword subfield + LIKE on the subfield.
'%LITERAL' (suffix)
Per-doc source scan — slow. No fast alternative on text; use .keyword.
any pattern with interior %/_
Per-doc source scan — slow. No optimization.
Also document:
Adding a .keyword subfield to the mapping is the most general fix — it routes all LIKE patterns through wildcard_query against the subfield, which is fast.
match_phrase(text_field, "literal") is always available as a direct alternative — the optimization above is just an automatic version of this rewrite for the substring case.
Cross-reference #4835 for the case-sensitivity gotcha (any LIKE is implicitly ILIKE today).
Bulk-load 5000 docs where each body_* field holds a string like logtype=<lt> http_status=<n> uri="..." user_agent="..." with <lt> randomly drawn from the four logtypes.
Each cell ran via:
POST /_plugins/_ppl/_explain to extract DSL pushdown shape
#3669 — Original feature request to push down LIKE on Calcite.
#3915 (merged 2025-08-13) — Introduced today's behavior. PR description: "we should only support pushdown like query for keyword field".
#3950 (closed 2025-10-22) — Asked for a _source script fallback to fill the text-field gap. Closed by:
#4010 (merged 2025-08-19) — Accepted the script path itself. That's exactly what we observe.
#3428 — Provides the semantic counterexamples (e.g., Amber Johnny vs Ambe_) cited as the reason NOT to emit wildcard_query for LIKE on tokenized text.
#4835 — LIKE should be case-sensitive in PPL but currently isn't (every LIKE cell in the matrix shows ILIKE / case_insensitive=true). Constrains any rewrite target.
#3668 — Open, unassigned. match_phrase doesn't push down on Calcite yet. Hard prerequisite for any LIKE → match_phrase rewrite to be useful, and the underlying cause of the rex-derived-column hard-fail in Issue 1.
Acceptance criteria
Issue 1 — early failure / docs:
Planner detects relevance functions on non-pushdownable columns and fails at planning time with a clear, user-facing error.
PPL relevance-function docs describe whole-value matching on keyword fields and the derived-column restriction.
Issue 2 — optimization (follow-up issue once #3668 ships):
Rewrite emits match_phrase only for '%LITERAL%' on a text field with no .keyword subfield, 2-arg LIKE, and analyzer-token-preserving literal.
All other LIKE patterns fall through to the current script path unchanged.
Test coverage: standard analyzer, multi-token literals, hyphenated literals, prefix-only / suffix-only / no-wildcard / interior-wildcard patterns (must fall through), escape character LIKE, case-sensitive 3-arg LIKE (must fall through), missing field, NULL handling.
Issue 2 — documentation:
Performance-notes page lists each LIKE pattern shape and whether it's optimized or scans source.
where.md and the relevance-functions page cross-link to it.
Documentation describes the opt-in setting and the conditions under which the rewrite fires.
Reproduction artifacts
Available on request:
gen_logs_multi.py — sample data generator.
matrix.py — runs the full matrix against localhost:9200.
Saved EXPLAIN responses showing the verbatim DSL for each cell.
Test data
5000 docs, each with three identical fields holding a string like:
body_text(text)body_keyword(keyword)body_multi(text +.keywordsubfield)lt= rex-extracted frombody(VARCHAR)logtypeis randomly drawn from{ws:access, ws:error, app:debug, app:info}.Ground truth
ws:accessas a substringGo-http-clientas a substringCorrectness legend
Note: rows that return 0 hits can still be OK when the operator is doing what it's defined to do. For example,
LIKE(body_text, "ws:access")(no wildcards) means strict equality — long body string ≠ short literal → 0 hits is correct.match(body_keyword, "ws:access")does whole-value term matching against the keyword field — body value ≠ literal → 0 hits is correct.Probe 1:
ws:accessLIKE(body_text, "ws:access")SCRIPT[ILIKE]LIKE(body_text, "%ws:access%")SCRIPT[ILIKE]match(body_text, "ws:access")matchmatch_phrase(body_text, "ws:access")match_phraseLIKE(body_keyword, "ws:access")wildcard case_insensitive=trueLIKE(body_keyword, "%ws:access%")wildcard case_insensitive=truematch(body_keyword, "ws:access")matchmatch_phrase(body_keyword, "ws:access")match_phrasebody_keyword = "ws:access"termLIKE(body_multi, "ws:access")wildcard(.keyword) case_insensitive=trueLIKE(body_multi, "%ws:access%")wildcard(.keyword) case_insensitive=truematch(body_multi, "ws:access")matchmatch_phrase(body_multi, "ws:access")match_phraseLIKE(lt, "ws:access")SCRIPT[ILIKE]LIKE(lt, "%ws:access%")SCRIPT[ILIKE]match(lt, "ws:access")SCRIPT[match](rejected at runtime)match_phrase(lt, "ws:access")SCRIPT[match_phrase](rejected at runtime)lt = "ws:access"SCRIPT[=]16 of 18 cells produce a result. Two cells hard-fail at runtime.
Issues identified from the matrix
Issue 1 —
match/match_phrasebehave inconsistently across field typesThe same
match/match_phraseoperator produces three different outcomes depending on the field's type:textmatch(body_text, "ws:access")→ 1271text + .keywordmatch(body_multi, "ws:access")→ 1271keywordmatch(body_keyword, "ws:access")→ 0rex/eval)Relevance search query functions are only supported when they are pushed downmatch(lt, "ws:access")→ 500The keyword case is Lucene-correct but easy to mistake for "no matches found." The derived-column case is a hard runtime failure: the planner emits a
SCRIPT[match]clause; the script engine rejects relevance functions at compile time; the query never executes.This is especially painful because "use
match_phraseinstead ofLIKE" is the natural workaround for the perf issue below — but that workaround fails the moment a user touches arex-extracted column.Cross-reference: #3668 — relevance functions need pushdown plumbing in Calcite that doesn't exist yet for derived columns.
Proposal — Issue 1
match/match_phraseagainst a non-pushdownable column (Calcite-derived, no underlying field), fail fast at planning time with a clear error message naming the column and suggesting alternatives. Don't generate aSCRIPT[match]clause that's guaranteed to fail server-side.match/match_phraseon akeywordfield perform whole-value term matching, not substring/phrase. Common alternatives: use atextanalyzer, query the.keywordsubfield only with=/LIKE %X%, or restructure the data.script-path execution.Issue 2 —
LIKEon a puretextfield is ascriptquery (perf)Two text-field rows emit
SCRIPT[ILIKE]instead of a native Lucene query:LIKE(body_text, "ws:access")LIKE(body_text, "%ws:access%")Both produce correct results, but as a Painless
ScriptQuerythat scans_sourceper document. The same intent expressed asmatch_phrase(body_text, "ws:access")returns the same 1271 hits via a native Lucene query. Adding a.keywordsubfield to the mapping (thebody_multirows) automatically routesLIKEthroughwildcard_queryagainst the subfield — fast.(The rex-extracted
ltrows also emitSCRIPT[ILIKE], but those are inherently script becauseltis a Calcite-derived column not backed by an index — same root cause as Issue 1.)Root cause —
opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java:1348-1361:The throw is caught at
PredicateAnalyzer.java:239-249(broadcatch (Throwable)) and falls back toScriptQueryExpression, which reads_sourceper doc.The comment at
:1343-1346explains why: emittingwildcard_queryagainst analyzed text would changeLIKEsemantics, since Lucene wildcards run per-token. This was an intentional correctness-over-performance choice — see Related Issues.Proposal — Issue 2
Both an optimization and a doc update are needed; they reinforce each other.
Optimization — rewrite
LIKE(text_field, '%LITERAL%')tomatch_phraseAdd a planner-level rewrite that fires when all of the following hold:
textwithout a.keywordsubfield (the keyword path already pushes down viawildcard_query).LIKE. 3-argLIKEwith explicit case-sensitive flag falls through to the current script path.'%LITERAL%'— leading and trailing%only, no interior%/_, no\-escapes. All other shapes (no wildcards, prefix-only, suffix-only, interior wildcards) fall through to the current script path.LITERALproduces ≥1 token and those tokens survive without stopword/stemming loss. Cache the analyzer check per(field, literal).plugins.calcite.like_to_match_phrase.enabledis on. Default off because the rewrite is analyzer-driven (case-insensitive under standard analyzer) andLIKEis contractually case-sensitive per [BUG] Thelikefunction is case insensitive in PPL but it should be case sensitive #4835.Target is
match_phrase, notmatch. Phrase semantics preserve token adjacency — closer to substring intent on analyzed text. Multi-token literals work; single-token degenerates tomatchcorrectly.Other
LIKEpatterns on text fields ('LITERAL','LITERAL%','%LITERAL', patterns with interior wildcards) cannot be safely rewritten with current Lucene primitives — they require character-anchored or character-position-aware matching that token-based queries can't express. They remain on the script path.Blocker: #3668 —
match_phrasedoesn't push down on Calcite yet. The rewrite is moot until that lands, otherwise it just exchanges one script path for another.Documentation — call out which
LIKEpatterns get optimizedAdd a "performance notes" section to
docs/user/ppl/cmd/where.md(or a dedicated page). The user-facing message must be specific about what is and isn't optimized:LIKEpattern on atextfield'%LITERAL%'(substring)match_phrasewhen the opt-in setting is enabled — fast.'LITERAL'(no wildcards, exact match)body_field.keyword = 'LITERAL'or restructure data.'LITERAL%'(prefix).keywordsubfield +LIKEon the subfield.'%LITERAL'(suffix)text; use.keyword.%/_Also document:
.keywordsubfield to the mapping is the most general fix — it routes allLIKEpatterns throughwildcard_queryagainst the subfield, which is fast.match_phrase(text_field, "literal")is always available as a direct alternative — the optimization above is just an automatic version of this rewrite for the substring case.LIKEis implicitlyILIKEtoday).Not proposing:
LIKEsemantics — conflicts with [BUG] like function behaviour in V2 is not correct #3428 and [BUG] Thelikefunction is case insensitive in PPL but it should be case sensitive #4835.LIKE → matchrewrite —matchdefaults to OR over tokens, would over-match on multi-token literals.LIKEpatterns other than'%LITERAL%'— doc must clearly state these still scan the source.Reproduction
Index template:
Bulk-load 5000 docs where each
body_*field holds a string likelogtype=<lt> http_status=<n> uri="..." user_agent="..."with<lt>randomly drawn from the four logtypes.Each cell ran via:
POST /_plugins/_ppl/_explainto extract DSL pushdown shapePOST /_plugins/_pplto capture hit countCalcite engine enabled (
plugins.calcite.enabled: true).Related issues
LIKEon Calcite._sourcescript fallback to fill the text-field gap. Closed by:Amber JohnnyvsAmbe_) cited as the reason NOT to emitwildcard_queryforLIKEon tokenized text.LIKEshould be case-sensitive in PPL but currently isn't (everyLIKEcell in the matrix showsILIKE/case_insensitive=true). Constrains any rewrite target.match_phrasedoesn't push down on Calcite yet. Hard prerequisite for anyLIKE → match_phraserewrite to be useful, and the underlying cause of the rex-derived-column hard-fail in Issue 1.Acceptance criteria
Issue 1 — early failure / docs:
keywordfields and the derived-column restriction.Issue 2 — optimization (follow-up issue once #3668 ships):
plugins.calcite.like_to_match_phrase.enabledadded, default off.match_phraseonly for'%LITERAL%'on atextfield with no.keywordsubfield, 2-argLIKE, and analyzer-token-preserving literal.LIKEpatterns fall through to the current script path unchanged.LIKE, case-sensitive 3-argLIKE(must fall through), missing field, NULL handling.Issue 2 — documentation:
LIKEpattern shape and whether it's optimized or scans source.where.mdand the relevance-functions page cross-link to it.Reproduction artifacts
Available on request:
gen_logs_multi.py— sample data generator.matrix.py— runs the full matrix againstlocalhost:9200.