[BUG/PERF] `LIKE` and relevance-function pushdown gaps on Calcite engine

## Test data

5000 docs, each with three identical fields holding a string like:
```
logtype=ws:access http_status=200 uri="..." user_agent="Go-http-client/1.1" ...
```
- `body_text` (text)
- `body_keyword` (keyword)
- `body_multi` (text + `.keyword` subfield)
- `lt` = rex-extracted from `body` (VARCHAR)

`logtype` is randomly drawn from `{ws:access, ws:error, app:debug, app:info}`.

## Ground truth

- 1271 docs contain `ws:access` as a substring
- 993 docs contain `Go-http-client` as a substring

## Correctness legend

- **OK** — result matches the operator's defined behavior given the field type.
- **WRONG** — result deviates from the operator's defined behavior.
- **FAIL** — query errors out at runtime.

Note: rows that return 0 hits can still be **OK** when the operator is doing what it's defined to do. For example, `LIKE(body_text, "ws:access")` (no wildcards) means strict equality — long body string ≠ short literal → 0 hits is correct. `match(body_keyword, "ws:access")` does whole-value term matching against the keyword field — body value ≠ literal → 0 hits is correct.

## Probe 1: `ws:access`

| query                                     | hits | correctness | dsl                                        |
|-------------------------------------------|------|-------------|--------------------------------------------|
| `LIKE(body_text, "ws:access")`            |    0 | OK          | `SCRIPT[ILIKE]`                            |
| `LIKE(body_text, "%ws:access%")`          | 1271 | OK          | `SCRIPT[ILIKE]`                            |
| `match(body_text, "ws:access")`           | 1271 | OK          | `match`                                    |
| `match_phrase(body_text, "ws:access")`    | 1271 | OK          | `match_phrase`                             |
| `LIKE(body_keyword, "ws:access")`         |    0 | OK          | `wildcard case_insensitive=true`           |
| `LIKE(body_keyword, "%ws:access%")`       | 1271 | OK          | `wildcard case_insensitive=true`           |
| `match(body_keyword, "ws:access")`        |    0 | OK          | `match`                                    |
| `match_phrase(body_keyword, "ws:access")` |    0 | OK          | `match_phrase`                             |
| `body_keyword = "ws:access"`              |    0 | OK          | `term`                                     |
| `LIKE(body_multi, "ws:access")`           |    0 | OK          | `wildcard(.keyword) case_insensitive=true` |
| `LIKE(body_multi, "%ws:access%")`         | 1271 | OK          | `wildcard(.keyword) case_insensitive=true` |
| `match(body_multi, "ws:access")`          | 1271 | OK          | `match`                                    |
| `match_phrase(body_multi, "ws:access")`   | 1271 | OK          | `match_phrase`                             |
| `LIKE(lt, "ws:access")`                   | 1271 | OK          | `SCRIPT[ILIKE]`                            |
| `LIKE(lt, "%ws:access%")`                 | 1271 | OK          | `SCRIPT[ILIKE]`                            |
| `match(lt, "ws:access")`                  | ERR  | **FAIL**    | `SCRIPT[match]` (rejected at runtime)      |
| `match_phrase(lt, "ws:access")`           | ERR  | **FAIL**    | `SCRIPT[match_phrase]` (rejected at runtime) |
| `lt = "ws:access"`                        | 1271 | OK          | `SCRIPT[=]`                                |

**16 of 18 cells produce a result. Two cells hard-fail at runtime.**

## Issues identified from the matrix

### Issue 1 — `match` / `match_phrase` behave inconsistently across field types

The same `match` / `match_phrase` operator produces three different outcomes depending on the field's type:

| field type             | behavior                                                                            | example                                            |
|------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------|
| `text`                 | works as expected — token / phrase match against the inverted index                 | `match(body_text, "ws:access")` → 1271             |
| `text + .keyword`      | works as expected on the text path                                                  | `match(body_multi, "ws:access")` → 1271            |
| `keyword`              | whole-value matching — typically returns 0 when the keyword value is a long string  | `match(body_keyword, "ws:access")` → 0             |
| derived (`rex`/`eval`) | **hard error**: `Relevance search query functions are only supported when they are pushed down` | `match(lt, "ws:access")` → 500   |

The keyword case is Lucene-correct but easy to mistake for "no matches found." The derived-column case is a hard runtime failure: the planner emits a `SCRIPT[match]` clause; the script engine rejects relevance functions at compile time; the query never executes.

This is especially painful because "use `match_phrase` instead of `LIKE`" is the natural workaround for the perf issue below — but that workaround fails the moment a user touches a `rex`-extracted column.

Cross-reference: [#3668](https://github.com/opensearch-project/sql/issues/3668) — relevance functions need pushdown plumbing in Calcite that doesn't exist yet for derived columns.

**Proposal — Issue 1**

1. **Reject early, not at runtime.** When the planner sees `match` / `match_phrase` against a non-pushdownable column (Calcite-derived, no underlying field), fail fast at planning time with a clear error message naming the column and suggesting alternatives. Don't generate a `SCRIPT[match]` clause that's guaranteed to fail server-side.
2. **Document the keyword-field case** in PPL relevance-function docs: `match` / `match_phrase` on a `keyword` field perform whole-value term matching, not substring/phrase. Common alternatives: use a `text` analyzer, query the `.keyword` subfield only with `=`/`LIKE %X%`, or restructure the data.
3. **Track in [#3668](https://github.com/opensearch-project/sql/issues/3668)** — when relevance-function pushdown extends to derived columns, the hard-fail case becomes a `script`-path execution.

---

### Issue 2 — `LIKE` on a pure `text` field is a `script` query (perf)

Two text-field rows emit `SCRIPT[ILIKE]` instead of a native Lucene query:

- `LIKE(body_text, "ws:access")`
- `LIKE(body_text, "%ws:access%")`

Both produce **correct** results, but as a Painless `ScriptQuery` that scans `_source` per document. The same intent expressed as `match_phrase(body_text, "ws:access")` returns the same 1271 hits via a native Lucene query. Adding a `.keyword` subfield to the mapping (the `body_multi` rows) automatically routes `LIKE` through `wildcard_query` against the subfield — fast.

(The rex-extracted `lt` rows also emit `SCRIPT[ILIKE]`, but those are inherently script because `lt` is a Calcite-derived column not backed by an index — same root cause as Issue 1.)

**Root cause** — `opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java:1348-1361`:

```java
public QueryExpression like(LiteralExpression literal, boolean caseSensitive) {
  String fieldName = getFieldReference();
  String keywordField = OpenSearchTextType.toKeywordSubField(fieldName, this.rel.getExprType());
  boolean isKeywordField = keywordField != null;
  if (isKeywordField) {
    builder = wildcardQuery(keywordField, ...).caseInsensitive(!caseSensitive);
    return this;
  }
  throw new UnsupportedOperationException("Like query is not supported for text field");
}
```

The throw is caught at `PredicateAnalyzer.java:239-249` (broad `catch (Throwable)`) and falls back to `ScriptQueryExpression`, which reads `_source` per doc.

The comment at `:1343-1346` explains why: emitting `wildcard_query` against analyzed text would change `LIKE` semantics, since Lucene wildcards run per-token. This was an intentional correctness-over-performance choice — see Related Issues.

**Proposal — Issue 2**

Both an optimization and a doc update are needed; they reinforce each other.

**Optimization — rewrite `LIKE(text_field, '%LITERAL%')` to `match_phrase`**

Add a planner-level rewrite that fires when all of the following hold:

- Field is `text` without a `.keyword` subfield (the keyword path already pushes down via `wildcard_query`).
- 2-arg `LIKE`. 3-arg `LIKE` with explicit case-sensitive flag falls through to the current script path.
- Pattern shape is exactly `'%LITERAL%'` — leading and trailing `%` only, no interior `%`/`_`, no `\`-escapes. All other shapes (no wildcards, prefix-only, suffix-only, interior wildcards) fall through to the current script path.
- The field's analyzer applied to `LITERAL` produces ≥1 token and those tokens survive without stopword/stemming loss. Cache the analyzer check per `(field, literal)`.
- Cluster setting `plugins.calcite.like_to_match_phrase.enabled` is on. **Default off** because the rewrite is analyzer-driven (case-insensitive under standard analyzer) and `LIKE` is contractually case-sensitive per #4835.

Target is **`match_phrase`**, not `match`. Phrase semantics preserve token adjacency — closer to substring intent on analyzed text. Multi-token literals work; single-token degenerates to `match` correctly.

Other `LIKE` patterns on text fields (`'LITERAL'`, `'LITERAL%'`, `'%LITERAL'`, patterns with interior wildcards) cannot be safely rewritten with current Lucene primitives — they require character-anchored or character-position-aware matching that token-based queries can't express. They remain on the script path.

**Blocker:** [#3668](https://github.com/opensearch-project/sql/issues/3668) — `match_phrase` doesn't push down on Calcite yet. The rewrite is moot until that lands, otherwise it just exchanges one script path for another.

**Documentation — call out which `LIKE` patterns get optimized**

Add a "performance notes" section to `docs/user/ppl/cmd/where.md` (or a dedicated page). The user-facing message must be specific about what is and isn't optimized:

| `LIKE` pattern on a `text` field         | engine behavior                                                         |
|------------------------------------------|-------------------------------------------------------------------------|
| `'%LITERAL%'` (substring)                | Optimized to `match_phrase` when the opt-in setting is enabled — fast. |
| `'LITERAL'` (no wildcards, exact match)  | Per-doc source scan — slow. Use `body_field.keyword = 'LITERAL'` or restructure data. |
| `'LITERAL%'` (prefix)                    | Per-doc source scan — slow. Use a `.keyword` subfield + `LIKE` on the subfield. |
| `'%LITERAL'` (suffix)                    | Per-doc source scan — slow. No fast alternative on `text`; use `.keyword`. |
| any pattern with interior `%`/`_`        | Per-doc source scan — slow. No optimization. |

Also document:
- Adding a `.keyword` subfield to the mapping is the most general fix — it routes all `LIKE` patterns through `wildcard_query` against the subfield, which is fast.
- `match_phrase(text_field, "literal")` is always available as a direct alternative — the optimization above is just an automatic version of this rewrite for the substring case.
- Cross-reference [#4835](https://github.com/opensearch-project/sql/issues/4835) for the case-sensitivity gotcha (any `LIKE` is implicitly `ILIKE` today).

**Not proposing:**
- A change to default `LIKE` semantics — conflicts with #3428 and #4835.
- A blanket `LIKE → match` rewrite — `match` defaults to OR over tokens, would over-match on multi-token literals.
- Optimization for `LIKE` patterns other than `'%LITERAL%'` — doc must clearly state these still scan the source.

## Reproduction

Index template:

```json
PUT logs-test
{
  "mappings": {
    "properties": {
      "body_text":    {"type": "text"},
      "body_keyword": {"type": "keyword"},
      "body_multi":   {"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 32766}}}
    }
  }
}
```

Bulk-load 5000 docs where each `body_*` field holds a string like `logtype=<lt> http_status=<n> uri="..." user_agent="..."` with `<lt>` randomly drawn from the four logtypes.

Each cell ran via:
- `POST /_plugins/_ppl/_explain` to extract DSL pushdown shape
- `POST /_plugins/_ppl` to capture hit count

Calcite engine enabled (`plugins.calcite.enabled: true`).

## Related issues

- **[#3669](https://github.com/opensearch-project/sql/issues/3669)** — Original feature request to push down `LIKE` on Calcite.
- **[#3915](https://github.com/opensearch-project/sql/pull/3915)** (merged 2025-08-13) — Introduced today's behavior. PR description: *"we should only support pushdown like query for keyword field"*.
- **[#3950](https://github.com/opensearch-project/sql/issues/3950)** (closed 2025-10-22) — Asked for a `_source` script fallback to fill the text-field gap. Closed by:
- **[#4010](https://github.com/opensearch-project/sql/pull/4010)** (merged 2025-08-19) — Accepted the script path itself. That's exactly what we observe.
- **[#3428](https://github.com/opensearch-project/sql/issues/3428)** — Provides the semantic counterexamples (e.g., `Amber Johnny` vs `Ambe_`) cited as the reason NOT to emit `wildcard_query` for `LIKE` on tokenized text.
- **[#4835](https://github.com/opensearch-project/sql/issues/4835)** — `LIKE` should be case-sensitive in PPL but currently isn't (every `LIKE` cell in the matrix shows `ILIKE` / `case_insensitive=true`). Constrains any rewrite target.
- **[#3668](https://github.com/opensearch-project/sql/issues/3668)** — Open, unassigned. `match_phrase` doesn't push down on Calcite yet. **Hard prerequisite** for any `LIKE → match_phrase` rewrite to be useful, and **the underlying cause** of the rex-derived-column hard-fail in Issue 1.

## Acceptance criteria

**Issue 1 — early failure / docs:**
- [ ] Planner detects relevance functions on non-pushdownable columns and fails at planning time with a clear, user-facing error.
- [ ] PPL relevance-function docs describe whole-value matching on `keyword` fields and the derived-column restriction.

**Issue 2 — optimization (follow-up issue once #3668 ships):**
- [ ] Cluster setting `plugins.calcite.like_to_match_phrase.enabled` added, default off.
- [ ] Rewrite emits `match_phrase` only for `'%LITERAL%'` on a `text` field with no `.keyword` subfield, 2-arg `LIKE`, and analyzer-token-preserving literal.
- [ ] All other `LIKE` patterns fall through to the current script path unchanged.
- [ ] Test coverage: standard analyzer, multi-token literals, hyphenated literals, prefix-only / suffix-only / no-wildcard / interior-wildcard patterns (must fall through), escape character `LIKE`, case-sensitive 3-arg `LIKE` (must fall through), missing field, NULL handling.

**Issue 2 — documentation:**
- [ ] Performance-notes page lists each `LIKE` pattern shape and whether it's optimized or scans source.
- [ ] `where.md` and the relevance-functions page cross-link to it.
- [ ] Documentation describes the opt-in setting and the conditions under which the rewrite fires.

## Reproduction artifacts

Available on request:
- `gen_logs_multi.py` — sample data generator.
- `matrix.py` — runs the full matrix against `localhost:9200`.
- Saved EXPLAIN responses showing the verbatim DSL for each cell.


query	hits	correctness	dsl
`LIKE(body_text, "ws:access")`	0	OK	`SCRIPT[ILIKE]`
`LIKE(body_text, "%ws:access%")`	1271	OK	`SCRIPT[ILIKE]`
`match(body_text, "ws:access")`	1271	OK	`match`
`match_phrase(body_text, "ws:access")`	1271	OK	`match_phrase`
`LIKE(body_keyword, "ws:access")`	0	OK	`wildcard case_insensitive=true`
`LIKE(body_keyword, "%ws:access%")`	1271	OK	`wildcard case_insensitive=true`
`match(body_keyword, "ws:access")`	0	OK	`match`
`match_phrase(body_keyword, "ws:access")`	0	OK	`match_phrase`
`body_keyword = "ws:access"`	0	OK	`term`
`LIKE(body_multi, "ws:access")`	0	OK	`wildcard(.keyword) case_insensitive=true`
`LIKE(body_multi, "%ws:access%")`	1271	OK	`wildcard(.keyword) case_insensitive=true`
`match(body_multi, "ws:access")`	1271	OK	`match`
`match_phrase(body_multi, "ws:access")`	1271	OK	`match_phrase`
`LIKE(lt, "ws:access")`	1271	OK	`SCRIPT[ILIKE]`
`LIKE(lt, "%ws:access%")`	1271	OK	`SCRIPT[ILIKE]`
`match(lt, "ws:access")`	ERR	FAIL	`SCRIPT[match]` (rejected at runtime)
`match_phrase(lt, "ws:access")`	ERR	FAIL	`SCRIPT[match_phrase]` (rejected at runtime)
`lt = "ws:access"`	1271	OK	`SCRIPT[=]`

field type	behavior	example
`text`	works as expected — token / phrase match against the inverted index	`match(body_text, "ws:access")` → 1271
`text + .keyword`	works as expected on the text path	`match(body_multi, "ws:access")` → 1271
`keyword`	whole-value matching — typically returns 0 when the keyword value is a long string	`match(body_keyword, "ws:access")` → 0
derived (`rex`/`eval`)	hard error: `Relevance search query functions are only supported when they are pushed down`	`match(lt, "ws:access")` → 500

`LIKE` pattern on a `text` field	engine behavior
`'%LITERAL%'` (substring)	Optimized to `match_phrase` when the opt-in setting is enabled — fast.
`'LITERAL'` (no wildcards, exact match)	Per-doc source scan — slow. Use `body_field.keyword = 'LITERAL'` or restructure data.
`'LITERAL%'` (prefix)	Per-doc source scan — slow. Use a `.keyword` subfield + `LIKE` on the subfield.
`'%LITERAL'` (suffix)	Per-doc source scan — slow. No fast alternative on `text`; use `.keyword`.
any pattern with interior `%`/`_`	Per-doc source scan — slow. No optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG/PERF] `LIKE` and relevance-function pushdown gaps on Calcite engine #5491

Test data

Ground truth

Correctness legend

Probe 1: `ws:access`

Issues identified from the matrix

Issue 1 — `match` / `match_phrase` behave inconsistently across field types

Issue 2 — `LIKE` on a pure `text` field is a `script` query (perf)

Reproduction

Related issues

Acceptance criteria

Reproduction artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG/PERF] LIKE and relevance-function pushdown gaps on Calcite engine #5491

Description

Test data

Ground truth

Correctness legend

Probe 1: ws:access

Issues identified from the matrix

Issue 1 — match / match_phrase behave inconsistently across field types

Issue 2 — LIKE on a pure text field is a script query (perf)

Reproduction

Related issues

Acceptance criteria

Reproduction artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG/PERF] `LIKE` and relevance-function pushdown gaps on Calcite engine #5491

Probe 1: `ws:access`

Issue 1 — `match` / `match_phrase` behave inconsistently across field types

Issue 2 — `LIKE` on a pure `text` field is a `script` query (perf)