Problem
PPL rex with N named capture groups runs the regex matcher N times per row, even though all N groups can be filled from a single Matcher.find() result. The cost is structural to the current lowering, not a bug.
CalciteRelNodeVisitor.innerRex (core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:378-413) parses the pattern, finds all (?<name>...) groups, and emits one REX_EXTRACT(field, pattern, name_i) UDF call per group. Each call independently runs the matcher:
// RexExtractFunction.executeExtraction, core/.../udf/RexExtractFunction.java:121-138
Pattern compiledPattern = RegexCommonUtils.getCompiledPattern(pattern); // cached
Matcher matcher = compiledPattern.matcher(text);
if (matcher.find()) { return extractor.apply(matcher); }
Pattern compilation is cached globally (RegexCommonUtils.getCompiledPattern at line 48-55), so compilation isn't the cost. The cost is matcher.find() running N times over the same text per row. There's no CSE — each REX_EXTRACT call differs in the third argument (groupName), so Calcite treats them as independent expressions.
The same applies to multiple sequential rex commands on the same field: each rex emits its own REX_EXTRACT calls; no fusion happens across commands either.
Concrete impact — a typical access-log analytics query with four sequential rex commands on the same field:
source=<index> | where match(body, '<keyword>')
| rex field=body "field_a=(?<field_a>[^\s]+)"
| rex field=body "field_b=(?<field_b>\d+)"
| rex field=body "field_c=\"(?<field_c>[^\"]+)\""
| rex field=body "field_d=\"(?<field_d>[^\"]+)\""
| ...
…runs matcher.find() 4× per row. Combining them into a single multi-group rex doesn't help (still 4 UDF calls, just with a more expensive combined pattern). On high-volume log indices this is a meaningful per-row cost multiplier.
Proposed fix
Add a fused UDF REX_EXTRACT_ALL(field, pattern) returning a MAP<VARCHAR, VARCHAR> (or a struct row) containing all named groups from one matcher invocation. Modify innerRex so when the pattern has ≥2 named groups, emit a single REX_EXTRACT_ALL call and project each named group as MAP_GET(rex_result, \"name_i\"). For single-group patterns keep the current direct call (no map overhead).
For the multi-rex-on-same-field case (the query above), add a Calcite HEP rule that fuses adjacent REX_EXTRACT / REX_EXTRACT_ALL calls with identical (field, pattern) operands across consecutive projections. That handles the case where the user wrote four separate rex commands rather than one combined one.
The visitor change alone fixes single-rex multi-group; the HEP rule extends the fix to the more common multi-rex pattern in real queries.
Files to touch
- New UDF:
core/src/main/java/org/opensearch/sql/expression/function/udf/RexExtractAllFunction.java
- Registration:
core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java and BuiltinFunctionName.java
- Visitor:
core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:391-413 (multi-group branch emits one call + map projections)
- Optional HEP rule:
core/src/main/java/org/opensearch/sql/calcite/plan/rule/RexExtractFusionRule.java, register in HEP_PROGRAM
Verification
CalcitePPLRexTest cases covering 1-group, 2-group, and 3-group patterns; verify the lowered plan contains 1 REX_EXTRACT_ALL (not N REX_EXTRACTs) for ≥2 groups.
- Integration test on
TEST_INDEX_BANK (or similar) with a multi-group pattern, asserting result equivalence with the current behavior.
- Microbenchmark (or rough timing) on a
match-filtered index showing per-row cost flat in N (the number of named groups) instead of linear.
Out of scope
- The change doesn't alter public
rex syntax or semantics — same input, same output, fewer matcher invocations.
- Ingest-time extraction via grok/dissect is the broader perf recommendation for users but orthogonal to this code change.
Problem
PPL
rexwith N named capture groups runs the regex matcher N times per row, even though all N groups can be filled from a singleMatcher.find()result. The cost is structural to the current lowering, not a bug.CalciteRelNodeVisitor.innerRex(core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:378-413) parses the pattern, finds all(?<name>...)groups, and emits oneREX_EXTRACT(field, pattern, name_i)UDF call per group. Each call independently runs the matcher:Pattern compilation is cached globally (
RegexCommonUtils.getCompiledPatternat line 48-55), so compilation isn't the cost. The cost ismatcher.find()running N times over the same text per row. There's no CSE — eachREX_EXTRACTcall differs in the third argument (groupName), so Calcite treats them as independent expressions.The same applies to multiple sequential
rexcommands on the same field: each rex emits its ownREX_EXTRACTcalls; no fusion happens across commands either.Concrete impact — a typical access-log analytics query with four sequential
rexcommands on the same field:…runs
matcher.find()4× per row. Combining them into a single multi-group rex doesn't help (still 4 UDF calls, just with a more expensive combined pattern). On high-volume log indices this is a meaningful per-row cost multiplier.Proposed fix
Add a fused UDF
REX_EXTRACT_ALL(field, pattern)returning aMAP<VARCHAR, VARCHAR>(or a struct row) containing all named groups from one matcher invocation. ModifyinnerRexso when the pattern has ≥2 named groups, emit a singleREX_EXTRACT_ALLcall and project each named group asMAP_GET(rex_result, \"name_i\"). For single-group patterns keep the current direct call (no map overhead).For the multi-
rex-on-same-field case (the query above), add a Calcite HEP rule that fuses adjacentREX_EXTRACT/REX_EXTRACT_ALLcalls with identical(field, pattern)operands across consecutive projections. That handles the case where the user wrote four separaterexcommands rather than one combined one.The visitor change alone fixes single-rex multi-group; the HEP rule extends the fix to the more common multi-rex pattern in real queries.
Files to touch
core/src/main/java/org/opensearch/sql/expression/function/udf/RexExtractAllFunction.javacore/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.javaandBuiltinFunctionName.javacore/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:391-413(multi-group branch emits one call + map projections)core/src/main/java/org/opensearch/sql/calcite/plan/rule/RexExtractFusionRule.java, register inHEP_PROGRAMVerification
CalcitePPLRexTestcases covering 1-group, 2-group, and 3-group patterns; verify the lowered plan contains 1REX_EXTRACT_ALL(not NREX_EXTRACTs) for ≥2 groups.TEST_INDEX_BANK(or similar) with a multi-group pattern, asserting result equivalence with the current behavior.match-filtered index showing per-row cost flat in N (the number of named groups) instead of linear.Out of scope
rexsyntax or semantics — same input, same output, fewer matcher invocations.