Problem Statement
PPL/SQL users have no way to understand why a query is slow or how to optimize it. Customers report slow queries, browser crashes, and stuck searches with no self-service diagnostics. The raw building blocks exist independently (estimated plan costs via explain cost, actual execution metrics via the profile API), but they are separate API calls returning different formats that only engine developers can interpret. There is no unified view comparing estimated vs. actual metrics, no human-readable interpretation, and no optimization suggestions.
This RFC directly addresses #4343 (Support analyze alongside explain).
Current State
explain cost provides estimated row counts and plan costs via Calcite's cost model, but only as raw plan text.
- The
profile API (#4983, #5044) provides actual execution time and row counts per operator as JSON.
- These two data sources are completely disconnected -- users must call them separately and mentally correlate operator nodes between different formats.
- No UI integration exists in Discover to surface this information.
- Competing tools (PostgreSQL
EXPLAIN ANALYZE, BigQuery plan explanations, Kusto best-practices analyzer) provide unified estimated-vs-actual views natively.
Long-Term Goals
- Unified query diagnostics: A single API call that returns both estimated and actual metrics per operator, enabling users to identify bottlenecks without needing engine-level expertise.
- Self-service optimization: Rule-based heuristic recommendations that surface actionable suggestions (e.g., "add a time filter", "this aggregation runs in-memory on 50M rows").
- Extensible rule framework: An
InsightRule interface that allows contributors to add new optimization heuristics without modifying core analyzer logic.
- UI integration: A visual panel in Discover that renders the operator tree and recommendations, making query profiling accessible to all users.
Proposal
Add an EXPLAIN ANALYZE endpoint that merges the existing explain cost and profile code paths into a unified operator tree, augmented by a rule-based QueryInsightsAnalyzer that produces human-readable optimization recommendations.
Response Schema
{
"query": "source=my_index | where age > 30 | stats count() by city | sort count",
"total_time_ms": 342,
"operator_tree": [
{
"id": 1,
"node_name": "OpenSearchIndexScan",
"description": "source=my_index | where age > 30",
"estimated_rows": 5000,
"actual_rows": 4823,
"estimated_cost": 12.5,
"actual_time_ms": 210,
"is_pushed_down": true,
"children": []
},
{
"id": 2,
"node_name": "Aggregation",
"description": "stats count() by city",
"estimated_rows": 50,
"actual_rows": 127,
"estimated_cost": 3.2,
"actual_time_ms": 95,
"is_pushed_down": false,
"children": [1]
},
{
"id": 3,
"node_name": "Sort",
"description": "sort count",
"estimated_rows": 50,
"actual_rows": 127,
"estimated_cost": 1.1,
"actual_time_ms": 37,
"is_pushed_down": false,
"children": [2]
}
],
"recommendations": [
{
"severity": "warning",
"rule": "cardinality_misestimation",
"message": "Aggregation produced 127 rows but estimated 50 (2.5x). Consider updating index statistics.",
"affected_node": 2,
"suggestion": null
}
]
}
Core Heuristic Rules
| Rule |
Severity |
Trigger |
cardinality_misestimation |
WARNING |
actual_rows differs from estimated_rows by >10x |
missing_time_filter |
WARNING |
Full index scan on time-series index without time-range predicate |
pushdown_visibility |
INFO |
Reports which operators executed in OpenSearch vs. in-memory |
expensive_in_memory_ops |
CRITICAL |
Joins/sorts/aggregations in-memory above configurable threshold |
full_scan_percentage |
WARNING |
Query scans >80% of total index documents |
Approach
Backend
-
ExplainAnalyzeExecutor - Runs both the explain cost path (Calcite RelNode tree with RelMetadataQuery) and the profile execution path, then merges into a unified tree where each node contains: node_name, estimated_rows, actual_rows, estimated_cost, actual_time_ms, is_pushed_down.
-
REST endpoint - POST _plugins/_ppl/_analyze (and future _plugins/_sql/_analyze) accepting { "query": "..." }.
-
QueryInsightsAnalyzer - Walks the merged tree, evaluates rules implementing an InsightRule interface, enriches with index metadata from _stats API, and produces Recommendation objects.
public interface InsightRule {
String ruleId();
List<Recommendation> evaluate(OperatorTree tree, IndexMetadata metadata);
}
Frontend
- QueryInsightsPanel (in
dashboards-query-workbench) - A collapsible panel in Discover rendering:
- Vertical operator tree with estimated vs. actual metrics per node
- Color coding: green (pushed down), orange (in-memory), red (bottleneck)
- Recommendation cards with severity icons and optional "Apply" buttons
Implementation Discussion
- Should
_analyze be a standalone endpoint or a parameter on the existing _ppl endpoint (e.g., "explain": "analyze")?
- What is the maximum plan depth before truncating the response?
- Should recommendations be cacheable for repeated queries against the same index?
Related Issues
Problem Statement
PPL/SQL users have no way to understand why a query is slow or how to optimize it. Customers report slow queries, browser crashes, and stuck searches with no self-service diagnostics. The raw building blocks exist independently (estimated plan costs via
explain cost, actual execution metrics via theprofileAPI), but they are separate API calls returning different formats that only engine developers can interpret. There is no unified view comparing estimated vs. actual metrics, no human-readable interpretation, and no optimization suggestions.This RFC directly addresses #4343 (Support analyze alongside explain).
Current State
explain costprovides estimated row counts and plan costs via Calcite's cost model, but only as raw plan text.profileAPI (#4983, #5044) provides actual execution time and row counts per operator as JSON.EXPLAIN ANALYZE, BigQuery plan explanations, Kusto best-practices analyzer) provide unified estimated-vs-actual views natively.Long-Term Goals
InsightRuleinterface that allows contributors to add new optimization heuristics without modifying core analyzer logic.Proposal
Add an EXPLAIN ANALYZE endpoint that merges the existing
explain costandprofilecode paths into a unified operator tree, augmented by a rule-based QueryInsightsAnalyzer that produces human-readable optimization recommendations.Response Schema
{ "query": "source=my_index | where age > 30 | stats count() by city | sort count", "total_time_ms": 342, "operator_tree": [ { "id": 1, "node_name": "OpenSearchIndexScan", "description": "source=my_index | where age > 30", "estimated_rows": 5000, "actual_rows": 4823, "estimated_cost": 12.5, "actual_time_ms": 210, "is_pushed_down": true, "children": [] }, { "id": 2, "node_name": "Aggregation", "description": "stats count() by city", "estimated_rows": 50, "actual_rows": 127, "estimated_cost": 3.2, "actual_time_ms": 95, "is_pushed_down": false, "children": [1] }, { "id": 3, "node_name": "Sort", "description": "sort count", "estimated_rows": 50, "actual_rows": 127, "estimated_cost": 1.1, "actual_time_ms": 37, "is_pushed_down": false, "children": [2] } ], "recommendations": [ { "severity": "warning", "rule": "cardinality_misestimation", "message": "Aggregation produced 127 rows but estimated 50 (2.5x). Consider updating index statistics.", "affected_node": 2, "suggestion": null } ] }Core Heuristic Rules
cardinality_misestimationactual_rowsdiffers fromestimated_rowsby >10xmissing_time_filterpushdown_visibilityexpensive_in_memory_opsfull_scan_percentageApproach
Backend
ExplainAnalyzeExecutor - Runs both the
explain costpath (CalciteRelNodetree withRelMetadataQuery) and theprofileexecution path, then merges into a unified tree where each node contains:node_name,estimated_rows,actual_rows,estimated_cost,actual_time_ms,is_pushed_down.REST endpoint -
POST _plugins/_ppl/_analyze(and future_plugins/_sql/_analyze) accepting{ "query": "..." }.QueryInsightsAnalyzer - Walks the merged tree, evaluates rules implementing an
InsightRuleinterface, enriches with index metadata from_statsAPI, and producesRecommendationobjects.Frontend
dashboards-query-workbench) - A collapsible panel in Discover rendering:Implementation Discussion
_analyzebe a standalone endpoint or a parameter on the existing_pplendpoint (e.g.,"explain": "analyze")?Related Issues