[RFC] Query Insights: Profile and Analyze for PPL/SQL

## Problem Statement

PPL/SQL users have no way to understand **why** a query is slow or how to optimize it. Customers report slow queries, browser crashes, and stuck searches with no self-service diagnostics. The raw building blocks exist independently (estimated plan costs via `explain cost`, actual execution metrics via the `profile` API), but they are separate API calls returning different formats that only engine developers can interpret. There is no unified view comparing estimated vs. actual metrics, no human-readable interpretation, and no optimization suggestions.

This RFC directly addresses [#4343 (Support analyze alongside explain)](https://github.com/opensearch-project/sql/issues/4343).

## Current State

- **`explain cost`** provides estimated row counts and plan costs via Calcite's cost model, but only as raw plan text.
- **The `profile` API** ([#4983](https://github.com/opensearch-project/sql/pull/4983), [#5044](https://github.com/opensearch-project/sql/pull/5044)) provides actual execution time and row counts per operator as JSON.
- These two data sources are completely disconnected -- users must call them separately and mentally correlate operator nodes between different formats.
- No UI integration exists in Discover to surface this information.
- Competing tools (PostgreSQL `EXPLAIN ANALYZE`, BigQuery plan explanations, Kusto best-practices analyzer) provide unified estimated-vs-actual views natively.

## Long-Term Goals

- **Unified query diagnostics**: A single API call that returns both estimated and actual metrics per operator, enabling users to identify bottlenecks without needing engine-level expertise.
- **Self-service optimization**: Rule-based heuristic recommendations that surface actionable suggestions (e.g., "add a time filter", "this aggregation runs in-memory on 50M rows").
- **Extensible rule framework**: An `InsightRule` interface that allows contributors to add new optimization heuristics without modifying core analyzer logic.
- **UI integration**: A visual panel in Discover that renders the operator tree and recommendations, making query profiling accessible to all users.

## Proposal

Add an **EXPLAIN ANALYZE** endpoint that merges the existing `explain cost` and `profile` code paths into a unified operator tree, augmented by a rule-based **QueryInsightsAnalyzer** that produces human-readable optimization recommendations.

### Response Schema

```json
{
  "query": "source=my_index | where age > 30 | stats count() by city | sort count",
  "total_time_ms": 342,
  "operator_tree": [
    {
      "id": 1,
      "node_name": "OpenSearchIndexScan",
      "description": "source=my_index | where age > 30",
      "estimated_rows": 5000,
      "actual_rows": 4823,
      "estimated_cost": 12.5,
      "actual_time_ms": 210,
      "is_pushed_down": true,
      "children": []
    },
    {
      "id": 2,
      "node_name": "Aggregation",
      "description": "stats count() by city",
      "estimated_rows": 50,
      "actual_rows": 127,
      "estimated_cost": 3.2,
      "actual_time_ms": 95,
      "is_pushed_down": false,
      "children": [1]
    },
    {
      "id": 3,
      "node_name": "Sort",
      "description": "sort count",
      "estimated_rows": 50,
      "actual_rows": 127,
      "estimated_cost": 1.1,
      "actual_time_ms": 37,
      "is_pushed_down": false,
      "children": [2]
    }
  ],
  "recommendations": [
    {
      "severity": "warning",
      "rule": "cardinality_misestimation",
      "message": "Aggregation produced 127 rows but estimated 50 (2.5x). Consider updating index statistics.",
      "affected_node": 2,
      "suggestion": null
    }
  ]
}
```

### Core Heuristic Rules

| Rule | Severity | Trigger |
|------|----------|---------|
| `cardinality_misestimation` | WARNING | `actual_rows` differs from `estimated_rows` by >10x |
| `missing_time_filter` | WARNING | Full index scan on time-series index without time-range predicate |
| `pushdown_visibility` | INFO | Reports which operators executed in OpenSearch vs. in-memory |
| `expensive_in_memory_ops` | CRITICAL | Joins/sorts/aggregations in-memory above configurable threshold |
| `full_scan_percentage` | WARNING | Query scans >80% of total index documents |

## Approach

### Backend

1. **ExplainAnalyzeExecutor** - Runs both the `explain cost` path (Calcite `RelNode` tree with `RelMetadataQuery`) and the `profile` execution path, then merges into a unified tree where each node contains: `node_name`, `estimated_rows`, `actual_rows`, `estimated_cost`, `actual_time_ms`, `is_pushed_down`.

2. **REST endpoint** - `POST _plugins/_ppl/_analyze` (and future `_plugins/_sql/_analyze`) accepting `{ "query": "..." }`.

3. **QueryInsightsAnalyzer** - Walks the merged tree, evaluates rules implementing an `InsightRule` interface, enriches with index metadata from `_stats` API, and produces `Recommendation` objects.

```java
public interface InsightRule {
    String ruleId();
    List<Recommendation> evaluate(OperatorTree tree, IndexMetadata metadata);
}
```

### Frontend

4. **QueryInsightsPanel** (in `dashboards-query-workbench`) - A collapsible panel in Discover rendering:
   - Vertical operator tree with estimated vs. actual metrics per node
   - Color coding: green (pushed down), orange (in-memory), red (bottleneck)
   - Recommendation cards with severity icons and optional "Apply" buttons


## Implementation Discussion

- Should `_analyze` be a standalone endpoint or a parameter on the existing `_ppl` endpoint (e.g., `"explain": "analyze"`)?
- What is the maximum plan depth before truncating the response?
- Should recommendations be cacheable for repeated queries against the same index?

### Related Issues

- [#4343 - Support analyze alongside explain](https://github.com/opensearch-project/sql/issues/4343)
- [#4983 - Profile API](https://github.com/opensearch-project/sql/pull/4983)
- [#5044 - Profile API](https://github.com/opensearch-project/sql/pull/5044)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Query Insights: Profile and Analyze for PPL/SQL #5500

Problem Statement

Current State

Long-Term Goals

Proposal

Response Schema

Core Heuristic Rules

Approach

Backend

Frontend

Implementation Discussion

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rule	Severity	Trigger
`cardinality_misestimation`	WARNING	`actual_rows` differs from `estimated_rows` by >10x
`missing_time_filter`	WARNING	Full index scan on time-series index without time-range predicate
`pushdown_visibility`	INFO	Reports which operators executed in OpenSearch vs. in-memory
`expensive_in_memory_ops`	CRITICAL	Joins/sorts/aggregations in-memory above configurable threshold
`full_scan_percentage`	WARNING	Query scans >80% of total index documents

[RFC] Query Insights: Profile and Analyze for PPL/SQL #5500

Description

Problem Statement

Current State

Long-Term Goals

Proposal

Response Schema

Core Heuristic Rules

Approach

Backend

Frontend

Implementation Discussion

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions