Skip to content

[RFC] Query Insights: Profile and Analyze for PPL/SQL #5500

@anasalkouz

Description

@anasalkouz

Problem Statement

PPL/SQL users have no way to understand why a query is slow or how to optimize it. Customers report slow queries, browser crashes, and stuck searches with no self-service diagnostics. The raw building blocks exist independently (estimated plan costs via explain cost, actual execution metrics via the profile API), but they are separate API calls returning different formats that only engine developers can interpret. There is no unified view comparing estimated vs. actual metrics, no human-readable interpretation, and no optimization suggestions.

This RFC directly addresses #4343 (Support analyze alongside explain).

Current State

  • explain cost provides estimated row counts and plan costs via Calcite's cost model, but only as raw plan text.
  • The profile API (#4983, #5044) provides actual execution time and row counts per operator as JSON.
  • These two data sources are completely disconnected -- users must call them separately and mentally correlate operator nodes between different formats.
  • No UI integration exists in Discover to surface this information.
  • Competing tools (PostgreSQL EXPLAIN ANALYZE, BigQuery plan explanations, Kusto best-practices analyzer) provide unified estimated-vs-actual views natively.

Long-Term Goals

  • Unified query diagnostics: A single API call that returns both estimated and actual metrics per operator, enabling users to identify bottlenecks without needing engine-level expertise.
  • Self-service optimization: Rule-based heuristic recommendations that surface actionable suggestions (e.g., "add a time filter", "this aggregation runs in-memory on 50M rows").
  • Extensible rule framework: An InsightRule interface that allows contributors to add new optimization heuristics without modifying core analyzer logic.
  • UI integration: A visual panel in Discover that renders the operator tree and recommendations, making query profiling accessible to all users.

Proposal

Add an EXPLAIN ANALYZE endpoint that merges the existing explain cost and profile code paths into a unified operator tree, augmented by a rule-based QueryInsightsAnalyzer that produces human-readable optimization recommendations.

Response Schema

{
  "query": "source=my_index | where age > 30 | stats count() by city | sort count",
  "total_time_ms": 342,
  "operator_tree": [
    {
      "id": 1,
      "node_name": "OpenSearchIndexScan",
      "description": "source=my_index | where age > 30",
      "estimated_rows": 5000,
      "actual_rows": 4823,
      "estimated_cost": 12.5,
      "actual_time_ms": 210,
      "is_pushed_down": true,
      "children": []
    },
    {
      "id": 2,
      "node_name": "Aggregation",
      "description": "stats count() by city",
      "estimated_rows": 50,
      "actual_rows": 127,
      "estimated_cost": 3.2,
      "actual_time_ms": 95,
      "is_pushed_down": false,
      "children": [1]
    },
    {
      "id": 3,
      "node_name": "Sort",
      "description": "sort count",
      "estimated_rows": 50,
      "actual_rows": 127,
      "estimated_cost": 1.1,
      "actual_time_ms": 37,
      "is_pushed_down": false,
      "children": [2]
    }
  ],
  "recommendations": [
    {
      "severity": "warning",
      "rule": "cardinality_misestimation",
      "message": "Aggregation produced 127 rows but estimated 50 (2.5x). Consider updating index statistics.",
      "affected_node": 2,
      "suggestion": null
    }
  ]
}

Core Heuristic Rules

Rule Severity Trigger
cardinality_misestimation WARNING actual_rows differs from estimated_rows by >10x
missing_time_filter WARNING Full index scan on time-series index without time-range predicate
pushdown_visibility INFO Reports which operators executed in OpenSearch vs. in-memory
expensive_in_memory_ops CRITICAL Joins/sorts/aggregations in-memory above configurable threshold
full_scan_percentage WARNING Query scans >80% of total index documents

Approach

Backend

  1. ExplainAnalyzeExecutor - Runs both the explain cost path (Calcite RelNode tree with RelMetadataQuery) and the profile execution path, then merges into a unified tree where each node contains: node_name, estimated_rows, actual_rows, estimated_cost, actual_time_ms, is_pushed_down.

  2. REST endpoint - POST _plugins/_ppl/_analyze (and future _plugins/_sql/_analyze) accepting { "query": "..." }.

  3. QueryInsightsAnalyzer - Walks the merged tree, evaluates rules implementing an InsightRule interface, enriches with index metadata from _stats API, and produces Recommendation objects.

public interface InsightRule {
    String ruleId();
    List<Recommendation> evaluate(OperatorTree tree, IndexMetadata metadata);
}

Frontend

  1. QueryInsightsPanel (in dashboards-query-workbench) - A collapsible panel in Discover rendering:
    • Vertical operator tree with estimated vs. actual metrics per node
    • Color coding: green (pushed down), orange (in-memory), red (bottleneck)
    • Recommendation cards with severity icons and optional "Apply" buttons

Implementation Discussion

  • Should _analyze be a standalone endpoint or a parameter on the existing _ppl endpoint (e.g., "explain": "analyze")?
  • What is the maximum plan depth before truncating the response?
  • Should recommendations be cacheable for repeated queries against the same index?

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Commentscalcitecalcite migration releatedfeatureperformanceMake it fast!

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions