Skip to content

HYPERFLEET-856 - feat: add deletion observability metrics and alerts#115

Open
rafabene wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-856-add-deletion-observability-metrics
Open

HYPERFLEET-856 - feat: add deletion observability metrics and alerts#115
rafabene wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-856-add-deletion-observability-metrics

Conversation

@rafabene
Copy link
Copy Markdown
Contributor

@rafabene rafabene commented Apr 28, 2026

Summary

  • Add three API-side Prometheus metrics for deletion observability:
    • hyperfleet_api_resource_terminating_total (counter) — tracks resources entering Terminating state
    • hyperfleet_api_resource_terminating_duration_seconds (histogram) — measures soft-delete to hard-delete duration (populated when hard-delete flow lands)
    • hyperfleet_api_resource_terminating_stuck (gauge via collector) — queries DB on each scrape for resources stuck beyond configurable threshold
  • Add PrometheusRule template with two alerts:
    • HyperFleetResourceDeletionStuck (warning) — resources stuck >30min for 5m
    • HyperFleetResourceDeletionTimeout (critical) — resources stuck >30min for 30m
  • Add --metrics-deletion-stuck-threshold config flag (default 30m)
  • Instrument SoftDelete in cluster and nodepool services with RecordTerminating()
  • Add unit tests for all metrics (counter, histogram, collector, labels, buckets, reset)
  • Add integration test for TerminatingCollector against real PostgreSQL via testcontainers
  • Document metrics, alerts, and PromQL queries in docs/metrics.md

Test plan

  • make test — 796 unit tests passing
  • make lint — 0 issues
  • make test-helm — all chart templates OK (including new PrometheusRule)
  • make test-integration — 93 integration tests passing (2 new for TerminatingCollector)

HYPERFLEET-856

Summary by CodeRabbit

  • New Features

    • Adds Prometheus alerts for stuck pending-deletion (warning default 30m, critical default 2h) and runtime metrics: stuck gauge, pending-deletion counter, duration histogram; services emit these metrics.
  • Configuration

    • Chart/CLI/env options to enable alerts, set deletion_stuck_threshold (default 30m), namespace/labels, and per-rule runbook URLs.
  • Documentation

    • Expanded metrics docs with PromQL examples.
  • Tests

    • Unit and integration tests for deletion metrics and collector.
  • Chores

    • DB migration adding indexes on deleted_time.

@openshift-ci openshift-ci Bot requested review from Mischulee and jsell-rh April 28, 2026 14:10
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jsell-rh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Walkthrough

Adds deletion-focused observability: a new metrics package that defines counters and histograms for Pending Deletion entries and their durations; a DB-backed Prometheus collector that exposes per-resource hyperfleet_api_resource_pending_deletion_stuck gauges computed against a configurable deletion_stuck_threshold; instrumentation on cluster and nodepool soft-deletes to record metrics; CLI/env/config wiring and validation for the threshold; a Helm PrometheusRule template and values for warning/critical alerts; corresponding docs; DB migration adding partial indexes on deleted_time; and unit/integration tests for the metrics and collector.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant API as API Service
    participant Metrics as Metrics Package
    participant DB as Database
    participant Prometheus as Prometheus Scraper
    participant Alertmanager as Alertmanager

    User->>API: Soft-delete cluster/nodepool
    API->>DB: Persist deleted_time
    API->>Metrics: RecordPendingDeletion(resource_type)
    Metrics->>Metrics: Increment counter & observe histogram

    Prometheus->>Metrics: Scrape (Collect)
    Metrics->>DB: Query counts WHERE deleted_time IS NOT NULL AND deleted_time < now - deletion_stuck_threshold
    DB-->>Metrics: Return counts per resource_type
    Metrics-->>Prometheus: Emit hyperfleet_api_resource_pending_deletion_stuck gauge(s)

    Prometheus->>Alertmanager: Evaluate PrometheusRule
    alt stuck sustained > deletionStuck.for
        Alertmanager->>Alertmanager: Fire HyperFleetResourceDeletionStuckWarning
    end
    alt stuck sustained > deletionTimeout.for
        Alertmanager->>Alertmanager: Fire HyperFleetResourceDeletionStuckCritical
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR: adding deletion observability metrics and alerts. It directly relates to the primary changes across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 8/10 reviews remaining, refill in 10 minutes and 6 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/config/metrics.go (1)

16-30: ⚠️ Potential issue | 🟠 Major

Reject non-positive deletion thresholds.

DeletionStuckThreshold is only marked required, so a negative duration still gets through. That would invert the cutoff used by the stuck-resource collector and make the new deletion alerts/metrics meaningless. Please add an explicit positive-value check in config validation before this value is consumed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/config/metrics.go` around lines 16 - 30, Add explicit validation to
reject non-positive DeletionStuckThreshold values: in the MetricsConfig
validation path (e.g., the method that validates or finalizes MetricsConfig
before use) check MetricsConfig.DeletionStuckThreshold > 0 and return an error
if it is <= 0; update any constructor/factory logic (NewMetricsConfig remains
fine) to rely on this validation so the stuck-resource collector and alerting
never receive a zero or negative duration. Ensure the error message references
DeletionStuckThreshold so callers know which config field is invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alerts HyperFleetResourceDeletionStuck and
HyperFleetResourceDeletionTimeout currently use the raw metric
hyperfleet_api_resource_terminating_stuck > 0 which will fire per pod; change
the expr to aggregate across pods (e.g., use max by
(resource_type)(hyperfleet_api_resource_terminating_stuck) > 0) to deduplicate,
and replace the hard-coded durations in the annotations' description ("30
minutes"/"1 hour") with the templated for values (use the same {{
.Values.prometheusRule.rules.deletionStuck.for }} and {{
.Values.prometheusRule.rules.deletionTimeout.for }} references used for the for
fields) so the text matches configured alert durations for the alerts named
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout.

In `@pkg/metrics/deletion.go`:
- Around line 134-149: Collect currently uses blocking DB calls via QueryRow;
change TerminatingCollector.Collect to use QueryRowContext with a bounded
context deadline so scrapes fail fast on slow DBs: create a context with timeout
(e.g., ctx, cancel := context.WithTimeout(context.Background(), c.queryTimeout)
and defer cancel()) and call c.db.QueryRowContext(ctx, q.query, threshold)
instead of c.db.QueryRow(...); ensure TerminatingCollector has a configurable
query timeout field (e.g., queryTimeout time.Duration) or use a sensible
constant, and preserve the existing error handling (log on Scan error) while
returning/continuing promptly when the context times out.

In `@pkg/services/cluster.go`:
- Around line 132-136: The cluster cascade path bypasses the node pool
soft-delete hook so nodepool termination metrics are not recorded; update the
cluster deletion flow (around the call to s.nodePoolDao.SoftDeleteByOwner) to
record nodepool terminations by invoking metrics.RecordTerminating("nodepool")
for the affected node pools (or ensure SoftDeleteByOwner emits that metric), so
the existing node_pool.go metric counts are preserved when deletions are
triggered by cluster cascade.

---

Outside diff comments:
In `@pkg/config/metrics.go`:
- Around line 16-30: Add explicit validation to reject non-positive
DeletionStuckThreshold values: in the MetricsConfig validation path (e.g., the
method that validates or finalizes MetricsConfig before use) check
MetricsConfig.DeletionStuckThreshold > 0 and return an error if it is <= 0;
update any constructor/factory logic (NewMetricsConfig remains fine) to rely on
this validation so the stuck-resource collector and alerting never receive a
zero or negative duration. Ensure the error message references
DeletionStuckThreshold so callers know which config field is invalid.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2ad75a9a-1f66-4293-8c1f-68b67c9c0ec6

📥 Commits

Reviewing files that changed from the base of the PR and between 12333a0 and e63fdd4.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go

Comment thread pkg/metrics/deletion.go Outdated
Comment thread pkg/services/cluster.go Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 6487c64 to d429144 Compare April 28, 2026 14:28
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert uses aggregation "max by (resource_type)" over
hyperfleet_api_resource_terminating_stuck which merges across releases because
the metric lacks a namespace label; don't attempt to add namespace to the
aggregation without the label. Fix by one of the suggested approaches: (a) add a
relabeling rule in the ServiceMonitor (or ServiceMonitor template) to preserve
__meta_kubernetes_namespace as a metric label and change the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout to
aggregate with "max by (namespace, resource_type)"; or (b) emit "namespace" from
the application so hyperfleet_api_resource_terminating_stuck includes namespace
and then aggregate by namespace; or (c) require and validate that
serviceMonitor.namespace is set (documenting multi-release limitations) so
scraping is release-scoped. Reference symbols to change: the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout,
the metric hyperfleet_api_resource_terminating_stuck, and the
ServiceMonitor/serviceMonitor.namespace/relabel_configs to implement option (a).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a59c3d7c-dcbd-4ed1-90ac-ec5a21b50b8b

📥 Commits

Reviewing files that changed from the base of the PR and between 6019453 and d429144.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/metrics/deletion_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • pkg/config/flags.go
  • pkg/services/cluster.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
  • docs/metrics.md
  • pkg/config/metrics.go
  • charts/values.yaml
  • pkg/metrics/deletion.go

Comment thread charts/templates/prometheusrule.yaml Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from d429144 to 79ebbb0 Compare April 28, 2026 14:43
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
charts/templates/prometheusrule.yaml (1)

16-37: ⚠️ Potential issue | 🟠 Major

Align the alert aggregation with the metric labels.

hyperfleet_api_resource_terminating_stuck only exports resource_type, component, and version, so max by (namespace, resource_type) collapses releases that share Prometheus and does not scope the alert by namespace. Either add a namespace label at scrape/metric time or remove namespace from the aggregation.

Please verify the label mismatch with a read-only search:

#!/bin/bash
set -euo pipefail
rg -n 'labelResourceType|labelComponent|labelVersion|stuckDesc|max by \(namespace, resource_type\)' pkg/metrics/deletion.go charts/templates/prometheusrule.yaml

Expected result: the metric definition has no namespace label while the alert rule groups by namespace, confirming the cross-release aggregation risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/templates/prometheusrule.yaml` around lines 16 - 37, The alert
aggregation uses "max by (namespace, resource_type)" but the metric
hyperfleet_api_resource_terminating_stuck (defined in pkg/metrics/deletion.go
via labelResourceType/labelComponent/labelVersion) does not include a namespace
label, causing cross-release aggregation; fix the two alerts
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in
charts/templates/prometheusrule.yaml by removing "namespace" from the
aggregation (change "max by (namespace, resource_type)" to "max by
(resource_type)") or alternatively ensure the metric is exported with a
namespace label at scrape time—pick one approach and apply it consistently to
both alert rules and their descriptions/runbook handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Around line 321-339: The PromQL aggregations are malformed and need to be
rewritten to the proper "sum by (...) (...)" form and the first rate() converted
to per-minute: change sum(rate(hyperfleet_api_resource_terminating_total[5m]))
by (resource_type) to sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_total[5m])) * 60; change
sum(hyperfleet_api_resource_terminating_stuck) by (resource_type) to sum by
(resource_type) (hyperfleet_api_resource_terminating_stuck); for the average
terminating duration replace both numerator and denominator forms sum(... ) by
(resource_type) with sum by (resource_type) (...): sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_sum[5m])) / sum by
(resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_count[5m])); and
update the histogram_quantile call to histogram_quantile(0.99, sum by (le,
resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_bucket[5m]))).

In `@pkg/services/cluster.go`:
- Around line 143-150: The metric emission for nodepool termination is happening
after the follow-up read (FindSoftDeletedByOwner) which can fail and skip
emitting metrics; update the deletion flow to emit the metrics earlier or change
the SoftDeleteByOwner API to return the affected count/list. Concretely: either
modify SoftDeleteByOwner(...) to return (affectedCount int, affectedPools
[]NodePool, err error) and call metrics.RecordTerminating("nodepool") based on
that return value inside the deletion path, or move the loop that calls
metrics.RecordTerminating("nodepool") to immediately after the
SoftDeleteByOwner(...) call (before calling FindSoftDeletedByOwner), ensuring
metrics.RecordTerminating is invoked even if subsequent reads fail; adjust
references to SoftDeleteByOwner, FindSoftDeletedByOwner, and
metrics.RecordTerminating("nodepool") accordingly.

---

Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert aggregation uses "max by (namespace,
resource_type)" but the metric hyperfleet_api_resource_terminating_stuck
(defined in pkg/metrics/deletion.go via
labelResourceType/labelComponent/labelVersion) does not include a namespace
label, causing cross-release aggregation; fix the two alerts
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in
charts/templates/prometheusrule.yaml by removing "namespace" from the
aggregation (change "max by (namespace, resource_type)" to "max by
(resource_type)") or alternatively ensure the metric is exported with a
namespace label at scrape time—pick one approach and apply it consistently to
both alert rules and their descriptions/runbook handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3252c284-6c38-4779-9ca9-a7fe3e9a0937

📥 Commits

Reviewing files that changed from the base of the PR and between d429144 and 79ebbb0.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • pkg/config/flags.go
  • pkg/services/node_pool.go
  • pkg/config/loader.go
  • pkg/config/metrics.go

Comment thread docs/metrics.md
Comment thread pkg/services/cluster.go
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 79ebbb0 to a29272c Compare April 28, 2026 15:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/services/cluster.go (1)

138-145: ⚠️ Potential issue | 🟠 Major

Nodepool terminating events can be permanently missed on read failure.

If FindSoftDeletedByOwner fails at Line 138 after SoftDeleteByOwner succeeded at Line 134, the function exits before Lines 143-145 emit nodepool metrics. Later retries short-circuit at Line 118, so those missed events are never recovered. Emit nodepool metric counts from the cascade write result (e.g., affected rows/returned IDs) instead of depending on a follow-up read.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/cluster.go` around lines 138 - 145, The current implementation
records nodepool terminating metrics by reading soft-deleted nodepools via
FindSoftDeletedByOwner (nodePools) after SoftDeleteByOwner, so if that read
fails the metrics are never emitted; update the cascade delete flow to emit
metrics from the cascade write result instead of the follow-up read: change
SoftDeleteByOwner (or the DAO method it calls) to return the affected row count
or the list of deleted IDs, have the caller capture that return value (instead
of relying on nodePools from FindSoftDeletedByOwner), and call
metrics.RecordTerminating the appropriate number of times (or once with the
count) based on that returned count/IDs; retain or remove the
FindSoftDeletedByOwner call as needed but ensure metrics emission no longer
depends on it.
charts/templates/prometheusrule.yaml (1)

17-17: ⚠️ Potential issue | 🟠 Major

namespace scoping in the alert expression is currently ineffective.

hyperfleet_api_resource_terminating_stuck doesn’t expose a namespace label, so max by (namespace, resource_type) still collapses across releases/namespaces for each resource_type. This can mix signals across deployments.

Run this to verify label shape and whether ServiceMonitor relabeling injects namespace:

#!/bin/bash
set -euo pipefail

echo "== Collector metric descriptor/emit labels =="
rg -n -C3 'resource_terminating_stuck|NewDesc\(|MustNewConstMetric\(' pkg/metrics/deletion.go

echo
echo "== Alert expressions =="
rg -n -C2 'HyperFleetResourceDeletion(Stuck|Timeout)|max by|hyperfleet_api_resource_terminating_stuck' charts/templates/prometheusrule.yaml

echo
echo "== ServiceMonitor relabeling (namespace injection) =="
fd -i 'servicemonitor.yaml' charts/templates | while read -r f; do
  echo "--- ${f} ---"
  rg -n -C3 '__meta_kubernetes_namespace|metricRelabelings|relabelings|targetLabels|namespace' "$f" || true
done

Expected: no namespace label in collector metric unless explicitly injected by scrape relabeling; if absent, current by(namespace, resource_type) doesn’t isolate per namespace.

Also applies to: 28-28

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by a
non-existent namespace label so signals are mixed; either inject namespace at
scrape time or stop grouping by it: update the Prometheus relabeling in your
ServiceMonitor (servicemonitor.yaml) to add target label "namespace" from
__meta_kubernetes_namespace so the collector metric
hyperfleet_api_resource_terminating_stuck has a namespace label, or change the
rule in charts/templates/prometheusrule.yaml (expr referencing max by
(namespace, resource_type)) to group only by existing labels (e.g., max by
(resource_type)) and remove namespace from the by() clause to avoid incorrect
scoping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by a non-existent namespace label so signals are
mixed; either inject namespace at scrape time or stop grouping by it: update the
Prometheus relabeling in your ServiceMonitor (servicemonitor.yaml) to add target
label "namespace" from __meta_kubernetes_namespace so the collector metric
hyperfleet_api_resource_terminating_stuck has a namespace label, or change the
rule in charts/templates/prometheusrule.yaml (expr referencing max by
(namespace, resource_type)) to group only by existing labels (e.g., max by
(resource_type)) and remove namespace from the by() clause to avoid incorrect
scoping.

In `@pkg/services/cluster.go`:
- Around line 138-145: The current implementation records nodepool terminating
metrics by reading soft-deleted nodepools via FindSoftDeletedByOwner (nodePools)
after SoftDeleteByOwner, so if that read fails the metrics are never emitted;
update the cascade delete flow to emit metrics from the cascade write result
instead of the follow-up read: change SoftDeleteByOwner (or the DAO method it
calls) to return the affected row count or the list of deleted IDs, have the
caller capture that return value (instead of relying on nodePools from
FindSoftDeletedByOwner), and call metrics.RecordTerminating the appropriate
number of times (or once with the count) based on that returned count/IDs;
retain or remove the FindSoftDeletedByOwner call as needed but ensure metrics
emission no longer depends on it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2a707980-1e96-49d8-88f5-25adfb2a8e1f

📥 Commits

Reviewing files that changed from the base of the PR and between 79ebbb0 and a29272c.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
  • pkg/config/flags.go
  • pkg/services/node_pool.go
  • charts/values.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • docs/metrics.md

Comment thread charts/templates/prometheusrule.yaml Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch 2 times, most recently from 964d474 to 3f46062 Compare April 28, 2026 17:22
Comment thread charts/templates/prometheusrule.yaml Outdated
summary: "HyperFleet resources stuck in Pending Deletion state"
description: >-
{{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been in
Pending Deletion state for more than {{ .Values.prometheusRule.rules.deletionStuck.for | default "5m" }}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this description correct, is it not hitting 35m and not 5m?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! You're right — the pending_deletion_stuck metric already embeds a 30m threshold from the collector (config.metrics.deletion_stuck_threshold), so by the time the Warning alert fires after its 5m for delay, the resource has actually been stuck for ~35m, not 5m.

Fixed in 39d7989 — the description now shows both components: 30m (stuck threshold) + 5m (alert delay).

Comment thread charts/values.yaml Outdated
Comment on lines +274 to +277
for: "5m"
runbookUrl: ""
deletionTimeout:
for: "30m"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what was your reasoning behind these defaults?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletionStuck.for: 5m is meant as an early warning — the collector already flags resources stuck beyond the 30m threshold (deletion_stuck_threshold), so 5m of sustained signal is just enough to filter out transient scrape noise before paging.

The deletionTimeout.for: 30m escalates to critical — at that point the resource has been stuck for ~60m total (30m threshold + 30m alert delay), which signals something is genuinely broken and needs immediate attention.

Happy to adjust if you think different values make more sense for the team's SLOs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards stupid defaults TBH, to avoid un-wanted noise for our partner teams. Most will prob just roll the defaults until they align on their SLO's. I would lean on warning after an hour and timeout at 2 or 3 hours.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense — bumped the defaults to conservative values: warning at 30m (fires ~1h total) and critical at 2h (fires ~2.5h total). Better to start quiet and let GCP/ROSA teams tighten when they define their SLOs, rather than generating noise out of the box.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/services/cluster.go (1)

138-145: ⚠️ Potential issue | 🟠 Major

Nodepool metric emission still depends on a brittle follow-up read.

If SoftDeleteByOwner succeeds but FindSoftDeletedByOwner fails, nodepools are already soft-deleted but no nodepool pending-deletion metrics are emitted, and retries won’t replay because the cluster is already marked deleted.

Consider moving emission to data returned directly by SoftDeleteByOwner (e.g., affected count/list) so it doesn’t depend on a second query.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/cluster.go` around lines 138 - 145, The current flow calls
s.nodePoolDao.SoftDeleteByOwner(...) then separately calls
s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if
the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the
affected nodepool count or list (e.g., return (count int, ids []string, error))
and update callers in cluster.go to call that result directly and invoke
metrics.RecordPendingDeletion("nodepool") using the returned count/list so
metric emission does not depend on FindSoftDeletedByOwner; ensure error handling
still returns on SoftDeleteByOwner failure and that the new return shape is
propagated where needed.
charts/templates/prometheusrule.yaml (1)

17-17: ⚠️ Potential issue | 🟠 Major

namespace grouping is still ineffective for this metric.

On Line 17 and Line 29, max by (namespace, resource_type) won’t scope alerts per namespace unless the metric actually has a namespace label. hyperfleet_api_resource_pending_deletion_stuck is emitted without namespace in pkg/metrics/deletion.go, so this can still merge signals across releases.

#!/bin/bash
# Verify metric labels for pending deletion stuck collector
rg -n -C3 'resource_pending_deletion_stuck|NewDesc\(|labelResourceType|labelComponent|labelVersion' pkg/metrics/deletion.go

# Verify whether ServiceMonitor injects namespace as a metric label
rg -n -C3 'ServiceMonitor|metricRelabelings|relabelings|__meta_kubernetes_namespace|namespace' charts/templates

Expected result:

  1. pkg/metrics/deletion.go shows only resource_type as variable label for this metric.
  2. No relabeling that preserves Kubernetes namespace into metric labels, confirming aggregation still cannot partition per namespace.

Also applies to: 29-29

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by
(namespace, resource_type) but the metric
hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace
label (see pkg/metrics/deletion.go), so change one of two things: either add a
"namespace" label to the metric at its declaration/recording in
pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector
sets the namespace value when observing), or if you cannot emit namespace,
remove namespace from the PrometheusRule expression in
charts/templates/prometheusrule.yaml and use max by (resource_type) (and the
identical change for the other occurrence at line 29) so alerts aren’t
misleadingly grouped by a non-existent label.
🧹 Nitpick comments (1)
test/integration/deletion_metrics_test.go (1)

33-35: Assert delete response status to prevent false-positive test results.

Right now these checks only assert transport success (err == nil). If delete returns non-2xx, the second subtest can still pass spuriously.

Proposed test hardening
-		_, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
+		delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
 		Expect(err).NotTo(HaveOccurred())
+		Expect(delResp.StatusCode()).To(BeNumerically(">=", 200))
+		Expect(delResp.StatusCode()).To(BeNumerically("<", 300))
...
-		_, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
+		delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
 		Expect(err).NotTo(HaveOccurred())
+		Expect(delResp.StatusCode()).To(BeNumerically(">=", 200))
+		Expect(delResp.StatusCode()).To(BeNumerically("<", 300))

Also applies to: 61-63

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/deletion_metrics_test.go` around lines 33 - 35, The test
currently only checks that client.DeleteClusterByIdWithResponse returned no
transport error (err == nil) which can hide non-2xx delete responses; update the
two places calling DeleteClusterByIdWithResponse (the call using cluster.ID and
the second similar call around lines 61-63) to capture the response object and
assert its HTTP status is the expected success code (e.g., 200/204) using the
response's status/status code accessor before or in addition to
Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses;
reference the DeleteClusterByIdWithResponse call, the returned response
variable, and the Expect assertions when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/deletion.go`:
- Around line 135-136: Metrics scrape COUNT queries for clusters and node_pools
(the entries with "SELECT COUNT(*) FROM clusters WHERE deleted_time IS NOT NULL
AND deleted_time < $1" and "SELECT COUNT(*) FROM node_pools WHERE deleted_time
IS NOT NULL AND deleted_time < $1" in pkg/metrics/deletion.go) need supporting
indexes; add a migration that creates either simple indexes on deleted_time
(CREATE INDEX ON clusters(deleted_time), CREATE INDEX ON
node_pools(deleted_time)) or partial indexes filtering NOT NULL (CREATE INDEX
... WHERE deleted_time IS NOT NULL) and also drop the stale indexes left on the
old deleted_at column (drop idx_clusters_deleted_at and
idx_node_pools_deleted_at) so the metrics queries use the new indexes and avoid
full-table scans.

---

Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by (namespace, resource_type) but the metric
hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace
label (see pkg/metrics/deletion.go), so change one of two things: either add a
"namespace" label to the metric at its declaration/recording in
pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector
sets the namespace value when observing), or if you cannot emit namespace,
remove namespace from the PrometheusRule expression in
charts/templates/prometheusrule.yaml and use max by (resource_type) (and the
identical change for the other occurrence at line 29) so alerts aren’t
misleadingly grouped by a non-existent label.

In `@pkg/services/cluster.go`:
- Around line 138-145: The current flow calls
s.nodePoolDao.SoftDeleteByOwner(...) then separately calls
s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if
the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the
affected nodepool count or list (e.g., return (count int, ids []string, error))
and update callers in cluster.go to call that result directly and invoke
metrics.RecordPendingDeletion("nodepool") using the returned count/list so
metric emission does not depend on FindSoftDeletedByOwner; ensure error handling
still returns on SoftDeleteByOwner failure and that the new return shape is
propagated where needed.

---

Nitpick comments:
In `@test/integration/deletion_metrics_test.go`:
- Around line 33-35: The test currently only checks that
client.DeleteClusterByIdWithResponse returned no transport error (err == nil)
which can hide non-2xx delete responses; update the two places calling
DeleteClusterByIdWithResponse (the call using cluster.ID and the second similar
call around lines 61-63) to capture the response object and assert its HTTP
status is the expected success code (e.g., 200/204) using the response's
status/status code accessor before or in addition to
Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses;
reference the DeleteClusterByIdWithResponse call, the returned response
variable, and the Expect assertions when making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: ab847be3-0cd7-404c-b0c0-96a4da969b4e

📥 Commits

Reviewing files that changed from the base of the PR and between 79ebbb0 and 39d7989.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/config/flags.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • charts/values.yaml
  • pkg/config/metrics.go
  • docs/metrics.md

Comment thread pkg/metrics/deletion.go
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 39d7989 to 7b9a35f Compare April 29, 2026 13:58
Add Prometheus metrics to track the lifecycle of resource deletion:
- pending_deletion_total counter: tracks resources entering Pending Deletion
- pending_deletion_duration_seconds histogram: measures soft-delete to hard-delete
- pending_deletion_stuck gauge (collector): reports resources stuck beyond threshold

Includes PrometheusRule alerts (warning at 1h, critical at 2.5h total),
partial indexes on deleted_time for efficient collector queries,
and integration tests for the full metrics pipeline.
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 7b9a35f to 429af79 Compare April 29, 2026 14:00
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/services/cluster.go (1)

134-160: ⚠️ Potential issue | 🟠 Major

Emit nodepool metrics from the write path, not the follow-up read.

This still derives the nodepool counter from FindSoftDeletedByOwner(), so a transient read failure drops the metric entirely, and any nodepools that were already soft-deleted before this cascade will be counted again. Please have SoftDeleteByOwner() return the affected rows/IDs, or move the metric emission into that write path.

♻️ Proposed fix
-	if cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy); cascadeErr != nil {
-		return nil, handleSoftDeleteError("NodePool", cascadeErr)
-	}
-
-	nodePools, err := s.nodePoolDao.FindSoftDeletedByOwner(ctx, id)
-	if err != nil {
-		return nil, errors.GeneralError("Failed to fetch cascade-deleted nodepools: %s", err)
-	}
-
-	for range nodePools {
+	nodePools, cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy)
+	if cascadeErr != nil {
+		return nil, handleSoftDeleteError("NodePool", cascadeErr)
+	}
+
+	for range nodePools {
 		metrics.RecordPendingDeletion("nodepool")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/cluster.go` around lines 134 - 160, The current metric emission
uses FindSoftDeletedByOwner() after SoftDeleteByOwner(), which can lose metrics
on transient reads and double-count pre-existing soft-deletes; change
nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of
affected nodepool IDs (or count), have SoftDeleteByOwner itself call
metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the
count), and update this caller (and any other callers) to stop iterating over
results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and
batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read
metric loop so metrics are emitted reliably from the write path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/services/cluster.go`:
- Around line 134-160: The current metric emission uses FindSoftDeletedByOwner()
after SoftDeleteByOwner(), which can lose metrics on transient reads and
double-count pre-existing soft-deletes; change
nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of
affected nodepool IDs (or count), have SoftDeleteByOwner itself call
metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the
count), and update this caller (and any other callers) to stop iterating over
results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and
batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read
metric loop so metrics are emitted reliably from the write path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: b561f099-048e-47c0-8991-8c567303bafc

📥 Commits

Reviewing files that changed from the base of the PR and between 39d7989 and 429af79.

📒 Files selected for processing (15)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/migration_structs.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
  • pkg/config/flags.go
  • pkg/services/node_pool.go
  • docs/metrics.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/config/loader.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • charts/values.yaml
  • test/integration/deletion_metrics_test.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants