HYPERFLEET-856 - feat: add deletion observability metrics and alerts#115
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughAdds deletion-focused observability: a new metrics package that defines counters and histograms for Pending Deletion entries and their durations; a DB-backed Prometheus collector that exposes per-resource Sequence Diagram(s)sequenceDiagram
actor User
participant API as API Service
participant Metrics as Metrics Package
participant DB as Database
participant Prometheus as Prometheus Scraper
participant Alertmanager as Alertmanager
User->>API: Soft-delete cluster/nodepool
API->>DB: Persist deleted_time
API->>Metrics: RecordPendingDeletion(resource_type)
Metrics->>Metrics: Increment counter & observe histogram
Prometheus->>Metrics: Scrape (Collect)
Metrics->>DB: Query counts WHERE deleted_time IS NOT NULL AND deleted_time < now - deletion_stuck_threshold
DB-->>Metrics: Return counts per resource_type
Metrics-->>Prometheus: Emit hyperfleet_api_resource_pending_deletion_stuck gauge(s)
Prometheus->>Alertmanager: Evaluate PrometheusRule
alt stuck sustained > deletionStuck.for
Alertmanager->>Alertmanager: Fire HyperFleetResourceDeletionStuckWarning
end
alt stuck sustained > deletionTimeout.for
Alertmanager->>Alertmanager: Fire HyperFleetResourceDeletionStuckCritical
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 8/10 reviews remaining, refill in 10 minutes and 6 seconds. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/config/metrics.go (1)
16-30:⚠️ Potential issue | 🟠 MajorReject non-positive deletion thresholds.
DeletionStuckThresholdis only markedrequired, so a negative duration still gets through. That would invert the cutoff used by the stuck-resource collector and make the new deletion alerts/metrics meaningless. Please add an explicit positive-value check in config validation before this value is consumed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/config/metrics.go` around lines 16 - 30, Add explicit validation to reject non-positive DeletionStuckThreshold values: in the MetricsConfig validation path (e.g., the method that validates or finalizes MetricsConfig before use) check MetricsConfig.DeletionStuckThreshold > 0 and return an error if it is <= 0; update any constructor/factory logic (NewMetricsConfig remains fine) to rely on this validation so the stuck-resource collector and alerting never receive a zero or negative duration. Ensure the error message references DeletionStuckThreshold so callers know which config field is invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alerts HyperFleetResourceDeletionStuck and
HyperFleetResourceDeletionTimeout currently use the raw metric
hyperfleet_api_resource_terminating_stuck > 0 which will fire per pod; change
the expr to aggregate across pods (e.g., use max by
(resource_type)(hyperfleet_api_resource_terminating_stuck) > 0) to deduplicate,
and replace the hard-coded durations in the annotations' description ("30
minutes"/"1 hour") with the templated for values (use the same {{
.Values.prometheusRule.rules.deletionStuck.for }} and {{
.Values.prometheusRule.rules.deletionTimeout.for }} references used for the for
fields) so the text matches configured alert durations for the alerts named
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout.
In `@pkg/metrics/deletion.go`:
- Around line 134-149: Collect currently uses blocking DB calls via QueryRow;
change TerminatingCollector.Collect to use QueryRowContext with a bounded
context deadline so scrapes fail fast on slow DBs: create a context with timeout
(e.g., ctx, cancel := context.WithTimeout(context.Background(), c.queryTimeout)
and defer cancel()) and call c.db.QueryRowContext(ctx, q.query, threshold)
instead of c.db.QueryRow(...); ensure TerminatingCollector has a configurable
query timeout field (e.g., queryTimeout time.Duration) or use a sensible
constant, and preserve the existing error handling (log on Scan error) while
returning/continuing promptly when the context times out.
In `@pkg/services/cluster.go`:
- Around line 132-136: The cluster cascade path bypasses the node pool
soft-delete hook so nodepool termination metrics are not recorded; update the
cluster deletion flow (around the call to s.nodePoolDao.SoftDeleteByOwner) to
record nodepool terminations by invoking metrics.RecordTerminating("nodepool")
for the affected node pools (or ensure SoftDeleteByOwner emits that metric), so
the existing node_pool.go metric counts are preserved when deletions are
triggered by cluster cascade.
---
Outside diff comments:
In `@pkg/config/metrics.go`:
- Around line 16-30: Add explicit validation to reject non-positive
DeletionStuckThreshold values: in the MetricsConfig validation path (e.g., the
method that validates or finalizes MetricsConfig before use) check
MetricsConfig.DeletionStuckThreshold > 0 and return an error if it is <= 0;
update any constructor/factory logic (NewMetricsConfig remains fine) to rely on
this validation so the stuck-resource collector and alerting never receive a
zero or negative duration. Ensure the error message references
DeletionStuckThreshold so callers know which config field is invalid.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 2ad75a9a-1f66-4293-8c1f-68b67c9c0ec6
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
6487c64 to
d429144
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert uses aggregation "max by (resource_type)" over
hyperfleet_api_resource_terminating_stuck which merges across releases because
the metric lacks a namespace label; don't attempt to add namespace to the
aggregation without the label. Fix by one of the suggested approaches: (a) add a
relabeling rule in the ServiceMonitor (or ServiceMonitor template) to preserve
__meta_kubernetes_namespace as a metric label and change the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout to
aggregate with "max by (namespace, resource_type)"; or (b) emit "namespace" from
the application so hyperfleet_api_resource_terminating_stuck includes namespace
and then aggregate by namespace; or (c) require and validate that
serviceMonitor.namespace is set (documenting multi-release limitations) so
scraping is release-scoped. Reference symbols to change: the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout,
the metric hyperfleet_api_resource_terminating_stuck, and the
ServiceMonitor/serviceMonitor.namespace/relabel_configs to implement option (a).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: a59c3d7c-dcbd-4ed1-90ac-ec5a21b50b8b
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/metrics/deletion_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
- pkg/config/flags.go
- pkg/services/cluster.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- pkg/services/node_pool.go
- test/integration/deletion_metrics_test.go
- docs/metrics.md
- pkg/config/metrics.go
- charts/values.yaml
- pkg/metrics/deletion.go
d429144 to
79ebbb0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
charts/templates/prometheusrule.yaml (1)
16-37:⚠️ Potential issue | 🟠 MajorAlign the alert aggregation with the metric labels.
hyperfleet_api_resource_terminating_stuckonly exportsresource_type,component, andversion, somax by (namespace, resource_type)collapses releases that share Prometheus and does not scope the alert by namespace. Either add a namespace label at scrape/metric time or removenamespacefrom the aggregation.Please verify the label mismatch with a read-only search:
#!/bin/bash set -euo pipefail rg -n 'labelResourceType|labelComponent|labelVersion|stuckDesc|max by \(namespace, resource_type\)' pkg/metrics/deletion.go charts/templates/prometheusrule.yamlExpected result: the metric definition has no namespace label while the alert rule groups by namespace, confirming the cross-release aggregation risk.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/templates/prometheusrule.yaml` around lines 16 - 37, The alert aggregation uses "max by (namespace, resource_type)" but the metric hyperfleet_api_resource_terminating_stuck (defined in pkg/metrics/deletion.go via labelResourceType/labelComponent/labelVersion) does not include a namespace label, causing cross-release aggregation; fix the two alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in charts/templates/prometheusrule.yaml by removing "namespace" from the aggregation (change "max by (namespace, resource_type)" to "max by (resource_type)") or alternatively ensure the metric is exported with a namespace label at scrape time—pick one approach and apply it consistently to both alert rules and their descriptions/runbook handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Around line 321-339: The PromQL aggregations are malformed and need to be
rewritten to the proper "sum by (...) (...)" form and the first rate() converted
to per-minute: change sum(rate(hyperfleet_api_resource_terminating_total[5m]))
by (resource_type) to sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_total[5m])) * 60; change
sum(hyperfleet_api_resource_terminating_stuck) by (resource_type) to sum by
(resource_type) (hyperfleet_api_resource_terminating_stuck); for the average
terminating duration replace both numerator and denominator forms sum(... ) by
(resource_type) with sum by (resource_type) (...): sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_sum[5m])) / sum by
(resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_count[5m])); and
update the histogram_quantile call to histogram_quantile(0.99, sum by (le,
resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_bucket[5m]))).
In `@pkg/services/cluster.go`:
- Around line 143-150: The metric emission for nodepool termination is happening
after the follow-up read (FindSoftDeletedByOwner) which can fail and skip
emitting metrics; update the deletion flow to emit the metrics earlier or change
the SoftDeleteByOwner API to return the affected count/list. Concretely: either
modify SoftDeleteByOwner(...) to return (affectedCount int, affectedPools
[]NodePool, err error) and call metrics.RecordTerminating("nodepool") based on
that return value inside the deletion path, or move the loop that calls
metrics.RecordTerminating("nodepool") to immediately after the
SoftDeleteByOwner(...) call (before calling FindSoftDeletedByOwner), ensuring
metrics.RecordTerminating is invoked even if subsequent reads fail; adjust
references to SoftDeleteByOwner, FindSoftDeletedByOwner, and
metrics.RecordTerminating("nodepool") accordingly.
---
Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert aggregation uses "max by (namespace,
resource_type)" but the metric hyperfleet_api_resource_terminating_stuck
(defined in pkg/metrics/deletion.go via
labelResourceType/labelComponent/labelVersion) does not include a namespace
label, causing cross-release aggregation; fix the two alerts
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in
charts/templates/prometheusrule.yaml by removing "namespace" from the
aggregation (change "max by (namespace, resource_type)" to "max by
(resource_type)") or alternatively ensure the metric is exported with a
namespace label at scrape time—pick one approach and apply it consistently to
both alert rules and their descriptions/runbook handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 3252c284-6c38-4779-9ca9-a7fe3e9a0937
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- cmd/hyperfleet-api/server/metrics_middleware.go
- pkg/config/flags.go
- pkg/services/node_pool.go
- pkg/config/loader.go
- pkg/config/metrics.go
79ebbb0 to
a29272c
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/services/cluster.go (1)
138-145:⚠️ Potential issue | 🟠 MajorNodepool terminating events can be permanently missed on read failure.
If
FindSoftDeletedByOwnerfails at Line 138 afterSoftDeleteByOwnersucceeded at Line 134, the function exits before Lines 143-145 emit nodepool metrics. Later retries short-circuit at Line 118, so those missed events are never recovered. Emit nodepool metric counts from the cascade write result (e.g., affected rows/returned IDs) instead of depending on a follow-up read.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 138 - 145, The current implementation records nodepool terminating metrics by reading soft-deleted nodepools via FindSoftDeletedByOwner (nodePools) after SoftDeleteByOwner, so if that read fails the metrics are never emitted; update the cascade delete flow to emit metrics from the cascade write result instead of the follow-up read: change SoftDeleteByOwner (or the DAO method it calls) to return the affected row count or the list of deleted IDs, have the caller capture that return value (instead of relying on nodePools from FindSoftDeletedByOwner), and call metrics.RecordTerminating the appropriate number of times (or once with the count) based on that returned count/IDs; retain or remove the FindSoftDeletedByOwner call as needed but ensure metrics emission no longer depends on it.charts/templates/prometheusrule.yaml (1)
17-17:⚠️ Potential issue | 🟠 Major
namespacescoping in the alert expression is currently ineffective.
hyperfleet_api_resource_terminating_stuckdoesn’t expose anamespacelabel, somax by (namespace, resource_type)still collapses across releases/namespaces for eachresource_type. This can mix signals across deployments.Run this to verify label shape and whether ServiceMonitor relabeling injects
namespace:#!/bin/bash set -euo pipefail echo "== Collector metric descriptor/emit labels ==" rg -n -C3 'resource_terminating_stuck|NewDesc\(|MustNewConstMetric\(' pkg/metrics/deletion.go echo echo "== Alert expressions ==" rg -n -C2 'HyperFleetResourceDeletion(Stuck|Timeout)|max by|hyperfleet_api_resource_terminating_stuck' charts/templates/prometheusrule.yaml echo echo "== ServiceMonitor relabeling (namespace injection) ==" fd -i 'servicemonitor.yaml' charts/templates | while read -r f; do echo "--- ${f} ---" rg -n -C3 '__meta_kubernetes_namespace|metricRelabelings|relabelings|targetLabels|namespace' "$f" || true doneExpected: no
namespacelabel in collector metric unless explicitly injected by scrape relabeling; if absent, currentby(namespace, resource_type)doesn’t isolate per namespace.Also applies to: 28-28
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by a non-existent namespace label so signals are mixed; either inject namespace at scrape time or stop grouping by it: update the Prometheus relabeling in your ServiceMonitor (servicemonitor.yaml) to add target label "namespace" from __meta_kubernetes_namespace so the collector metric hyperfleet_api_resource_terminating_stuck has a namespace label, or change the rule in charts/templates/prometheusrule.yaml (expr referencing max by (namespace, resource_type)) to group only by existing labels (e.g., max by (resource_type)) and remove namespace from the by() clause to avoid incorrect scoping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by a non-existent namespace label so signals are
mixed; either inject namespace at scrape time or stop grouping by it: update the
Prometheus relabeling in your ServiceMonitor (servicemonitor.yaml) to add target
label "namespace" from __meta_kubernetes_namespace so the collector metric
hyperfleet_api_resource_terminating_stuck has a namespace label, or change the
rule in charts/templates/prometheusrule.yaml (expr referencing max by
(namespace, resource_type)) to group only by existing labels (e.g., max by
(resource_type)) and remove namespace from the by() clause to avoid incorrect
scoping.
In `@pkg/services/cluster.go`:
- Around line 138-145: The current implementation records nodepool terminating
metrics by reading soft-deleted nodepools via FindSoftDeletedByOwner (nodePools)
after SoftDeleteByOwner, so if that read fails the metrics are never emitted;
update the cascade delete flow to emit metrics from the cascade write result
instead of the follow-up read: change SoftDeleteByOwner (or the DAO method it
calls) to return the affected row count or the list of deleted IDs, have the
caller capture that return value (instead of relying on nodePools from
FindSoftDeletedByOwner), and call metrics.RecordTerminating the appropriate
number of times (or once with the count) based on that returned count/IDs;
retain or remove the FindSoftDeletedByOwner call as needed but ensure metrics
emission no longer depends on it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 2a707980-1e96-49d8-88f5-25adfb2a8e1f
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
- pkg/config/flags.go
- pkg/services/node_pool.go
- charts/values.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- cmd/hyperfleet-api/servecmd/cmd.go
- pkg/config/loader.go
- pkg/config/metrics.go
- docs/metrics.md
964d474 to
3f46062
Compare
| summary: "HyperFleet resources stuck in Pending Deletion state" | ||
| description: >- | ||
| {{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been in | ||
| Pending Deletion state for more than {{ .Values.prometheusRule.rules.deletionStuck.for | default "5m" }}. |
There was a problem hiding this comment.
Is this description correct, is it not hitting 35m and not 5m?
There was a problem hiding this comment.
Good catch! You're right — the pending_deletion_stuck metric already embeds a 30m threshold from the collector (config.metrics.deletion_stuck_threshold), so by the time the Warning alert fires after its 5m for delay, the resource has actually been stuck for ~35m, not 5m.
Fixed in 39d7989 — the description now shows both components: 30m (stuck threshold) + 5m (alert delay).
| for: "5m" | ||
| runbookUrl: "" | ||
| deletionTimeout: | ||
| for: "30m" |
There was a problem hiding this comment.
Curious what was your reasoning behind these defaults?
There was a problem hiding this comment.
The deletionStuck.for: 5m is meant as an early warning — the collector already flags resources stuck beyond the 30m threshold (deletion_stuck_threshold), so 5m of sustained signal is just enough to filter out transient scrape noise before paging.
The deletionTimeout.for: 30m escalates to critical — at that point the resource has been stuck for ~60m total (30m threshold + 30m alert delay), which signals something is genuinely broken and needs immediate attention.
Happy to adjust if you think different values make more sense for the team's SLOs.
There was a problem hiding this comment.
I am leaning towards stupid defaults TBH, to avoid un-wanted noise for our partner teams. Most will prob just roll the defaults until they align on their SLO's. I would lean on warning after an hour and timeout at 2 or 3 hours.
There was a problem hiding this comment.
Makes total sense — bumped the defaults to conservative values: warning at 30m (fires ~1h total) and critical at 2h (fires ~2.5h total). Better to start quiet and let GCP/ROSA teams tighten when they define their SLOs, rather than generating noise out of the box.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/services/cluster.go (1)
138-145:⚠️ Potential issue | 🟠 MajorNodepool metric emission still depends on a brittle follow-up read.
If
SoftDeleteByOwnersucceeds butFindSoftDeletedByOwnerfails, nodepools are already soft-deleted but nonodepoolpending-deletion metrics are emitted, and retries won’t replay because the cluster is already marked deleted.Consider moving emission to data returned directly by
SoftDeleteByOwner(e.g., affected count/list) so it doesn’t depend on a second query.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 138 - 145, The current flow calls s.nodePoolDao.SoftDeleteByOwner(...) then separately calls s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the affected nodepool count or list (e.g., return (count int, ids []string, error)) and update callers in cluster.go to call that result directly and invoke metrics.RecordPendingDeletion("nodepool") using the returned count/list so metric emission does not depend on FindSoftDeletedByOwner; ensure error handling still returns on SoftDeleteByOwner failure and that the new return shape is propagated where needed.charts/templates/prometheusrule.yaml (1)
17-17:⚠️ Potential issue | 🟠 Major
namespacegrouping is still ineffective for this metric.On Line 17 and Line 29,
max by (namespace, resource_type)won’t scope alerts per namespace unless the metric actually has anamespacelabel.hyperfleet_api_resource_pending_deletion_stuckis emitted withoutnamespaceinpkg/metrics/deletion.go, so this can still merge signals across releases.#!/bin/bash # Verify metric labels for pending deletion stuck collector rg -n -C3 'resource_pending_deletion_stuck|NewDesc\(|labelResourceType|labelComponent|labelVersion' pkg/metrics/deletion.go # Verify whether ServiceMonitor injects namespace as a metric label rg -n -C3 'ServiceMonitor|metricRelabelings|relabelings|__meta_kubernetes_namespace|namespace' charts/templatesExpected result:
pkg/metrics/deletion.goshows onlyresource_typeas variable label for this metric.- No relabeling that preserves Kubernetes namespace into metric labels, confirming aggregation still cannot partition per namespace.
Also applies to: 29-29
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by (namespace, resource_type) but the metric hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace label (see pkg/metrics/deletion.go), so change one of two things: either add a "namespace" label to the metric at its declaration/recording in pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector sets the namespace value when observing), or if you cannot emit namespace, remove namespace from the PrometheusRule expression in charts/templates/prometheusrule.yaml and use max by (resource_type) (and the identical change for the other occurrence at line 29) so alerts aren’t misleadingly grouped by a non-existent label.
🧹 Nitpick comments (1)
test/integration/deletion_metrics_test.go (1)
33-35: Assert delete response status to prevent false-positive test results.Right now these checks only assert transport success (
err == nil). If delete returns non-2xx, the second subtest can still pass spuriously.Proposed test hardening
- _, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) + delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) Expect(err).NotTo(HaveOccurred()) + Expect(delResp.StatusCode()).To(BeNumerically(">=", 200)) + Expect(delResp.StatusCode()).To(BeNumerically("<", 300)) ... - _, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) + delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) Expect(err).NotTo(HaveOccurred()) + Expect(delResp.StatusCode()).To(BeNumerically(">=", 200)) + Expect(delResp.StatusCode()).To(BeNumerically("<", 300))Also applies to: 61-63
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/deletion_metrics_test.go` around lines 33 - 35, The test currently only checks that client.DeleteClusterByIdWithResponse returned no transport error (err == nil) which can hide non-2xx delete responses; update the two places calling DeleteClusterByIdWithResponse (the call using cluster.ID and the second similar call around lines 61-63) to capture the response object and assert its HTTP status is the expected success code (e.g., 200/204) using the response's status/status code accessor before or in addition to Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses; reference the DeleteClusterByIdWithResponse call, the returned response variable, and the Expect assertions when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/metrics/deletion.go`:
- Around line 135-136: Metrics scrape COUNT queries for clusters and node_pools
(the entries with "SELECT COUNT(*) FROM clusters WHERE deleted_time IS NOT NULL
AND deleted_time < $1" and "SELECT COUNT(*) FROM node_pools WHERE deleted_time
IS NOT NULL AND deleted_time < $1" in pkg/metrics/deletion.go) need supporting
indexes; add a migration that creates either simple indexes on deleted_time
(CREATE INDEX ON clusters(deleted_time), CREATE INDEX ON
node_pools(deleted_time)) or partial indexes filtering NOT NULL (CREATE INDEX
... WHERE deleted_time IS NOT NULL) and also drop the stale indexes left on the
old deleted_at column (drop idx_clusters_deleted_at and
idx_node_pools_deleted_at) so the metrics queries use the new indexes and avoid
full-table scans.
---
Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by (namespace, resource_type) but the metric
hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace
label (see pkg/metrics/deletion.go), so change one of two things: either add a
"namespace" label to the metric at its declaration/recording in
pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector
sets the namespace value when observing), or if you cannot emit namespace,
remove namespace from the PrometheusRule expression in
charts/templates/prometheusrule.yaml and use max by (resource_type) (and the
identical change for the other occurrence at line 29) so alerts aren’t
misleadingly grouped by a non-existent label.
In `@pkg/services/cluster.go`:
- Around line 138-145: The current flow calls
s.nodePoolDao.SoftDeleteByOwner(...) then separately calls
s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if
the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the
affected nodepool count or list (e.g., return (count int, ids []string, error))
and update callers in cluster.go to call that result directly and invoke
metrics.RecordPendingDeletion("nodepool") using the returned count/list so
metric emission does not depend on FindSoftDeletedByOwner; ensure error handling
still returns on SoftDeleteByOwner failure and that the new return shape is
propagated where needed.
---
Nitpick comments:
In `@test/integration/deletion_metrics_test.go`:
- Around line 33-35: The test currently only checks that
client.DeleteClusterByIdWithResponse returned no transport error (err == nil)
which can hide non-2xx delete responses; update the two places calling
DeleteClusterByIdWithResponse (the call using cluster.ID and the second similar
call around lines 61-63) to capture the response object and assert its HTTP
status is the expected success code (e.g., 200/204) using the response's
status/status code accessor before or in addition to
Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses;
reference the DeleteClusterByIdWithResponse call, the returned response
variable, and the Expect assertions when making this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: ab847be3-0cd7-404c-b0c0-96a4da969b4e
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/config/flags.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- charts/values.yaml
- pkg/config/metrics.go
- docs/metrics.md
39d7989 to
7b9a35f
Compare
Add Prometheus metrics to track the lifecycle of resource deletion: - pending_deletion_total counter: tracks resources entering Pending Deletion - pending_deletion_duration_seconds histogram: measures soft-delete to hard-delete - pending_deletion_stuck gauge (collector): reports resources stuck beyond threshold Includes PrometheusRule alerts (warning at 1h, critical at 2.5h total), partial indexes on deleted_time for efficient collector queries, and integration tests for the full metrics pipeline.
7b9a35f to
429af79
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/services/cluster.go (1)
134-160:⚠️ Potential issue | 🟠 MajorEmit nodepool metrics from the write path, not the follow-up read.
This still derives the
nodepoolcounter fromFindSoftDeletedByOwner(), so a transient read failure drops the metric entirely, and any nodepools that were already soft-deleted before this cascade will be counted again. Please haveSoftDeleteByOwner()return the affected rows/IDs, or move the metric emission into that write path.♻️ Proposed fix
- if cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy); cascadeErr != nil { - return nil, handleSoftDeleteError("NodePool", cascadeErr) - } - - nodePools, err := s.nodePoolDao.FindSoftDeletedByOwner(ctx, id) - if err != nil { - return nil, errors.GeneralError("Failed to fetch cascade-deleted nodepools: %s", err) - } - - for range nodePools { + nodePools, cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) + if cascadeErr != nil { + return nil, handleSoftDeleteError("NodePool", cascadeErr) + } + + for range nodePools { metrics.RecordPendingDeletion("nodepool") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 134 - 160, The current metric emission uses FindSoftDeletedByOwner() after SoftDeleteByOwner(), which can lose metrics on transient reads and double-count pre-existing soft-deletes; change nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of affected nodepool IDs (or count), have SoftDeleteByOwner itself call metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the count), and update this caller (and any other callers) to stop iterating over results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read metric loop so metrics are emitted reliably from the write path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/services/cluster.go`:
- Around line 134-160: The current metric emission uses FindSoftDeletedByOwner()
after SoftDeleteByOwner(), which can lose metrics on transient reads and
double-count pre-existing soft-deletes; change
nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of
affected nodepool IDs (or count), have SoftDeleteByOwner itself call
metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the
count), and update this caller (and any other callers) to stop iterating over
results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and
batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read
metric loop so metrics are emitted reliably from the write path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: b561f099-048e-47c0-8991-8c567303bafc
📒 Files selected for processing (15)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/db/migrations/202604290001_add_deleted_time_indexes.gopkg/db/migrations/migration_structs.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
- pkg/config/flags.go
- pkg/services/node_pool.go
- docs/metrics.md
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/config/loader.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- charts/values.yaml
- test/integration/deletion_metrics_test.go
- pkg/config/metrics.go
- pkg/metrics/deletion_test.go
Summary
hyperfleet_api_resource_terminating_total(counter) — tracks resources entering Terminating statehyperfleet_api_resource_terminating_duration_seconds(histogram) — measures soft-delete to hard-delete duration (populated when hard-delete flow lands)hyperfleet_api_resource_terminating_stuck(gauge via collector) — queries DB on each scrape for resources stuck beyond configurable thresholdHyperFleetResourceDeletionStuck(warning) — resources stuck >30min for 5mHyperFleetResourceDeletionTimeout(critical) — resources stuck >30min for 30m--metrics-deletion-stuck-thresholdconfig flag (default 30m)SoftDeletein cluster and nodepool services withRecordTerminating()TerminatingCollectoragainst real PostgreSQL via testcontainersdocs/metrics.mdTest plan
make test— 796 unit tests passingmake lint— 0 issuesmake test-helm— all chart templates OK (including new PrometheusRule)make test-integration— 93 integration tests passing (2 new for TerminatingCollector)HYPERFLEET-856
Summary by CodeRabbit
New Features
Configuration
Documentation
Tests
Chores