OCPBUGS-92837: test/router: wait for all per-route metrics before asserting#31344
OCPBUGS-92837: test/router: wait for all per-route metrics before asserting#31344mkowalski wants to merge 1 commit into
Conversation
The HAProxy router metrics test exits its retry loop as soon as haproxy_backend_connections_total reaches the expected count, but then immediately asserts on other per-route metrics like haproxy_server_http_responses_total. The HAProxy exporter has a scrape interval (typically 5s), so these metrics may not be populated in the same scrape that satisfied the connections check. This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs (regression #42639 / OCPBUGS-92837) because the post-loop assertions find nil where they expect populated gauges. Fix by adding haproxy_server_http_responses_total 2xx to the loop exit condition so we only proceed when all per-route backend stats are confirmed present in the same metrics scrape. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: AI Signed-off-by: Mateusz Kowalski <mko@redhat.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughThe HAProxy router metrics test now keeps polling until route backend connection counts and per-server 2xx response metrics are both present in the same scrape. It logs a retry message while waiting instead of succeeding as soon as backend connections appear. ChangesHAProxy route metrics polling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkowalski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/router/metrics.go (1)
168-186: 🩺 Stability & Availability | 🔵 TrivialUse a context-aware poll here.
This loop can run for up to 240 seconds, but
wait.PollImmediatewon’t stop early if the spec is canceled. If this test can takeg.SpecContext, switch towait.PollUntilContextTimeoutso the retry exits with the test context.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/router/metrics.go` around lines 168 - 186, The polling loop in the metrics test uses a fixed timeout and will not exit early when the spec context is canceled. Update the retry logic in the metrics check to use a context-aware poll with g.SpecContext, replacing the current wait.PollImmediate call so it respects test cancellation. Keep the existing metric validation and retry conditions in the same callback logic, but wire them through the context-aware polling API.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/extended/router/metrics.go`:
- Around line 168-186: The polling loop in the metrics test uses a fixed timeout
and will not exit early when the spec context is canceled. Update the retry
logic in the metrics check to use a context-aware poll with g.SpecContext,
replacing the current wait.PollImmediate call so it respects test cancellation.
Keep the existing metric validation and retry conditions in the same callback
logic, but wire them through the context-aware polling API.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: fdf94624-a83b-410c-a39c-ae02b928ab11
📒 Files selected for processing (1)
test/extended/router/metrics.go
|
@mkowalski: This pull request references Jira Issue OCPBUGS-92837, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-92837, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
There was a problem hiding this comment.
Pull request overview
This PR improves stability of the HAProxy router metrics extended test by ensuring the retry loop doesn’t exit until required per-route metrics are available, accounting for the HAProxy exporter’s scrape interval.
Changes:
- Adds contextual comments explaining exporter scrape-interval lag for per-route metrics.
- Extends the
wait.PollImmediateexit condition to also wait forhaproxy_server_http_responses_total{code="2xx"}before proceeding to post-loop assertions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| backendConns := findGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels) | ||
| if len(backendConns) > 0 && backendConns[0] >= float64(times) { | ||
| // Also verify that the HTTP response metrics have been | ||
| // populated for this route before exiting the loop. | ||
| // The exporter may not refresh all stats atomically, so |
There was a problem hiding this comment.
Shouldn't we do it step by step and handle whenever necessary? Like this we can say it about every single metric. No?
|
@mkowalski: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
The HAProxy router metrics test exits its retry loop as soon as
haproxy_backend_connections_totalreaches the expected count, but thenimmediately asserts on other per-route metrics like
haproxy_server_http_responses_total. The HAProxy exporter has a scrapeinterval (typically 5s), so these metrics may not be populated in the same
scrape that satisfied the connections check.
This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs
(Sippy regression #42639 /
OCPBUGS-92837) because
the post-loop assertions find
nilwhere they expect populated gauges.Root Cause
The retry loop at lines 164-186 waits for:
haproxy_server_upto have 2 non-zero entries (both backend servers UP)haproxy_backend_connections_total >= 10for the test routeOnce satisfied, the loop exits. But the post-loop assertions immediately check
haproxy_server_http_responses_totalwithcode=2xx— which may not yetbe populated because the HAProxy exporter scrapes stats on a 5-second interval.
The connections metric can appear in one exporter scrape cycle while the HTTP
responses metric only appears in the next one.
Fix
Add
haproxy_server_http_responses_totalwithcode=2xxto the loop exitcondition. The loop now only returns success when all per-route backend stats
are confirmed present in the same metrics scrape. This adds at most one extra
exporter scrape cycle (~5s) to the wait, well within the 240s timeout.
Verification
metrics.go:227: Expected <[]float64 | len:0, cap:0>: nilrequired per-route stats before proceeding
🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.
Summary by CodeRabbit