OCPBUGS-92837: test/router: wait for all per-route metrics before asserting by mkowalski · Pull Request #31344 · openshift/origin

mkowalski · 2026-06-26T13:52:35Z

Summary

The HAProxy router metrics test exits its retry loop as soon as
haproxy_backend_connections_total reaches the expected count, but then
immediately asserts on other per-route metrics like
haproxy_server_http_responses_total. The HAProxy exporter has a scrape
interval (typically 5s), so these metrics may not be populated in the same
scrape that satisfied the connections check.

This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs
(Sippy regression #42639 /
OCPBUGS-92837) because
the post-loop assertions find nil where they expect populated gauges.

Root Cause

The retry loop at lines 164-186 waits for:

haproxy_server_up to have 2 non-zero entries (both backend servers UP)
haproxy_backend_connections_total >= 10 for the test route

Once satisfied, the loop exits. But the post-loop assertions immediately check
haproxy_server_http_responses_total with code=2xx — which may not yet
be populated because the HAProxy exporter scrapes stats on a 5-second interval.
The connections metric can appear in one exporter scrape cycle while the HTTP
responses metric only appears in the next one.

Fix

Add haproxy_server_http_responses_total with code=2xx to the loop exit
condition. The loop now only returns success when all per-route backend stats
are confirmed present in the same metrics scrape. This adds at most one extra
exporter scrape cycle (~5s) to the wait, well within the 240s timeout.

Verification

Failure signature: metrics.go:227: Expected <[]float64 | len:0, cap:0>: nil
All 11/11 failing runs show this pattern
The fix ensures the metrics map used for post-loop assertions contains all
required per-route stats before proceeding

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

Tests
- Improved the reliability of router metrics validation by waiting for all expected per-route statistics to appear before considering the check successful.
- Added additional retry handling so metrics checks continue until connection and response data are fully available.

The HAProxy router metrics test exits its retry loop as soon as haproxy_backend_connections_total reaches the expected count, but then immediately asserts on other per-route metrics like haproxy_server_http_responses_total. The HAProxy exporter has a scrape interval (typically 5s), so these metrics may not be populated in the same scrape that satisfied the connections check. This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs (regression #42639 / OCPBUGS-92837) because the post-loop assertions find nil where they expect populated gauges. Fix by adding haproxy_server_http_responses_total 2xx to the loop exit condition so we only proceed when all per-route backend stats are confirmed present in the same metrics scrape. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: AI Signed-off-by: Mateusz Kowalski <mko@redhat.com>

openshift-merge-bot · 2026-06-26T13:52:37Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai · 2026-06-26T13:52:56Z

Walkthrough

The HAProxy router metrics test now keeps polling until route backend connection counts and per-server 2xx response metrics are both present in the same scrape. It logs a retry message while waiting instead of succeeding as soon as backend connections appear.

Changes

HAProxy route metrics polling

Layer / File(s)	Summary
Poll for route metrics `test/extended/router/metrics.go`	The poll loop now waits for `haproxy_backend_connections_total` and `haproxy_server_http_responses_total{code=2xx}` to be populated together before succeeding, and logs retries while either metric is still missing.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 15

✅ Passed checks (15 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The modified file’s Ginkgo titles are static strings; the PR changes only test body polling logic, not any dynamic or unstable test names.
Test Structure And Quality	✅ Passed	The changed test keeps a single integration behavior, uses bounded PollImmediate timeouts, and follows existing router-test setup/cleanup patterns.
Microshift Test Compatibility	✅ Passed	No new Ginkgo test was added; the existing router metrics spec is already skipped on MicroShift and the route test has an [apigroup:route.openshift.io] tag.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The modified router metrics test only waits for additional metrics; it has no multi-node/HA assumptions, and no node/topology checks were added.
Topology-Aware Scheduling Compatibility	✅ Passed	Only router test polling logic changed; no manifests/controllers or topology-based scheduling constraints were introduced.
Ote Binary Stdout Contract	✅ Passed	The PR only changes router metrics test logic; the modified file has no process-level stdout writes, init/TestMain, or klog/log stdout setup.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The changed test uses net.JoinHostPort/IPUrl for host formatting and only talks to cluster-internal router metrics; no new IPv4-only or external connectivity assumptions were added.
No-Weak-Crypto	✅ Passed	The PR only updates a router metrics test retry condition; the edited code contains no weak-crypto, custom crypto, or secret-comparison logic.
Container-Privileges	✅ Passed	The PR only updates test/extended/router/metrics.go; no container/K8s manifests or security-context fields like privileged, hostNetwork, or allowPrivilegeEscalation were added.
No-Sensitive-Data-In-Logs	✅ Passed	The only new log message is a generic retry notice; no passwords, tokens, PII, hostnames, or customer data were added to logs.
Title check	✅ Passed	The title is concise and accurately summarizes the main test change.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

openshift-ci · 2026-06-26T13:53:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/router/OWNERS~~ [mkowalski]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

🧹 Nitpick comments (1)

test/extended/router/metrics.go (1)
168-186: 🩺 Stability & Availability | 🔵 Trivial

Use a context-aware poll here.

This loop can run for up to 240 seconds, but wait.PollImmediate won’t stop early if the spec is canceled. If this test can take g.SpecContext, switch to wait.PollUntilContextTimeout so the retry exits with the test context.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/router/metrics.go` around lines 168 - 186, The polling loop in
the metrics test uses a fixed timeout and will not exit early when the spec
context is canceled. Update the retry logic in the metrics check to use a
context-aware poll with g.SpecContext, replacing the current wait.PollImmediate
call so it respects test cancellation. Keep the existing metric validation and
retry conditions in the same callback logic, but wire them through the
context-aware polling API.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/router/metrics.go`:
- Around line 168-186: The polling loop in the metrics test uses a fixed timeout
and will not exit early when the spec context is canceled. Update the retry
logic in the metrics check to use a context-aware poll with g.SpecContext,
replacing the current wait.PollImmediate call so it respects test cancellation.
Keep the existing metric validation and retry conditions in the same callback
logic, but wire them through the context-aware polling API.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fdf94624-a83b-410c-a39c-ae02b928ab11

📥 Commits

Reviewing files that changed from the base of the PR and between 817fa8a and 465f2f7.

📒 Files selected for processing (1)

test/extended/router/metrics.go

openshift-ci-robot · 2026-06-26T14:11:20Z

@mkowalski: This pull request references Jira Issue OCPBUGS-92837, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

The HAProxy router metrics test exits its retry loop as soon as
haproxy_backend_connections_total reaches the expected count, but then
immediately asserts on other per-route metrics like
haproxy_server_http_responses_total. The HAProxy exporter has a scrape
interval (typically 5s), so these metrics may not be populated in the same
scrape that satisfied the connections check.

This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs
(Sippy regression #42639 /
OCPBUGS-92837) because
the post-loop assertions find nil where they expect populated gauges.

Root Cause

The retry loop at lines 164-186 waits for:

haproxy_server_up to have 2 non-zero entries (both backend servers UP)

haproxy_backend_connections_total >= 10 for the test route

Once satisfied, the loop exits. But the post-loop assertions immediately check
haproxy_server_http_responses_total with code=2xx — which may not yet
be populated because the HAProxy exporter scrapes stats on a 5-second interval.
The connections metric can appear in one exporter scrape cycle while the HTTP
responses metric only appears in the next one.

Fix

Add haproxy_server_http_responses_total with code=2xx to the loop exit
condition. The loop now only returns success when all per-route backend stats
are confirmed present in the same metrics scrape. This adds at most one extra
exporter scrape cycle (~5s) to the wait, well within the 240s timeout.

Verification

Failure signature: metrics.go:227: Expected <[]float64 | len:0, cap:0>: nil

All 11/11 failing runs show this pattern

The fix ensures the metrics map used for post-loop assertions contains all
required per-route stats before proceeding

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

Tests

Improved the reliability of router metrics validation by waiting for all expected per-route statistics to appear before considering the check successful.

Added additional retry handling so metrics checks continue until connection and response data are fully available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

mkowalski · 2026-06-26T14:11:57Z

/jira refresh

openshift-ci-robot · 2026-06-26T14:12:05Z

@mkowalski: This pull request references Jira Issue OCPBUGS-92837, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-bot · 2026-06-26T14:16:26Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Copilot

Pull request overview

This PR improves stability of the HAProxy router metrics extended test by ensuring the retry loop doesn’t exit until required per-route metrics are available, accounting for the HAProxy exporter’s scrape interval.

Changes:

Adds contextual comments explaining exporter scrape-interval lag for per-route metrics.
Extends the wait.PollImmediate exit condition to also wait for haproxy_server_http_responses_total{code="2xx"} before proceeding to post-loop assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mkowalski · 2026-06-26T18:14:55Z

+					backendConns := findGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels)
+					if len(backendConns) > 0 && backendConns[0] >= float64(times) {
+						// Also verify that the HTTP response metrics have been
+						// populated for this route before exiting the loop.
+						// The exporter may not refresh all stats atomically, so


Shouldn't we do it step by step and handle whenever necessary? Like this we can say it about every single metric. No?

openshift-ci · 2026-06-26T18:52:55Z

@mkowalski: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-fips	`465f2f7`	link	true	`/test e2e-aws-ovn-fips`
ci/prow/e2e-vsphere-ovn	`465f2f7`	link	true	`/test e2e-vsphere-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026

openshift-ci Bot requested review from bentito and knobunc June 26, 2026 13:53

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 26, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 26, 2026

mkowalski changed the title ~~test/router: wait for all per-route metrics before asserting~~ OCPBUGS-92837: test/router: wait for all per-route metrics before asserting Jun 26, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2026

openshift-ci Bot requested a review from melvinjoseph86 June 26, 2026 14:12

mkowalski requested a review from Copilot June 26, 2026 15:21

Copilot started reviewing on behalf of mkowalski June 26, 2026 15:21 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Uh oh!

Conversation

mkowalski commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Verification

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jun 26, 2026

Summary

Root Cause

Fix

Verification

Summary by CodeRabbit

Uh oh!

mkowalski commented Jun 26, 2026

Uh oh!

openshift-ci-robot commented Jun 26, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mkowalski Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mkowalski commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading