Skip to content

Use CAPI featuregate for CAPI aws worker machinesets#10659

Open
mdbooth wants to merge 1 commit into
openshift:mainfrom
mdbooth:capi-workers-featuregate-aws
Open

Use CAPI featuregate for CAPI aws worker machinesets#10659
mdbooth wants to merge 1 commit into
openshift:mainfrom
mdbooth:capi-workers-featuregate-aws

Conversation

@mdbooth

@mdbooth mdbooth commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

This removes use of:

  • FeatureGateClusterAPIControlPlaneInstall because it is not implemented in either installer or CPMS.
  • FeatureGateClusterAPIComputeInstall replaced by FeatureGateClusterAPIMachineManagementAWS and scoped to AWS only, as that is the only platform where it is implemented

The effect of this change is that installer-created worker machinesets will use CAPI as soon as platform support is promoted.

Also updates the tests to use the feature gates explicitly instead of assuming inclusion in a particular feature set.

Summary by CodeRabbit

  • Bug Fixes
    • Adjusted machine pool defaulting so management is auto-set to Cluster API only for AWS compute pools when the appropriate feature gate is enabled.
    • Prevents overwriting an existing management setting and avoids unintended changes to control-plane or non-AWS pools.
  • Tests
    • Expanded/refactored tests to validate behavior across feature gate enabled/disabled and preconfigured management scenarios.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Machine pool management defaulting now applies only to AWS compute pools when the AWS machine-management feature gate is enabled. Validation and tests were updated to match the AWS-specific behavior.

Changes

AWS machine pool management

Layer / File(s) Summary
Feature-gate mapping
pkg/types/validation/featuregates.go
The machine-pool feature-gate validation mapping now references FeatureGateClusterAPIMachineManagement and includes the AWS-specific comment.
AWS defaulting and tests
pkg/types/defaults/machinepools.go, pkg/types/defaults/machinepools_test.go
SetMachinePoolDefaults now sets Management only for empty AWS compute pools when FeatureGateClusterAPIMachineManagementAWS is enabled, and the tests cover AWS compute, control-plane, non-AWS, and pre-set management cases.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: switching AWS worker machine sets to the CAPI feature gate.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo-style titles or dynamic values were added; the new test case names are static, descriptive strings.
Test Structure And Quality ✅ Passed PASS: The updated tests are table-driven, each subtest checks one behavior, and the new feature-gate assertions include diagnostic messages; no cluster waits or cleanup issues.
Microshift Test Compatibility ✅ Passed Only unit tests using testing/assert were changed; no new Ginkgo e2e tests or MicroShift-unsupported OpenShift APIs were added.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the changed test is a unit test and contains no SNO-sensitive multi-node assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed The PR only changes machine-pool defaulting/validation; no pod specs, affinity, node selectors, topology spread, or topology-derived replica logic were introduced.
Ote Binary Stdout Contract ✅ Passed The PR only changes defaults, validation, and tests; none of the touched files contain process-level stdout writes or logging setup.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the touched test is a unit test and contains no IPv4 or external connectivity assumptions.
No-Weak-Crypto ✅ Passed Touched files only adjust feature-gate/default logic; no MD5/SHA1/DES/RC4/ECB/custom crypto or secret comparisons appear in the changed code.
Container-Privileges ✅ Passed PR only changes machine-pool defaults/tests and validation; no container/K8s manifests or privilege-related securityContext settings were added.
No-Sensitive-Data-In-Logs ✅ Passed No new log/print calls or sensitive-data strings appear in the touched files; changes only adjust defaults, tests, and validation mappings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from bfournie and jhixson74 June 26, 2026 08:50
@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign stephenfin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mdbooth

mdbooth commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-ovn-techpreview

This removes use of:
* FeatureGateClusterAPIControlPlaneInstall because it is not implemented
  in either installer or CPMS.
* FeatureGateClusterAPIComputeInstall replaced by
  FeatureGateClusterAPIMachineManagementAWS and scoped to AWS only, as
  that is the only platform where it is implemented

The effect of this change is that installer-created worker machinesets
will use CAPI as soon as platform support is promoted.

Also updates the tests to use the feature gates explicitly instead of
assuming inclusion in a particular feature set.
@mdbooth mdbooth force-pushed the capi-workers-featuregate-aws branch from d7d2bfc to a976a1f Compare June 26, 2026 10:34
@mdbooth mdbooth marked this pull request as draft June 26, 2026 10:34
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2026
@mdbooth

mdbooth commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-ovn-techpreview

@mdbooth

mdbooth commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Leaving this draft until I get a good run of e2e-aws-ovn-techpreview.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/types/validation/featuregates.go`:
- Around line 45-47: The validation mapping is using the generic
machine-management feature gate instead of the AWS-specific one, which broadens
install config acceptance in `installconfig.go` through `FeatureGateName`.
Update the `featuregates.go` entry to reference the AWS-specific gate used for
`compute.management`, and keep the platform-specific behavior aligned with the
AWS-only contract rather than falling back to
`FeatureGateClusterAPIMachineManagement`. If the helper cannot represent
platform-specific gating, fix that abstraction instead of changing this mapping.
- Around line 48-49: The validation error path in the feature gate check is
pointing at the list field instead of the actual compute pool entry, so update
the field path used in the feature validation logic to target the indexed pool.
In the validation code around the compute management check, adjust the
`field.NewPath(...)` call so it references `compute[0].management` (or the
matching index if the logic is generalized), keeping the check in sync with the
`c.Compute` slice access.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 15a6ab77-30eb-4de5-a0ee-ee5fe328f05f

📥 Commits

Reviewing files that changed from the base of the PR and between d7d2bfc and a976a1f.

📒 Files selected for processing (3)
  • pkg/types/defaults/machinepools.go
  • pkg/types/defaults/machinepools_test.go
  • pkg/types/validation/featuregates.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/types/defaults/machinepools.go
  • pkg/types/defaults/machinepools_test.go

Comment on lines +45 to +47
// Note that this should use a platform-specific feature gate, but
// there is no way to express that here.
FeatureGateName: features.FeatureGateClusterAPIMachineManagement,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Use the AWS-specific gate in validation, not the generic machine-management gate.

pkg/types/validation/installconfig.go consumes this mapping purely by FeatureGateName, so switching this entry to FeatureGateClusterAPIMachineManagement broadens which configs are admitted/rejected. That no longer matches the AWS-only behavior described for this change and can make compute.management validation diverge from the actual platform support. If the current helper cannot express platform-specific gating, that limitation needs fixing instead of weakening the contract here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/types/validation/featuregates.go` around lines 45 - 47, The validation
mapping is using the generic machine-management feature gate instead of the
AWS-specific one, which broadens install config acceptance in `installconfig.go`
through `FeatureGateName`. Update the `featuregates.go` entry to reference the
AWS-specific gate used for `compute.management`, and keep the platform-specific
behavior aligned with the AWS-only contract rather than falling back to
`FeatureGateClusterAPIMachineManagement`. If the helper cannot represent
platform-specific gating, fix that abstraction instead of changing this mapping.

Comment on lines 48 to 49
Condition: len(c.Compute) > 0 && c.Compute[0].Management == types.ClusterAPI,
Field: field.NewPath("compute", "management"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Point the validation error at the indexed compute pool.

compute is a list in install-config, so field.NewPath("compute", "management") does not map to the structure users edit. If this check is for the first pool, the field path should be compute[0].management (or the matched index if you generalize the scan), otherwise the error is harder to act on. As per path instructions, "Verify error field paths match the YAML/JSON structure users provide."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/types/validation/featuregates.go` around lines 48 - 49, The validation
error path in the feature gate check is pointing at the list field instead of
the actual compute pool entry, so update the field path used in the feature
validation logic to target the indexed pool. In the validation code around the
compute management check, adjust the `field.NewPath(...)` call so it references
`compute[0].management` (or the matching index if the logic is generalized),
keeping the check in sync with the `c.Compute` slice access.

Source: Path instructions

@mdbooth mdbooth marked this pull request as ready for review June 26, 2026 14:40
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2026
@openshift-ci openshift-ci Bot requested review from patrickdillon and rwsu June 26, 2026 14:40
@patrickdillon

Copy link
Copy Markdown
Contributor

/test ?

@patrickdillon

Copy link
Copy Markdown
Contributor

/test e2e-aws-ovn-dualstack-ipv6-primary-techpreview e2e-aws-ovn-dualstack-ipv4-primary-techpreview

@patrickdillon

Copy link
Copy Markdown
Contributor

/hold
We don't want to setback aws dualstack, so
Depends on openshift/cluster-capi-operator#593

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2026
Comment on lines -151 to +154
featureSet configv1.FeatureSet
featureGates []string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just stick with featureset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so: it's the wrong abstraction. We'd have to change all the tests every time we promoted anything.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this looks fine to me although you might need CNU featureset enabled?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we lose coverage for scenarios where the users set: featureSet: Default/TechPreviewNoUpgrade/DevPreviewNoUpgrade in the install-config, right?

FeatureGateName: features.FeatureGateClusterAPIComputeInstall,
// Note that this should use a platform-specific feature gate, but
// there is no way to express that here.
FeatureGateName: features.FeatureGateClusterAPIMachineManagement,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@mdbooth: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-dualstack-ipv6-primary-techpreview a976a1f link false /test e2e-aws-ovn-dualstack-ipv6-primary-techpreview
ci/prow/e2e-aws-ovn a976a1f link true /test e2e-aws-ovn
ci/prow/e2e-aws-ovn-dualstack-ipv4-primary-techpreview a976a1f link false /test e2e-aws-ovn-dualstack-ipv4-primary-techpreview
ci/prow/unit a976a1f link true /test unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mdbooth

mdbooth commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

ipv6 failure:

   'failed to create InfraMachine: failed to apply AWSMachine: awsmachines.infrastructure.cluster.x-k8s.io
   "ci-op-nslj16m1-73c27-wf2r9-worker-us-east-1f-zrzgx" is forbidden: ValidatingAdmissionPolicy
   'openshift-cluster-api-unsupported-aws-spec-fields' with binding 'openshift-cluster-api-unsupported-aws-spec-fields'
   denied request: spec.privateDnsName is a forbidden field'

name: "compute with default feature set",
name: "non-AWS compute is unaffected by ClusterAPIMachineManagementAWS",
pool: &types.MachinePool{Name: types.MachinePoolComputeRoleName},
platform: &types.Platform{},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's put a non-nil platform (e.g. GCP) to test "non-AWS" case?

Field: field.NewPath("osImageStream"),
},
{
FeatureGateName: features.FeatureGateClusterAPIControlPlaneInstall,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also remove the 2 unit tests, following this change:

name: "Control Plane CAPI machine management is allowed with DevPreviewNoUpgrade Feature Set",

name: "Control Plane CAPI machine management is not allowed with Default Feature Set",

},
{
FeatureGateName: features.FeatureGateClusterAPIControlPlaneInstall,
Condition: c.ControlPlane != nil && c.ControlPlane.Management == types.ClusterAPI,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing this check, field controlPlane.management can be set to ClusterAPI freely; but results in a no-op.

We should add a validation to block it for the moment, WDYT?

case types.MachinePoolEdgeRoleName:
allErrs = append(allErrs, validateComputeEdge(platform, p.Name, poolFldPath, poolFldPath)...)
if p.Management == types.ClusterAPI {
allErrs = append(allErrs, field.Invalid(poolFldPath.Child("management"), p.Management, "edge compute pools cannot be managed by Cluster API"))
}

@tthvo

tthvo commented Jun 26, 2026

Copy link
Copy Markdown
Member

ipv6 failure:

Let's check with the 2 CCAPIO PRs 👇 (see next comment)

Note: There is another "issue" that breaks oc exec/oc logs (i.e. pending kubelet serving CSR) due to missing IPv6 addresses on CAPI machine status (fixed in kubernetes-sigs/cluster-api-provider-aws#5993 and waiting for next downstream sync). This only applies to dualstack + CAPI machine management install.

@tthvo

tthvo commented Jun 26, 2026

Copy link
Copy Markdown
Member

/testwith openshift/installer/main/e2e-aws-ovn-dualstack-ipv6-primary-techpreview openshift/cluster-capi-operator#592 openshift/cluster-capi-operator#593
/testwith openshift/installer/main/e2e-aws-ovn-dualstack-ipv4-primary-techpreview openshift/cluster-capi-operator#592 openshift/cluster-capi-operator#593

@tthvo

tthvo commented Jun 27, 2026

Copy link
Copy Markdown
Member

Hmm, in the AWS ipv4-primary dualstack, the konnectivity agent pods are rejected (see events) 😕:

openshift-bootstrap-konnectivity   62m  Warning   FailedCreate  daemonset/konnectivity-agent
Error creating: pods "konnectivity-agent-" is forbidden: unable to validate against any security context constraint:
[provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.hostNetwork: Invalid
value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].hostNetwork: Invalid value:
true: Host network is not allowed to be used, provider restricted-v3: .spec.hostNetwork: Invalid value: true: Host 
network is not allowed to be used, provider restricted-v3: .spec.hostUsers: Invalid value: null: Host Users must be set 
to false, provider restricted-v3: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be 
used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nested-container": Forbidden: 
not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, 
provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not 
usable by user or serviceaccount, provider "hostmount-anyuid-v2": Forbidden: not usable by user or serviceaccount, 
provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not 
usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider 
"privileged": Forbidden: not usable by user or serviceaccount]

This may be some sort of timing/race problem? AWS ipv6-primary dualstack and ipv4-only TPNU installed just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants