Skip to content

crio: conditionally enable min_injected_gomaxprocs for high-CPU MCPs#6225

Open
haircommander wants to merge 1 commit into
openshift:mainfrom
haircommander:cpu-based-gomaxprocs
Open

crio: conditionally enable min_injected_gomaxprocs for high-CPU MCPs#6225
haircommander wants to merge 1 commit into
openshift:mainfrom
haircommander:cpu-based-gomaxprocs

Conversation

@haircommander

@haircommander haircommander commented Jun 23, 2026

Copy link
Copy Markdown
Member

- What I did
Go processes run on nodes with many CPUs suffer a problem where the go runtime creates runtime threads per CPUs it can see, assuming there isn't a CPU limit on the process. However, the kubernetes scheduler is binpacking pods based on CPU request, which means go processes actually have access to less CPU time than the go runtime is expecting. This causes scheduling and GC latency.

min_injected_gomaxprocs is a feature in CRI-O that injects the GOMAXPROCS environment variable into all pods based on their CPU request. if the pod has a limit, GOMAXPROCS isn't set. Otherwise, it is set to 2requested CPUs. The 2 figure was found from perf testing, allowing the go runtime to burst up to capacity, while mitigating the latency.

Add code to inject min_injected_gomaxprocs if any node in the pool has more than 128 CPUs

alternative to #6177
- How to verify it

- Description for the changelog

Summary by CodeRabbit

Release Notes

  • New Features

    • The CRI-O runtime configuration now detects the maximum CPU capacity from each MachineConfigPool and automatically enables min_injected_gomaxprocs when pools have 128+ CPUs. If TOML drop-in generation fails, the update stops to avoid partial configuration.
  • Tests

    • Added unit tests covering max-CPU detection behavior for single/multiple matching nodes, no matches, invalid selectors, and Kubernetes API failures.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Walkthrough

The PR extends the container-runtime-config controller to automatically emit a CRI-O drop-in configuration that enables min_injected_gomaxprocs for any MachineConfigPool whose nodes contain at least 128 CPUs. A new helper getMaxCPUCountFromPool queries nodes via a Kubernetes client and returns the maximum CPU capacity, and the controller struct now holds that client.

Changes

CPU-aware CRI-O drop-in generation

Layer / File(s) Summary
CPU capacity query helper and TOML config
pkg/controller/container-runtime-config/helpers.go, pkg/controller/container-runtime-config/helpers_test.go
Adds clientset import, a new CRI-O drop-in file path constant, a tomlConfigCRIOMinInjectedGomaxprocs struct for TOML encoding, and getMaxCPUCountFromPool which converts a pool's node selector to a label selector and queries the Kubernetes API to find the maximum CPU capacity across matching nodes. Table-driven tests cover single-node, multi-node, no-match, and API failure scenarios using a fake client.
Controller kubeClient wiring and drop-in emission
pkg/controller/container-runtime-config/container_runtime_config_controller.go
Adds kubeClient clientset.Interface field to the Controller struct, assigns it in New(...), and extends syncContainerRuntimeConfig to call getMaxCPUCountFromPool and append the TOML-encoded CRI-O drop-in file to the rendered MachineConfig when the pool's max CPU count is ≥ 128.

Sequence Diagram(s)

sequenceDiagram
  participant Controller
  participant getMaxCPUCountFromPool
  participant KubeClient
  Controller->>getMaxCPUCountFromPool: pool.nodeSelector
  getMaxCPUCountFromPool->>KubeClient: list nodes with label selector
  KubeClient-->>getMaxCPUCountFromPool: nodes with CPU capacity
  getMaxCPUCountFromPool-->>Controller: max CPU value
  Controller->>Controller: maxCPU >= 128?
  alt CPU threshold met
    Controller->>Controller: generate TOML drop-in<br/>min_injected_gomaxprocs=1
    Controller->>Controller: append to configFileList
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive Custom check targets Ginkgo tests (It blocks, BeforeEach/AfterEach, Eventually/Consistently), but codebase uses standard Go testing. The check cannot be meaningfully applied to non-Ginkgo tests. Clarify if check should apply to standard Go tests too. Note: TestGetMaxCPUCountFromPool defines expectedErrLog field but never validates it.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: conditionally enabling a CRI-O feature for machine config pools with high-CPU nodes, which aligns with the core functionality added across the modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Custom check targets Ginkgo test names (It(), Describe(), etc.); PR contains only standard Go table-driven tests with static, descriptive names and no Ginkgo tests, so check is not applicable.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The two new test functions (TestGetMaxCPUCountFromPool and TestGetMaxCPUCountFromPool_APIFailure) are standard Go unit tests using *testing.T framework, no...
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. Only standard Go unit tests (TestGetMaxCPUCountFromPool, TestGetMaxCPUCountFromPool_APIFailure) were added to helpers_test.go. Check only applies to Ginkg...
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only the container-runtime-config controller with infrastructure configuration logic and does not introduce any pod scheduling constraints, deployments, or topology assumptions.
Ote Binary Stdout Contract ✅ Passed Check not applicable: PR changes are in controller library code (machine-config-controller daemon), not in OTE test binary. No stdout contract violations detected in the controller code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds unit tests (TestGetMaxCPUCountFromPool, TestGetMaxCPUCountFromPool_APIFailure) using standard Go testing with mocked Kubernetes clients, not Ginkgo e2e tests. No IPv4 assumptions or ex...
No-Weak-Crypto ✅ Passed PR introduces no weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), no custom crypto implementations, and no insecure secret comparisons. Changes focus on CPU detection and CRI-O configu...
Container-Privileges ✅ Passed No container privilege escalation settings (privileged: true, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation, runAsRoot) found; PR only adds CRI-O performance configuration.
No-Sensitive-Data-In-Logs ✅ Passed All new logging statements only log non-sensitive infrastructure metadata: pool names, CPU counts, node counts, and error messages. No passwords, tokens, API keys, PII, session IDs, hostnames, or c...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@haircommander haircommander changed the title crio: conditionally enable min_injected_gomaxprocs for high-CPU workers crio: conditionally enable min_injected_gomaxprocs for high-CPU MCPs Jun 23, 2026
@openshift-ci openshift-ci Bot requested review from hasan4791 and mtrmac June 23, 2026 18:42
@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controller/container-runtime-config/helpers_test.go (1)

2974-3190: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add an explicit API list-error test case for this helper.

Current cases validate selector and empty-match behavior, but not the node-list failure branch. A fake client reactor that forces nodes/list to fail would lock in the expected reconcile behavior and prevent regressions.

🧪 Example test addition
+{
+  name: "node list API error",
+  pool: &mcfgv1.MachineConfigPool{
+    ObjectMeta: metav1.ObjectMeta{Name: "worker"},
+    Spec: mcfgv1.MachineConfigPoolSpec{
+      NodeSelector: &metav1.LabelSelector{
+        MatchLabels: map[string]string{"node-role.kubernetes.io/worker": ""},
+      },
+    },
+  },
+  nodes:       []corev1.Node{},
+  expectedCPU: 0,
+  setupClient: func(c *k8sfake.Clientset) {
+    c.PrependReactor("list", "nodes", func(action k8stesting.Action) (bool, runtime.Object, error) {
+      return true, nil, fmt.Errorf("simulated list error")
+    })
+  },
+},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/container-runtime-config/helpers_test.go` around lines 2974 -
3190, The TestGetMaxCPUCountFromPool test function is missing a test case that
covers the failure path when the Kubernetes API call to list nodes fails. Add a
new test case in the tests slice that configures the fake client to return an
error when listing nodes, then verify the function handles this gracefully by
either logging an error or returning an appropriate value. Use the fake client's
reactor mechanism to inject a failure on the nodes/list operation, and update
the test loop to check for expectedErrLog when present to validate error logging
behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@pkg/controller/container-runtime-config/container_runtime_config_controller.go`:
- Around line 825-828: The error handling for the addTOMLgeneratedConfigFile
call when creating the min_injected_gomaxprocs config currently logs the error
and continues with the else block, allowing reconciliation to proceed even when
drop-in file generation fails. Instead of logging and continuing, return an
error from the reconciliation function when addTOMLgeneratedConfigFile returns
an error for crioDropInFilePathMinInjectedGomaxprocs, ensuring that
reconciliation fails explicitly and the pool configuration is not left in an
incomplete state.

---

Nitpick comments:
In `@pkg/controller/container-runtime-config/helpers_test.go`:
- Around line 2974-3190: The TestGetMaxCPUCountFromPool test function is missing
a test case that covers the failure path when the Kubernetes API call to list
nodes fails. Add a new test case in the tests slice that configures the fake
client to return an error when listing nodes, then verify the function handles
this gracefully by either logging an error or returning an appropriate value.
Use the fake client's reactor mechanism to inject a failure on the nodes/list
operation, and update the test loop to check for expectedErrLog when present to
validate error logging behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05775bb9-e077-432b-a8dd-69035fc07c07

📥 Commits

Reviewing files that changed from the base of the PR and between c41a042 and b40459b.

📒 Files selected for processing (3)
  • pkg/controller/container-runtime-config/container_runtime_config_controller.go
  • pkg/controller/container-runtime-config/helpers.go
  • pkg/controller/container-runtime-config/helpers_test.go

Comment thread pkg/controller/container-runtime-config/container_runtime_config_controller.go Outdated
Go processes on nodes with many CPUs suffer from excessive runtime
threads when GOMAXPROCS isn't set. The go runtime creates threads
per CPU it can see, but the kubernetes scheduler binpacks based on
CPU requests, leading to scheduling and GC latency issues.

CRI-O's min_injected_gomaxprocs feature injects the GOMAXPROCS
environment variable into all pods based on their CPU request (set
to 2*requested CPUs to allow bursting while mitigating latency).

This change automatically enables min_injected_gomaxprocs for any
pool (master, worker, arbiter, or custom) where any node has >= 128
CPUs. The controller:
- Queries nodes in each pool to find max CPU count
- Generates a CRIO drop-in config file when CPU >= 128
- Sets min_injected_gomaxprocs = 1 to enable the feature

This approach is pool-aware and only affects high-CPU nodes where
the GOMAXPROCS issue is most impactful.

Signed-off-by: Peter Hunt <pehunt@redhat.com>
@haircommander haircommander force-pushed the cpu-based-gomaxprocs branch from b40459b to 6e5fdb6 Compare June 23, 2026 19:12
@haircommander

Copy link
Copy Markdown
Member Author

@harche PTAL

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/controller/container-runtime-config/helpers_test.go (2)

2974-3191: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Verify or remove the unused expectedErrLog field.

The test table includes an expectedErrLog field (e.g., line 3170 sets it to true for the "invalid node selector" case), but the test body never asserts that the error was logged. This makes the field misleading and adds confusion to the test.

Either:

  • Remove the expectedErrLog field if error logging verification is not needed, or
  • Capture and verify klog output to ensure errors are logged as expected
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/container-runtime-config/helpers_test.go` around lines 2974 -
3191, The TestGetMaxCPUCountFromPool test defines an expectedErrLog field in the
test struct (set to true for the "invalid node selector" case) but never
verifies or asserts this field in the test body. Either remove the
expectedErrLog field from the test struct across all test cases since error
logging is not being validated, or add logic to the test body to capture and
verify klog output when expectedErrLog is true to ensure errors are properly
logged for invalid selectors.

2974-3191: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Consider adding a test case for nodes with missing CPU capacity.

The current test cases assume all nodes have Status.Capacity[corev1.ResourceCPU] populated. Consider adding a test case where a node is missing this field to verify that cpuQuantity.Value() handles it gracefully (likely returns 0).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/container-runtime-config/helpers_test.go` around lines 2974 -
3191, Add a new test case to the tests slice in the TestGetMaxCPUCountFromPool
function that covers a scenario where a node has an empty or missing CPU
capacity field. Create a test case similar to the existing ones (with name,
pool, nodes, and expectedCPU fields) where at least one node either has an empty
Status.Capacity ResourceList or lacks the corev1.ResourceCPU entry. Set the
expectedCPU to 0 or an appropriate value to verify that the
getMaxCPUCountFromPool function gracefully handles nodes without CPU capacity
information without crashing or returning unexpected values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/controller/container-runtime-config/helpers_test.go`:
- Around line 2974-3191: The TestGetMaxCPUCountFromPool test defines an
expectedErrLog field in the test struct (set to true for the "invalid node
selector" case) but never verifies or asserts this field in the test body.
Either remove the expectedErrLog field from the test struct across all test
cases since error logging is not being validated, or add logic to the test body
to capture and verify klog output when expectedErrLog is true to ensure errors
are properly logged for invalid selectors.
- Around line 2974-3191: Add a new test case to the tests slice in the
TestGetMaxCPUCountFromPool function that covers a scenario where a node has an
empty or missing CPU capacity field. Create a test case similar to the existing
ones (with name, pool, nodes, and expectedCPU fields) where at least one node
either has an empty Status.Capacity ResourceList or lacks the corev1.ResourceCPU
entry. Set the expectedCPU to 0 or an appropriate value to verify that the
getMaxCPUCountFromPool function gracefully handles nodes without CPU capacity
information without crashing or returning unexpected values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b6f6075-c50f-4163-aa99-bffc5b45eb48

📥 Commits

Reviewing files that changed from the base of the PR and between b40459b and 6e5fdb6.

📒 Files selected for processing (3)
  • pkg/controller/container-runtime-config/container_runtime_config_controller.go
  • pkg/controller/container-runtime-config/helpers.go
  • pkg/controller/container-runtime-config/helpers_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/controller/container-runtime-config/helpers.go
  • pkg/controller/container-runtime-config/container_runtime_config_controller.go

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@haircommander: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@harche harche left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

})
if err != nil {
klog.V(2).Infof("error listing nodes for pool %s: %v", pool.Name, err)
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collapses every failure into 0, which is the same value as "no big nodes," and 0 means the drop-in gets dropped from the rendered MachineConfig. So a transient node List error during a sync removes the config and reboots the whole pool, then the next successful sync adds it back and reboots again. A blip on the API shouldn't be able to roll a pool. Can we propagate the error and degrade through syncStatusOnly the same way the TOML path below does, instead of swallowing it?

}

// List nodes matching the pool's node selector
nodes, err := kubeClient.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a full node List against the API server on every sync, once per pool. The controller already runs informers/listers for its other watched types, but there's no node informer wired in today, so this call is uncached. Wiring a node lister and reading from cache here would cut the per-sync API load and also make the transient failure above much less likely to flap the config. Worth weighing against the cost of adding a node watch to this controller.


// For pools with >= 128 CPUs, enable min_injected_gomaxprocs
maxCPU := getMaxCPUCountFromPool(ctrl.kubeClient, pool)
if maxCPU >= 128 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick design question on this block. Since it reads the live node CPU count on every sync and the answer lands in a MachineConfig, the on/off state can change from things that are not config edits, and each change reboots the pool. For example, if an autoscaler takes a worker pool from 128 CPU nodes down to 64 and later back up, the file gets removed and then added back, so the pool reboots on each change. A fresh pool whose nodes have not registered yet also reads as 0 first and flips on once they join.

Would it be simpler to keep MCO out of the CPU counting and just turn the feature on here, then let CRI-O decide per node at container creation? CRI-O already runs on the node and sits in the injection path (gomaxprocs_hooks_linux.go), so it could skip injection on smaller nodes itself. Then the MachineConfig never changes based on cluster state, the reboot churn goes away, and getMaxCPUCountFromPool plus the kubeClient plumbing would not be needed. Is there a reason the 128 threshold needs to live in MCO, or would moving that decision into CRI-O work for your case?

pool *mcfgv1.MachineConfigPool
nodes []corev1.Node
expectedCPU int64
expectedErrLog bool

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedErrLog is set but never asserted on, so it's dead right now. Either assert against captured log output or drop the field. Also, no test covers the new branch in syncContainerRuntimeConfig itself (drop-in appended at >=128, and the degrade path on TOML failure), which is the riskier wiring.

@haircommander

Copy link
Copy Markdown
Member Author

/hold

approach may change actually openshift/enhancements#2047

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants