Bug Report
What did you do?
Ansible operator plugin should consider using runtime.GOMAXPROCS(0) to set --max-concurrent-reconciles instead of runtime.NumCPU() in https://github.com/operator-framework/ansible-operator-plugins/blob/main/internal/ansible/flags/flag.go#L109.
What did you expect to see?
--max-concurrent-reconciles should take into account the pod's CPU resource limits so as to not start too many simultaneous playbooks.
What did you see instead? Under which circumstances?
runtime.NumCPU() will return the number of logical CPUs usable by the current process, which does not seem to be cgroup-aware. This means that --max-concurrent-reconciles can be set to a large number on OpenShift nodes with a large number of CPUs. This can cause the Ansible operator to start up too many playbooks simultaneously, eating up memory and leading to the pod being OOM killed.
On an OpenShift compute node with 128 CPUs, the operator set the default --max-concurrent-reconciles to 128. This particular environment had over 30 AnsibleAutomationPlatformBackup resources, which meant the operator tried to start over 30 simultaneous playbooks. This used up quite a bit of memory, causing the pod to exceed the 4000Mi resource limit which lead to it being OOM killed.
The pod that was OOM killed did have CPU resource limits set, but the operator did not seem to take those into account when determining the maximum number of concurrent reconciles.
resources:
limits:
cpu: "2"
memory: 4000Mi
requests:
cpu: 10m
memory: 256Mi
Environment
Kubernetes cluster type:
OpenShift
This behavior was observed with Ansible Automation Platform operator 2.5 using Go 1.21.13.
ansible-operator version: "v1.31.0-ocp", commit: "731dca792e1343af155b82bc3c34a5800ee863af", kubernetes version: "v1.26.0", go version: "go1.21.13 (Red Hat 1.21.13-3.module+el8.10.0+22345+acdd8d0e) X:strictfipsruntime", GOOS: "linux", GOARCH: "amd64"
Possible Solution
Switch to using runtime.GOMAXPROCS(0) which seems to be cgroup-aware since Go 1.25.
For versions less than 1.25, using runtime.GOMAXPROCS(0) seems to return the same value as runtime.NumCPU().
Bug Report
What did you do?
Ansible operator plugin should consider using
runtime.GOMAXPROCS(0)to set--max-concurrent-reconcilesinstead ofruntime.NumCPU()in https://github.com/operator-framework/ansible-operator-plugins/blob/main/internal/ansible/flags/flag.go#L109.What did you expect to see?
--max-concurrent-reconcilesshould take into account the pod's CPU resource limits so as to not start too many simultaneous playbooks.What did you see instead? Under which circumstances?
runtime.NumCPU()will return the number of logical CPUs usable by the current process, which does not seem to be cgroup-aware. This means that--max-concurrent-reconcilescan be set to a large number on OpenShift nodes with a large number of CPUs. This can cause the Ansible operator to start up too many playbooks simultaneously, eating up memory and leading to the pod being OOM killed.On an OpenShift compute node with 128 CPUs, the operator set the default
--max-concurrent-reconcilesto 128. This particular environment had over 30 AnsibleAutomationPlatformBackup resources, which meant the operator tried to start over 30 simultaneous playbooks. This used up quite a bit of memory, causing the pod to exceed the 4000Mi resource limit which lead to it being OOM killed.The pod that was OOM killed did have CPU resource limits set, but the operator did not seem to take those into account when determining the maximum number of concurrent reconciles.
Environment
Kubernetes cluster type:
OpenShift
This behavior was observed with Ansible Automation Platform operator 2.5 using Go 1.21.13.
Possible Solution
Switch to using
runtime.GOMAXPROCS(0)which seems to be cgroup-aware since Go 1.25.For versions less than 1.25, using
runtime.GOMAXPROCS(0)seems to return the same value asruntime.NumCPU().