Skip to content

KEP-6003: Add KEP for configurable HPA sync period#6004

Open
Fedosin wants to merge 1 commit intokubernetes:masterfrom
Fedosin:configurable-hpa-sync-period
Open

KEP-6003: Add KEP for configurable HPA sync period#6004
Fedosin wants to merge 1 commit intokubernetes:masterfrom
Fedosin:configurable-hpa-sync-period

Conversation

@Fedosin
Copy link
Copy Markdown

@Fedosin Fedosin commented Apr 8, 2026

  • One-line PR description: Introduce KEP proposing an optional syncPeriodSeconds field on HorizontalPodAutoscalerBehavior to allow per-HPA override of the global --horizontal-pod-autoscaler-sync-period.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Fedosin
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 8, 2026
per-item interval during reconciliation and cleans it up on HPA deletion.

The informer event handlers are updated so that:
- Newly created HPAs and spec changes (detected via `Generation` comparison)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPA's generation fields are currently not set.
I have kubernetes/kubernetes#138228 to get this fixed

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you! I've added a note in the Design Details section acknowledging the dependency on kubernetes/kubernetes#138228 for properly incrementing the HPA Generation field on spec changes.

We propose to add a new field to the existing [`HorizontalPodAutoscalerBehavior`][] object:

- `syncPeriodSeconds`: *(int32)* the period in seconds between each
reconciliation of this HPA. Must be greater than 0 and less than or equal
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allow this HPA to query more frequently the metric source, which is usually metrics-server, which caches its metrics. The end result is that in most situations reducing this sync period won't have much impact. Conceivably you could add a field that only targets custom/external metrics, although my suspicion is that will result in a non-intuitive API.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I've added a new "Interaction with metrics sources" paragraph to the Proposal section addressing this. It explains that for resource metrics served by metrics-server (which caches values at ~60s intervals), syncing the HPA faster than the metrics collection interval won't yield fresher data. Reducing the sync period is most impactful when used with custom or external metrics providers that expose rapidly updating values. Users are advised to consider their metrics pipeline latency when choosing a syncPeriodSeconds value.

I considered scoping the field to custom/external metrics only, but as you noted, that would result in a non-intuitive API -- users would need to reason about which metric types their HPA uses before setting the field. A single field that applies uniformly is simpler, and the documentation clarifies when it's actually beneficial.

periods (e.g. 1s) for many HPAs, increasing the rate of metrics queries
and scale sub-resource calls. This is mitigated by:
- Validation bounds: `syncPeriodSeconds` must be >= 1 and <= 3600.
- Cluster administrators can use admission webhooks or policies to enforce
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a meaningful webhook require to calculate the aggregate sync frequency across all HPAs in the cluster? For users, this looks difficult to reason about.

Do you see another way such a webhook could implemented? Perhaps it could be part of this KEP to ensure this feature is safe by default?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that a webhook calculating aggregate frequency would be hard to reason about. I've reworked the Risks and Mitigations section to present a multi-layered safety story instead:

  1. Validation bounds (>= 1, <= 3600) with a note that we may raise the lower bound before Beta based on real-world usage data.
  2. Feature gate -- in Alpha, cluster admins have explicit opt-in control.
  3. Best-effort semantics -- the field is a target interval, not a hard guarantee. The controller won't queue additional work if reconciliation takes longer than the configured period, preventing workqueue saturation.
  4. Policy enforcement -- I've replaced the vague "admission webhooks" suggestion with a concrete ValidatingAdmissionPolicy example using a simple CEL rule that enforces a per-HPA floor (e.g. syncPeriodSeconds >= 10). This doesn't require reasoning about aggregate frequency -- it's a straightforward per-object check.

This makes the feature safe by default while still giving cluster admins a simple knob if they want tighter control.

```

The per-HPA sync frequency is implemented via a new `PerItemIntervalRateLimiter`
in the HPA controller's workqueue. This rate limiter supports per-key interval
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't metrics-server and custom/external metrics API need to respond in less than 1s to ensure the workqueue doesn't saturate? This looks like a very short timeout. Do you see how to prevent the workqueue from blocking? Perhaps this new field should only be a hint (i.e. a best-effort goal)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good concern. I've added a paragraph in Design Details clarifying that syncPeriodSeconds is a best-effort target interval, not a hard real-time guarantee. Specifically:

  • If a reconciliation cycle (including metrics queries and scale sub-resource calls) takes longer than the configured period, the controller will start the next cycle immediately after the current one completes rather than queuing up additional work.
  • This means the workqueue cannot saturate even if the metrics backend is slow or the configured period is shorter than the end-to-end reconciliation latency.
  • The rate limiter only re-enqueues a key after the current reconciliation for that key finishes, so there is at most one pending item per HPA in the queue at any time.

So effectively, syncPeriodSeconds is already a hint/best-effort goal by design. I've also reflected this in the Risks and Mitigations section.

Introduce KEP proposing an optional syncPeriodSeconds field on
HorizontalPodAutoscalerBehavior to allow per-HPA override of the
global --horizontal-pod-autoscaler-sync-period.
@Fedosin Fedosin force-pushed the configurable-hpa-sync-period branch from 68a5daa to df7747f Compare April 21, 2026 14:37
Copy link
Copy Markdown

@wozniakjan wozniakjan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3 from KEDA

3. Look for warnings and errors which might point where the problem lies.

## Implementation History

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth referencing the old issue tracker before KEPs existed :)
kubernetes/kubernetes#110317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants