Skip to content

[Bug] Reconciliation loop running too hot #415

Description

@gleason-m

Describe the bug

The reconciliation loop runs multiple times per second per TWD. This shows count of "Running Reconcile loop" logs during the upgrade, rollback, and roll-forward.

Image

I believe it is related to the non-deterministic map iteration in mapToStatus:

for buildID := range m.k8sState.Deployments {

My thinking is this map causes the reconciler's final write to output the deprecatedVersions array in a non-deterministic order and triggers reconciliation again even if no status actually changes (just ordering changes)

if err := r.Status().Update(ctx, &workerDeploy); err != nil {

Here can see the TWD version climbing several times per second:

gleasonm@symbiote-dev:~$ kubectl get twd $NAME -n $NS -w -o custom-columns='RV:.metadata.resourceVersion,GEN:.metadata.generation,TARGET:.status.targetVersion.status' | while IFS= read -r line; do printf '%s  %s\n' "$(date '+%H:%M:%S.%3N')" "$line"; done
20:56:33.807  RV          GEN   TARGET
20:56:33.811  401336452   264   Current
20:56:33.830  401336457   264   Current
20:56:33.893  401336459   264   Current
20:56:33.950  401336461   264   Current
20:56:34.009  401336463   264   Current
20:56:34.077  401336466   264   Current
20:56:34.135  401336469   264   Current
20:56:34.188  401336472   264   Current
20:56:34.248  401336474   264   Current
20:56:34.310  401336479   264   Current
20:56:34.368  401336483   264   Current
20:56:34.431  401336486   264   Current
20:56:34.497  401336489   264   Current
20:56:34.550  401336492   264   Current
20:56:34.616  401336497   264   Current
20:56:34.671  401336499   264   Current
20:56:34.743  401336502   264   Current
20:56:34.826  401336506   264   Current
20:56:34.960  401336509   264   Current
20:56:35.016  401336512   264   Current

and here can see there is no diff when I sort by buildID:

gleasonm@symbiote-dev:~$ kubectl get twd $NAME -n $NS -o json | jq '.status.deprecatedVersions |= sort_by(.buildID) | .status' > /tmp/s1.json
gleasonm@symbiote-dev:~$ kubectl get twd $NAME -n $NS -o json | jq '.status.deprecatedVersions |= sort_by(.buildID) | .status' > /tmp/s2.json
gleasonm@symbiote-dev:~$ diff /tmp/s1.json /tmp/s2.json
gleasonm@symbiote-dev:~$

However, I checked v1.1.1 and this behavior looks unchanged, so I am not sure how to explain why I only observe this on v1.3.1

Minimal Reproduction

Have not tested a repro but I have a TWD with 21 deprecatedVersions and am running TWC v1.3.1

Environment/Versions

  • OS: Linux
  • Temporal Server Version: 1.30.2
  • TWC version: 1.3.1
  • Helm Chart version: 0.20.0

Additional context

I was running the TWC on v1.1.1 and started seeing this issue after upgrading to v1.3.1. I couldn't see anything obvious between the version that would explain why I started seeing the behavior following the upgrade.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions