Skip to content

fix(clustered): correctly reconcile JobSet phase during restarts#7517

Open
AdilFayyaz wants to merge 2 commits into
mainfrom
adil/jobsets-restart-reconcile-fix
Open

fix(clustered): correctly reconcile JobSet phase during restarts#7517
AdilFayyaz wants to merge 2 commits into
mainfrom
adil/jobsets-restart-reconcile-fix

Conversation

@AdilFayyaz

@AdilFayyaz AdilFayyaz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Why are the changes needed?

The clustered (JobSet) plugin was prematurely surfacing task failures during JobSet restarts: a single worker failure
within the restart budget could fast-fail the task, and rank-0 pod lookups used an unsuffixed name that never matched the real, randomly-suffixed pod. This caused spurious failures and missed diagnostics while a JobSet was legitimately restarting.

What changes were proposed in this pull request?

  • Gate maybeFastFailWorker0 on restart budget: worker failures only surface as RetryableFailure once Status.Restarts >= MaxRestarts; otherwise the task stays Running. Pending-pod diagnostics (e.g. ImagePullBackOff) still fast-fail regardless of budget.
  • Handle the RestartingJobSet condition and the no-condition-yet case via a new hasJobSetStarted check (restarts, worker readiness/activity, or prior plugin state), reporting Running with a restart-attempt reason instead of falling back to Initializing/failure.
  • Replace unsuffixed rank0PodName matching with findRank0Pod, a suffix-tolerant lookup that deterministically prefer

How was this patch tested?

  • go test ./flyteplugins/go/tasks/plugins/k8s/clustered/...
  • New unit tests cover: suffixed/deterministic rank-0 selection, failure-with-budget-remaining → Running, budget-exhausted → RetryableFailure, pending ImagePullBackOff → fast-fail regardless of budget, RestartingJobSet condition → Running (attempt N), restarts with no true condition → Running, and plugin-state read error → fallback to status.
  • Tested a failing example from the sdk on dogfood-1

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Stack

If you do use git town to manage PR Stacks, the stack relevant to this PR
will show below. Otherwise, you can ignore this section.

Docs link

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz self-assigned this Jun 12, 2026
Copilot AI review requested due to automatic review settings June 12, 2026 22:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the clustered (JobSet) plugin’s phase reconciliation during JobSet restarts to avoid prematurely surfacing worker failures and to correctly identify rank-0 pods whose real names include controller-assigned suffixes.

Changes:

  • Add suffix-tolerant rank-0 pod identification/selection and reuse it across fast-fail and maintenance-retry paths.
  • Gate worker-failure fast-fail behavior on restart budget exhaustion, while still fast-failing pending-pod diagnostics.
  • Add/extend unit tests covering restart-condition handling, budget gating behavior, and deterministic rank-0 selection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
flyteplugins/go/tasks/plugins/k8s/clustered/util.go Adds rank-0 pod matching and findRank0Pod selection logic used by phase/log inspection.
flyteplugins/go/tasks/plugins/k8s/clustered/phase.go Updates phase reconciliation to account for restarts, restart budget, and restarting conditions.
flyteplugins/go/tasks/plugins/k8s/clustered/logs.go Uses rank-0 prefix matching helper for primary pod selection in log context.
flyteplugins/go/tasks/plugins/k8s/clustered/clustered_test.go Adds tests for new restart/budget logic and suffixed rank-0 pod handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread flyteplugins/go/tasks/plugins/k8s/clustered/util.go
Comment thread flyteplugins/go/tasks/plugins/k8s/clustered/phase.go
Comment thread flyteplugins/go/tasks/plugins/k8s/clustered/util.go Outdated
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz requested a review from pingsutw June 12, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants