Skip to content

Runtime K8s scheduler backend hard-errors when scheduler.kubernetes.service_url is unset (build stage auto-derives the same value) #179

@initializ-mk

Description

@initializ-mk

Symptom

When forge runs in-cluster and `scheduler.kubernetes.service_url` is not set in `forge.yaml`, the agent fails to start with:

```
Error: kubernetes scheduler backend: scheduler.kubernetes.service_url is required
```

Operators hit this on any cluster deploy that uses the default `scheduler.backend: auto` (picks `kubernetes` when `scheduler.InCluster()` reports true) and didn't explicitly add a `service_url`.

Why it's a footgun

The build-time schedule-manifest stage already knows how to derive a sensible default:

```go
// forge-cli/build/schedule_manifest_stage.go:70-82
serviceURL := cfg.Scheduler.Kubernetes.ServiceURL
if serviceURL == "" {
port := 8080
if bc.Spec != nil && bc.Spec.Runtime != nil && bc.Spec.Runtime.Port != 0 {
port = bc.Spec.Runtime.Port
}
serviceURL = fmt.Sprintf("http://%s.%s.svc:%d/", agentID, ns, port)
}
```

The runtime at `forge-cli/runtime/scheduler_k8s_backend.go:104-106` does not:

```go
if cfg.ServiceURL == "" {
return nil, fmt.Errorf("kubernetes scheduler backend: scheduler.kubernetes.service_url is required")
}
```

So an operator who runs `forge package` once gets working CronJob manifests with the right Service DNS — but if their pod starts without an explicit `service_url` (e.g. they didn't re-render forge.yaml from the build output), the runtime refuses to come up. Two adjacent code paths, two different decisions for the same missing field.

Proposed fix

Mirror the build-stage default in the runtime:

  • `NewKubernetesBackend` (and the `-WithClient` test seam) derives `http://<agent_id>..svc:/` when `cfg.ServiceURL` is empty.
  • `K8sBackendConfig` gains a `Port int` field; `selectScheduleBackend` plumbs `r.cfg.Port` into it.
  • The `service_url is required` error path is removed (there's no scenario where we can't derive a default in-cluster).
  • Operator override semantics unchanged: an explicit `scheduler.kubernetes.service_url` always wins.
  • Tests pin both branches (default-derivation + explicit override unchanged).

Repro

`forge.yaml` without a `scheduler:` block at all; deploy the image to any K8s namespace. Pod logs:

```
{"level":"info","msg":"schedule tools registered"}
Error: kubernetes scheduler backend: scheduler.kubernetes.service_url is required
```

Workarounds today

Either set `scheduler.backend: file` (skip K8s entirely) or set `scheduler.kubernetes.service_url` to the same value the build stage would derive.

Scope

Code: `forge-cli/runtime/scheduler_k8s_backend.go`, `forge-cli/runtime/runner.go` (one-line plumbing).
Tests: `forge-cli/runtime/scheduler_k8s_backend_test.go`.
Docs: `docs/deployment/scheduler-kubernetes.md` and `docs/core-concepts/scheduling.md` to mention the new default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions