Skip to content

feat(scheduler): Backend interface + FileBackend + CronJob manifest helpers (#162 part 2)#169

Merged
initializ-mk merged 1 commit into
mainfrom
feat/issue-162-k8s-scheduler
Jun 15, 2026
Merged

feat(scheduler): Backend interface + FileBackend + CronJob manifest helpers (#162 part 2)#169
initializ-mk merged 1 commit into
mainfrom
feat/issue-162-k8s-scheduler

Conversation

@initializ-mk

Copy link
Copy Markdown
Contributor

Summary

First half of part 2 of the #162 stack. Introduces the `ScheduleBackend` abstraction the runtime and `forge package` both consume, ships the `FileBackend` wrapping existing behavior with zero observable change, and lands the pure-Go CronJob manifest builders + in-cluster detection helper that parts 2b/3 build on.

Scope split inside part 2

`#162 part 2` as originally scoped was "backend interface + impls + auto-detection." The K8s runtime CRUD implementation requires `k8s.io/client-go`, a ~10 MB transitive dependency. To keep this PR reviewable I split:

Sub-part Scope Status
2 (this PR) `ScheduleBackend` interface, `FileBackend` refactor, `CronJobYAML` + `CronJobName` + `InCluster` helpers, forge.yaml config block. Zero new deps. this PR
2b `KubernetesBackend` with `k8s.io/client-go` for runtime CronJob CRUD (enables the LLM-driven `schedule_set` path) next
3 `forge package` wires the CronJob manifests + Secret template + Role into the build output last

This PR alone is enough to ship static-schedule K8s deploys end-to-end once part 3 lands — `forge.yaml` schedules become CronJobs at build time, the cluster's CronJob controller handles timing, no runtime API calls needed.

Files

File Change
`forge-core/scheduler/backend.go` (new) `Backend` interface (Start, Stop, Reload, Sync, List, Get, Set, Delete, History, Store); `FileBackend` wrapping `Scheduler` + `ScheduleStore`; `SourceYAML` / `SourceLLM` constants
`forge-core/scheduler/k8s_detect.go` (new) `InCluster()` + `FORGE_IN_CLUSTER` override
`forge-core/scheduler/k8s_manifest.go` (new) `CronJobName` (deterministic, 63-char safe via hash-suffix) + `CronJobYAML` (pure-Go manifest text, shell-escaped task body)
`forge-core/types/config.go` `SchedulerConfig` + `K8sSchedulerConfig` blocks
`forge-cli/runtime/runner.go` Swap `r.sched *Scheduler` for `r.schedBackend Backend`; FileBackend constructed at the existing wiring point. Zero behavior change.
`docs/deployment/scheduler-kubernetes.md` (new) Operator-facing reference: resolution ladder, manifest shape, token plumbing via `forge auth`, security model

Behavior change

None. `FileBackend` is a structural wrapper around the existing `Scheduler` + `MemoryScheduleStore`. The 30s ticker, the SCHEDULES.md persistence, the overlap-skip behavior, the audit events — all unchanged. The lazyScheduleReloader still works because Backend.Reload delegates through.

The `scheduler` block in forge.yaml is purely additive — the existing `schedules[]` array continues to work as before.

Reconciliation rules pinned by Sync

Case Behavior
New yaml-sourced entry Insert
Existing yaml-sourced entry Update task / cron / channel, preserve LastRun / RunCount / LastStatus
yaml-sourced entry no longer in manifest Delete
LLM-sourced entry Always preserved — the LLM owns the chat-created schedule, the declarative reconcile path doesn't touch it

Test: `TestFileBackend_SyncPreservesLLMSourced` pins the LLM-preservation invariant.

CronJob manifest shape (used by parts 2b + 3)

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: forge-<agent_id>-<schedule_id> # 63-char safe, hash-suffix when truncating
labels:
forge.agent.id: <agent_id>
forge.schedule.id: <schedule_id>
forge.schedule.source: yaml | llm
spec:
schedule: ""
concurrencyPolicy: Forbid # K8s-native overlap skip
jobTemplate:
spec:
template:
spec:
containers:
- name: trigger
image: curlimages/curl:8.10.1
env:
- name: FORGE_AUTH_TOKEN
valueFrom: {secretKeyRef: {name: <agent_id>-internal-token, key: token}}
command: ["sh", "-c"]
args:
- curl -sfX POST -H 'Authorization: Bearer $FORGE_AUTH_TOKEN' -H 'X-Forge-Schedule-Id: ' --data '...'
```

`X-Forge-Schedule-Id` is the hook part 3 uses for audit-event linkage (the agent recognizes scheduled fires and emits `schedule_fire` / `schedule_complete` around the normal A2A dispatch).

Test plan

  • `go test -count=1 ./...` clean in forge-core and forge-cli
  • `golangci-lint run ./...` → 0 issues in both modules
  • `gofmt -w` applied
  • All 10 new tests pass:
    • FileBackend (5): Sync upserts, Sync prunes, Sync preserves LLM-sourced, Sync preserves per-run state, Store() accessor
    • CronJob manifest (5): name fits K8s limits, name uniqueness via hash-suffix, manifest has required fields, defaults applied, single-quote shell escaping
  • All existing scheduler tests pass (Scheduler_SingleFire, Reload, Overlap, StartStop — verifying the FileBackend wrapper doesn't regress timing semantics)
  • Manual end-to-end deferred to part 3 (when forge package wires the manifest builder)

Refs #162

…elpers (#162 part 2)

First half of part 2 of the #162 stack. Introduces the ScheduleBackend
abstraction the runtime + 'forge package' both consume, ships the
FileBackend wrapping existing behavior with zero observable change,
and lands the pure-Go CronJob manifest builders + in-cluster detection
helper that parts 2b/3 build on.

forge-core/scheduler/backend.go
  ScheduleBackend interface (Start, Stop, Reload, Sync, List, Get,
  Set, Delete, History) bundling timing + persistence into one
  surface. FileBackend wraps Scheduler + ScheduleStore behind the
  interface. SourceYAML / SourceLLM constants pin the reconciliation
  rule: Sync upserts yaml-sourced entries, prunes yaml-sourced
  entries dropped from the manifest, preserves LLM-sourced ones,
  and carries LastRun / RunCount across re-Sync.

forge-core/scheduler/k8s_detect.go
  InCluster() helper. Default signal is the projected ServiceAccount
  token at /var/run/secrets/kubernetes.io/serviceaccount/token.
  FORGE_IN_CLUSTER env var overrides for dev/test.

forge-core/scheduler/k8s_manifest.go
  CronJobName + CronJobYAML — pure-Go manifest builders. K8s-name
  sanitization (RFC 1123 subset) with hash-suffix when truncating
  past 63 chars so distinct (agent, schedule) pairs that share a
  prefix don't collide. CronJob spec uses concurrencyPolicy: Forbid
  (K8s-native equivalent of FileBackend's overlap skip) and embeds
  X-Forge-Schedule-Id header (enables schedule_fire / _complete
  audit-event linkage in part 3). Task text is shell-escaped for
  the single-quoted JSON-RPC body.

forge-core/types/config.go
  SchedulerConfig + K8sSchedulerConfig blocks on ForgeConfig.
  Backend: "auto" (default) | "file" | "kubernetes". Kubernetes
  block fields: Namespace, ServiceURL, AllowDynamic (default false),
  TriggerImage, AuthSecretName.

forge-cli/runtime/runner.go
  Swap r.sched *Scheduler for r.schedBackend Backend. FileBackend
  constructed at the existing wiring point with the same Scheduler
  + store the pre-#162 path used. Zero behavior change. The
  lazyScheduleReloader delegates Reload through the Backend.

Tests:
- backend_test.go (5 tests): Sync idempotency, yaml-entry pruning,
  LLM-entry preservation, per-run state preservation across re-Sync,
  Store() accessor.
- k8s_manifest_test.go (5 tests): CronJob name fits 63 chars,
  hash-suffix uniqueness on truncation, manifest has required
  K8s-side fields, defaults applied for missing inputs, single-quote
  shell escaping, FORGE_IN_CLUSTER env override (both directions).

Docs:
- docs/deployment/scheduler-kubernetes.md — new operator-facing
  reference: backend resolution ladder, CronJob shape, token
  plumbing via forge auth, security model.

NOT in this PR (deliberate split):
- KubernetesBackend with client-go for runtime CronJob CRUD. The
  runtime backend interface and helper code shipped here are enough
  for static (forge.yaml) schedules deployed via 'forge package' in
  part 3 — the cluster's CronJob controller handles timing without
  any runtime API calls. The LLM-driven schedule_set path requires
  client-go and lands as part 2b.

Refs #162
@initializ-mk initializ-mk merged commit 6cb7f55 into main Jun 15, 2026
10 checks passed
initializ-mk added a commit that referenced this pull request Jun 15, 2026
… CRUD (#162 part 2b)

Second half of part 2 of the #162 stack. Builds on the
ScheduleBackend interface + FileBackend refactor + manifest
helpers shipped in part 2 (PR #169). Adds the real K8s
runtime backend, dependency on k8s.io/client-go, and the
forge.yaml-driven backend selection in the runner.

forge-cli/runtime/scheduler_k8s_backend.go
  KubernetesBackend implements scheduler.Backend by delegating
  persistence + timing to the cluster's CronJob controller:

  - Start/Stop/Reload are no-ops (cluster owns timing)
  - Sync reconciles cluster CronJobs against declared yaml
    entries: create, update on drift, prune dropped yaml
    entries, PRESERVE LLM-sourced entries unconditionally
  - Set / Delete are gated by AllowDynamic (default false);
    yaml-sourced CronJobs cannot be deleted via direct Delete
    (Sync is the only removal path for declarative entries)
  - List filters by forge.agent.id label so unrelated CronJobs
    in the namespace don't appear in schedule_list output
  - History returns empty + warns once; the audit stream's
    schedule_fire/complete events are the canonical source
  - CronJobs are constructed in-memory to match the YAML the
    forge-core scheduler.CronJobYAML emits byte-for-byte, so
    runtime reconcile doesn't churn against forge package
    manifests (#162 part 3)
  - Round-trips Schedule <-> CronJob via labels (agent.id,
    schedule.id, schedule.source) and annotations (task, skill,
    channel, channel_target, run_count, last_status)
  - LastRun read from CronJob.Status.LastScheduleTime
  - NewKubernetesBackendWithClient testing seam accepts an
    explicit kubernetes.Interface (fake.Clientset in tests)

forge-cli/runtime/runner.go
  Backend selection wired off forge.yaml scheduler.backend:
  - "kubernetes" — always K8s; errors at startup when not
    in-cluster (FORGE_IN_CLUSTER=true overrides for tests)
  - "file"       — always FileBackend
  - "auto"/""    — K8s when in-cluster, file otherwise
  Drops the old syncYAMLSchedules helper now that the runner
  calls Backend.Sync(declaredSchedules()) for both modes.

go.mod
  k8s.io/api, k8s.io/apimachinery, k8s.io/client-go @ v0.36.2.

Tests (9 cases against fake.Clientset):
  - Sync creates CronJobs for declared entries with the
    expected labels + ConcurrencyPolicy=Forbid
  - Sync is idempotent (no churn on no-op re-run)
  - Sync updates on cron drift
  - Sync prunes yaml entries removed from the manifest
  - Sync preserves LLM-sourced entries on yaml-only re-Sync
  - Dynamic Set is gated by AllowDynamic with an actionable
    error referencing the config flag
  - Dynamic Delete of a yaml-sourced schedule is refused with
    an error pointing operators at the manifest
  - List filters by forge.agent.id (unrelated CronJobs in the
    namespace are not returned)
  - History returns empty + does not error

Docs:
  - docs/deployment/scheduler-kubernetes.md gains the RBAC
    table, the annotation round-trip reference, and the
    "what's not in the K8s backend" section flagging
    schedule_history deferral + cross-namespace out-of-scope
    + token rotation as a follow-up.

Refs #162
initializ-mk added a commit that referenced this pull request Jun 15, 2026
… CRUD (#162 part 2b)

Second half of part 2 of the #162 stack. Builds on the
ScheduleBackend interface + FileBackend refactor + manifest
helpers shipped in part 2 (PR #169). Adds the real K8s
runtime backend, dependency on k8s.io/client-go, and the
forge.yaml-driven backend selection in the runner.

forge-cli/runtime/scheduler_k8s_backend.go
  KubernetesBackend implements scheduler.Backend by delegating
  persistence + timing to the cluster's CronJob controller:

  - Start/Stop/Reload are no-ops (cluster owns timing)
  - Sync reconciles cluster CronJobs against declared yaml
    entries: create, update on drift, prune dropped yaml
    entries, PRESERVE LLM-sourced entries unconditionally
  - Set / Delete are gated by AllowDynamic (default false);
    yaml-sourced CronJobs cannot be deleted via direct Delete
    (Sync is the only removal path for declarative entries)
  - List filters by forge.agent.id label so unrelated CronJobs
    in the namespace don't appear in schedule_list output
  - History returns empty + warns once; the audit stream's
    schedule_fire/complete events are the canonical source
  - CronJobs are constructed in-memory to match the YAML the
    forge-core scheduler.CronJobYAML emits byte-for-byte, so
    runtime reconcile doesn't churn against forge package
    manifests (#162 part 3)
  - Round-trips Schedule <-> CronJob via labels (agent.id,
    schedule.id, schedule.source) and annotations (task, skill,
    channel, channel_target, run_count, last_status)
  - LastRun read from CronJob.Status.LastScheduleTime
  - NewKubernetesBackendWithClient testing seam accepts an
    explicit kubernetes.Interface (fake.Clientset in tests)

forge-cli/runtime/runner.go
  Backend selection wired off forge.yaml scheduler.backend:
  - "kubernetes" — always K8s; errors at startup when not
    in-cluster (FORGE_IN_CLUSTER=true overrides for tests)
  - "file"       — always FileBackend
  - "auto"/""    — K8s when in-cluster, file otherwise
  Drops the old syncYAMLSchedules helper now that the runner
  calls Backend.Sync(declaredSchedules()) for both modes.

go.mod
  k8s.io/api, k8s.io/apimachinery, k8s.io/client-go @ v0.36.2.

Tests (9 cases against fake.Clientset):
  - Sync creates CronJobs for declared entries with the
    expected labels + ConcurrencyPolicy=Forbid
  - Sync is idempotent (no churn on no-op re-run)
  - Sync updates on cron drift
  - Sync prunes yaml entries removed from the manifest
  - Sync preserves LLM-sourced entries on yaml-only re-Sync
  - Dynamic Set is gated by AllowDynamic with an actionable
    error referencing the config flag
  - Dynamic Delete of a yaml-sourced schedule is refused with
    an error pointing operators at the manifest
  - List filters by forge.agent.id (unrelated CronJobs in the
    namespace are not returned)
  - History returns empty + does not error

Docs:
  - docs/deployment/scheduler-kubernetes.md gains the RBAC
    table, the annotation round-trip reference, and the
    "what's not in the K8s backend" section flagging
    schedule_history deferral + cross-namespace out-of-scope
    + token rotation as a follow-up.

Refs #162
initializ-mk added a commit that referenced this pull request Jun 15, 2026
… CRUD (#162 part 2b)

Second half of part 2 of the #162 stack. Builds on the
ScheduleBackend interface + FileBackend refactor + manifest
helpers shipped in part 2 (PR #169). Adds the real K8s
runtime backend, dependency on k8s.io/client-go, and the
forge.yaml-driven backend selection in the runner.

forge-cli/runtime/scheduler_k8s_backend.go
  KubernetesBackend implements scheduler.Backend by delegating
  persistence + timing to the cluster's CronJob controller:

  - Start/Stop/Reload are no-ops (cluster owns timing)
  - Sync reconciles cluster CronJobs against declared yaml
    entries: create, update on drift, prune dropped yaml
    entries, PRESERVE LLM-sourced entries unconditionally
  - Set / Delete are gated by AllowDynamic (default false);
    yaml-sourced CronJobs cannot be deleted via direct Delete
    (Sync is the only removal path for declarative entries)
  - List filters by forge.agent.id label so unrelated CronJobs
    in the namespace don't appear in schedule_list output
  - History returns empty + warns once; the audit stream's
    schedule_fire/complete events are the canonical source
  - CronJobs are constructed in-memory to match the YAML the
    forge-core scheduler.CronJobYAML emits byte-for-byte, so
    runtime reconcile doesn't churn against forge package
    manifests (#162 part 3)
  - Round-trips Schedule <-> CronJob via labels (agent.id,
    schedule.id, schedule.source) and annotations (task, skill,
    channel, channel_target, run_count, last_status)
  - LastRun read from CronJob.Status.LastScheduleTime
  - NewKubernetesBackendWithClient testing seam accepts an
    explicit kubernetes.Interface (fake.Clientset in tests)

forge-cli/runtime/runner.go
  Backend selection wired off forge.yaml scheduler.backend:
  - "kubernetes" — always K8s; errors at startup when not
    in-cluster (FORGE_IN_CLUSTER=true overrides for tests)
  - "file"       — always FileBackend
  - "auto"/""    — K8s when in-cluster, file otherwise
  Drops the old syncYAMLSchedules helper now that the runner
  calls Backend.Sync(declaredSchedules()) for both modes.

go.mod
  k8s.io/api, k8s.io/apimachinery, k8s.io/client-go @ v0.36.2.

Tests (9 cases against fake.Clientset):
  - Sync creates CronJobs for declared entries with the
    expected labels + ConcurrencyPolicy=Forbid
  - Sync is idempotent (no churn on no-op re-run)
  - Sync updates on cron drift
  - Sync prunes yaml entries removed from the manifest
  - Sync preserves LLM-sourced entries on yaml-only re-Sync
  - Dynamic Set is gated by AllowDynamic with an actionable
    error referencing the config flag
  - Dynamic Delete of a yaml-sourced schedule is refused with
    an error pointing operators at the manifest
  - List filters by forge.agent.id (unrelated CronJobs in the
    namespace are not returned)
  - History returns empty + does not error

Docs:
  - docs/deployment/scheduler-kubernetes.md gains the RBAC
    table, the annotation round-trip reference, and the
    "what's not in the K8s backend" section flagging
    schedule_history deferral + cross-namespace out-of-scope
    + token rotation as a follow-up.

Refs #162
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant