Skip to content

ctx.NewSession from OnAgentEnd always races against app.busy (v0.79.0 phase-handoff pattern is unreliable) #63

@ezynda3

Description

@ezynda3

Summary

ctx.NewSession(prompt) called from an OnAgentEnd handler fails with cannot start new session while agent is busy because AgentEnd is emitted from TurnEndEvent before app.drainQueue clears app.busy. Even worse, any post-turn tooling (formatters, linters, on-save hooks, hidden tool calls) extends the busy window further. There is no event signaling "agent fully idle, post-hooks complete", so the v0.79.0 phase-handoff pattern is racy by construction.

This affects the canonical example at examples/extensions/phase-handoff.go and any extension that wants to chain a fresh session at the end of a turn.

Reproduction

A minimal extension:

//go:build ignore
package main

import "kit/ext"

func Init(api ext.API) {
    api.OnAgentEnd(func(_ ext.AgentEndEvent, ctx ext.Context) {
        go func() {
            if err := ctx.NewSession("hello from the next session"); err != nil {
                ctx.PrintError("NewSession: " + err.Error())
            }
        }()
    })
}

Run any single-turn prompt. The TUI prints:

NewSession: cannot start new session while agent is busy

100% reproducible. Adding a time.Sleep before NewSession makes the race window smaller but never closes it deterministically — any post-turn save hook (e.g. a project go-edit-lint step or goimports formatter) can hold busy for seconds to minutes.

Root cause (from reading the source)

  • AgentEnd is bridged from kit.TurnEndEvent in pkg/kit/extensions_bridge.go:130-155. That event fires inside the agent loop.
  • app.busy is only cleared at internal/app/app.go:853 after drainQueue exits, which is after the entire queued-prompt cycle (including any extension/hook side-effects) settles.
  • RequestNewSessionFromExtension at internal/app/app.go:1242 checks IsBusy() and fails fast.

So AgentEnd is necessarily emitted while the app is still busy, and extensions have no idle event to latch onto.

Workaround (what I'm doing today)

Wrap ctx.NewSession in a long retry loop that polls the busy-error substring with exponential backoff:

func newSessionWithRetry(ctx ext.Context, prompt string) error {
    deadline := time.Now().Add(10 * time.Minute)
    wait := 250 * time.Millisecond
    var lastErr error
    for {
        err := ctx.NewSession(prompt)
        if err == nil {
            return nil
        }
        lastErr = err
        if !strings.Contains(err.Error(), "agent is busy") {
            return err
        }
        if time.Now().After(deadline) {
            return fmt.Errorf("agent still busy after %s: %w", 10*time.Minute, lastErr)
        }
        time.Sleep(wait)
        wait *= 2
        if wait > 2*time.Second {
            wait = 2 * time.Second
        }
    }
}

This works but is ugly, depends on string-matching an internal error message, and every extension author has to re-invent it.

Proposed fixes (in order of preference)

  1. Emit a new AgentIdle event (or rename intent: AgentTurnSettled) that fires after drainQueue clears app.busy and any post-turn hooks complete. Document AgentEnd as "LLM finished generating; app may still be busy" and AgentIdle as "safe to call NewSession, SendMultimodalMessage, etc."
  2. Or: make RequestNewSessionFromExtension wait for idle internally instead of failing fast. Block on a condition variable (or buffer the request and consume it when busy flips to false) up to a documented timeout, then return the busy error only if the timeout elapses. This keeps the API surface the same and silently fixes every existing extension.
  3. At minimum: export a sentinel error (e.g. var ErrAgentBusy = errors.New(...)) so extensions can detect the condition with errors.Is instead of substring-matching. Today pkg/kit/extension_api.go and internal/app/app.go:1250 return a raw fmt.Errorf, so the message is the only handle.

(2) would be my preferred fix — it makes the canonical phase-handoff example "just work" with no extension-author awareness.

Impact

This blocks the v0.79.0 phase-handoff workflow whenever the project has any post-turn tooling. We hit it on a Flyt project that runs go fix + go-edit-lint on edited files after every turn; the busy window regularly exceeds 6+ seconds and was unbounded across project sizes.

Environment

  • github.com/mark3labs/kit v0.79.0
  • Go 1.26.4
  • Linux, interactive TUI

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions