fix: avoid local completion budget truncation#85
Conversation
🦋 Changeset detectedLatest commit: c93400d The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 77b64d91d6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| capability: this.agent.config.modelCapabilities, | ||
| generate: this.agent.generate, | ||
| completionBudgetConfig, | ||
| inputTokenCount: () => this.agent.context.tokenCountWithPending, |
There was a problem hiding this comment.
Include prompt/tool tokens in completion-budget input count
TurnFlow now feeds inputTokenCount from context.tokenCountWithPending, but that counter only tracks context history and pending messages; it does not include the system prompt or tool schemas that KosongLLM.chatOnce still sends on every request. On the first step (before any provider usage has been recorded), this makes the completion cap too large by exactly those omitted tokens, so near-limit requests can exceed max_context_tokens and fail upstream instead of being safely clamped.
Useful? React with 👍 / 👎.
| messages, | ||
| systemPrompt: this.agent.config.systemPrompt, | ||
| tools: this.agent.tools.loopTools, | ||
| inputTokenCount: this.tokenCountWithPending, |
There was a problem hiding this comment.
Clamp compaction budget against compaction input, not full context
generateCompactionResponse now passes this.tokenCountWithPending into applyCompletionBudget, but the compaction request actually uses only the compacted prefix plus the compaction instruction; when overall context is near the limit, this overestimates request input and can force max_completion_tokens down to 1 even though the compaction prompt has plenty of headroom. In that state, the compaction model can return an empty/invalid summary and repeatedly fail retries, blocking recovery exactly when compaction is needed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3319815f51
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| messages, | ||
| systemPrompt: this.agent.config.systemPrompt, | ||
| tools: this.agent.tools.loopTools, | ||
| inputTokenCount: 0, |
There was a problem hiding this comment.
Clamp compaction cap against actual prompt tokens
Setting inputTokenCount to 0 here makes compaction always request roughly max_context_tokens - safetyMargin output tokens, even when the compaction prompt itself is already near the context limit. In that common near-full-context path, the generated request budget can exceed the model window by thousands of tokens (input + requested output), which causes provider-side invalid-request/context-overflow failures because withMaxCompletionTokens only forwards the cap and does not clamp it to the serialized prompt size. This regresses compaction reliability exactly when compaction is most needed.
Useful? React with 👍 / 👎.
| capability: this.agent.config.modelCapabilities, | ||
| generate: this.agent.generate, | ||
| completionBudgetConfig, | ||
| inputTokenCount: () => this.agent.context.tokenCountWithPending, |
There was a problem hiding this comment.
Include system/tools in initial turn budget input count
This switches turn-time budgeting to context.tokenCountWithPending, which only tracks history-message tokens until provider usage has covered a step; it does not include system prompt or tool schema tokens in fresh/reset states. On the first request of a session (and immediately after compaction/clear), the cap is therefore computed from message tokens alone, so large system/tool payloads can make input + max_completion_tokens exceed max_context_tokens and trigger provider invalid_request_error near the context boundary. The previous path budgeted from the exact request components and did not undercount this way.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c93400d490
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const cap = | ||
| args.budget.hardCap ?? | ||
| (maxCtx > 0 ? maxCtx : args.budget.fallback ?? DEFAULT_UNKNOWN_CONTEXT_FALLBACK); |
There was a problem hiding this comment.
Reintroduce remaining-context clamp for completion budget
computeCompletionBudgetCap now returns hardCap (or full max_context_tokens) without subtracting current input tokens, so near-full conversations can send max_completion_tokens values where prompt_tokens + max_completion_tokens exceeds the model window. In that case the provider can reject the request (e.g., invalid request/context overflow) before any summary is produced, which is a regression from the previous behavior that clamped to remaining context and is especially risky in compaction flows that run when context is already close to full.
Useful? React with 👍 / 👎.
| // serialized prompt. Locally using the largest cap avoids cutting off | ||
| // thinking before the model produces a summary. | ||
| const cap = | ||
| args.budget.hardCap ?? |
There was a problem hiding this comment.
Bound configured hard cap by model context limit
computeCompletionBudgetCap now prefers budget.hardCap verbatim, so if KIMI_MODEL_MAX_COMPLETION_TOKENS/KIMI_MODEL_MAX_TOKENS is set above the model’s context window, every request can carry an impossible completion limit and fail with provider-side invalid-request/context-limit errors. Before this commit, explicit caps were still clamped by available context, so this change introduces hard failures for oversized but previously tolerable operator configuration.
Useful? React with 👍 / 👎.
| // thinking before the model produces a summary. | ||
| const cap = | ||
| args.budget.hardCap ?? | ||
| (maxCtx > 0 ? maxCtx : args.budget.fallback ?? DEFAULT_UNKNOWN_CONTEXT_FALLBACK); |
There was a problem hiding this comment.
Avoid maxing completion cap and triggering TPM throttling
Setting the cap to full max_context_tokens for known models inflates max_completion_tokens on every request, even when the prompt is short. Kimi’s gateway rate-limit accounting uses prompt_tokens + max_completion_tokens (not actual generated tokens), so this change can sharply increase measured TPM/TPD and cause avoidable 429/quota errors in normal turns. The previous input-aware clamp kept this parameter closer to feasible remaining budget and avoided this artificial rate-limit pressure.
Useful? React with 👍 / 👎.
Related Issue
No linked issue. This fixes overly small local completion token caps in long-context conversations.
Problem
The local completion budget clamp subtracted the estimated input size and a safety margin before setting
max_completion_tokens. In near-limit paths, especially compaction, that could reduce the cap too aggressively and truncate model thinking before a summary was produced.The provider backend already computes a safe request-specific value from the serialized prompt. Keeping a smaller local estimate adds risk without improving the final server-side safety check.
What changed
inputTokenCountand safety-margin plumbing from ordinary turns, compaction, and the LLM adapter.Verification
pnpm exec vitest run packages/agent-core/test/utils/completion-budget.test.ts packages/agent-core/test/agent/kosong-llm.test.ts packages/agent-core/test/agent/compaction.test.ts packages/agent-core/test/agent/context.test.tspnpm run typecheckgit diff --checkChecklist
gen-changesetsskill, or this PR needs no changeset.gen-docsskill, or this PR needs no doc update.