Skip to content

fix: avoid local completion budget truncation#85

Merged
wbxl2000 merged 6 commits into
mainfrom
qer/use-context-token-budget
May 26, 2026
Merged

fix: avoid local completion budget truncation#85
wbxl2000 merged 6 commits into
mainfrom
qer/use-context-token-budget

Conversation

@wbxl2000
Copy link
Copy Markdown
Collaborator

@wbxl2000 wbxl2000 commented May 26, 2026

Related Issue

No linked issue. This fixes overly small local completion token caps in long-context conversations.

Problem

The local completion budget clamp subtracted the estimated input size and a safety margin before setting max_completion_tokens. In near-limit paths, especially compaction, that could reduce the cap too aggressively and truncate model thinking before a summary was produced.

The provider backend already computes a safe request-specific value from the serialized prompt. Keeping a smaller local estimate adds risk without improving the final server-side safety check.

What changed

  • Changed completion budget calculation to use the explicit hard cap when configured, otherwise the model context window when known.
  • Removed inputTokenCount and safety-margin plumbing from ordinary turns, compaction, and the LLM adapter.
  • Updated completion budget, LLM adapter, and compaction tests to cover the new cap behavior.
  • Updated the changeset wording for the final behavior.

Verification

  • pnpm exec vitest run packages/agent-core/test/utils/completion-budget.test.ts packages/agent-core/test/agent/kosong-llm.test.ts packages/agent-core/test/agent/compaction.test.ts packages/agent-core/test/agent/context.test.ts
  • pnpm run typecheck
  • git diff --check

Checklist

  • I have read the CONTRIBUTING document.
  • I have linked a related issue, or explained the problem above.
  • I have added tests that prove my feature works.
  • Ran gen-changesets skill, or this PR needs no changeset.
  • Ran gen-docs skill, or this PR needs no doc update.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 26, 2026

🦋 Changeset detected

Latest commit: c93400d

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@moonshot-ai/agent-core Patch
@moonshot-ai/kimi-code Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77b64d91d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

capability: this.agent.config.modelCapabilities,
generate: this.agent.generate,
completionBudgetConfig,
inputTokenCount: () => this.agent.context.tokenCountWithPending,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include prompt/tool tokens in completion-budget input count

TurnFlow now feeds inputTokenCount from context.tokenCountWithPending, but that counter only tracks context history and pending messages; it does not include the system prompt or tool schemas that KosongLLM.chatOnce still sends on every request. On the first step (before any provider usage has been recorded), this makes the completion cap too large by exactly those omitted tokens, so near-limit requests can exceed max_context_tokens and fail upstream instead of being safely clamped.

Useful? React with 👍 / 👎.

messages,
systemPrompt: this.agent.config.systemPrompt,
tools: this.agent.tools.loopTools,
inputTokenCount: this.tokenCountWithPending,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clamp compaction budget against compaction input, not full context

generateCompactionResponse now passes this.tokenCountWithPending into applyCompletionBudget, but the compaction request actually uses only the compacted prefix plus the compaction instruction; when overall context is near the limit, this overestimates request input and can force max_completion_tokens down to 1 even though the compaction prompt has plenty of headroom. In that state, the compaction model can return an empty/invalid summary and repeatedly fail retries, blocking recovery exactly when compaction is needed.

Useful? React with 👍 / 👎.

@wbxl2000
Copy link
Copy Markdown
Collaborator Author

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3319815f51

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

messages,
systemPrompt: this.agent.config.systemPrompt,
tools: this.agent.tools.loopTools,
inputTokenCount: 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clamp compaction cap against actual prompt tokens

Setting inputTokenCount to 0 here makes compaction always request roughly max_context_tokens - safetyMargin output tokens, even when the compaction prompt itself is already near the context limit. In that common near-full-context path, the generated request budget can exceed the model window by thousands of tokens (input + requested output), which causes provider-side invalid-request/context-overflow failures because withMaxCompletionTokens only forwards the cap and does not clamp it to the serialized prompt size. This regresses compaction reliability exactly when compaction is most needed.

Useful? React with 👍 / 👎.

capability: this.agent.config.modelCapabilities,
generate: this.agent.generate,
completionBudgetConfig,
inputTokenCount: () => this.agent.context.tokenCountWithPending,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include system/tools in initial turn budget input count

This switches turn-time budgeting to context.tokenCountWithPending, which only tracks history-message tokens until provider usage has covered a step; it does not include system prompt or tool schema tokens in fresh/reset states. On the first request of a session (and immediately after compaction/clear), the cap is therefore computed from message tokens alone, so large system/tool payloads can make input + max_completion_tokens exceed max_context_tokens and trigger provider invalid_request_error near the context boundary. The previous path budgeted from the exact request components and did not undercount this way.

Useful? React with 👍 / 👎.

@wbxl2000
Copy link
Copy Markdown
Collaborator Author

@codex

@wbxl2000 wbxl2000 changed the title fix: use context token count for completion budget fix: avoid local completion budget truncation May 26, 2026
Copy link
Copy Markdown
Collaborator

@kermanx kermanx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c93400d490

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +58 to +60
const cap =
args.budget.hardCap ??
(maxCtx > 0 ? maxCtx : args.budget.fallback ?? DEFAULT_UNKNOWN_CONTEXT_FALLBACK);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reintroduce remaining-context clamp for completion budget

computeCompletionBudgetCap now returns hardCap (or full max_context_tokens) without subtracting current input tokens, so near-full conversations can send max_completion_tokens values where prompt_tokens + max_completion_tokens exceeds the model window. In that case the provider can reject the request (e.g., invalid request/context overflow) before any summary is produced, which is a regression from the previous behavior that clamped to remaining context and is especially risky in compaction flows that run when context is already close to full.

Useful? React with 👍 / 👎.

// serialized prompt. Locally using the largest cap avoids cutting off
// thinking before the model produces a summary.
const cap =
args.budget.hardCap ??
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound configured hard cap by model context limit

computeCompletionBudgetCap now prefers budget.hardCap verbatim, so if KIMI_MODEL_MAX_COMPLETION_TOKENS/KIMI_MODEL_MAX_TOKENS is set above the model’s context window, every request can carry an impossible completion limit and fail with provider-side invalid-request/context-limit errors. Before this commit, explicit caps were still clamped by available context, so this change introduces hard failures for oversized but previously tolerable operator configuration.

Useful? React with 👍 / 👎.

// thinking before the model produces a summary.
const cap =
args.budget.hardCap ??
(maxCtx > 0 ? maxCtx : args.budget.fallback ?? DEFAULT_UNKNOWN_CONTEXT_FALLBACK);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid maxing completion cap and triggering TPM throttling

Setting the cap to full max_context_tokens for known models inflates max_completion_tokens on every request, even when the prompt is short. Kimi’s gateway rate-limit accounting uses prompt_tokens + max_completion_tokens (not actual generated tokens), so this change can sharply increase measured TPM/TPD and cause avoidable 429/quota errors in normal turns. The previous input-aware clamp kept this parameter closer to feasible remaining budget and avoided this artificial rate-limit pressure.

Useful? React with 👍 / 👎.

@wbxl2000 wbxl2000 merged commit 2bb50a3 into main May 26, 2026
6 checks passed
@wbxl2000 wbxl2000 deleted the qer/use-context-token-budget branch May 26, 2026 14:25
@github-actions github-actions Bot mentioned this pull request May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants