feat(subagent): add API key pool for parallel subagent execution#2369
feat(subagent): add API key pool for parallel subagent execution#2369Liewzheng wants to merge 12 commits into
Conversation
- Introduce APIKeyPool for round-robin key rotation across subagents - Add KeyPoolKimi wrapper to retry with fresh keys on rate limits - Inject subagent type into User-Agent for backend attribution - Add foreground concurrency limit (80% of key count or task slots) - Add configurable foreground timeout via KIMI_FOREGROUND_AGENT_TIMEOUT - Fix oauth.py assertion for KeyPoolKimi-wrapped providers
| api_key_override: str | None = None | ||
| if self._root_runtime.key_pool is not None: | ||
| api_key_override = self._root_runtime.key_pool.acquire() |
There was a problem hiding this comment.
🔴 SubagentBuilder injects KIMI pool keys into non-Kimi provider API calls
In SubagentBuilder.build_builtin_instance, when a key pool exists (runtime.key_pool is not None), a key is unconditionally acquired from the pool and passed as api_key_override regardless of the provider type. The pool only contains KIMI_API_KEY* values. If the subagent's effective model resolves to a non-Kimi provider (e.g., the user sets model to an OpenAI or Anthropic model alias in the Agent tool call), clone_llm_with_model_alias at src/kimi_cli/llm.py:417-425 passes this KIMI key to create_llm, where resolved_api_key = api_key_override is applied unconditionally at src/kimi_cli/llm.py:228. The resulting provider (OpenAI, Anthropic, etc.) would authenticate with an invalid KIMI API key, causing all API calls to fail with authentication errors.
Prompt for agents
The key pool acquisition in SubagentBuilder.build_builtin_instance should only happen when the effective provider is a Kimi provider. Currently, api_key_override is set from the pool unconditionally. The fix should check whether the provider that will be used for this subagent is of type 'kimi' before acquiring a key from the pool.
Approach: Before acquiring from the pool, determine the effective provider type. If effective_model is None, the subagent inherits the root runtime's provider — check self._root_runtime.llm.provider_config.type. If effective_model is not None, look up the model in config to get its provider type. Only acquire from the pool if the provider type is 'kimi'.
Relevant code paths:
- src/kimi_cli/subagents/builder.py:20-26 (key acquisition)
- src/kimi_cli/llm.py:228-240 (api_key_override applied unconditionally)
- src/kimi_cli/llm.py:252-253 (KeyPoolKimi wrapping only in kimi case)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 0546f6ce.
SubagentBuilder.build_builtin_instance now determines the effective provider type before acquiring from the key pool:
- If
effective_modelisNone, it checksruntime.llm.provider_config.type - If
effective_modelis set, it looks up the model inconfig.models, then resolves its provider fromconfig.providers, and readsprovider.type - Only when the resolved provider type is
"kimi"does it callkey_pool.acquire()and pass bothapi_key_overrideandkey_pooltoclone_llm_with_model_alias - For any non-kimi provider, both values stay
Noneand the subagent falls back to the provider's own API key
Added tests:
test_builder_skips_key_pool_for_non_kimi_provider— root runtime LLM hasopenai_legacyprovider → no key injectedtest_builder_skips_key_pool_when_effective_model_is_non_kimi—effective_modelresolves to a non-kimi model → no key injectedtest_builder_uses_key_pool_for_kimi_provider— kimi provider → key is acquired as before
| max_concurrent = _max_foreground_concurrency(self._runtime) | ||
| running = _count_running_foreground(self._runtime) | ||
| if running >= max_concurrent: | ||
| return ToolError( | ||
| message=( | ||
| f"Too many foreground subagents are already running " | ||
| f"({running}/{max_concurrent}). Please wait for one to finish " | ||
| f"before starting another." | ||
| ), | ||
| brief="Concurrency limit reached", | ||
| ) |
There was a problem hiding this comment.
🟡 TOCTOU race in foreground concurrency check allows exceeding the limit
The foreground concurrency check at src/kimi_cli/tools/agent/__init__.py:181-191 reads the count of running_foreground instances, but the subagent's status is not updated to running_foreground until inside runner.run(req) (after an await at line 204-205). When the LLM issues multiple concurrent Agent tool calls in a single step (dispatched via asyncio.gather), all coroutines can pass the concurrency check before any of them marks its status as running_foreground. For example, with a limit of 4, the model could issue 6 parallel Agent calls and all 6 would pass the check (seeing 0 running), exceeding the intended limit.
Prompt for agents
The concurrency check at lines 181-191 of src/kimi_cli/tools/agent/__init__.py suffers from a time-of-check-to-time-of-use (TOCTOU) race because the subagent status is only set to running_foreground inside runner.run(req), which is awaited after the check. Multiple concurrent Agent tool calls can all pass the check before any of them updates the store.
Possible fix: Eagerly create the subagent instance and set its status to running_foreground BEFORE the await, similar to how _run_in_background marks running_background synchronously before dispatching (lines 280-286). This would require creating the instance record in the foreground path as well, and rolling it back on failure. Alternatively, use a simple asyncio counter (an integer incremented before the await and decremented after) to track in-flight foreground subagents.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 0546f6ce.
The root cause was that _count_running_foreground only counted instances whose status was already "running_foreground", but that status was only set inside runner.run — after multiple await points. When the LLM issued concurrent Agent calls via asyncio.gather, every coroutine passed the check before any of them updated the store.
Fix: Eagerly create the instance and set its status to "running_foreground" before the await runner.run:
ForegroundSubagentRunner._prepare_instanceis now a synchronous public methodprepare_instanceAgentTool.__call__callsprepare_instanceandupdate_instance(status="running_foreground")right after the concurrency check, beforeawait runner.run(req, prepared)runner.runaccepts an optionalpreparedargument so it skips the internal prepare step
This mirrors the background-task pattern already used in _run_in_background, where the status is marked synchronously before dispatching the async work.
Added tests:
test_agent_tool_eagerly_sets_running_foreground_before_await— asserts that whenrunner.runis entered, the instance status is already"running_foreground"test_concurrent_agent_calls_respect_limit_after_toctou_fix— spawns a hanging foreground runner, yields control, then verifies the second concurrent Agent call is rejected by the limit
…U race in foreground concurrency check
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0546f6ce13
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if key_pool is not None: | ||
| chat_provider = KeyPoolKimi(chat_provider, key_pool) |
There was a problem hiding this comment.
Preserve Kimi provider identity in key-pool wrapper
Wrapping Kimi with KeyPoolKimi changes llm.chat_provider from a Kimi instance to a different type, which breaks downstream isinstance(..., Kimi) paths used outside the two patched callsites. For example, ReadMedia only uses Kimi's server-side video upload when that check passes (src/kimi_cli/tools/file/read_media.py), so keyed subagents now fall back to inlined data URLs for videos, which can bloat prompts and fail for larger media. Make the wrapper transparent to Kimi-specific checks (or update all Kimi type checks to unwrap) to avoid this regression.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 69deedd6.
Added KeyPoolKimi unwrapping in the two remaining isinstance(..., Kimi) checks that were missed in the initial PR:
src/kimi_cli/tools/file/read_media.py:121— Video upload now correctly uses the server-sidefiles.upload_videoAPI even when the provider is wrapped byKeyPoolKimisrc/kimi_cli/wire/server.py:625— Web UI User-Agent suffix injection now works correctly through the wrapper
Pattern matches the existing unwraps in llm.py (thinking_keep) and auth/oauth.py (token refresh).
| if runtime.key_pool is not None: | ||
| return max(1, int(runtime.key_pool.key_count * 0.8)) |
There was a problem hiding this comment.
Limit key-pool concurrency only for Kimi executions
_max_foreground_concurrency applies the key-pool cap whenever runtime.key_pool exists, regardless of the active subagent provider. Since runtime.key_pool is initialized from KIMI_API_KEY* env vars globally, non-Kimi sessions can be throttled incorrectly (e.g., OpenAI subagents being capped by unrelated Kimi key count). This should be conditioned on the effective provider for the pending foreground run, not just pool presence.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 69deedd6.
_max_foreground_concurrency now takes an optional provider_type parameter:
provider_type == "kimi"+key_poolpresent →max(1, int(key_pool.key_count * 0.8))- Any other provider type (or no key pool) →
max(1, int(max_running_tasks * 0.8))
AgentTool.__call__ resolves the effective provider type (from params.model if overridden, otherwise from runtime.llm.provider_config.type) and passes it into the concurrency check. Non-Kimi subagents are no longer throttled by an unrelated Kimi key count.
Updated tests:
test_key_pool_based_limit_for_kimi/test_key_pool_single_key_fallback_to_one_for_kimi— explicitprovider_type="kimi"test_key_pool_ignored_for_non_kimi— key pool exists but non-kimi provider usesmax_running_taskscap
| prepared = runner.prepare_instance(req) | ||
| agent_id = prepared.record.agent_id | ||
| store.update_instance(agent_id, status="running_foreground") |
There was a problem hiding this comment.
Handle foreground prepare failures inside AgentTool error path
runner.prepare_instance(req) is now executed before entering the try block, so validation failures (unknown subagent type, invalid resume id, resume of a running instance) bypass the tool's ToolError handling and surface as uncaught exceptions to the generic tool runtime wrapper. This regresses user-facing error quality and consistency compared with the prior path. Move preparation into the guarded block (or catch around it) so these expected failures return structured ToolErrors.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 69deedd6.
runner.prepare_instance(req) and the subsequent update_instance(status="running_foreground") are now back inside the try block, so validation failures (unknown subagent type, invalid resume id, resume of a running instance) are caught by the generic except Exception handler and converted to structured ToolErrors instead of surfacing as uncaught exceptions.
agent_id is pre-initialized to None before the try block; the except handlers only touch the store when agent_id is not None, ensuring a failed prepare_instance does not trigger a secondary NameError or UnboundLocalError.
The eager status update (TOCTOU fix from 0546f6ce) is preserved because prepare_instance + update_instance still execute synchronously before the await runner.run.
…l concurrency to kimi; guard prepare_instance failures
💡 Codex ReviewLines 237 to 239 in 69deedd This log line emits the first 16 characters of the active API key whenever a subagent LLM is created with kimi-cli/src/kimi_cli/tools/agent/__init__.py Line 194 in 69deedd The concurrency cap is derived from Line 21 in 69deedd This change rewrites locked package sources from ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
…ncy cap; restore uv.lock
💡 Remove API key prefix logging from LLM creationFixed in Removed the # Before
logger.info(
"Creating LLM with overridden API key {prefix}... for provider {provider_type}",
prefix=resolved_api_key[:16],
provider_type=provider.type,
)
# After
logger.info(
"Creating LLM with overridden API key for provider {provider_type}",
provider_type=provider.type,
)💡 Compute provider type from resume target before capping concurrencyFixed in
This ensures that resuming a non-Kimi subagent is not incorrectly throttled by the Kimi key-pool limit. 💡 Keep uv.lock registry URLs on the default PyPI indexFixed in Restored |
…imized # Conflicts: # CHANGELOG.md # docs/en/release-notes/changelog.md # docs/zh/release-notes/changelog.md
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b026853e2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if isinstance(exc.__cause__, asyncio.CancelledError): | ||
| logger.warning("Foreground agent timed out after {t}s", t=timeout) | ||
| if agent_id is not None: | ||
| store.update_instance(agent_id, status="idle") |
There was a problem hiding this comment.
Keep timed-out foreground agents marked as killed
When asyncio.wait_for times out, runner.run is cancelled and already records the instance as "killed", but this handler immediately overwrites it to "idle". That state mismatch makes interrupted runs look like normal idle instances in status views/metrics and can mislead follow-up automation that relies on subagent status semantics.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4e8d6ac1.
Removed the store.update_instance(agent_id, status="idle") line from the TimeoutError handler in AgentTool.__call__.
When asyncio.wait_for cancels runner.run on timeout, the runner's internal except asyncio.CancelledError block already marks the instance as "killed" (and re-raises). Overwriting that to "idle" in the outer handler made interrupted runs appear as normal idle instances in status views. The instance now correctly stays "killed".
| logger.info( | ||
| "Subagent {agent_id} assigned API key {prefix}... (pool size {n})", | ||
| agent_id=agent_id, | ||
| prefix=api_key_override[:16], |
There was a problem hiding this comment.
Remove API key material from subagent assignment logs
This log line writes the first 16 characters of each assigned API key to plaintext logs. Even partial key disclosure is sensitive in shared CI/support log pipelines and creates avoidable credential leakage risk; key assignment should be logged without exposing any key bytes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4e8d6ac1.
Removed the prefix=api_key_override[:16] parameter from the logger.info call in SubagentBuilder.build_builtin_instance.
# Before
logger.info(
"Subagent {agent_id} assigned API key {prefix}... (pool size {n})",
agent_id=agent_id,
prefix=api_key_override[:16],
n=self._root_runtime.key_pool.key_count,
)
# After
logger.info(
"Subagent {agent_id} assigned API key from pool (pool size {n})",
agent_id=agent_id,
n=self._root_runtime.key_pool.key_count,
)There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e8d6ac18e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -119,7 +119,12 @@ async def _read_media(self, path: KaosPath, file_type: FileType) -> ToolReturnVa | |||
| case "video": | |||
| data = await path.read_bytes() | |||
| if (llm := self._runtime.llm) and isinstance(llm.chat_provider, Kimi): | |||
There was a problem hiding this comment.
Unwrap KeyPoolKimi before the Kimi video check
The current diff still checks llm.chat_provider for Kimi before the new unwrap runs, so when key pooling wraps the provider as KeyPoolKimi this branch is skipped and Kimi subagents fall back to embedding video data URLs instead of using files.upload_video. Fresh evidence after the earlier thread is that the unwrap was added inside the branch, but the outer guard remains on the wrapped object; move the unwrap before this isinstance check.
Useful? React with 👍 / 👎.
| async def generate(self, *args: Any, **kwargs: Any) -> Any: | ||
| return await self._provider.generate(*args, **kwargs) | ||
|
|
||
| def on_retryable_error(self, error: BaseException) -> bool: |
There was a problem hiding this comment.
Rotate pooled keys before retrying status errors
This new wrapper only rotates keys when on_retryable_error is called, but _run_with_connection_recovery currently invokes that hook only for APIConnectionError/APITimeoutError; 429/500/503 APIStatusErrors are retried by tenacity with the same client and same exhausted key. In the rate-limit scenario this feature targets, a subagent that receives HTTP 429 will keep retrying the same key even though the pool has other keys available, so status-error retries need to trigger the same rotation path.
Useful? React with 👍 / 👎.
- read_media.py: unwrap KeyPoolKimi BEFORE isinstance(Kimi) guard so video uploads correctly use server-side files.upload_video API. - llm.py: intercept APIStatusError (429/500/503) inside KeyPoolKimi.generate() to rotate the pooled key BEFORE re-raising for tenacity retry. - tests: add parametrized tests for retryable status codes and a negative test for non-retryable codes (400).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1bddb1c3f7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Enforce foreground concurrency limit (80% of system capacity). | ||
| # For resumed instances, derive the provider type from the stored launch spec | ||
| # so that non-Kimi resumed subagents are not capped by an unrelated key pool. | ||
| effective_model = params.model |
There was a problem hiding this comment.
Resolve default subagent model before applying Kimi cap
For a new foreground launch with no explicit model, this leaves effective_model as None, so _resolve_effective_provider_type falls back to the root provider even though prepare_instance later records req.model or type_def.default_model and SubagentBuilder.resolve_effective_model will run the subagent on its type default. In a Kimi-root session with a key pool, a subagent type whose default model is OpenAI/Anthropic is therefore treated as Kimi for the foreground key-pool cap and can be rejected at the much lower Kimi limit before it ever starts.
Useful? React with 👍 / 👎.
| effective_model = launch.model_override or launch.effective_model | ||
| provider_type = _resolve_effective_provider_type(self._runtime, effective_model) | ||
| max_concurrent = _max_foreground_concurrency(self._runtime, provider_type) | ||
| running = _count_running_foreground(self._runtime) |
There was a problem hiding this comment.
Count only matching-provider foreground runs for the Kimi cap
running counts every running_foreground instance, but max_concurrent may be the Kimi key-pool capacity for the pending request. When an OpenAI/default-model foreground subagent is already running and a Kimi subagent is launched with a small key pool (for example two keys gives a limit of one), the Kimi request is rejected as 1/1 even though no Kimi key is currently in use. The count needs to be scoped to the same effective provider/key-pool class as the cap being enforced.
Useful? React with 👍 / 👎.
… for kimi cap - Resolve default subagent model before applying Kimi cap: when no explicit model is given, look up the built-in type's default_model so that a kimi default correctly triggers the key-pool cap. - Count only matching-provider foreground runs for the Kimi cap: _count_running_foreground now optionally filters by provider_type so that openai/anthropic/etc. subagents do not consume the kimi key-pool quota. - Add tests for provider-filtered counting and default-model resolution.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 681ee768c4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if timeout is not None: | ||
| return await asyncio.wait_for(runner.run(req), timeout=timeout) | ||
| return await runner.run(req) | ||
| return await asyncio.wait_for(runner.run(req, prepared), timeout=timeout) |
There was a problem hiding this comment.
Clear pre-run timeouts from running_foreground
When the foreground timeout fires while runner.run() is still in its pre-execution preparation (for example prepare_soul() awaiting explore-agent git context collection), the runner has not reached its except asyncio.CancelledError block that marks the instance killed. Because this code now marks the instance running_foreground before the wait_for, the timeout handler returns without changing that status, leaving the subagent counted as running and causing later Agent calls to hit the concurrency limit until restart. Mark the record killed/failed here when it is still running_foreground.
Useful? React with 👍 / 👎.
When asyncio.wait_for times out while runner.run() is still in its pre-execution preparation (e.g. prepare_soul awaiting explore-agent git context collection), the runner's except asyncio.CancelledError block that marks the instance 'killed' may not have run yet. Because we now eagerly mark the instance 'running_foreground' before wait_for, the TimeoutError handler must clean up the status when it is still 'running_foreground', otherwise the subagent remains counted as running and later Agent calls hit the concurrency limit. - In AgentTool.__call__, when TimeoutError.__cause__ is CancelledError, check if the instance is still 'running_foreground' and mark it 'killed'. - Add test_timeout_cleans_running_foreground_status to verify the cleanup.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2b1f2c9a9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| async def generate(self, *args: Any, **kwargs: Any) -> Any: | ||
| try: | ||
| return await self._provider.generate(*args, **kwargs) | ||
| except APIStatusError as exc: |
There was a problem hiding this comment.
Rotate keys for streaming status retries
When a retryable Kimi status error is raised while consuming the streaming response, this try has already returned the KimiStreamedMessage, so the wrapper never calls on_retryable_error; KimiStreamedMessage._convert_stream_response converts OpenAIError/httpx.HTTPError into APIStatusError during async for, and KimiSoul will retry 429/500/503/504 with the same exhausted client/key. Fresh evidence after the earlier status-retry fix is that only stream creation is covered here, not iteration, so pooled subagents can still hammer one key on streaming server/rate-limit failures.
Useful? React with 👍 / 👎.
Centralize key rotation for retryable APIStatusError (429/500/502/503/504) in KimiSoul._run_with_connection_recovery instead of KeyPoolKimi.generate. - KeyPoolKimi.generate no longer catches APIStatusError; this prevents double rotation when generate() fails and the error is also caught by the soul. - _run_with_connection_recovery now treats retryable status errors the same way as connection errors: calls chat_provider.on_retryable_error once per tenacity attempt, then re-enters with _status_retried=True. - This covers both initial generate() failures AND mid-stream failures during async iteration, which were previously missed by the generate() wrapper. Tests: - Update test_step_status_error to expect recovery_calls == 1. - Add test_step_mid_stream_status_error_triggers_recovery with a RetryableChatProvider that yields partial stream then errors, verifying on_retryable_error is called before the retry. - Remove obsolete KeyPoolKimi.generate APIStatusError tests.
Related Issue
Closes #2368
Description
1. API Key Pool for Parallel Subagent Execution
This PR introduces
APIKeyPool(src/kimi_cli/llm_key_pool.py), a round-robin API key allocator designed for parallel subagent execution.APIKeyPool.from_env("KIMI_API_KEY")collects keys fromKIMI_API_KEY,KIMI_API_KEY_1,KIMI_API_KEY_2, … up toKIMI_API_KEY_99. A pool is only created when ≥2 keys are found; otherwise it returnsNoneand subagents fall back to the root runtime's key.acquire()returns the next key in rotation, ensuring concurrent subagents are distributed across different API keys instead of hammering a single key's rate-limit quota.2. KeyPoolKimi Wrapper
Added
KeyPoolKimi(src/kimi_cli/llm.py), a thin wrapper around the KosongKimiprovider:generate,with_thinking,with_generation_kwargs,with_extra_body, etc.)on_retryable_error()swaps in the next key from the pool and recreates the HTTP clientproviderproperty for unwrapping when type checks are needed3. Subagent User-Agent Attribution
SubagentBuildernow injects a subagent-specificUser-Agentheader:This allows the Kimi backend to distinguish root agent calls from subagent calls for monitoring and attribution.
4. Foreground Concurrency Limit
Added
_max_foreground_concurrency()insrc/kimi_cli/tools/agent/__init__.py:max(1, int(key_count * 0.8))max(1, int(background.max_running_tasks * 0.8))ToolErrorinstead of blocking5. Configurable Foreground Timeout
Changed the default foreground subagent timeout from "no limit" to 300 seconds (5 minutes):
timeoutparameter still takes highest priorityKIMI_FOREGROUND_AGENT_TIMEOUTenvironment variable overrides the default (set to0to disable)6. OAuth Compatibility Fix
Fixed
oauth.py_apply_access_token()which had a hardisinstance(chat_provider, Kimi)assertion. WhenKeyPoolKimiwraps the provider, the assertion failed and crashed the subagent. The fix unwrapsKeyPoolKimibefore the type check:7.
KIMI_MODEL_THINKING_KEEPCompatibility FixThe
thinking_keepcheck also used a bareisinstance(chat_provider, Kimi)assertion. When key pooling is enabled, the provider is wrapped inKeyPoolKimi, so the check never matched and the extra body was never injected. Fixed by unwrapping before the type check.8.
clone_llm_with_model_aliasFixWhen
model_alias=Nonebutapi_key_override,key_pool, orextra_headerswere passed, the function incorrectly returned the original LLM unchanged. This meant subagents could not receive a distinct API key when no model override was requested. Fixed to recreate the LLM from storedprovider_config/model_configwhen override parameters are present.9. Logger UnboundLocalError Fix
Removed a local
from kimi_cli.utils.logging import loggerinsidecreate_llmthat shadowed the module-level import. Whenapi_key_overridewas set butoauthwasNone, the local import was skipped and the subsequentlogger.info(...)call raisedUnboundLocalError.10. Default HTTP Timeout for OpenAI Client
Added a default
httpx.Timeout(connect=10, read=300, write=30, pool=30)inopenai_common.pyto prevent indefinite API hangs when no timeout is configured.11. Subagent Live Visualization
Enhanced the shell UI to show subagent progress in real time:
Step 2 ⠋ 12s)_LiveViewhandlesTextPartandStepBeginevents from the subagent wire streamBreaking Changes
KIMI_FOREGROUND_AGENT_TIMEOUT=0or pass an explicittimeoutparameter.src/kimi_cli/llm.py,src/kimi_cli/llm_key_pool.py,src/kimi_cli/app.py,src/kimi_cli/soul/agent.pysrc/kimi_cli/subagents/builder.py,src/kimi_cli/tools/agent/__init__.pysrc/kimi_cli/auth/oauth.pysrc/kimi_cli/ui/shell/visualize/_blocks.py,src/kimi_cli/ui/shell/visualize/_live_view.pypackages/kosong/src/kosong/chat_provider/openai_common.pytests/core/test_llm_key_pool.py,tests/core/test_key_pool_kimi.py,tests/core/test_foreground_concurrency.py,tests/core/test_foreground_agent_timeout.py,tests/core/test_subagent_builder.py,tests/tools/test_tool_schemas.py,tests/core/test_default_agent.py,tests/ui/test_usage.pyCHANGELOG.md,docs/en/configuration/env-vars.md,docs/en/customization/agents.md,docs/en/release-notes/changelog.md,docs/zh/configuration/env-vars.md,docs/zh/customization/agents.md,docs/zh/release-notes/changelog.mdTesting
test_llm_key_pool.py: verifiesAPIKeyPool.from_envkey collection, round-robin ordering, and empty-pool behaviortest_key_pool_kimi.py: verifiesKeyPoolKimiattribute forwarding andon_retryable_errorkey rotationtest_foreground_concurrency.py: verifies concurrency cap enforcement andToolErrorrejection when limit is reachedtest_foreground_agent_timeout.py: verifies env var override, default 300s,0= no limit, and invalid env fallbacktest_subagent_builder.py: adjusted mock signature forclone_llm_with_model_aliastest_tool_schemas.py: refreshed inline snapshot for newtimeoutdefaulttest_default_agent.py: updated agent description assertiontest_usage.py: adjusted for rich 14.xgrey23empty-segment filteringChecklist
make gen-changelogto update the changelog.make gen-docsto update the user documentation.