Conversation
Drop the sync POST /rollout/task endpoint and PolarRolloutWorker; both the CLI and the Slime bridge now speak submit + poll only. Slime evaluation routes through a one-shot async batch that shares the training path's task-payload and adapter helpers. Closes checklist item #1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GatewayNodeConfig.max_eval_prewarm_workers sizes a dedicated prewarm worker pool and semaphore that no longer contend with max_run_workers. RuntimeSpec.eval_prepare provides a separate ordered prepare list for eval runtimes; when unset, prewarm falls back to ``prepare`` so existing specs keep working unchanged. ``max_eval_prewarm_workers`` defaults to ``max_run_workers`` to preserve current behavior. Closes checklist item #2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PrefixMergingBuilder now emits traces split at the first `user` turn of each chain: the prompt prefix is the tokenized [system*, first_user] from the first completion, and response_ids covers every token after — all intermediate assistant/tool turns through the final response. Logprobs are stitched from each completion in the chain; interstitial positions are zero-filled so TIS corrects off-policy ratios trainer-side. Falls back to the response-only trace when a chain has no user turn or the first completion's prompt extends past its first user message.
Drop the logprob_start_len=0 override — per-prompt-token logprobs are no longer computed on every proxied chat completion. Prompt token IDs are now injected into meta_info by a new tokenizer_manager patch (sourced from the already-tokenized request.input_ids) and read directly by serving_chat, removing the per-request prompt-logprob prefill cost. Also pin the supported SGLang version at 0.5.10 with an explicit guard so mismatch failures are actionable.
Eliminate trainer-side polling at the rollout->trainer boundary. AsyncPolarRolloutWorker binds a FastAPI listener on 127.0.0.1:<free_port> and includes its callback_url in submit payloads; the rollout server POSTs the terminal TaskResult once execute_task resolves. The worker awaits a per-task asyncio.Event instead of polling every 2s, with a defensive 60s fallback poll for dropped callbacks.
- gateway/node: drop redundant _dispatched_session_ids set (was leaking on the happy path; session_registry already rejects duplicates) - rollout/balancer: make list_nodes/stats/get_node pure reads; compute heartbeat-derived health at read time without mutating the stored node record, and expose refresh_health() for callers that do want to tick stale state - runtime/docker: skip the recursive session-dir chmod on start and stop when container and host UIDs match - agent/harnesses/codex: move CODEX_HOME out from under the log dir so credentials are not clobbered by log rotation - slime/rollout: cap output_queue at a few rollout batches instead of 2000, and stop re-resolving PolarSlimeConfig in the training drain path (use the cached worker config) - slime/adapter: document the lossy _messages_to_text flattening as a known limitation - swegym example: add --use-tis next to --use-rollout-logprobs with a README note on why it is required
There was a problem hiding this comment.
Pull request overview
This PR introduces a backend-agnostic inference layer in the gateway (SGLang + vLLM), updates topology/config/schema accordingly, and refactors both the training signal pipeline (logprobs/token-ids → trajectories) and the dashboard/examples to match. It also expands API transformers (reasoning/tooling/response_format) and adds/renames several agent harness “preset” shortcuts.
Changes:
- Add inference-engine abstraction (
sglang/vllm) with request/response normalization, and migrate topology fromsglang.base_url→inference.{engine,base_url}. - Standardize training traces to store per-token logprobs as
list[float]aligned withresponse_ids, with new cross-engine equivalence tests. - Update dashboard + platform APIs to show new inference/topology fields and add task-level averages (reward/traces/completions), plus refresh docs/examples/scripts.
Reviewed changes
Copilot reviewed 76 out of 85 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/routes/Dashboard.tsx | Add avg reward + trace/completion columns |
| web/src/components/TopologyGraph.tsx | Display engine + inference base URL |
| web/src/components/CompletionDiff.tsx | Rename panels to engine-agnostic labels |
| web/src/api/types.ts | Add mean_traces/mean_completions + inference fields |
| tests/trajectory/test_per_request_builder.py | Update logprobs expectations to floats |
| tests/trajectory/test_engine_trajectory_equivalence.py | New: pin SGLang vs vLLM trace equivalence |
| tests/slime_bridge/test_config.py | Add topology template inference block test |
| tests/slime_bridge/test_adapter.py | Update logprobs expectations to floats |
| tests/platform/test_api.py | Update topology YAML to inference: block |
| tests/gateway/test_transform_openai_chat.py | Add max_completion_tokens alias + role merge coverage |
| tests/gateway/test_transform_google.py | Expand request/response mapping + reasoning/tool cases |
| tests/gateway/test_engine.py | New: unit tests for engine strategies |
| tests/config/test_topology.py | New inference block tests + reject legacy sglang: |
| src/slime_bridge/rollout.py | Update admin pause/resume routes to /admin/inference/* |
| src/slime_bridge/README.md | Update bridge docs for inference terminology/flow |
| src/slime_bridge/config.py | Render inference: block in topology template |
| src/slime_bridge/adapter.py | Remove response_ids-from-logprobs fallback; simplify logprob extraction |
| src/polar/trajectory/README.md | Clarify builder/evaluator mental model + logprob alignment |
| src/polar/trajectory/models.py | Change response_logprobs to list[float] + length validation |
| src/polar/trajectory/evaluator/README.md | Expand evaluator docs and configs |
| src/polar/trajectory/builder/record_utils.py | Extract float logprobs aligned with response_ids |
| src/polar/trajectory/builder/README.md | Update builder docs + logprob alignment semantics |
| src/polar/trajectory/builder/prefix_merging.py | Refactor grouping to token-prefix chains; float logprobs placeholders |
| src/polar/runtime/README.md | Expand runtime backend documentation |
| src/polar/rollout/README.md | Expand rollout service documentation |
| src/polar/rollout/manager.py | Compute mean_traces/mean_completions in task list output |
| src/polar/platform/README.md | Expand dashboard architecture docs |
| src/polar/platform/fs_index.py | Add mean_traces/mean_completions in FS summaries |
| src/polar/platform/api/topology.py | Return engine + inference_base_url |
| src/polar/gateway/transform/reasoning.py | New: reasoning extraction/signature/encryption helpers |
| src/polar/gateway/transform/README.md | Document canonical Chat format + engine responsibilities |
| src/polar/gateway/transform/openai_chat.py | Alias max_completion_tokens → max_tokens; normalize request |
| src/polar/gateway/transform/images.py | Drop invalid image detail; support Anthropic documents; skip Gemini thought parts |
| src/polar/gateway/transform/google.py | Map more gen config fields; reasoning/tool/system support; response_format mapping |
| src/polar/gateway/transform/base.py | Rename/expand request normalization; Qwen3.5 thinking-off handling |
| src/polar/gateway/transform/anthropic.py | Add reasoning roundtrip, tool filtering, usage cache fields, stream thinking blocks |
| src/polar/gateway/server.py | Swap SGLangClient → InferenceClient; rename admin routes |
| src/polar/gateway/README.md | Update gateway architecture docs for inference engines |
| src/polar/gateway/proxy.py | Add engine strategy integration + engine name in status |
| src/polar/gateway/engine.py | New: InferenceEngine + SGLang/vLLM strategies |
| src/polar/config/topology.py | Replace sglang config with strict inference block |
| src/polar/config/README.md | Update topology schema documentation |
| src/polar/cli.py | Show engine label + inference base URL in topology output |
| src/polar/agent/README.md | Rework harness docs; add openclaw/hermes + preset guidance |
| src/polar/agent/presets/shell.py | New: shell preset harness |
| src/polar/agent/presets/qwen_code.py | New: Qwen Code preset harness |
| src/polar/agent/presets/pi.py | New: pi agent preset harness |
| src/polar/agent/presets/openhands_sdk.py | Update preset docstring |
| src/polar/agent/presets/opencode.py | New: OpenCode preset harness |
| src/polar/agent/presets/openclaw.py | New: OpenClaw preset harness |
| src/polar/agent/presets/hermes.py | New: Hermes preset harness |
| src/polar/agent/presets/gemini_cli.py | Document env var mapping comment |
| src/polar/agent/presets/codex.py | New: Codex preset harness |
| src/polar/agent/presets/claude_code.py | New: Claude Code preset harness |
| src/polar/agent/presets/init.py | New: presets package exports |
| src/polar/agent/harnesses/init.py | Remove harnesses package stub (presets replace it) |
| src/polar/agent/factory.py | Point builtin harness map to presets |
| README.md | Update install/docs for vLLM + inference server choices |
| examples/swegym_slime_grpo/topology.yaml | Migrate to inference: block + env var base_url |
| examples/swegym_slime_grpo/run.sh | Refactor: envsubst templates; GPU split knobs; Ray sizing cleanup |
| examples/swegym_slime_grpo/README.md | Rewrite quickstart + dashboard instructions |
| examples/swegym_slime_grpo/polar_config.yaml | Switch to envsubst placeholders for asset dirs |
| examples/swegym_slime_grpo/model_args.sh | New: shared Megatron model args |
| examples/swegym_slime_grpo/launch_e2e.sh | Prefer apptainer in PATH; require envsubst |
| examples/swegym_slime_grpo/convert_weights.sh | Source shared model_args.sh |
| examples/swebench_verified/topology.yaml | Switch served model + inference: {engine:vllm} |
| examples/swebench_verified/README.md | Update to vLLM-based serving instructions |
| examples/count_stars/topology.yaml | Switch to vLLM inference block |
| examples/count_stars/submit_count_stars_task.py | Remove old one-off submit script |
| examples/count_stars/submit_all.py | Remove old multi-submit script |
| examples/count_stars/run.py | New consolidated runner script |
| examples/count_stars/README.md | Update docs for vLLM-based quickstart |
| examples/calculator/topology.yaml | Switch to vLLM inference block |
| examples/calculator/submit_calculator_task.py | Remove old one-off submit script |
| examples/calculator/submit_all.py | Remove old multi-submit script |
| examples/calculator/run.py | New consolidated runner script |
| examples/calculator/README.md | Update docs and add dashboard screenshots |
| .gitignore | Ignore models/; adjust docs ignore entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| request_copy = request.copy() | ||
| request_copy.pop("stream", None) | ||
| request_copy["stream"] = False | ||
| request_copy = self.engine.prepare_request(request_copy) | ||
|
|
| record_count = (traj.get("metadata") or {}).get("record_count") | ||
| completion_counts.append(int(record_count) if isinstance(record_count, int) else len(traces)) |
yidong72
left a comment
There was a problem hiding this comment.
Code review — vLLM support, reasoning round-trip, trace-format changes
Reviewed the substantive changes (gateway/engine, transforms, trajectory builder, slime_bridge, presets). Overview, strengths, then issues.
Overview
- Backend-agnostic inference — new
gateway/engine.pystrategy (SGLangEngine/VLLMEngine) isolates per-backend request params + response canonicalization;SGLangClient→InferenceClient, configsglang:→inference: {engine, base_url}, admin routes/admin/sglang/*→/admin/inference/*. - Reasoning round-trip — new
transform/reasoning.py+ thinking handling across all four transformers. - Trace simplification —
response_logprobslist[dict]→list[float];prefix_merginggrouping rewritten to pure token-prefix routing. - New
hermes/openclawpresets,harnesses/→presets/rename, dashboardmean_traces/mean_completions.
Strengths
engine.pystrategy boundary is exactly right — only the two genuine backend differences are overridden; HTTP/streaming/pause stays backend-agnostic. Docstrings explain why. Strong unit tests.list[float]logprob change is a real win, and the newTracevalidator enforceslen(response_logprobs) == len(response_ids).- Extensive transform tests; nice defensive touches (
images.pydetail filtering, Anthropic server-tool dropping, empty-toolsguarding).
Issues / questions
1. prefix_merging._find_extendable_chain — chains are no longer consumed (possible mis-merge). The old code popped a matched waiting chain, so each tip joined at most one successor. Now every chain_tips[idx] stays permanently matchable. If two completions share an identical prompt_ids (retries, n>1 sampling siblings, or parallel sub-agents momentarily sharing a prompt), both append to the same chain (each matches the tip; the tip doesn't advance past equality), linearizing siblings that should be separate traces. The longest-tip tiebreak handles genuine divergence but not the equal-prefix collision. Can we confirm the equivalence / per-request tests cover the identical-prompt-sibling case? If not, it's a silent training-data risk — I'd block on this.
2. base.py._merge_developer_role — two bundled behavior changes for all models. The developer→system merge previously ran only for Qwen; _normalize_request now calls it unconditionally. Also the guard changed from len(system_parts) > 1 to if system_parts:, so a single system message originally mid-conversation is now always hoisted to the front. Intended for non-Qwen harnesses too?
3. Removed logprob safety net in the slime adapter (adapter.py). _extract_rollout_log_probs previously raised when a trainable (loss_mask=1) token was missing its logprob; it now trusts the builder. The Trace validator enforces length but not that trainable positions carry a real (non-0.0) logprob. A builder bug would now flow silently into training as logprob=0.0. Consider keeping a cheap assert on trainable positions.
4. VLLMEngine.prepare_request mutates shared message dicts in place. InferenceClient.completion does a shallow request.copy(), so nested messages dicts are shared with the original request; message["reasoning"] = message.pop("reasoning_content") mutates them. If the original request is captured into the completion record, it shows the mutated key. Low severity — a copy of mutated messages would be safer.
5. Minor — _find_extendable_chain is O(chains × tokens) per completion (~O(N²·L)/session) vs the old ~O(1) keyed lookup. Almost certainly fine at realistic sizes; just noting it.
Verdict
Solid, well-documented PR; engine abstraction and logprob simplification are clean. #1 is the one I'd resolve before merge (silent training-data impact); #2/#3 want explicit confirmation; #4/#5 are minor.
(Review generated with Claude Code.)
yidong72
left a comment
There was a problem hiding this comment.
Three inline findings from the review (see the summary comment for full context).
| def _pop_compatible_chain( | ||
| *, | ||
| prompt_key: str, | ||
| def _find_extendable_chain( |
There was a problem hiding this comment.
Issue #1 — chains are no longer consumed (possible mis-merge).
The old _pop_compatible_chain popped a matched waiting chain, so each tip joined at most one successor. Here every chain_tips[idx] stays permanently matchable. If two completions share an identical prompt_ids (retries, n>1 sampling siblings, or parallel sub-agents momentarily sharing a prompt), both append to the same chain — each matches the tip, and the tip doesn't advance past equality — linearizing siblings that should be separate traces. The longest-tip tiebreak handles genuine divergence but not the equal-prefix collision.
Can we confirm test_engine_trajectory_equivalence.py / test_per_request_builder.py cover the identical-prompt-sibling case? If not, this is a silent training-data risk — I'd block on it.
|
|
||
| if len(system_parts) > 1: | ||
| request["messages"] = [{"role": "system", "content": "\n\n".join(system_parts)}, *non_system] | ||
| if system_parts: |
There was a problem hiding this comment.
Issue #2 — two bundled behavior changes for all models.
_normalize_request now calls _merge_developer_role unconditionally; previously the developer→system merge ran only for Qwen. Also this guard changed from len(system_parts) > 1 to if system_parts:, so a single system message that originally appeared mid-conversation is now always hoisted to the front.
For typical leading-system prompts this is a no-op, but it changes message ordering for any harness that interleaves a system message. Is the universal merge + hoist intended for non-Qwen harnesses?
| # response_logprobs is one float per response token (interstitials are 0.0, | ||
| # masked out by loss_mask); the builder guarantees trainable tokens carry | ||
| # their real sampled logprob. | ||
| return [float(value) for value in logprobs] |
There was a problem hiding this comment.
Issue #3 — removed logprob safety net.
This previously raised RolloutLogprobError when a trainable (loss_mask=1) token was missing its logprob; it now trusts the builder. The Trace validator enforces length but not that trainable positions carry a real (non-placeholder-0.0) logprob. A builder bug that drops a sampled logprob would now flow silently into training as logprob=0.0 instead of failing loudly. Consider keeping a cheap assert on trainable positions.
openclawandhermes