Skip to content

vLLM support, more harness shortcuts and Readability Updates#37

Open
billxbf wants to merge 76 commits into
stablefrom
polar
Open

vLLM support, more harness shortcuts and Readability Updates#37
billxbf wants to merge 76 commits into
stablefrom
polar

Conversation

@billxbf
Copy link
Copy Markdown
Collaborator

@billxbf billxbf commented May 31, 2026

  • now supports vLLM / SGL dual inference backends
  • patching a few edge bugs in gateway proxy (eg. reasoning_content)
  • added harness shortcuts for openclaw and hermes
  • readability upgrades. refactored examples and documentations

billxbf and others added 30 commits March 18, 2026 09:03
Drop the sync POST /rollout/task endpoint and PolarRolloutWorker; both
the CLI and the Slime bridge now speak submit + poll only. Slime
evaluation routes through a one-shot async batch that shares the
training path's task-payload and adapter helpers.

Closes checklist item #1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GatewayNodeConfig.max_eval_prewarm_workers sizes a dedicated prewarm
worker pool and semaphore that no longer contend with max_run_workers.
RuntimeSpec.eval_prepare provides a separate ordered prepare list for
eval runtimes; when unset, prewarm falls back to ``prepare`` so existing
specs keep working unchanged.  ``max_eval_prewarm_workers`` defaults to
``max_run_workers`` to preserve current behavior.

Closes checklist item #2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PrefixMergingBuilder now emits traces split at the first `user` turn of
each chain: the prompt prefix is the tokenized [system*, first_user] from
the first completion, and response_ids covers every token after — all
intermediate assistant/tool turns through the final response.  Logprobs
are stitched from each completion in the chain; interstitial positions
are zero-filled so TIS corrects off-policy ratios trainer-side.  Falls
back to the response-only trace when a chain has no user turn or the
first completion's prompt extends past its first user message.
Drop the logprob_start_len=0 override — per-prompt-token logprobs are no
longer computed on every proxied chat completion.  Prompt token IDs are
now injected into meta_info by a new tokenizer_manager patch (sourced
from the already-tokenized request.input_ids) and read directly by
serving_chat, removing the per-request prompt-logprob prefill cost.

Also pin the supported SGLang version at 0.5.10 with an explicit guard
so mismatch failures are actionable.
Eliminate trainer-side polling at the rollout->trainer boundary.
AsyncPolarRolloutWorker binds a FastAPI listener on 127.0.0.1:<free_port>
and includes its callback_url in submit payloads; the rollout server
POSTs the terminal TaskResult once execute_task resolves. The worker
awaits a per-task asyncio.Event instead of polling every 2s, with a
defensive 60s fallback poll for dropped callbacks.
- gateway/node: drop redundant _dispatched_session_ids set (was leaking
  on the happy path; session_registry already rejects duplicates)
- rollout/balancer: make list_nodes/stats/get_node pure reads;
  compute heartbeat-derived health at read time without mutating the
  stored node record, and expose refresh_health() for callers that do
  want to tick stale state
- runtime/docker: skip the recursive session-dir chmod on start and
  stop when container and host UIDs match
- agent/harnesses/codex: move CODEX_HOME out from under the log dir
  so credentials are not clobbered by log rotation
- slime/rollout: cap output_queue at a few rollout batches instead of
  2000, and stop re-resolving PolarSlimeConfig in the training drain
  path (use the cached worker config)
- slime/adapter: document the lossy _messages_to_text flattening as a
  known limitation
- swegym example: add --use-tis next to --use-rollout-logprobs with a
  README note on why it is required
Copy link
Copy Markdown
Contributor

@skzhang1 skzhang1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a backend-agnostic inference layer in the gateway (SGLang + vLLM), updates topology/config/schema accordingly, and refactors both the training signal pipeline (logprobs/token-ids → trajectories) and the dashboard/examples to match. It also expands API transformers (reasoning/tooling/response_format) and adds/renames several agent harness “preset” shortcuts.

Changes:

  • Add inference-engine abstraction (sglang/vllm) with request/response normalization, and migrate topology from sglang.base_urlinference.{engine,base_url}.
  • Standardize training traces to store per-token logprobs as list[float] aligned with response_ids, with new cross-engine equivalence tests.
  • Update dashboard + platform APIs to show new inference/topology fields and add task-level averages (reward/traces/completions), plus refresh docs/examples/scripts.

Reviewed changes

Copilot reviewed 76 out of 85 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
web/src/routes/Dashboard.tsx Add avg reward + trace/completion columns
web/src/components/TopologyGraph.tsx Display engine + inference base URL
web/src/components/CompletionDiff.tsx Rename panels to engine-agnostic labels
web/src/api/types.ts Add mean_traces/mean_completions + inference fields
tests/trajectory/test_per_request_builder.py Update logprobs expectations to floats
tests/trajectory/test_engine_trajectory_equivalence.py New: pin SGLang vs vLLM trace equivalence
tests/slime_bridge/test_config.py Add topology template inference block test
tests/slime_bridge/test_adapter.py Update logprobs expectations to floats
tests/platform/test_api.py Update topology YAML to inference: block
tests/gateway/test_transform_openai_chat.py Add max_completion_tokens alias + role merge coverage
tests/gateway/test_transform_google.py Expand request/response mapping + reasoning/tool cases
tests/gateway/test_engine.py New: unit tests for engine strategies
tests/config/test_topology.py New inference block tests + reject legacy sglang:
src/slime_bridge/rollout.py Update admin pause/resume routes to /admin/inference/*
src/slime_bridge/README.md Update bridge docs for inference terminology/flow
src/slime_bridge/config.py Render inference: block in topology template
src/slime_bridge/adapter.py Remove response_ids-from-logprobs fallback; simplify logprob extraction
src/polar/trajectory/README.md Clarify builder/evaluator mental model + logprob alignment
src/polar/trajectory/models.py Change response_logprobs to list[float] + length validation
src/polar/trajectory/evaluator/README.md Expand evaluator docs and configs
src/polar/trajectory/builder/record_utils.py Extract float logprobs aligned with response_ids
src/polar/trajectory/builder/README.md Update builder docs + logprob alignment semantics
src/polar/trajectory/builder/prefix_merging.py Refactor grouping to token-prefix chains; float logprobs placeholders
src/polar/runtime/README.md Expand runtime backend documentation
src/polar/rollout/README.md Expand rollout service documentation
src/polar/rollout/manager.py Compute mean_traces/mean_completions in task list output
src/polar/platform/README.md Expand dashboard architecture docs
src/polar/platform/fs_index.py Add mean_traces/mean_completions in FS summaries
src/polar/platform/api/topology.py Return engine + inference_base_url
src/polar/gateway/transform/reasoning.py New: reasoning extraction/signature/encryption helpers
src/polar/gateway/transform/README.md Document canonical Chat format + engine responsibilities
src/polar/gateway/transform/openai_chat.py Alias max_completion_tokens → max_tokens; normalize request
src/polar/gateway/transform/images.py Drop invalid image detail; support Anthropic documents; skip Gemini thought parts
src/polar/gateway/transform/google.py Map more gen config fields; reasoning/tool/system support; response_format mapping
src/polar/gateway/transform/base.py Rename/expand request normalization; Qwen3.5 thinking-off handling
src/polar/gateway/transform/anthropic.py Add reasoning roundtrip, tool filtering, usage cache fields, stream thinking blocks
src/polar/gateway/server.py Swap SGLangClient → InferenceClient; rename admin routes
src/polar/gateway/README.md Update gateway architecture docs for inference engines
src/polar/gateway/proxy.py Add engine strategy integration + engine name in status
src/polar/gateway/engine.py New: InferenceEngine + SGLang/vLLM strategies
src/polar/config/topology.py Replace sglang config with strict inference block
src/polar/config/README.md Update topology schema documentation
src/polar/cli.py Show engine label + inference base URL in topology output
src/polar/agent/README.md Rework harness docs; add openclaw/hermes + preset guidance
src/polar/agent/presets/shell.py New: shell preset harness
src/polar/agent/presets/qwen_code.py New: Qwen Code preset harness
src/polar/agent/presets/pi.py New: pi agent preset harness
src/polar/agent/presets/openhands_sdk.py Update preset docstring
src/polar/agent/presets/opencode.py New: OpenCode preset harness
src/polar/agent/presets/openclaw.py New: OpenClaw preset harness
src/polar/agent/presets/hermes.py New: Hermes preset harness
src/polar/agent/presets/gemini_cli.py Document env var mapping comment
src/polar/agent/presets/codex.py New: Codex preset harness
src/polar/agent/presets/claude_code.py New: Claude Code preset harness
src/polar/agent/presets/init.py New: presets package exports
src/polar/agent/harnesses/init.py Remove harnesses package stub (presets replace it)
src/polar/agent/factory.py Point builtin harness map to presets
README.md Update install/docs for vLLM + inference server choices
examples/swegym_slime_grpo/topology.yaml Migrate to inference: block + env var base_url
examples/swegym_slime_grpo/run.sh Refactor: envsubst templates; GPU split knobs; Ray sizing cleanup
examples/swegym_slime_grpo/README.md Rewrite quickstart + dashboard instructions
examples/swegym_slime_grpo/polar_config.yaml Switch to envsubst placeholders for asset dirs
examples/swegym_slime_grpo/model_args.sh New: shared Megatron model args
examples/swegym_slime_grpo/launch_e2e.sh Prefer apptainer in PATH; require envsubst
examples/swegym_slime_grpo/convert_weights.sh Source shared model_args.sh
examples/swebench_verified/topology.yaml Switch served model + inference: {engine:vllm}
examples/swebench_verified/README.md Update to vLLM-based serving instructions
examples/count_stars/topology.yaml Switch to vLLM inference block
examples/count_stars/submit_count_stars_task.py Remove old one-off submit script
examples/count_stars/submit_all.py Remove old multi-submit script
examples/count_stars/run.py New consolidated runner script
examples/count_stars/README.md Update docs for vLLM-based quickstart
examples/calculator/topology.yaml Switch to vLLM inference block
examples/calculator/submit_calculator_task.py Remove old one-off submit script
examples/calculator/submit_all.py Remove old multi-submit script
examples/calculator/run.py New consolidated runner script
examples/calculator/README.md Update docs and add dashboard screenshots
.gitignore Ignore models/; adjust docs ignore entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 118 to 122
request_copy = request.copy()
request_copy.pop("stream", None)
request_copy["stream"] = False
request_copy = self.engine.prepare_request(request_copy)

Comment on lines +230 to +231
record_count = (traj.get("metadata") or {}).get("record_count")
completion_counts.append(int(record_count) if isinstance(record_count, int) else len(traces))
Copy link
Copy Markdown

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review — vLLM support, reasoning round-trip, trace-format changes

Reviewed the substantive changes (gateway/engine, transforms, trajectory builder, slime_bridge, presets). Overview, strengths, then issues.

Overview

  • Backend-agnostic inference — new gateway/engine.py strategy (SGLangEngine/VLLMEngine) isolates per-backend request params + response canonicalization; SGLangClientInferenceClient, config sglang:inference: {engine, base_url}, admin routes /admin/sglang/*/admin/inference/*.
  • Reasoning round-trip — new transform/reasoning.py + thinking handling across all four transformers.
  • Trace simplificationresponse_logprobs list[dict]list[float]; prefix_merging grouping rewritten to pure token-prefix routing.
  • New hermes/openclaw presets, harnesses/presets/ rename, dashboard mean_traces/mean_completions.

Strengths

  • engine.py strategy boundary is exactly right — only the two genuine backend differences are overridden; HTTP/streaming/pause stays backend-agnostic. Docstrings explain why. Strong unit tests.
  • list[float] logprob change is a real win, and the new Trace validator enforces len(response_logprobs) == len(response_ids).
  • Extensive transform tests; nice defensive touches (images.py detail filtering, Anthropic server-tool dropping, empty-tools guarding).

Issues / questions

1. prefix_merging._find_extendable_chain — chains are no longer consumed (possible mis-merge). The old code popped a matched waiting chain, so each tip joined at most one successor. Now every chain_tips[idx] stays permanently matchable. If two completions share an identical prompt_ids (retries, n>1 sampling siblings, or parallel sub-agents momentarily sharing a prompt), both append to the same chain (each matches the tip; the tip doesn't advance past equality), linearizing siblings that should be separate traces. The longest-tip tiebreak handles genuine divergence but not the equal-prefix collision. Can we confirm the equivalence / per-request tests cover the identical-prompt-sibling case? If not, it's a silent training-data risk — I'd block on this.

2. base.py._merge_developer_role — two bundled behavior changes for all models. The developer→system merge previously ran only for Qwen; _normalize_request now calls it unconditionally. Also the guard changed from len(system_parts) > 1 to if system_parts:, so a single system message originally mid-conversation is now always hoisted to the front. Intended for non-Qwen harnesses too?

3. Removed logprob safety net in the slime adapter (adapter.py). _extract_rollout_log_probs previously raised when a trainable (loss_mask=1) token was missing its logprob; it now trusts the builder. The Trace validator enforces length but not that trainable positions carry a real (non-0.0) logprob. A builder bug would now flow silently into training as logprob=0.0. Consider keeping a cheap assert on trainable positions.

4. VLLMEngine.prepare_request mutates shared message dicts in place. InferenceClient.completion does a shallow request.copy(), so nested messages dicts are shared with the original request; message["reasoning"] = message.pop("reasoning_content") mutates them. If the original request is captured into the completion record, it shows the mutated key. Low severity — a copy of mutated messages would be safer.

5. Minor — _find_extendable_chain is O(chains × tokens) per completion (~O(N²·L)/session) vs the old ~O(1) keyed lookup. Almost certainly fine at realistic sizes; just noting it.

Verdict

Solid, well-documented PR; engine abstraction and logprob simplification are clean. #1 is the one I'd resolve before merge (silent training-data impact); #2/#3 want explicit confirmation; #4/#5 are minor.

(Review generated with Claude Code.)

Copy link
Copy Markdown

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three inline findings from the review (see the summary comment for full context).

def _pop_compatible_chain(
*,
prompt_key: str,
def _find_extendable_chain(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #1 — chains are no longer consumed (possible mis-merge).

The old _pop_compatible_chain popped a matched waiting chain, so each tip joined at most one successor. Here every chain_tips[idx] stays permanently matchable. If two completions share an identical prompt_ids (retries, n>1 sampling siblings, or parallel sub-agents momentarily sharing a prompt), both append to the same chain — each matches the tip, and the tip doesn't advance past equality — linearizing siblings that should be separate traces. The longest-tip tiebreak handles genuine divergence but not the equal-prefix collision.

Can we confirm test_engine_trajectory_equivalence.py / test_per_request_builder.py cover the identical-prompt-sibling case? If not, this is a silent training-data risk — I'd block on it.


if len(system_parts) > 1:
request["messages"] = [{"role": "system", "content": "\n\n".join(system_parts)}, *non_system]
if system_parts:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #2 — two bundled behavior changes for all models.

_normalize_request now calls _merge_developer_role unconditionally; previously the developer→system merge ran only for Qwen. Also this guard changed from len(system_parts) > 1 to if system_parts:, so a single system message that originally appeared mid-conversation is now always hoisted to the front.

For typical leading-system prompts this is a no-op, but it changes message ordering for any harness that interleaves a system message. Is the universal merge + hoist intended for non-Qwen harnesses?

# response_logprobs is one float per response token (interstitials are 0.0,
# masked out by loss_mask); the builder guarantees trainable tokens carry
# their real sampled logprob.
return [float(value) for value in logprobs]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #3 — removed logprob safety net.

This previously raised RolloutLogprobError when a trainable (loss_mask=1) token was missing its logprob; it now trusts the builder. The Trace validator enforces length but not that trainable positions carry a real (non-placeholder-0.0) logprob. A builder bug that drops a sampled logprob would now flow silently into training as logprob=0.0 instead of failing loudly. Consider keeping a cheap assert on trainable positions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants