vLLM support, more harness shortcuts and Readability Updates by billxbf · Pull Request #37 · NVIDIA-NeMo/ProRL-Agent-Server

billxbf · 2026-05-31T01:41:37Z

now supports vLLM / SGL dual inference backends
patching a few edge bugs in gateway proxy (eg. reasoning_content)
added harness shortcuts for openclaw and hermes
readability upgrades. refactored examples and documentations

Drop the sync POST /rollout/task endpoint and PolarRolloutWorker; both the CLI and the Slime bridge now speak submit + poll only. Slime evaluation routes through a one-shot async batch that shares the training path's task-payload and adapter helpers. Closes checklist item #1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

GatewayNodeConfig.max_eval_prewarm_workers sizes a dedicated prewarm worker pool and semaphore that no longer contend with max_run_workers. RuntimeSpec.eval_prepare provides a separate ordered prepare list for eval runtimes; when unset, prewarm falls back to ``prepare`` so existing specs keep working unchanged. ``max_eval_prewarm_workers`` defaults to ``max_run_workers`` to preserve current behavior. Closes checklist item #2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PrefixMergingBuilder now emits traces split at the first `user` turn of each chain: the prompt prefix is the tokenized [system*, first_user] from the first completion, and response_ids covers every token after — all intermediate assistant/tool turns through the final response. Logprobs are stitched from each completion in the chain; interstitial positions are zero-filled so TIS corrects off-policy ratios trainer-side. Falls back to the response-only trace when a chain has no user turn or the first completion's prompt extends past its first user message.

Drop the logprob_start_len=0 override — per-prompt-token logprobs are no longer computed on every proxied chat completion. Prompt token IDs are now injected into meta_info by a new tokenizer_manager patch (sourced from the already-tokenized request.input_ids) and read directly by serving_chat, removing the per-request prompt-logprob prefill cost. Also pin the supported SGLang version at 0.5.10 with an explicit guard so mismatch failures are actionable.

Eliminate trainer-side polling at the rollout->trainer boundary. AsyncPolarRolloutWorker binds a FastAPI listener on 127.0.0.1:<free_port> and includes its callback_url in submit payloads; the rollout server POSTs the terminal TaskResult once execute_task resolves. The worker awaits a per-task asyncio.Event instead of polling every 2s, with a defensive 60s fallback poll for dropped callbacks.

- gateway/node: drop redundant _dispatched_session_ids set (was leaking on the happy path; session_registry already rejects duplicates) - rollout/balancer: make list_nodes/stats/get_node pure reads; compute heartbeat-derived health at read time without mutating the stored node record, and expose refresh_health() for callers that do want to tick stale state - runtime/docker: skip the recursive session-dir chmod on start and stop when container and host UIDs match - agent/harnesses/codex: move CODEX_HOME out from under the log dir so credentials are not clobbered by log rotation - slime/rollout: cap output_queue at a few rollout batches instead of 2000, and stop re-resolving PolarSlimeConfig in the training drain path (use the cached worker config) - slime/adapter: document the lossy _messages_to_text flattening as a known limitation - swegym example: add --use-tis next to --use-rollout-logprobs with a README note on why it is required

…o polar

skzhang1

LGTM

Copilot

Pull request overview

This PR introduces a backend-agnostic inference layer in the gateway (SGLang + vLLM), updates topology/config/schema accordingly, and refactors both the training signal pipeline (logprobs/token-ids → trajectories) and the dashboard/examples to match. It also expands API transformers (reasoning/tooling/response_format) and adds/renames several agent harness “preset” shortcuts.

Changes:

Add inference-engine abstraction (sglang/vllm) with request/response normalization, and migrate topology from sglang.base_url → inference.{engine,base_url}.
Standardize training traces to store per-token logprobs as list[float] aligned with response_ids, with new cross-engine equivalence tests.
Update dashboard + platform APIs to show new inference/topology fields and add task-level averages (reward/traces/completions), plus refresh docs/examples/scripts.

Reviewed changes

Copilot reviewed 76 out of 85 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
web/src/routes/Dashboard.tsx	Add avg reward + trace/completion columns
web/src/components/TopologyGraph.tsx	Display engine + inference base URL
web/src/components/CompletionDiff.tsx	Rename panels to engine-agnostic labels
web/src/api/types.ts	Add mean_traces/mean_completions + inference fields
tests/trajectory/test_per_request_builder.py	Update logprobs expectations to floats
tests/trajectory/test_engine_trajectory_equivalence.py	New: pin SGLang vs vLLM trace equivalence
tests/slime_bridge/test_config.py	Add topology template inference block test
tests/slime_bridge/test_adapter.py	Update logprobs expectations to floats
tests/platform/test_api.py	Update topology YAML to `inference:` block
tests/gateway/test_transform_openai_chat.py	Add max_completion_tokens alias + role merge coverage
tests/gateway/test_transform_google.py	Expand request/response mapping + reasoning/tool cases
tests/gateway/test_engine.py	New: unit tests for engine strategies
tests/config/test_topology.py	New inference block tests + reject legacy `sglang:`
src/slime_bridge/rollout.py	Update admin pause/resume routes to `/admin/inference/*`
src/slime_bridge/README.md	Update bridge docs for inference terminology/flow
src/slime_bridge/config.py	Render `inference:` block in topology template
src/slime_bridge/adapter.py	Remove response_ids-from-logprobs fallback; simplify logprob extraction
src/polar/trajectory/README.md	Clarify builder/evaluator mental model + logprob alignment
src/polar/trajectory/models.py	Change `response_logprobs` to `list[float]` + length validation
src/polar/trajectory/evaluator/README.md	Expand evaluator docs and configs
src/polar/trajectory/builder/record_utils.py	Extract float logprobs aligned with response_ids
src/polar/trajectory/builder/README.md	Update builder docs + logprob alignment semantics
src/polar/trajectory/builder/prefix_merging.py	Refactor grouping to token-prefix chains; float logprobs placeholders
src/polar/runtime/README.md	Expand runtime backend documentation
src/polar/rollout/README.md	Expand rollout service documentation
src/polar/rollout/manager.py	Compute mean_traces/mean_completions in task list output
src/polar/platform/README.md	Expand dashboard architecture docs
src/polar/platform/fs_index.py	Add mean_traces/mean_completions in FS summaries
src/polar/platform/api/topology.py	Return `engine` + `inference_base_url`
src/polar/gateway/transform/reasoning.py	New: reasoning extraction/signature/encryption helpers
src/polar/gateway/transform/README.md	Document canonical Chat format + engine responsibilities
src/polar/gateway/transform/openai_chat.py	Alias max_completion_tokens → max_tokens; normalize request
src/polar/gateway/transform/images.py	Drop invalid image detail; support Anthropic documents; skip Gemini thought parts
src/polar/gateway/transform/google.py	Map more gen config fields; reasoning/tool/system support; response_format mapping
src/polar/gateway/transform/base.py	Rename/expand request normalization; Qwen3.5 thinking-off handling
src/polar/gateway/transform/anthropic.py	Add reasoning roundtrip, tool filtering, usage cache fields, stream thinking blocks
src/polar/gateway/server.py	Swap SGLangClient → InferenceClient; rename admin routes
src/polar/gateway/README.md	Update gateway architecture docs for inference engines
src/polar/gateway/proxy.py	Add engine strategy integration + engine name in status
src/polar/gateway/engine.py	New: `InferenceEngine` + SGLang/vLLM strategies
src/polar/config/topology.py	Replace `sglang` config with strict `inference` block
src/polar/config/README.md	Update topology schema documentation
src/polar/cli.py	Show engine label + inference base URL in topology output
src/polar/agent/README.md	Rework harness docs; add openclaw/hermes + preset guidance
src/polar/agent/presets/shell.py	New: shell preset harness
src/polar/agent/presets/qwen_code.py	New: Qwen Code preset harness
src/polar/agent/presets/pi.py	New: pi agent preset harness
src/polar/agent/presets/openhands_sdk.py	Update preset docstring
src/polar/agent/presets/opencode.py	New: OpenCode preset harness
src/polar/agent/presets/openclaw.py	New: OpenClaw preset harness
src/polar/agent/presets/hermes.py	New: Hermes preset harness
src/polar/agent/presets/gemini_cli.py	Document env var mapping comment
src/polar/agent/presets/codex.py	New: Codex preset harness
src/polar/agent/presets/claude_code.py	New: Claude Code preset harness
src/polar/agent/presets/init.py	New: presets package exports
src/polar/agent/harnesses/init.py	Remove harnesses package stub (presets replace it)
src/polar/agent/factory.py	Point builtin harness map to presets
README.md	Update install/docs for vLLM + inference server choices
examples/swegym_slime_grpo/topology.yaml	Migrate to `inference:` block + env var base_url
examples/swegym_slime_grpo/run.sh	Refactor: envsubst templates; GPU split knobs; Ray sizing cleanup
examples/swegym_slime_grpo/README.md	Rewrite quickstart + dashboard instructions
examples/swegym_slime_grpo/polar_config.yaml	Switch to envsubst placeholders for asset dirs
examples/swegym_slime_grpo/model_args.sh	New: shared Megatron model args
examples/swegym_slime_grpo/launch_e2e.sh	Prefer apptainer in PATH; require envsubst
examples/swegym_slime_grpo/convert_weights.sh	Source shared model_args.sh
examples/swebench_verified/topology.yaml	Switch served model + `inference: {engine:vllm}`
examples/swebench_verified/README.md	Update to vLLM-based serving instructions
examples/count_stars/topology.yaml	Switch to vLLM inference block
examples/count_stars/submit_count_stars_task.py	Remove old one-off submit script
examples/count_stars/submit_all.py	Remove old multi-submit script
examples/count_stars/run.py	New consolidated runner script
examples/count_stars/README.md	Update docs for vLLM-based quickstart
examples/calculator/topology.yaml	Switch to vLLM inference block
examples/calculator/submit_calculator_task.py	Remove old one-off submit script
examples/calculator/submit_all.py	Remove old multi-submit script
examples/calculator/run.py	New consolidated runner script
examples/calculator/README.md	Update docs and add dashboard screenshots
.gitignore	Ignore `models/`; adjust docs ignore entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        request_copy = request.copy()
        request_copy.pop("stream", None)
        request_copy["stream"] = False
+        request_copy = self.engine.prepare_request(request_copy)



+        record_count = (traj.get("metadata") or {}).get("record_count")
+        completion_counts.append(int(record_count) if isinstance(record_count, int) else len(traces))


yidong72

Code review — vLLM support, reasoning round-trip, trace-format changes

Reviewed the substantive changes (gateway/engine, transforms, trajectory builder, slime_bridge, presets). Overview, strengths, then issues.

Overview

Backend-agnostic inference — new gateway/engine.py strategy (SGLangEngine/VLLMEngine) isolates per-backend request params + response canonicalization; SGLangClient→InferenceClient, config sglang:→inference: {engine, base_url}, admin routes /admin/sglang/*→/admin/inference/*.
Reasoning round-trip — new transform/reasoning.py + thinking handling across all four transformers.
Trace simplification — response_logprobs list[dict]→list[float]; prefix_merging grouping rewritten to pure token-prefix routing.
New hermes/openclaw presets, harnesses/→presets/ rename, dashboard mean_traces/mean_completions.

Strengths

engine.py strategy boundary is exactly right — only the two genuine backend differences are overridden; HTTP/streaming/pause stays backend-agnostic. Docstrings explain why. Strong unit tests.
list[float] logprob change is a real win, and the new Trace validator enforces len(response_logprobs) == len(response_ids).
Extensive transform tests; nice defensive touches (images.py detail filtering, Anthropic server-tool dropping, empty-tools guarding).

Issues / questions

1. prefix_merging._find_extendable_chain — chains are no longer consumed (possible mis-merge). The old code popped a matched waiting chain, so each tip joined at most one successor. Now every chain_tips[idx] stays permanently matchable. If two completions share an identical prompt_ids (retries, n>1 sampling siblings, or parallel sub-agents momentarily sharing a prompt), both append to the same chain (each matches the tip; the tip doesn't advance past equality), linearizing siblings that should be separate traces. The longest-tip tiebreak handles genuine divergence but not the equal-prefix collision. Can we confirm the equivalence / per-request tests cover the identical-prompt-sibling case? If not, it's a silent training-data risk — I'd block on this.

2. base.py._merge_developer_role — two bundled behavior changes for all models. The developer→system merge previously ran only for Qwen; _normalize_request now calls it unconditionally. Also the guard changed from len(system_parts) > 1 to if system_parts:, so a single system message originally mid-conversation is now always hoisted to the front. Intended for non-Qwen harnesses too?

3. Removed logprob safety net in the slime adapter (adapter.py). _extract_rollout_log_probs previously raised when a trainable (loss_mask=1) token was missing its logprob; it now trusts the builder. The Trace validator enforces length but not that trainable positions carry a real (non-0.0) logprob. A builder bug would now flow silently into training as logprob=0.0. Consider keeping a cheap assert on trainable positions.

4. VLLMEngine.prepare_request mutates shared message dicts in place. InferenceClient.completion does a shallow request.copy(), so nested messages dicts are shared with the original request; message["reasoning"] = message.pop("reasoning_content") mutates them. If the original request is captured into the completion record, it shows the mutated key. Low severity — a copy of mutated messages would be safer.

5. Minor — _find_extendable_chain is O(chains × tokens) per completion (~O(N²·L)/session) vs the old ~O(1) keyed lookup. Almost certainly fine at realistic sizes; just noting it.

Verdict

Solid, well-documented PR; engine abstraction and logprob simplification are clean. #1 is the one I'd resolve before merge (silent training-data impact); #2/#3 want explicit confirmation; #4/#5 are minor.

(Review generated with Claude Code.)

yidong72

Three inline findings from the review (see the summary comment for full context).

yidong72 · 2026-05-31T03:12:07Z

-    def _pop_compatible_chain(
-        *,
-        prompt_key: str,
+    def _find_extendable_chain(


Issue #1 — chains are no longer consumed (possible mis-merge).

The old _pop_compatible_chain popped a matched waiting chain, so each tip joined at most one successor. Here every chain_tips[idx] stays permanently matchable. If two completions share an identical prompt_ids (retries, n>1 sampling siblings, or parallel sub-agents momentarily sharing a prompt), both append to the same chain — each matches the tip, and the tip doesn't advance past equality — linearizing siblings that should be separate traces. The longest-tip tiebreak handles genuine divergence but not the equal-prefix collision.

Can we confirm test_engine_trajectory_equivalence.py / test_per_request_builder.py cover the identical-prompt-sibling case? If not, this is a silent training-data risk — I'd block on it.

yidong72 · 2026-05-31T03:12:07Z


-        if len(system_parts) > 1:
-            request["messages"] = [{"role": "system", "content": "\n\n".join(system_parts)}, *non_system]
+        if system_parts:


Issue #2 — two bundled behavior changes for all models.

_normalize_request now calls _merge_developer_role unconditionally; previously the developer→system merge ran only for Qwen. Also this guard changed from len(system_parts) > 1 to if system_parts:, so a single system message that originally appeared mid-conversation is now always hoisted to the front.

For typical leading-system prompts this is a no-op, but it changes message ordering for any harness that interleaves a system message. Is the universal merge + hoist intended for non-Qwen harnesses?

yidong72 · 2026-05-31T03:12:07Z

+    # response_logprobs is one float per response token (interstitials are 0.0,
+    # masked out by loss_mask); the builder guarantees trainable tokens carry
+    # their real sampled logprob.
+    return [float(value) for value in logprobs]


Issue #3 — removed logprob safety net.

This previously raised RolloutLogprobError when a trainable (loss_mask=1) token was missing its logprob; it now trusts the builder. The Trace validator enforces length but not that trainable positions carry a real (non-placeholder-0.0) logprob. A builder bug that drops a sampled logprob would now flow silently into training as logprob=0.0 instead of failing loudly. Consider keeping a cheap assert on trainable positions.

billxbf and others added 30 commits March 18, 2026 09:03

init migrate

c6bde5d

nit

4ec8c9d

nit

3f1b513

nit

5b5147b

integration and runtime

8039701

bug fix

c856d70

packaging cleanup

4295ece

nit

d9df831

swebench, sglang+slime patches, runtime fix

aef25d6

nit

3414a56

simplify calculator example

dd2015e

add runtime timeout to prevent orphant processes

561ea17

nit

705f80c

pin harness versions for reproduction

c4e6408

cleanup tests

28af4a5

drop failed traces to slime

f3e5dc0

packaging cleanups

5a23b96

share group adv estimator

006edf0

nit

0b2b5ed

nit

470419f

nit

ac490fb

dynamic bs + tis, remove adv estimato

38dd467

expand instance

5818f87

preflight check

865cb8e

billxbf and others added 21 commits May 14, 2026 16:19

nit

bc21c80

prepare for refactor

118c0d4

add training curve

cea6a5d

update training curves

4945ec0

resolve conflict

f4bda0c

nit

f7d7741

VLM support and count_star example

14d859f

add polar dashboard

4203c7a

Merge branch 'stable' into polar

01830b1

nit

88bcad4

Merge branch 'polar' of github.com:NVIDIA-NeMo/ProRL-Agent-Server int…

c4f8607

…o polar

nit

692c8fa

Merge branch 'stable' into polar

92c2c75

fix reasoning content mapping

2f21e4d

patching transform edge cases

657e015

vllm dual inference

ce6c5f2

nit

4e132df

cleanup examples

0960180

clarify harness presets, add openclaw & hermes

5f7f370

update quickstart examples and docs, upgrade prefix merging robustness

5ab54a6

nit

08f6e5c

billxbf requested review from HaoZhang534, skzhang1 and yidong72 May 31, 2026 01:41

skzhang1 approved these changes May 31, 2026

View reviewed changes

yidong72 requested a review from Copilot May 31, 2026 02:38

Copilot started reviewing on behalf of yidong72 May 31, 2026 02:38 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

yidong72 reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM support, more harness shortcuts and Readability Updates#37

vLLM support, more harness shortcuts and Readability Updates#37
billxbf wants to merge 76 commits into
stablefrom
polar

billxbf commented May 31, 2026

Uh oh!

skzhang1 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

yidong72 left a comment

Uh oh!

yidong72 left a comment

Uh oh!

yidong72 May 31, 2026

Uh oh!

yidong72 May 31, 2026

Uh oh!

yidong72 May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		record_count = (traj.get("metadata") or {}).get("record_count")
		completion_counts.append(int(record_count) if isinstance(record_count, int) else len(traces))

Conversation

billxbf commented May 31, 2026

Uh oh!

skzhang1 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

yidong72 left a comment

Choose a reason for hiding this comment

Code review — vLLM support, reasoning round-trip, trace-format changes

Overview

Strengths

Issues / questions

Verdict

Uh oh!

yidong72 left a comment

Choose a reason for hiding this comment

Uh oh!

yidong72 May 31, 2026

Choose a reason for hiding this comment

Uh oh!

yidong72 May 31, 2026

Choose a reason for hiding this comment

Uh oh!

yidong72 May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants