Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
9744713
feat: add NPU device_memory_used and vllm support
UsernameFull Jan 28, 2026
3077bef
(feat): publish roll v0.2.0.
PanAndy Feb 3, 2026
c8f8029
(chore): append commiter for v0.2.0.
PanAndy Feb 5, 2026
f41a8f1
(chore): append commiter for v0.2.0.
PanAndy Feb 5, 2026
f3f13dc
Remove upload_to_mos call after checkpoint save
chocoded Feb 7, 2026
4a0ce56
fix: correct typo in async_parallel_rollout.md
WeiyaoLuo Feb 9, 2026
777dad6
feat: add katex to docs markdown
kkkky123 Feb 9, 2026
4a49bab
(docs): update readme.
PanAndy Feb 10, 2026
afc4d30
(fix): use default_factory for mutable SequencePackingConfig field.
hydrozhao Feb 12, 2026
ae69fd8
(fix): fix train_infer_is_weight KeyError for rlvr_vlm_pipeline and …
guoshengCS Feb 6, 2026
c70c473
(fix): handle is_last_step in DeepSpeedTrainStrategy.save_checkpoint
XucSh Feb 28, 2026
ce4e3a2
fix: address resource leaks and code quality issues
hobostay Feb 9, 2026
bec2a4b
(fix): set vllm VLLM_USE_FLASHINFER_SAMPLER=0 for torch 280.
PanAndy Feb 10, 2026
81e9c5c
(fix): set sglang port range to avoid conflicting.
HuangJoJo Feb 9, 2026
1054785
(fix): fix sglang multi-nodes fail when worker num > 1.
emiedon Feb 4, 2026
c72d283
(fix): optimize port allocation logic with atomic operation.
Feb 4, 2026
2e783fe
(chore): fix qwen3-vl-32B 80GB config.
HuangJoJo Feb 5, 2026
d6dad8f
(fix): hardcode default async concurrency limit to 1000 to remove dep…
hydrozhao Feb 26, 2026
526f7b5
(fix): fix reward metrics expo.
PanAndy Feb 26, 2026
53c6da3
(fix): fix batch num tokens.
PanAndy Mar 2, 2026
3b0398a
(fix): fix vllm process weights.
PanAndy Mar 3, 2026
cda8262
(fix): fix func download get_node_ip.
PanAndy Mar 4, 2026
ae0a39b
(fix): fix sglang process weights.
hydrozhao Mar 5, 2026
ca8e9e5
(fix): Make offload states configurable and Fix batch size setting in…
Schnabel-8 Feb 10, 2026
36d0064
(feat): support vllm 0.15.1.
hydrozhao Feb 11, 2026
c356bb9
(fix): FSDP2 DCP Saving when CPU Offload.
Feb 26, 2026
c601cd1
(feat): support sglang-router.
hydrozhao Feb 28, 2026
9087c02
(feat): add Dockerfile for torch2.10.0, support vllm 0.16.dev.
hydrozhao Mar 3, 2026
85100e8
(fix): pyarrow>15.0.0 jemalloc coredump, add torch2.10.0 deps, fix ro…
HuangJoJo Mar 3, 2026
f33540c
(feat): update mcore adapter.
chocoded Mar 3, 2026
4449a31
(feat): support training for qwen3.5-27B.
xuehuanran Mar 4, 2026
a35fbce
(fix): refactor sharded state dict metadata handling and integrate in…
chocoded Mar 5, 2026
b4facdd
(chore): move EnvAffinityRouter and PartialGPUManager to router.py.
hydrozhao Mar 5, 2026
5dfecbb
(fix): gracefully shutdown of Router.
hydrozhao Mar 5, 2026
31c99c9
(chore): release docker image for torch2.10.0.
hydrozhao Mar 5, 2026
16b3ca8
(feat): add example config for qwen3_5_35ba3.
xuehuanran Mar 5, 2026
50c8954
(fix): correct parameter name when constructing reward cluster.
hydrozhao Mar 5, 2026
82436a5
(feat): support onpolicy distillation.
Schnabel-8 Mar 6, 2026
5b488cc
(fix): fix version compare of torch for pg_options_param_name.
hydrozhao Mar 6, 2026
16d2113
(fix): separated the system role check from the skip_mock_system_prom…
hydrozhao Mar 6, 2026
0257ca3
(fix): prevent sync generate request execution during shutdown.
hydrozhao Mar 6, 2026
921dc28
(docs): update readme.
PanAndy Mar 6, 2026
d2dcd86
(fix): FSDP2 Model Initialization & Casting.
Mar 6, 2026
b63b3a4
fix bugs in strategy config and opd config
Schnabel-8 Mar 6, 2026
2eba7c3
(fix): add context parallel loss reduction in trainer.
chocoded Mar 9, 2026
5cd926f
fix: add sft support on npu
UsernameFull Feb 4, 2026
d133109
feat: add npu mindspeed
jiaqiw09 Feb 6, 2026
7f56229
feat: add NPU (Ascend) support for FSDP2, vLLM, model update, and pla…
UsernameFull Feb 10, 2026
9640566
Revert "adapt mindspeed"
UsernameFull Feb 11, 2026
91085a8
fix: rng_state on npu
UsernameFull Feb 11, 2026
079bd47
fix: DeepSpeedEngine.load_checkpoint method doesn't take an is_last_…
UsernameFull Mar 3, 2026
562c25a
platform add empty_cache get_rng_state set_rng_state
UsernameFull Mar 5, 2026
3210f2b
fix: support _set_allocator_settings in NPU
UsernameFull Mar 9, 2026
bd17364
feat: add reward model cluster mode for LLM-as-judge in RLVR pipeline
tanzelin430 Mar 10, 2026
53eddee
Add notable works section to README
taoluo Mar 16, 2026
2d6ab4f
Enhance model saving with PEFT support
chocoded Mar 17, 2026
13a5027
Import is_peft_available from transformers.utils
chocoded Mar 17, 2026
0ce37ff
Rename linear_attention_type to experimental_attention_variant
chocoded Mar 17, 2026
a26d90a
Rename linear_attention_type to experimental_attention_variant
chocoded Mar 17, 2026
f16628f
Rename linear_attention_type to experimental_attention_variant
chocoded Mar 17, 2026
2fd2606
feat: support rock native env and provide demo to run agent rollout a…
jingyushen Mar 17, 2026
c0f40a3
docs: Add Huawei Ascend hardware support doc
UsernameFull Mar 13, 2026
9de2784
fix: fix rlvr metrics update
UsernameFull Mar 17, 2026
9c6ce5c
fix: MultipleChoiceBoxedRuleRewardWorker returns a zero reward
luyouqi233 Mar 19, 2026
cb617db
Revise RLix description for clarity and detail
taoluo Mar 20, 2026
4fd4147
Add Qwen3.5 ROCK agentic SWE example
shamanez Mar 23, 2026
6190d06
minor comment.
shamanez Mar 24, 2026
52e0978
fix: disable reward normalization for SWE configs with group_size=1
shamanez Mar 24, 2026
345edea
Update config name in run_onpolicy_distill_pipeline.sh
joeyzyz Mar 24, 2026
f509efc
add reference to notable work
pUmpKin-Co Mar 26, 2026
53fce5a
added OpenRewards
Mar 25, 2026
bc9af12
added the openreward support.
Mar 25, 2026
6e1a5df
revert agent_native_env_manager.py to upstream version
shamanez Mar 25, 2026
b39681b
remove IPA config yaml not needed for OpenReward integration
shamanez Mar 25, 2026
a49a915
fix: use Cluster instead of WorkerConfig for dynamic batching dp_size
dubin555 Mar 14, 2026
7526682
add initial trackio integration for roll.
ParagEkbote Mar 26, 2026
3432271
(feat): tensorboard log in new executor.
PanAndy Apr 29, 2026
942703d
feat: add npu dockerfile and useage
UsernameFull Apr 27, 2026
034d38e
feat: add npu dockerfile and useage
UsernameFull Apr 29, 2026
19e740d
Optimize ROCm for send_recv and model_update
aaab8b Apr 15, 2026
4bb7d74
Update README.md
histmeisah May 4, 2026
2611542
Add support for ROCm 7.2 and PyTorch 2.10
aaab8b May 12, 2026
084f0ed
feat(agentic): integrate Atropos environment as gem.Env adapter
RUFFY-369 Apr 17, 2026
4960b49
docs: add atropos-gsm8k training demo configuration and launch script
RUFFY-369 Apr 17, 2026
6069cda
fix: move max_steps to yaml to avoid unrecognized cli args in start_p…
RUFFY-369 Apr 17, 2026
6d0df55
fix(yaml): remove duplicate max_steps key
RUFFY-369 Apr 17, 2026
190f124
fix(yaml): add explicit val_env_manager and RL params to avoid config…
RUFFY-369 Apr 17, 2026
e91f69e
fix(yaml): resolve ZeroDivisionError by providing valid val_env_config
RUFFY-369 Apr 17, 2026
5b45342
fix(scheduler): defensive Ray resource allocation for modern versions
RUFFY-369 Apr 17, 2026
8a53cb1
feat: integrate Atropos deep reasoning with GRPO and universal reward…
RUFFY-369 Apr 21, 2026
e5c91c7
fix: restore OpenReward demo config mistakenly pruned during cleanup
RUFFY-369 Apr 21, 2026
aa2052b
refactor: simplify resource allocation and restore original node sort…
RUFFY-369 May 11, 2026
9ec88d6
docs: Add Huawei Ascend hardware support doc
UsernameFull Mar 13, 2026
5d1e3c3
add npu doc
UsernameFull May 14, 2026
c09bc8b
[ascend adapt] qwen3-30b model vllm fsdp2
shun001 May 11, 2026
d724d6d
ci: add NPU CI coverage
UsernameFull May 25, 2026
a5be21d
test: add NPU abort smoke coverage
UsernameFull May 25, 2026
dbfad7b
test: skip SGLang abort smoke without NPU kernel
UsernameFull May 25, 2026
e8abdcd
ci: install SGLang NPU kernel for abort smoke
UsernameFull May 25, 2026
7403376
test: run SGLang abort smoke without memory saver
UsernameFull May 25, 2026
e2c124a
test: cover SGLang abort state without engine launch
UsernameFull May 25, 2026
8ba6675
add cpu test
UsernameFull May 26, 2026
5ad241c
test_rollout_scheduler.py
UsernameFull May 26, 2026
c73a686
fix
UsernameFull May 26, 2026
8403c40
xx
UsernameFull May 26, 2026
51052df
fix
UsernameFull May 26, 2026
1ac2d3d
fix
UsernameFull May 26, 2026
d1cac57
xx
UsernameFull May 26, 2026
d44dbb0
xx
UsernameFull May 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
299 changes: 299 additions & 0 deletions .github/workflows/ci-npu-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
name: Tests

on:
push:
branches: [main, npu_ci]
paths-ignore:
- "docs_roll/**"
- "**/*.md"
- ".github/workflows/deploy.yml"
- ".github/workflows/daily-stats.yml"
pull_request:
branches: [main, npu_ci]
paths-ignore:
- "docs_roll/**"
- "**/*.md"
- ".github/workflows/deploy.yml"
- ".github/workflows/daily-stats.yml"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
unit-test:
name: Unit Tests (CPU)
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
cache-dependency-path: |
requirements_common.txt
mcore_adapter/pyproject.toml
mcore_adapter/requirements.txt
setup.py
pyproject.toml

- name: Install dependencies
run: |
pip install --upgrade pip
# Install PyTorch CPU-only to keep CI lightweight
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# Install core test dependencies (subset of requirements_common.txt)
pip install pytest pytest-timeout pytest-asyncio numpy tensordict pydantic dacite \
more_itertools hydra-core omegaconf peft==0.12.0 datasets==3.1.0 \
trl==0.9.6 transformers ray[default] sympy deprecated codetiming pybase64 imageio \
jsonschema mcp gem-llm==0.0.4 gym 'gymnasium[toy-text]' gym_sokoban
# Install mcore_adapter and roll itself
pip install -e ./mcore_adapter
pip install -e .

- name: Run CPU-compatible unit tests
run: |
pytest tests/utils/test_action_parser.py \
tests/utils/test_functionals.py \
tests/utils/test_dynamic_batching.py \
tests/utils/test_sequence_packing.py \
tests/utils/test_taskgroups.py \
tests/utils/test_cp_rmpad_ulysses_utils.py \
tests/datasets/test_collator.py \
tests/datasets/test_sampler.py \
tests/agentic \
tests/test_ref_worker_type_consistency.py \
tests/distributed/scheduler/test_protocol.py \
tests/distributed/scheduler/test_protocol_padding.py \
tests/distributed/scheduler/test_decorator.py \
tests/distributed/scheduler/test_resource_manager.py \
-v --timeout=300 -x
env:
PYTHONPATH: ${{ github.workspace }}
ROLL_RUN_EXTERNAL_AGENTIC_TESTS: "0"
ROLL_RUN_AGENTIC_SANDBOX_TESTS: "0"
ROLL_RUN_AGENTIC_ENV_MANAGER_DEBUG_TESTS: "0"

npu-test:
name: NPU Integration Tests
if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository
runs-on: linux-aarch64-a3-8
timeout-minutes: 120
container:
# Pre-built NPU docker image (built from docker/Dockerfile.A3) with all deps pre-installed
image: swr.cn-north-4.myhuaweicloud.com/ascend-cicd/roll:main-a3
env:
HF_ENDPOINT: https://hf-mirror.com
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
TASK_QUEUE_ENABLE: "2"
VLLM_USE_V1: "1"
# The CI vLLM smoke uses TP=1; FlashComm sequence parallelism requires TP>1.
VLLM_ASCEND_ENABLE_FLASHCOMM: "0"
SGLANG_KERNEL_NPU_REPO: https://github.com/sgl-project/sgl-kernel-npu.git
SGLANG_KERNEL_NPU_BRANCH: main
SGLANG_KERNEL_NPU_CACHE_KEY: main
SGLANG_REPO: https://github.com/sgl-project/sglang.git
SGLANG_BRANCH: ifmn/eagle-dp-attn
SGLANG_CACHE_KEY: ifmn-eagle-dp-attn

steps:
- name: Checkout code
uses: actions/checkout@v4
with:
submodules: recursive

- name: Cache NPU pip packages
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-npu-pip-${{ env.SGLANG_KERNEL_NPU_CACHE_KEY }}-${{ env.SGLANG_CACHE_KEY }}-${{ hashFiles('requirements_common.txt', 'mcore_adapter/pyproject.toml', 'mcore_adapter/requirements.txt', 'setup.py', 'pyproject.toml', '.github/workflows/ci-npu-test.yml') }}
restore-keys: |
${{ runner.os }}-npu-pip-${{ env.SGLANG_KERNEL_NPU_CACHE_KEY }}-${{ env.SGLANG_CACHE_KEY }}-
${{ runner.os }}-npu-pip-${{ env.SGLANG_CACHE_KEY }}-
${{ runner.os }}-npu-pip-

- name: Configure Ascend runtime
shell: bash
run: |
set -eo pipefail
if [ -f /usr/local/Ascend/ascend-toolkit/set_env.sh ]; then
source /usr/local/Ascend/ascend-toolkit/set_env.sh
fi
if [ -f /usr/local/Ascend/nnal/atb/set_env.sh ]; then
source /usr/local/Ascend/nnal/atb/set_env.sh
fi

export ASCEND_HOME_PATH="${ASCEND_HOME_PATH:-/usr/local/Ascend/ascend-toolkit/latest}"
export ASCEND_TOOLKIT_HOME="${ASCEND_TOOLKIT_HOME:-${ASCEND_HOME_PATH}}"
export ASCEND_OPP_PATH="${ASCEND_OPP_PATH:-${ASCEND_HOME_PATH}/opp}"
export ASCEND_AICPU_PATH="${ASCEND_AICPU_PATH:-${ASCEND_HOME_PATH}}"
export LD_LIBRARY_PATH="/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/runtime/lib64:/usr/local/Ascend/ascend-toolkit/latest/runtime/lib64/stub:/usr/local/Ascend/ascend-toolkit/latest/tools/hccl/lib64:/usr/local/Ascend/ascend-toolkit/latest/hccl/lib64:${LD_LIBRARY_PATH:-}"

cann_python_paths=()
for path in \
"${ASCEND_HOME_PATH}/python/site-packages" \
"${ASCEND_HOME_PATH}/opp/built-in/op_impl/ai_core/tbe"; do
if [ -d "$path" ]; then
cann_python_paths+=("$path")
fi
done
if [ ${#cann_python_paths[@]} -gt 0 ]; then
export PYTHONPATH="$(IFS=:; echo "${cann_python_paths[*]}"):${PYTHONPATH:-}"
fi

echo "ASCEND_HOME_PATH=${ASCEND_HOME_PATH}" >> "$GITHUB_ENV"
echo "ASCEND_TOOLKIT_HOME=${ASCEND_TOOLKIT_HOME}" >> "$GITHUB_ENV"
echo "ASCEND_OPP_PATH=${ASCEND_OPP_PATH}" >> "$GITHUB_ENV"
echo "ASCEND_AICPU_PATH=${ASCEND_AICPU_PATH}" >> "$GITHUB_ENV"
echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> "$GITHUB_ENV"
echo "PYTHONPATH=${PYTHONPATH:-}" >> "$GITHUB_ENV"
echo "${ASCEND_HOME_PATH}/bin" >> "$GITHUB_PATH"
echo "${ASCEND_HOME_PATH}/compiler/ccec_compiler/bin" >> "$GITHUB_PATH"

- name: Show environment info
run: |
echo "=== Python ==="
python3 --version
python3 -m pip --version
echo "=== PyTorch ==="
python3 -c "import torch; print(f'torch={torch.__version__}')"
echo "=== NPU ==="
python3 -c "
import torch
import torch_npu
import importlib.util

print(f'torch_npu={torch_npu.__version__}')
tbe_spec = importlib.util.find_spec('tbe')
print(f'tbe_module={tbe_spec is not None}')
if tbe_spec is None:
raise RuntimeError('CANN tbe Python module is not visible in PYTHONPATH')
for module_name in ('decorator', 'attrs', 'psutil', 'scipy', 'cloudpickle', 'tornado', 'ml_dtypes'):
module_spec = importlib.util.find_spec(module_name)
print(f'{module_name}_module={module_spec is not None}')
if not torch.npu.is_available():
raise RuntimeError('torch.npu.is_available() is False')
print(f'npu_device_count={torch.npu.device_count()}')
"
echo "=== Ascend ==="
npu-smi info

- name: Install pytest dependencies
run: |
pip install pytest-timeout

- name: Install SGLang NPU kernel from source
shell: bash
run: |
set -eo pipefail
export SGLANG_KERNEL_NPU_SRC="/tmp/sgl-kernel-npu"
rm -rf "${SGLANG_KERNEL_NPU_SRC}"
git clone --depth 1 --branch "${SGLANG_KERNEL_NPU_BRANCH}" "${SGLANG_KERNEL_NPU_REPO}" "${SGLANG_KERNEL_NPU_SRC}"
cd "${SGLANG_KERNEL_NPU_SRC}"
python3 -m pip install pybind11 wheel
bash build.sh -a kernels
python3 -m pip install output/sgl_kernel_npu*.whl
python3 - <<'PY'
import sgl_kernel_npu

print(f"sgl_kernel_npu={sgl_kernel_npu.__path__}")
PY

- name: Install SGLang from source
shell: bash
run: |
set -eo pipefail
export SGLANG_SRC="/tmp/sglang"
rm -rf "${SGLANG_SRC}"
git clone --depth 1 --branch "${SGLANG_BRANCH}" "${SGLANG_REPO}" "${SGLANG_SRC}"
python3 - <<'PY' > "${SGLANG_SRC}/ci-requirements.txt"
import importlib.metadata
import os
import re
import tomllib
from pathlib import Path

skip_packages = {
"cuda-python",
"flashinfer-cubin",
"flashinfer-python",
"nvidia-cutlass-dsl",
"nvidia-ml-py",
"sgl-kernel",
"torch",
"torch-memory-saver",
"torchaudio",
"torchao",
"torchcodec",
"torchvision",
"transformers",
}

pyproject = Path(os.environ["SGLANG_SRC"]) / "python" / "pyproject.toml"
dependencies = tomllib.loads(pyproject.read_text())["project"]["dependencies"]
for dependency in dependencies:
package_name = re.split(r"[\[<>=!~; ]", dependency, maxsplit=1)[0]
package_name = package_name.replace("_", "-").lower()
if package_name in skip_packages:
continue
try:
importlib.metadata.version(package_name)
except importlib.metadata.PackageNotFoundError:
print(dependency)
PY
echo "Missing SGLang dependencies for CI:"
cat "${SGLANG_SRC}/ci-requirements.txt"
python3 -m pip install -r "${SGLANG_SRC}/ci-requirements.txt"
python3 -m pip install --no-deps -e "${SGLANG_SRC}/python"
python3 - <<'PY'
import importlib.metadata

print(f"sglang={importlib.metadata.version('sglang')}")
PY

- name: Install ROLL
run: |
pip install -e ./mcore_adapter
pip install -e .

- name: Show vLLM Ascend info
run: |
python3 - <<'PY'
import importlib.metadata

import vllm
import vllm_ascend
from roll.platforms import current_platform

for package_name in ("transformers", "deepspeed", "triton-ascend"):
try:
package_version = importlib.metadata.version(package_name)
except importlib.metadata.PackageNotFoundError:
package_version = "not installed"
print(f"{package_name}={package_version}")

print(f"vllm={vllm.__version__}")
print(f"platform={current_platform.device_type}")
PY

- name: Run remaining NPU-compatible unit tests
run: |
export PYTHONPATH="${GITHUB_WORKSPACE}:${PYTHONPATH:-}"
python3 -m pytest tests/third_party/sglang \
tests/third_party/vllm \
tests/datasets \
tests/distributed \
tests/models \
tests/pipeline \
tests/third_party/deepspeed \
tests/utils/ \
tests/test_ref_worker_type_consistency.py \
--ignore=tests/models/cuda_mem \
-v --timeout=600 -x
env:
ROLL_NPU_CI: "1"
DS_UNITTEST_TIMEOUT: "600"
4 changes: 0 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
# Ignore all png files
*.png

# But allow png files in static/img directory
!docs_roll/static/img/*.png
*.pyc
*/checkpoint_dir
*/dataset
Expand Down
Loading
Loading