Skip to content

[megatron] Add seq packing support for qwen3.5#1769

Open
erictang000 wants to merge 11 commits into
NovaSky-AI:mainfrom
erictang000:qwen3_5_seq_pack
Open

[megatron] Add seq packing support for qwen3.5#1769
erictang000 wants to merge 11 commits into
NovaSky-AI:mainfrom
erictang000:qwen3_5_seq_pack

Conversation

@erictang000

Copy link
Copy Markdown
Collaborator

No description provided.

erictang000 and others added 4 commits June 10, 2026 01:26
Qwen3.5 hybrid Gated-DeltaNet checkpoints report a ...ForConditionalGeneration
arch and auto-dispatch through megatron-bridge's VL bridge -> Qwen3VLModel, which
packs + CP-shards sequences inside its own forward. Under SkyRL sample packing
that double-packs and corrupts the cu_seqlens fed to the GDN varlen kernel,
aborting in the backward.

When language_model_only=True, route these checkpoints to megatron-core's native
GPTModel + GDN thd path (vision tower dropped), which supports packed sequences
directly. Implemented as thin SkyRL subclasses of the stock Qwen35MoEBridge /
Qwen35Bridge that feed text_config into the inherited provider logic via a shim
and re-prefix the weight mappings to model.language_model.*; registered on
sentinel ...ForCausalLM source keys so the real VL-bridge registration is not
clobbered. maybe_force_qwen35_text_bridge rewrites the loaded architectures to
the sentinel before to_megatron_provider; the worker calls it gated on
policy/ref language_model_only.

Verified logprob parity vs vLLM with sample packing on:
- Qwen3.5-0.8B (dense), TP=2: diff mean 0.008
- Qwen3.5-35B-A3B (MoE), TP=4 EP=4: diff mean 0.010

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erictang000 erictang000 marked this pull request as ready for review June 15, 2026 19:07

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables sample packing for Qwen3.5 hybrid Gated-DeltaNet (GDN) models on the Megatron backend by routing them to the native GPTModel path when language_model_only=True. Feedback highlights a critical issue where calling get_text_config() on the Hugging Face configuration will raise an AttributeError at runtime, and suggests updating an outdated comment in the shell script.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread skyrl/backends/skyrl_train/workers/megatron/model_bridges.py

# Qwen3.5 flags
REMOVE_MICROBATCH_PADDING=false # sample packing is not yet supported for GDN layers in megatron - see: https://github.com/NVIDIA/Megatron-LM/pull/2644
REMOVE_MICROBATCH_PADDING=True # sample packing is not yet supported for GDN layers in megatron - see: https://github.com/NVIDIA/Megatron-LM/pull/2644

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on this line is outdated and misleading because this pull request is specifically adding support for sample packing with GDN layers for Qwen3.5. Let's update the comment to reflect that sample packing is now supported and enabled.

Suggested change
REMOVE_MICROBATCH_PADDING=True # sample packing is not yet supported for GDN layers in megatron - see: https://github.com/NVIDIA/Megatron-LM/pull/2644
REMOVE_MICROBATCH_PADDING=True # Enable sample packing for GDN layers in megatron

@erictang000 erictang000 changed the title [WIP] Add seq packing support for qwen3.5 [megatron] Add seq packing support for qwen3.5 Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant