[train] Save HF processor on checkpoint export for VLMs by dinhxuanvu · Pull Request #1785 · NovaSky-AI/SkyRL

dinhxuanvu · 2026-06-14T01:42:28Z

Summary

VLM checkpoints exported via DistributedStrategy.save_hf_configs are missing preprocessor_config.json (and the image/video processor configs). Only the model config, tokenizer, and generation config are written. As a result, AutoProcessor.from_pretrained() cannot reload the exported checkpoint and downstream vLLM serving crashes on load.

This adds processor saving to the HF export path:

Adds check_is_vlm() and get_processor() helpers to skyrl/utils/tok.py.
In save_hf_configs, when the model is a VLM, resolves the processor from model_config.name_or_path and writes it alongside the other configs. This mirrors the existing generation_config handling, which already resolves from name_or_path.
No-op for text-only models (no vision_config), and wrapped in try/except so a processor-resolution failure cannot abort a checkpoint save.

Both export paths funnel through save_hf_configs (FSDP passes the live model .config; Megatron passes hf_config loaded via AutoConfig.from_pretrained(model_path)), so the fix covers both backends.

Test plan

New CPU unit tests in tests/backends/skyrl_train/distributed/test_save_hf_configs_processor.py:
- processor saved for VLM
- processor not resolved for text-only
- processor-resolution failure does not raise
- skipped when name_or_path is empty
End-to-end: export a VLM checkpoint and confirm preprocessor_config.json is present and AutoProcessor.from_pretrained() reloads it.

VLM checkpoints exported by save_hf_configs were missing preprocessor_config.json (and image/video processor configs), so AutoProcessor.from_pretrained() and vLLM fail to load the exported checkpoint. Only the model config, tokenizer, and generation config were saved. Resolve and save the processor from model_config.name_or_path when the model is a VLM, mirroring the existing generation_config handling. No-op for text-only models (no vision_config). Adds check_is_vlm/get_processor helpers to skyrl/utils/tok.py. Signed-off-by: Vu Dinh <vudinh@outlook.com>

gemini-code-assist

Code Review

This pull request introduces support for saving Hugging Face processor configurations (such as preprocessor_config.json) when exporting vision-language model (VLM) checkpoints, preventing potential loading crashes in downstream environments like vLLM. It adds utility functions to check if a model is a VLM and to retrieve its processor, along with corresponding unit tests. Feedback suggests wrapping the VLM check in a try-except block and passing the already loaded model configuration directly to avoid redundant disk/network I/O. Additionally, defensive checks should be added to handle processors that lack a tokenizer, and the unit tests should be updated accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

- Move the VLM check + processor resolve/save into a single try/except so a processor-resolution failure (offline/transient node) cannot abort the checkpoint save. - check_is_vlm now accepts a PretrainedConfig or a path; save_hf_configs passes the already-loaded model_config, avoiding a redundant AutoConfig.from_pretrained. - Update tests to assert the object is passed, plus a test that a VLM-check failure is swallowed. Signed-off-by: Vu Dinh <vudinh@outlook.com>

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread skyrl/backends/skyrl_train/distributed/strategy.py Outdated

Comment thread skyrl/utils/tok.py Outdated

Comment thread skyrl/utils/tok.py

Comment thread tests/backends/skyrl_train/distributed/test_save_hf_configs_processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Save HF processor on checkpoint export for VLMs#1785

[train] Save HF processor on checkpoint export for VLMs#1785
dinhxuanvu wants to merge 2 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/save-hf-processor-on-export

dinhxuanvu commented Jun 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dinhxuanvu commented Jun 14, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant