Skip to content

[train] Save HF processor on checkpoint export for VLMs#1785

Open
dinhxuanvu wants to merge 2 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/save-hf-processor-on-export
Open

[train] Save HF processor on checkpoint export for VLMs#1785
dinhxuanvu wants to merge 2 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/save-hf-processor-on-export

Conversation

@dinhxuanvu

Copy link
Copy Markdown
Contributor

Summary

VLM checkpoints exported via DistributedStrategy.save_hf_configs are missing preprocessor_config.json (and the image/video processor configs). Only the model config, tokenizer, and generation config are written. As a result, AutoProcessor.from_pretrained() cannot reload the exported checkpoint and downstream vLLM serving crashes on load.

This adds processor saving to the HF export path:

  • Adds check_is_vlm() and get_processor() helpers to skyrl/utils/tok.py.
  • In save_hf_configs, when the model is a VLM, resolves the processor from model_config.name_or_path and writes it alongside the other configs. This mirrors the existing generation_config handling, which already resolves from name_or_path.
  • No-op for text-only models (no vision_config), and wrapped in try/except so a processor-resolution failure cannot abort a checkpoint save.

Both export paths funnel through save_hf_configs (FSDP passes the live model .config; Megatron passes hf_config loaded via AutoConfig.from_pretrained(model_path)), so the fix covers both backends.

Test plan

  • New CPU unit tests in tests/backends/skyrl_train/distributed/test_save_hf_configs_processor.py:
    • processor saved for VLM
    • processor not resolved for text-only
    • processor-resolution failure does not raise
    • skipped when name_or_path is empty
  • End-to-end: export a VLM checkpoint and confirm preprocessor_config.json is present and AutoProcessor.from_pretrained() reloads it.

VLM checkpoints exported by save_hf_configs were missing
preprocessor_config.json (and image/video processor configs), so
AutoProcessor.from_pretrained() and vLLM fail to load the exported
checkpoint. Only the model config, tokenizer, and generation config
were saved.

Resolve and save the processor from model_config.name_or_path when the
model is a VLM, mirroring the existing generation_config handling. No-op
for text-only models (no vision_config). Adds check_is_vlm/get_processor
helpers to skyrl/utils/tok.py.

Signed-off-by: Vu Dinh <vudinh@outlook.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for saving Hugging Face processor configurations (such as preprocessor_config.json) when exporting vision-language model (VLM) checkpoints, preventing potential loading crashes in downstream environments like vLLM. It adds utility functions to check if a model is a VLM and to retrieve its processor, along with corresponding unit tests. Feedback suggests wrapping the VLM check in a try-except block and passing the already loaded model configuration directly to avoid redundant disk/network I/O. Additionally, defensive checks should be added to handle processors that lack a tokenizer, and the unit tests should be updated accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread skyrl/backends/skyrl_train/distributed/strategy.py Outdated
Comment thread skyrl/utils/tok.py Outdated
Comment thread skyrl/utils/tok.py
- Move the VLM check + processor resolve/save into a single try/except so a
  processor-resolution failure (offline/transient node) cannot abort the
  checkpoint save.
- check_is_vlm now accepts a PretrainedConfig or a path; save_hf_configs passes
  the already-loaded model_config, avoiding a redundant AutoConfig.from_pretrained.
- Update tests to assert the object is passed, plus a test that a VLM-check
  failure is swallowed.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant