[train] Save HF processor on checkpoint export for VLMs#1785
[train] Save HF processor on checkpoint export for VLMs#1785dinhxuanvu wants to merge 2 commits into
Conversation
VLM checkpoints exported by save_hf_configs were missing preprocessor_config.json (and image/video processor configs), so AutoProcessor.from_pretrained() and vLLM fail to load the exported checkpoint. Only the model config, tokenizer, and generation config were saved. Resolve and save the processor from model_config.name_or_path when the model is a VLM, mirroring the existing generation_config handling. No-op for text-only models (no vision_config). Adds check_is_vlm/get_processor helpers to skyrl/utils/tok.py. Signed-off-by: Vu Dinh <vudinh@outlook.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for saving Hugging Face processor configurations (such as preprocessor_config.json) when exporting vision-language model (VLM) checkpoints, preventing potential loading crashes in downstream environments like vLLM. It adds utility functions to check if a model is a VLM and to retrieve its processor, along with corresponding unit tests. Feedback suggests wrapping the VLM check in a try-except block and passing the already loaded model configuration directly to avoid redundant disk/network I/O. Additionally, defensive checks should be added to handle processors that lack a tokenizer, and the unit tests should be updated accordingly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
- Move the VLM check + processor resolve/save into a single try/except so a processor-resolution failure (offline/transient node) cannot abort the checkpoint save. - check_is_vlm now accepts a PretrainedConfig or a path; save_hf_configs passes the already-loaded model_config, avoiding a redundant AutoConfig.from_pretrained. - Update tests to assert the object is passed, plus a test that a VLM-check failure is swallowed. Signed-off-by: Vu Dinh <vudinh@outlook.com>
Summary
VLM checkpoints exported via
DistributedStrategy.save_hf_configsare missingpreprocessor_config.json(and the image/video processor configs). Only the model config, tokenizer, and generation config are written. As a result,AutoProcessor.from_pretrained()cannot reload the exported checkpoint and downstream vLLM serving crashes on load.This adds processor saving to the HF export path:
check_is_vlm()andget_processor()helpers toskyrl/utils/tok.py.save_hf_configs, when the model is a VLM, resolves the processor frommodel_config.name_or_pathand writes it alongside the other configs. This mirrors the existinggeneration_confighandling, which already resolves fromname_or_path.vision_config), and wrapped intry/exceptso a processor-resolution failure cannot abort a checkpoint save.Both export paths funnel through
save_hf_configs(FSDP passes the live model.config; Megatron passeshf_configloaded viaAutoConfig.from_pretrained(model_path)), so the fix covers both backends.Test plan
tests/backends/skyrl_train/distributed/test_save_hf_configs_processor.py:name_or_pathis emptypreprocessor_config.jsonis present andAutoProcessor.from_pretrained()reloads it.