Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, and Anbang Yao
This repository contains the official PyTorch implementation of CoM-PT.
- [Coming Soon] ⏳ The evaluation code on clip-benchmark, and more pre-trained VFM model families will be released shortly. Watch/Star this repository to stay updated!
- [April 2026] 🎉 We release the training code and pre-trained VFM checkpoints on CC3M dataset.
- [Feb 2026] 🎉 Our paper has been accepted to CVPR 2026!
pip install -r requirements-training.txt
pip install -r requirements-test.txtNote: We strongly recommend using
numpy<2.0in this repository to avoid unnecessary issues during training.
OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as arguments to main.py.
The script src/data/gather_cc.py collects the Conceptual Captions 3M images. First, download the Conceptual Captions 3M URLs, and then run the script from our repository.
For easy notation, we rename Train_GCC-training to cc3m_train, and Validation_GCC-1.1.0-Validation to cc3m_val.
python src/data/gather_cc.py [path/to/cc3m/images/] [path/to/cc3m_train.tsv] [path/to/cc3m_val.tsv]Our downloaded CC3M training set contains 2.89M images, and our CC3M validation set contains 13K images.
We also provide a URL where you can directly download the
.zipfile: Link to zip
The script src/data/gather_cc12m.py collects the Conceptual 12M images. First, download the Conceptual 12M URLs, and then run the script from our repository:
python src/data/gather_cc12m.py [path/to/cc12m/images/] [path/to/cc12m.tsv]Since the CC12M dataset is extremely large, the
.zipfile is currently in preparation for release.
We do not directly use the generated cc3m_train.csv and cc12m_train.csv files in our training. Instead, we combine them with MLLM-generated long captions from DreamLIP. You can download cc3m_lc.csv and cc12m_lc.csv here.
Training scripts are provided in the training_script folder. Please ensure that the path to the teacher's checkpoint is correctly modified before conducting CoM-PT.
To conduct baseline pre-training:
bash training_script/cc3m_vit/baseline/baseline_vit-b.shTo conduct CoM-PT:
bash training_script/cc3m_vit/com-pt/com_vit-s_to_vit-b.sh| Network | Method | Train Script | Google Drive |
|---|---|---|---|
| ViT-T/16 | Baseline | sh |
baseline_vit-t_e128.pth |
| ViT-S/16 | Baseline | sh |
baseline_vit-s_e128 |
| ViT-S/16 | CoM-PT | sh |
com_vit-s_e24.pth |
| ViT-B/16 | Baseline | sh |
baseline_vit-b_e128.pth |
| ViT-B/16 | CoM-PT | sh |
com_vit-b_e18.pth |
| ViT-L/16 | Baseline | sh |
baseline_vit-l_e128.pth |
| ViT-L/16 | CoM-PT | sh |
com_vit-l_e15.pth |
More model families are currently being prepared for release.
Evaluation on the ImageNet-1K dataset can be performed directly by adding an --eval flag to the training scripts.
The evaluation on MS-COCO and VTAB+ is built upon
clip-benchmark, which is in preparation for release.
Our codebase is built upon open_clip and clip-kd. We sincerely thank the authors for releasing their amazing code.
If you find our paper and repository helpful, please consider citing our work:
@inproceedings{fan2026compt,
title={Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models},
author={Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, and Anbang Yao},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}