Skip to content

deep-optimization/CoM-PT

Repository files navigation

🚀 Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, and Anbang Yao

CVPR 2026 arXiv


This repository contains the official PyTorch implementation of CoM-PT.

📢 News

  • [Coming Soon] ⏳ The evaluation code on clip-benchmark, and more pre-trained VFM model families will be released shortly. Watch/Star this repository to stay updated!
  • [April 2026] 🎉 We release the training code and pre-trained VFM checkpoints on CC3M dataset.
  • [Feb 2026] 🎉 Our paper has been accepted to CVPR 2026!

🛠️ Installation

pip install -r requirements-training.txt
pip install -r requirements-test.txt

Note: We strongly recommend using numpy<2.0 in this repository to avoid unnecessary issues during training.

🗂️ Dataset Preparation

Conceptual Captions 3M (CC3M)

OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as arguments to main.py.

The script src/data/gather_cc.py collects the Conceptual Captions 3M images. First, download the Conceptual Captions 3M URLs, and then run the script from our repository.

For easy notation, we rename Train_GCC-training to cc3m_train, and Validation_GCC-1.1.0-Validation to cc3m_val.

python src/data/gather_cc.py [path/to/cc3m/images/] [path/to/cc3m_train.tsv] [path/to/cc3m_val.tsv]

Our downloaded CC3M training set contains 2.89M images, and our CC3M validation set contains 13K images.

We also provide a URL where you can directly download the .zip file: Link to zip

Conceptual 12M (CC12M)

The script src/data/gather_cc12m.py collects the Conceptual 12M images. First, download the Conceptual 12M URLs, and then run the script from our repository:

python src/data/gather_cc12m.py [path/to/cc12m/images/] [path/to/cc12m.tsv]

Since the CC12M dataset is extremely large, the .zip file is currently in preparation for release.

Image Descriptions of CC3M and Merged-15M

We do not directly use the generated cc3m_train.csv and cc12m_train.csv files in our training. Instead, we combine them with MLLM-generated long captions from DreamLIP. You can download cc3m_lc.csv and cc12m_lc.csv here.

🚀 Model Training

Training scripts are provided in the training_script folder. Please ensure that the path to the teacher's checkpoint is correctly modified before conducting CoM-PT.

To conduct baseline pre-training:

bash training_script/cc3m_vit/baseline/baseline_vit-b.sh

To conduct CoM-PT:

bash training_script/cc3m_vit/com-pt/com_vit-s_to_vit-b.sh

📦 Model Zoo

ViT Family Pre-trained on the CC3M Dataset

Network Method Train Script Google Drive
ViT-T/16 Baseline sh baseline_vit-t_e128.pth
ViT-S/16 Baseline sh baseline_vit-s_e128
ViT-S/16 CoM-PT sh com_vit-s_e24.pth
ViT-B/16 Baseline sh baseline_vit-b_e128.pth
ViT-B/16 CoM-PT sh com_vit-b_e18.pth
ViT-L/16 Baseline sh baseline_vit-l_e128.pth
ViT-L/16 CoM-PT sh com_vit-l_e15.pth

More model families are currently being prepared for release.

📊 Model Evaluation

Evaluation on the ImageNet-1K dataset can be performed directly by adding an --eval flag to the training scripts.

The evaluation on MS-COCO and VTAB+ is built upon clip-benchmark, which is in preparation for release.

🙏 Acknowledgement

Our codebase is built upon open_clip and clip-kd. We sincerely thank the authors for releasing their amazing code.

📝 Citation

If you find our paper and repository helpful, please consider citing our work:

@inproceedings{fan2026compt,
  title={Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models},
  author={Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, and Anbang Yao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

About

The official project website of "Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models" (CoM-PT for short, accepted to CVPR 2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors