🚀 Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, and Anbang Yao

This repository contains the official PyTorch implementation of CoM-PT.

📢 News

[Coming Soon] ⏳ The evaluation code on clip-benchmark, and more pre-trained VFM model families will be released shortly. Watch/Star this repository to stay updated!
[April 2026] 🎉 We release the training code and pre-trained VFM checkpoints on CC3M dataset.
[Feb 2026] 🎉 Our paper has been accepted to CVPR 2026!

🛠️ Installation

pip install -r requirements-training.txt
pip install -r requirements-test.txt

Note: We strongly recommend using numpy<2.0 in this repository to avoid unnecessary issues during training.

🗂️ Dataset Preparation

Conceptual Captions 3M (CC3M)

OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as arguments to main.py.

The script src/data/gather_cc.py collects the Conceptual Captions 3M images. First, download the Conceptual Captions 3M URLs, and then run the script from our repository.

For easy notation, we rename Train_GCC-training to cc3m_train, and Validation_GCC-1.1.0-Validation to cc3m_val.

python src/data/gather_cc.py [path/to/cc3m/images/] [path/to/cc3m_train.tsv] [path/to/cc3m_val.tsv]

Our downloaded CC3M training set contains 2.89M images, and our CC3M validation set contains 13K images.

We also provide a URL where you can directly download the .zip file: Link to zip

Conceptual 12M (CC12M)

The script src/data/gather_cc12m.py collects the Conceptual 12M images. First, download the Conceptual 12M URLs, and then run the script from our repository:

python src/data/gather_cc12m.py [path/to/cc12m/images/] [path/to/cc12m.tsv]

Since the CC12M dataset is extremely large, the .zip file is currently in preparation for release.

Image Descriptions of CC3M and Merged-15M

We do not directly use the generated cc3m_train.csv and cc12m_train.csv files in our training. Instead, we combine them with MLLM-generated long captions from DreamLIP. You can download cc3m_lc.csv and cc12m_lc.csv here.

🚀 Model Training

Training scripts are provided in the training_script folder. Please ensure that the path to the teacher's checkpoint is correctly modified before conducting CoM-PT.

To conduct baseline pre-training:

bash training_script/cc3m_vit/baseline/baseline_vit-b.sh

To conduct CoM-PT:

bash training_script/cc3m_vit/com-pt/com_vit-s_to_vit-b.sh

📦 Model Zoo

ViT Family Pre-trained on the CC3M Dataset

Network	Method	Train Script	Google Drive
ViT-T/16	Baseline	`sh`	baseline_vit-t_e128.pth
ViT-S/16	Baseline	`sh`	baseline_vit-s_e128
ViT-S/16	CoM-PT	`sh`	com_vit-s_e24.pth
ViT-B/16	Baseline	`sh`	baseline_vit-b_e128.pth
ViT-B/16	CoM-PT	`sh`	com_vit-b_e18.pth
ViT-L/16	Baseline	`sh`	baseline_vit-l_e128.pth
ViT-L/16	CoM-PT	`sh`	com_vit-l_e15.pth

More model families are currently being prepared for release.

📊 Model Evaluation

Evaluation on the ImageNet-1K dataset can be performed directly by adding an --eval flag to the training scripts.

The evaluation on MS-COCO and VTAB+ is built upon clip-benchmark, which is in preparation for release.

🙏 Acknowledgement

Our codebase is built upon open_clip and clip-kd. We sincerely thank the authors for releasing their amazing code.

📝 Citation

If you find our paper and repository helpful, please consider citing our work:

@inproceedings{fan2026compt,
  title={Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models},
  author={Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, and Anbang Yao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
tests		tests
training_script		training_script
LICENSE		LICENSE
README.md		README.md
check_imgs.py		check_imgs.py
check_tar.py		check_tar.py
pyproject.toml		pyproject.toml
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

📢 News

🛠️ Installation

🗂️ Dataset Preparation

Conceptual Captions 3M (CC3M)

Conceptual 12M (CC12M)

Image Descriptions of CC3M and Merged-15M

🚀 Model Training

📦 Model Zoo

ViT Family Pre-trained on the CC3M Dataset

📊 Model Evaluation

🙏 Acknowledgement

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

📢 News

🛠️ Installation

🗂️ Dataset Preparation

Conceptual Captions 3M (CC3M)

Conceptual 12M (CC12M)

Image Descriptions of CC3M and Merged-15M

🚀 Model Training

📦 Model Zoo

ViT Family Pre-trained on the CC3M Dataset

📊 Model Evaluation

🙏 Acknowledgement

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages