Feature request
Meta released V-JEPA 2.1 on 2026-03-16 with four pretrained video encoders at 384 resolution (ViT-B 80M, ViT-L 300M, ViT-g 1B, ViT-G 2B). The existing vjepa2 model family in transformers supports V-JEPA 2.0 but not 2.1.
V-JEPA 2.1 introduces several architectural changes over 2.0: corrected RoPE implementation, learnable modality embeddings, hierarchical feature extraction with per-layer norms, separate image patch embedding, RoPE position interpolation, and a predictor context token projection. These require config and modeling extensions beyond what the current VJEPA2Model supports.
Paper: https://huggingface.co/papers/2603.14482
Code: https://github.com/facebookresearch/vjepa2 (see app/vjepa_2_1/)
Motivation
V-JEPA 2.1 checkpoints are currently only loadable through Meta's torch.hub interface. Adding transformers support would let users load these models via from_pretrained with standard HF APIs, consistent with the existing V-JEPA 2.0 integration.
There is also an open request from the HF team to Meta to upload the 2.1 weights to the Hub: facebookresearch/vjepa2#137.
Your contribution
I have a working implementation on a branch that extends the existing vjepa2 model family with backward-compatible config fields and modeling changes. Verified end-to-end against Meta's reference (ViT-B/384 checkpoint, encoder max diff 0.0001, predictor max diff 0.008). All existing tests pass. I will open a PR shortly.
Feature request
Meta released V-JEPA 2.1 on 2026-03-16 with four pretrained video encoders at 384 resolution (ViT-B 80M, ViT-L 300M, ViT-g 1B, ViT-G 2B). The existing
vjepa2model family in transformers supports V-JEPA 2.0 but not 2.1.V-JEPA 2.1 introduces several architectural changes over 2.0: corrected RoPE implementation, learnable modality embeddings, hierarchical feature extraction with per-layer norms, separate image patch embedding, RoPE position interpolation, and a predictor context token projection. These require config and modeling extensions beyond what the current
VJEPA2Modelsupports.Paper: https://huggingface.co/papers/2603.14482
Code: https://github.com/facebookresearch/vjepa2 (see
app/vjepa_2_1/)Motivation
V-JEPA 2.1 checkpoints are currently only loadable through Meta's torch.hub interface. Adding transformers support would let users load these models via
from_pretrainedwith standard HF APIs, consistent with the existing V-JEPA 2.0 integration.There is also an open request from the HF team to Meta to upload the 2.1 weights to the Hub: facebookresearch/vjepa2#137.
Your contribution
I have a working implementation on a branch that extends the existing
vjepa2model family with backward-compatible config fields and modeling changes. Verified end-to-end against Meta's reference (ViT-B/384 checkpoint, encoder max diff 0.0001, predictor max diff 0.008). All existing tests pass. I will open a PR shortly.