Add V-JEPA 2.1 inference support

### Feature request

Meta released [V-JEPA 2.1](https://github.com/facebookresearch/vjepa2) on 2026-03-16 with four pretrained video encoders at 384 resolution (ViT-B 80M, ViT-L 300M, ViT-g 1B, ViT-G 2B). The existing `vjepa2` model family in transformers supports V-JEPA 2.0 but not 2.1.

V-JEPA 2.1 introduces several architectural changes over 2.0: corrected RoPE implementation, learnable modality embeddings, hierarchical feature extraction with per-layer norms, separate image patch embedding, RoPE position interpolation, and a predictor context token projection. These require config and modeling extensions beyond what the current `VJEPA2Model` supports.

Paper: https://huggingface.co/papers/2603.14482
Code: https://github.com/facebookresearch/vjepa2 (see `app/vjepa_2_1/`)

### Motivation

V-JEPA 2.1 checkpoints are currently only loadable through Meta's torch.hub interface. Adding transformers support would let users load these models via `from_pretrained` with standard HF APIs, consistent with the existing V-JEPA 2.0 integration.

There is also an open request from the HF team to Meta to upload the 2.1 weights to the Hub: facebookresearch/vjepa2#137.

### Your contribution

I have a working implementation on a branch that extends the existing `vjepa2` model family with backward-compatible config fields and modeling changes. Verified end-to-end against Meta's reference (ViT-B/384 checkpoint, encoder max diff 0.0001, predictor max diff 0.008). All existing tests pass. I will open a PR shortly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add V-JEPA 2.1 inference support #45496

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add V-JEPA 2.1 inference support #45496

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions