Skip to content

Add V-JEPA 2.1 inference support #45496

@davevanveen

Description

@davevanveen

Feature request

Meta released V-JEPA 2.1 on 2026-03-16 with four pretrained video encoders at 384 resolution (ViT-B 80M, ViT-L 300M, ViT-g 1B, ViT-G 2B). The existing vjepa2 model family in transformers supports V-JEPA 2.0 but not 2.1.

V-JEPA 2.1 introduces several architectural changes over 2.0: corrected RoPE implementation, learnable modality embeddings, hierarchical feature extraction with per-layer norms, separate image patch embedding, RoPE position interpolation, and a predictor context token projection. These require config and modeling extensions beyond what the current VJEPA2Model supports.

Paper: https://huggingface.co/papers/2603.14482
Code: https://github.com/facebookresearch/vjepa2 (see app/vjepa_2_1/)

Motivation

V-JEPA 2.1 checkpoints are currently only loadable through Meta's torch.hub interface. Adding transformers support would let users load these models via from_pretrained with standard HF APIs, consistent with the existing V-JEPA 2.0 integration.

There is also an open request from the HF team to Meta to upload the 2.1 weights to the Hub: facebookresearch/vjepa2#137.

Your contribution

I have a working implementation on a branch that extends the existing vjepa2 model family with backward-compatible config fields and modeling changes. Verified end-to-end against Meta's reference (ViT-B/384 checkpoint, encoder max diff 0.0001, predictor max diff 0.008). All existing tests pass. I will open a PR shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions