Skip to content

Add ViT attention plugin FMHA cubins and packed cu_seqlens support#82

Open
micwill755 wants to merge 4 commits into
NVIDIA:mainfrom
micwill755:feature/torch-tensorrt-python-runtime-vit
Open

Add ViT attention plugin FMHA cubins and packed cu_seqlens support#82
micwill755 wants to merge 4 commits into
NVIDIA:mainfrom
micwill755:feature/torch-tensorrt-python-runtime-vit

Conversation

@micwill755
Copy link
Copy Markdown

@micwill755 micwill755 commented May 11, 2026

Summary

This PR adds generated ViT FMHA cubins from TensorRT-LLM into TensorRT-Edge-LLM under vitAttentionKernels and wires them into ViTAttentionPlugin. The plugin now supports both dense additive masks and packed cu_seqlens attention regions, enabling Qwen-style windowed ViT attention without requiring dense [S, S] masks.

Changes

  • Added embedded ViT FMHA cubin assets and vit_fmha_cubin.h.
  • Added ViT FMHA runner support for packed cu_seqlens.
  • Added mask_type and max_seq_len plugin fields.
  • Added support for Qwen2.5-VL head_dim=80.
  • Added ONNX export support for trt::ViTAttentionPlugin.
  • Updated Qwen2.5-VL visual ONNX export to emit ViT attention plugin nodes.
  • Updated plugin format checks for mixed inputs:
  • QKV/cos/sin/output: FP16
    • cu_seqlens: INT32
    • Dense additive masks: same dtype as QKV
  • Validated end-to-end visual paths for Qwen2.5-VL, Llama 3.2 Vision/Mllama, and GR00T/Eagle/SigLIP-style models.

Testing

  • Built NvInfer_edgellm_plugin.
  • Ran ViT attention plugin correctness example.
  • Ran end-to-end ViT attention plugin benchmark.
  • Confirmed output shapes match PyTorch references across tested visual models.
  • Exported Qwen2.5-VL visual ONNX with trt::ViTAttentionPlugin nodes in the graph.
  • Built and serialized a TensorRT visual engine from the exported ONNX graph.
  • Deserialized the saved TensorRT engine and ran inference with trtexec.
  • Ran semantic correctness check comparing the deserialized TensorRT engine against PyTorch

…as trt::ViTAttentionPlugin custom nodes instead of plain ONNX attention subgraphs
…into TensorRT-Edge-LLM under vitAttentionKernels
@micwill755 micwill755 requested a review from a team May 11, 2026 23:18
@nvluxiaoz
Copy link
Copy Markdown
Collaborator

packed cu_seqlens attention regions - I think this is something we already supported. @nvamberl can you review and confirm if this MR adds additional features for Edge-LLM? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants