Add ViT attention plugin FMHA cubins and packed cu_seqlens support by micwill755 · Pull Request #82 · NVIDIA/TensorRT-Edge-LLM

micwill755 · 2026-05-11T23:18:56Z

Summary

This PR adds generated ViT FMHA cubins from TensorRT-LLM into TensorRT-Edge-LLM under vitAttentionKernels and wires them into ViTAttentionPlugin. The plugin now supports both dense additive masks and packed cu_seqlens attention regions, enabling Qwen-style windowed ViT attention without requiring dense [S, S] masks.

Changes

Added embedded ViT FMHA cubin assets and vit_fmha_cubin.h.
Added ViT FMHA runner support for packed cu_seqlens.
Added mask_type and max_seq_len plugin fields.
Added support for Qwen2.5-VL head_dim=80.
Added ONNX export support for trt::ViTAttentionPlugin.
Updated Qwen2.5-VL visual ONNX export to emit ViT attention plugin nodes.
Updated plugin format checks for mixed inputs:
QKV/cos/sin/output: FP16
- cu_seqlens: INT32
- Dense additive masks: same dtype as QKV
Validated end-to-end visual paths for Qwen2.5-VL, Llama 3.2 Vision/Mllama, and GR00T/Eagle/SigLIP-style models.

Testing

Built NvInfer_edgellm_plugin.
Ran ViT attention plugin correctness example.
Ran end-to-end ViT attention plugin benchmark.
Confirmed output shapes match PyTorch references across tested visual models.
Exported Qwen2.5-VL visual ONNX with trt::ViTAttentionPlugin nodes in the graph.
Built and serialized a TensorRT visual engine from the exported ONNX graph.
Deserialized the saved TensorRT engine and ran inference with trtexec.
Ran semantic correctness check comparing the deserialized TensorRT engine against PyTorch

…as trt::ViTAttentionPlugin custom nodes instead of plain ONNX attention subgraphs

…into TensorRT-Edge-LLM under vitAttentionKernels

nvluxiaoz · 2026-05-12T00:20:35Z

packed cu_seqlens attention regions - I think this is something we already supported. @nvamberl can you review and confirm if this MR adds additional features for Edge-LLM? Thanks!

micwill755 added 2 commits May 10, 2026 05:00

Qwen2.5-VL’s 32 visual transformer attention blocks are now exported …

cb2fb7c

…as trt::ViTAttentionPlugin custom nodes instead of plain ONNX attention subgraphs

Tested Generated ViT FMHA cubins from TensorRT-LLM and embedded them …

2101279

…into TensorRT-Edge-LLM under vitAttentionKernels

micwill755 requested a review from a team May 11, 2026 23:18

micwill755 added 2 commits May 12, 2026 16:42

Added sm90 for h100 testing.

84c0fcb

Add compact block mask support for ViT attention plugin

3901efb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ViT attention plugin FMHA cubins and packed cu_seqlens support#82

Add ViT attention plugin FMHA cubins and packed cu_seqlens support#82
micwill755 wants to merge 4 commits into
NVIDIA:mainfrom
micwill755:feature/torch-tensorrt-python-runtime-vit

micwill755 commented May 11, 2026 •

edited

Loading

Uh oh!

nvluxiaoz commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

micwill755 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvluxiaoz commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

micwill755 commented May 11, 2026 •

edited

Loading