Problem
Hello, thank you for your work on this project.
I encountered a performance issue while deploying the Transformer model and would like to ask whether this is a known limitation or whether there is a recommended deployment approach.
I tested inference performance with a fixed input shape of (4, 3, 256, 256) and observed that ONNX Runtime GPU is significantly slower than PyTorch:
- PyTorch: 0.17 ~ 0.19 s / batch
- Python ONNX Runtime GPU: 0.35 ~ 0.40 s / batch
- C++ ONNX Runtime GPU: 0.35 ~ 0.40 s / batch
In this case, ONNX Runtime GPU is about 2x slower than PyTorch.
Also, Python ORT and C++ ORT show very similar latency, so this does not appear to be caused by Python wrapper overhead.
What I have checked
I have already tried the following:
- Removed the
style output
- This made almost no difference.
- Exported the model with static batch and static input shape
- This also made almost no difference.
- Verified through profiling that the main computation is running on CUDAExecutionProvider.
- Checked the main hotspots in profiling:
- the first
Conv
- later
Gemm/MatMul
- some
Reshape/Transpose ops
- When setting
cudnn_conv_algo_search=DEFAULT, the log shows that Conv runs in Fallback mode, and performance becomes even worse.
Questions
I would like to ask:
- Have you tested this Transformer model on ONNX Runtime GPU and compared its performance against PyTorch?
- Is there a recommended ONNX export method or deployment configuration for this model?
- In your experience, is this model better suited for TensorRT than for ONNX Runtime CUDAExecutionProvider?
Additional information
If needed, I can also provide:
- ONNX export code
- ONNX Runtime profiling results
- a minimal reproducible script
Thank you.
Problem
Hello, thank you for your work on this project.
I encountered a performance issue while deploying the Transformer model and would like to ask whether this is a known limitation or whether there is a recommended deployment approach.
I tested inference performance with a fixed input shape of (4, 3, 256, 256) and observed that ONNX Runtime GPU is significantly slower than PyTorch:
In this case, ONNX Runtime GPU is about 2x slower than PyTorch.
Also, Python ORT and C++ ORT show very similar latency, so this does not appear to be caused by Python wrapper overhead.
What I have checked
I have already tried the following:
styleoutputConvGemm/MatMulReshape/Transposeopscudnn_conv_algo_search=DEFAULT, the log shows that Conv runs in Fallback mode, and performance becomes even worse.Questions
I would like to ask:
Additional information
If needed, I can also provide:
Thank you.