ONNX Runtime GPU is ~2x slower than PyTorch for cellpose model inference

## Problem

Hello, thank you for your work on this project.

I encountered a performance issue while deploying the Transformer model and would like to ask whether this is a known limitation or whether there is a recommended deployment approach.

I tested inference performance with a fixed input shape of **(4, 3, 256, 256)** and observed that ONNX Runtime GPU is significantly slower than PyTorch:

- **PyTorch:** 0.17 ~ 0.19 s / batch
- **Python ONNX Runtime GPU:** 0.35 ~ 0.40 s / batch
- **C++ ONNX Runtime GPU:** 0.35 ~ 0.40 s / batch

In this case, **ONNX Runtime GPU is about 2x slower than PyTorch**.

Also, Python ORT and C++ ORT show very similar latency, so this does not appear to be caused by Python wrapper overhead.

---

## What I have checked

I have already tried the following:

- Removed the `style` output
  - This made almost no difference.
- Exported the model with **static batch** and **static input shape**
  - This also made almost no difference.
- Verified through profiling that the main computation is running on **CUDAExecutionProvider**.
- Checked the main hotspots in profiling:
  - the first `Conv`
  - later `Gemm/MatMul`
  - some `Reshape/Transpose` ops
- When setting `cudnn_conv_algo_search=DEFAULT`, the log shows that Conv runs in **Fallback mode**, and performance becomes even worse.

---

## Questions

I would like to ask:

1. Have you tested this Transformer model on **ONNX Runtime GPU** and compared its performance against **PyTorch**?
2. Is there a recommended **ONNX export method** or deployment configuration for this model?
3. In your experience, is this model better suited for **TensorRT** than for **ONNX Runtime CUDAExecutionProvider**?

---

## Additional information

If needed, I can also provide:

- ONNX export code
- ONNX Runtime profiling results
- a minimal reproducible script

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ONNX Runtime GPU is ~2x slower than PyTorch for cellpose model inference #3

Problem

What I have checked

Questions

Additional information

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

ONNX Runtime GPU is ~2x slower than PyTorch for cellpose model inference #3

Description

Problem

What I have checked

Questions

Additional information

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions