Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
8699786
docs(pymllm): document qwen3 projection alignment
jialilve Apr 30, 2026
ec95ddd
perf(qwen3-vl): fuse M-RoPE and reuse RoPE cache
jialilve May 23, 2026
d837252
feat(bench): add multimodal VIT timing benchmark mode
jialilve May 23, 2026
f9be384
perf(rmsnorm): patch Jetson FlashInfer device properties
jialilve May 23, 2026
0e93e6d
perf(sampling): avoid GPU sync for greedy decode
jialilve May 23, 2026
7494bc4
perf(bench): align CUDA graph capture batch size with sweep (#1)
Jun 4, 2026
fdec95c
perf(bench): tensorize decode KV-mapping write, drop per-request .item()
Jun 4, 2026
87cf2a3
feat(bench): skip settings exceeding KV pool capacity (#3)
Jun 4, 2026
7558903
feat(bench): align profiling methodology with SGLang (#4)
Jun 4, 2026
ac95cd6
fix(bench): do not profile the warmup run (#2)
Jun 4, 2026
78c906d
feat(bench): add single-stage correctness mode (#5)
Jun 4, 2026
e33a5ef
test(bench): expect gzipped profiler traces
jialilve Jun 4, 2026
8dcb2a1
test(bench): cover batched decode kv mapping
jialilve Jun 4, 2026
e7ed949
docs(bench): label correct mode as smoke check
jialilve Jun 4, 2026
75a67e5
feat(bench): sweep multimodal prefill input length
jialilve Jun 7, 2026
2dae7ce
perf(memory): refine static KV pool profiling
jialilve Jun 7, 2026
7eb9502
perf(memory): improve KV cache budget diagnostics
jialilve Jun 7, 2026
b56d811
docs(pymllm_runtime): rewrite and humanize runtime docs
jialilve Jun 8, 2026
d442aae
docs(pymllm): rewrite README to mirror runtime setup doc
jialilve Jun 8, 2026
f346552
docs(readme): highlight Jetson Orin pymllm performance
jialilve Jun 8, 2026
32662c0
docs(pymllm): fix benchmark prompt typo
jialilve Jun 8, 2026
9ed367e
Merge branch 'main' into feature/jetson-qwen3-family-bf16-w4a16-w8a8
chenghuaWang Jun 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 40 additions & 11 deletions README-ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ mllm

## 最新动态

- [2026 年 6 月 8 日] `pymllm` 已覆盖 Qwen3、Qwen3-VL 与 Qwen3.5 在 Jetson Orin 上的 W4A16 / W8A8 serving;Qwen3-VL-2B W8A8 在 AGX Orin 32GB 上最高达到 3.12x prefill 加速比,decode 吞吐整体与 llama.cpp 接近。
- [2026 年 4 月 30 日] `pymllm` 新增面向 Jetson 的 Qwen3 / Qwen3-VL BF16、W4A16 和 W8A8 serving 支持,覆盖 compressed-tensors AWQ 与 W8A8 INT8 路径。
- [2026 年 3 月 18 日] 🔥🔥🔥 `pymllm` 已支持在 Jetson Orin 和 Jetson Thor 设备上使用 CUDA(实验特性,仍在持续开发中)。
- [2026 年 2 月 3 日] 🔥🔥🔥 MLLM Qnn AOT 已支持在 NPU 上全图执行![快速开始](https://ubiquitouslearning.github.io/mllm/qnn_backend/aot_execute.html), [技术报告](https://chenghuawang.github.io/News/2026-01-29-mllm-qnn-aot-support/)
- [2025 年 11 月 27 日] Android Demo 更新:通过一种全新的 In-App Go 服务架构,在 Android 上实现了 Qwen3 和 DeepSeek-OCR 的稳定流式推理。
Expand All @@ -29,6 +31,29 @@ mllm
- 更加完善、精细的工程实现
- [2025 年 7 月 30 日] 为 QNN 后端模型新增旋转量化(Rotation Quantization)方法,并支持 Qwen-2-VL 2B(ViT 性能分析将在 v2 中集成)

## Jetson Orin CUDA Runtime

`pymllm` 现已支持 Qwen3、Qwen3-VL 与 Qwen3.5 在 Jetson Orin 上运行,覆盖 BF16 serving 以及 W4A16、W8A8 两种量化 serving 路径。其中,W4A16 使用 AWQ compressed tensors 与 Marlin GEMM,W8A8 使用 Triton per-token activation quantization 与 CUTLASS INT8 GEMM。

在 `input_len=2048`、`output_len=128` 的测速口径下,`pymllm` 在 Jetson Orin 上的 prefill 性能相对 llama.cpp 有明显提升。Qwen3-VL-2B W8A8 在 AGX Orin 32GB 上最高达到 **3.12x prefill 加速比**,prefill 吞吐约 **12243 tok/s**。decode 吞吐整体与 llama.cpp 接近,不同模型、设备和量化格式下会有小幅领先或回落。

<div align="center">
<img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
</div>

<div align="center">
<img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
</div>

对于多模态 prefill,`bench_one_batch --image` 测量“视觉编码 + 图像/文本 token prefill”的完整路径。下表使用 `input_len=2048`,TPS 为多次运行的 mean latency 计算结果。

| Device | Model | FP16 | W4A16 | W8A8 |
|---|---|---:|---:|---:|
| AGX Orin 32GB | Qwen3-VL-2B | 4875.75 | 4700.28 | 6443.59 |
| AGX Orin 32GB | Qwen3-VL-4B | - | 2499.46 | 3837.07 |
| Orin NX 16GB | Qwen3-VL-2B | 2438.27 | 2494.89 | 3200.40 |
| Orin NX 16GB | Qwen3-VL-4B | - | 1231.21 | 1673.93 |

## Android Demo & Architecture

我们已对 Android 端实现进行了重构,采用了一种稳健的、完全在设备端运行的 **Client-Server** 架构。
Expand Down Expand Up @@ -75,17 +100,21 @@ mllm 框架可以与主流社区框架的模型检查点无缝集成。通过 ml

### mllm v2

| Model(v2) | CPU | Hexagon NPU <br> INT8 |
|-----------------------------------------------------------------------------|------|-----------------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/summary) |
| [Qwen3-4B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-4B-w4a8-i8mm-kai) | |
| [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/DeepSeek-OCR-w4a8-i8mm-kai) | |
| [SmolLM3](https://huggingface.co/blog/smollm3)| [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/SmolLM3-3B-w4a8-i8mm-kai) | |
| [Qwen2-VL-2B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-2B-Instruct-w4a32kai) ||
| [Qwen2-VL-7B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-7B-Instruct-w4a32kai)||
| [Qwen2.5-VL-3B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-3B-Instruct-w4a32kai)||
| [Qwen2.5-VL-7B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-7B-Instruct-w4a32kai)||
| Model(v2) | CPU | Jetson Orin CUDA | Hexagon NPU <br> INT8 |
|-----------------------------------------------------------------------------|------|------------------|-----------------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | | |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/summary) |
| [Qwen3-4B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-4B-w4a8-i8mm-kai) | | |
| Qwen3.5-2B | | ✔️ W4A16 / W8A8 | |
| Qwen3.5-4B | | ✔️ W4A16 / W8A8 | |
| Qwen3-VL-2B-Instruct | | ✔️ W4A16 / W8A8 | |
| Qwen3-VL-4B-Instruct | | ✔️ W4A16 / W8A8 | |
| [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/DeepSeek-OCR-w4a8-i8mm-kai) | | |
| [SmolLM3](https://huggingface.co/blog/smollm3)| [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/SmolLM3-3B-w4a8-i8mm-kai) | | |
| [Qwen2-VL-2B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-2B-Instruct-w4a32kai) | | |
| [Qwen2-VL-7B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-7B-Instruct-w4a32kai) | | |
| [Qwen2.5-VL-3B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-3B-Instruct-w4a32kai) | | |
| [Qwen2.5-VL-7B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-7B-Instruct-w4a32kai) | | |

### mllm v1

Expand Down
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ mllm

## Latest News

- [2026 Jun 08] `pymllm` now covers Qwen3, Qwen3-VL, and Qwen3.5 on Jetson Orin with W4A16 / W8A8 serving; Qwen3-VL-2B W8A8 reaches up to 3.12x prefill speedup on AGX Orin 32GB, while decode throughput stays broadly close to llama.cpp.
- [2026 Apr 30] `pymllm` adds Jetson-oriented Qwen3 / Qwen3-VL BF16, W4A16, and W8A8 serving support, including compressed-tensors AWQ and W8A8 INT8 paths.
- [2026 Mar 18] 🔥🔥🔥 `pymllm` now supports CUDA on Jetson Orin and Jetson Thor devices (experimental; still under active development).
- [2026 Feb 03] 🔥🔥🔥 MLLM Qnn AOT Support for Full Graph Execution on NPU! [Quick Start](https://ubiquitouslearning.github.io/mllm/qnn_backend/aot_execute.html), [Technical Report](https://chenghuawang.github.io/News/2026-01-29-mllm-qnn-aot-support-en/)
- [2025 Nov 27] Android Demo Update: Enabled stable Qwen3 and DeepSeek-OCR streaming on Android via a novel In-App Go Server Architecture.
Expand All @@ -28,6 +30,29 @@ mllm
- A more refined engineering implementation
- [2025 Jul 30] Add Rotation Quantization method for QNN backend models and support Qwen-2-VL 2B(ViT profiling will integrate in v2)

## Jetson Orin CUDA Runtime

`pymllm` now supports Qwen3, Qwen3-VL, and Qwen3.5 on Jetson Orin with BF16 serving plus W4A16 and W8A8 quantized serving. The W4A16 path uses AWQ compressed tensors and Marlin GEMM. The W8A8 path uses Triton per-token activation quantization and CUTLASS INT8 GEMM.

For `input_len=2048` and `output_len=128`, `pymllm` shows strong prefill gains over llama.cpp on Jetson Orin. Qwen3-VL-2B W8A8 reaches up to **3.12x prefill speedup** on AGX Orin 32GB and about **12243 tok/s** prefill throughput. Decode throughput is generally close to llama.cpp, with small wins or losses depending on model, device, and quantization.

<div align="center">
<img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
</div>

<div align="center">
<img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
</div>

For multimodal prefill, `bench_one_batch --image` measures the full path of vision encoding plus image/text token prefill. The table below uses `input_len=2048` and reports mean TPS across repeated runs.

| Device | Model | FP16 | W4A16 | W8A8 |
|---|---|---:|---:|---:|
| AGX Orin 32GB | Qwen3-VL-2B | 4875.75 | 4700.28 | 6443.59 |
| AGX Orin 32GB | Qwen3-VL-4B | - | 2499.46 | 3837.07 |
| Orin NX 16GB | Qwen3-VL-2B | 2438.27 | 2494.89 | 3200.40 |
| Orin NX 16GB | Qwen3-VL-4B | - | 1231.21 | 1673.93 |

## Android Demo & Architecture

We have refactored the Android implementation to use a robust **Client-Server** architecture entirely on-device.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added bench_assets/two_cats.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added bench_assets/two_cats_480p.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading