UbiquitousLearning · chenghuaWang · Jun 8, 2026 · Apr 30, 2026 · May 23, 2026 · May 23, 2026
diff --git a/README-ZH.md b/README-ZH.md
@@ -17,6 +17,8 @@ mllm
 
 ## 最新动态
 
+- [2026 年 6 月 8 日] `pymllm` 已覆盖 Qwen3、Qwen3-VL 与 Qwen3.5 在 Jetson Orin 上的 W4A16 / W8A8 serving；Qwen3-VL-2B W8A8 在 AGX Orin 32GB 上最高达到 3.12x prefill 加速比，decode 吞吐整体与 llama.cpp 接近。
+- [2026 年 4 月 30 日] `pymllm` 新增面向 Jetson 的 Qwen3 / Qwen3-VL BF16、W4A16 和 W8A8 serving 支持，覆盖 compressed-tensors AWQ 与 W8A8 INT8 路径。
 - [2026 年 3 月 18 日] 🔥🔥🔥 `pymllm` 已支持在 Jetson Orin 和 Jetson Thor 设备上使用 CUDA（实验特性，仍在持续开发中）。
 - [2026 年 2 月 3 日] 🔥🔥🔥 MLLM Qnn AOT 已支持在 NPU 上全图执行！[快速开始](https://ubiquitouslearning.github.io/mllm/qnn_backend/aot_execute.html), [技术报告](https://chenghuawang.github.io/News/2026-01-29-mllm-qnn-aot-support/)
 - [2025 年 11 月 27 日] Android Demo 更新：通过一种全新的 In-App Go 服务架构，在 Android 上实现了 Qwen3 和 DeepSeek-OCR 的稳定流式推理。
@@ -29,6 +31,29 @@ mllm
   - 更加完善、精细的工程实现
 - [2025 年 7 月 30 日] 为 QNN 后端模型新增旋转量化（Rotation Quantization）方法，并支持 Qwen-2-VL 2B（ViT 性能分析将在 v2 中集成）
 
+## Jetson Orin CUDA Runtime
+
+`pymllm` 现已支持 Qwen3、Qwen3-VL 与 Qwen3.5 在 Jetson Orin 上运行，覆盖 BF16 serving 以及 W4A16、W8A8 两种量化 serving 路径。其中，W4A16 使用 AWQ compressed tensors 与 Marlin GEMM，W8A8 使用 Triton per-token activation quantization 与 CUTLASS INT8 GEMM。
+
+在 `input_len=2048`、`output_len=128` 的测速口径下，`pymllm` 在 Jetson Orin 上的 prefill 性能相对 llama.cpp 有明显提升。Qwen3-VL-2B W8A8 在 AGX Orin 32GB 上最高达到 **3.12x prefill 加速比**，prefill 吞吐约 **12243 tok/s**。decode 吞吐整体与 llama.cpp 接近，不同模型、设备和量化格式下会有小幅领先或回落。
+
+<div align="center">
+  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
+</div>
+
+<div align="center">
+  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
+</div>
+
+对于多模态 prefill，`bench_one_batch --image` 测量“视觉编码 + 图像/文本 token prefill”的完整路径。下表使用 `input_len=2048`，TPS 为多次运行的 mean latency 计算结果。
+
+| Device | Model | FP16 | W4A16 | W8A8 |
+|---|---|---:|---:|---:|
+| AGX Orin 32GB | Qwen3-VL-2B | 4875.75 | 4700.28 | 6443.59 |
+| AGX Orin 32GB | Qwen3-VL-4B | - | 2499.46 | 3837.07 |
+| Orin NX 16GB | Qwen3-VL-2B | 2438.27 | 2494.89 | 3200.40 |
+| Orin NX 16GB | Qwen3-VL-4B | - | 1231.21 | 1673.93 |
+
 ## Android Demo & Architecture
 
 我们已对 Android 端实现进行了重构，采用了一种稳健的、完全在设备端运行的 **Client-Server** 架构。
@@ -75,17 +100,21 @@ mllm 框架可以与主流社区框架的模型检查点无缝集成。通过 ml
 
 ### mllm v2
 
-| Model(v2)                                                                   | CPU  | Hexagon NPU <br> INT8 |
-|-----------------------------------------------------------------------------|------|-----------------------|
-| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3)                     | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai)  |  | 
-| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3)                     | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai)  | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/summary) |
-| [Qwen3-4B](https://github.com/QwenLM/Qwen3)                      | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-4B-w4a8-i8mm-kai)  |  |
-| [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR)       | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/DeepSeek-OCR-w4a8-i8mm-kai)  |  |
-| [SmolLM3](https://huggingface.co/blog/smollm3)| [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/SmolLM3-3B-w4a8-i8mm-kai)  |  |
-| [Qwen2-VL-2B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-2B-Instruct-w4a32kai) ||
-| [Qwen2-VL-7B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-7B-Instruct-w4a32kai)||
-| [Qwen2.5-VL-3B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-3B-Instruct-w4a32kai)||
-| [Qwen2.5-VL-7B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/)|[✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-7B-Instruct-w4a32kai)||
+| Model(v2)                                                                   | CPU  | Jetson Orin CUDA | Hexagon NPU <br> INT8 |
+|-----------------------------------------------------------------------------|------|------------------|-----------------------|
+| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3)                     | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai)  |  |  |
+| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3)                     | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai)  |  | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/summary) |
+| [Qwen3-4B](https://github.com/QwenLM/Qwen3)                      | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-4B-w4a8-i8mm-kai)  |  |  |
+| Qwen3.5-2B                                                       |  | ✔️ W4A16 / W8A8 |  |
+| Qwen3.5-4B                                                       |  | ✔️ W4A16 / W8A8 |  |
+| Qwen3-VL-2B-Instruct                                            |  | ✔️ W4A16 / W8A8 |  |
+| Qwen3-VL-4B-Instruct                                            |  | ✔️ W4A16 / W8A8 |  |
+| [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR)       | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/DeepSeek-OCR-w4a8-i8mm-kai)  |  |  |
+| [SmolLM3](https://huggingface.co/blog/smollm3)| [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/SmolLM3-3B-w4a8-i8mm-kai)  |  |  |
+| [Qwen2-VL-2B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-2B-Instruct-w4a32kai) |  |  |
+| [Qwen2-VL-7B-Instruct](https://qwenlm.github.io/zh/blog/qwen2-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2-VL-7B-Instruct-w4a32kai) |  |  |
+| [Qwen2.5-VL-3B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-3B-Instruct-w4a32kai) |  |  |
+| [Qwen2.5-VL-7B-Instruct](https://qwenlm.github.io/blog/qwen2.5-vl/) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen2.5-VL-7B-Instruct-w4a32kai) |  |  |
 
 ### mllm v1
 

diff --git a/README.md b/README.md
@@ -17,6 +17,8 @@ mllm
 
 ## Latest News
 
+- [2026 Jun 08] `pymllm` now covers Qwen3, Qwen3-VL, and Qwen3.5 on Jetson Orin with W4A16 / W8A8 serving; Qwen3-VL-2B W8A8 reaches up to 3.12x prefill speedup on AGX Orin 32GB, while decode throughput stays broadly close to llama.cpp.
+- [2026 Apr 30] `pymllm` adds Jetson-oriented Qwen3 / Qwen3-VL BF16, W4A16, and W8A8 serving support, including compressed-tensors AWQ and W8A8 INT8 paths.
 - [2026 Mar 18] 🔥🔥🔥 `pymllm` now supports CUDA on Jetson Orin and Jetson Thor devices (experimental; still under active development).
 - [2026 Feb 03] 🔥🔥🔥 MLLM Qnn AOT Support for Full Graph Execution on NPU! [Quick Start](https://ubiquitouslearning.github.io/mllm/qnn_backend/aot_execute.html), [Technical Report](https://chenghuawang.github.io/News/2026-01-29-mllm-qnn-aot-support-en/)
 - [2025 Nov 27] Android Demo Update: Enabled stable Qwen3 and DeepSeek-OCR streaming on Android via a novel In-App Go Server Architecture.
@@ -28,6 +30,29 @@ mllm
   - A more refined engineering implementation
 - [2025 Jul 30] Add Rotation Quantization method for QNN backend models and support Qwen-2-VL 2B（ViT profiling will integrate in v2）
 
+## Jetson Orin CUDA Runtime
+
+`pymllm` now supports Qwen3, Qwen3-VL, and Qwen3.5 on Jetson Orin with BF16 serving plus W4A16 and W8A8 quantized serving. The W4A16 path uses AWQ compressed tensors and Marlin GEMM. The W8A8 path uses Triton per-token activation quantization and CUTLASS INT8 GEMM.
+
+For `input_len=2048` and `output_len=128`, `pymllm` shows strong prefill gains over llama.cpp on Jetson Orin. Qwen3-VL-2B W8A8 reaches up to **3.12x prefill speedup** on AGX Orin 32GB and about **12243 tok/s** prefill throughput. Decode throughput is generally close to llama.cpp, with small wins or losses depending on model, device, and quantization.
+
+<div align="center">
+  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
+</div>
+
+<div align="center">
+  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
+</div>
+
+For multimodal prefill, `bench_one_batch --image` measures the full path of vision encoding plus image/text token prefill. The table below uses `input_len=2048` and reports mean TPS across repeated runs.
+
+| Device | Model | FP16 | W4A16 | W8A8 |
+|---|---|---:|---:|---:|
+| AGX Orin 32GB | Qwen3-VL-2B | 4875.75 | 4700.28 | 6443.59 |
+| AGX Orin 32GB | Qwen3-VL-4B | - | 2499.46 | 3837.07 |
+| Orin NX 16GB | Qwen3-VL-2B | 2438.27 | 2494.89 | 3200.40 |
+| Orin NX 16GB | Qwen3-VL-4B | - | 1231.21 | 1673.93 |
+
 ## Android Demo & Architecture
 
 We have refactored the Android implementation to use a robust **Client-Server** architecture entirely on-device.

diff --git a/assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg b/assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg
diff --git a/assets/jetson/pymllm-jetson-speedup-summary-2048.jpg b/assets/jetson/pymllm-jetson-speedup-summary-2048.jpg
diff --git a/bench_assets/two_cats.jpg b/bench_assets/two_cats.jpg
diff --git a/bench_assets/two_cats_480p.jpg b/bench_assets/two_cats_480p.jpg