Model Download
PP-DocLayoutV3 ONNX model: https://huggingface.co/AlexTransformer/PP-DocLayoutV3-onnx
Customer Impact
I purchased an AMD Ryzen AI MAX+ 395 AI PC laptop specifically for on-device document parsing with ONNX models on the NPU. After extensive model preparation and successful AIE compilation, the VAIML runtime crashes at inference time, making the NPU unusable for real-world document AI workloads. This severely impacts the value proposition of the Ryzen AI platform.
Hardware
| Component |
Detail |
| APU |
AMD Ryzen AI MAX+ 395 |
| NPU Type |
STX (detected by quicktest.py) |
| Memory |
128 GB unified (LPDDR5X) |
| OS |
Windows 11 |
| Driver |
NPU driver 32.0.22032.6002 |
Software Environment
| Component |
Version |
| Ryzen AI Software |
1.7.1 |
| onnxruntime-vitisai |
1.23.3.dev20260320 |
| VAIML (vaip) commit |
9ce31169da2a09a217ab2e1492b3fc9cd39d425c |
| AIE Compiler |
2026.1 (windows64-bit) |
| Python |
3.12 |
| Conda env |
ryzen-ai-1.7.1 (official) |
Problem
AIE compilation succeeds with ZERO errors, but runtime inference crashes with Access Violation (0xC0000005) in xir_deserialize_cif().
The crash occurs inside onnxruntime_vitisai_ep.dll when the VAIML runtime attempts to deserialize/load the compiled AIE binary. This is NOT a compilation error — the compiler reports success. The runtime simply cannot load its own output.
Root Cause Analysis
What we proved works:
- NPU hardware is functional —
quicktest.py passes (test_model.onnx runs on NPU)
- VitisAIExecutionProvider is available —
['VitisAIExecutionProvider', 'DmlExecutionProvider', 'CPUExecutionProvider']
- VAIML frontend partitioning works — 95.7% of operators (1157/1209) are supported, covering 99.1% of GOPs
- Unified memory works — AIE compiler successfully allocates external buffers from the 128 GB shared memory pool
- AIE compilation completes —
Compilation Complete (WARNING:47, CRITICAL-WARNING:0, ERROR:0)
- Session creation succeeds — VitisAIExecutionProvider is selected, session loads in ~760s (first compile) or ~1.4s (cache hit)
- Static INT8 quantized model runs — 1.07x speedup over CPU (partial NPU offload, 28% VAIML coverage)
What crashes:
The dynamic INT8 quantized model (QUInt8, per-tensor) passes full AIE compilation but crashes at the very first inference call:
Exception Code: 0xC0000005 (ACCESS_VIOLATION)
onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x147926
onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x14E8C1
onnxruntime_providers_vitisai.dll + 0x21614
The crash happens in xir_deserialize_cif() — the function responsible for loading the compiled AIE graph at runtime. The compiled binary was produced by the SAME toolchain, yet the runtime cannot deserialize it.
Model details:
- Model: PP-DocLayoutV3 (PaddleOCR document layout detection, widely used document AI model)
- Input: 1x3x800x800 (standard document image)
- Original format: FP32, 66 MB
- After graph surgery + INT8 quantization: 104 MB (QUInt8 dynamic quantization)
- Graph surgery performed: 86 Cast nodes reduced to 12 (removed BOOL to FLOAT pairs, fixed DOUBLE to FLOAT, changed Cast.82 INT32 to FLOAT output); 3 Where(bool,A,B) rewritten as float-mask arithmetic; onnxsim simplification; 2519 Constants converted to Initializers; static shape fixing
Reproduction Steps
Prerequisites
# 1. Install Ryzen AI Software 1.7.1 (official installer)
# 2. Create conda environment
conda activate ryzen-ai-1.7.1
set RYZEN_AI_INSTALLATION_PATH=C:\Program Files\RyzenAI\1.7.1
# 3. Install packages
pip install onnx onnxsim numpy
# 4. Verify NPU
cd %RYZEN_AI_INSTALLATION_PATH%\quicktest
python quicktest.py
# Expected: "Detected NPU type: STX" + "Test Finished"
Step 1: Prepare the model
# Download PP-DocLayoutV3 ONNX from PaddlePaddle (or any ONNX model with similar structure)
# Perform graph surgery: fix static shapes, remove Where(bool,A,B), simplify with onnxsim
# Quantize with onnxruntime dynamic quantization (QUInt8, per-tensor)
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("model_fp32.onnx", "model_int8.onnx", weight_type=QuantType.QUInt8, per_channel=False)
Step 2: Create VAIML config
{
"passes": [
{"name": "init", "plugin": "vaip-pass_init"},
{"name": "vaiml_partition", "plugin": "vaip-pass_vaiml_partition",
"vaiml_config": {
"device": "stx",
"optimize_level": 1,
"preferred_data_storage": "vectorized",
"enable_f32_to_bf16_conversion": true
}}
],
"target": "VAIML",
"targets": [{"name": "VAIML", "pass": ["init", "vaiml_partition"]}]
}
Step 3: Create session and run inference (CRASHES)
import os
os.environ['RYZEN_AI_INSTALLATION_PATH'] = r'C:\Program Files\RyzenAI\1.7.1'
import numpy as np
import onnxruntime as ort
model = "model_int8.onnx"
config = "vaiml_config.json"
inputs = {
'image': np.random.rand(1, 3, 800, 800).astype(np.float32),
'im_shape': np.array([[800, 800]], dtype=np.float32),
'scale_factor': np.array([[1.0, 1.0]], dtype=np.float32),
}
# This succeeds (AIE compilation, ~760s first time, ~1.4s cached)
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, sess_options=so,
providers=['VitisAIExecutionProvider', 'CPUExecutionProvider'],
provider_options=[{'config_file': config}, {}])
# This CRASHES (0xC0000005 Access Violation)
out = sess.run(None, inputs)
Expected output:
Output shapes: [(300, 7), (1,), (300, 200, 200)]
Actual output:
Exception Code: 0xC0000005
onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x147926
(Process terminates immediately)
AIE Compilation Log (showing SUCCESS)
INFO: [aiecompiler 77-6272] Completing Scheduler pass
INFO: [aiecompiler 77-6497] runtime-opt stats: Avoided compilation of 75 cores out of 80 cores.
INFO: [aiecompiler 77-23486] ### Exiting Peano ElfGen
INFO: [aiecompiler 77-23810] Completing SchedulerControlPackets pass
Compilation Complete
(WARNING:47, CRITICAL-WARNING:0, ERROR:0)
External buffer allocation from unified memory (proving 128 GB shared memory works):
coalesed_weights: 278 KB
coalesed_spills: 803 KB
compute_graph.ofm_ddr: 24 MB (allocated from shared DDR)
VAIML Partition Summary
Number of operators in the model: 1209
Number of operators supported by VAIML: 1157 (95.699%)
GOPs supported by VAIML: 180.261 (99.100%)
Number of subgraphs supported by VAIML: 11
fail_safe_summary.json:
"offload_map": {"AIE": 100, "CPU": 0}
What We Already Tried
| Approach |
VAIML Frontend |
AIE Compile |
Runtime |
Result |
| FP32 original |
95.7% |
Fail (L1 placement L-93) |
CPU fallback |
No NPU acceleration |
| FP32 + graph surgery |
95.7% |
Fail (L1 placement L-93) |
CPU fallback |
No NPU acceleration |
| INT8 dynamic (QUInt8) + O1 config |
95.7% |
Fail (SCC L-115) |
CPU fallback |
No NPU acceleration |
| INT8 dynamic (QUInt8) + O1 + vectorized |
95.7% |
Pass (0 errors) |
CRASH |
xir_deserialize_cif |
| INT8 static (QOperator) |
28% |
Pass |
Pass |
1.07x speedup |
Why This Matters for Customers
-
PP-DocLayoutV3 is a mainstream model — PaddleOCR has millions of users. When it does not work on NPU, customers conclude AMD NPU does not support real AI workloads.
-
The compilation succeeds but runtime crashes — this is the most confusing failure mode. A customer would invest hours in model preparation, see "Compilation Complete (0 errors)", then get a cryptic crash with no actionable error message.
-
Windows platform is critical — Ryzen AI laptops ship with Windows. The NPU value proposition is run AI faster on your laptop. If models compile but crash at runtime, the NPU is dead weight.
-
The unified memory architecture is proven to work — our tests show AIE can allocate 24 MB external buffers from the 128 GB shared pool. The hardware is capable. The software stack has a bug.
Request
- Fix the
xir_deserialize_cif crash — the runtime should be able to load any binary that the AIE compiler produces
- Add a clear error message instead of a segfault — if the runtime cannot handle a particular graph, report why
- Test with real-world ONNX models (not just the quicktest reference model) — PP-DocLayoutV3, YOLOv8, ResNet-50, BERT are the models customers actually want to run
Environment Files Available
I can provide upon request:
- The prepared ONNX model (
inference_int8.onnx, 104 MB)
- The VAIML config JSON
- Full AIE compiler log (thousands of lines)
- VAIML cache directory contents (partition info, graph nodes, tensor shapes)
- Crash stack trace (full
0xC0000005 dump)
System: AMD Ryzen AI MAX+ 395 (STX) | 128 GB Unified Memory | Windows 11 | Ryzen AI Software 1.7.1
Model Download
PP-DocLayoutV3 ONNX model: https://huggingface.co/AlexTransformer/PP-DocLayoutV3-onnx
Customer Impact
I purchased an AMD Ryzen AI MAX+ 395 AI PC laptop specifically for on-device document parsing with ONNX models on the NPU. After extensive model preparation and successful AIE compilation, the VAIML runtime crashes at inference time, making the NPU unusable for real-world document AI workloads. This severely impacts the value proposition of the Ryzen AI platform.
Hardware
quicktest.py)Software Environment
9ce31169da2a09a217ab2e1492b3fc9cd39d425cryzen-ai-1.7.1(official)Problem
AIE compilation succeeds with ZERO errors, but runtime inference crashes with Access Violation (0xC0000005) in
xir_deserialize_cif().The crash occurs inside
onnxruntime_vitisai_ep.dllwhen the VAIML runtime attempts to deserialize/load the compiled AIE binary. This is NOT a compilation error — the compiler reports success. The runtime simply cannot load its own output.Root Cause Analysis
What we proved works:
quicktest.pypasses (test_model.onnx runs on NPU)['VitisAIExecutionProvider', 'DmlExecutionProvider', 'CPUExecutionProvider']Compilation Complete (WARNING:47, CRITICAL-WARNING:0, ERROR:0)What crashes:
The dynamic INT8 quantized model (QUInt8, per-tensor) passes full AIE compilation but crashes at the very first inference call:
The crash happens in
xir_deserialize_cif()— the function responsible for loading the compiled AIE graph at runtime. The compiled binary was produced by the SAME toolchain, yet the runtime cannot deserialize it.Model details:
Reproduction Steps
Prerequisites
Step 1: Prepare the model
Step 2: Create VAIML config
{ "passes": [ {"name": "init", "plugin": "vaip-pass_init"}, {"name": "vaiml_partition", "plugin": "vaip-pass_vaiml_partition", "vaiml_config": { "device": "stx", "optimize_level": 1, "preferred_data_storage": "vectorized", "enable_f32_to_bf16_conversion": true }} ], "target": "VAIML", "targets": [{"name": "VAIML", "pass": ["init", "vaiml_partition"]}] }Step 3: Create session and run inference (CRASHES)
Expected output:
Actual output:
(Process terminates immediately)
AIE Compilation Log (showing SUCCESS)
External buffer allocation from unified memory (proving 128 GB shared memory works):
VAIML Partition Summary
What We Already Tried
xir_deserialize_cifWhy This Matters for Customers
PP-DocLayoutV3 is a mainstream model — PaddleOCR has millions of users. When it does not work on NPU, customers conclude AMD NPU does not support real AI workloads.
The compilation succeeds but runtime crashes — this is the most confusing failure mode. A customer would invest hours in model preparation, see "Compilation Complete (0 errors)", then get a cryptic crash with no actionable error message.
Windows platform is critical — Ryzen AI laptops ship with Windows. The NPU value proposition is run AI faster on your laptop. If models compile but crash at runtime, the NPU is dead weight.
The unified memory architecture is proven to work — our tests show AIE can allocate 24 MB external buffers from the 128 GB shared pool. The hardware is capable. The software stack has a bug.
Request
xir_deserialize_cifcrash — the runtime should be able to load any binary that the AIE compiler producesEnvironment Files Available
I can provide upon request:
inference_int8.onnx, 104 MB)0xC0000005dump)System: AMD Ryzen AI MAX+ 395 (STX) | 128 GB Unified Memory | Windows 11 | Ryzen AI Software 1.7.1