Skip to content

[VAIML Runtime Bug] xir_deserialize_cif crash (0xC0000005) after successful AIE compilation — PP-DocLayoutV3 on Ryzen AI MAX+ 395 (STX) #378

@AIwork4me

Description

@AIwork4me

Model Download

PP-DocLayoutV3 ONNX model: https://huggingface.co/AlexTransformer/PP-DocLayoutV3-onnx

Customer Impact

I purchased an AMD Ryzen AI MAX+ 395 AI PC laptop specifically for on-device document parsing with ONNX models on the NPU. After extensive model preparation and successful AIE compilation, the VAIML runtime crashes at inference time, making the NPU unusable for real-world document AI workloads. This severely impacts the value proposition of the Ryzen AI platform.

Hardware

Component Detail
APU AMD Ryzen AI MAX+ 395
NPU Type STX (detected by quicktest.py)
Memory 128 GB unified (LPDDR5X)
OS Windows 11
Driver NPU driver 32.0.22032.6002

Software Environment

Component Version
Ryzen AI Software 1.7.1
onnxruntime-vitisai 1.23.3.dev20260320
VAIML (vaip) commit 9ce31169da2a09a217ab2e1492b3fc9cd39d425c
AIE Compiler 2026.1 (windows64-bit)
Python 3.12
Conda env ryzen-ai-1.7.1 (official)

Problem

AIE compilation succeeds with ZERO errors, but runtime inference crashes with Access Violation (0xC0000005) in xir_deserialize_cif().

The crash occurs inside onnxruntime_vitisai_ep.dll when the VAIML runtime attempts to deserialize/load the compiled AIE binary. This is NOT a compilation error — the compiler reports success. The runtime simply cannot load its own output.

Root Cause Analysis

What we proved works:

  1. NPU hardware is functionalquicktest.py passes (test_model.onnx runs on NPU)
  2. VitisAIExecutionProvider is available['VitisAIExecutionProvider', 'DmlExecutionProvider', 'CPUExecutionProvider']
  3. VAIML frontend partitioning works — 95.7% of operators (1157/1209) are supported, covering 99.1% of GOPs
  4. Unified memory works — AIE compiler successfully allocates external buffers from the 128 GB shared memory pool
  5. AIE compilation completesCompilation Complete (WARNING:47, CRITICAL-WARNING:0, ERROR:0)
  6. Session creation succeeds — VitisAIExecutionProvider is selected, session loads in ~760s (first compile) or ~1.4s (cache hit)
  7. Static INT8 quantized model runs — 1.07x speedup over CPU (partial NPU offload, 28% VAIML coverage)

What crashes:

The dynamic INT8 quantized model (QUInt8, per-tensor) passes full AIE compilation but crashes at the very first inference call:

Exception Code: 0xC0000005  (ACCESS_VIOLATION)
  onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x147926
  onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x14E8C1
  onnxruntime_providers_vitisai.dll + 0x21614

The crash happens in xir_deserialize_cif() — the function responsible for loading the compiled AIE graph at runtime. The compiled binary was produced by the SAME toolchain, yet the runtime cannot deserialize it.

Model details:

  • Model: PP-DocLayoutV3 (PaddleOCR document layout detection, widely used document AI model)
  • Input: 1x3x800x800 (standard document image)
  • Original format: FP32, 66 MB
  • After graph surgery + INT8 quantization: 104 MB (QUInt8 dynamic quantization)
  • Graph surgery performed: 86 Cast nodes reduced to 12 (removed BOOL to FLOAT pairs, fixed DOUBLE to FLOAT, changed Cast.82 INT32 to FLOAT output); 3 Where(bool,A,B) rewritten as float-mask arithmetic; onnxsim simplification; 2519 Constants converted to Initializers; static shape fixing

Reproduction Steps

Prerequisites

# 1. Install Ryzen AI Software 1.7.1 (official installer)
# 2. Create conda environment
conda activate ryzen-ai-1.7.1
set RYZEN_AI_INSTALLATION_PATH=C:\Program Files\RyzenAI\1.7.1

# 3. Install packages
pip install onnx onnxsim numpy

# 4. Verify NPU
cd %RYZEN_AI_INSTALLATION_PATH%\quicktest
python quicktest.py
# Expected: "Detected NPU type: STX" + "Test Finished"

Step 1: Prepare the model

# Download PP-DocLayoutV3 ONNX from PaddlePaddle (or any ONNX model with similar structure)
# Perform graph surgery: fix static shapes, remove Where(bool,A,B), simplify with onnxsim
# Quantize with onnxruntime dynamic quantization (QUInt8, per-tensor)

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("model_fp32.onnx", "model_int8.onnx", weight_type=QuantType.QUInt8, per_channel=False)

Step 2: Create VAIML config

{
  "passes": [
    {"name": "init", "plugin": "vaip-pass_init"},
    {"name": "vaiml_partition", "plugin": "vaip-pass_vaiml_partition",
     "vaiml_config": {
       "device": "stx",
       "optimize_level": 1,
       "preferred_data_storage": "vectorized",
       "enable_f32_to_bf16_conversion": true
     }}
  ],
  "target": "VAIML",
  "targets": [{"name": "VAIML", "pass": ["init", "vaiml_partition"]}]
}

Step 3: Create session and run inference (CRASHES)

import os
os.environ['RYZEN_AI_INSTALLATION_PATH'] = r'C:\Program Files\RyzenAI\1.7.1'

import numpy as np
import onnxruntime as ort

model = "model_int8.onnx"
config = "vaiml_config.json"

inputs = {
    'image': np.random.rand(1, 3, 800, 800).astype(np.float32),
    'im_shape': np.array([[800, 800]], dtype=np.float32),
    'scale_factor': np.array([[1.0, 1.0]], dtype=np.float32),
}

# This succeeds (AIE compilation, ~760s first time, ~1.4s cached)
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, sess_options=so,
    providers=['VitisAIExecutionProvider', 'CPUExecutionProvider'],
    provider_options=[{'config_file': config}, {}])

# This CRASHES (0xC0000005 Access Violation)
out = sess.run(None, inputs)

Expected output:

Output shapes: [(300, 7), (1,), (300, 200, 200)]

Actual output:

Exception Code: 0xC0000005
onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x147926

(Process terminates immediately)

AIE Compilation Log (showing SUCCESS)

INFO: [aiecompiler 77-6272] Completing Scheduler pass
INFO: [aiecompiler 77-6497] runtime-opt stats: Avoided compilation of 75 cores out of 80 cores.
INFO: [aiecompiler 77-23486] ### Exiting Peano ElfGen
INFO: [aiecompiler 77-23810] Completing SchedulerControlPackets pass

Compilation Complete
(WARNING:47, CRITICAL-WARNING:0, ERROR:0)

External buffer allocation from unified memory (proving 128 GB shared memory works):

coalesed_weights:         278 KB
coalesed_spills:          803 KB
compute_graph.ofm_ddr:     24 MB  (allocated from shared DDR)

VAIML Partition Summary

Number of operators in the model: 1209
Number of operators supported by VAIML: 1157 (95.699%)
GOPs supported by VAIML: 180.261 (99.100%)
Number of subgraphs supported by VAIML: 11

fail_safe_summary.json:
  "offload_map": {"AIE": 100, "CPU": 0}

What We Already Tried

Approach VAIML Frontend AIE Compile Runtime Result
FP32 original 95.7% Fail (L1 placement L-93) CPU fallback No NPU acceleration
FP32 + graph surgery 95.7% Fail (L1 placement L-93) CPU fallback No NPU acceleration
INT8 dynamic (QUInt8) + O1 config 95.7% Fail (SCC L-115) CPU fallback No NPU acceleration
INT8 dynamic (QUInt8) + O1 + vectorized 95.7% Pass (0 errors) CRASH xir_deserialize_cif
INT8 static (QOperator) 28% Pass Pass 1.07x speedup

Why This Matters for Customers

  1. PP-DocLayoutV3 is a mainstream model — PaddleOCR has millions of users. When it does not work on NPU, customers conclude AMD NPU does not support real AI workloads.

  2. The compilation succeeds but runtime crashes — this is the most confusing failure mode. A customer would invest hours in model preparation, see "Compilation Complete (0 errors)", then get a cryptic crash with no actionable error message.

  3. Windows platform is critical — Ryzen AI laptops ship with Windows. The NPU value proposition is run AI faster on your laptop. If models compile but crash at runtime, the NPU is dead weight.

  4. The unified memory architecture is proven to work — our tests show AIE can allocate 24 MB external buffers from the 128 GB shared pool. The hardware is capable. The software stack has a bug.

Request

  1. Fix the xir_deserialize_cif crash — the runtime should be able to load any binary that the AIE compiler produces
  2. Add a clear error message instead of a segfault — if the runtime cannot handle a particular graph, report why
  3. Test with real-world ONNX models (not just the quicktest reference model) — PP-DocLayoutV3, YOLOv8, ResNet-50, BERT are the models customers actually want to run

Environment Files Available

I can provide upon request:

  • The prepared ONNX model (inference_int8.onnx, 104 MB)
  • The VAIML config JSON
  • Full AIE compiler log (thousands of lines)
  • VAIML cache directory contents (partition info, graph nodes, tensor shapes)
  • Crash stack trace (full 0xC0000005 dump)

System: AMD Ryzen AI MAX+ 395 (STX) | 128 GB Unified Memory | Windows 11 | Ryzen AI Software 1.7.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions