[VAIML Runtime Bug] xir_deserialize_cif crash (0xC0000005) after successful AIE compilation — PP-DocLayoutV3 on Ryzen AI MAX+ 395 (STX)

## Model Download

**PP-DocLayoutV3 ONNX model**: https://huggingface.co/AlexTransformer/PP-DocLayoutV3-onnx

## Customer Impact

I purchased an AMD Ryzen AI MAX+ 395 AI PC laptop specifically for on-device document parsing with ONNX models on the NPU. After extensive model preparation and successful AIE compilation, the VAIML runtime crashes at inference time, making the NPU unusable for real-world document AI workloads. This severely impacts the value proposition of the Ryzen AI platform.

## Hardware

| Component | Detail |
|-----------|--------|
| APU | AMD Ryzen AI MAX+ 395 |
| NPU Type | STX (detected by `quicktest.py`) |
| Memory | 128 GB unified (LPDDR5X) |
| OS | Windows 11 |
| Driver | NPU driver 32.0.22032.6002 |

## Software Environment

| Component | Version |
|-----------|---------|
| Ryzen AI Software | 1.7.1 |
| onnxruntime-vitisai | 1.23.3.dev20260320 |
| VAIML (vaip) commit | `9ce31169da2a09a217ab2e1492b3fc9cd39d425c` |
| AIE Compiler | 2026.1 (windows64-bit) |
| Python | 3.12 |
| Conda env | `ryzen-ai-1.7.1` (official) |

## Problem

**AIE compilation succeeds with ZERO errors, but runtime inference crashes with Access Violation (0xC0000005) in `xir_deserialize_cif()`.**

The crash occurs inside `onnxruntime_vitisai_ep.dll` when the VAIML runtime attempts to deserialize/load the compiled AIE binary. This is NOT a compilation error — the compiler reports success. The runtime simply cannot load its own output.

## Root Cause Analysis

### What we proved works:

1. **NPU hardware is functional** — `quicktest.py` passes (test_model.onnx runs on NPU)
2. **VitisAIExecutionProvider is available** — `['VitisAIExecutionProvider', 'DmlExecutionProvider', 'CPUExecutionProvider']`
3. **VAIML frontend partitioning works** — 95.7% of operators (1157/1209) are supported, covering 99.1% of GOPs
4. **Unified memory works** — AIE compiler successfully allocates external buffers from the 128 GB shared memory pool
5. **AIE compilation completes** — `Compilation Complete (WARNING:47, CRITICAL-WARNING:0, ERROR:0)`
6. **Session creation succeeds** — VitisAIExecutionProvider is selected, session loads in ~760s (first compile) or ~1.4s (cache hit)
7. **Static INT8 quantized model runs** — 1.07x speedup over CPU (partial NPU offload, 28% VAIML coverage)

### What crashes:

The **dynamic INT8 quantized model** (QUInt8, per-tensor) passes full AIE compilation but crashes at the very first inference call:

```
Exception Code: 0xC0000005  (ACCESS_VIOLATION)
  onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x147926
  onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x14E8C1
  onnxruntime_providers_vitisai.dll + 0x21614
```

The crash happens in `xir_deserialize_cif()` — the function responsible for loading the compiled AIE graph at runtime. The compiled binary was produced by the SAME toolchain, yet the runtime cannot deserialize it.

### Model details:

- **Model**: PP-DocLayoutV3 (PaddleOCR document layout detection, widely used document AI model)
- **Input**: 1x3x800x800 (standard document image)
- **Original format**: FP32, 66 MB
- **After graph surgery + INT8 quantization**: 104 MB (QUInt8 dynamic quantization)
- **Graph surgery performed**: 86 Cast nodes reduced to 12 (removed BOOL to FLOAT pairs, fixed DOUBLE to FLOAT, changed Cast.82 INT32 to FLOAT output); 3 Where(bool,A,B) rewritten as float-mask arithmetic; onnxsim simplification; 2519 Constants converted to Initializers; static shape fixing

## Reproduction Steps

### Prerequisites

```powershell
# 1. Install Ryzen AI Software 1.7.1 (official installer)
# 2. Create conda environment
conda activate ryzen-ai-1.7.1
set RYZEN_AI_INSTALLATION_PATH=C:\Program Files\RyzenAI\1.7.1

# 3. Install packages
pip install onnx onnxsim numpy

# 4. Verify NPU
cd %RYZEN_AI_INSTALLATION_PATH%\quicktest
python quicktest.py
# Expected: "Detected NPU type: STX" + "Test Finished"
```

### Step 1: Prepare the model

```python
# Download PP-DocLayoutV3 ONNX from PaddlePaddle (or any ONNX model with similar structure)
# Perform graph surgery: fix static shapes, remove Where(bool,A,B), simplify with onnxsim
# Quantize with onnxruntime dynamic quantization (QUInt8, per-tensor)

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("model_fp32.onnx", "model_int8.onnx", weight_type=QuantType.QUInt8, per_channel=False)
```

### Step 2: Create VAIML config

```json
{
  "passes": [
    {"name": "init", "plugin": "vaip-pass_init"},
    {"name": "vaiml_partition", "plugin": "vaip-pass_vaiml_partition",
     "vaiml_config": {
       "device": "stx",
       "optimize_level": 1,
       "preferred_data_storage": "vectorized",
       "enable_f32_to_bf16_conversion": true
     }}
  ],
  "target": "VAIML",
  "targets": [{"name": "VAIML", "pass": ["init", "vaiml_partition"]}]
}
```

### Step 3: Create session and run inference (CRASHES)

```python
import os
os.environ['RYZEN_AI_INSTALLATION_PATH'] = r'C:\Program Files\RyzenAI\1.7.1'

import numpy as np
import onnxruntime as ort

model = "model_int8.onnx"
config = "vaiml_config.json"

inputs = {
    'image': np.random.rand(1, 3, 800, 800).astype(np.float32),
    'im_shape': np.array([[800, 800]], dtype=np.float32),
    'scale_factor': np.array([[1.0, 1.0]], dtype=np.float32),
}

# This succeeds (AIE compilation, ~760s first time, ~1.4s cached)
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, sess_options=so,
    providers=['VitisAIExecutionProvider', 'CPUExecutionProvider'],
    provider_options=[{'config_file': config}, {}])

# This CRASHES (0xC0000005 Access Violation)
out = sess.run(None, inputs)
```

### Expected output:
```
Output shapes: [(300, 7), (1,), (300, 200, 200)]
```

### Actual output:
```
Exception Code: 0xC0000005
onnxruntime_vitisai_ep.dll!xir_deserialize_cif() + 0x147926
```
(Process terminates immediately)

## AIE Compilation Log (showing SUCCESS)

```
INFO: [aiecompiler 77-6272] Completing Scheduler pass
INFO: [aiecompiler 77-6497] runtime-opt stats: Avoided compilation of 75 cores out of 80 cores.
INFO: [aiecompiler 77-23486] ### Exiting Peano ElfGen
INFO: [aiecompiler 77-23810] Completing SchedulerControlPackets pass

Compilation Complete
(WARNING:47, CRITICAL-WARNING:0, ERROR:0)
```

External buffer allocation from unified memory (proving 128 GB shared memory works):
```
coalesed_weights:         278 KB
coalesed_spills:          803 KB
compute_graph.ofm_ddr:     24 MB  (allocated from shared DDR)
```

## VAIML Partition Summary

```
Number of operators in the model: 1209
Number of operators supported by VAIML: 1157 (95.699%)
GOPs supported by VAIML: 180.261 (99.100%)
Number of subgraphs supported by VAIML: 11

fail_safe_summary.json:
  "offload_map": {"AIE": 100, "CPU": 0}
```

## What We Already Tried

| Approach | VAIML Frontend | AIE Compile | Runtime | Result |
|----------|:---:|:---:|:---:|--------|
| FP32 original | 95.7% | Fail (L1 placement L-93) | CPU fallback | No NPU acceleration |
| FP32 + graph surgery | 95.7% | Fail (L1 placement L-93) | CPU fallback | No NPU acceleration |
| INT8 dynamic (QUInt8) + O1 config | 95.7% | Fail (SCC L-115) | CPU fallback | No NPU acceleration |
| INT8 dynamic (QUInt8) + O1 + vectorized | 95.7% | Pass (0 errors) | **CRASH** | `xir_deserialize_cif` |
| INT8 static (QOperator) | 28% | Pass | Pass | 1.07x speedup |

## Why This Matters for Customers

1. **PP-DocLayoutV3 is a mainstream model** — PaddleOCR has millions of users. When it does not work on NPU, customers conclude AMD NPU does not support real AI workloads.

2. **The compilation succeeds but runtime crashes** — this is the most confusing failure mode. A customer would invest hours in model preparation, see "Compilation Complete (0 errors)", then get a cryptic crash with no actionable error message.

3. **Windows platform is critical** — Ryzen AI laptops ship with Windows. The NPU value proposition is run AI faster on your laptop. If models compile but crash at runtime, the NPU is dead weight.

4. **The unified memory architecture is proven to work** — our tests show AIE can allocate 24 MB external buffers from the 128 GB shared pool. The hardware is capable. The software stack has a bug.

## Request

1. **Fix the `xir_deserialize_cif` crash** — the runtime should be able to load any binary that the AIE compiler produces
2. **Add a clear error message** instead of a segfault — if the runtime cannot handle a particular graph, report why
3. **Test with real-world ONNX models** (not just the quicktest reference model) — PP-DocLayoutV3, YOLOv8, ResNet-50, BERT are the models customers actually want to run

## Environment Files Available

I can provide upon request:
- The prepared ONNX model (`inference_int8.onnx`, 104 MB)
- The VAIML config JSON
- Full AIE compiler log (thousands of lines)
- VAIML cache directory contents (partition info, graph nodes, tensor shapes)
- Crash stack trace (full `0xC0000005` dump)

---

**System:** AMD Ryzen AI MAX+ 395 (STX) | 128 GB Unified Memory | Windows 11 | Ryzen AI Software 1.7.1


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VAIML Runtime Bug] xir_deserialize_cif crash (0xC0000005) after successful AIE compilation — PP-DocLayoutV3 on Ryzen AI MAX+ 395 (STX) #378

Model Download

Customer Impact

Hardware

Software Environment

Problem

Root Cause Analysis

What we proved works:

What crashes:

Model details:

Reproduction Steps

Prerequisites

Step 1: Prepare the model

Step 2: Create VAIML config

Step 3: Create session and run inference (CRASHES)

Expected output:

Actual output:

AIE Compilation Log (showing SUCCESS)

VAIML Partition Summary

What We Already Tried

Why This Matters for Customers

Request

Environment Files Available

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Detail
APU	AMD Ryzen AI MAX+ 395
NPU Type	STX (detected by `quicktest.py`)
Memory	128 GB unified (LPDDR5X)
OS	Windows 11
Driver	NPU driver 32.0.22032.6002

Component	Version
Ryzen AI Software	1.7.1
onnxruntime-vitisai	1.23.3.dev20260320
VAIML (vaip) commit	`9ce31169da2a09a217ab2e1492b3fc9cd39d425c`
AIE Compiler	2026.1 (windows64-bit)
Python	3.12
Conda env	`ryzen-ai-1.7.1` (official)

Approach	VAIML Frontend	AIE Compile	Runtime	Result
FP32 original	95.7%	Fail (L1 placement L-93)	CPU fallback	No NPU acceleration
FP32 + graph surgery	95.7%	Fail (L1 placement L-93)	CPU fallback	No NPU acceleration
INT8 dynamic (QUInt8) + O1 config	95.7%	Fail (SCC L-115)	CPU fallback	No NPU acceleration
INT8 dynamic (QUInt8) + O1 + vectorized	95.7%	Pass (0 errors)	CRASH	`xir_deserialize_cif`
INT8 static (QOperator)	28%	Pass	Pass	1.07x speedup

[VAIML Runtime Bug] xir_deserialize_cif crash (0xC0000005) after successful AIE compilation — PP-DocLayoutV3 on Ryzen AI MAX+ 395 (STX) #378

Description

Model Download

Customer Impact

Hardware

Software Environment

Problem

Root Cause Analysis

What we proved works:

What crashes:

Model details:

Reproduction Steps

Prerequisites

Step 1: Prepare the model

Step 2: Create VAIML config

Step 3: Create session and run inference (CRASHES)

Expected output:

Actual output:

AIE Compilation Log (showing SUCCESS)

VAIML Partition Summary

What We Already Tried

Why This Matters for Customers

Request

Environment Files Available

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions