[Cadence: Vision] ResNet18 & ResNet50: Optimized, DMA-enabled, functional#19111
[Cadence: Vision] ResNet18 & ResNet50: Optimized, DMA-enabled, functional#19111cad-rlc wants to merge 48 commits intopytorch:mainfrom
Conversation
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
Differential Revision: [D60101911](https://our.internmc.facebook.com/intern/diff/D60101911) [ghstack-poisoned]
This require us to move to create_runtime API v2 -> v4. This should be backwards compatible (i.e. old PTE should be able to load), and should also be supported on slightly older version of XNNPACK 3p library given the v4 got introduced 2 years ago. This patch add a new workspace pointer member in the XnnpackBackend instance. This should be done also for the weight cache. Which is left as a TODO here for now.
Resolving errors in functionality
Minor code modification in ping-pong-process
Correcting the MIN_FLT32 value and adding MIN_ABS_FLT32.
…into stable-branch
# Conflicts: # Makefile # backends/cadence/aot/ref_implementations.py # backends/cadence/generic/operators/CMakeLists.txt # backends/cadence/generic/operators/op_dequantize_per_tensor.cpp # backends/cadence/generic/operators/op_im2row.cpp # backends/cadence/generic/operators/op_quantize_per_tensor.cpp # backends/cadence/generic/operators/op_quantized_layer_norm.cpp # backends/cadence/generic/operators/op_requantize.cpp # backends/cadence/generic/operators/quantized_add_out.cpp # backends/cadence/generic/operators/quantized_conv2d_nchw_out.cpp # backends/cadence/generic/operators/quantized_conv2d_nhwc_out.cpp # backends/cadence/generic/operators/quantized_fully_connected_out.cpp # backends/cadence/generic/operators/quantized_linear_out.cpp # backends/cadence/generic/operators/quantized_matmul_out.cpp # backends/cadence/generic/operators/quantized_relu_out.cpp # backends/cadence/runtime/TARGETS # backends/cadence/utils/runtime/BUCK # backends/cadence/utils/runtime/TARGETS # backends/cadence/vision/kernels/kernels.cpp # backends/cadence/vision/kernels/targets.bzl # backends/cadence/vision/operators/operators.h # backends/cadence/vision/operators/targets.bzl # backends/cadence/vision/third-party/targets.bzl # install_requirements.py
…rlap fix
Summary:
All 20 conv layers + 1 maxpool layer now run via DMA-tiled kernels.
ResNet18 int8 quantized 64x64: 47.2M cycles, 57.6x speedup over generic.
Config generator:
- generate_combined_configs.py: extracts conv2d + maxpool from PTE files
into a single combined header layer_configs.h
- generate_layer_configs.py: kernel names use _dma/_no_dma suffixes
- resnet18_layers.json: extracted layer params for ResNet18
Operators:
- layer_configs.h: combined header with 29 conv + 1 maxpool configs
- conv_kernel_dispatcher.c: _dma/_no_dma kernel name suffixes,
CONV_DISPATCH printf for all branches
- All includes migrated from separate conv/maxpool headers to
combined operators/layer_configs.h
Maxpool DMA executor:
- maxpool_exec_2x2j2.c: DMA-tiled executor with ping-pong buffers
- Supports arbitrary kernel sizes with overlap handling:
per-tile source row computed from output_rows * stride_h - pad_h,
MIN_FLT32 fill provides top/left/bottom/right padding
- op_max_pool2d_with_indices.cpp: DMA path via config lookup
Logs:
- resnet18_all_dma.log: inference log, 57.6x speedup, Top-1 class 111
- resnet18_all_dma_vs_generic.txt: per-op performance comparison
- Rename maxpool executor: maxpool_exec_2x2j2 -> maxpool_exec_mxnj2 (arbitrary kernel size, stride-2) - Add mean_exec_dma.c and mean_executors.h for SIMD-optimized mean operator - Remove CADENCE_CONV2D_GENERIC macro and all debug printf from vision/operators - Add DMA buffer config headers for multiple DRAM sizes (4k/8k/16k/24k/32k/61k) - Reorganize logs: remove old scattered logs, add structured per-model DRAM-sweep logs - Add layerwise performance reports for ResNet18 (cache + no-cache cores)
# Conflicts: # backends/cadence/vision/operators/op_quantized_conv_out.cpp
…ntized_conv_out.cpp - Merge quantized_conv2d_nchw_out_per_tensor.cpp into op_quantized_conv_out.cpp - Add DMA-optimized NCHW conv path with XAI kernel dispatch - Add 6 specialized typed variants (asym8s, asym8u, dilated, depthwise) - Add 4 conv1d variants (ncl, nlc) in generic::native namespace - Remove old quantized_conv2d_nchw_out_per_tensor.cpp - Update CMakeLists.txt to remove old file reference
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19111
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below:
|
|
The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:
Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows. |
This PR needs a
|
Summary
Optimized Cadence Vision DSP operators for ResNet18 and ResNet50 inference. All operators are DMA-enabled with ping-pong tiling and functionally verified (int8 quantized, NCHW layout).
Operators
Conv2d (
quantized_conv2d_nchw)MaxPool2d (
maxpool_exec_mxnj2)Mean / AdaptiveAvgPool (
mean_exec_dma)Quantize / Dequantize (
quantize_per_tensor,dequantize_per_tensor)Quantized ReLU (
quantized_relu)Quantized Linear (
quantized_linear_out)Add (
op_add)Softmax (
op_softmax)Build Configuration
cc @mcremon-meta @hsharma35 @zonglinpengmeta