Skip to content

[Cadence: Vision] ResNet18 & ResNet50: Optimized, DMA-enabled, functional#19111

Open
cad-rlc wants to merge 48 commits intopytorch:mainfrom
cad-rlc:main
Open

[Cadence: Vision] ResNet18 & ResNet50: Optimized, DMA-enabled, functional#19111
cad-rlc wants to merge 48 commits intopytorch:mainfrom
cad-rlc:main

Conversation

@cad-rlc
Copy link
Copy Markdown

@cad-rlc cad-rlc commented Apr 24, 2026

Summary

Optimized Cadence Vision DSP operators for ResNet18 and ResNet50 inference. All operators are DMA-enabled with ping-pong tiling and functionally verified (int8 quantized, NCHW layout).

Operators

Conv2d (quantized_conv2d_nchw)

  • Kernel variants: 7x7j2, 3x3j1, 3x3j2, 1x1j1, 1x1j2
  • Modes: DMA ping-pong tiling (with iDMA) and cache-only (no DMA)
  • Dispatch: Automatic kernel selection based on layer config (kernel size, stride, dilation)
  • Quantization: int8 asymmetric input × symmetric weights, per-tensor output scaling
  • Bias correction: 24-bit clamped kernel bias with post-kernel residual correction
  • Config generator: Python tool to generate per-DRAM-size layer config headers

MaxPool2d (maxpool_exec_mxnj2)

  • Kernel: Arbitrary MxN kernel size, stride-2
  • Modes: DMA tiled and cache-only (no DMA)
  • Layout: NCHW float32

Mean / AdaptiveAvgPool (mean_exec_dma)

  • Kernel: SIMD-optimized channel-wise mean with DMA tiling
  • Layout: NCHW float32, reduces spatial dims to 1x1

Quantize / Dequantize (quantize_per_tensor, dequantize_per_tensor)

  • Modes: DMA ping-pong and HW-optimized (no DMA)
  • Types: int8 asymmetric (asym8s)

Quantized ReLU (quantized_relu)

  • Modes: DMA ping-pong and HW-optimized (no DMA)
  • Type: int8 clamp

Quantized Linear (quantized_linear_out)

  • Mode: SIMD with DMA tiling
  • Type: int8 input × int8 weights, int32 bias

Add (op_add)

  • Mode: DMA ping-pong element-wise float32 add

Softmax (op_softmax)

  • Mode: HW-optimized softmax

Build Configuration

  • Supports configurable DRAM buffer sizes.
  • Automatic DMA vs cache-only dispatch based on DRAM availability

cc @mcremon-meta @hsharma35 @zonglinpengmeta

larryliu0820 and others added 30 commits July 22, 2024 21:29
[ghstack-poisoned]
This require us to move to create_runtime API v2 -> v4. This should be
backwards compatible (i.e. old PTE should be able to load), and should
also be supported on slightly older version of XNNPACK 3p library given
the v4 got introduced 2 years ago.

This patch add a new workspace pointer member in the XnnpackBackend instance.

This should be done also for the weight cache. Which is left as a TODO
here for now.
Resolving errors in functionality
Minor code modification in ping-pong-process
Correcting the MIN_FLT32 value and adding MIN_ABS_FLT32.
Suraj Raut added 15 commits April 7, 2026 06:37
This reverts commit fb32e93, reversing
changes made to fcccda3.
# Conflicts:
#	Makefile
#	backends/cadence/aot/ref_implementations.py
#	backends/cadence/generic/operators/CMakeLists.txt
#	backends/cadence/generic/operators/op_dequantize_per_tensor.cpp
#	backends/cadence/generic/operators/op_im2row.cpp
#	backends/cadence/generic/operators/op_quantize_per_tensor.cpp
#	backends/cadence/generic/operators/op_quantized_layer_norm.cpp
#	backends/cadence/generic/operators/op_requantize.cpp
#	backends/cadence/generic/operators/quantized_add_out.cpp
#	backends/cadence/generic/operators/quantized_conv2d_nchw_out.cpp
#	backends/cadence/generic/operators/quantized_conv2d_nhwc_out.cpp
#	backends/cadence/generic/operators/quantized_fully_connected_out.cpp
#	backends/cadence/generic/operators/quantized_linear_out.cpp
#	backends/cadence/generic/operators/quantized_matmul_out.cpp
#	backends/cadence/generic/operators/quantized_relu_out.cpp
#	backends/cadence/runtime/TARGETS
#	backends/cadence/utils/runtime/BUCK
#	backends/cadence/utils/runtime/TARGETS
#	backends/cadence/vision/kernels/kernels.cpp
#	backends/cadence/vision/kernels/targets.bzl
#	backends/cadence/vision/operators/operators.h
#	backends/cadence/vision/operators/targets.bzl
#	backends/cadence/vision/third-party/targets.bzl
#	install_requirements.py
…rlap fix

Summary:
  All 20 conv layers + 1 maxpool layer now run via DMA-tiled kernels.
  ResNet18 int8 quantized 64x64: 47.2M cycles, 57.6x speedup over generic.

Config generator:
  - generate_combined_configs.py: extracts conv2d + maxpool from PTE files
    into a single combined header layer_configs.h
  - generate_layer_configs.py: kernel names use _dma/_no_dma suffixes
  - resnet18_layers.json: extracted layer params for ResNet18

Operators:
  - layer_configs.h: combined header with 29 conv + 1 maxpool configs
  - conv_kernel_dispatcher.c: _dma/_no_dma kernel name suffixes,
    CONV_DISPATCH printf for all branches
  - All includes migrated from separate conv/maxpool headers to
    combined operators/layer_configs.h

Maxpool DMA executor:
  - maxpool_exec_2x2j2.c: DMA-tiled executor with ping-pong buffers
  - Supports arbitrary kernel sizes with overlap handling:
    per-tile source row computed from output_rows * stride_h - pad_h,
    MIN_FLT32 fill provides top/left/bottom/right padding
  - op_max_pool2d_with_indices.cpp: DMA path via config lookup

Logs:
  - resnet18_all_dma.log: inference log, 57.6x speedup, Top-1 class 111
  - resnet18_all_dma_vs_generic.txt: per-op performance comparison
- Rename maxpool executor: maxpool_exec_2x2j2 -> maxpool_exec_mxnj2 (arbitrary kernel size, stride-2)
- Add mean_exec_dma.c and mean_executors.h for SIMD-optimized mean operator
- Remove CADENCE_CONV2D_GENERIC macro and all debug printf from vision/operators
- Add DMA buffer config headers for multiple DRAM sizes (4k/8k/16k/24k/32k/61k)
- Reorganize logs: remove old scattered logs, add structured per-model DRAM-sweep logs
- Add layerwise performance reports for ResNet18 (cache + no-cache cores)
# Conflicts:
#	backends/cadence/vision/operators/op_quantized_conv_out.cpp
…ntized_conv_out.cpp

- Merge quantized_conv2d_nchw_out_per_tensor.cpp into op_quantized_conv_out.cpp
- Add DMA-optimized NCHW conv path with XAI kernel dispatch
- Add 6 specialized typed variants (asym8s, asym8u, dilated, depthwise)
- Add 4 conv1d variants (ncl, nlc) in generic::native namespace
- Remove old quantized_conv2d_nchw_out_per_tensor.cpp
- Update CMakeLists.txt to remove old file reference
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 24, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19111

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⚠️ 11 Awaiting Approval

As of commit 0475f66 with merge base c5c5b3a (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2026
@github-actions github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels Apr 24, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 24, 2026

The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:

  • ciflow/trunk

Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows.

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: arm Issues related to arm backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants