Skip to content

[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188

Open
Looong01 wants to merge 35 commits into
lightvector:masterfrom
Looong01:AMD_GPU
Open

[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188
Looong01 wants to merge 35 commits into
lightvector:masterfrom
Looong01:AMD_GPU

Conversation

@Looong01
Copy link
Copy Markdown

All test passed, we can merge it to main branch! @lightvector

Bothe Windows and Linux Binary release has been published here: https://github.com/Looong01/KataGo-Multi-backends/releases

Background

This PR summarizes all commits by Looong01 on the AMD_GPU branch from 2025-07-28 to 2026-03-16 (23 commits total: 18 non-merge + 5 merge), focused on introducing and refining ROCm backend support in KataGo, plus the new MIGraphX backend added on the MIGraphX branch.

Key Changes — ROCm Backend

  • Added core ROCm backend implementation and utility files: rocmbackend.cpp, rocmhelpers.hip, rocmutils.*, rocmincludes.h, rocmerrorcheck.h.
  • Integrated USE_ROCM_BACKEND into startup and config flow (setup/benchmark/gtpconfig) for proper backend detection and config generation.
  • Expanded ROCm build logic in CMakeLists.txt: HIP compiler setup, architecture/FP16 detection, ROCm package lookup, and fallback linking.
  • Added Windows ROCm build support (HIP_PATH/ROCM_PATH, clang toolchain handling, Windows library search paths).
  • Iteratively improved rocmbackend.cpp with Convlayer method updates, performance tuning, and bug fixes.
  • Introduced and then removed experimental rocmbackend_new.cpp after merging validated changes into the main backend path.
  • Updated docs and sample configs (README / Compiling / cpp/configs/*) with ROCm instructions and rocmDeviceToUse*, rocmUseFP16 examples.
  • Regularly merged upstream lightvector:master to reduce branch drift.

Critical Bug Fix: ConvLayer accumulate (residual skip connections)

  • Root cause: miopenConvolutionForwardImmediate does not support alpha/beta parameters (unlike cuDNN's cudnnConvolutionForward). The original code set beta = accumulate ? 1.0 : 0.0 but this value was never passed to the MIOpen API, causing all residual skip connections to be silently dropped — the neural network output was effectively garbage.
  • Fix: When accumulate=true, save the output buffer (trunk) to a pre-allocated accumBuf via hipMemcpyAsync (Device-to-Device), run convolution (which overwrites the output buffer), then add the saved residual back using a new customCudaAddTensorsInplace GPU kernel. All operations stay in VRAM with zero CPU-side data transfer.
  • New kernels added in rocmhelpers.hip / rocmhelpers.h:
    • customCudaAddTensorsInplace(float*, const float*, int) 
    • customCudaAddTensorsInplace(half*, const half*, int)
  • Performance optimization: accumBuf is pre-allocated once per ConvLayer at construction time (sized for maxBatchSize), avoiding per-inference hipMalloc/hipFree overhead.

Secondary Fix: Algorithm enumeration buffer overflow

  • miopenConvolutionForwardGetSolutionCount returns the available count by overwriting the output parameter. The original code used this count to size a fixed stack array miopenConvSolution_t solutions[2*requestedAlgoCount] which could overflow. Replaced with std::vector<miopenConvSolution_t> solutions(availableAlgoCount) for safe dynamic sizing.

Windows ROCm Build — CMakeLists.txt Self-Configuration

Added full Windows ROCm build support directly into `CMakeLists.txt.

Key Changes — MIGraphX Backend (New)

Added a complete MIGraphX graph-compiler backend (migraphxbackend.cpp, 1886 lines) as an alternative to the ROCm (MIOpen) backend. MIGraphX compiles the entire neural network into a single fused GPU program, leveraging AMD's graph-level optimizations (operator fusion, memory planning, kernel scheduling).

Architecture

  • Graph-based inference: The full model (trunk → policy/value/ownership heads) is constructed as a migraphx::program at load time using MIGraphXGraphBuilder, compiled once, then cached as .mxr files under ~/.katago/migraphxcache/.
  • Multi-batch compilation: MIGraphX does not support dynamic batch dimensions. Instead, multiple fixed-batch programs are pre-compiled at sizes {4, 8, 16, 24, 32, 40, 64} (capped by maxBatchSize). At inference time, getBestBatchSize() selects the smallest compiled size ≥ actual batch to minimize GPU waste.
  • Cache system: Compiled programs are saved/loaded as .mxr files with naming format migraphx_{modelName}_{sha256}_{H}x{W}_batch{N}_fp{0|1}_nhwc{0|1}_{exact|max}.mxr. First launch compiles all batch sizes (slow); subsequent launches load from cache in seconds.

Neural Network Components Implemented

Component Status
Convolution (with padding/dilation/stride)
BatchNorm (merged scale/bias via unsqueeze+multibroadcast)
Residual Block
Global Pooling Residual Block (dual-branch with gpoolToBiasMul)
Nested Bottleneck Residual Block
Global Pooling (3 features: mean, mean×scale, max/scale2)
Policy Head (p1Conv + g1Conv→GPool→bias, p2Conv spatial, pass branch)
Value Head (v1Conv→GPool→v2Mul→v3Mul)
Score Value Head (shared v2→sv3Mul)
Ownership Head (vOwnershipConv→flatten)
Activations: ReLU, Mish, MishScale8
MatMul / FC layers (with optional bias)
SGF Metadata Encoder ❌ Disabled

FP16 Support

  • Enabled by default (when useFP16Mode is Auto or True).
  • Input parameters remain float_type on host; a convert op inside the graph handles float→half on GPU.
  • All weights/computation use half_type; outputs are cast back to float via static_cast<float> in the visit() lambda.

Build Integration

  • New USE_BACKEND=MIGRAPHX option in CMakeLists.txt (~60 lines of build logic).
  • Links against libmigraphx (and optional libmigraphx_gpu) from rocm.
  • Registered as "mgx" backend prefix in setup.cpp, forced NCHW format.
  • Backend identification in main.cpp: prints "Using MIGraphX backend".

Known Limitations

  • No dynamic batch: Fixed batch sizes require pre-compilation; small batches may waste GPU cycles (e.g., actual batch 3 uses compiled batch 4).
  • NCHW only: NHWC format is silently ignored.
  • SGF Metadata Encoder disabled: Potential weight shape mismatch, skipped for now.
  • No mask support in global pooling: Assumes full board at nnXLen×nnYLen.

Change Stats — ROCm

  • Commits: 23 (non-merge: 18, merge: 5) + post-PR bug fixes
  • Files touched: 21 + 3 (rocmbackend.cpp, rocmhelpers.hip, rocmhelpers.h)
  • Diffstat: +9372 / -4009 + +59 / -28

Change Stats — MIGraphX

  • Commits: 3 (non-merge: 3)
  • Files touched: 4 (migraphxbackend.cpp, CMakeLists.txt, setup.cpp, main.cpp)
  • Diffstat: +1977 / -1 (1886 lines new backend + 91 lines build/setup integration)

Included Commits (Author: Looong01)

ROCm Backend (AMD_GPU branch, 2025-07-28 ~ 2026-03-16)

  • 1f2ae46e 2025-07-28 Add ROCm backend
  • b4555304 2025-07-28 Fix bugs
  • 8b30cb96 2025-07-31 Update
  • 570ced01 2025-08-01 Fix bugs
  • abb61240 2025-08-01 Fix bugs
  • bfb292e7 2025-08-01 All bug fixed
  • 4606424f 2025-08-01 Update
  • 1e8ea788 2025-08-02 test new method
  • c1a09cf3 2025-08-02 Update
  • 0957b88b 2025-08-02 Test finished
  • c70d841a 2025-08-02 Update docks
  • 1d05ca8d 2025-08-02 Update gitignore
  • 9d4662b7 2025-08-02 Update new method
  • d40bd509 2025-08-02 Optimize performance
  • 158d24df 2025-08-13 Update new Convlayer method
  • ec32eb19 2025-08-13 Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm
  • 0bfe0a14 2025-10-04 Add new compile target
  • f5fbb336 2025-11-08 Merge branch 'lightvector:master' into master
  • 26d8c5bd 2025-11-08 Add ROCm for Windows support
  • 555d2f17 2025-12-01 Merge branch 'lightvector:master' into master
  • dbc7cfa4 2026-02-22 Merge branch 'lightvector:master' into master
  • ed396b72 2026-02-28 Fix bugs
  • ccec62c5 2026-03-16 Merge branch 'lightvector:master' into master
  • xxxxxxxx 2026-04-19 Fix critical ConvLayer accumulate bug & algorithm buffer overflow

MIGraphX Backend (MIGraphX branch, 2026-02-27 ~ 2026-04-19)

  • c511c338 2026-02-27 Add MIGraphX support
  • 00cb6881 2026-04-19 Fix bugs (MIGraphX: 5 structural bugs, GELU→MishScale8, NHWC→NCHW, dimension mismatches)
  • b1da0e06 2026-04-19 Optimize performance (FP16 default, multi-batch compilation, cache per batch size)

@Looong01
Copy link
Copy Markdown
Author

Windows ROCm:

image

Linux ROCm:

rocm

Linux MIGraphx:

image

@Looong01
Copy link
Copy Markdown
Author

image

@lightvector
Copy link
Copy Markdown
Owner

Thanks, I'll look at this soon.

@Looong01
Copy link
Copy Markdown
Author

Thanks, I'll look at this soon.

Thanks!

Copy link
Copy Markdown
Owner

@lightvector lightvector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all this work and sorry for the delay in reading over this! I've left a bunch of comments about migraph. I haven't done a detailed review of ROCm yet, but I suspect a lot of the high-level comments about error handling I left about migraph will apply to ROCm yet, can you take a look?

Comment thread cpp/command/benchmark.cpp Outdated
Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
Comment thread cpp/command/benchmark.cpp
Comment on lines +449 to +450
auto mean = addReduceMean(input, {2, 3});
mean = addSqueeze(mean, {2, 3});
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a keepdims or something?

Comment thread cpp/neuralnet/migraphxbackend.cpp
Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
@Looong01
Copy link
Copy Markdown
Author

Thank you so much for seeing this. I have fixed all and answered part of your questions about MIGraphX backend.
And There is only one similar issue in ROCm backend(cout << "Found ROCm device " << i << ": " << prop.name << endl;). Now I fix it.

@Looong01 Looong01 requested a review from lightvector May 11, 2026 17:41
Copy link
Copy Markdown
Owner

@lightvector lightvector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes. I marked as resolved the comments that looked resolved to me but left open the ones that I didn't see an answer to, or maybe I just missed it. Please let me know what you think about them - you can reply back to the comments if you have an an answer or you think a resolution is not necessary and I will take a look.

I also left some more comments in this pass.

I have a much higher-level question too - the MIGraphX backend looks quite reasonable and I think I'm okay accepting it once it's polished and once I can independently test it a little too.

However, the ROCm backend I'm much less sure about. I skimmed through it, and it seems like a massive copy-paste of the CUDA backend, but with lots of subtle differences, and that seems really awkward for long-term maintainability. Do you have thoughts about this? How necessary is this backend if we have both OpenCL and MIGraphX - do you have thoughts on if it would be okay to drop it, or split it to a separate PR, or keep it as an unofficial branch or something, or other options?

Additionally, very soon I'm going to be adding support for transformer blocks, since transformers are likely the future of strong models. I have a branch where I'm working on it for OpenCL and testing it now, and will be implementing personally it through Eigen/CUDA/TensorRT as well. What are your thoughts about implementing support for these for the AMD side here once it's ready as well? Is this something you're committed to maintaining or updating through more architecture changes?

Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
Comment on lines +249 to +252
if(biasDesc->weights.size() != (size_t)biasDesc->numChannels) {
cerr << "ERROR: MatMul bias " << biasDesc->name << " size mismatch: "
<< biasDesc->weights.size() << " vs expected " << biasDesc->numChannels << endl;
} else {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed spot in earlier error handling cleanup.

Comment on lines +680 to +681
// Initial MatMul for global features
{
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth an error check that the initial matmul channels matches the numGlobalFeatures.

Comment on lines +942 to +946
#if defined(MIGRAPHX_VERSION_MAJOR) && defined(MIGRAPHX_VERSION_MINOR) && defined(MIGRAPHX_VERSION_PATCH)
string migraphxVersionStr = Global::strprintf("%d_%d_%d", MIGRAPHX_VERSION_MAJOR, MIGRAPHX_VERSION_MINOR, MIGRAPHX_VERSION_PATCH);
#else
string migraphxVersionStr = "unknown";
#endif
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they aren't defined, what are your thoughts about emitting a warning through the logger and not saving the cache file?

Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
string migraphxVersionStr = "unknown";
#endif
string cacheKey = Global::strprintf(
"migraphx%s_%s_%s_%dx%d_batch%d_fp%d_%s",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the cache also include something based on GPU architecture/name? ("gcnArchName"?)

Comment thread cpp/neuralnet/migraphxbackend.cpp Outdated
Comment on lines +1612 to +1620
// preBN + preActivation (simplified - just activation for now)
auto x = input;
if(desc->preActivation.activation == 1) { // GELU
// Simplified GELU
auto sigmoid = main_module->add_instruction(migraphx::make_op("sigmoid"), x);
x = main_module->add_instruction(migraphx::make_op("mul"), x, sigmoid);
} else {
x = main_module->add_instruction(migraphx::make_op("relu"), x);
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what degree is it possible to make these these tests exercise the actual same classes or components of the graph logic for the actual residual blocks, conv layers, etc, by calling the same classes or functions to build them?

For example I notice here there is custom "gelu" code that doesn't appear anywhere else and definitely is not part of the actual residual block implementation.

Comment thread cpp/CMakeLists.txt
Comment on lines +881 to +897
# Link HIP runtime
find_library(AMDHIP64_LIBRARY amdhip64
HINTS /opt/rocm/lib
PATH_SUFFIXES lib lib64)
if(AMDHIP64_LIBRARY)
target_link_libraries(katago ${AMDHIP64_LIBRARY})
else()
target_link_libraries(katago amdhip64)
endif()

# Link other required libraries
find_library(HIPRTC_LIBRARY hiprtc
HINTS /opt/rocm/lib
PATH_SUFFIXES lib lib64)
if(HIPRTC_LIBRARY)
target_link_libraries(katago ${HIPRTC_LIBRARY})
endif()
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some of these libraries are required, would it be better to report an error if they're not found and/or fail here with a message that flags why, rather than waiting for a link-time error?

Comment thread cpp/CMakeLists.txt Outdated
endif()

# Add ROCm library directories
link_directories(/opt/rocm/lib)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a no-op given line 428, or am I misreading the logic?

@Looong01
Copy link
Copy Markdown
Author

In fact, ROCm is more mature than MIGraphX. This can be seen through benchmarks: the computing speed of ROCm is much greater than MIGraphX and far greater than OpenCL.

ROCm itself is designed to be compatible with CUDA. I don't think this is very awkward for long-term maintenance, because code can be very easily and effortlessly migrated from CUDA to ROCm. Moreover, ROCm supports both Linux and Windows, while MIGraphX currently officially only supports Linux. According to AMD's official roadmap, MIGraphX may support Windows much later, although Windows support is part of their plan.

In addition, ROCm and MIGraphX support all CUDA features and operators, including but not limited to Transformer blocks. If you are ready to add any new features, I will update in a timely manner to add the corresponding AMD GPU support for Transformers or any new features in the future.

@Looong01 Looong01 requested a review from lightvector May 12, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants