[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support by Looong01 · Pull Request #1188 · lightvector/KataGo

Looong01 · 2026-04-20T02:45:36Z

All test passed, we can merge it to main branch! @lightvector

Bothe Windows and Linux Binary release has been published here: https://github.com/Looong01/KataGo-Multi-backends/releases

Background

This PR summarizes all commits by Looong01 on the AMD_GPU branch from 2025-07-28 to 2026-03-16 (23 commits total: 18 non-merge + 5 merge), focused on introducing and refining ROCm backend support in KataGo, plus the new MIGraphX backend added on the MIGraphX branch.

Key Changes — ROCm Backend

Added core ROCm backend implementation and utility files: rocmbackend.cpp, rocmhelpers.hip, rocmutils.*, rocmincludes.h, rocmerrorcheck.h.
Integrated USE_ROCM_BACKEND into startup and config flow (setup/benchmark/gtpconfig) for proper backend detection and config generation.
Expanded ROCm build logic in CMakeLists.txt: HIP compiler setup, architecture/FP16 detection, ROCm package lookup, and fallback linking.
Added Windows ROCm build support (HIP_PATH/ROCM_PATH, clang toolchain handling, Windows library search paths).
Iteratively improved rocmbackend.cpp with Convlayer method updates, performance tuning, and bug fixes.
Introduced and then removed experimental rocmbackend_new.cpp after merging validated changes into the main backend path.
Updated docs and sample configs (README / Compiling / cpp/configs/*) with ROCm instructions and rocmDeviceToUse*, rocmUseFP16 examples.
Regularly merged upstream lightvector:master to reduce branch drift.

Critical Bug Fix: ConvLayer accumulate (residual skip connections)

Root cause: miopenConvolutionForwardImmediate does not support alpha/beta parameters (unlike cuDNN's cudnnConvolutionForward). The original code set beta = accumulate ? 1.0 : 0.0 but this value was never passed to the MIOpen API, causing all residual skip connections to be silently dropped — the neural network output was effectively garbage.
Fix: When accumulate=true, save the output buffer (trunk) to a pre-allocated accumBuf via hipMemcpyAsync (Device-to-Device), run convolution (which overwrites the output buffer), then add the saved residual back using a new customCudaAddTensorsInplace GPU kernel. All operations stay in VRAM with zero CPU-side data transfer.
New kernels added in rocmhelpers.hip / rocmhelpers.h:
- customCudaAddTensorsInplace(float*, const float*, int)
- customCudaAddTensorsInplace(half*, const half*, int)
Performance optimization: accumBuf is pre-allocated once per ConvLayer at construction time (sized for maxBatchSize), avoiding per-inference hipMalloc/hipFree overhead.

Secondary Fix: Algorithm enumeration buffer overflow

miopenConvolutionForwardGetSolutionCount returns the available count by overwriting the output parameter. The original code used this count to size a fixed stack array miopenConvSolution_t solutions[2*requestedAlgoCount] which could overflow. Replaced with std::vector<miopenConvSolution_t> solutions(availableAlgoCount) for safe dynamic sizing.

Windows ROCm Build — CMakeLists.txt Self-Configuration

Added full Windows ROCm build support directly into `CMakeLists.txt.

Key Changes — MIGraphX Backend (New)

Added a complete MIGraphX graph-compiler backend (migraphxbackend.cpp, 1886 lines) as an alternative to the ROCm (MIOpen) backend. MIGraphX compiles the entire neural network into a single fused GPU program, leveraging AMD's graph-level optimizations (operator fusion, memory planning, kernel scheduling).

Architecture

Graph-based inference: The full model (trunk → policy/value/ownership heads) is constructed as a migraphx::program at load time using MIGraphXGraphBuilder, compiled once, then cached as .mxr files under ~/.katago/migraphxcache/.
Multi-batch compilation: MIGraphX does not support dynamic batch dimensions. Instead, multiple fixed-batch programs are pre-compiled at sizes {4, 8, 16, 24, 32, 40, 64} (capped by maxBatchSize). At inference time, getBestBatchSize() selects the smallest compiled size ≥ actual batch to minimize GPU waste.
Cache system: Compiled programs are saved/loaded as .mxr files with naming format migraphx_{modelName}_{sha256}_{H}x{W}_batch{N}_fp{0|1}_nhwc{0|1}_{exact|max}.mxr. First launch compiles all batch sizes (slow); subsequent launches load from cache in seconds.

Neural Network Components Implemented

Component	Status
Convolution (with padding/dilation/stride)	✅
BatchNorm (merged scale/bias via unsqueeze+multibroadcast)	✅
Residual Block	✅
Global Pooling Residual Block (dual-branch with gpoolToBiasMul)	✅
Nested Bottleneck Residual Block	✅
Global Pooling (3 features: mean, mean×scale, max/scale2)	✅
Policy Head (p1Conv + g1Conv→GPool→bias, p2Conv spatial, pass branch)	✅
Value Head (v1Conv→GPool→v2Mul→v3Mul)	✅
Score Value Head (shared v2→sv3Mul)	✅
Ownership Head (vOwnershipConv→flatten)	✅
Activations: ReLU, Mish, MishScale8	✅
MatMul / FC layers (with optional bias)	✅
SGF Metadata Encoder	❌ Disabled

FP16 Support

Enabled by default (when useFP16Mode is Auto or True).
Input parameters remain float_type on host; a convert op inside the graph handles float→half on GPU.
All weights/computation use half_type; outputs are cast back to float via static_cast<float> in the visit() lambda.

Build Integration

New USE_BACKEND=MIGRAPHX option in CMakeLists.txt (~60 lines of build logic).
Links against libmigraphx (and optional libmigraphx_gpu) from rocm.
Registered as "mgx" backend prefix in setup.cpp, forced NCHW format.
Backend identification in main.cpp: prints "Using MIGraphX backend".

Known Limitations

No dynamic batch: Fixed batch sizes require pre-compilation; small batches may waste GPU cycles (e.g., actual batch 3 uses compiled batch 4).
NCHW only: NHWC format is silently ignored.
SGF Metadata Encoder disabled: Potential weight shape mismatch, skipped for now.
No mask support in global pooling: Assumes full board at nnXLen×nnYLen.

Change Stats — ROCm

Commits: 23 (non-merge: 18, merge: 5) + post-PR bug fixes
Files touched: 21 + 3 (rocmbackend.cpp, rocmhelpers.hip, rocmhelpers.h)
Diffstat: +9372 / -4009 + +59 / -28

Change Stats — MIGraphX

Commits: 3 (non-merge: 3)
Files touched: 4 (migraphxbackend.cpp, CMakeLists.txt, setup.cpp, main.cpp)
Diffstat: +1977 / -1 (1886 lines new backend + 91 lines build/setup integration)

Included Commits (Author: Looong01)

ROCm Backend (`AMD_GPU` branch, 2025-07-28 ~ 2026-03-16)

1f2ae46e 2025-07-28 Add ROCm backend
b4555304 2025-07-28 Fix bugs
8b30cb96 2025-07-31 Update
570ced01 2025-08-01 Fix bugs
abb61240 2025-08-01 Fix bugs
bfb292e7 2025-08-01 All bug fixed
4606424f 2025-08-01 Update
1e8ea788 2025-08-02 test new method
c1a09cf3 2025-08-02 Update
0957b88b 2025-08-02 Test finished
c70d841a 2025-08-02 Update docks
1d05ca8d 2025-08-02 Update gitignore
9d4662b7 2025-08-02 Update new method
d40bd509 2025-08-02 Optimize performance
158d24df 2025-08-13 Update new Convlayer method
ec32eb19 2025-08-13 Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm
0bfe0a14 2025-10-04 Add new compile target
f5fbb336 2025-11-08 Merge branch 'lightvector:master' into master
26d8c5bd 2025-11-08 Add ROCm for Windows support
555d2f17 2025-12-01 Merge branch 'lightvector:master' into master
dbc7cfa4 2026-02-22 Merge branch 'lightvector:master' into master
ed396b72 2026-02-28 Fix bugs
ccec62c5 2026-03-16 Merge branch 'lightvector:master' into master
xxxxxxxx 2026-04-19 Fix critical ConvLayer accumulate bug & algorithm buffer overflow

MIGraphX Backend (`MIGraphX` branch, 2026-02-27 ~ 2026-04-19)

c511c338 2026-02-27 Add MIGraphX support
00cb6881 2026-04-19 Fix bugs (MIGraphX: 5 structural bugs, GELU→MishScale8, NHWC→NCHW, dimension mismatches)
b1da0e06 2026-04-19 Optimize performance (FP16 default, multi-batch compilation, cache per batch size)

Merge

Add ROCm and MIGraphX support for AMD GPU

Looong01 · 2026-04-20T03:26:31Z

Windows ROCm:

Linux ROCm:

Linux MIGraphx:

Looong01 · 2026-04-20T03:32:11Z

lightvector · 2026-04-20T15:30:22Z

Thanks, I'll look at this soon.

Looong01 · 2026-04-20T18:57:13Z

Thanks, I'll look at this soon.

Thanks!

lightvector

Thanks for all this work and sorry for the delay in reading over this! I've left a bunch of comments about migraph. I haven't done a detailed review of ROCm yet, but I suspect a lot of the high-level comments about error handling I left about migraph will apply to ROCm yet, can you take a look?

lightvector · 2026-05-11T15:30:17Z

+        auto mean = addReduceMean(input, {2, 3});
+        mean = addSqueeze(mean, {2, 3});


Does this need a keepdims or something?

Looong01 · 2026-05-11T17:19:03Z

Thank you so much for seeing this. I have fixed all and answered part of your questions about MIGraphX backend.
And There is only one similar issue in ROCm backend(cout << "Found ROCm device " << i << ": " << prop.name << endl;). Now I fix it.

lightvector

Thanks for the fixes. I marked as resolved the comments that looked resolved to me but left open the ones that I didn't see an answer to, or maybe I just missed it. Please let me know what you think about them - you can reply back to the comments if you have an an answer or you think a resolution is not necessary and I will take a look.

I also left some more comments in this pass.

I have a much higher-level question too - the MIGraphX backend looks quite reasonable and I think I'm okay accepting it once it's polished and once I can independently test it a little too.

However, the ROCm backend I'm much less sure about. I skimmed through it, and it seems like a massive copy-paste of the CUDA backend, but with lots of subtle differences, and that seems really awkward for long-term maintainability. Do you have thoughts about this? How necessary is this backend if we have both OpenCL and MIGraphX - do you have thoughts on if it would be okay to drop it, or split it to a separate PR, or keep it as an unofficial branch or something, or other options?

Additionally, very soon I'm going to be adding support for transformer blocks, since transformers are likely the future of strong models. I have a branch where I'm working on it for OpenCL and testing it now, and will be implementing personally it through Eigen/CUDA/TensorRT as well. What are your thoughts about implementing support for these for the AMD side here once it's ready as well? Is this something you're committed to maintaining or updating through more architecture changes?

lightvector · 2026-05-12T13:34:59Z

+            if(biasDesc->weights.size() != (size_t)biasDesc->numChannels) {
+                cerr << "ERROR: MatMul bias " << biasDesc->name << " size mismatch: "
+                     << biasDesc->weights.size() << " vs expected " << biasDesc->numChannels << endl;
+            } else {


Missed spot in earlier error handling cleanup.

lightvector · 2026-05-12T13:44:55Z

+    // Initial MatMul for global features
+    {


It might be worth an error check that the initial matmul channels matches the numGlobalFeatures.

lightvector · 2026-05-12T13:49:02Z

+#if defined(MIGRAPHX_VERSION_MAJOR) && defined(MIGRAPHX_VERSION_MINOR) && defined(MIGRAPHX_VERSION_PATCH)
+    string migraphxVersionStr = Global::strprintf("%d_%d_%d", MIGRAPHX_VERSION_MAJOR, MIGRAPHX_VERSION_MINOR, MIGRAPHX_VERSION_PATCH);
+#else
+    string migraphxVersionStr = "unknown";
+#endif


If they aren't defined, what are your thoughts about emitting a warning through the logger and not saving the cache file?

lightvector · 2026-05-12T14:13:59Z

+    string migraphxVersionStr = "unknown";
+#endif
+    string cacheKey = Global::strprintf(
+        "migraphx%s_%s_%s_%dx%d_batch%d_fp%d_%s",


Should the cache also include something based on GPU architecture/name? ("gcnArchName"?)

lightvector · 2026-05-12T14:18:27Z

+        // preBN + preActivation (simplified - just activation for now)
+        auto x = input;
+        if(desc->preActivation.activation == 1) {  // GELU
+            // Simplified GELU
+            auto sigmoid = main_module->add_instruction(migraphx::make_op("sigmoid"), x);
+            x = main_module->add_instruction(migraphx::make_op("mul"), x, sigmoid);
+        } else {
+            x = main_module->add_instruction(migraphx::make_op("relu"), x);
+        }


To what degree is it possible to make these these tests exercise the actual same classes or components of the graph logic for the actual residual blocks, conv layers, etc, by calling the same classes or functions to build them?

For example I notice here there is custom "gelu" code that doesn't appear anywhere else and definitely is not part of the actual residual block implementation.

lightvector · 2026-05-12T14:21:51Z

+  # Link HIP runtime
+  find_library(AMDHIP64_LIBRARY amdhip64
+               HINTS /opt/rocm/lib
+               PATH_SUFFIXES lib lib64)
+  if(AMDHIP64_LIBRARY)
+    target_link_libraries(katago ${AMDHIP64_LIBRARY})
+  else()
+    target_link_libraries(katago amdhip64)
+  endif()
+
+  # Link other required libraries
+  find_library(HIPRTC_LIBRARY hiprtc
+               HINTS /opt/rocm/lib
+               PATH_SUFFIXES lib lib64)
+  if(HIPRTC_LIBRARY)
+    target_link_libraries(katago ${HIPRTC_LIBRARY})
+  endif()


If some of these libraries are required, would it be better to report an error if they're not found and/or fail here with a message that flags why, rather than waiting for a link-time error?

lightvector · 2026-05-12T14:24:34Z

+  endif()
+
+  # Add ROCm library directories
+  link_directories(/opt/rocm/lib)


Is this a no-op given line 428, or am I misreading the logic?

Looong01 · 2026-05-12T15:45:56Z

In fact, ROCm is more mature than MIGraphX. This can be seen through benchmarks: the computing speed of ROCm is much greater than MIGraphX and far greater than OpenCL.

ROCm itself is designed to be compatible with CUDA. I don't think this is very awkward for long-term maintenance, because code can be very easily and effortlessly migrated from CUDA to ROCm. Moreover, ROCm supports both Linux and Windows, while MIGraphX currently officially only supports Linux. According to AMD's official roadmap, MIGraphX may support Windows much later, although Windows support is part of their plan.

In addition, ROCm and MIGraphX support all CUDA features and operators, including but not limited to Transformer blocks. If you are ready to add any new features, I will update in a timely manner to add the corresponding AMD GPU support for Transformers or any new features in the future.

Looong01 and others added 29 commits July 28, 2025 20:13

Add ROCm backend

1f2ae46

Fix bugs

b455530

Update

8b30cb9

Fix bugs

570ced0

Fix bugs

abb6124

All bug fixed

bfb292e

Update

4606424

test new method

1e8ea78

Update

c1a09cf

Test finished

0957b88

Update docks

c70d841

Update gitignore

1d05ca8

Update new method

9d4662b

Optimize performance

d40bd50

Update new Convlayer method

158d24d

Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm

ec32eb1

Add new compile target

0bfe0a1

Merge branch 'lightvector:master' into master

f5fbb33

Add ROCm for Windows support

26d8c5b

Merge branch 'lightvector:master' into master

555d2f1

Merge branch 'lightvector:master' into master

dbc7cfa

Add MIGraphX support

c511c33

Merge pull request #3 from lightvector/master

d91e9f5

Merge

Merge branch 'lightvector:master' into MIGraphX

337b575

Fix bugs

00cb688

Optimize performance

b1da0e0

Add MIGraphX support

8a133a0

Merge pull request #4 from Looong01/MIGraphX

8c4eaf2

Add ROCm and MIGraphX support for AMD GPU

Add Windows ROCm support

e480af5

Update cpp/README.md

a36189e

lightvector requested changes May 11, 2026

View reviewed changes

Looong01 and others added 3 commits May 12, 2026 01:04

Update MIGraphX backend

bb0d896

Merge branch 'master' into AMD_GPU

9d9285b

Update ROCm backend

1ae6e25

Looong01 requested a review from lightvector May 11, 2026 17:41

lightvector reviewed May 12, 2026

View reviewed changes

Update MIGraphX backend

0d1912c

Looong01 requested a review from lightvector May 12, 2026 15:46

Update AMD GPU compile pipeline

0984efa

		auto mean = addReduceMean(input, {2, 3});
		mean = addSqueeze(mean, {2, 3});

Conversation

Looong01 commented Apr 20, 2026

Background

Key Changes — ROCm Backend

Critical Bug Fix: ConvLayer accumulate (residual skip connections)

Secondary Fix: Algorithm enumeration buffer overflow

Windows ROCm Build — CMakeLists.txt Self-Configuration

Key Changes — MIGraphX Backend (New)

Architecture

Neural Network Components Implemented

FP16 Support

Build Integration

Known Limitations

Change Stats — ROCm

Change Stats — MIGraphX

Included Commits (Author: Looong01)

ROCm Backend (AMD_GPU branch, 2025-07-28 ~ 2026-03-16)

MIGraphX Backend (MIGraphX branch, 2026-02-27 ~ 2026-04-19)

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

lightvector commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

lightvector left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Looong01 commented May 11, 2026

Uh oh!

lightvector left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Looong01 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ROCm Backend (`AMD_GPU` branch, 2025-07-28 ~ 2026-03-16)

MIGraphX Backend (`MIGraphX` branch, 2026-02-27 ~ 2026-04-19)