[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188
[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188Looong01 wants to merge 35 commits into
Conversation
Add ROCm and MIGraphX support for AMD GPU
|
Thanks, I'll look at this soon. |
Thanks! |
lightvector
left a comment
There was a problem hiding this comment.
Thanks for all this work and sorry for the delay in reading over this! I've left a bunch of comments about migraph. I haven't done a detailed review of ROCm yet, but I suspect a lot of the high-level comments about error handling I left about migraph will apply to ROCm yet, can you take a look?
| auto mean = addReduceMean(input, {2, 3}); | ||
| mean = addSqueeze(mean, {2, 3}); |
There was a problem hiding this comment.
Does this need a keepdims or something?
|
Thank you so much for seeing this. I have fixed all and answered part of your questions about MIGraphX backend. |
lightvector
left a comment
There was a problem hiding this comment.
Thanks for the fixes. I marked as resolved the comments that looked resolved to me but left open the ones that I didn't see an answer to, or maybe I just missed it. Please let me know what you think about them - you can reply back to the comments if you have an an answer or you think a resolution is not necessary and I will take a look.
I also left some more comments in this pass.
I have a much higher-level question too - the MIGraphX backend looks quite reasonable and I think I'm okay accepting it once it's polished and once I can independently test it a little too.
However, the ROCm backend I'm much less sure about. I skimmed through it, and it seems like a massive copy-paste of the CUDA backend, but with lots of subtle differences, and that seems really awkward for long-term maintainability. Do you have thoughts about this? How necessary is this backend if we have both OpenCL and MIGraphX - do you have thoughts on if it would be okay to drop it, or split it to a separate PR, or keep it as an unofficial branch or something, or other options?
Additionally, very soon I'm going to be adding support for transformer blocks, since transformers are likely the future of strong models. I have a branch where I'm working on it for OpenCL and testing it now, and will be implementing personally it through Eigen/CUDA/TensorRT as well. What are your thoughts about implementing support for these for the AMD side here once it's ready as well? Is this something you're committed to maintaining or updating through more architecture changes?
| if(biasDesc->weights.size() != (size_t)biasDesc->numChannels) { | ||
| cerr << "ERROR: MatMul bias " << biasDesc->name << " size mismatch: " | ||
| << biasDesc->weights.size() << " vs expected " << biasDesc->numChannels << endl; | ||
| } else { |
There was a problem hiding this comment.
Missed spot in earlier error handling cleanup.
| // Initial MatMul for global features | ||
| { |
There was a problem hiding this comment.
It might be worth an error check that the initial matmul channels matches the numGlobalFeatures.
| #if defined(MIGRAPHX_VERSION_MAJOR) && defined(MIGRAPHX_VERSION_MINOR) && defined(MIGRAPHX_VERSION_PATCH) | ||
| string migraphxVersionStr = Global::strprintf("%d_%d_%d", MIGRAPHX_VERSION_MAJOR, MIGRAPHX_VERSION_MINOR, MIGRAPHX_VERSION_PATCH); | ||
| #else | ||
| string migraphxVersionStr = "unknown"; | ||
| #endif |
There was a problem hiding this comment.
If they aren't defined, what are your thoughts about emitting a warning through the logger and not saving the cache file?
| string migraphxVersionStr = "unknown"; | ||
| #endif | ||
| string cacheKey = Global::strprintf( | ||
| "migraphx%s_%s_%s_%dx%d_batch%d_fp%d_%s", |
There was a problem hiding this comment.
Should the cache also include something based on GPU architecture/name? ("gcnArchName"?)
| // preBN + preActivation (simplified - just activation for now) | ||
| auto x = input; | ||
| if(desc->preActivation.activation == 1) { // GELU | ||
| // Simplified GELU | ||
| auto sigmoid = main_module->add_instruction(migraphx::make_op("sigmoid"), x); | ||
| x = main_module->add_instruction(migraphx::make_op("mul"), x, sigmoid); | ||
| } else { | ||
| x = main_module->add_instruction(migraphx::make_op("relu"), x); | ||
| } |
There was a problem hiding this comment.
To what degree is it possible to make these these tests exercise the actual same classes or components of the graph logic for the actual residual blocks, conv layers, etc, by calling the same classes or functions to build them?
For example I notice here there is custom "gelu" code that doesn't appear anywhere else and definitely is not part of the actual residual block implementation.
| # Link HIP runtime | ||
| find_library(AMDHIP64_LIBRARY amdhip64 | ||
| HINTS /opt/rocm/lib | ||
| PATH_SUFFIXES lib lib64) | ||
| if(AMDHIP64_LIBRARY) | ||
| target_link_libraries(katago ${AMDHIP64_LIBRARY}) | ||
| else() | ||
| target_link_libraries(katago amdhip64) | ||
| endif() | ||
|
|
||
| # Link other required libraries | ||
| find_library(HIPRTC_LIBRARY hiprtc | ||
| HINTS /opt/rocm/lib | ||
| PATH_SUFFIXES lib lib64) | ||
| if(HIPRTC_LIBRARY) | ||
| target_link_libraries(katago ${HIPRTC_LIBRARY}) | ||
| endif() |
There was a problem hiding this comment.
If some of these libraries are required, would it be better to report an error if they're not found and/or fail here with a message that flags why, rather than waiting for a link-time error?
| endif() | ||
|
|
||
| # Add ROCm library directories | ||
| link_directories(/opt/rocm/lib) |
There was a problem hiding this comment.
Is this a no-op given line 428, or am I misreading the logic?
|
In fact, ROCm is more mature than MIGraphX. This can be seen through benchmarks: the computing speed of ROCm is much greater than MIGraphX and far greater than OpenCL. ROCm itself is designed to be compatible with CUDA. I don't think this is very awkward for long-term maintenance, because code can be very easily and effortlessly migrated from CUDA to ROCm. Moreover, ROCm supports both Linux and Windows, while MIGraphX currently officially only supports Linux. According to AMD's official roadmap, MIGraphX may support Windows much later, although Windows support is part of their plan. In addition, ROCm and MIGraphX support all CUDA features and operators, including but not limited to Transformer blocks. If you are ready to add any new features, I will update in a timely manner to add the corresponding AMD GPU support for Transformers or any new features in the future. |




All test passed, we can merge it to main branch! @lightvector
Bothe Windows and Linux Binary release has been published here: https://github.com/Looong01/KataGo-Multi-backends/releases
Background
This PR summarizes all commits by
Looong01on theAMD_GPUbranch from2025-07-28to2026-03-16(23commits total:18non-merge +5merge), focused on introducing and refining ROCm backend support in KataGo, plus the new MIGraphX backend added on theMIGraphXbranch.Key Changes — ROCm Backend
rocmhelpers.hip,rocmutils.*,rocmincludes.h,rocmerrorcheck.h.USE_ROCM_BACKENDinto startup and config flow (setup/benchmark/gtpconfig) for proper backend detection and config generation.HIP_PATH/ROCM_PATH, clang toolchain handling, Windows library search paths).rocmbackend_new.cppafter merging validated changes into the main backend path.cpp/configs/*) with ROCm instructions androcmDeviceToUse*,rocmUseFP16examples.lightvector:masterto reduce branch drift.Critical Bug Fix: ConvLayer accumulate (residual skip connections)
miopenConvolutionForwardImmediatedoes not supportalpha/betaparameters (unlike cuDNN'scudnnConvolutionForward). The original code setbeta = accumulate ? 1.0 : 0.0but this value was never passed to the MIOpen API, causing all residual skip connections to be silently dropped — the neural network output was effectively garbage.accumulate=true, save the output buffer (trunk) to a pre-allocatedaccumBufviahipMemcpyAsync(Device-to-Device), run convolution (which overwrites the output buffer), then add the saved residual back using a newcustomCudaAddTensorsInplaceGPU kernel. All operations stay in VRAM with zero CPU-side data transfer.rocmhelpers.hip/rocmhelpers.h:customCudaAddTensorsInplace(float*, const float*, int)customCudaAddTensorsInplace(half*, const half*, int)accumBufis pre-allocated once perConvLayerat construction time (sized formaxBatchSize), avoiding per-inferencehipMalloc/hipFreeoverhead.Secondary Fix: Algorithm enumeration buffer overflow
miopenConvolutionForwardGetSolutionCountreturns the available count by overwriting the output parameter. The original code used this count to size a fixed stack arraymiopenConvSolution_t solutions[2*requestedAlgoCount]which could overflow. Replaced withstd::vector<miopenConvSolution_t> solutions(availableAlgoCount)for safe dynamic sizing.Windows ROCm Build — CMakeLists.txt Self-Configuration
Added full Windows ROCm build support directly into `CMakeLists.txt.
Key Changes — MIGraphX Backend (New)
Added a complete MIGraphX graph-compiler backend (migraphxbackend.cpp, 1886 lines) as an alternative to the ROCm (MIOpen) backend. MIGraphX compiles the entire neural network into a single fused GPU program, leveraging AMD's graph-level optimizations (operator fusion, memory planning, kernel scheduling).
Architecture
migraphx::programat load time usingMIGraphXGraphBuilder, compiled once, then cached as.mxrfiles under~/.katago/migraphxcache/.{4, 8, 16, 24, 32, 40, 64}(capped bymaxBatchSize). At inference time,getBestBatchSize()selects the smallest compiled size ≥ actual batch to minimize GPU waste..mxrfiles with naming formatmigraphx_{modelName}_{sha256}_{H}x{W}_batch{N}_fp{0|1}_nhwc{0|1}_{exact|max}.mxr. First launch compiles all batch sizes (slow); subsequent launches load from cache in seconds.Neural Network Components Implemented
FP16 Support
useFP16ModeisAutoorTrue).float_typeon host; aconvertop inside the graph handles float→half on GPU.half_type; outputs are cast back tofloatviastatic_cast<float>in thevisit()lambda.Build Integration
USE_BACKEND=MIGRAPHXoption in CMakeLists.txt (~60 lines of build logic).libmigraphx(and optionallibmigraphx_gpu) from rocm."mgx"backend prefix insetup.cpp, forced NCHW format.main.cpp: prints"Using MIGraphX backend".Known Limitations
nnXLen×nnYLen.Change Stats — ROCm
23(non-merge: 18,merge: 5) + post-PR bug fixes21+3(rocmbackend.cpp,rocmhelpers.hip,rocmhelpers.h)+9372 / -4009++59 / -28Change Stats — MIGraphX
3(non-merge: 3)4(migraphxbackend.cpp, CMakeLists.txt, setup.cpp, main.cpp)+1977 / -1(1886 lines new backend + 91 lines build/setup integration)Included Commits (Author: Looong01)
ROCm Backend (
AMD_GPUbranch, 2025-07-28 ~ 2026-03-16)1f2ae46e2025-07-28 Add ROCm backendb45553042025-07-28 Fix bugs8b30cb962025-07-31 Update570ced012025-08-01 Fix bugsabb612402025-08-01 Fix bugsbfb292e72025-08-01 All bug fixed4606424f2025-08-01 Update1e8ea7882025-08-02 test new methodc1a09cf32025-08-02 Update0957b88b2025-08-02 Test finishedc70d841a2025-08-02 Update docks1d05ca8d2025-08-02 Update gitignore9d4662b72025-08-02 Update new methodd40bd5092025-08-02 Optimize performance158d24df2025-08-13 Update new Convlayer methodec32eb192025-08-13 Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm0bfe0a142025-10-04 Add new compile targetf5fbb3362025-11-08 Merge branch 'lightvector:master' into master26d8c5bd2025-11-08 Add ROCm for Windows support555d2f172025-12-01 Merge branch 'lightvector:master' into masterdbc7cfa42026-02-22 Merge branch 'lightvector:master' into mastered396b722026-02-28 Fix bugsccec62c52026-03-16 Merge branch 'lightvector:master' into masterxxxxxxxx2026-04-19 Fix critical ConvLayer accumulate bug & algorithm buffer overflowMIGraphX Backend (
MIGraphXbranch, 2026-02-27 ~ 2026-04-19)c511c3382026-02-27 Add MIGraphX support00cb68812026-04-19 Fix bugs (MIGraphX: 5 structural bugs, GELU→MishScale8, NHWC→NCHW, dimension mismatches)b1da0e062026-04-19 Optimize performance (FP16 default, multi-batch compilation, cache per batch size)