Skip to content

win: native Windows (AMD HIP SDK) ROCm + MinGW CPU build for gfx1151 + >2 GiB fix#362

Open
jamesburton wants to merge 4 commits into
antirez:mainfrom
jamesburton:pr-windows-rocm
Open

win: native Windows (AMD HIP SDK) ROCm + MinGW CPU build for gfx1151 + >2 GiB fix#362
jamesburton wants to merge 4 commits into
antirez:mainfrom
jamesburton:pr-windows-rocm

Conversation

@jamesburton

Copy link
Copy Markdown

Addresses #360.

Builds ds4's ROCm backend natively on Windows (AMD HIP SDK 7.1, hipcc/clang MSVC ABI — no WSL) producing a gfx1151 ds4-bench.exe that loads the model and runs.

Includes the >2 GiB staged-read fix (#360): cuda_pread_full cast the file offset through 32-bit MSVC off_t → staged read failed at exactly 2.00 GiB (Input/output error). Now passes the 64-bit offset un-truncated on Windows; POSIX path unchanged.

Windows support added (all behind #ifdef _WIN32; Linux/macOS/CUDA/Linux-ROCm builds byte-for-byte unchanged):

  • ds4_win.h — dependency-free POSIX shim (mmap/anon-mmap/mlock, madvise, sysconf, flock/fcntl/pread, fmemopen, clock_gettime, ftruncate, 64-bit _stat64 remap for >2 GB models).
  • win/ds4_pthread_win.h (Win32 pthreads), win/ds4_sockets_win.h (Winsock shim — ds4_distributed.c is in CORE_OBJS and pulls in BSD sockets).
  • win/build-rocm.sh + windows-rocm Makefile target; vendors the full rocWMMA header tree (SDK ships only the version header), links hipBLAS/hipBLASLt, -std=c++17.

Verified: builds clean for gfx1151 and runs on Strix Halo (Radeon 8060S) — ~114 tok/s prefill @2k, ~195 @16k (with DS4_CUDA_MANAGED, see #359 / #361). CLI (termios) and server (full sockets) frontends are deferred; the bench is the perf vehicle. (A separate smaller MinGW CPU build also exists.)

Your Name and others added 3 commits June 8, 2026 18:21
…151)

Re-port the native Windows AMD HIP SDK build (no WSL, MSVC ABI via hipcc's
bundled clang) onto main's refactored backend, where the GPU path is now
ds4_rocm.cu (pulling in rocm/*.cuh) and the shared CORE_OBJS gained
ds4_distributed.c and ds4_ssd.c — both of which now link into ds4-bench.

ROCm ds4-bench TU set: ds4_rocm.cu + host C ds4.c, ds4_bench.c, ds4_help.c,
ds4_distributed.c, ds4_ssd.c. All Windows portability behind #ifdef _WIN32
(sub-guarded __MINGW32__ / DS4_WIN_PTHREAD), so POSIX/macOS/CUDA/Linux-ROCm
builds are byte-for-byte unchanged.

New POSIX surface main requires (added to ds4_win.h, guarded !__MINGW32__):
- anonymous mmap (MAP_ANONYMOUS/MAP_ANON via VirtualAlloc) + mlock/munlock for
  ds4_ssd.c --simulate-used-memory; munmap handles both file views and VM.
- nanosleep, sleep, getpagesize, mkstemp, ftello/fseeko, PATH_MAX.
- WIN32_LEAN_AND_MEAN so <windows.h> does not drag in legacy <winsock.h>.

New win/ds4_sockets_win.h: minimal BSD-sockets-over-Winsock2 shim so the
distributed runtime (a TCP coordinator/worker transport) compiles and links
into the bench — poll->WSAPoll, close->closesocket (non-socket _close
fallback), errno translation, lazy WSAStartup, SIGPIPE no-op, if_nametoindex.
pthread shim gains pthread_detach.

rocWMMA is now mandatory for HIP (ds4_rocm_moe/q8/indexer use rocwmma::
fragment/mma_sync), and the Windows SDK ships none of it: vendor the full
rocm-7.1.0 header tree (MIT) under win/third_party/rocwmma, keeping the SDK's
generated rocwmma-version.hpp (2.2.1).

build-rocm.sh: compile ds4_rocm.cu with -std=c++17 (ROCm 7.1 hipcub/rocprim
need std::visit), -DDS4_ROCM_BUILD on host C, and link -lhipblas -lhipblaslt
-lws2_32; synthesize both hipblas.lib and hipblaslt.lib MSVC import libs from
the SDK DLLs via llvm-dlltool. Makefile gains a windows-rocm target.

Source edits (all _WIN32-guarded): ds4.c, ds4_bench.c, ds4_help.c, ds4_ssd.c,
ds4_distributed.c, ds4_rocm.cu include blocks; rocm/ds4_rocm_runtime.cuh skips
the Linux st_blksize O_DIRECT hint. Compile-verified: ds4-bench.exe links as a
PE32+ x86-64 binary. Not run (single-GPU host serialized).

Follow-ups: distributed *serving* on Windows needs WSADuplicateSocket (dup) and
DWORD-ms SO_RCVTIMEO/SNDTIMEO; CLI/server frontends still unported.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
main's ROCm staged loader (cuda_pread_full) cast the file offset through
off_t, which is 32-bit (long) under the MSVC ABI, so offsets >=2 GiB
truncate to negative and the staged read fails with EIO at exactly
2.00 GiB. Pass the 64-bit offset un-truncated on Windows (the pread shim
already takes long long). POSIX path unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 8, 2026 18:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…get)

Add a CPU-only (no GPU) native-Windows build of ds4-bench.exe with MinGW-w64
GCC, mirroring the pre-refactor windows-cpu target on top of the new main. It
reuses the Windows portability shims added by the ROCm-build PR (ds4_win.h,
win/ds4_sockets_win.h) — all changes stay _WIN32/__MINGW32__-guarded so POSIX
and MSVC-ABI builds are byte-for-byte unchanged.

TU set (GPU-less core, -DDS4_NO_GPU): ds4.c, ds4_bench.c, ds4_help.c,
ds4_distributed.c, ds4_ssd.c. main moved ds4_distributed.c/ds4_ssd.c into the
shared core, so the bench links the Winsock surface too; the link names
-lws2_32 -liphlpapi explicitly (the sockets shim's #pragma comment(lib,…) is a
no-op under gcc).

MinGW-specific fixes vs the MSVC ROCm path:
- ds4_distributed.c uses POSIX sleep(); the ds4_win.h sleep() shim is
  !__MINGW32__-guarded (MinGW provides it via <unistd.h>), but the Windows
  branch never pulled <unistd.h>. Add an __MINGW32__-guarded <unistd.h> include.

Makefile: add self-contained windows-cpu target (own CC=gcc + flags), extend
clean for *.exe. win/README.md: correct the make/direct-gcc instructions for
the now-required distributed/ssd TUs and socket libs.

Builds, links, and runs to arg-validation (exit 2: "specify exactly one of
--prompt-file or --chat-prompt-file"). Model not run (q2 ~80 GB > 32 GB RAM).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jamesburton jamesburton changed the title win: native Windows (AMD HIP SDK) ROCm build for gfx1151 + >2 GiB staged-read fix win: native Windows (AMD HIP SDK) ROCm + MinGW CPU build for gfx1151 + >2 GiB fix Jun 8, 2026
@jamesburton

Copy link
Copy Markdown
Author

Pushed an update: also adds a parallel windows-cpu Makefile target — a native MinGW-w64 CPU build of ds4-bench.exe (no GPU), sharing the same ds4_win.h / Winsock shims. Builds + links + runs to arg-validation with MinGW GCC 15.2; the only extra source change is a __MINGW32__-guarded <unistd.h> include in ds4_distributed.c (for sleep). Keeps the ROCm and CPU Windows targets coherent in one diff.

@jamesburton

Copy link
Copy Markdown
Author

Validation on gfx1151 / Strix Halo (Radeon 8060S), native Windows build of this PR. Clean run — GPU exclusively free, idle-sleep disabled. DeepSeek-V4-Flash q2-imatrix (80.76 GiB), gen=32, --ctx-start=--ctx-max:

ctx prefill tok/s gen tok/s
2048 120.7 14.9
8192 185.2 13.7
16384 186.9 12.8
32768 181.5 12.8

The Windows build loads the 80 GB model (the >2 GiB staged-read fix works) and runs correctly across contexts. Standalone q4k-dot-test: 4/4 pass. So #362 (native Windows ROCm/CPU build + >2 GiB fix) is validated on real hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants