win: native Windows (AMD HIP SDK) ROCm + MinGW CPU build for gfx1151 + >2 GiB fix#362
win: native Windows (AMD HIP SDK) ROCm + MinGW CPU build for gfx1151 + >2 GiB fix#362jamesburton wants to merge 4 commits into
Conversation
…151) Re-port the native Windows AMD HIP SDK build (no WSL, MSVC ABI via hipcc's bundled clang) onto main's refactored backend, where the GPU path is now ds4_rocm.cu (pulling in rocm/*.cuh) and the shared CORE_OBJS gained ds4_distributed.c and ds4_ssd.c — both of which now link into ds4-bench. ROCm ds4-bench TU set: ds4_rocm.cu + host C ds4.c, ds4_bench.c, ds4_help.c, ds4_distributed.c, ds4_ssd.c. All Windows portability behind #ifdef _WIN32 (sub-guarded __MINGW32__ / DS4_WIN_PTHREAD), so POSIX/macOS/CUDA/Linux-ROCm builds are byte-for-byte unchanged. New POSIX surface main requires (added to ds4_win.h, guarded !__MINGW32__): - anonymous mmap (MAP_ANONYMOUS/MAP_ANON via VirtualAlloc) + mlock/munlock for ds4_ssd.c --simulate-used-memory; munmap handles both file views and VM. - nanosleep, sleep, getpagesize, mkstemp, ftello/fseeko, PATH_MAX. - WIN32_LEAN_AND_MEAN so <windows.h> does not drag in legacy <winsock.h>. New win/ds4_sockets_win.h: minimal BSD-sockets-over-Winsock2 shim so the distributed runtime (a TCP coordinator/worker transport) compiles and links into the bench — poll->WSAPoll, close->closesocket (non-socket _close fallback), errno translation, lazy WSAStartup, SIGPIPE no-op, if_nametoindex. pthread shim gains pthread_detach. rocWMMA is now mandatory for HIP (ds4_rocm_moe/q8/indexer use rocwmma:: fragment/mma_sync), and the Windows SDK ships none of it: vendor the full rocm-7.1.0 header tree (MIT) under win/third_party/rocwmma, keeping the SDK's generated rocwmma-version.hpp (2.2.1). build-rocm.sh: compile ds4_rocm.cu with -std=c++17 (ROCm 7.1 hipcub/rocprim need std::visit), -DDS4_ROCM_BUILD on host C, and link -lhipblas -lhipblaslt -lws2_32; synthesize both hipblas.lib and hipblaslt.lib MSVC import libs from the SDK DLLs via llvm-dlltool. Makefile gains a windows-rocm target. Source edits (all _WIN32-guarded): ds4.c, ds4_bench.c, ds4_help.c, ds4_ssd.c, ds4_distributed.c, ds4_rocm.cu include blocks; rocm/ds4_rocm_runtime.cuh skips the Linux st_blksize O_DIRECT hint. Compile-verified: ds4-bench.exe links as a PE32+ x86-64 binary. Not run (single-GPU host serialized). Follow-ups: distributed *serving* on Windows needs WSADuplicateSocket (dup) and DWORD-ms SO_RCVTIMEO/SNDTIMEO; CLI/server frontends still unported. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
main's ROCm staged loader (cuda_pread_full) cast the file offset through off_t, which is 32-bit (long) under the MSVC ABI, so offsets >=2 GiB truncate to negative and the staged read fails with EIO at exactly 2.00 GiB. Pass the 64-bit offset un-truncated on Windows (the pread shim already takes long long). POSIX path unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…get) Add a CPU-only (no GPU) native-Windows build of ds4-bench.exe with MinGW-w64 GCC, mirroring the pre-refactor windows-cpu target on top of the new main. It reuses the Windows portability shims added by the ROCm-build PR (ds4_win.h, win/ds4_sockets_win.h) — all changes stay _WIN32/__MINGW32__-guarded so POSIX and MSVC-ABI builds are byte-for-byte unchanged. TU set (GPU-less core, -DDS4_NO_GPU): ds4.c, ds4_bench.c, ds4_help.c, ds4_distributed.c, ds4_ssd.c. main moved ds4_distributed.c/ds4_ssd.c into the shared core, so the bench links the Winsock surface too; the link names -lws2_32 -liphlpapi explicitly (the sockets shim's #pragma comment(lib,…) is a no-op under gcc). MinGW-specific fixes vs the MSVC ROCm path: - ds4_distributed.c uses POSIX sleep(); the ds4_win.h sleep() shim is !__MINGW32__-guarded (MinGW provides it via <unistd.h>), but the Windows branch never pulled <unistd.h>. Add an __MINGW32__-guarded <unistd.h> include. Makefile: add self-contained windows-cpu target (own CC=gcc + flags), extend clean for *.exe. win/README.md: correct the make/direct-gcc instructions for the now-required distributed/ssd TUs and socket libs. Builds, links, and runs to arg-validation (exit 2: "specify exactly one of --prompt-file or --chat-prompt-file"). Model not run (q2 ~80 GB > 32 GB RAM). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed an update: also adds a parallel |
|
Validation on gfx1151 / Strix Halo (Radeon 8060S), native Windows build of this PR. Clean run — GPU exclusively free, idle-sleep disabled. DeepSeek-V4-Flash q2-imatrix (80.76 GiB), gen=32, --ctx-start=--ctx-max:
The Windows build loads the 80 GB model (the >2 GiB staged-read fix works) and runs correctly across contexts. Standalone |
Addresses #360.
Builds ds4's ROCm backend natively on Windows (AMD HIP SDK 7.1, hipcc/clang MSVC ABI — no WSL) producing a gfx1151
ds4-bench.exethat loads the model and runs.Includes the >2 GiB staged-read fix (#360):
cuda_pread_fullcast the file offset through 32-bit MSVCoff_t→ staged read failed at exactly 2.00 GiB (Input/output error). Now passes the 64-bit offset un-truncated on Windows; POSIX path unchanged.Windows support added (all behind
#ifdef _WIN32; Linux/macOS/CUDA/Linux-ROCm builds byte-for-byte unchanged):ds4_win.h— dependency-free POSIX shim (mmap/anon-mmap/mlock, madvise, sysconf, flock/fcntl/pread, fmemopen, clock_gettime, ftruncate, 64-bit_stat64remap for >2 GB models).win/ds4_pthread_win.h(Win32 pthreads),win/ds4_sockets_win.h(Winsock shim —ds4_distributed.cis in CORE_OBJS and pulls in BSD sockets).win/build-rocm.sh+windows-rocmMakefile target; vendors the full rocWMMA header tree (SDK ships only the version header), links hipBLAS/hipBLASLt,-std=c++17.Verified: builds clean for gfx1151 and runs on Strix Halo (Radeon 8060S) — ~114 tok/s prefill @2k, ~195 @16k (with
DS4_CUDA_MANAGED, see #359 / #361). CLI (termios) and server (full sockets) frontends are deferred; the bench is the perf vehicle. (A separate smaller MinGW CPU build also exists.)