From 06783bbd965c9692381f9e2074d8c14328c890fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Niccol=C3=B2=20Belli?= Date: Mon, 8 Jun 2026 15:29:22 +0200 Subject: [PATCH] docs: add zfs instructions --- README.md | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/README.md b/README.md index 785695284..05e2e8aed 100644 --- a/README.md +++ b/README.md @@ -1171,6 +1171,73 @@ The cache directory is disposable. If behavior looks suspicious, stop the server and remove it. You can investigate what is cached with hexdump as the kv cache files include the verbatim prompt cached. +## ZFS filesystem tuning + +If your Linux machine uses ZFS for the filesystem containing model weights and +on-disk KV cache, apply these settings to avoid double-caching RAM. + +### Disable ZFS ARC for model and cache directories + +Disable the ZFS Adaptive Replacement Cache for the dataset holding models and KV +cache. Without this, ZFS caches the 80 GB+ model file in the ARC while the +inference engine `mmap`s the same data — a wasteful double-cache. + +```sh +zfs set primarycache=none zroot/home/user/models +zfs set primarycache=none zroot/home/user/kv-cache +``` + +### Limit the ZFS ARC size at boot + +By default OpenZFS allocates up to 50 % of system RAM for the ARC. On a 128 GB +system ZFS could consume 64 GB in the background. Add these kernel boot +parameters: + +```text +zfs.zfs_arc_max=2147483648 zfs.zfs_arc_min=536870912 +``` + +- `zfs.zfs_arc_max=2147483648` — hard cap the ARC at 2 GB, preventing ZFS from + silently eating tens of GB of RAM. +- `zfs.zfs_arc_min=536870912` — reserve a 512 MB floor so ZFS still caches + essential metadata (directory trees, file permissions) without starving. + +**Why capping the ARC matters even though the kernel can reclaim it under memory +pressure:** ARC memory is managed through a translation layer (SPL) that adds +significant latency to reclamation. When the LLM engine suddenly maps 80+ GB of +model weights, the OOM killer fires before ZFS can finish shrinking its ARC. +On unified-memory APUs, the GPU driver also demands large +contiguous blocks immediately and will fail the allocation if ARC is in the way. +Without these caps, the system either kills the inference process or falls into +swap death-spiral thrashing. + +On Ubuntu with GRUB, edit `/etc/default/grub` and append to +`GRUB_CMDLINE_LINUX_DEFAULT`: + +```sh +sudo cp /etc/default/grub /etc/default/grub.bak +sudoedit /etc/default/grub +``` + +Set: + +```text +GRUB_CMDLINE_LINUX_DEFAULT="quiet splash zfs.zfs_arc_max=2147483648 zfs.zfs_arc_min=536870912" +``` + +Then: + +```sh +sudo update-grub +sudo reboot +``` + +After reboot, verify the ARC limits took effect: + +```sh +cat /sys/module/zfs/parameters/zfs_arc_max /sys/module/zfs/parameters/zfs_arc_min +``` + ## Backends The default graph backend is Metal on macOS and CUDA in CUDA builds: