How we fixed an OOM livelock on a ZFS + KVM host: ARC cap, swappiness and a Patroni balloon
🇧🇦 Na bosanskom: Kako smo riješili OOM livelock na ZFS + KVM hostu
A virtualization host froze on us — completely unresponsive, no SSH, no console — and we had to power-cycle it (hard reset). It’s a NixOS host with ZFS and about a dozen KVM/QEMU guests. This is the story of how we reconstructed what happened from the logs and the four changes we applied so it doesn’t recur.
The topic is universal for anyone running ZFS + KVM on the same server: the combination of an unbounded ZFS ARC, over-committed VM memory and swap on a ZFS zvol is a recipe for a freeze.
Symptom: an infinite loop instead of a clean OOM
The first rule after an incident like this — look at the previous boot:
journalctl --list-boots
The previous boot’s journal ends abruptly, and the last thing we see is 2039 identical lines in a single second:
kernel: Purging GPU memory, 0 pages freed, 0 pages still pinned, 1 pages left available.
…and then nothing. This message comes from the i915 GPU shrinker of the integrated Intel graphics. Under extreme memory pressure the kernel calls every “shrinker” to free RAM; on a headless server i915 has nothing to free, so it spins in place. That’s a classic out-of-memory livelock — the system didn’t fall to a clean OOM, it got stuck trying (in vain) to free memory.
An important detail: at the moment of the freeze the OOM-killer never engaged at all. Had it done so, it would have killed one process and the system would have survived. Instead — livelock and a total freeze.
Root cause: three compounding factors
# ARC with no upper bound
grep -E '^(size|c_max)' /proc/spl/kstat/zfs/arcstats
# c_max 61.7 GB <-- on a host with 62 GiB of RAM!
# swap is a ZFS zvol
swapon --show
# /dev/zd0 partition 8G
-
Unbounded ZFS ARC.
zfs_arc_maxwas at its default, so the ARC was allowed to grow to practically all of RAM. The ARC is “reclaimable” in theory, but under a sudden VM-allocation spike it can’t be evicted fast enough. -
VM memory over-commit. The sum of guest-allocated RAM was almost equal to the host’s physical RAM. After ~53 days of uptime, the guests’ RSS had “warmed up” toward those allocations (the page cache inside each guest fills over time, and KVM rarely hands pages back), so real usage crept toward the ceiling.
-
Swap on a ZFS zvol. To write a page to swap on a zvol, ZFS must allocate memory for its write path — which is impossible when there’s no memory. Reclaim makes no progress → livelock instead of recovery. That’s why the 8 GB of swap was effectively useless at the critical moment.
Interesting: the nightly backup was not the culprit, even though it looked that way at first. Careful log reading showed the backup had finished cleanly hours before the freeze. The real trigger was a guest’s memory allocation on top of already-tight RAM.
The fix: four changes
1. Cap the ZFS ARC (the primary fix)
Because root is on ZFS, we set the limit via a kernel parameter so it applies as early as module load in the initrd:
boot.kernelParams = [ "zfs.zfs_arc_max=17179869184" ]; # 16 GiB
Live, without a reboot:
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
This is the essential fix: it removes the “runaway consumer” and guarantees the OOM-killer can do its job cleanly (kill one process) instead of the system livelocking.
Worth knowing: on Linux the ZFS ARC shows up under
usedinfree, not underbuff/cache— so “used” memory looks larger than it really is. ~16 GB of that is reclaimable cache.
2. Lower vm.swappiness
boot.kernel.sysctl."vm.swappiness" = 10; # default is 60
This biases the kernel to reclaim page cache / ARC first, and only then enter the dangerous zvol write path. It applies immediately (sysctl is runtime, no reboot needed).
Note: shrinking swap from 8 to 4 GB does not fix the deadlock — swap size isn’t the cause, the fact that the write path needs memory is. The real fix would be swap off ZFS; swappiness is what actually reduces the exposure.
3. Right-size the over-committed VMs
One web-server VM had several times more RAM allocated than it ever used. We lowered its allocation and gave the host back precious headroom. A memory change requires a cold restart of the guest (max memory can’t change on a live domain).
4. A Patroni-aware balloon service (live, no reboot)
The most interesting part. We run a PostgreSQL HA cluster (Patroni) across several VM nodes. The leader does all the writes and needs full RAM; replicas only replay WAL and use less. Instead of a fixed allocation, we built a small service on the host that, every hour:
- reads the cluster topology via the Patroni REST API (
/cluster), - via the live libvirt balloon (
virsh setmem … --live, no reboot) sets:- leader → full RAM,
- healthy streaming replicas → less RAM.
Key safety decisions:
- Grow-before-shrink: grow first (toward the leader), then shrink — so a freshly-promoted leader is never starved even for a moment.
- Fail-open: if Patroni is unreachable or there’s no leader, the service touches nothing — VMs keep their full (boot-default) memory. The worst case is “no saving”, never “leader too small”.
- Safe because
shared_buffersis fixed and sized for the lower bound; the balloon only reclaims free guest memory, never pages PostgreSQL is using.
The logic, condensed (Python, stdlib):
for vm, target_kib in sorted(targets.items(), key=lambda kv: -kv[1]): # grow before shrink
if current_kib(vm) != target_kib:
subprocess.check_call(["virsh", "setmem", vm, str(target_kib), "--live"])
Result: the leader stays full, the replicas shrink, and the host gains several gigabytes of headroom — all live, without a single database reboot.
Lessons
- On a ZFS + KVM host, always cap the ARC (
zfs_arc_max). The “up to almost everything” default fights directly with guest memory. - Mind over-commit and “warm-up”. A fresh VM uses little; after weeks of uptime RSS creeps toward what’s allocated. The metric to watch is committed memory, not the current
free. - Swap on a ZFS zvol is deadlock-prone. If it must be that way, at least lower
swappiness; ideally keep swap off ZFS. - A livelock is not the same as an OOM. If you see an endless
Purging GPU memory(or a similar shrinker in a loop), that’s a memory livelock — the goal is for the OOM-killer to be able to act cleanly. - Read the logs to the end before concluding. The first suspect (the backup) wasn’t guilty; only the timeline revealed the real trigger.
- Balloon + orchestration (here: Patroni role → libvirt balloon) lets you redistribute RAM between VMs dynamically without reboots.
Note
Generated by Claude 🤖
Ernad Husremović, hernad@bring.out.ba