Migrating a production Odoo tenant from on-prem to Hetzner with NixOS — and the DNS/journal/NAT gotchas along the way
🇧🇦 Na bosanskom: Migracija produkcione Odoo instance s lokalne infrastrukture na Hetzner
We recently relocated a production Odoo 16 (Bosnian edition) tenant from an on-premise NixOS KVM virtual machine to a Hetzner cloud VM — declaratively, with NixOS + colmena, zero data loss, and only a short cutover window. This post is the engineering write-up: the clean parts, and — more usefully — the three gotchas that bit us (a DNS wildcard subtlety, a BIND journal trap, and a NAT hairpin that broke PDF invoices).
Addresses below are anonymized: the on-prem LAN is 192.168.x.0/24, the cloud libvirt network
is 192.168.y.0/24, the public IP is <hetzner-public-ip>, and the tenant is reachable at
tenant.odoo.bringout.cloud.
The goal: resilience
The driving reason was availability. The on-prem site in Sarajevo is subject to occasional internet and power outages, and an ERP that goes dark whenever the local link or grid blinks is a real business problem. Moving the tenant to a Hetzner cloud VM — with its database on a cloud HA PostgreSQL cluster — makes the core application independent of local infrastructure: the ERP keeps serving even when the Sarajevo office is offline.
Each tenant historically ran on its own on-prem VM (192.168.x.124, 4 vCPU / 12 GB), and those
VMs were also heavily over-provisioned (single-digit GB actually used) — so right-sizing in the cloud
was a welcome side-benefit. We kept the tenant’s configuration byte-identical where it mattered
and changed only what the new environment forced us to. Everything is declarative: the VM, its
services, and its networking live in the NixOS/colmena repository.
1. Reproducing the engine exactly (and a GCC surprise)
The tenant runs a specific pinned Odoo build (a “bosnian-legacy” Nix package) plus a precise set of addons. To preserve behaviour, we mirrored that exact package and addon set into the cloud repo as new attributes — without touching the cloud fleet’s existing Odoo build — and gave the tenant’s host a small overlay so its service file was reused verbatim (only the database host changed).
Then the first colmena build failed. The culprit:
lxml 4.9.4no longer compiles under nixpkgs-25.05, because its GCC 14 promotes-Wincompatible-pointer-types(lxml vs libxml2’sconst xmlError *) into a hard error. GCC 13 (nixpkgs-24.11) only emits a warning, so the build succeeds.
The faithful fix was to pin this host to the same nixpkgs revision the on-prem box used (24.11 / GCC 13), while the rest of the fleet stays on newer channels:
# hive.nix
nodeNixpkgs = {
# ...
tenant-vm = (import ./nixpkgs-24.11); # GCC 13 — builds the legacy lxml 4.9.4
};
colmena build then produced the full closure cleanly — same engine, same addons, same behaviour.
2. Provisioning the cloud VM
We created the guest from a NixOS qcow2 template (hostname + IP substituted, nixos-install),
registered it via the libvirt-guests module, and let colmena deploy the real config. One practical
detail: the template ships a 20 GB root partition on a 50 GB disk, so the first big closure copy
filled it. Growing it online (no reboot) is three commands:
echo "/dev/vda1 : start=2048, type=83, bootable" | sfdisk --force /dev/vda
partx -u /dev/vda # kernel re-reads the (mounted) partition size
resize2fs /dev/vda1 # grow ext4 live
3. Database migration: live dump → Patroni cluster
The on-prem PostgreSQL isn’t reachable from the cloud, so the database moves into the cloud’s
Patroni HA cluster (reached through its HAProxy VIP at 192.168.y.100:500X).
pg_dump takes an MVCC-consistent snapshot, so we could dump the live database with no
downtime to validate the whole pipeline first:
pg_dump -h <on-prem-pg> -U admin_user --format=custom --no-owner --no-privileges \
-f tenant.dump tenant_db
# restore into the Patroni leader via the VIP
pg_restore -h 192.168.y.100 -p 500X -U admin_user --no-owner --role=admin_user \
-d tenant_db --jobs=2 tenant.dump
We brought the tenant up on the cloud VM against this copy, confirmed it served HTTP 200 with real data, then did the actual cutover: stop on-prem Odoo → final consistent dump → drop/recreate → restore → re-sync the filestore. Because the DB is small, the cutover window was a couple of minutes.
4. Keeping mail alive over a VPN subnet route
The tenant ingests bank statements by IMAP from the on-prem mail server (192.168.x.40). Those
ports are firewalled off the public internet, and we explicitly did not want to change the
tenant’s mail configuration. So instead of repointing mail, we routed the traffic privately: an
on-prem node advertises a tailscale/headscale subnet route to the mail host, the cloud VM
accepts routes, and a single /etc/hosts override makes the mail hostname resolve to its internal
address. Result: the tenant’s existing SMTP/IMAP config keeps working untouched — over the tunnel.
The database deliberately does not use the tunnel (latency-sensitive, and we don’t want it to depend on the on-prem link) — only the low-volume mail does.
This is, candidly, the one remaining on-prem dependency: if the Sarajevo link is down, the ERP keeps running but new bank-statement e-mails won’t be fetched until connectivity returns (they queue on the mail server and import once the tunnel is back). Relocating mail to the cloud is a natural follow-up; for now the core ERP — app and database — is fully cloud-resident and outage-independent.
5. The DNS carve-out — and an RFC 4592 trap
Public DNS pointed *.bringout.cloud at the old on-prem reverse proxy via a wildcard. To move just
one name to the cloud, you add an explicit record:
tenant.odoo.bringout.cloud IN A <hetzner-public-ip>
…but that alone breaks the siblings. Per RFC 4592, adding tenant.odoo.bringout.cloud creates
an empty-non-terminal node odoo.bringout.cloud, which becomes the closest encloser for
other.odoo.bringout.cloud — and since there’s no wildcard at that level, the siblings start
returning NXDOMAIN. The fix is to add a matching wildcard one level down so the others keep
resolving:
tenant.odoo.bringout.cloud IN A <hetzner-public-ip> ; carve-out
*.odoo.bringout.cloud IN CNAME old-proxy.example. ; preserve the siblings
6. The BIND .jnl journal trap (a brief self-inflicted outage)
Right after deploying the zone change, several zones stopped answering — SERVFAIL/empty, with
named logging “zone not loaded”. But named-checkconf -z was perfectly clean. The tell-tale
clue: zones without a journal kept working.
Cause: the zones use ixfr-from-differences, so each master zone keeps a <zone>.jnl journal. Our
zone files share a single SOA serial, so bumping it changed the serial of every zone — and
every existing journal was now out of sync with its (changed) zone file. named refuses to load a
master zone whose journal doesn’t match. The fix:
systemctl stop bind
rm -f /data/named/*.zone.jnl # leave dynamic-update zones' journals alone
systemctl start bind # zones reload fresh and rebuild journals
Lesson learned, and now written into our runbook: after any zone-content/serial change, clear the stale static-zone journals.
7. PDF invoices with no logo — a NAT hairpin
After cutover, generated PDF invoices rendered without the company logo. It looked like a
filestore problem; it wasn’t (the logo served fine locally). The real cause: wkhtmltopdf fetches
report assets from Odoo’s configured base URL, which is the public https://tenant.odoo.bringout.cloud
→ <hetzner-public-ip>. From inside, hitting your own public IP is a NAT hairpin that doesn’t
loop back — so the asset fetch timed out and the logo came back blank.
Two complementary fixes (leaving web.base.url public, which e-mails need):
# Odoo system parameter — wkhtmltopdf fetches assets locally:
report.url = http://localhost:8069
# and resolve the public name internally via the reverse proxy, not the public IP:
networking.extraHosts = "192.168.y.2 tenant.odoo.bringout.cloud";
Invoices render correctly again.
8. Headroom for the move
Adding a VM to a busy hypervisor without swap is asking for the OOM killer, so the new guest got a small swapfile and the host got proper swap — cheap insurance now that one more tenant lives there.
Takeaways
- Pin the toolchain, not just the package — a newer GCC turning a warning into an error is a classic “it built last year” trap. Same nixpkgs rev → same build.
- MVCC
pg_dumplets you rehearse the whole migration against the live database with zero downtime; keep the destructive cutover tiny. - A VPN subnet route can preserve an integration (here, bank-statement IMAP) without touching app config.
- DNS wildcards are sharp: an explicit record under a wildcard needs a sibling wildcard, or you NXDOMAIN the neighbours.
named-checkconf -zclean but zones “not loaded” almost always means stale.jnljournals.- Blank report images after a host move → check the base URL / NAT hairpin, not the filestore.
Napomena
Generisano od strane Claude 🤖
Ernad Husremović, hernad@bring.out.ba