The old platform was a small Talos Kubernetes cluster on Hetzner Cloud. It was not large by production standards: one control-plane node, three static workers, one ingress load balancer, and a zero-to-five-node autoscale pool. Before the price change, that shape was annoying but defensible. After the June 2026 pricing change, recreating the same Ashburn footprint became the wrong default.
The important word is recreating. Existing cloud servers can be grandfathered for a while, but Kubernetes platforms recycle hosts. Talos upgrades, node replacement, autoscaler churn, failed disks, accidental deletes, and region moves all eventually turn grandfathered infrastructure into current-price infrastructure. I do not want a platform whose cost model depends on never replacing a node.
The teardown took the Hetzner project to zero billable resources. That makes the next architecture decision clean: do not preserve the old shape just because it existed. Rebuild only the parts that the current project needs.
| 1 | Old talos-redux shape: |
| 2 | control plane: 1 x cpx31, Ashburn |
| 3 | static workers: 3 x cpx21, Ashburn |
| 4 | autoscale pool: 0-5 x cpx21, Ashburn |
| 5 | ingress: 1 x lb11 |
| 6 | persistent volumes: 260 GB |
| 7 | normal rebuild cost: about 183 EUR/month after the June 2026 price change |
| 8 | autoscale ceiling: about 345 EUR/month with five extra cpx21 nodes |
| 9 | |
| 10 | After teardown: |
| 11 | servers: 0 |
| 12 | load balancers: 0 |
| 13 | volumes: 0 |
| 14 | snapshots: 0 |
| 15 | recurring bill: 0 EUR, excluding already accrued prorated charges |
Why Home Hosting Is Back on the Table
Home hosting usually loses when the requirement is production uptime. The house is not a datacenter. Power, cooling, WAN, switching, routing, and physical access are now part of the system. If a router firmware update breaks the edge at midnight, there is no provider SLA to hide behind.
That is acceptable here because this is still bootstrapping. The first requirement is low recurring cost with enough compute to keep building. Public availability matters, but a few hours of downtime is not a business-ending event. That changes the design target from cloud-like HA to cheap, understandable, recoverable infrastructure.
The useful home-hosted replacement is not an old rack server. A rack server can look cheap on eBay and then lose on power, noise, heat, and parts. The useful replacement is a stack of used business mini PCs: Lenovo ThinkCentre Tiny, Dell OptiPlex Micro, HP EliteDesk Mini, or similar. They are quiet, efficient, commodity x86 boxes with real RAM slots and NVMe storage.
| 1 | Three-node home shape: |
| 2 | nodes: 3 x used business mini PC |
| 3 | example CPU class: i5-8500T / i5-9500T / similar 35 W desktop part |
| 4 | memory: 32 GB per node |
| 5 | storage: 1-2 TB NVMe per node |
| 6 | aggregate: 18 hardware threads / 96 GB RAM / 3-6 TB raw NVMe |
| 7 | |
| 8 | Old cloud shape: |
| 9 | nodes: 4 Hetzner cloud servers |
| 10 | aggregate: 13 shared vCPU / 20 GB RAM / 260 GB network volumes |
| 11 | |
| 12 | The home stack is compute-equivalent or better. |
| 13 | It is not datacenter-equivalent. Power, WAN, router, and building are now part of the platform. |
On paper, three used mini PCs are already more machine than the old Hetzner cluster. The old cluster had 20 GB of RAM total. Three 32 GB mini PCs have 96 GB. The old cluster had 13 shared vCPUs. Three six-thread office desktops have 18 hardware threads. Shared cloud vCPU and local desktop CPU are not perfectly comparable, but for this workload class the home stack is not the weak side of the comparison.
Target Hardware
The minimal buy is three identical or near-identical mini PCs, a UPS, and enough NVMe to stop treating storage as scarce. I would target this shape:
| Component | Target | Reason |
|---|---|---|
| Node count | 3 | Enough for embedded etcd quorum and simple maintenance. |
| CPU | i5-8500T / i5-9500T class or newer | Cheap, quiet, 35 W nominal parts with adequate single-core speed. |
| Memory | 32 GB per node | The old cluster was memory-constrained; RAM is the cheap fix. |
| Disk | 1-2 TB NVMe per node | Local database, registry cache, observability, and restore room. |
| Network | Wired gigabit Ethernet | Wi-Fi is not a server backplane. |
| Power | UPS for nodes, router, switch | The cluster is only reachable if the edge stays powered. |
This is a capital expense instead of a cloud rental. The practical range is roughly a few hundred dollars for used nodes, plus RAM and SSD upgrades if the listings are bare. Power is small: at 20-35 W per node, three nodes are usually in the single-digit to low-double-digit monthly dollar range depending on local electricity. A single old 1U or 2U server can burn that much by itself while making the room unpleasant.
Cluster Shape
I would not recreate the Talos + hcloud module locally. Talos is good, but
the replacement problem is smaller now: take three machines, run
Kubernetes, deploy workloads, expose HTTP, back up state. The shortest
working tool is k3s with
embedded etcd.
flowchart TD
users["Users"] --> cf["Cloudflare DNS / Access"]
cf --> tunnel["Cloudflare Tunnel"]
admin["Admin laptop"] --> tailscale["Tailscale"]
tailscale --> cluster
subgraph home["Home network"]
router["Router"]
switch["Managed or boring unmanaged switch"]
ups["UPS"]
router --> switch
ups --> router
ups --> switch
subgraph cluster["k3s embedded-etcd cluster"]
n1["mini-1\ncontrol-plane + worker\nlocal NVMe"]
n2["mini-2\ncontrol-plane + worker\nlocal NVMe"]
n3["mini-3\ncontrol-plane + worker\nlocal NVMe"]
end
switch --> n1
switch --> n2
switch --> n3
tunnel --> n1
end
n1 --> backups["R2 / B2 backups"]
n2 --> backups
n3 --> backups All three nodes can run both control-plane and workload pods. That is not a pure production separation, but it is the right trade for a small bootstrap cluster. Control-plane isolation costs one third of the hardware before the applications get to run. If control-plane pressure becomes real later, the cluster can grow to five nodes and reserve three for control-plane duties.
| 1 | # first node |
| 2 | curl -sfL https://get.k3s.io | sh -s - server \ |
| 3 | --cluster-init \ |
| 4 | --write-kubeconfig-mode=0644 |
| 5 | |
| 6 | sudo cat /var/lib/rancher/k3s/server/node-token |
| 7 | |
| 8 | # second and third nodes |
| 9 | curl -sfL https://get.k3s.io | K3S_URL=https://mini-1:6443 K3S_TOKEN="$TOKEN" sh -s - server |
| 10 | |
| 11 | kubectl get nodes -o wide |
| 12 | kubectl get storageclass |
This keeps the host lifecycle intentionally primitive. PXE boot, Ansible, Terraform providers for local machines, and GitOps-managed node images are all possible. They are not the first problem. The first problem is a working cluster whose failure modes fit in my head.
Storage Policy
Storage is where home clusters usually overbuild first. The cloud version used Hetzner CSI volumes. Those gave each persistent volume a network block device, independent of the worker node. At home, the equivalent feature is distributed storage, usually Longhorn, Rook/Ceph, or a NAS. All of those are valid. None should be day one.
flowchart LR
workload["Workload"] --> state{"State type?"}
state -->|"HTTP app / worker / queue consumer"| stateless["Stateless deployment\nrun anywhere"]
state -->|"database"| db["Single Postgres primary\nlocal NVMe + nightly logical backup"]
state -->|"object blobs / user files"| object["S3-compatible object store\nR2 / B2 preferred"]
state -->|"cache / scratch"| scratch["local-path PVC\nrecreatable"]
state -->|"must survive node loss automatically"| later["Add Longhorn later\nonly after backup restore is boring"] The lazy version is local NVMe plus offsite backups:
- Stateless services use normal Deployments and can reschedule anywhere.
- PostgreSQL runs as one primary with a local-path PVC pinned to the node that owns the disk.
- Object-like state goes to R2 or B2 instead of a home disk when possible.
- Monitoring and registry storage are allowed to be lossy or restorable.
- The restore path is tested before any distributed storage is introduced.
This sounds less impressive than Longhorn, but it is the safer first version. Distributed storage moves the failure from "node died, restore from backup" to "storage control plane is degraded, replicas are rebuilding, and every workload is now coupled to that subsystem." That is worthwhile when the recovery objective demands it. It is not worthwhile just to preserve the shape of a cloud cluster.
Repo Changes
The application layer can mostly stay Kubernetes-shaped. The
infrastructure layer should become much smaller. The old
infra/prod directory managed a
cloud network, cloud servers, a load balancer, firewall, Talos bootstrap,
hcloud CSI, hcloud CCM, autoscaler, ArgoCD, and secrets. The home cluster
does not need most of that.
| 1 | Infrastructure deletions: |
| 2 | remove hcloud-k8s module |
| 3 | remove hcloud cloud-controller-manager |
| 4 | remove hcloud CSI |
| 5 | remove Hetzner load balancer dependency |
| 6 | remove cluster-autoscaler |
| 7 | |
| 8 | Kubernetes deltas: |
| 9 | hcloud-volumes -> local-path |
| 10 | hcloud-volumes-encrypted -> local-path, or node-local encrypted disk |
| 11 | CNPG instances: 2 -> 1 |
| 12 | ingress-nginx LoadBalancer -> ClusterIP or NodePort behind cloudflared |
| 13 | external-dns: optional, because Cloudflare Tunnel owns hostname routing |
| 14 | |
| 15 | Operational deltas: |
| 16 | Terraform provisions nothing for the cluster |
| 17 | GitOps still owns workloads |
| 18 | host lifecycle is manual until it becomes painful |
The important deletions are the cloud-specific controllers. The
hcloud-cloud-controller-manager
and hcloud-csi are useful only
while Hetzner is the substrate. Once workloads run on home nodes, keeping
those abstractions around is worse than useless: they make manifests look
portable while encoding a provider that no longer exists.
The same applies to autoscaling. A home stack can autoscale pods, but it cannot autoscale nodes unless there is an actual pool of powered-off machines with automated provisioning. That is not day one. The scheduling policy should assume fixed capacity and fail visibly when the cluster is full.
Ingress and Admin
Home ingress should not require a static residential IP, router port
forwards, or exposing the home address directly. Cloudflare Tunnel is the
simplest edge. A small
cloudflared deployment creates
outbound connections from the cluster to Cloudflare, and public hostnames
route over that tunnel.
| 1 | tunnel: talos-redux-home |
| 2 | credentials-file: /etc/cloudflared/talos-redux-home.json |
| 3 | |
| 4 | ingress: |
| 5 | - hostname: argocd.example.com |
| 6 | service: http://argocd-server.argocd.svc.cluster.local:80 |
| 7 | - hostname: grafana.example.com |
| 8 | service: http://grafana.monitoring.svc.cluster.local:80 |
| 9 | - hostname: api.example.com |
| 10 | service: http://beyond-api.beyond-api.svc.cluster.local:80 |
| 11 | - service: http_status:404 |
Administrative access should be separate. Tailscale handles SSH, Kubernetes API access, emergency dashboards, and private services. Public HTTP goes through Cloudflare. Admin paths go through Tailscale. Nothing requires an open residential SSH port.
Backups
Backups are the replacement for day-one distributed storage. The backup contract should be boring:
- Database dumps are written offsite every night.
- Application object data is already offsite or copied offsite.
- Secrets are exportable and encrypted outside the cluster.
- A restore is tested on a different machine.
- Retention is short enough to be cheap and long enough to catch mistakes.
| 1 | |
| 2 | set -euo pipefail |
| 3 | |
| 4 | stamp="$(date -u +%Y%m%dT%H%M%SZ)" |
| 5 | tmp="/var/backups/postgres-main-$stamp.sql.zst" |
| 6 | |
| 7 | kubectl -n database exec postgres-main-1 -- \ |
| 8 | pg_dumpall --clean --if-exists \ |
| 9 | | zstd -19 -T0 > "$tmp" |
| 10 | |
| 11 | rclone copy "$tmp" r2:talos-redux-backups/postgres/ |
| 12 | rclone delete --min-age 30d r2:talos-redux-backups/postgres/ |
| 13 | |
| 14 | # restore check belongs in CI or a weekly cron on a different machine |
The restore drill matters more than the backup command. A backup that has
never been restored is an artifact, not a recovery plan. For this
project, I would rather have one boring nightly
pg_dumpall restore-tested on a
spare node than an impressive storage layer whose failure path I have not
practiced.
Failure Model
The honest comparison is not "home hosting is as reliable as Hetzner." It is not. The honest comparison is "the failure modes are acceptable for this phase and much cheaper to hold idle."
| 1 | Failure Immediate result Minimal mitigation |
| 2 | Power blink Cluster reboots UPS + BIOS power-on-after-AC |
| 3 | ISP outage Public apps unavailable Accept downtime; optional LTE later |
| 4 | Router failure Public and admin access unavailable Spare router config export |
| 5 | One mini PC dies Stateless pods reschedule; local PV down Restore from backup or move PVC manually |
| 6 | NVMe dies Node-local state lost Nightly offsite backups |
| 7 | Cloudflare outage Public ingress unavailable Tailscale admin still works |
| 8 | Accidental delete Data loss Versioned offsite backup + restore drill |
This model deliberately accepts stateful downtime. If the Postgres node dies, the database is down until the PVC is moved or the dump is restored. That is a fine trade while the project is bootstrapping. When downtime becomes more expensive than complexity, add replicated storage or move the database to a managed provider.
What Not to Add Yet
The tempting path is to rebuild every cloud feature locally: distributed block storage, metallb, BGP, PXE, GitOps-managed host imaging, external secrets, HA Postgres, multi-WAN, Prometheus long-term storage, and automated bare-metal node replacement. Those are all plausible future steps. Most are not the next step.
The first home-hosted platform should skip:
- Longhorn until a manual restore is more painful than operating Longhorn.
- MetalLB until something on the LAN needs stable service IPs.
- PXE and image automation until rebuilding a node has happened enough times to be boring and annoying.
- Multi-WAN until public availability is worth another bill.
- Rack hardware unless noise, heat, and power are part of the hobby.
The correct initial complexity budget is three machines, one cluster, one tunnel, one backup path, and one restore drill.
Decision
A stack of mini PCs can replace the old Hetzner cluster for this phase. It is compute-equivalent, memory-superior, and dramatically cheaper while idle. It is not an uptime-equivalent replacement for a datacenter-backed platform, so the architecture has to admit that:
- Use k3s embedded etcd across three nodes.
- Use local NVMe first, not distributed storage first.
- Use Cloudflare Tunnel for public ingress.
- Use Tailscale for administration.
- Use offsite backups as the real durability boundary.
- Accept stateful downtime until the project earns HA complexity.
The old platform was designed like a small cloud platform. The home version should be designed like a recoverable workshop server cluster. Same Kubernetes deployment model where it helps. Much less provider-shaped ceremony where it does not.
References
The price and platform assumptions here come from current provider documentation and from the teardown inventory. Used mini PC prices are intentionally treated as ranges because listings move daily.