Replacing Cloud Kubernetes with Home Mini PCs

The old platform was a small Talos Kubernetes cluster on Hetzner Cloud. It was not large by production standards: one control-plane node, three static workers, one ingress load balancer, and a zero-to-five-node autoscale pool. Before the price change, that shape was annoying but defensible. After the June 2026 pricing change, recreating the same Ashburn footprint became the wrong default.

The important word is recreating. Existing cloud servers can be grandfathered for a while, but Kubernetes platforms recycle hosts. Talos upgrades, node replacement, autoscaler churn, failed disks, accidental deletes, and region moves all eventually turn grandfathered infrastructure into current-price infrastructure. I do not want a platform whose cost model depends on never replacing a node.

The teardown took the Hetzner project to zero billable resources. That makes the next architecture decision clean: do not preserve the old shape just because it existed. Rebuild only the parts that the current project needs.

hetzner-state.txt

1	Old talos-redux shape:
2	control plane: 1 x cpx31, Ashburn
3	static workers: 3 x cpx21, Ashburn
4	autoscale pool: 0-5 x cpx21, Ashburn
5	ingress: 1 x lb11
6	persistent volumes: 260 GB
7	normal rebuild cost: about 183 EUR/month after the June 2026 price change
8	autoscale ceiling: about 345 EUR/month with five extra cpx21 nodes
9
10	After teardown:
11	servers: 0
12	load balancers: 0
13	volumes: 0
14	snapshots: 0
15	recurring bill: 0 EUR, excluding already accrued prorated charges

Why Home Hosting Is Back on the Table

Home hosting usually loses when the requirement is production uptime. The house is not a datacenter. Power, cooling, WAN, switching, routing, and physical access are now part of the system. If a router firmware update breaks the edge at midnight, there is no provider SLA to hide behind.

That is acceptable here because this is still bootstrapping. The first requirement is low recurring cost with enough compute to keep building. Public availability matters, but a few hours of downtime is not a business-ending event. That changes the design target from cloud-like HA to cheap, understandable, recoverable infrastructure.

The useful home-hosted replacement is not an old rack server. A rack server can look cheap on eBay and then lose on power, noise, heat, and parts. The useful replacement is a stack of used business mini PCs: Lenovo ThinkCentre Tiny, Dell OptiPlex Micro, HP EliteDesk Mini, or similar. They are quiet, efficient, commodity x86 boxes with real RAM slots and NVMe storage.

capacity-comparison.txt

1	Three-node home shape:
2	nodes: 3 x used business mini PC
3	example CPU class: i5-8500T / i5-9500T / similar 35 W desktop part
4	memory: 32 GB per node
5	storage: 1-2 TB NVMe per node
6	aggregate: 18 hardware threads / 96 GB RAM / 3-6 TB raw NVMe
7
8	Old cloud shape:
9	nodes: 4 Hetzner cloud servers
10	aggregate: 13 shared vCPU / 20 GB RAM / 260 GB network volumes
11
12	The home stack is compute-equivalent or better.
13	It is not datacenter-equivalent. Power, WAN, router, and building are now part of the platform.

On paper, three used mini PCs are already more machine than the old Hetzner cluster. The old cluster had 20 GB of RAM total. Three 32 GB mini PCs have 96 GB. The old cluster had 13 shared vCPUs. Three six-thread office desktops have 18 hardware threads. Shared cloud vCPU and local desktop CPU are not perfectly comparable, but for this workload class the home stack is not the weak side of the comparison.

Target Hardware

The minimal buy is three identical or near-identical mini PCs, a UPS, and enough NVMe to stop treating storage as scarce. I would target this shape:

Component	Target	Reason
Node count	3	Enough for embedded etcd quorum and simple maintenance.
CPU	i5-8500T / i5-9500T class or newer	Cheap, quiet, 35 W nominal parts with adequate single-core speed.
Memory	32 GB per node	The old cluster was memory-constrained; RAM is the cheap fix.
Disk	1-2 TB NVMe per node	Local database, registry cache, observability, and restore room.
Network	Wired gigabit Ethernet	Wi-Fi is not a server backplane.
Power	UPS for nodes, router, switch	The cluster is only reachable if the edge stays powered.

This is a capital expense instead of a cloud rental. The practical range is roughly a few hundred dollars for used nodes, plus RAM and SSD upgrades if the listings are bare. Power is small: at 20-35 W per node, three nodes are usually in the single-digit to low-double-digit monthly dollar range depending on local electricity. A single old 1U or 2U server can burn that much by itself while making the room unpleasant.

Cluster Shape

I would not recreate the Talos + hcloud module locally. Talos is good, but the replacement problem is smaller now: take three machines, run Kubernetes, deploy workloads, expose HTTP, back up state. The shortest working tool is k3s with embedded etcd.

Home-hosted replacement topology

flowchart TD
  users["Users"] --> cf["Cloudflare DNS / Access"]
  cf --> tunnel["Cloudflare Tunnel"]

  admin["Admin laptop"] --> tailscale["Tailscale"]
  tailscale --> cluster

  subgraph home["Home network"]
    router["Router"]
    switch["Managed or boring unmanaged switch"]
    ups["UPS"]
    router --> switch
    ups --> router
    ups --> switch

    subgraph cluster["k3s embedded-etcd cluster"]
      n1["mini-1\ncontrol-plane + worker\nlocal NVMe"]
      n2["mini-2\ncontrol-plane + worker\nlocal NVMe"]
      n3["mini-3\ncontrol-plane + worker\nlocal NVMe"]
    end

    switch --> n1
    switch --> n2
    switch --> n3
    tunnel --> n1
  end

  n1 --> backups["R2 / B2 backups"]
  n2 --> backups
  n3 --> backups

The home stack replaces provider networking with Cloudflare Tunnel for public ingress and Tailscale for administrative access.

All three nodes can run both control-plane and workload pods. That is not a pure production separation, but it is the right trade for a small bootstrap cluster. Control-plane isolation costs one third of the hardware before the applications get to run. If control-plane pressure becomes real later, the cluster can grow to five nodes and reserve three for control-plane duties.

k3s-bootstrap.sh   
    # first node 
  curl -sfL https://get.k3s.io | sh -s - server \ 
    --cluster-init \ 
    --write-kubeconfig-mode=0644 
  sudo cat /var/lib/rancher/k3s/server/node-token 
  # second and third nodes 
  curl -sfL https://get.k3s.io | K3S_URL=https://mini-1:6443 K3S_TOKEN="$TOKEN" sh -s - server 
  kubectl get nodes -o wide 
  kubectl get storageclass 
  

This keeps the host lifecycle intentionally primitive. PXE boot, Ansible, Terraform providers for local machines, and GitOps-managed node images are all possible. They are not the first problem. The first problem is a working cluster whose failure modes fit in my head.

Storage Policy

Storage is where home clusters usually overbuild first. The cloud version used Hetzner CSI volumes. Those gave each persistent volume a network block device, independent of the worker node. At home, the equivalent feature is distributed storage, usually Longhorn, Rook/Ceph, or a NAS. All of those are valid. None should be day one.

Day-one state policy

flowchart LR
  workload["Workload"] --> state{"State type?"}
  state -->|"HTTP app / worker / queue consumer"| stateless["Stateless deployment\nrun anywhere"]
  state -->|"database"| db["Single Postgres primary\nlocal NVMe + nightly logical backup"]
  state -->|"object blobs / user files"| object["S3-compatible object store\nR2 / B2 preferred"]
  state -->|"cache / scratch"| scratch["local-path PVC\nrecreatable"]
  state -->|"must survive node loss automatically"| later["Add Longhorn later\nonly after backup restore is boring"]

The first version optimizes for understandable recovery, not transparent failover of every persistent volume.

The lazy version is local NVMe plus offsite backups:

Stateless services use normal Deployments and can reschedule anywhere.
PostgreSQL runs as one primary with a local-path PVC pinned to the node that owns the disk.
Object-like state goes to R2 or B2 instead of a home disk when possible.
Monitoring and registry storage are allowed to be lossy or restorable.
The restore path is tested before any distributed storage is introduced.

This sounds less impressive than Longhorn, but it is the safer first version. Distributed storage moves the failure from "node died, restore from backup" to "storage control plane is degraded, replicas are rebuilding, and every workload is now coupled to that subsystem." That is worthwhile when the recovery objective demands it. It is not worthwhile just to preserve the shape of a cloud cluster.

Repo Changes

The application layer can mostly stay Kubernetes-shaped. The infrastructure layer should become much smaller. The old infra/prod directory managed a cloud network, cloud servers, a load balancer, firewall, Talos bootstrap, hcloud CSI, hcloud CCM, autoscaler, ArgoCD, and secrets. The home cluster does not need most of that.

repo-delta.txt

1	Infrastructure deletions:
2	remove hcloud-k8s module
3	remove hcloud cloud-controller-manager
4	remove hcloud CSI
5	remove Hetzner load balancer dependency
6	remove cluster-autoscaler
7
8	Kubernetes deltas:
9	hcloud-volumes -> local-path
10	hcloud-volumes-encrypted -> local-path, or node-local encrypted disk
11	CNPG instances: 2 -> 1
12	ingress-nginx LoadBalancer -> ClusterIP or NodePort behind cloudflared
13	external-dns: optional, because Cloudflare Tunnel owns hostname routing
14
15	Operational deltas:
16	Terraform provisions nothing for the cluster
17	GitOps still owns workloads
18	host lifecycle is manual until it becomes painful

The important deletions are the cloud-specific controllers. The hcloud-cloud-controller-manager and hcloud-csi are useful only while Hetzner is the substrate. Once workloads run on home nodes, keeping those abstractions around is worse than useless: they make manifests look portable while encoding a provider that no longer exists.

The same applies to autoscaling. A home stack can autoscale pods, but it cannot autoscale nodes unless there is an actual pool of powered-off machines with automated provisioning. That is not day one. The scheduling policy should assume fixed capacity and fail visibly when the cluster is full.

Ingress and Admin

Home ingress should not require a static residential IP, router port forwards, or exposing the home address directly. Cloudflare Tunnel is the simplest edge. A small cloudflared deployment creates outbound connections from the cluster to Cloudflare, and public hostnames route over that tunnel.

cloudflared-config.yaml   
    tunnel: talos-redux-home 
  credentials-file: /etc/cloudflared/talos-redux-home.json 
  ingress: 
    - hostname: argocd.example.com 
      service: http://argocd-server.argocd.svc.cluster.local:80 
    - hostname: grafana.example.com 
      service: http://grafana.monitoring.svc.cluster.local:80 
    - hostname: api.example.com 
      service: http://beyond-api.beyond-api.svc.cluster.local:80 
    - service: http_status:404 
  

Administrative access should be separate. Tailscale handles SSH, Kubernetes API access, emergency dashboards, and private services. Public HTTP goes through Cloudflare. Admin paths go through Tailscale. Nothing requires an open residential SSH port.

Backups

Backups are the replacement for day-one distributed storage. The backup contract should be boring:

Database dumps are written offsite every night.
Application object data is already offsite or copied offsite.
Secrets are exportable and encrypted outside the cluster.
A restore is tested on a different machine.
Retention is short enough to be cheap and long enough to catch mistakes.

postgres-backup.sh   
    #!/usr/bin/env bash 
  set -euo pipefail 
  stamp="$(date -u +%Y%m%dT%H%M%SZ)" 
  tmp="/var/backups/postgres-main-$stamp.sql.zst" 
  kubectl -n database exec postgres-main-1 -- \ 
    pg_dumpall --clean --if-exists \ 
    | zstd -19 -T0 > "$tmp" 
  rclone copy "$tmp" r2:talos-redux-backups/postgres/ 
  rclone delete --min-age 30d r2:talos-redux-backups/postgres/ 
  # restore check belongs in CI or a weekly cron on a different machine 
  

The restore drill matters more than the backup command. A backup that has never been restored is an artifact, not a recovery plan. For this project, I would rather have one boring nightly pg_dumpall restore-tested on a spare node than an impressive storage layer whose failure path I have not practiced.

Failure Model

The honest comparison is not "home hosting is as reliable as Hetzner." It is not. The honest comparison is "the failure modes are acceptable for this phase and much cheaper to hold idle."

home-failure-model.txt

1	Failure Immediate result Minimal mitigation
2	Power blink Cluster reboots UPS + BIOS power-on-after-AC
3	ISP outage Public apps unavailable Accept downtime; optional LTE later
4	Router failure Public and admin access unavailable Spare router config export
5	One mini PC dies Stateless pods reschedule; local PV down Restore from backup or move PVC manually
6	NVMe dies Node-local state lost Nightly offsite backups
7	Cloudflare outage Public ingress unavailable Tailscale admin still works
8	Accidental delete Data loss Versioned offsite backup + restore drill

This model deliberately accepts stateful downtime. If the Postgres node dies, the database is down until the PVC is moved or the dump is restored. That is a fine trade while the project is bootstrapping. When downtime becomes more expensive than complexity, add replicated storage or move the database to a managed provider.

What Not to Add Yet

The tempting path is to rebuild every cloud feature locally: distributed block storage, metallb, BGP, PXE, GitOps-managed host imaging, external secrets, HA Postgres, multi-WAN, Prometheus long-term storage, and automated bare-metal node replacement. Those are all plausible future steps. Most are not the next step.

The first home-hosted platform should skip:

Longhorn until a manual restore is more painful than operating Longhorn.
MetalLB until something on the LAN needs stable service IPs.
PXE and image automation until rebuilding a node has happened enough times to be boring and annoying.
Multi-WAN until public availability is worth another bill.
Rack hardware unless noise, heat, and power are part of the hobby.

The correct initial complexity budget is three machines, one cluster, one tunnel, one backup path, and one restore drill.

Decision

A stack of mini PCs can replace the old Hetzner cluster for this phase. It is compute-equivalent, memory-superior, and dramatically cheaper while idle. It is not an uptime-equivalent replacement for a datacenter-backed platform, so the architecture has to admit that:

Use k3s embedded etcd across three nodes.
Use local NVMe first, not distributed storage first.
Use Cloudflare Tunnel for public ingress.
Use Tailscale for administration.
Use offsite backups as the real durability boundary.
Accept stateful downtime until the project earns HA complexity.

The old platform was designed like a small cloud platform. The home version should be designed like a recoverable workshop server cluster. Same Kubernetes deployment model where it helps. Much less provider-shaped ceremony where it does not.

References

The price and platform assumptions here come from current provider documentation and from the teardown inventory. Used mini PC prices are intentionally treated as ranges because listings move daily.

1	# first node
2	curl -sfL https://get.k3s.io \| sh -s - server \
3	--cluster-init \
4	--write-kubeconfig-mode=0644
5
6	sudo cat /var/lib/rancher/k3s/server/node-token
7
8	# second and third nodes
9	curl -sfL https://get.k3s.io \| K3S_URL=https://mini-1:6443 K3S_TOKEN="$TOKEN" sh -s - server
10
11	kubectl get nodes -o wide
12	kubectl get storageclass

1	tunnel: talos-redux-home
2	credentials-file: /etc/cloudflared/talos-redux-home.json
3
4	ingress:
5	- hostname: argocd.example.com
6	service: http://argocd-server.argocd.svc.cluster.local:80
7	- hostname: grafana.example.com
8	service: http://grafana.monitoring.svc.cluster.local:80
9	- hostname: api.example.com
10	service: http://beyond-api.beyond-api.svc.cluster.local:80
11	- service: http_status:404

1	#!/usr/bin/env bash
2	set -euo pipefail
3
4	stamp="$(date -u +%Y%m%dT%H%M%SZ)"
5	tmp="/var/backups/postgres-main-$stamp.sql.zst"
6
7	kubectl -n database exec postgres-main-1 -- \
8	pg_dumpall --clean --if-exists \
9	\| zstd -19 -T0 > "$tmp"
10
11	rclone copy "$tmp" r2:talos-redux-backups/postgres/
12	rclone delete --min-age 30d r2:talos-redux-backups/postgres/
13
14	# restore check belongs in CI or a weekly cron on a different machine