Help:Multi-node Kubernetes

From Canasta Wiki

This page describes the conceptual model for running Canasta on a multi-node Kubernetes cluster: when it's the right shape, what storage and ingress requirements apply, and what limits the default chart still has.

For an end-to-end worked example, see Help:User journeys/Canasta multi-node on AWS EC2 with k3s.

When you need this

Multi-node multi-replica is appropriate if your wiki gets enough traffic that a single web pod is a bottleneck, you need the wiki to keep serving during node maintenance, or you're on a managed K8s service (EKS, GKE, AKS) where multi-node is routine.

It is not a substitute for full HA. The default deployment keeps several single-points-of-failure that multi-replica web alone does not fix:

Capability Multi-node multi-replica web
Page-view load balancing Yes
Survives a single web pod crash Yes
Survives a node failure Only if pods are spread across nodes
Database HA No — still one DB pod on node-local storage
Caddy HA No — still single-replica

For true HA, add an external database (managed relational service or a Galera cluster — see Help:External database) and either put an external load balancer in front with --skip-tls, or accept Caddy as a single-replica TLS terminator.

Prerequisites

  • A controller — the machine where you run the canasta CLI. This is typically your laptop or a dedicated admin host. It is not part of the cluster.
  • SSH access from the controller to every node (for self-managed clusters), or a working kubectl + helm on the controller pointed at the cluster (for managed Kubernetes).
  • Canasta CLI installed on the controller. See Help:Installation.
  • A domain name with DNS control.
  • For self-managed clusters: at least two Linux hosts that meet Canasta's hardware requirements (4 GiB RAM each is the practical minimum; 8 GiB+ if Elasticsearch is enabled).
  • An RWM-capable StorageClass available on the cluster — see "Storage requirements" below.

Storage requirements

The four content PVCs (extensions, skins, images, public_assets) must be readable and writable from every node that may schedule a web pod. With multi-replica web that can mean any node in the cluster. Two implications:

  • Access mode must be ReadWriteMany (RWM) so the same PVC can mount on multiple nodes simultaneously. Canasta's chart ships with RWO defaults; pass --access-mode ReadWriteMany to canasta create on multi-node setups.
  • The StorageClass must be one whose CSI driver supports RWM. Common options:
    • Network filesystems exported from a shared host or appliance (NFS, SMB, CephFS).
    • Managed RWM services from cloud providers (e.g., AWS EFS, Azure Files, Google Cloud Filestore).
    • Vendor CSI drivers that explicitly advertise RWM (NetApp Trident, Portworx, Longhorn-with-RWM-enabled).

Single-node defaults like k3s's local-path and EKS's gp2 are RWO-only and will not work for multi-replica web.

For NFS specifically, Canasta provides a turnkey setup helper (canasta storage setup nfs) that installs the server, exports a share, and registers the cluster-side CSI driver and StorageClass in one step. See the journeys for worked examples.

Topology

  • Cluster scope is the host — one cluster per host. The control plane and worker components are systemd singletons and a node belongs to exactly one cluster at a time.
  • The control plane runs pods by default — a single-node cluster (control-plane only, no workers) is a supported deployment. Workers are added when you need scale-out, not when you need correctness.
  • Multiple Canasta instances per cluster are supported — each lands in its own canasta-<id> namespace with its own Helm release and (optionally) its own Argo CD Application. Cluster-scoped infrastructure (Argo CD, cert-manager, ingress controller, ClusterIssuer) is shared across instances on the same cluster.
  • Multi-cluster means multi-host — separate dev / staging / prod environments are separate sets of hosts.

Caveats and known limitations

Database

The default deployment runs one MariaDB pod on node-local (local-path) storage. A node failure on the DB's node takes the wiki down regardless of web replica count. For production HA, point Canasta at an external managed database by passing -e/--envfile with USE_EXTERNAL_DB=true plus MYSQL_HOST, MYSQL_PORT, and MYSQL_USER at canasta create time. Canasta then skips the bundled db service and points MediaWiki at your external host. See Help:External database for the full envfile reference, prerequisites, and known limitations.

Caddy

Caddy is single-replica. The caddy-data PVC holds Let's Encrypt certificates and is RWO; multiple replicas would also independently provision certs and hit ACME rate limits. For an HA TLS layer, put an external load balancer in front of Canasta and disable Canasta's TLS with --skip-tls.

Elasticsearch / OpenSearch

Both run single-node by default (discovery.type=single-node hardcoded). Clustered Elasticsearch needs out-of-chart configuration.

Pod scheduling

The chart doesn't declare pod anti-affinity or topology spread constraints — multi-replica spread is best-effort. With replicaCount: 3 on two nodes the K8s scheduler may produce 2-1 or 3-0 placement depending on resource availability. If you need guaranteed spread, cordon the crowded node temporarily and re-run canasta restart to force reschedule.

No canasta scale command yet

Replica counts are set by editing per-instance values.yaml on the target host and running canasta restart. The restart flow reads web.replicaCount and applies it via helm upgrade. New web pods attach to the same RWM-backed PVCs and can land on any node.

Re-running canasta create against an existing instance

If a previous canasta create failed mid-flight and left the database PVC behind, re-running with the same --id regenerates the database root password but does not reset the DB's on-disk grants — producing an unrecoverable auth mismatch (Access denied for user 'root'). Workaround: delete the namespace first before retrying.

Troubleshooting

Pods on the worker node can't resolve DNS or reach ClusterIP services. Cross-node networking is blocked. For k3s on cloud VMs the usual cause is a security group allowing only TCP between nodes — k3s uses VXLAN over UDP 8472 for pod networking. Allow all UDP in addition to TCP between cluster nodes. Smoking gun: web pod's wait-for-db init container hangs on DNS lookup.

PVCs stuck Pending. kubectl describe pvc <name> shows the cause. Usually: StorageClass missing, StorageClass doesn't support RWM, or CSI driver pod not running.

Web replicas Pending after scale-up. RWO PVC already attached to a different node. Recreate the instance with --access-mode ReadWriteMany against an RWM-capable StorageClass.

Pods all on one node. Scheduler doesn't guarantee spread (see "Pod scheduling" above). Use the cordon workaround.

Certificate stuck Ready=False. Usually DNS hasn't propagated, or port 80 isn't open from the internet (Let's Encrypt HTTP-01 challenge requires reachable port 80).

Error scenarios and recovery

Common misuse patterns when first setting up a multi-node cluster, and what to do about them.

canasta install k8s-worker --cp-host <name> fails with "Host '<name>' is not registered." The cp-host has to be in the controller's saved-host inventory. Run canasta host add --name <name> --ssh user@<host> first, then retry. The hosts.yml file lives in the controller's config directory ($CANASTA_CONFIG_DIR/hosts.yml; see Help:Multi-host management).

canasta install k8s-worker --cp-host <name> exits silently or with a generic "Module failed: non-zero return code". The cp-host is registered but doesn't have a k3s control plane running. Common cause: forgetting canasta install k8s-cp --host <name> before the worker step, or pointing --cp-host at a host that's a worker rather than a cp. Confirm with ssh <cp-host> 'sudo systemctl is-active k3s' — if the answer is anything other than active, install or repair the cp first. (Tracked: silent-failure ergonomics is being improved.)

canasta create --orchestrator k8s fails with "kubectl is not installed". The target host has no Canasta-managed k3s install. Run canasta install k8s-cp --host <host> first. If the host already runs Kubernetes from another distribution (kubeadm, kops, RKE2, EKS-bootstrapped, etc.), just ensure kubectl and helm are available in the controller's PATH pointed at that cluster — Canasta does not require the cluster to be canasta-installed, only the tooling to be present.

canasta install k8s-cp on a host that already has k3s. Re-running canasta install k8s-cp on a host where the same command previously succeeded is idempotent — it re-applies the install steps and exits cleanly. If the host has k3s from a non-Canasta install (e.g., bare curl https://get.k3s.io | sh), the install may collide with existing service definitions. To start clean, first run canasta uninstall k8s --host <host> (or /usr/local/bin/k3s-uninstall.sh for a non-Canasta install), then re-run.

Recovery from a failed cp install. If canasta install k8s-cp partially succeeds and leaves the host in a half-installed state, run canasta uninstall k8s --host <host> to clean up the systemd unit, the data dir, and the kubectl/helm binaries; then retry. The uninstall is safe to run on a host that's already clean.

See also