Why does my whole Galera cluster show non-Primary status?

A network partition removed the majority component, so no group has quorum and writes stop cluster-wide to avoid split-brain. Identify the node with the highest committed seqno and reinstate it as the primary component with SET GLOBAL wsrep_provider_options='pc.bootstrap=YES', then let the other nodes rejoin. Running an odd number of nodes across separate availability zones prevents most quorum losses.

Why does a rejoining node keep doing a full SST instead of a fast IST?

Incremental State Transfer replays only the write-sets a node missed while it was down, streamed from a donor's GCache ring buffer. If gcache.size is smaller than the write volume produced during the outage, the required range ages out and the provider falls back to a full State Snapshot Transfer. Increase gcache.size to exceed peak write volume across your longest maintenance window.

How do I recover a Galera node that won't start with seqno -1?

An unclean shutdown left grastate.dat with safe_to_bootstrap: 0 and seqno -1. Run mariadbd --wsrep-recover to reconstruct the last committed seqno from the InnoDB redo logs, identify the node with the highest recovered seqno, bootstrap that node as the new primary component, and let the remaining nodes rejoin via IST.

Galera Cluster Setup & Node Management

MariaDB Galera Cluster delivers synchronous multi-master replication through the Write-Set Replication (wsrep) API, giving every node an identical, certified copy of committed data and a recovery point objective of zero. That guarantee is only as strong as the operational discipline behind it: a single mismatched provider option, an undersized write-set cache, or an uncoordinated shutdown can force a full State Snapshot Transfer, split the membership into competing components, or leave the group non-primary and unable to accept writes. This guide is the operational reference for standing up and running a production Galera deployment end to end — infrastructure baselines, wsrep.cnf alignment, bootstrap sequencing, node lifecycle, automation, failure remediation, and telemetry — across MariaDB 10.6 through 11.8 with Galera 4. It sits alongside the MariaDB Galera Core Architecture & Fundamentals reference, which explains the replication theory this page puts into practice.

Architecture Overview

A Galera deployment is a set of identical MariaDB nodes, each running the Galera provider library (libgalera_smm.so) loaded through wsrep_provider. Every node is a full read/write master. When a transaction commits, its write-set is broadcast to all members, ordered by the Group Communication System (GCS), and certified independently on each node before it is applied. The subsystem has four cooperating layers, and understanding which layer owns which failure mode is what makes incident response fast.

Every node is a full read/write master over an identical stack; the GCS bus orders and fans out each write-set, and joiners catch up via IST or SST.

Layer	Component	Responsibility	Primary failure signal
Server	`mariadbd`	SQL parsing, InnoDB storage, transaction execution	Slow queries, lock waits
Replication API	wsrep	Hooks commit path, produces/consumes write-sets	`wsrep_ready=OFF`
Provider	Galera (`libgalera_smm.so`)	Certification, GCache, state transfers	Certification conflicts, IST fallback
Communication	GCS / EVS	Membership, ordering, flow control	Non-primary views, node eviction

The mandatory baseline every node must share before it can join a primary component is small but unforgiving. These directives must be byte-identical across the fleet except for the node-specific address and name:

[mysqld]
# --- Provider & cluster identity (must match cluster-wide) ---
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name=prod-galera-primary
wsrep_cluster_address=gcomm://10.0.1.10,10.0.1.11,10.0.1.12

# --- Per-node identity (unique on every host) ---
wsrep_node_name=db-node-01
wsrep_node_address=10.0.1.10

# --- Storage & format requirements for Galera ---
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2

binlog_format=ROW, innodb_autoinc_lock_mode=2, and InnoDB tables are hard requirements — Galera refuses to replicate statement-based binlog events or MyISAM writes reliably, and a mismatched wsrep_cluster_name silently prevents a node from ever reaching a primary view. The full annotated parameter set, including deprecated flags across versions, lives in the wsrep.cnf Configuration Deep Dive; the identity and addressing rules build directly on the topology models in Designing Multi-Master Topologies.

Core Mechanics: Replication, Certification, and State Transfer

Galera is often called synchronous, but it is more precisely virtually synchronous with certification-based conflict resolution. A committing transaction does not wait for remote nodes to apply its changes; it waits only for the write-set to be replicated and globally ordered. Each node then certifies the write-set against its own pending queue. The Write-Set Certification Process Explained covers the deterministic conflict logic in depth, but the operational summary is: because every node certifies the same ordered stream with the same rules, they all reach the same accept/reject decision, and the transaction is guaranteed durable on commit.

A commit waits only for ordered replication, not remote apply; the globally-ordered stream makes every node reach the same certification verdict, so the losing conflicting write is deterministically aborted.

The lifecycle of a node’s membership follows a strict state machine. A joining node moves through OPEN → PRIMARY → JOINER → JOINED → SYNCED, and only a SYNCED node is a viable donor or a fully caught-up writer. When a node joins, the provider first attempts an Incremental State Transfer (IST) — replaying only the write-sets missing since the node left, streamed from a donor’s GCache. If the donor’s GCache no longer holds the required range (the gcache.size ring buffer has wrapped), the node falls back to a State Snapshot Transfer (SST): a full physical copy of the dataset. IST is measured in seconds; SST on a large dataset can saturate the network for minutes to hours and desyncs the donor. Sizing GCache to survive your longest expected maintenance window is therefore the single highest-leverage tuning decision, and the trade-offs are worked through in Initial Data Synchronization Methods.

Only a SYNCED node is a viable donor and a fully caught-up read/write member; serving an SST temporarily desyncs the donor until the transfer completes.

Flow control is the pressure-relief valve that keeps a slow node from letting the group diverge. When any node’s receive queue exceeds gcs.fc_limit, it broadcasts a FC_PAUSE and every node stops accepting new write-sets until the backlog drains below gcs.fc_limit × gcs.fc_factor. A group that pauses frequently is telling you its slowest applier cannot keep up — the fix is more wsrep_slave_threads, faster storage, or a smaller write rate, not a higher limit.

Configuration Reference

Galera parameters divide cleanly into four operational domains. Aligning them by domain — rather than editing individual keys ad hoc — is what prevents the drift that causes certification failures and SST storms. The definitive per-parameter reference is the wsrep.cnf Configuration Deep Dive; the matrix below is the operational orientation.

Domain	Key directives	What it governs	Common mistake
Identity & topology	`wsrep_cluster_name`, `wsrep_cluster_address`, `wsrep_node_name`, `wsrep_node_address`	Consensus namespace and addressing	Missing `gcomm://` prefix; non-deterministic node names
State transfer	`wsrep_sst_method`, `wsrep_sst_auth`, `gcache.size`, `ist.recv_addr`	How joiners synchronize	GCache too small → SST fallback
Flow control & certification	`wsrep_slave_threads`, `gcs.fc_limit`, `gcs.fc_factor`, `wsrep_certify_nonPK`	Backpressure and apply parallelism	`fc_limit` raised instead of fixing slow applier
Network hardening	`evs.suspect_timeout`, `evs.inactive_timeout`, `evs.send_window`, TLS `socket.ssl_*`	Membership stability across latency	Timeouts too aggressive for cross-AZ links

Identity and topology

[mysqld]
wsrep_cluster_name=prod-galera-primary
wsrep_cluster_address=gcomm://10.0.1.10,10.0.1.11,10.0.1.12
wsrep_node_name=db-node-01
wsrep_node_address=10.0.1.10

wsrep_cluster_address must carry the gcomm:// scheme and should list all stable members so a restarting node can find a running quorum. Keep wsrep_node_name mapped one-to-one to your infrastructure inventory (Ansible hostname, cloud instance ID) so log lines and status output are traceable back to a host.

State transfer and GCache

[mysqld]
wsrep_sst_method=mariabackup
wsrep_sst_auth=sstuser:REPLACE_WITH_VAULT_SECRET
wsrep_provider_options="gcache.size=4G; gcache.page_size=256M; ist.recv_addr=10.0.1.10:4568"

mariabackup is the default and recommended method on 10.6+: it streams a hot physical copy without a global lock, so the donor keeps serving reads. Size gcache.size to exceed the write volume produced during your longest planned node outage — if a node is down for a 30-minute patch and the workload writes 3 GB in that window, a 4 GB GCache keeps the rejoin on the fast IST path.

Flow control and certification

[mysqld]
wsrep_slave_threads=8
wsrep_certify_nonPK=ON
wsrep_provider_options="gcs.fc_limit=256; gcs.fc_factor=0.8"

Match wsrep_slave_threads to available cores for parallel apply, but never exceed innodb_thread_concurrency. The default gcs.fc_limit of 16 is far too low for OLTP; 128–512 is typical. Keep wsrep_certify_nonPK=ON so tables without an explicit primary key still get a synthetic one for certification — without it, PK-less writes fail certification unpredictably.

Network hardening

[mysqld]
wsrep_provider_options="evs.suspect_timeout=PT5S; evs.inactive_timeout=PT15S; evs.send_window=1024; evs.user_send_window=512; socket.ssl=YES; socket.ssl_cert=/etc/mysql/certs/node-cert.pem; socket.ssl_key=/etc/mysql/certs/node-key.pem; socket.ssl_ca=/etc/mysql/certs/galera-ca.pem"

Default EVS timeouts assume a low-latency LAN; on cross-AZ or cloud networks they cause false-positive evictions. Loosen them deliberately. Encrypting the replication channel is mandatory on any shared network — the end-to-end certificate workflow is documented in Network Security & Firewall Rules for Galera.

Bootstrap Sequencing

Exactly one node initializes a new primary component; every other node then joins it. Getting this order wrong is the classic cause of a split cluster.

Prepare all nodes. Install identical package versions, distribute the aligned wsrep.cnf, and open ports 4567 (replication), 4568 (IST), and 4444 (SST) between members.
Bootstrap the first node with the systemd-aware wrapper, which sets wsrep_cluster_address=gcomm:// for this one invocation only:
```
galera_new_cluster
systemctl status mariadb --no-pager
```

Confirm the primary component formed before touching any other node:

mariadb -N -e "SHOW STATUS LIKE 'wsrep_cluster_status'"   # Primary
mariadb -N -e "SHOW STATUS LIKE 'wsrep_local_state_comment'"  # Synced

Start remaining nodes normally. They discover the running member, request a transfer (IST if GCache permits, else SST), and reach SYNCED:
```
systemctl start mariadb
```

Full primary-component election, the safe_to_bootstrap flag, and platform-specific packaging are covered in Bootstrapping Your First Galera Cluster. Never run galera_new_cluster on more than one node, and never on a node that should be joining an existing cluster — doing so creates a second, isolated primary component.

Automation Patterns

Treat Galera state as a programmable surface. The provider exposes 150+ status variables via SHOW GLOBAL STATUS LIKE 'wsrep_%', and every lifecycle action has an idempotent scriptable form. Two patterns cover most operational automation: a readiness probe and a config-drift guard.

A production readiness probe must distinguish “the server answers” from “the node is a synced member of a primary component”. This Python 3.9+ probe using mysql-connector-python handles the wsrep-specific error codes explicitly:

import sys
import mysql.connector
from mysql.connector import errorcode

def galera_health(host: str = "127.0.0.1") -> int:
    try:
        conn = mysql.connector.connect(
            host=host, user="monitor", password="REPLACE_WITH_VAULT_SECRET",
            database="information_schema", connection_timeout=3,
        )
    except mysql.connector.Error as err:
        print(f"CRITICAL: cannot connect to {host}: {err}")
        return 2

    wanted = ("wsrep_cluster_status", "wsrep_local_state_comment", "wsrep_ready")
    state = {}
    try:
        cur = conn.cursor()
        cur.execute(
            "SELECT variable_name, variable_value FROM GLOBAL_STATUS "
            "WHERE variable_name IN (%s, %s, %s)", wanted,
        )
        state = {name.lower(): val for name, val in cur.fetchall()}
    except mysql.connector.Error as err:
        # 1205 lock wait timeout, 1213 deadlock — transient under load, retry
        if err.errno in (errorcode.ER_LOCK_WAIT_TIMEOUT, errorcode.ER_LOCK_DEADLOCK):
            print(f"WARNING: transient wsrep contention (errno {err.errno}); retry")
            return 1
        print(f"CRITICAL: status query failed: {err}")
        return 2
    finally:
        conn.close()

    if (state.get("wsrep_cluster_status") == "Primary"
            and state.get("wsrep_ready") == "ON"
            and state.get("wsrep_local_state_comment") == "Synced"):
        print("OK: node is Synced in a Primary component")
        return 0
    print(f"CRITICAL: node not write-ready: {state}")
    return 2

if __name__ == "__main__":
    sys.exit(galera_health())

Wire this into load-balancer health checks, systemd ExecStartPost gates, and CI smoke tests so a node is never routed traffic before it is genuinely synced. The full telemetry and alerting build-out — Prometheus exporters, threshold runbooks, and multi-node polling — is documented in Automated Node Health Monitoring.

The second pattern is drift detection: render wsrep.cnf from version-controlled templates and fail the pipeline if a live node’s variables diverge from the intended state.

#!/usr/bin/env bash
# Compare live wsrep_ variables against the committed baseline.
set -euo pipefail
BASELINE="/opt/galera/baseline-wsrep.txt"   # sorted key=value pairs from Git
mariadb -N -e "SHOW GLOBAL VARIABLES LIKE 'wsrep_%'" \
  | tr '\t' '=' | sort > /tmp/live-wsrep.txt
if ! diff -u "$BASELINE" /tmp/live-wsrep.txt; then
  echo "DRIFT DETECTED: live configuration differs from baseline" >&2
  exit 1
fi
echo "Configuration matches baseline."

Rendering wsrep.cnf per host from inventory and enforcing idempotency is covered in Automating Node Provisioning with Ansible.

Infrastructure and Topology Considerations

Synchronous replication amplifies every hardware and network weakness. A commit is gated by the slowest network round trip and the slowest applier, so provisioning decisions are latency-first, not throughput-first.

Quorum requires an odd count. A Galera cluster survives the loss of fewer than half its members. Three nodes tolerate one failure; five tolerate two. A two-node cluster has no majority after one failure and goes non-primary — use three, or add a lightweight garbd arbitrator as the tie-breaker.
Keep replication latency low. Same-region, low-RTT interconnects are strongly preferred. Stretching a deployment across regions multiplies commit latency by the inter-region RTT on every write; if you must span distance, use an async replica downstream rather than a stretched synchronous cluster, as discussed in Fallback Routing & Read-Only Nodes.
Spread across failure domains. Place the three nodes in three availability zones so no single AZ outage removes quorum — but keep them within one region to hold RTT down. The garbd arbitrator belongs in a third AZ when you run an even node count.
Provision for SST bursts. NVMe-backed data directories, vm.swappiness=1, transparent huge pages disabled, and a raised fs.aio-max-nr prevent OOM and I/O stalls during a full SST. Sizing matrices are in Galera Cluster Hardware Requirements.
Reserve headroom for the certification index. Memory must hold the active working set plus Galera’s certification index and GCache. Undersizing here turns a write spike into flow-control thrashing.

Adding and removing nodes safely follows from these constraints: provision a new node with the identical baseline, let it transfer and reach SYNCED, and only then add it to the load balancer; drain and confirm wsrep_local_state_comment=Synced before a controlled shutdown so grastate.dat records a clean seqno. The full join/leave choreography is in Graceful Node Join and Leave Procedures.

Failure Modes and Remediation

The six incidents below account for the large majority of Galera pages. Each maps a symptom to its root cause and the first corrective action.

Symptom	Root cause	Immediate action
`wsrep_cluster_status = non-Primary` on all nodes	Network partition lost quorum; no majority component	Identify the node with the highest committed seqno and set it primary with `SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'`
Joiner falls back to full SST every restart	`gcache.size` too small; write-set range aged out of the donor’s ring buffer	Increase `gcache.size` to exceed peak write volume during a maintenance window; verify with `wsrep_local_cached_downto`
`WSREP: Failed to prepare for SST` / joiner hangs	`wsrep_sst_auth` mismatch or `mariabackup` not installed on donor	Fix SST credentials, install `mariabackup` on all nodes, confirm port 4444 is open
Frequent `wsrep_flow_control_paused` spikes	Slowest applier cannot keep pace; `wsrep_slave_threads` undersized or slow disk	Raise `wsrep_slave_threads`, move to faster storage; do not simply raise `gcs.fc_limit`
Node evicted with `view(NON_PRIM)` on healthy hardware	`evs.suspect_timeout` / `evs.inactive_timeout` too aggressive for the network	Loosen EVS timeouts; check for packet loss and firewall drops on 4567
Node refuses to start: `safe_to_bootstrap: 0`, seqno `-1`	Unclean shutdown left `grastate.dat` uncertain	Run `mariadbd --wsrep-recover` to recover seqno from InnoDB, then bootstrap the highest-seqno node

For total outages, recovery hinges on finding the most-advanced node. mariadbd --wsrep-recover reconstructs the last committed seqno from the InnoDB redo logs even when grastate.dat shows seqno: -1; bootstrap that node first, then let the rest rejoin via IST. Log-driven diagnosis — grepping WSREP lines from journalctl -u mariadb and the error log — is detailed in Handling Galera Startup Errors & Logs.

Monitoring and Telemetry

Reactive log-reading does not scale; scrape the wsrep status variables continuously and alert on leading indicators before writes stall. These are the variables worth a dashboard row and an alert threshold:

Variable	Meaning	Alert threshold
`wsrep_cluster_status`	Primary vs non-Primary	Page immediately if not `Primary`
`wsrep_cluster_size`	Active member count	Alert if below expected quorum
`wsrep_local_state_comment`	Node state (`Synced`, `Donor/Desynced`, `Joining`)	Warn if not `Synced` for > 60s outside maintenance
`wsrep_flow_control_paused`	Fraction of time paused by flow control (0–1)	Warn > 0.1, page > 0.3
`wsrep_local_recv_queue_avg`	Mean inbound apply-queue depth	Warn if rising trend > 1.0
`wsrep_cert_deps_distance`	Available apply parallelism	Use to right-size `wsrep_slave_threads`
`wsrep_local_cert_failures`	Certification conflicts (deadlock aborts)	Investigate rate spikes

Expose these through a Prometheus exporter or an OpenTelemetry collector that runs the readiness probe above and emits gauges per variable. Build threshold-based runbooks so an alert links straight to the remediation table — a rising wsrep_flow_control_paused should route the on-call to the slow-applier row, not a blank dashboard. The Python polling and exporter patterns are built out in Monitoring Galera Cluster State with Python.

Operational discipline, deterministic configuration, and automated state validation are what turn Galera’s zero-RPO promise into a dependable production reality. Align infrastructure, wsrep.cnf baselines, lifecycle automation, and telemetry with the certification semantics described here, and the group stays primary through the failures that would otherwise page you at 3 a.m.

wsrep.cnf Configuration Deep Dive — the full annotated parameter reference behind the domain matrix above
Bootstrapping Your First Galera Cluster — primary-component election and the safe_to_bootstrap flag
Graceful Node Join and Leave Procedures — controlled SST/IST sequencing and clean shutdown
Initial Data Synchronization Methods — mariabackup vs rsync and GCache-driven IST/SST selection
Automated Node Health Monitoring — telemetry pipelines and threshold runbooks for the variables above
Galera Cluster Hardware Requirements — sizing, kernel tuning, and SST-burst capacity planning
MariaDB Galera Core Architecture & Fundamentals — the replication and certification theory this operational guide applies

Galera Cluster Setup & Node Management

Architecture Overview #

Core Mechanics: Replication, Certification, and State Transfer #

Configuration Reference #

Identity and topology #

State transfer and GCache #

Flow control and certification #

Network hardening #

Bootstrap Sequencing #

Automation Patterns #

Infrastructure and Topology Considerations #

Failure Modes and Remediation #

Monitoring and Telemetry #

Related #