Graceful Node Join and Leave Procedures for MariaDB Galera

This procedure builds on the node lifecycle model described in Galera Cluster Setup & Node Management, and turns it into a repeatable runbook for taking a node out of a running cluster and putting it back without disrupting writes. Deterministic node lifecycle management is the operational backbone of any production MariaDB Galera deployment. Unlike primary-replica architectures where topology changes are largely unidirectional, multi-master synchronous replication requires strict state coordination during both ingress and egress events. Platform teams and database administrators must treat join and leave operations as transactional workflows rather than simple service restarts: improper sequencing triggers a full State Snapshot Transfer (SST), exhausts donor node I/O, or fractures cluster quorum. This guide details production-grade procedures for gracefully managing node state transitions, the parameters that govern them, validated automation patterns, and the exact log lines to check when a transition stalls.

Concept: What “Graceful” Means in a wsrep State Machine

A Galera node is always in exactly one wsrep state, and every join or leave is really a walk through a defined state machine. The value reported by wsrep_local_state_comment is the ground truth for where a node sits in that walk. A graceful transition is one where the node moves through the expected states in order — never skipping a step, never being killed mid-transfer — so the rest of the membership never has to guess whether its data is authoritative.

A leave is graceful when the node reaches Synced, flushes its commit path, broadcasts a NODE_LEAVE message over the Group Communication System, and shuts down with a clean grastate.dat (a non--1 seqno). A join is graceful when the node re-enters the group, negotiates the cheapest possible state transfer — Incremental State Transfer (IST) where possible, SST only when it must — and climbs to Synced before any traffic is routed to it. The mechanics of synchronous certification that make this ordering matter are covered in how Galera synchronous replication works.

The distinction between graceful and abrupt is not academic. An abrupt exit (kill -9, power loss, an OOM kill, or a forced container stop) leaves seqno at -1 in grastate.dat, which forces the node into full SST on its next start and can, if it was the last writable member, leave the group non-primary. Everything below exists to keep both sides of that state machine deterministic.

Prerequisites & Environment Requirements

Before scheduling any node lifecycle event, confirm the environment is ready. These requirements apply to MariaDB 10.6 through 11.8 with Galera 4.

Uniform configuration. Every node’s wsrep.cnf must agree on wsrep_cluster_name, wsrep_cluster_address, and provider options except for the node-specific wsrep_node_address / wsrep_node_name. Drift here is the most common cause of a rejoining node forming its own partition. The full parameter matrix lives in the wsrep.cnf configuration deep dive.
Open transfer ports. TCP 4567 (replication), 4568 (IST), and 4444 (SST) must be reachable between every pair of nodes. If a rejoin hangs before any data moves, a dropped port is the first suspect — verify against the network security and firewall rules for Galera.
Synchronized SST credentials. The wsrep_sst_auth user must exist with identical credentials on donor and joiner; an SST that authenticates as the wrong user fails silently after the port handshake.
A connection router in front of the node. ProxySQL or HAProxy must be able to drain the target node so applications never send writes to a member that is leaving or still Joining.
An odd, primary membership. Run these procedures against a Galera cluster that is currently Primary with an odd node count (3, 5, 7). Taking a node out of an already-degraded cluster is a recovery scenario, not a graceful one.

Validate the baseline that governs transfer behavior across all members before you touch a single node:

[mysqld]
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address="gcomm://10.0.1.10,10.0.1.11,10.0.1.12"
wsrep_node_address=10.0.1.12
wsrep_node_name=db-node-03
wsrep_sst_method=mariabackup
wsrep_sst_auth="sstuser:SecurePassphrase!"
wsrep_provider_options="gcache.size=8G; gcache.page_size=256M; evs.keepalive_period=PT1S; evs.inactive_timeout=PT15S; pc.wait_prim=TRUE; pc.timeout=PT30S"

The gcache.size value directly controls IST viability: if a node leaves and rejoins within the cached write-set window, Galera bypasses full SST and replays only the delta, cutting rejoin time from minutes to seconds. Size it to exceed the write volume you expect to accumulate during your longest planned maintenance window.

Step-by-Step Procedure

Node lifecycle work splits into two ordered sequences — a graceful leave and a graceful join. Never interleave them across two nodes at once; serialize so only one member is ever transitioning.

The graceful leave sequence

Execute these steps on the node you are removing. Each step exists to protect a specific invariant, noted in the why.

Drain application connections. Redirect traffic at the router (ProxySQL/HAProxy), then wait for SHOW PROCESSLIST to show zero application queries. Why: a write in flight when you desync the node can be lost or forced into an abort.
Verify synced state. Confirm the node is fully caught up before you touch it:
```
SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
```
The value must be Synced. Proceeding while it reads Joiner or Donor corrupts the on-disk state file. If it is stuck, resolve it first using fixing wsrep_local_state_comment issues.
Quiesce writes. Hold a brief read lock to block new writes and flush table-level caches:
```
FLUSH TABLES WITH READ LOCK;
-- hold ~2-3 seconds
UNLOCK TABLES;
```
Why: this quiets the commit path so the clean shutdown has nothing racing it. It does not flush InnoDB dirty pages — the shutdown in the next step does that.
Stop the service gracefully. Let systemd send SIGTERM so the provider can announce departure:
```
systemctl stop mariadb
```
Why: SIGTERM gives the wsrep provider time to broadcast NODE_LEAVE over the Group Communication System and write a clean seqno to grastate.dat. A SIGKILL skips both, which is what forces the expensive rejoin later.

Validate the exit. Confirm the group saw a clean departure:

grep -E "left the group|Shifting SYNCED" /var/log/mysql/error.log | tail
cat /var/lib/mysql/grastate.dat   # seqno must be a real number, not -1

You should see WSREP: Member 2.0 (db-node-03) left the group and a non-negative seqno.

The graceful join sequence

Rejoining requires precise state negotiation. Galera weighs three inputs before it moves data: the joiner’s last committed seqno, the donor’s remaining gcache retention, and network reachability.

Pre-join validation. Confirm wsrep_cluster_address still matches the live membership exactly. Why: a mismatched group address makes the node form a partitioned, standalone cluster instead of joining — the classic accidental split.
Start the service. Bring MariaDB up as a normal join (never galera_new_cluster, which would bootstrap a competing component — see bootstrapping your first Galera cluster for why that distinction matters):
```
systemctl start mariadb
```
Watch IST vs SST selection. Galera attempts IST first. If the joiner’s missing seqno range is still inside the donor’s gcache, only the delta replays. If it has fallen out, Galera falls back to SST via mariabackup, which streams the full dataset. For large datasets this choice dominates rejoin time — tune it with choosing the right SST method for large datasets.
Monitor the state transition. The node cycles Joining → Joined → Synced. If it stalls at Joining or Donor/Desynced, suspect an MTU mismatch or a firewall drop on 4568/4444 and follow troubleshooting node desync during join.
Verify and re-add to the pool. Only once the node is Synced and quorum is intact should the router send traffic back to it (see the verification section below).

Parameter Deep-Dive

These are the knobs that most directly change how a leave and a join behave. Production-tuned values assume a 3-node cluster on NVMe-backed storage with a moderate OLTP write rate.

Parameter	Recommended value	Why it matters for lifecycle events
`gcache.size`	`8G` (≥ peak write volume during a maintenance window)	Determines whether a rejoin qualifies for cheap IST or falls back to full SST. Undersized gcache is the #1 cause of surprise SSTs.
`wsrep_sst_method`	`mariabackup`	Physical, near-lock-free streaming SST. Keeps the donor writable during transfer, unlike `rsync` which holds a global read lock.
`evs.inactive_timeout`	`PT15S`	How long the group waits before evicting a silent member. Too low and a brief network blip during a leave triggers a false eviction.
`pc.wait_prim`	`TRUE`	Forces a starting node to wait for a Primary Component instead of racing ahead and forming its own partition.
`pc.timeout`	`PT30S`	Grace period for primary-component negotiation on rejoin; gives a slow node time to find the group before giving up.
`wsrep_sst_donor`	`db-node-01,db-node-02` (ordered preference)	Pins which member serves an SST so a joiner never desyncs your busiest node.

Group all of these under wsrep_provider_options except wsrep_sst_method, wsrep_sst_auth, and wsrep_sst_donor, which are top-level [mysqld] directives. For the low-latency EVS and flow-control tuning that pairs with these, see configuring wsrep_provider_options for low latency.

Verification & Health Checks

After a join, confirm the node is genuinely integrated — not merely running — before routing traffic:

-- Must read: Synced
SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
-- Must read: Primary
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
-- Must equal your expected node count (e.g. 3)
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';
-- Must read: ON
SHOW GLOBAL STATUS LIKE 'wsrep_ready';
-- Should be 0 in steady state; sustained >0 means the node is throttling the cluster
SHOW GLOBAL STATUS LIKE 'wsrep_flow_control_paused';
-- Should drain toward 0 shortly after Synced
SHOW GLOBAL STATUS LIKE 'wsrep_local_recv_queue';

Every node in the group must report the same wsrep_cluster_conf_id; a divergent value means a member is looking at a stale view. At the systemd layer, a clean rejoin looks like:

systemctl is-active mariadb          # active
systemctl show -p ExecMainStatus mariadb   # ExecMainStatus=0

Wire these same variables into your alerting so a silent degradation is caught before the next lifecycle event; the thresholds are defined in automated node health monitoring.

Automation Integration

Lifecycle operations should be idempotent, state-aware, and safe to re-run. The orchestrator below polls wsrep_local_state_comment over a real database connection using PyMySQL, with explicit handling for the two wsrep errors an automation layer hits most — 1213 (deadlock / certification conflict) and 1205 (lock wait timeout) — so a transient conflict retries instead of aborting the whole join. It targets Python 3.9+.

import subprocess
import sys
import time
import logging
import pymysql

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
log = logging.getLogger("galera-lifecycle")

DB = dict(unix_socket="/run/mysqld/mysqld.sock", user="monitor",
          password="MonitorPass!", connect_timeout=5)


def wsrep_state() -> str:
    """Return wsrep_local_state_comment, retrying past transient wsrep errors."""
    for _ in range(3):
        try:
            conn = pymysql.connect(**DB)
            try:
                with conn.cursor() as cur:
                    cur.execute("SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment'")
                    row = cur.fetchone()
                    return row[1] if row else "Unknown"
            finally:
                conn.close()
        except pymysql.err.OperationalError as exc:
            code = exc.args[0]
            if code in (1213, 1205):  # deadlock / lock wait timeout — retry
                log.warning("Transient wsrep error %s, retrying: %s", code, exc)
                time.sleep(2)
                continue
            raise
    return "Unknown"


def wait_for_synced(timeout_s: int = 600, interval_s: float = 2.0) -> bool:
    """Poll until the node reaches Synced or the timeout elapses."""
    deadline = time.monotonic() + timeout_s
    while time.monotonic() < deadline:
        try:
            state = wsrep_state()
        except pymysql.err.MySQLError as exc:
            # During SST the socket may be unavailable — treat as "still joining".
            log.info("DB not reachable yet (%s); continuing to poll", exc)
            time.sleep(interval_s)
            continue
        log.info("wsrep_local_state_comment=%s", state)
        if state == "Synced":
            return True
        time.sleep(interval_s)
    return False


def graceful_join() -> None:
    log.info("Starting MariaDB for a normal (non-bootstrap) join")
    result = subprocess.run(["systemctl", "start", "mariadb"],
                            capture_output=True, text=True)
    if result.returncode != 0:
        log.error("systemctl start failed: %s", result.stderr.strip())
        sys.exit(1)

    if wait_for_synced():
        log.info("Node reached Synced — safe to re-add to the router pool")
    else:
        log.error("Node did not reach Synced within timeout; capturing logs")
        subprocess.run("journalctl -u mariadb --since '10 minutes ago' "
                       "> /tmp/galera_join_failure.log", shell=True)
        sys.exit(2)


if __name__ == "__main__":
    graceful_join()

Drive this from Ansible so a rolling maintenance run drains, stops, patches, restarts, and health-gates one node at a time with serial: 1 and a max_fail_percentage: 0, keeping the SST/IST sequence strictly serialized. The Ansible role that renders the matching wsrep.cnf per host is documented in automating node provisioning with Ansible. In a Terraform-managed fleet, gate the provisioner or downstream health check on the same Synced probe before the instance is added back to the target group.

Troubleshooting

These are the failure signatures specific to join and leave operations, with the exact remediation for each.

Rejoin always triggers a full SST instead of IST. The log shows WSREP: Failed to prepare for incremental state transfer followed by a Requesting state transfer line that names mariabackup. Root cause: the joiner’s missing seqno range has aged out of the donor’s gcache. Fix by increasing gcache.size so it spans your maintenance windows, and keep leaves short. A seqno of -1 in grastate.dat (from an abrupt stop) also forces SST — always leave gracefully.
Node hangs at Joining and never advances. No SST/IST bytes move. Root cause: TCP 4568 (IST) or 4444 (SST) is blocked, or the donor and joiner have mismatched MTU. Confirm the ports with ss -tlnp | grep -E '4444|4568' on the donor and re-check firewall rules; then follow troubleshooting node desync during join.
WSREP: gcs/src/gcs_group.cpp: Reversing history on start. Root cause: the node is trying to join with a higher seqno than the current Primary Component — it holds writes the group no longer has, usually after an incorrect bootstrap elsewhere. Do not force it in; recover by wiping this node’s state and letting it SST from the authoritative member.
Cluster goes non-Primary the moment a node leaves. wsrep_cluster_status reads non-Primary and writes are rejected. Root cause: the leave dropped the membership below floor(N/2)+1. This means the group was already degraded before the leave. Restore quorum first; the quorum math and safe maintenance ordering are covered under designing multi-master topologies.
Service won’t stop cleanly (systemctl stop times out). Root cause: a long-running transaction or an in-progress SST as donor blocks shutdown. Check wsrep_local_state_comment — if it reads Donor/Desynced, wait for the SST it is serving to finish rather than killing it. Raw startup and shutdown log decoding is covered in handling Galera startup errors and logs.

Frequently Asked Questions

Do I have to drain and lock every time, or can I just systemctl stop? A clean systemctl stop alone will announce the leave and preserve seqno, which is enough to avoid a forced SST on the next start. Draining connections and the brief FLUSH TABLES WITH READ LOCK are what protect in-flight application writes from being aborted at the moment of departure. For a truly graceful, zero-impact leave, do all three.

Why did my rejoin trigger a full SST when the node was only down for five minutes? Because rejoin cost is governed by write volume, not wall-clock time. If those five minutes produced more write-sets than fit in the donor’s gcache, the missing range aged out and Galera must fall back to SST. Size gcache.size to exceed the writes accumulated during your longest expected outage, and the same five-minute window will replay as a fast IST instead.

Can I take two nodes out at once to speed up maintenance? No. Serialize every lifecycle event. Concurrent leaves can drop a 3-node cluster below its floor(N/2)+1 quorum threshold and turn it non-Primary, and concurrent joins let two SSTs desync your donors at the same time. Run one node through the full leave-maintain-join cycle, confirm it is Synced, then start the next.

Galera Cluster Setup & Node Management — the parent guide covering the full node lifecycle end to end
Fixing wsrep_local_state_comment Issues — resolving a node stuck outside the Synced state
Troubleshooting Node Desync During Join — diagnosing stalls at Joining and Donor/Desynced
Bootstrapping Your First Galera Cluster — why a join must never be a bootstrap
Choosing the Right SST Method for Large Datasets — controlling the cost of the SST fallback
Automated Node Health Monitoring — the wsrep telemetry that gates safe transitions

Graceful Node Join and Leave Procedures for MariaDB Galera

Concept: What “Graceful” Means in a wsrep State Machine #

Prerequisites & Environment Requirements #

Step-by-Step Procedure #

The graceful leave sequence #

The graceful join sequence #

Parameter Deep-Dive #

Verification & Health Checks #

Automation Integration #

Troubleshooting #

Frequently Asked Questions #

Related #