Why is my Galera node stuck in Joining and never reaching Synced?

The state transfer has not completed. Check that SST port 4444 and group-communication port 4567 are reachable to the donor, that wsrep_sst_method matches on donor and joiner, and that the donor's gcache.size covers the write volume accumulated while the node was offline. If IST cannot fit the missing seqno range Galera falls back to a full SST, and a failing SST leaves the node pinned in Joining.

My node reads Donor/Desynced and will not return to Synced — is it broken?

Usually not. A node that is serving an SST to a joiner is supposed to report Donor/Desynced until that transfer finishes. Do not restart it, because that aborts the joiner as well. Confirm the joiner is still progressing and only intervene if the donor's mariabackup stream has actually errored in the log.

Is it safe to run galera_new_cluster when a node shows Closed?

Only after verifying quorum and that no surviving member holds a higher wsrep_last_committed value. galera_new_cluster forces a new Primary Component; running it on stale data causes irreversible divergence. If other nodes are still Primary, rejoin the Closed node with a normal systemctl start instead of bootstrapping.

Fixing `wsrep_local_state_comment` Issues in MariaDB Galera

This page extends the lifecycle rules laid out in Graceful Node Join and Leave Procedures and answers one focused operational question: a Galera node is reporting a wsrep_local_state_comment value other than Synced — stuck in Joining, pinned at Donor/Desynced, sitting in Initialized, or flapping into Closed — and you need to move it back to a healthy, writable state without corrupting data or breaking quorum. The wsrep_local_state_comment status variable is the human-readable snapshot of where a node sits in the replication state machine, and its value dictates the exact recovery path you must take.

Context: Why This Variable Is the Ground Truth

In a multi-master topology, every node advertises its synchronization posture through two coupled variables. wsrep_local_state is the integer the internal state machine uses (0 Closed, 1 Initialized, 2 Joined, 3 Donor/Desynced, 4 Synced), and wsrep_local_state_comment is the string projection of that same value. Because connection routers such as ProxySQL and HAProxy gate traffic on the comment field, an anomalous value is not cosmetic — it is the difference between a node that receives writes and one that silently drops out of the write path.

Anomalous or stagnant values rarely indicate transient network jitter. They almost always signal an underlying State Snapshot Transfer (SST) or Incremental State Transfer (IST) failure, quorum loss, a misaligned provider configuration, or corrupted on-disk state metadata. The certification mechanics that make ordered state transitions matter are covered in how Galera synchronous replication works; this page is strictly about reading the comment field, isolating the root cause, and applying a production-safe fix.

Every wsrep_local_state_comment value maps to a point in this machine. A healthy node ends at Synced (4); the amber states are the ones this page returns to Synced.

Solution: Diagnose the State Vector, Then Apply the Matching Fix

Step 1 — Capture the full state vector

Never remediate on wsrep_local_state_comment alone. Correlate it with provider-level indicators so you distinguish a genuine state fault from a downstream symptom:

SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';  -- human-readable phase
SHOW GLOBAL STATUS LIKE 'wsrep_local_state';          -- integer state (target: 4)
SHOW GLOBAL STATUS LIKE 'wsrep_ready';                -- ON only when node accepts queries
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';         -- members in this component
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';       -- must read Primary
SHOW GLOBAL STATUS LIKE 'wsrep_flow_control_paused';  -- >0.1 means the apply queue is saturated

A healthy node reports Synced (state 4) with wsrep_ready = ON and wsrep_cluster_status = Primary. Any deviation routes you to one of the recovery paths below. For baseline topology validation before you begin, work from the parent Galera Cluster Setup & Node Management reference.

Step 2 — Node stuck in `Joining` or `Donor/Desynced`

When a node stays in Joining (joiner side) or Donor/Desynced (donor side), the state transfer has failed to complete or was interrupted mid-stream. The usual causes are donor I/O exhaustion, tables locked during mariabackup execution, or a network timeout on port 4444 (SST transport) or 4567 (group communication).

Confirm connectivity and check whether the donor is throttling under flow control:

nc -zv <donor_ip> 4444 && nc -zv <donor_ip> 4567

SHOW GLOBAL STATUS LIKE 'wsrep_flow_control_sent';

IST is only possible when a donor still retains the joiner’s last applied seqno in its gcache. Ensure the donor’s gcache.size (tuned via wsrep_provider_options) covers the write volume accumulated while the joiner was offline, and that both sides agree on wsrep_sst_method. Otherwise Galera falls back to a full SST. If SST fails repeatedly, abort the join, clear the datadir, and let the node request a fresh transfer — but never delete data without verifying backups and confirming quorum still exists on the remaining members:

systemctl stop mariadb
# Verify quorum exists on remaining nodes (wsrep_cluster_status = Primary) before proceeding
rm -rf /var/lib/mysql/*
systemctl start mariadb

The cost of that SST fallback is controlled by choosing the right SST method for large datasets, and the ordered sequencing lives in Graceful Node Join and Leave Procedures so you do not starve donors.

Step 3 — Node stuck in `Initialized` or `Closed`

Initialized means the wsrep provider loaded but the node has not joined the group. Closed typically follows a forced shutdown, split-brain resolution, or an explicit SET GLOBAL wsrep_on=OFF. The common root causes are a malformed wsrep_cluster_address, a lingering wsrep-new-cluster flag on a non-bootstrap node, SELinux/AppArmor blocking socket or provider loading, or a pc.wait_prim timeout after quorum loss.

grep -E 'wsrep_cluster_address|wsrep_provider' /etc/my.cnf.d/server.cnf

SHOW GLOBAL STATUS LIKE 'wsrep_cluster_conf_id';

If the node is isolated and genuinely must form a new Primary Component, use the safe bootstrap sequence — but only after confirming no surviving node holds a higher wsrep_last_committed:

systemctl stop mariadb
galera_new_cluster

Forcing a new primary on stale data causes irreversible divergence. If instead the node simply refuses to start and never reaches Initialized, the fault is upstream of state — see handling Galera startup errors and logs.

Step 4 — `Synced` but degraded by flow control

A node that reads Synced yet shows wsrep_flow_control_paused > 0.1 is not in a state fault — its apply queue is saturated and it will eventually be pushed into Desynced. Widen the apply path and relax flow control:

Raise wsrep_slave_threads to match CPU cores without overcommitting disk I/O.
Tune gcs.fc_limit=16 and gcs.fc_factor=0.8 in wsrep_provider_options.
Confirm disk latency with iostat -x 1 and decide whether innodb_flush_log_at_trx_commit=2 is acceptable for your RPO/RTO.

Step 5 — Automate the triage safely

Manual triage does not scale. This Python 3.9+ probe reads the state vector and only acts when preconditions are met. It targets mysql-connector-python and handles the Galera-specific deadlock/lock-wait error codes (1213, 1205) that surface when querying a node under flow-control pressure:

import subprocess
import logging
import mysql.connector
from mysql.connector import errorcode

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

STATE_VARS = ("wsrep_local_state_comment", "wsrep_ready", "wsrep_cluster_status")


def check_galera_state(host, user, password):
    """Return a dict of the wsrep state vector, or None if unreachable."""
    try:
        conn = mysql.connector.connect(
            host=host, user=user, password=password, database="information_schema"
        )
    except mysql.connector.Error as exc:
        logging.error("Database connection failed: %s", exc)
        return None
    try:
        cursor = conn.cursor(dictionary=True)
        cursor.execute(
            "SELECT VARIABLE_NAME, VARIABLE_VALUE FROM information_schema.GLOBAL_STATUS "
            "WHERE VARIABLE_NAME IN (%s, %s, %s)",
            STATE_VARS,
        )
        return {r["VARIABLE_NAME"].lower(): r["VARIABLE_VALUE"] for r in cursor.fetchall()}
    except mysql.connector.Error as exc:
        # 1213 = deadlock, 1205 = lock wait timeout — expected under heavy flow control
        if exc.errno in (errorcode.ER_LOCK_DEADLOCK, errorcode.ER_LOCK_WAIT_TIMEOUT):
            logging.warning("Node contended (errno %s); retry after backoff.", exc.errno)
        else:
            logging.error("Status query failed: %s", exc)
        return None
    finally:
        conn.close()


def safe_restart_service():
    """Idempotent restart; caller must have already validated quorum."""
    logging.info("Initiating controlled MariaDB restart...")
    subprocess.run(["systemctl", "stop", "mariadb"], check=True)
    subprocess.run(["systemctl", "start", "mariadb"], check=True)
    logging.info("Service restarted.")


def remediate_node(state):
    if not state:
        return
    comment = state.get("wsrep_local_state_comment", "")
    ready = state.get("wsrep_ready", "")
    cluster_status = state.get("wsrep_cluster_status", "")

    if comment in ("Joining", "Donor/Desynced") and ready == "OFF":
        logging.warning("Node stuck in %s; validate donor connectivity before restart.", comment)
        safe_restart_service()
    elif comment == "Closed" and cluster_status != "Primary":
        logging.critical("Node Closed and non-Primary; require quorum check before any bootstrap.")
    else:
        logging.info("Node state is %s; no intervention required.", comment)


if __name__ == "__main__":
    remediate_node(check_galera_state("localhost", "monitor", "secure_password"))

Never automate rm -rf /var/lib/mysql/* without a human approval gate or verified immutable backup. For a full metrics loop, build on monitoring Galera cluster state with Python.

Parameter Reference

Variable / option	Type	Default	Recommended	Role in `wsrep_local_state_comment` recovery
`wsrep_local_state_comment`	status (read-only)	—	`Synced`	The human-readable phase you are diagnosing; target value after any fix
`wsrep_local_state`	status (read-only)	—	`4`	Integer form of the comment; `4` = Synced, `3` = Donor/Desynced
`wsrep_ready`	status (read-only)	—	`ON`	`OFF` means the node rejects queries regardless of comment string
`wsrep_cluster_status`	status (read-only)	—	`Primary`	`non-Primary` blocks writes; must be `Primary` before bootstrap decisions
`wsrep_flow_control_paused`	status (read-only)	`0`	`< 0.1`	Fraction of time paused; high values precede a drop into `Desynced`
`gcache.size`	provider option	`128M`	`≥ write volume of longest outage`	Determines whether a stuck joiner recovers via IST or falls back to SST
`wsrep_sst_method`	server (`[mysqld]`)	`rsync`	`mariabackup`	Must match on donor and joiner or the transfer aborts, pinning `Joining`
`wsrep_slave_threads`	server (`[mysqld]`)	`1`	`= CPU cores`	Drains the apply queue so a `Synced` node stays out of flow control

Verification

After any recovery workflow, confirm convergence before routing production traffic back to the node:

SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';  -- expect: Synced
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';         -- expect: full member count
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';       -- expect: Primary

Then prove replication is actually flowing: insert a sentinel row on one node and read it back from every other member within a few seconds. Finally, watch wsrep_flow_control_paused for roughly ten minutes to confirm the node is not immediately re-entering flow control because of I/O saturation.

Edge Cases & Gotchas

Docker and Kubernetes images stall in Joining on a stale grastate.dat. A container restarted from a snapshot can carry a seqno and uuid that no longer match the live component. Mount the datadir on a persistent volume and, when the UUID diverges, delete grastate.dat and gvwstate.dat so the node requests a clean transfer instead of arguing over a phantom position.
A Closed node after systemd killed it too fast. If TimeoutStopSec in the unit is shorter than the time MariaDB needs to flush and broadcast NODE_LEAVE, systemd sends SIGKILL, leaving seqno: -1 and a Closed/forced-SST state on next boot. Raise TimeoutStopSec in a drop-in unit so graceful shutdown can complete.
Donor/Desynced that never returns to Synced. The comment is behaving correctly — a node serving an SST is supposed to read Donor/Desynced until the transfer finishes. Do not restart it; that aborts the joiner too. Confirm the joiner is progressing first, and only intervene if the donor’s mariabackup stream has actually errored.

Graceful Node Join and Leave Procedures — the parent runbook for ordered node lifecycle transitions
Troubleshooting Node Desync During Join — deeper diagnosis of stalls at Joining and Donor/Desynced
Handling Galera Startup Errors & Logs — when a node never even reaches Initialized
Monitoring Galera Cluster State with Python — scraping wsrep_ telemetry to catch state faults early

Fixing wsrep_local_state_comment Issues in MariaDB Galera

Context: Why This Variable Is the Ground Truth #

Solution: Diagnose the State Vector, Then Apply the Matching Fix #

Step 1 — Capture the full state vector #

Step 2 — Node stuck in Joining or Donor/Desynced #

Step 3 — Node stuck in Initialized or Closed #

Step 4 — Synced but degraded by flow control #

Step 5 — Automate the triage safely #

Parameter Reference #

Verification #

Edge Cases & Gotchas #

Related #