Why is rsync a bad SST method for large Galera datasets?

The rsync method acquires a global read lock (FLUSH TABLES WITH READ LOCK) on the donor for the entire transfer and copies raw, uncompressed files single-threaded. On datasets above 100GB this drives the donor to Donor/Desynced, saturates the replication network, and starves InnoDB buffer-pool flushes, throttling the whole group. Use mariabackup instead, which streams physical pages without a full table lock.

How do I fix 'No space left on device' (Errcode 28) during mariabackup SST?

The joiner ran out of space during the apply phase, usually because datadir, innodb_tmpdir, or tmpdir sits on an undersized partition or a small /tmp tmpfs. Stop mariadb, clear the partial datadir, point innodb_tmpdir at a dedicated high-capacity volume, confirm free space with df -h, and restart the node so SST can retry cleanly.

How do I stop Galera from forcing a full SST on every rejoin?

A full SST is forced when the joiner's last sequence number has aged out of the donor's GCache ring buffer. Raise gcache.size beyond the write volume produced during your longest expected outage or maintenance window. The 128M default on stock cloud images guarantees a full SST within seconds on a busy group, so sizing it correctly is the difference between a seconds-long incremental catch-up and an hours-long snapshot.

Choosing the Right SST Method for Large Datasets

This decision builds on the synchronization model described in Initial Data Synchronization Methods, and answers one focused question: when a joiner must receive a multi-terabyte snapshot, which State Snapshot Transfer (SST) method keeps the donor writable, finishes inside your recovery-time objective, and does not saturate the replication network? For small datasets the default barely matters — any method completes in seconds. Past roughly 100GB the choice stops being cosmetic and starts dictating cluster availability: the wrong backend holds a global read lock on the donor for the entire transfer, starves live replication traffic, and turns a routine node replacement into a multi-hour incident. This page gives database administrators, DevOps engineers, and platform teams the single authoritative answer — wsrep_sst_method=mariabackup with parallel, compressed streaming — plus the tuning, verification, and failure paths that make it reliable at scale.

Context: Why the SST Backend Decides Availability at Scale

Galera synchronizes a joiner either incrementally or with a full snapshot, and the provider chooses automatically based on whether the joiner’s last committed sequence number still lives in the donor’s GCache ring buffer. For a fresh node, a rebuilt node, or one whose position has aged out of gcache.size, an Incremental State Transfer is impossible and a full SST is unavoidable. That full copy is where large datasets punish a bad default. The mechanics of that choice — and the ordered write-set stream both transfers replay — are covered in Understanding Galera Synchronous Replication.

The three SST backends differ in exactly the dimensions that matter at terabyte scale: whether the donor stays writable, how much CPU and network the transfer burns, and whether the copy is physical or logical.

mysqldump replays logical INSERT statements and forces a FLUSH TABLES WITH READ LOCK on the donor for the entire dump; on a terabyte it can lock the donor for hours and is only defensible for narrow cross-version edge cases. rsync copies files directly but still blocks donor writes for the duration and transfers raw, uncompressed pages that exhaust the interface. mariabackup streams physical InnoDB tablespaces through mbstream without a full table lock, so the donor’s write path resumes almost immediately. For any production dataset above a few tens of gigabytes, mariabackup is the answer; the sections below tune it.

Why `rsync` fails past 100GB

The default rsync method performs a synchronous, file-level copy that acquires a global read lock (FLUSH TABLES WITH READ LOCK) on the donor for the entire transfer. That block drives wsrep_local_state_comment to Donor/Desynced and produces three compounding problems on a large copy:

Lock contention — single-threaded file enumeration prevents the donor from applying incoming write-sets, so the whole group throttles behind it.
Network saturation — uncompressed raw file transfer consumes the full interface and starves the 4567 replication channel that keeps surviving nodes in sync.
I/O starvation — sequential donor disk reads compete with active InnoDB buffer-pool flushes, degrading throughput cluster-wide.

Treat rsync as a fallback for sub-50GB datasets or isolated lab reprovisioning only. Production-scale synchronization needs a streaming, page-level backend.

Solution: Parallel, Compressed `mariabackup` Streaming

mariabackup (the MariaDB-native fork of Percona XtraBackup) captures InnoDB tablespaces, redo logs, and transaction history as a continuous stream rather than reconstructing rows. This drops donor lock duration to a fraction of a second and lets the donor resume processing write-sets as soon as the initial page copy begins. Two options make it viable at scale: parallel read threads sized to storage queue depth, and on-the-fly compression that cuts the network payload by 60–80% without pinning the donor CPU.

Configure it in the merged wsrep.cnf tree. SST-specific flags belong in a [mariabackup] section, which the SST script reads when it builds the backup command — the layering rules for this file are detailed in the wsrep.cnf Configuration Deep Dive:

[mysqld]
# --- Core SST routing ---
wsrep_sst_method=mariabackup
wsrep_sst_auth="sst_user:your_strong_password_here"

# --- Galera cache retention during donor desync ---
wsrep_provider_options="gcache.size=8G;gcs.fc_limit=1024;gcs.fc_single_primary=YES"

# mariabackup SST options are passed via the [mariabackup] section.
# The SST script reads this section when building the mariabackup command.
[mariabackup]
parallel=12
compress
compress-threads=6

Each significant line, and why it is set this way for a 500GB–2TB dataset on NVMe-backed storage:

wsrep_sst_method=mariabackup routes state transfer through the physical streaming backend instead of the blocking rsync default.
wsrep_sst_auth supplies the replicated SST account credentials the donor uses to open the backup; inject it from a secrets manager, never commit it in plaintext.
gcache.size=8G sizes the write-set ring buffer so the joiner’s required position stays in cache; if the outage write volume exceeds this window the required range wraps and the provider is forced into a full SST even when an incremental catch-up should have been possible.
gcs.fc_limit=1024 raises the apply-queue depth before flow control throttles the group, so a busy donor mid-transfer does not stall write-heavy OLTP.
parallel=12 runs twelve reader threads, matched to enterprise NVMe queue depth (16–32). Over-provisioning causes thread contention; under-provisioning leaves I/O bandwidth idle.
compress enables streaming compression — MariaDB’s SST script uses mbstream with lz4 by default. For zstd, pass compress-algorithm=zstd if your mariabackup binary supports it (MariaDB 10.6.2+).
compress-threads=6 dedicates CPU to compression so it does not become the transfer bottleneck; keep it below half the donor’s core count to leave headroom for write-set apply.

Parameter Reference

The options that move large-dataset SST behavior, their scope, defaults, and recommended values:

Parameter	Section	Type	Default	Recommended (0.5–2TB, NVMe)	Purpose
`wsrep_sst_method`	`[mysqld]`	enum	`rsync`	`mariabackup`	Selects the SST backend; physical streaming vs. blocking file copy
`wsrep_sst_auth`	`[mysqld]`	string	(unset)	`sst_user:<secret>`	Donor-side credentials for the backup session
`gcache.size`	provider opts	size	`128M`	`8G`+	Write-set ring buffer; too small forces full SST on rejoin
`gcs.fc_limit`	provider opts	integer	`16`	`1024`	Apply-queue depth before flow control throttles the group
`gcs.fc_single_primary`	provider opts	bool	`NO`	`YES`	Relaxes flow control when one node drives writes
`parallel`	`[mariabackup]`	integer	`1`	`8–16`	Concurrent reader threads; match NVMe queue depth
`compress`	`[mariabackup]`	flag	off	on	Streaming compression; cuts payload 60–80%
`compress-threads`	`[mariabackup]`	integer	`1`	`4–8`	CPU threads for compression; keep below half core count

Verification Step

After the joiner completes SST, confirm it reached a writable, synced state and that the transfer used the method you configured. Query the live status variables on the joiner:

SHOW GLOBAL STATUS WHERE Variable_name IN (
  'wsrep_cluster_status',        -- must be 'Primary'
  'wsrep_local_state_comment',   -- must be 'Synced'
  'wsrep_cluster_size'           -- must equal the full node count
);

Then confirm the SST script actually invoked mariabackup rather than falling back, and that it finished cleanly:

# Confirm the configured method and a successful completion in the log
journalctl -u mariadb --no-pager | grep -E "WSREP_SST|Prepared|SST complete"

# Verify the running method matches the config
mariadb -e "SHOW GLOBAL VARIABLES LIKE 'wsrep_sst_method';"

A wsrep_local_state_comment of Synced with wsrep_cluster_status=Primary and a SST complete log line is the definitive signal that the large-dataset transfer succeeded and the node is safe to route traffic to.

Edge Cases & Gotchas

Joiner disk exhaustion during apply. The most common large-dataset failure is the joiner running out of space while mariabackup prepares the copy:

WSREP_SST: [ERROR] mariabackup: Error writing file 'UNOPENED' (Errcode: 28 "No space left on device")

This happens when the joiner’s datadir or innodb_tmpdir sits on an undersized partition, or when tmpdir defaults to a small /tmp tmpfs. Remediate by stopping the node, clearing the partial datadir, pointing innodb_tmpdir at a dedicated high-capacity volume, and restarting:

systemctl stop mariadb
rm -rf /var/lib/mysql/*
df -h /var/lib/mysql /tmp          # confirm free space before retry
systemctl start mariadb
journalctl -u mariadb -f | grep -E "WSREP_SST|WSREP"

Compression outrunning the CPU. On donors with modest core counts, setting compress-threads too high starves write-set apply and paradoxically slows both the transfer and the live group. Keep parallel and compress-threads combined below the donor’s physical core count, and prefer lz4 over zstd when CPU, not network, is the constraint.

GCache too small on cloud images. Stock cloud MariaDB images ship the gcache.size=128M default. On a busy group that guarantees a full SST on every rejoin because the joiner’s position ages out of cache within seconds. Size gcache.size to exceed the write volume of your longest expected maintenance window — this single value is the difference between a seconds-long incremental catch-up and an hours-long snapshot. Provider-option layering that affects this is documented in the wsrep.cnf Configuration Deep Dive.

Automating Failure Detection and Rejoin

At fleet scale, manual log tailing does not hold. A deterministic probe that watches the joiner’s SST progress and drives a safe, bounded re-provision keeps node replacement hands-off. This snippet targets Python 3.9+, parses the MariaDB log for SST completion or a fatal error, and caps retries so a failing node cannot thrash the group:

import subprocess
import re
import sys
import time

LOG_PATH = "/var/log/mariadb/mariadb.log"
MAX_ATTEMPTS = 5

DONE = re.compile(r"WSREP_SST:.*SST complete")
ERROR = re.compile(r"WSREP_SST:.*\[ERROR\]")


def monitor_sst(log_path: str = LOG_PATH) -> bool | None:
    """Return True on completion, False on fatal error, None if inconclusive."""
    try:
        with open(log_path, "r", encoding="utf-8") as log:
            for line in log:
                if ERROR.search(line):
                    print(f"[FATAL] SST failed: {line.strip()}", file=sys.stderr)
                    return False
                if DONE.search(line):
                    print("[INFO] SST completed successfully.")
                    return True
    except FileNotFoundError:
        print("[WARN] Log not found; assuming fresh bootstrap.", file=sys.stderr)
        return True
    return None


def safe_rejoin(attempt: int = 1) -> bool:
    """Idempotent node rebuild with bounded, backed-off SST retry."""
    if attempt > MAX_ATTEMPTS:
        print(f"[FATAL] SST still failing after {MAX_ATTEMPTS} attempts.", file=sys.stderr)
        return False

    subprocess.run(["systemctl", "stop", "mariadb"], check=True)
    # The glob must be expanded by a shell; passing "/var/lib/mysql/*" in a list
    # would make rm try to delete a file literally named "*". The path is fixed
    # and not user-supplied, so shell=True is safe here.
    subprocess.run("rm -rf /var/lib/mysql/*", shell=True, check=True)
    subprocess.run(["systemctl", "start", "mariadb"], check=True)

    time.sleep(min(2 ** attempt, 300))  # exponential backoff, capped at 5 min
    result = monitor_sst()
    if result is False:
        return safe_rejoin(attempt + 1)
    return bool(result)


if __name__ == "__main__":
    sys.exit(0 if safe_rejoin() else 1)

Before routing traffic to the rebuilt node, gate on wsrep_cluster_status as shown in the verification step, and wire this probe into an Ansible play or Kubernetes init container so large-scale provisioning stays deterministic. Reusable versions of the status probe are covered in Monitoring Galera Cluster State with Python.

By matching SST mechanics to physical storage, retaining enough donor cache to favor incremental catch-up, and embedding bounded recovery into automation, platform teams get predictable synchronization windows and preserve multi-master availability even at petabyte scale.

Initial Data Synchronization Methods — the parent SST/IST model and the three backends compared
wsrep.cnf Configuration Deep Dive — provider-option layering, GCache sizing, and validation
Graceful Node Join and Leave Procedures — clean shutdowns that keep rejoins on the fast IST path
Understanding Galera Synchronous Replication — the write-set stream both transfer types replay

Choosing the Right SST Method for Large Datasets

Context: Why the SST Backend Decides Availability at Scale #

Why rsync fails past 100GB #

Solution: Parallel, Compressed mariabackup Streaming #

Parameter Reference #

Verification Step #

Edge Cases & Gotchas #

Automating Failure Detection and Rejoin #

Related #