How large should gcache.size be for a Galera node?

Size gcache.size from measured load: peak write throughput in MB/s multiplied by worst-case apply latency in seconds, multiplied by a 1.5 safety factor. For example 250 MB/s of writes with a 3-second apply latency needs at least 1125 MB, rounded up to 2G. It must exceed the write volume produced across your longest maintenance window; otherwise a rejoining node's IST range ages out of the ring buffer and it falls back to a full State Snapshot Transfer that desyncs the donor.

Should I disable SMT (hyperthreading) on Galera nodes?

For latency-sensitive write workloads, yes. SMT sibling threads share execution units and add unpredictable context-switch latency during high-concurrency write bursts, which shows up as flow-control pauses. Size wsrep_slave_threads to the physical core count, not the logical thread count, and disable SMT in BIOS or via /sys/devices/system/cpu/smt/control when sub-millisecond commit latency matters.

What network latency can a synchronous Galera cluster tolerate?

Keep intra-region round-trip time under 1 ms. Every commit is gated on the group-communication round trip, so latency is added to every write. Above roughly 50 ms the group suffers constant flow control and false node evictions; at that distance use an asynchronous replica downstream rather than stretching a synchronous cluster across regions.

Galera Cluster Hardware Requirements: Production Sizing & Validation

This sizing guide builds on the node lifecycle and provisioning model described in Galera Cluster Setup & Node Management, and solves one specific operational problem: choosing and validating node hardware before cluster formation so that synchronous commit latency stays bounded under production write load. Because Galera’s synchronous replication gates every commit on the slowest network round trip and the slowest applier, a single under-provisioned node throttles the entire cluster through flow control. Hardware sizing here is therefore not a capacity exercise; it is deterministic latency engineering. Get the CPU, memory, network, and storage baselines right up front and you prevent certification backlogs, SST-induced donor stalls, and the split-brain recovery scenarios that undersized clusters manufacture.

How Hardware Shapes Synchronous Commit Latency

In asynchronous replication a primary commits locally and ships changes later, so local disk speed dominates. Galera inverts this: a transaction must be replicated to the group, certified against the write-set certification process on every node, and queued for apply before the originating client sees COMMIT succeed. The critical path of a single write therefore touches four hardware subsystems in series — CPU (certification), network (group communication round trip), memory (GCache and certification index), and storage (redo/binlog durability). The subsystem with the worst tail latency sets the ceiling for the whole cluster.

Two consequences follow directly. First, the group runs at the speed of its weakest member, so heterogeneous node hardware is an anti-pattern — provision identical machines. Second, oversizing one subsystem cannot compensate for undersizing another; a node with 128 cores and slow cross-AZ networking still stalls on flow control. The sizing workflow below walks each subsystem in the order it appears on the commit critical path.

Figure: one commit is gated in series by four hardware subsystems; total latency is their sum, and the slowest node paces the whole cluster.

Prerequisites & Environment Requirements

Confirm the following baseline before running any of the validation steps. These are minimums for a production three-node cluster; scale CPU and memory with your working-set size and write concurrency.

Subsystem	Production minimum	Notes
MariaDB	10.6 LTS or newer	`innodb_redo_log_capacity` and unified redo handling assume 10.6+
Galera provider	wsrep API 26 / Galera 4	Ships with MariaDB 10.4+
CPU	8 physical cores per node	Dedicated cores for certification + parallel apply
Memory	32 GB per node	Working set + certification index + GCache + 20% OS headroom
Storage	NVMe, ≥ 50k random-write IOPS	XFS or ext4, battery-backed write cache if using RAID
Network	≥ 10 GbE, RTT < 1 ms intra-region	Ports 4567 (group comm), 4568 (IST), 4444 (SST) open
OS	Linux with `numactl`, `fio`, `ethtool` installed	Kernel 5.4+ for `bbr` congestion control

Network reachability on 4567/4568/4444 is a hard precondition. Confirm firewall rules before you begin, following Network Security & Firewall Rules for Galera — bootstrap and SST failures frequently trace back to a blocked port rather than a hardware fault.

Step-by-Step Hardware Validation

Run these steps on every candidate node during provisioning, before installing MariaDB. Each step explains what the check protects against, not just how to run it.

Step 1 — Validate CPU topology and NUMA alignment

Galera’s certification thread and the parallel-apply threads governed by wsrep_slave_threads (tuned in the wsrep.cnf Configuration Deep Dive) scale with physical cores. Simultaneous multithreading (SMT) adds sibling threads that contend for the same execution units, injecting unpredictable context-switch latency during high-concurrency write bursts. NUMA misalignment is worse: when the InnoDB buffer pool spans multiple NUMA nodes, remote memory access penalties compound on every certification lookup. Verify both before provisioning.

#!/usr/bin/env bash
# validate_cpu_topology.sh — physical cores, SMT status, NUMA distribution
set -euo pipefail

PHYSICAL_CORES=$(lscpu | awk -F: '/^Core\(s\) per socket/ {print $2}' | tr -d ' ')
SOCKETS=$(lscpu | awk -F: '/^Socket\(s\)/ {print $2}' | tr -d ' ')
SMT_ACTIVE=$(lscpu | awk -F: '/^Thread\(s\) per core/ {print $2}' | tr -d ' ')

echo "Physical Cores: $((PHYSICAL_CORES * SOCKETS))"
echo "SMT Threads per Core: $SMT_ACTIVE"

if [ "$SMT_ACTIVE" -gt 1 ]; then
  echo "WARNING: SMT active. For latency-sensitive workloads disable via BIOS"
  echo "         or: echo off > /sys/devices/system/cpu/smt/control"
fi

NUMA_NODES=$(numactl --hardware | awk '/available:/ {print $2}')
echo "NUMA Nodes Available: $NUMA_NODES"

if [ "$NUMA_NODES" -gt 1 ]; then
  echo "ACTION: Pin mysqld with numactl --cpunodebind --membind in the systemd unit."
fi

If more than one NUMA node is present, bind mysqld to a single node in a systemd drop-in so the buffer pool and Galera threads share local memory:

# /etc/systemd/system/mariadb.service.d/numa.conf
[Service]
ExecStart=
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 /usr/sbin/mariadbd

Step 2 — Size memory, buffer pool, and GCache

Memory serves three competing consumers: the InnoDB buffer pool, the OS page cache, and Galera’s in-memory write-set cache (GCache). The gcache.size parameter bounds how many recent write-sets a donor retains; a rejoining node uses a fast Incremental State Transfer only while its last position is still inside that ring buffer. Once the required range ages out, the node falls back to a full State Snapshot Transfer, which desyncs the donor and saturates I/O and network. The initial data synchronization methods guide covers that IST-versus-SST decision in depth.

Size GCache from measured peak write throughput and worst-case apply latency, with a safety factor:

gcache.size (MB) = Peak_Write_MB_per_s × Max_Apply_Latency_s × 1.5

A deployment sustaining 250 MB/s of writes with a 3 s worst-case apply latency needs at least 1125 MB; round up to 2G to absorb bursts. Then allocate the buffer pool only after reserving 20% of RAM for the OS page cache, GCache, and the certification index — never hand 100% of system memory to InnoDB.

[mysqld]
innodb_buffer_pool_size = 24G          # ~70-75% of 32G, leaving OS + Galera headroom
wsrep_provider_options  = "gcache.size=2G; gcache.page_size=256M"

Step 3 — Validate the network fabric and MTU

Synchronous replication demands a low-latency, high-bandwidth fabric. Keep intra-region round-trip time under 1 ms; synchronous replication across links above roughly 50 ms triggers constant flow control and eviction timeouts, at which point an async replica downstream is the correct pattern rather than a stretched cluster. Jumbo frames (MTU 9000) cut per-packet overhead and interrupt load for large write-sets, but only if every hop agrees — a single MTU mismatch causes silent fragmentation. Validate end-to-end with a do-not-fragment probe (payload 8972 = 9000 MTU − 20-byte IP header − 8-byte ICMP header):

ping -M do -s 8972 10.0.1.11

Tune the TCP stack to prevent head-of-line blocking during certification bursts. Apply identical kernel parameters on every node via /etc/sysctl.d/99-galera-net.conf:

net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_congestion_control = bbr

Load them with sysctl --system and confirm with sysctl net.ipv4.tcp_congestion_control.

Step 4 — Benchmark storage I/O and redo alignment

Disk I/O drives SST throughput, local commit durability, and crash-recovery speed. NVMe is mandatory for production nodes; use enterprise SSDs or hardware RAID 10 with battery-backed write cache, and mount data volumes with noatime,nodiratime to eliminate needless metadata writes. Align InnoDB redo capacity with GCache so the redo log never becomes the bottleneck under sustained load:

innodb_redo_log_capacity >= gcache.size / 4

On MariaDB 10.6+, set this single parameter; on older releases the equivalent is innodb_log_file_size × innodb_log_files_in_group. Keep innodb_flush_log_at_trx_commit=1 and sync_binlog=1 — bypassing durability for throughput defeats the zero-data-loss purpose of synchronous multi-master. Benchmark before deployment:

fio --name=galera_validation --ioengine=libaio --rw=randwrite --bs=16k \
    --direct=1 --size=10G --numjobs=4 --runtime=60 --time_based \
    --group_reporting --output-format=json

Target sustained 4k random-write IOPS above 50,000 with p99 latency under 0.5 ms. A node that misses this budget will pace the whole cluster.

Parameter Deep-Dive

These are the configuration knobs whose correct values depend directly on the hardware you just validated. Distribute them identically across nodes except where noted.

Parameter	Hardware it maps to	Production guidance
`wsrep_slave_threads`	Physical cores	Match to physical (not SMT) core count; never exceed `innodb_thread_concurrency`. Right-size from observed `wsrep_cert_deps_distance`
`gcache.size`	RAM headroom	Peak write MB/s × worst apply latency × 1.5; must exceed write volume across your longest maintenance window
`innodb_buffer_pool_size`	RAM	~70-75% of total RAM after reserving OS + GCache + certification-index headroom
`innodb_redo_log_capacity`	NVMe write bandwidth	`>= gcache.size / 4` on 10.6+; prevents redo flushing from throttling apply
`evs.suspect_timeout` / `evs.inactive_timeout`	Network RTT/jitter	Loosen from defaults (5s/15s) on higher-latency fabrics to avoid false evictions

For the group-communication and EVS timeout values specifically, the low-latency tuning rationale lives in Configuring wsrep_provider_options for Low Latency. Set wsrep_slave_threads conservatively at first — over-provisioning apply threads raises InnoDB lock contention rather than throughput, and the correct value is a function of wsrep_cert_deps_distance under your real workload, not a guess.

Verification & Health Checks

After the node is provisioned and MariaDB is running, confirm that the hardware baseline actually translates into a healthy cluster member. Start with the node state and the flow-control counters that expose an under-provisioned member:

SHOW GLOBAL STATUS WHERE Variable_name IN (
  'wsrep_local_state_comment',      -- expect: Synced
  'wsrep_cluster_size',             -- expect: full node count
  'wsrep_flow_control_paused',      -- expect: near 0.0
  'wsrep_local_recv_queue_avg',     -- rising trend => slow applier / weak disk
  'wsrep_cert_deps_distance'        -- use to right-size wsrep_slave_threads
);

A node that reaches Synced but shows a climbing wsrep_flow_control_paused or wsrep_local_recv_queue_avg under load is telling you its CPU or storage cannot keep pace — the hardware, not the configuration, is the constraint. Wrap the check in a Python readiness probe you can run from provisioning tooling. It uses PyMySQL and handles the Galera-specific error codes 1213 (deadlock) and 1205 (lock-wait timeout) that a certification-conflict canary write can surface:

#!/usr/bin/env python3
"""Confirm a freshly provisioned node is Synced and not flow-control bound."""
import sys
import pymysql

FC_PAUSE_WARN = 0.1  # fraction of time paused by flow control

def check_node(host: str) -> int:
    try:
        conn = pymysql.connect(host=host, user="monitor",
                               password="monitor_pw", connect_timeout=5)
    except pymysql.err.OperationalError as exc:
        print(f"[FAIL] cannot reach {host}: {exc}", file=sys.stderr)
        return 1

    try:
        with conn.cursor() as cur:
            cur.execute(
                "SHOW GLOBAL STATUS WHERE Variable_name IN "
                "('wsrep_local_state_comment','wsrep_flow_control_paused')"
            )
            status = {name: val for name, val in cur.fetchall()}
    except pymysql.err.OperationalError as exc:
        code = exc.args[0]
        if code in (1213, 1205):  # deadlock / lock-wait under certification
            print(f"[WARN] transient contention ({code}); retry probe")
            return 2
        raise
    finally:
        conn.close()

    state = status.get("wsrep_local_state_comment")
    pause = float(status.get("wsrep_flow_control_paused", 0.0))
    if state != "Synced":
        print(f"[FAIL] {host} state is {state}, expected Synced")
        return 1
    if pause > FC_PAUSE_WARN:
        print(f"[WARN] {host} flow-control paused {pause:.2%} — undersized node")
        return 2
    print(f"[PASS] {host} Synced, flow-control paused {pause:.2%}")
    return 0

if __name__ == "__main__":
    sys.exit(check_node(sys.argv[1] if len(sys.argv) > 1 else "127.0.0.1"))

Automation Integration

Hardware validation belongs in the provisioning pipeline, gating a node before MariaDB is ever installed. The script below returns structured JSON and a non-zero exit code so an orchestrator (Ansible, Terraform provisioners, or a CI job) can halt a deployment whose hardware misses the baseline.

#!/usr/bin/env python3
"""Pre-install hardware gate for Galera nodes. Emits JSON, exits non-zero on FAIL."""
import subprocess
import json
import sys

def run_cmd(cmd: str) -> str:
    return subprocess.check_output(cmd, shell=True).decode().strip()

def validate_galera_hardware() -> None:
    report = {}

    # nproc reports logical CPUs (SMT threads included); derive physical
    # cores from lscpu when a stricter gate is required.
    cpus = int(run_cmd("nproc"))
    report["logical_cpus"] = cpus
    report["cpu_status"] = "PASS" if cpus >= 8 else "FAIL: need >= 8 logical CPUs"

    mem_kb = int(run_cmd("awk '/MemTotal/ {print $2}' /proc/meminfo"))
    mem_gb = round(mem_kb / 1024 / 1024, 2)
    report["total_memory_gb"] = mem_gb
    report["memory_status"] = "PASS" if mem_gb >= 32 else "FAIL: need >= 32 GB RAM"

    peer = sys.argv[1] if len(sys.argv) > 1 else "127.0.0.1"
    ping_out = run_cmd(f"ping -c 3 -W 1 {peer} | grep rtt")
    avg_rtt = float(ping_out.split("/")[4])
    report["avg_rtt_ms"] = avg_rtt
    report["network_status"] = (
        "PASS" if avg_rtt <= 1.0 else f"WARN: RTT {avg_rtt} ms > 1 ms"
    )

    print(json.dumps(report, indent=2))
    hard_gates = [report["cpu_status"], report["memory_status"]]
    sys.exit(0 if all(g == "PASS" for g in hard_gates) else 1)

if __name__ == "__main__":
    validate_galera_hardware()

An Ansible pre-task runs the same gate and fails the play before the MariaDB role executes, so an unfit node never joins the group:

- name: Gate node hardware before installing MariaDB
  ansible.builtin.command:
    cmd: /usr/local/bin/validate_galera_hardware.py "{{ galera_peer_ip }}"
  register: hw_gate
  changed_when: false
  failed_when: hw_gate.rc != 0

The idempotent playbook patterns that render this into a full node role are covered in Automating Node Provisioning with Ansible.

Troubleshooting

The symptoms below are the ones that trace back to a hardware or kernel-tuning gap rather than a configuration typo.

WSREP: (…) cannot get message from recv queue with rising wsrep_local_recv_queue_avg. The applier cannot drain the receive queue fast enough. Root cause is almost always undersized wsrep_slave_threads relative to physical cores, or storage that misses the 50k-IOPS budget. Confirm the fio result from Step 4, raise wsrep_slave_threads toward the physical core count, and only then reconsider gcs.fc_limit.
Joiner falls back to full SST on every restart. GCache is smaller than the write volume produced during the outage, so the IST range aged out of the ring buffer. Recompute gcache.size with the Step 2 formula against your longest maintenance window, and verify a rejoin now uses IST via the graceful node join and leave procedures.
ping -M do -s 8972 fails or hangs while a smaller payload succeeds. An MTU mismatch on an intermediate hop is fragmenting jumbo frames. Set a consistent MTU across every NIC and switch port, or drop back to MTU 1500 cluster-wide until the fabric is fixed — never run mixed MTUs.
Intermittent WSREP: view(NON_PRIM) evictions on healthy hardware. EVS timeouts are too aggressive for the network’s jitter, often after moving to a higher-latency or cross-AZ fabric. Loosen evs.suspect_timeout/evs.inactive_timeout and check ethtool -S for packet drops before blaming the node.
OOM kill of mysqld during a certification storm. Total of buffer pool + GCache + certification index exceeded RAM. Reduce innodb_buffer_pool_size to restore the 20% headroom, set vm.swappiness=1, and confirm memory math against Step 2. Log-level diagnosis of the crash is covered in Handling Galera Startup Errors & Logs.

wsrep.cnf Configuration Deep Dive — the annotated parameter reference behind the knobs sized here
Initial Data Synchronization Methods — how GCache sizing drives the IST-versus-SST decision
Bootstrapping Your First Galera Cluster — the next step once every node passes the hardware gate
Automated Node Health Monitoring — turning the verification probes above into continuous telemetry
Network Security & Firewall Rules for Galera — opening 4567/4568/4444 that this page assumes are reachable

Galera Cluster Hardware Requirements: Production Sizing & Validation

How Hardware Shapes Synchronous Commit Latency #

Prerequisites & Environment Requirements #

Step-by-Step Hardware Validation #

Step 1 — Validate CPU topology and NUMA alignment #

Step 2 — Size memory, buffer pool, and GCache #

Step 3 — Validate the network fabric and MTU #

Step 4 — Benchmark storage I/O and redo alignment #

Parameter Deep-Dive #

Verification & Health Checks #

Automation Integration #

Troubleshooting #

Related #