Why should a Galera cluster have an odd number of nodes?

An odd node count lets the survivors of any single-domain failure hold a strict majority of quorum weight, so they form a Primary Component and keep accepting writes. An even count can split exactly 50/50 during a partition, and Galera correctly takes both halves non-Primary to prevent divergence, so the whole group stops serving writes.

Can I stretch one synchronous Galera group across two data centers?

You can, but it is fragile: a two-site synchronous group is prone to even-split loss of quorum, and every commit pays the WAN round-trip. The durable pattern is a synchronous group inside one low-latency site plus an asynchronous replica in the DR site. If a synchronous two-site design is mandatory, add a garbd arbitrator in a third location or bias pc.weight toward the site that must survive.

Why does a rejoining node trigger a full SST instead of a fast IST?

IST replays only the write-sets a node missed while offline, but those write-sets must still exist in a donor's gcache. If the node was down longer than the gcache window covers, the donor has purged them and the join escalates to a full SST. Size gcache.size to peak write bytes per second multiplied by your worst-case outage so rejoins favor IST.

Designing Multi-Master Topologies

This design guide builds on the architecture described in MariaDB Galera Core Architecture & Fundamentals, and solves one specific problem: how to lay out Galera nodes across hosts, availability zones, and regions so the group stays writable under failure without silently splitting into two halves that both accept writes. In an active-active group every node commits synchronously, so a topology decision that looks harmless on a whiteboard — an even number of nodes, a stretched WAN link, two data centers with no tie-breaker — becomes a production outage the first time a switch reboots. This page gives database administrators, DevOps engineers, and platform teams a repeatable method for sizing the group, placing nodes into failure domains, weighting quorum, and validating the layout before any application traffic depends on it.

Concept: What a Topology Decision Actually Controls

A Galera topology is the mapping of database nodes onto failure domains plus the quorum rules that decide which surviving partition is allowed to keep serving writes. Because commits are synchronous, the write latency of the whole group is bounded by the round-trip time to the slowest voting member, and the availability of the group is governed by whether the survivors of a partition can still form a Primary Component — a majority that holds more than half the total quorum weight. The mechanics of that synchronous commit path are unpacked in Understanding Galera Synchronous Replication; the topology sits one level above it and decides how far apart those synchronous peers are allowed to be.

Three properties fall directly out of the topology and nothing else:

Write latency floor — set by the slowest inter-node round trip inside the synchronous group. You cannot tune it away with wsrep_slave_threads; you can only shorten the physical path.
Partition survivability — set by node count and quorum weighting. An odd count with equal weights survives the loss of any single failure domain; an even count risks a 50/50 split where neither side is Primary.
Blast radius of a hot row — set by how many nodes accept writes to the same keys. Spreading writes for one hot table across every node maximizes the Write-Set Certification Process conflict rate; pinning that workload to one node all but eliminates it.

For anything spanning more than one data center, the durable pattern is a synchronous group inside one low-latency site with an asynchronous replica in the disaster-recovery site, rather than stretching one synchronous group across the WAN. That trade-off is examined in depth under when to use async replicas with Galera.

Figure: a three-node synchronous cluster in the primary site, with an asynchronous replica absorbing WAN latency in the secondary site.

Prerequisites & Environment Requirements

Before committing a topology to configuration management, confirm the following are in place on every prospective node:

Software: MariaDB 10.6 LTS through 11.x with the bundled Galera 4 provider (libgalera_smm.so). Mixing provider protocol versions across nodes prevents a Primary Component from forming, so pin one server minor version per group. Node sizing and disk/RAM baselines are covered in Galera cluster hardware requirements.
Network ports open bidirectionally between all members: TCP 3306 (SQL), 4567 (Galera group communication, also UDP for multicast if used), 4568 (IST), and 4444 (SST). Getting these firewall rules right is a prerequisite, not an afterthought — see Network Security & Firewall Rules for Galera.
Measured latency, not assumed latency: capture the real round-trip time between every pair of nodes with ping -c 100 or mtr before you decide same-AZ, cross-AZ, or cross-region. A synchronous group wants sub-millisecond to low-single-digit-millisecond RTT between voting members.
Consistent clocks: run chrony or systemd-timesyncd on every node. Clock skew above a few seconds corrupts eviction timers and SSL validity windows.
A per-node identity plan: each node needs a stable wsrep_node_name, a routable wsrep_node_address, and a segment assignment. These belong in a version-controlled drop-in as described in the wsrep.cnf configuration deep dive.

Step-by-Step: Designing and Bringing Up the Topology

Each step below explains the reasoning, not just the command, because a topology mistake does not fail at edit time — it fails hours later as a certification storm or a non-Primary stall.

1. Size the voting group to an odd number

Start from three synchronous nodes and grow in odd increments (3, 5, 7). An odd count means any single failure domain can disappear and the survivors still hold a strict majority of quorum weight, so they stay Primary and keep accepting writes. An even count introduces a partition in which each half holds exactly 50%, and Galera — correctly refusing to guess — takes both halves non-Primary to prevent divergence. Five nodes tolerate two simultaneous failures but pay for it with a wider certification fan-out; three is the right default for most workloads.

2. Map nodes onto failure domains

Assign nodes to availability zones so that no single zone holds a majority. With three nodes across three AZs, losing any one AZ leaves a 2-of-3 majority. With three nodes across two AZs, the AZ holding two nodes is a single point of failure for write availability — losing it drops you to one node, which is a minority and goes non-Primary. Write the domain map down explicitly:

# node1 -> az-a, node2 -> az-b, node3 -> az-c
[mysqld]
wsrep_on                = ON
wsrep_cluster_name      = orders-prod
wsrep_node_name         = node1
wsrep_node_address      = 10.0.1.10
wsrep_cluster_address   = gcomm://10.0.1.10,10.0.2.10,10.0.3.10

3. Tag data-center segments for multi-region groups

When nodes genuinely sit in different sites, tag each with a gmcast.segment so Galera relays each write-set across the expensive link only once per segment instead of once per remote node. Nodes sharing a segment number gossip locally and elect one relay for cross-segment traffic:

[mysqld]
# Primary site nodes share segment 1; DR-adjacent nodes share segment 2
wsrep_provider_options  = "gmcast.segment=1; evs.suspect_timeout=PT10S; evs.inactive_timeout=PT30S"

Segments do not change quorum math — they only reduce WAN chatter — so they complement, never replace, the odd-count rule.

4. Add a tie-breaker instead of a fourth full node

If business rules force a two-site synchronous layout, do not add a fourth data node to “balance” the sites — that reintroduces the even-count split. Add a Galera Arbitrator (garbd) in a third location instead. It joins group communication and votes on quorum but stores no data and applies no write-sets, so it breaks ties cheaply:

garbd --address "gcomm://10.0.1.10,10.0.2.10" \
      --group "orders-prod" \
      --sst "none" \
      --log /var/log/garbd.log

5. Weight quorum when one site must win

For an active/standby two-site design, give the primary site more quorum weight so that a WAN cut leaves the primary side Primary and the DR side read-only, rather than taking both offline. Set pc.weight higher on the site you want to survive:

[mysqld]
# Primary-site node: heavier vote
wsrep_provider_options  = "pc.weight=2"

A partition then compares total weight on each side; the heavier side stays writable. Pair this with health-check routing so applications never write to the losing side, a pattern detailed in Fallback Routing to Read-Only Nodes.

6. Bootstrap once, then join the rest

Form the group by starting exactly one node with galera_new_cluster (which sets --wsrep-new-cluster and creates the initial Primary Component), then start the remaining nodes normally so they discover the group through wsrep_cluster_address and pull state via IST or SST. Never run the bootstrap command on more than one node — two bootstraps create two independent groups with the same name. The safe first-boot sequence is walked through in bootstrapping your first Galera cluster.

Parameter Deep-Dive

These are the knobs that turn an abstract topology into a group that behaves the way the design intends. Values below are production starting points, not universal constants — tune against your measured latency.

Parameter	Type	Default	Recommended for topology	Why it matters
`gmcast.segment`	integer	`0`	one distinct value per data center	Elects a single cross-segment relay so each write-set crosses the WAN once, not once per remote node.
`pc.weight`	integer	`1`	`2` on the site that must survive a split	Biases quorum so a WAN partition leaves the intended side Primary instead of taking both non-Primary.
`evs.suspect_timeout`	period	`PT5S`	`PT10S` across AZ/region	How long before a silent peer is suspected; too low causes spurious evictions on jittery links.
`evs.inactive_timeout`	period	`PT15S`	`PT30S` across AZ/region	Hard deadline before a peer is declared dead; must exceed `suspect_timeout` and cover worst-case WAN stalls.
`gcache.size`	bytes	`128M`	`2G`–`4G`, sized to peak outage write volume	Determines whether a rejoining node qualifies for fast IST or falls back to a full SST that ties up a donor.
`wsrep_slave_threads`	integer	`1`	match vCPU count, cap ~16–32	Parallel apply keeps remote write-sets from backing up and tripping flow control; it does not lower the latency floor.

Size gcache.size from workload rather than folklore: gcache.size ≈ peak_write_bytes_per_sec × expected_max_outage_seconds, with headroom. If a node is offline longer than the gcache window can cover, the donor has already purged those write-sets and the rejoin escalates to a full SST — the exact trade-off explored in Initial Data Synchronization Methods. The full precedence and syntax rules for the provider-options string live in configuring wsrep_provider_options for low latency.

Figure: four placement choices under a single-domain failure — which survivors keep more than half the quorum weight and stay Primary, and which fall to a minority and go read-only.

Verification & Health Checks

A topology is only correct once the group agrees on its own membership and every node reports Synced and Primary. Check the whole-group view first:

SHOW GLOBAL STATUS WHERE Variable_name IN (
  'wsrep_cluster_size',        -- must equal the intended node count
  'wsrep_cluster_status',      -- must be 'Primary' on every node
  'wsrep_local_state_comment', -- must be 'Synced' before routing traffic
  'wsrep_evs_state',           -- must be 'OPERATIONAL'
  'wsrep_cluster_conf_id'      -- must match across all nodes (same view)
);

If wsrep_cluster_conf_id differs between two nodes, they are in different membership views and you are looking at a split. If wsrep_cluster_status reads non-Primary, the node is in a minority partition and is refusing writes by design.

A minimal Python probe makes the same check enforceable in CI and in orchestration. It targets Python 3.9+ and PyMySQL, and handles the two wsrep conflict codes an automation caller must expect — 1213 (deadlock / certification failure) and 1205 (lock wait timeout):

import sys
import pymysql
from pymysql.constants import ER

EXPECTED_SIZE = 3

def check_node(host: str, user: str, password: str) -> bool:
    try:
        conn = pymysql.connect(host=host, user=user, password=password,
                               connect_timeout=5, read_timeout=5)
    except pymysql.err.OperationalError as exc:
        print(f"[FAIL] {host}: cannot connect: {exc}", file=sys.stderr)
        return False

    wanted = ("wsrep_cluster_size", "wsrep_cluster_status", "wsrep_local_state_comment")
    try:
        with conn.cursor() as cur:
            status = {}
            for var in wanted:
                cur.execute("SHOW GLOBAL STATUS LIKE %s", (var,))
                row = cur.fetchone()
                status[var] = row[1] if row else None
    except pymysql.err.OperationalError as exc:
        code = exc.args[0]
        if code in (ER.LOCK_DEADLOCK, ER.LOCK_WAIT_TIMEOUT):  # 1213, 1205
            print(f"[RETRY] {host}: transient wsrep conflict {code}", file=sys.stderr)
        else:
            print(f"[FAIL] {host}: {exc}", file=sys.stderr)
        return False
    finally:
        conn.close()

    ok = (status["wsrep_cluster_status"] == "Primary"
          and status["wsrep_local_state_comment"] == "Synced"
          and status["wsrep_cluster_size"] == str(EXPECTED_SIZE))
    print(f"[{'OK' if ok else 'FAIL'}] {host}: {status}")
    return ok

if __name__ == "__main__":
    healthy = check_node(sys.argv[1], "monitor", "monitor-pass")
    sys.exit(0 if healthy else 1)

Automation Integration

Treat the topology as declarative infrastructure so that a node’s failure domain, segment, and weight are properties of code, not of whoever last edited my.cnf.

Ansible: template the identity block from inventory, where az, gmcast.segment, and pc.weight are host variables. Keeping the rendered drop-in idempotent — so a re-run never rewrites and needlessly restarts a healthy node — is the discipline covered in Automating Node Provisioning with Ansible.
Terraform: encode the failure-domain map directly in placement — spread instances with availability_zone and anti-affinity so the plan itself guarantees no zone holds a majority, then hand node addresses to the config layer via cloud-init.
CI/CD: run the Python probe above as a gate after every rolling change; block the pipeline unless every node reports Synced and Primary and wsrep_cluster_size matches the intended count.
Orchestrator loop: on scale-out, add nodes in odd increments and wait for wsrep_local_state_comment=Synced before advertising the new node to the load balancer, so traffic never lands on a node still receiving state. Node add/remove sequencing is detailed in graceful node join and leave procedures, and continuous state scraping in automated node health monitoring.

Troubleshooting

Symptoms below are specific to topology and quorum, with the exact next action for each.

WSREP: no nodes coming from prim view, prim not possible in the error log. The node cannot see a majority and refuses to form a group. Cause is almost always a partition or an even split. Confirm reachability on port 4567 between segments, and verify node count is odd. If the whole group genuinely lost quorum and you are certain of the most-advanced node, recover it by bootstrapping from that node only after inspecting grastate.dat for the highest seqno.

Both sides report wsrep_cluster_status = non-Primary after a link flap. This is the even-split failure mode. You have a two-way partition where neither half holds more than 50% of quorum weight. Do not force both sides Primary — that is how you get divergent data. Restore the network path so the group re-merges, or set pc.weight asymmetrically in the design so a future partition has a defined winner, then rebuild.

A newly added node degrades write throughput across the whole group. You likely added it across a slower link, so it became the latency floor for synchronous commit, or its apply queue is backing up and engaging flow control. Check wsrep_flow_control_paused (a value above 0.1 means the group is throttling to let this node catch up) and confirm the node shares a low-latency path with its peers; move it into the correct segment or out of the voting group.

A rejoining node always triggers a full SST instead of IST. The node was offline longer than the gcache window on every eligible donor, so the needed write-sets were purged. Increase gcache.size to cover your real worst-case outage, or shorten the outage; verify a candidate donor still holds the joiner’s last seqno before forcing a join.

Cross-region node keeps getting evicted as inactive. WAN jitter is exceeding the eviction timers. Raise evs.suspect_timeout and evs.inactive_timeout to cover worst-case latency, and consider moving that node out of the synchronous group entirely in favor of an asynchronous DR replica.

MariaDB Galera Core Architecture & Fundamentals — the parent guide to components, the write-set lifecycle, and baseline parameters
Understanding Galera Synchronous Replication — why the slowest voting node sets the write-latency floor
Write-Set Certification Process Explained — how concurrent-write conflicts are resolved and why hot rows raise the conflict rate
Fallback Routing to Read-Only Nodes — routing writes only to the Primary side after a partition
Network Security & Firewall Rules for Galera — the ports every topology depends on
Initial Data Synchronization Methods — the IST/SST model behind gcache sizing

Designing Multi-Master Topologies

Concept: What a Topology Decision Actually Controls #

Prerequisites & Environment Requirements #

Step-by-Step: Designing and Bringing Up the Topology #

1. Size the voting group to an odd number #

2. Map nodes onto failure domains #

3. Tag data-center segments for multi-region groups #

4. Add a tie-breaker instead of a fourth full node #

5. Weight quorum when one site must win #

6. Bootstrap once, then join the rest #

Parameter Deep-Dive #

Verification & Health Checks #

Automation Integration #

Troubleshooting #

Related #