Fallback Routing & Read-Only Nodes in MariaDB Galera: Production Implementation Guide

This procedure builds on the synchronous replication model described in MariaDB Galera Core Architecture & Fundamentals, and solves a specific operational problem: how to keep read traffic flowing when the synchronous write path degrades. Galera’s multi-master design guarantees strict consistency, but production workloads regularly hit certification bottlenecks, maintenance windows, or partial network partitions that throttle write throughput. Fallback routing paired with dedicated read-only nodes isolates heavy analytical queries, reporting pipelines, and batch jobs from the primary synchronous path, so a saturated certification queue never brings reads down with it. This guide gives you the validated configuration, proxy routing logic, verification commands, and Python automation needed to deploy resilient fallback behaviour on MariaDB 10.6–11.x with Galera 4.

Concept: What Fallback Routing and Read-Only Nodes Actually Do

Fallback routing is a proxy-enforced policy that redirects read queries away from the synchronous write pool the moment that pool drops below an operational threshold, while continuing to accept writes on whatever primary members remain healthy. A read-only node is any endpoint that serves SELECT traffic but is prevented from accepting writes — either a Galera member locked with read_only=1, or an asynchronous replica consuming binary logs from a donor node outside the write-set path.

The two concepts combine into a tiered routing model. Under normal operation the proxy splits traffic between a writer hostgroup and a synchronous reader hostgroup, both backed by Synced Galera members. When degradation is detected the proxy evicts unhealthy members to an offline hostgroup and cascades reads to a fallback hostgroup of asynchronous replicas. Because write-set propagation determines how far a fallback replica can lag, understanding the exact consistency trade-off from Understanding Galera Synchronous Replication is a prerequisite for defining sane fallback SLAs.

Fallback activation triggers when the primary synchronous pool falls below operational thresholds. Common triggers include:

Fewer than two nodes reporting wsrep_ready=ON
wsrep_flow_control_paused exceeding 0.5 for more than 30 seconds
Replication backlog (wsrep_local_recv_queue_avg) crossing a defined limit, or rising certification failures (wsrep_local_cert_failures, a cumulative counter)

ProxySQL fans one application connection into tiered hostgroups: writes and synchronous reads stay on Synced Galera members, reads spill to the async fallback tier only when hostgroup 20 drains, and any desynced or non-Primary member is evicted to the offline hostgroup.

Prerequisites & Environment Requirements

Before configuring fallback routing, confirm the following are in place across every node and the proxy tier:

MariaDB 10.6+ with Galera 4 (wsrep_provider = /usr/lib/galera/libgalera_smm.so) on all cluster members. Mixed minor versions are tolerated for rolling upgrades but not for steady-state operation.
ProxySQL 2.4+ for native mysql_galera_hostgroups support. Earlier releases require an external scheduler script to move nodes between hostgroups.
A monitor MySQL account on every backend with USAGE, REPLICATION CLIENT, and SELECT on information_schema — ProxySQL and the Python probe both authenticate with it.
Network reachability on 3306 (SQL), 4567/tcp+udp (group communication), 4568/tcp (IST), and 4444/tcp (SST) between Galera members, plus 6032 (ProxySQL admin, MySQL protocol) reachable from your automation host. Validate these against your Network Security & Firewall Rules for Galera baseline before proceeding.
Asynchronous replicas primed from a donor using GTID-based replication if you intend to serve fallback reads from outside the certification path.

Choosing a fallback endpoint is a trade of certification cost against consistency: a synchronous read-only member gives zero lag but taxes every node's certification path, while an async replica offloads that work entirely at the price of bounded staleness.

Step-by-Step Procedure

1. Provision the read-only endpoints

Decide whether each fallback endpoint is a synchronous Galera member or an asynchronous replica, then apply the matching configuration. A Galera member with read_only=1 stays in the group, applies write-sets through its applier threads, and guarantees zero lag at the cost of continuous certification CPU. An async replica consumes binlogs via CHANGE REPLICATION SOURCE TO, bypassing certification entirely but introducing measurable lag. The operational decision matrix is covered in depth in When to Use Async Replicas with Galera.

# Synchronous read-only node (Galera member)
[mysqld]
wsrep_on=1
read_only=1
super_read_only=1
wsrep_slave_threads=8
wsrep_provider_options="gcache.size=4G;gcs.fc_limit=128"

# Asynchronous replica (binlog consumer, outside the write-set path)
[mysqld]
read_only=1
super_read_only=1
relay_log_recovery=1
slave_parallel_threads=4
slave_parallel_mode=optimistic

super_read_only=1 is set alongside read_only=1 because plain read_only still allows users with the SUPER privilege to write — a real risk when automation or migration tooling connects as an admin. super_read_only closes that bypass so a fallback endpoint can never silently diverge.

2. Register the backends in ProxySQL hostgroups

Map every backend to a hostgroup: writers (10), synchronous readers (20), and fallback async reads (30). The max_replication_lag column matters only for hostgroup 30 — it lets ProxySQL evict a lagging replica automatically, whereas synchronous members always report zero lag.

-- Hostgroup mapping
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight, max_connections, max_replication_lag, comment) VALUES
(10, 'galera-node-01', 3306, 100, 500, 0, 'Primary Writes'),
(10, 'galera-node-02', 3306, 100, 500, 0, 'Primary Writes'),
(10, 'galera-node-03', 3306, 100, 500, 0, 'Primary Writes'),
(20, 'galera-node-01', 3306, 100, 1000, 0, 'Sync Reads'),
(20, 'galera-node-02', 3306, 100, 1000, 0, 'Sync Reads'),
(30, 'async-replica-01', 3306, 100, 2000, 10, 'Fallback Async Reads');

3. Enable native Galera health checking

ProxySQL’s Galera checker reads each node’s wsrep state directly and moves non-Primary or desynced nodes into the offline hostgroup without any external script. Galera health cannot be tested with an ad-hoc SELECT ... WHERE wsrep_ready=1 query — wsrep_ready and wsrep_cluster_status are status variables, not selectable columns — so the native checker is the correct mechanism.

-- Monitor cadence — these variables take numeric values in milliseconds.
UPDATE global_variables SET variable_value=2000 WHERE variable_name='mysql-monitor_query_interval';
UPDATE global_variables SET variable_value=2000 WHERE variable_name='mysql-monitor_ping_interval';
UPDATE global_variables SET variable_value=600 WHERE variable_name='mysql-monitor_connect_timeout';

-- Native Galera monitoring: writers = 10, backup writers = 11, synchronous
-- readers = 20, offline = 9. Desynced / non-Primary nodes are relocated
-- automatically once max_transactions_behind is exceeded.
INSERT INTO mysql_galera_hostgroups
  (writer_hostgroup, backup_writer_hostgroup, reader_hostgroup, offline_hostgroup,
   active, max_writers, writer_is_also_reader, max_transactions_behind)
VALUES (10, 11, 20, 9, 1, 3, 2, 100);

4. Define the routing rules

Route write-intent statements (including SELECT ... FOR UPDATE) to the writer hostgroup and plain reads to the synchronous readers. When hostgroup 20 empties out during degradation, ProxySQL’s replication-hostgroup logic lets reads cascade to the fallback tier.

-- Routing rules
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply) VALUES
(1, 1, '^SELECT.*FOR UPDATE', 10, 1),
(2, 1, '^SELECT', 20, 1),
(3, 1, '^(INSERT|UPDATE|DELETE|REPLACE)', 10, 1);

LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL QUERY RULES TO DISK;

When wsrep_ready=0 or wsrep_cluster_status deviates from Primary, ProxySQL removes the affected nodes from hostgroups 10 and 20; read traffic then cascades to hostgroup 30. For scheduler syntax and query-rule precedence, consult the ProxySQL documentation.

Parameter Deep-Dive

These are the knobs that most directly govern fallback behaviour. Values shown are production baselines for a three-node cluster on 10 GbE with sub-1 ms RTT.

Parameter	Scope	Baseline	Why it matters for fallback
`max_transactions_behind`	ProxySQL galera hostgroup	`100`	Write-set queue depth at which a reader is evicted to offline. Too high serves stale reads; too low flaps nodes during normal flow control.
`max_replication_lag`	ProxySQL server (hg 30)	`10` (s)	Async fallback eviction threshold in `Seconds_Behind_Source`. Set to your read-consistency SLA.
`wsrep_slave_threads`	Galera member	`8`	Parallel apply threads on read-only members; higher values drain the recv queue faster so a read-only node stays `Synced` under write bursts.
`wsrep_provider_options` gcache.size	Galera member	`4G`	Sizes the write-set cache (tune it in wsrep.cnf); a larger gcache keeps IST viable so a returning read-only node avoids a full SST.
`gcs.fc_limit`	Galera member	`128`	Flow-control queue limit; raising it on read-only members reduces cluster-wide `wsrep_flow_control_paused` spikes that would otherwise trip fallback prematurely.
`mysql-monitor_ping_interval`	ProxySQL global	`2000` (ms)	Detection latency for a dead backend. Lower reacts faster but adds monitor connection load.

Disabling wsrep_certify_nonPK on synchronous read-only members reduces certification overhead, but only do so if every replicated table has a primary key — otherwise you risk silent divergence on unindexed UPDATE/DELETE.

Verification & Health Checks

After loading the configuration, confirm each tier is behaving before you rely on it during an incident.

Check that Galera members are Synced and forming a primary component:

SHOW GLOBAL STATUS WHERE Variable_name IN
  ('wsrep_ready','wsrep_cluster_status','wsrep_cluster_size','wsrep_local_state_comment');

Confirm ProxySQL agrees on backend state and that no writer has been wrongly evicted:

-- Run against the ProxySQL admin interface on 6032
SELECT hostgroup_id, hostname, status, max_replication_lag
FROM runtime_mysql_servers ORDER BY hostgroup_id;

SELECT hostname, Galera_status, error FROM mysql_server_galera_log
ORDER BY time_start_us DESC LIMIT 10;

Validate that an async fallback replica is inside its lag budget before it can receive traffic:

SHOW REPLICA STATUS\G
-- Inspect: Replica_IO_Running=Yes, Replica_SQL_Running=Yes, Seconds_Behind_Source < 10

A lightweight Python probe you can drop into a monitoring loop:

import mysql.connector

def is_synced(host: str) -> bool:
    """Return True only if the node is Synced and in a Primary component."""
    try:
        conn = mysql.connector.connect(
            host=host, port=3306, user="monitor", password="monitor_pass",
            connection_timeout=3,
        )
        cur = conn.cursor(dictionary=True)
        cur.execute(
            "SHOW GLOBAL STATUS WHERE Variable_name IN "
            "('wsrep_ready','wsrep_cluster_status','wsrep_local_state_comment')"
        )
        s = {r["Variable_name"].lower(): r["Value"] for r in cur.fetchall()}
        cur.close()
        conn.close()
        return (
            s.get("wsrep_ready") == "ON"
            and s.get("wsrep_cluster_status") == "Primary"
            and s.get("wsrep_local_state_comment") == "Synced"
        )
    except mysql.connector.Error:
        return False

Automation Integration

Platform teams need programmatic control over fallback transitions rather than manual LOAD ... TO RUNTIME commands during an incident. The script below validates cluster health, decides whether fallback should activate, and reconfigures ProxySQL through its admin interface (MySQL protocol on port 6032) with idempotent state checks and explicit error handling for the Galera error codes 1213 (deadlock) and 1205 (lock wait timeout).

import mysql.connector
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

PROXY_ADMIN_HOST = "proxysql-admin"
PROXY_ADMIN_PORT = 6032  # ProxySQL admin interface (MySQL protocol, not HTTP/REST)
PROXY_CREDS = ("admin", "admin")


def check_galera_health(host: str, port: int = 3306) -> Dict:
    """Validate wsrep state and return operational metrics."""
    try:
        conn = mysql.connector.connect(
            host=host, port=port, user="monitor", password="monitor_pass",
            connection_timeout=3, pool_name="galera_monitor",
        )
        cursor = conn.cursor(dictionary=True)
        cursor.execute(
            """
            SELECT VARIABLE_NAME, VARIABLE_VALUE
            FROM information_schema.GLOBAL_STATUS
            WHERE VARIABLE_NAME IN ('wsrep_ready', 'wsrep_cluster_status', 'wsrep_local_recv_queue')
            """
        )
        metrics = {row["VARIABLE_NAME"].lower(): row["VARIABLE_VALUE"] for row in cursor.fetchall()}
        cursor.close()
        conn.close()
        return metrics
    except mysql.connector.Error as e:
        # 1213 / 1205 surface here under contention; treat any failure as unhealthy.
        logging.error(f"Connection failed to {host}:{port} - {e}")
        return {"wsrep_ready": "OFF", "wsrep_cluster_status": "Disconnected"}


def evaluate_fallback_readiness(nodes: List[str]) -> bool:
    """Return True when fewer than two nodes are Primary+ready, i.e. activate fallback."""
    healthy_count = 0
    for node in nodes:
        state = check_galera_health(node)
        if state.get("wsrep_ready") == "ON" and state.get("wsrep_cluster_status") == "Primary":
            healthy_count += 1
    return healthy_count < 2


def push_fallback_routing(admin_host: str, admin_port: int, creds: tuple):
    """Reconfigure ProxySQL via its admin interface (MySQL protocol on port 6032)."""
    conn = mysql.connector.connect(
        host=admin_host, port=admin_port, user=creds[0], password=creds[1],
        connection_timeout=5,
    )
    try:
        cursor = conn.cursor()
        cursor.execute("UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id IN (10, 20)")
        cursor.execute("LOAD MYSQL SERVERS TO RUNTIME")
        cursor.execute("SAVE MYSQL SERVERS TO DISK")
        conn.commit()
        cursor.close()
        logging.info("Fallback routing activated successfully.")
    finally:
        conn.close()


if __name__ == "__main__":
    GALERA_NODES = ["galera-01", "galera-02", "galera-03"]
    if evaluate_fallback_readiness(GALERA_NODES):
        logging.warning("Cluster degraded. Activating fallback routing.")
        push_fallback_routing(PROXY_ADMIN_HOST, PROXY_ADMIN_PORT, PROXY_CREDS)
    else:
        logging.info("Cluster healthy. Standard routing maintained.")

For an Ansible or CI/CD deployment, template the ProxySQL admin credentials from a vault rather than hardcoding them, run the health evaluation as a scheduled job (systemd timer or Kubernetes CronJob), and add exponential backoff around the admin connection so a transient proxy restart does not flap routing. The certification-queue logic here aligns with the Write-Set Certification Process Explained, so fallback fires only when certification contention actually threatens write-latency SLAs. Reuse the same probe from the broader automated node health monitoring patterns to avoid maintaining two divergent health definitions.

Troubleshooting

Reads are being served stale after failover. The async fallback replica exceeded its lag budget but ProxySQL never evicted it. Confirm max_replication_lag is non-zero on the hostgroup-30 server and that mysql-monitor_replication_lag_interval is short enough to catch drift. Verify with SELECT hostname, Seconds_Behind_Source FROM stats_mysql_connection_pool WHERE hostgroup=30.

All nodes evicted to the offline hostgroup during a brief flow-control pause. max_transactions_behind is too low for your write burst profile, so normal flow control looks like degradation. Raise it toward 200 and correlate with wsrep_flow_control_paused; a pause ratio under 0.1 is healthy and should not trip eviction.

WSREP has not yet prepared node for application use on a returning read-only node. The member is in Joiner or Donor state and must not receive traffic. Monitor wsrep_local_state_comment and only route once it reads Synced. If it is stuck in Joiner, the gcache window on the donor was too small and a full SST is running — size gcache.size to cover your longest maintenance window.

Fallback never activates during a real partition because Galera blocks first. pc.wait_prim=TRUE (the safe default that prevents split-brain) can hold a node before proxy-level checks fire. Keep the default, and rely on ProxySQL’s independent health checks to evict the isolated node; only set pc.wait_prim=false during controlled, single-node maintenance, never as a standing configuration.

Async replica breaks with duplicate-key errors after promotion. A write reached the replica outside the binlog stream, usually because super_read_only was not set. Re-clone from a donor, set super_read_only=1, and gate promotions behind a Seconds_Behind_Source check.

Runbook checklist for fallback activation:

Verify wsrep_ready=ON on at least two primary nodes before disabling fallback
Confirm ProxySQL runtime mysql_servers status matches the intended hostgroup layout
Validate async replica Seconds_Behind_Source is under the configured threshold
Execute LOAD MYSQL SERVERS TO RUNTIME and SAVE MYSQL SERVERS TO DISK after routing changes
Monitor wsrep_flow_control_paused and wsrep_local_cert_failures post-failover

Implementing fallback routing and read-only nodes turns Galera from a rigid synchronous cluster into a workload-aware platform that stays available through certification storms, maintenance windows, and partial partitions — without compromising data consistency or operational control.

Understanding Galera Synchronous Replication — the zero-lag consistency model that fallback SLAs are measured against
When to Use Async Replicas with Galera — choosing synchronous read-only members versus binlog replicas
Write-Set Certification Process Explained — the certification contention that drives fallback triggers
Network Security & Firewall Rules for Galera — opening 3306/4567/4568/6032 for the proxy and fallback tiers
Automated Node Health Monitoring — reusing one health definition across probes and routing logic

Fallback Routing & Read-Only Nodes in MariaDB Galera: Production Implementation Guide

Concept: What Fallback Routing and Read-Only Nodes Actually Do #

Prerequisites & Environment Requirements #

Step-by-Step Procedure #

1. Provision the read-only endpoints #

2. Register the backends in ProxySQL hostgroups #

3. Enable native Galera health checking #

4. Define the routing rules #

Parameter Deep-Dive #

Verification & Health Checks #

Automation Integration #

Troubleshooting #

Related #