Fallback Routing & Read-Only Nodes in MariaDB Galera: Production Implementation Guide
Synchronous multi-master replication guarantees strict consistency, but production workloads frequently encounter certification bottlenecks, maintenance windows, or partial network partitions that degrade write throughput. Implementing fallback routing alongside dedicated read-only nodes isolates heavy analytical queries, reporting pipelines, and batch operations from the primary synchronous path. This guide provides validated configuration patterns, proxy routing logic, and Python automation workflows required to deploy resilient fallback architectures in enterprise Galera environments.
Architectural Positioning & Routing Strategy
Galera’s deterministic state transfer and quorum-based decision making form the baseline for cluster stability, as documented in MariaDB Galera Core Architecture & Fundamentals. Routing all client traffic directly to synchronous nodes, however, forces read-heavy workloads to compete for certification resources and network bandwidth. Production deployments require a stateless proxy layer that evaluates node health, replication state, and queue depth before routing queries.
Fallback routing triggers when the primary synchronous pool falls below operational thresholds. Common triggers include:
- Fewer than two nodes reporting
wsrep_ready=ON wsrep_flow_control_pausedexceeding 0.5 for >30 seconds- Replication backlog (
wsrep_local_recv_queue_avg) crossing defined limits, or rising certification failures (wsrep_local_cert_failures, a cumulative counter)
The proxy must maintain separate hostgroups for primary writes, synchronous reads, and fallback endpoints. When degradation is detected, writes are restricted to the remaining healthy Galera nodes while read traffic shifts to designated fallback endpoints. These endpoints typically consist of either Galera nodes explicitly locked to read_only=1 or asynchronous replicas consuming binary logs from a designated donor. Understanding how write-set propagation impacts fallback latency is essential when defining consistency SLAs; the Understanding Galera Synchronous Replication reference details the exact trade-offs between zero-lag synchronous consumption and eventual consistency models.
Figure: ProxySQL hostgroup routing — writes and synchronous reads stay on Galera, degraded nodes are evicted to the offline group, and reads can spill to async replicas.
flowchart TD
App["Application"] --> PX["ProxySQL"]
PX -->|"writes"| W["Writer hostgroup 10 (Galera nodes)"]
PX -->|"reads"| RD["Reader hostgroup 20 (synced nodes)"]
PX -. "fallback reads" .-> FB["Async replicas, hostgroup 30"]
W -. "desynced / non-Primary" .-> OFF["Offline hostgroup 9"]
RD -. "desynced / non-Primary" .-> OFF
Read-Only Node Implementation & State Synchronization
Deploying read-only nodes requires explicit state synchronization strategy selection. A Galera node configured with wsrep_on=1 and read_only=1 remains fully synchronized with the cluster. It participates in group communication, applies write-sets via the replication applier threads, and guarantees zero replication lag. The trade-off is CPU and network overhead from continuous certification processing, which can be mitigated by tuning wsrep_slave_threads and disabling wsrep_certify_nonPK for non-primary workloads.
For reporting, ETL pipelines, or cross-region reads, asynchronous replicas are generally preferred. They consume binlogs via standard MariaDB replication (CHANGE REPLICATION SOURCE TO), bypassing the Galera certification layer entirely. This reduces primary node CPU pressure but introduces measurable replication lag. The operational decision matrix for selecting between synchronous read-only nodes and async replicas is thoroughly covered in When to Use Async Replicas with Galera.
Critical Configuration Parameters:
# Synchronous Read-Only Node (Galera Member)
[mysqld]
wsrep_on=1
read_only=1
super_read_only=1
wsrep_slave_threads=8
wsrep_provider_options="gcache.size=4G;gcs.fc_limit=128"
# Asynchronous Replica (Binlog Consumer)
[mysqld]
read_only=1
super_read_only=1
relay_log_recovery=1
slave_parallel_threads=4
slave_parallel_mode=optimistic
Enforce super_read_only=1 to prevent accidental privilege escalation bypasses. Validate replication topology using SHOW REPLICA STATUS\G and monitor Seconds_Behind_Source before promoting fallback endpoints to active routing pools.
Proxy Configuration & Fallback Logic
ProxySQL is the industry standard for dynamic Galera routing due to its native hostgroup management, query rewriting, and health check integration. The following configuration establishes a three-tier routing topology with automatic fallback activation.
-- Hostgroup Mapping
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight, max_connections, max_replication_lag, comment) VALUES
(10, 'galera-node-01', 3306, 100, 500, 0, 'Primary Writes'),
(10, 'galera-node-02', 3306, 100, 500, 0, 'Primary Writes'),
(10, 'galera-node-03', 3306, 100, 500, 0, 'Primary Writes'),
(20, 'galera-node-01', 3306, 100, 1000, 0, 'Sync Reads'),
(20, 'galera-node-02', 3306, 100, 1000, 0, 'Sync Reads'),
(30, 'async-replica-01', 3306, 100, 2000, 10, 'Fallback Async Reads');
-- Monitor cadence — these variables take numeric values in milliseconds.
-- Galera health cannot be tested with an ad-hoc "SELECT ... WHERE wsrep_ready=1"
-- query; wsrep_ready / wsrep_cluster_status are status variables, not selectable
-- columns. Use ProxySQL's native Galera checker instead (below).
UPDATE global_variables SET variable_value=2000 WHERE variable_name='mysql-monitor_query_interval';
UPDATE global_variables SET variable_value=2000 WHERE variable_name='mysql-monitor_ping_interval';
UPDATE global_variables SET variable_value=600 WHERE variable_name='mysql-monitor_connect_timeout';
-- Native Galera monitoring: ProxySQL reads each node's wsrep state and moves
-- non-Primary / desynced nodes into the offline hostgroup (9) automatically.
-- Writers = 10, synchronous readers = 20.
INSERT INTO mysql_galera_hostgroups
(writer_hostgroup, backup_writer_hostgroup, reader_hostgroup, offline_hostgroup,
active, max_writers, writer_is_also_reader, max_transactions_behind)
VALUES (10, 11, 20, 9, 1, 3, 2, 100);
-- Routing Rules
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply) VALUES
(1, 1, '^SELECT.*FOR UPDATE', 10, 1),
(2, 1, '^SELECT', 20, 1),
(3, 1, '^(INSERT|UPDATE|DELETE|REPLACE)', 10, 1);
Fallback activation relies on ProxySQL’s scheduler or external monitoring hooks. When wsrep_ready=0 or wsrep_cluster_status deviates from Primary, the proxy automatically removes degraded nodes from hostgroup 10 and 20. Read traffic then cascades to hostgroup 30. For detailed ProxySQL configuration syntax and scheduler integration, consult the ProxySQL Documentation.
Python Automation & Operational Workflows
Platform teams and DevOps engineers require programmatic control over fallback routing. The following Python automation script validates cluster health, calculates fallback readiness, and pushes hostgroup reconfiguration to ProxySQL through its admin interface (MySQL protocol on port 6032). It uses connection pooling, idempotent state checks, and explicit error handling for production resilience.
import mysql.connector
import logging
from typing import Dict, List
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
PROXY_ADMIN_HOST = "proxysql-admin"
PROXY_ADMIN_PORT = 6032 # ProxySQL admin interface (MySQL protocol, not HTTP/REST)
PROXY_CREDS = ("admin", "admin")
def check_galera_health(host: str, port: int = 3306) -> Dict:
"""Validate wsrep state and return operational metrics."""
try:
conn = mysql.connector.connect(
host=host, port=port, user='monitor', password='monitor_pass',
connection_timeout=3, pool_name='galera_monitor'
)
cursor = conn.cursor(dictionary=True)
cursor.execute("""
SELECT VARIABLE_NAME, VARIABLE_VALUE
FROM information_schema.GLOBAL_STATUS
WHERE VARIABLE_NAME IN ('wsrep_ready', 'wsrep_cluster_status', 'wsrep_local_recv_queue')
""")
metrics = {row['VARIABLE_NAME'].lower(): row['VARIABLE_VALUE'] for row in cursor.fetchall()}
cursor.close()
conn.close()
return metrics
except mysql.connector.Error as e:
logging.error(f"Connection failed to {host}:{port} - {e}")
return {"wsrep_ready": "OFF", "wsrep_cluster_status": "Disconnected"}
def evaluate_fallback_readiness(nodes: List[str]) -> bool:
"""Determine if fallback routing should activate."""
healthy_count = 0
for node in nodes:
state = check_galera_health(node)
if state.get("wsrep_ready") == "ON" and state.get("wsrep_cluster_status") == "Primary":
healthy_count += 1
return healthy_count < 2
def push_fallback_routing(admin_host: str, admin_port: int, creds: tuple):
"""Reconfigure ProxySQL via its admin interface (MySQL protocol on port 6032)."""
conn = mysql.connector.connect(
host=admin_host, port=admin_port, user=creds[0], password=creds[1],
connection_timeout=5
)
try:
cursor = conn.cursor()
cursor.execute("UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id IN (10, 20)")
cursor.execute("LOAD MYSQL SERVERS TO RUNTIME")
cursor.execute("SAVE MYSQL SERVERS TO DISK")
conn.commit()
cursor.close()
logging.info("Fallback routing activated successfully.")
finally:
conn.close()
if __name__ == "__main__":
GALERA_NODES = ["galera-01", "galera-02", "galera-03"]
if evaluate_fallback_readiness(GALERA_NODES):
logging.warning("Cluster degraded. Activating fallback routing.")
push_fallback_routing(PROXY_ADMIN_HOST, PROXY_ADMIN_PORT, PROXY_CREDS)
else:
logging.info("Cluster healthy. Standard routing maintained.")
This script integrates cleanly into CI/CD pipelines or Kubernetes operators. For production deployment, replace hardcoded credentials with environment variables or vault injection, and implement exponential backoff for API retries. The certification queue monitoring logic aligns with the Write-Set Certification Process Explained documentation, ensuring fallback triggers only when certification contention threatens write latency SLAs.
Operational Dependencies & Failure Scenarios
Fallback routing does not eliminate the need for strict operational boundaries. The following dependencies must be validated before deployment:
- Network Partition Handling: Galera’s
pc.wait_prim=TRUEprevents split-brain but can block fallback activation if not paired with proxy-level health checks. Ensurewsrep_provider_options="pc.wait_prim=false"is only used during controlled maintenance. - Flow Control Interaction: When
wsrep_flow_control_pausedexceeds thresholds, the proxy must throttle write routing. Configuremysql-monitor_read_only_timeoutto prevent stale connections from draining into fallback pools. - State Transfer Overhead: SST (State Snapshot Transfer) or IST (Incremental State Transfer) during node recovery consumes significant I/O. Schedule fallback routing tests during low-traffic windows and monitor
wsrep_local_state_commentto avoid routing to nodes inJoinerorDonorstates. - Replication Lag Monitoring: Async fallback nodes require strict lag thresholds. Use
max_replication_lagin ProxySQL to automatically evict replicas that exceed acceptable drift. Reference official MariaDB replication tuning guidelines at MariaDB Knowledge Base: Replication Overview for parameter baselines.
Runbook Checklist for Fallback Activation:
Implementing fallback routing and read-only nodes transforms Galera from a rigid synchronous cluster into a resilient, workload-aware database platform. By decoupling read traffic, enforcing strict health checks, and automating routing transitions, platform teams maintain availability during certification storms, maintenance windows, and partial network partitions without compromising data consistency or operational control.