Automated Node Health Monitoring for MariaDB Galera Clusters
In synchronous multi-master topologies, node health is a continuous state machine governed by write-set replication (wsrep) protocols, not a binary connectivity flag. Platform teams and database administrators must implement state-aware validation that accounts for cluster quorum, flow control backpressure, and replication latency. Standard TCP probes or SELECT 1 liveness checks fail to capture synchronization states that dictate write availability across the topology. Effective observability pipelines for Galera Cluster Setup & Node Management require deterministic telemetry derived directly from Galera status variables, enabling automated remediation before cascading failures or split-brain conditions occur.
Core State Variables and Validation Logic
The monitoring foundation relies on querying SHOW GLOBAL STATUS LIKE 'wsrep_%'. Production automation must parse and validate a strict subset of variables that directly impact cluster consensus and write routing. Threshold definitions must align with workload characteristics, network latency, and hardware provisioning:
wsrep_cluster_status: Must remainPrimary. A transition toNon-Primaryindicates quorum loss or network partition, immediately halting writes untilpc.bootstrap=YESis manually or programmatically triggered.wsrep_local_state_comment: Steady-state requiresSynced. Transient states includeJoiner(applying state transfer) andDonor(serving SST/IST). Any unexpected state during normal operations signals a degraded condition requiring immediate investigation.wsrep_ready: Must beON.OFFindicates the node is rejecting client connections due to initialization failure, state mismatch, or ongoing state transfer.wsrep_flow_control_paused&wsrep_flow_control_sent: Measure replication backpressure. Sustained values >0.5 indicate a bottleneck node is throttling cluster throughput. This metric directly correlates with disk I/O latency and network MTU misconfigurations.wsrep_local_send_queue&wsrep_local_recv_queue: Track write-set backlog. Growth beyondgcs.fc_limitthresholds predicts IST fallback or forced SST. Queue depth must be monitored alongsidegcs.fc_factorto prevent premature throttling.
Threshold validation must be enforced programmatically. During Bootstrapping Your First Galera Cluster, establishing baseline health checks prevents cascading failures when nodes rejoin under heavy write loads. Production validation should enforce strict boundaries: wsrep_cluster_size must match the expected topology count, wsrep_local_cert_failures must trend toward zero, and queue depths must respect limits defined in your wsrep.cnf Configuration Deep Dive. This ensures consistency across infrastructure-as-code deployments and eliminates configuration drift during automated scaling or node replacement events.
Executable Monitoring Implementation
The following implementation uses pymysql and prometheus_client to query MariaDB, validate state variables, and expose metrics for scraping. It adheres to Python Database API Specification v2.0 (PEP 249) connection standards, implements exponential backoff for transient network failures, and emits deterministic alert states. For full integration patterns and dependency management, refer to Monitoring Galera Cluster State with Python.
#!/usr/bin/env python3
"""
Production-grade Galera health monitor.
Exposes wsrep metrics via Prometheus HTTP endpoint and logs deterministic alerts.
Dependencies: pip install pymysql prometheus-client
"""
import pymysql
import logging
import time
import sys
from dataclasses import dataclass
from typing import Dict, Any
from prometheus_client import start_http_server, Gauge, Info
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
stream=sys.stdout
)
@dataclass
class GaleraHealthConfig:
host: str
port: int
user: str
password: str
expected_cluster_size: int
fc_threshold: float = 0.5
queue_depth_limit: int = 1000
scrape_interval: int = 10
class GaleraHealthMonitor:
def __init__(self, config: GaleraHealthConfig):
self.config = config
self.metrics = self._init_metrics()
def _init_metrics(self) -> Dict[str, Any]:
return {
"cluster_status": Info("galera_cluster_status", "Current cluster consensus state"),
"local_state": Info("galera_local_state", "Node synchronization state"),
"ready": Gauge("galera_node_ready", "1 if node accepts writes, 0 otherwise"),
"flow_control_paused": Gauge("galera_flow_control_paused", "Fraction of time flow control paused"),
"send_queue": Gauge("galera_send_queue", "Local write-set send queue depth"),
"recv_queue": Gauge("galera_recv_queue", "Local write-set receive queue depth"),
"cert_failures": Gauge("galera_cert_failures", "Local certification failures"),
"cluster_size": Gauge("galera_cluster_size", "Current cluster node count"),
}
def fetch_status(self) -> Dict[str, str]:
conn = pymysql.connect(
host=self.config.host,
port=self.config.port,
user=self.config.user,
password=self.config.password,
cursorclass=pymysql.cursors.DictCursor,
connect_timeout=5,
read_timeout=5,
autocommit=True
)
try:
with conn.cursor() as cursor:
cursor.execute("SHOW GLOBAL STATUS LIKE 'wsrep_%'")
return {row['Variable_name']: row['Value'] for row in cursor.fetchall()}
except pymysql.MySQLError as e:
logging.error("Database query failed: %s", e)
raise
finally:
conn.close()
def evaluate_and_export(self, status: Dict[str, str]) -> None:
self.metrics["cluster_status"].info({"state": status.get("wsrep_cluster_status", "Unknown")})
self.metrics["local_state"].info({"state": status.get("wsrep_local_state_comment", "Unknown")})
self.metrics["ready"].set(1 if status.get("wsrep_ready") == "ON" else 0)
self.metrics["flow_control_paused"].set(float(status.get("wsrep_flow_control_paused", 0)))
self.metrics["send_queue"].set(int(status.get("wsrep_local_send_queue", 0)))
self.metrics["recv_queue"].set(int(status.get("wsrep_local_recv_queue", 0)))
self.metrics["cert_failures"].set(int(status.get("wsrep_local_cert_failures", 0)))
self.metrics["cluster_size"].set(int(status.get("wsrep_cluster_size", 0)))
# Deterministic alert evaluation
if status.get("wsrep_cluster_status") != "Primary":
logging.critical("QUORUM_LOST: wsrep_cluster_status=%s", status.get("wsrep_cluster_status"))
if float(status.get("wsrep_flow_control_paused", 0)) > self.config.fc_threshold:
logging.warning("BACKPRESSURE: Flow control paused > %.2f", self.config.fc_threshold)
if int(status.get("wsrep_local_send_queue", 0)) > self.config.queue_depth_limit:
logging.warning("QUEUE_DEPTH: Send queue exceeds %d", self.config.queue_depth_limit)
if int(status.get("wsrep_cluster_size", 0)) != self.config.expected_cluster_size:
logging.warning("SIZE_MISMATCH: Expected %d, got %d", self.config.expected_cluster_size, int(status.get("wsrep_cluster_size", 0)))
def main():
config = GaleraHealthConfig(
host="127.0.0.1", port=3306, user="monitor", password="secure_pass", expected_cluster_size=3
)
monitor = GaleraHealthMonitor(config)
start_http_server(9104)
logging.info("Galera health monitor started on :9104")
while True:
try:
status = monitor.fetch_status()
monitor.evaluate_and_export(status)
except Exception as e:
logging.error("Monitoring cycle failed: %s", e)
time.sleep(config.scrape_interval)
if __name__ == "__main__":
main()
Telemetry Integration and Operational Dependencies
Once exposed, these metrics feed into standard observability stacks. Prometheus scrape intervals should align with wsrep_provider_options polling cycles (typically 10–15 seconds). Alerting rules must differentiate between transient state transitions and sustained degradation. For example, wsrep_cluster_status != Primary for >30 seconds triggers a P1 incident, while flow control metrics >0.5 for >2 minutes trigger capacity scaling alerts. Queue depth alerts should correlate with disk I/O latency and network MTU settings to distinguish between replication bottlenecks and infrastructure degradation.
Operational runbooks must map these alerts to specific recovery procedures. Planned maintenance requires executing Graceful Node Join and Leave Procedures to ensure clean state transfer without triggering unnecessary SST. Unplanned failures require rapid triage using Handling Galera Startup Errors & Logs to isolate corrupted relay logs or mismatched wsrep_provider versions. Hardware provisioning directly impacts queue depth tolerance; nodes with insufficient IOPS or network bandwidth will consistently trigger flow control, degrading cluster-wide write latency.
For data consistency validation, integrate scheduled artifact retention checks and point-in-time recovery verification alongside health monitoring. This ensures that automated failover decisions are backed by validated state transfer methods, preventing split-brain scenarios during network partitions. By codifying these validation boundaries into infrastructure pipelines, platform teams achieve deterministic cluster behavior, predictable scaling, and zero-downtime maintenance windows.