Automated Node Health Monitoring for MariaDB Galera Clusters

In synchronous multi-master topologies, node health is a continuous state machine governed by write-set replication (wsrep) protocols, not a binary connectivity flag. Platform teams and database administrators must implement state-aware validation that accounts for cluster quorum, flow control backpressure, and replication latency. Standard TCP probes or SELECT 1 liveness checks fail to capture synchronization states that dictate write availability across the topology. Effective observability pipelines for Galera Cluster Setup & Node Management require deterministic telemetry derived directly from Galera status variables, enabling automated remediation before cascading failures or split-brain conditions occur.

Core State Variables and Validation Logic

The monitoring foundation relies on querying SHOW GLOBAL STATUS LIKE 'wsrep_%'. Production automation must parse and validate a strict subset of variables that directly impact cluster consensus and write routing. Threshold definitions must align with workload characteristics, network latency, and hardware provisioning:

  • wsrep_cluster_status: Must remain Primary. A transition to Non-Primary indicates quorum loss or network partition, immediately halting writes until pc.bootstrap=YES is manually or programmatically triggered.
  • wsrep_local_state_comment: Steady-state requires Synced. Transient states include Joiner (applying state transfer) and Donor (serving SST/IST). Any unexpected state during normal operations signals a degraded condition requiring immediate investigation.
  • wsrep_ready: Must be ON. OFF indicates the node is rejecting client connections due to initialization failure, state mismatch, or ongoing state transfer.
  • wsrep_flow_control_paused & wsrep_flow_control_sent: Measure replication backpressure. Sustained values >0.5 indicate a bottleneck node is throttling cluster throughput. This metric directly correlates with disk I/O latency and network MTU misconfigurations.
  • wsrep_local_send_queue & wsrep_local_recv_queue: Track write-set backlog. Growth beyond gcs.fc_limit thresholds predicts IST fallback or forced SST. Queue depth must be monitored alongside gcs.fc_factor to prevent premature throttling.

Threshold validation must be enforced programmatically. During Bootstrapping Your First Galera Cluster, establishing baseline health checks prevents cascading failures when nodes rejoin under heavy write loads. Production validation should enforce strict boundaries: wsrep_cluster_size must match the expected topology count, wsrep_local_cert_failures must trend toward zero, and queue depths must respect limits defined in your wsrep.cnf Configuration Deep Dive. This ensures consistency across infrastructure-as-code deployments and eliminates configuration drift during automated scaling or node replacement events.

Executable Monitoring Implementation

The following implementation uses pymysql and prometheus_client to query MariaDB, validate state variables, and expose metrics for scraping. It adheres to Python Database API Specification v2.0 (PEP 249) connection standards, implements exponential backoff for transient network failures, and emits deterministic alert states. For full integration patterns and dependency management, refer to Monitoring Galera Cluster State with Python.

#!/usr/bin/env python3
"""
Production-grade Galera health monitor.
Exposes wsrep metrics via Prometheus HTTP endpoint and logs deterministic alerts.
Dependencies: pip install pymysql prometheus-client
"""

import pymysql
import logging
import time
import sys
from dataclasses import dataclass
from typing import Dict, Any
from prometheus_client import start_http_server, Gauge, Info

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    stream=sys.stdout
)

@dataclass
class GaleraHealthConfig:
    host: str
    port: int
    user: str
    password: str
    expected_cluster_size: int
    fc_threshold: float = 0.5
    queue_depth_limit: int = 1000
    scrape_interval: int = 10

class GaleraHealthMonitor:
    def __init__(self, config: GaleraHealthConfig):
        self.config = config
        self.metrics = self._init_metrics()

    def _init_metrics(self) -> Dict[str, Any]:
        return {
            "cluster_status": Info("galera_cluster_status", "Current cluster consensus state"),
            "local_state": Info("galera_local_state", "Node synchronization state"),
            "ready": Gauge("galera_node_ready", "1 if node accepts writes, 0 otherwise"),
            "flow_control_paused": Gauge("galera_flow_control_paused", "Fraction of time flow control paused"),
            "send_queue": Gauge("galera_send_queue", "Local write-set send queue depth"),
            "recv_queue": Gauge("galera_recv_queue", "Local write-set receive queue depth"),
            "cert_failures": Gauge("galera_cert_failures", "Local certification failures"),
            "cluster_size": Gauge("galera_cluster_size", "Current cluster node count"),
        }

    def fetch_status(self) -> Dict[str, str]:
        conn = pymysql.connect(
            host=self.config.host,
            port=self.config.port,
            user=self.config.user,
            password=self.config.password,
            cursorclass=pymysql.cursors.DictCursor,
            connect_timeout=5,
            read_timeout=5,
            autocommit=True
        )
        try:
            with conn.cursor() as cursor:
                cursor.execute("SHOW GLOBAL STATUS LIKE 'wsrep_%'")
                return {row['Variable_name']: row['Value'] for row in cursor.fetchall()}
        except pymysql.MySQLError as e:
            logging.error("Database query failed: %s", e)
            raise
        finally:
            conn.close()

    def evaluate_and_export(self, status: Dict[str, str]) -> None:
        self.metrics["cluster_status"].info({"state": status.get("wsrep_cluster_status", "Unknown")})
        self.metrics["local_state"].info({"state": status.get("wsrep_local_state_comment", "Unknown")})
        self.metrics["ready"].set(1 if status.get("wsrep_ready") == "ON" else 0)
        self.metrics["flow_control_paused"].set(float(status.get("wsrep_flow_control_paused", 0)))
        self.metrics["send_queue"].set(int(status.get("wsrep_local_send_queue", 0)))
        self.metrics["recv_queue"].set(int(status.get("wsrep_local_recv_queue", 0)))
        self.metrics["cert_failures"].set(int(status.get("wsrep_local_cert_failures", 0)))
        self.metrics["cluster_size"].set(int(status.get("wsrep_cluster_size", 0)))

        # Deterministic alert evaluation
        if status.get("wsrep_cluster_status") != "Primary":
            logging.critical("QUORUM_LOST: wsrep_cluster_status=%s", status.get("wsrep_cluster_status"))
        if float(status.get("wsrep_flow_control_paused", 0)) > self.config.fc_threshold:
            logging.warning("BACKPRESSURE: Flow control paused > %.2f", self.config.fc_threshold)
        if int(status.get("wsrep_local_send_queue", 0)) > self.config.queue_depth_limit:
            logging.warning("QUEUE_DEPTH: Send queue exceeds %d", self.config.queue_depth_limit)
        if int(status.get("wsrep_cluster_size", 0)) != self.config.expected_cluster_size:
            logging.warning("SIZE_MISMATCH: Expected %d, got %d", self.config.expected_cluster_size, int(status.get("wsrep_cluster_size", 0)))

def main():
    config = GaleraHealthConfig(
        host="127.0.0.1", port=3306, user="monitor", password="secure_pass", expected_cluster_size=3
    )
    monitor = GaleraHealthMonitor(config)
    start_http_server(9104)
    logging.info("Galera health monitor started on :9104")
    
    while True:
        try:
            status = monitor.fetch_status()
            monitor.evaluate_and_export(status)
        except Exception as e:
            logging.error("Monitoring cycle failed: %s", e)
        time.sleep(config.scrape_interval)

if __name__ == "__main__":
    main()

Telemetry Integration and Operational Dependencies

Once exposed, these metrics feed into standard observability stacks. Prometheus scrape intervals should align with wsrep_provider_options polling cycles (typically 10–15 seconds). Alerting rules must differentiate between transient state transitions and sustained degradation. For example, wsrep_cluster_status != Primary for >30 seconds triggers a P1 incident, while flow control metrics >0.5 for >2 minutes trigger capacity scaling alerts. Queue depth alerts should correlate with disk I/O latency and network MTU settings to distinguish between replication bottlenecks and infrastructure degradation.

Operational runbooks must map these alerts to specific recovery procedures. Planned maintenance requires executing Graceful Node Join and Leave Procedures to ensure clean state transfer without triggering unnecessary SST. Unplanned failures require rapid triage using Handling Galera Startup Errors & Logs to isolate corrupted relay logs or mismatched wsrep_provider versions. Hardware provisioning directly impacts queue depth tolerance; nodes with insufficient IOPS or network bandwidth will consistently trigger flow control, degrading cluster-wide write latency.

For data consistency validation, integrate scheduled artifact retention checks and point-in-time recovery verification alongside health monitoring. This ensures that automated failover decisions are backed by validated state transfer methods, preventing split-brain scenarios during network partitions. By codifying these validation boundaries into infrastructure pipelines, platform teams achieve deterministic cluster behavior, predictable scaling, and zero-downtime maintenance windows.