Monitoring Galera Cluster State with Python

Telemetry Architecture & Diagnostic Intent

Multi-master synchronization in MariaDB Galera introduces distributed state transitions that traditional SNMP or agent-based monitoring stacks routinely miss. The cluster does not operate on simple up/down semantics; instead, it relies on a deterministic state machine governed by the wsrep (Write-Set Replication) provider layer. Programmatic evaluation of this state requires direct interrogation of the provider API, precise threshold mapping, and deterministic remediation routing. When architecting Galera Cluster Setup & Node Management pipelines, the monitoring layer must function as a first-class control plane component rather than a passive telemetry sink.

Python provides the optimal execution environment for this workload due to its mature database connector ecosystem, structured exception handling, and native JSON serialization. A production-grade evaluator must establish resilient connections, parse SHOW GLOBAL STATUS LIKE 'wsrep_%' output into a deterministic dictionary, apply threshold logic against known failure modes, and emit actionable exit codes or structured payloads for downstream orchestration. This approach eliminates polling drift and ensures that alerting systems receive consistent, machine-readable state snapshots aligned with Automated Node Health Monitoring standards.

Critical wsrep Metrics & Threshold Mapping

Galera exposes over thirty wsrep_ status variables, but only a subset dictates immediate operational health. The following matrix forms the core diagnostic baseline for automated evaluation:

Variable Healthy State Warning Threshold Critical State Root-Cause Analysis
wsrep_cluster_status Primary N/A non-Primary Quorum loss, network partition, or bootstrap misconfiguration
wsrep_local_state_comment Synced Joined, Donor/Desynced Initialized, Unknown Node joining, SST/IST in progress, or provider crash
wsrep_ready ON N/A OFF Node rejecting writes during sync or after fatal provider error
wsrep_flow_control_paused 0.0 > 0.1 > 0.25 Slow applier threads, disk I/O saturation, or network jitter
wsrep_local_recv_queue 0 > 50 > 200 Write burst exceeding apply rate, causing replication backlog
wsrep_local_cert_failures 0 > 0 > 5 (per interval) Write conflicts from retry storms, missing unique constraints, or overlapping transactions
wsrep_cluster_size 3 (expected) < expected 1 Node drop, firewall blocking 4567/tcp, or gcomm timeout

The wsrep_cluster_status = non-Primary condition indicates a partitioned cluster where the node has lost quorum and will reject writes to prevent split-brain. wsrep_ready = OFF confirms the node is temporarily or permanently isolated from the replication group. Flow control metrics expose replication lag caused by slow appliers or storage bottlenecks, while certification failures signal application-level write conflicts. Accurate threshold routing prevents alert fatigue by distinguishing transient synchronization states from genuine degradation.

Production-Grade Python Evaluator

The following implementation uses mysql-connector-python to establish a connection, extract the wsrep_ namespace, and evaluate cluster health against production thresholds. Connection resilience is enforced via exponential backoff, and parsing is strictly dictionary-based to avoid fragile regex matching. The script adheres to Python DB-API 2.0 conventions and outputs structured JSON compatible with modern observability stacks.

#!/usr/bin/env python3
"""
Galera Cluster State Evaluator
Production-safe diagnostic script for automated wsrep telemetry parsing.
"""

import sys
import json
import time
import logging
from typing import Dict, Any, Tuple
import mysql.connector
from mysql.connector import Error

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

# Exit codes aligned with standard monitoring conventions (Nagios/Prometheus)
EXIT_OK = 0
EXIT_WARNING = 1
EXIT_CRITICAL = 2
EXIT_UNKNOWN = 3

THRESHOLDS: Dict[str, Dict[str, Any]] = {
    "wsrep_cluster_status": {"healthy": "Primary", "critical": "non-Primary"},
    "wsrep_local_state_comment": {"healthy": "Synced", "warning": ("Joined", "Donor/Desynced"), "critical": ("Initialized", "Unknown")},
    "wsrep_ready": {"healthy": "ON", "critical": "OFF"},
    "wsrep_flow_control_paused": {"healthy": 0.0, "warning": 0.1, "critical": 0.25},
    "wsrep_local_recv_queue": {"healthy": 0, "warning": 50, "critical": 200},
    "wsrep_local_cert_failures": {"healthy": 0, "warning": 0, "critical": 5},
    "wsrep_cluster_size": {"healthy": 3, "warning": 2, "critical": 1}
}

def fetch_wsrep_status(host: str, user: str, password: str, port: int = 3306, retries: int = 3) -> Dict[str, str]:
    """Establish resilient connection and extract wsrep status variables."""
    config = {
        "user": user,
        "password": password,
        "host": host,
        "port": port,
        "connect_timeout": 5,
        "read_timeout": 10
    }
    
    for attempt in range(1, retries + 1):
        try:
            conn = mysql.connector.connect(**config)
            cursor = conn.cursor(dictionary=True)
            cursor.execute("SHOW GLOBAL STATUS LIKE 'wsrep_%'")
            status = {row["Variable_name"]: row["Value"] for row in cursor.fetchall()}
            cursor.close()
            conn.close()
            return status
        except Error as e:
            logging.warning(f"Connection attempt {attempt}/{retries} failed: {e}")
            if attempt < retries:
                time.sleep(min(2 ** attempt, 10))
            else:
                raise RuntimeError(f"Failed to connect to MySQL after {retries} attempts: {e}")

def evaluate_state(wsrep_data: Dict[str, str]) -> Tuple[int, Dict[str, Any]]:
    """Apply threshold logic and return exit code + structured payload."""
    findings = {"metrics": {}, "alerts": [], "status": "UNKNOWN"}
    exit_code = EXIT_OK

    for var, rules in THRESHOLDS.items():
        raw_val = wsrep_data.get(var)
        if raw_val is None:
            findings["alerts"].append(f"Missing metric: {var}")
            exit_code = max(exit_code, EXIT_UNKNOWN)
            continue

        # Type normalization
        try:
            val = float(raw_val) if "." in raw_val else int(raw_val)
        except ValueError:
            val = raw_val

        findings["metrics"][var] = val

        # Threshold evaluation. Numeric metrics must use magnitude comparisons,
        # not equality — a recv_queue of 500 should trip the critical band, not
        # silently pass because it does not equal the threshold exactly.
        HIGHER_IS_WORSE = {"wsrep_flow_control_paused", "wsrep_local_recv_queue", "wsrep_local_cert_failures"}
        LOWER_IS_WORSE = {"wsrep_cluster_size"}

        if var in HIGHER_IS_WORSE and isinstance(val, (int, float)):
            if "critical" in rules and val >= rules["critical"]:
                findings["alerts"].append(f"CRITICAL: {var} = {val}")
                exit_code = max(exit_code, EXIT_CRITICAL)
            elif "warning" in rules and val >= rules["warning"] and val > rules.get("healthy", 0):
                findings["alerts"].append(f"WARNING: {var} = {val}")
                exit_code = max(exit_code, EXIT_WARNING)
        elif var in LOWER_IS_WORSE and isinstance(val, (int, float)):
            if "critical" in rules and val <= rules["critical"]:
                findings["alerts"].append(f"CRITICAL: {var} = {val}")
                exit_code = max(exit_code, EXIT_CRITICAL)
            elif "warning" in rules and val <= rules["warning"]:
                findings["alerts"].append(f"WARNING: {var} = {val}")
                exit_code = max(exit_code, EXIT_WARNING)
        elif "healthy" in rules and val != rules["healthy"]:
            # String/enum metrics: match the warning/critical value bands.
            if isinstance(rules.get("critical"), tuple) and val in rules["critical"]:
                findings["alerts"].append(f"CRITICAL: {var} = {val}")
                exit_code = max(exit_code, EXIT_CRITICAL)
            elif isinstance(rules.get("warning"), tuple) and val in rules["warning"]:
                findings["alerts"].append(f"WARNING: {var} = {val}")
                exit_code = max(exit_code, EXIT_WARNING)
            elif not isinstance(rules.get("critical"), tuple) and val == rules.get("critical"):
                findings["alerts"].append(f"CRITICAL: {var} = {val}")
                exit_code = max(exit_code, EXIT_CRITICAL)
            else:
                findings["alerts"].append(f"WARNING: {var} = {val} (expected {rules['healthy']})")
                exit_code = max(exit_code, EXIT_WARNING)

    findings["status"] = "OK" if exit_code == EXIT_OK else ("WARNING" if exit_code == EXIT_WARNING else "CRITICAL")
    return exit_code, findings

def main():
    import argparse
    parser = argparse.ArgumentParser(description="Galera wsrep state evaluator")
    parser.add_argument("--host", required=True, help="Database host")
    parser.add_argument("--user", required=True, help="Monitoring user")
    parser.add_argument("--password", required=True, help="Monitoring password")
    parser.add_argument("--port", type=int, default=3306, help="MySQL port")
    args = parser.parse_args()

    try:
        wsrep_data = fetch_wsrep_status(args.host, args.user, args.password, args.port)
        exit_code, payload = evaluate_state(wsrep_data)
        print(json.dumps(payload, indent=2))
        sys.exit(exit_code)
    except Exception as e:
        logging.error(f"Evaluation failed: {e}")
        sys.exit(EXIT_UNKNOWN)

if __name__ == "__main__":
    main()

Integration & Safe Recovery Routing

Deploy this evaluator as a systemd timer, cron job, or Prometheus textfile collector. When integrated with alerting pipelines, route CRITICAL payloads to incident response channels while suppressing WARNING states during known maintenance windows. For safe recovery routing, never automate node restarts on wsrep_cluster_status = non-Primary without verifying quorum across surviving nodes. Instead, trigger a diagnostic playbook that isolates the partitioned node, validates network connectivity on 4567/tcp and 4568/tcp, and confirms gcomm configuration alignment.

When wsrep_flow_control_paused exceeds 0.25, investigate storage I/O latency and innodb_flush_log_at_trx_commit settings. High wsrep_local_recv_queue values typically indicate applier thread starvation; increasing wsrep_slave_threads or upgrading storage throughput resolves the backlog without risking data divergence. Certification failures (wsrep_local_cert_failures) require application-level remediation: enforce strict unique constraints, implement idempotent write patterns, and review retry logic to prevent transaction storms.

For comprehensive telemetry ingestion and historical trend analysis, consult the official Galera Cluster Status Variables documentation to align custom thresholds with provider version updates. When exporting metrics to time-series databases, ensure consistent label cardinality and avoid high-frequency polling that competes with replication traffic.