How should the probe handle Galera error codes 1213 and 1205?

Error 1213 (certification deadlock) and 1205 (lock wait timeout) are transient write-conflict codes that a node under heavy write load or flow control raises normally. Treat them as retryable rather than a hard failure: log them, back off, and try again. Reporting them as CRITICAL would generate false pages every time the group is simply busy.

What does a rising wsrep_flow_control_paused value mean for the health verdict?

wsrep_flow_control_paused is the fraction of time replication was throttled to let slow appliers catch up. A single sample above 0.1 during a write burst is normal; it only indicates real degradation when it stays elevated across consecutive checks. Alert on a sustained condition rather than one spike to avoid false criticals during batch workloads.

Monitoring Galera Cluster State with Python

This probe pattern builds on the health-signal model in Automated Node Health Monitoring, and answers one focused question: how do you write a Python script that reads a MariaDB Galera node’s wsrep_ state and returns a deterministic, machine-readable health verdict — an exit code and a JSON payload — instead of a fragile SELECT 1 liveness check? The single authoritative answer on this page is a self-contained evaluator that connects with mysql-connector-python, parses SHOW GLOBAL STATUS LIKE 'wsrep_%' into a dictionary, applies threshold logic tuned for a synchronous multi-master group, and emits Nagios/Prometheus-compatible exit codes so an orchestrator can act before a partial failure cascades into split-brain.

Context: Why a wsrep-Aware Probe, Not a TCP Check

Multi-master synchronization in MariaDB Galera introduces distributed state transitions that a TCP connect, a ping, or a bare SELECT 1 cannot see. A member can complete a socket handshake and answer a trivial query while it is a Joiner applying a state transfer, a Donor/Desynced node streaming to a peer, or sitting in a non-Primary partition that silently rejects every write. Health here is a state machine governed by the write-set replication (wsrep) provider, described in Understanding Galera Synchronous Replication, not a binary connectivity flag.

Programmatic evaluation of that state machine requires direct interrogation of the provider’s status variables, precise threshold mapping, and deterministic routing of the result. Python is a strong fit for this workload: a mature database-connector ecosystem, structured exception handling for the transient write-conflict codes Galera raises under load, and native JSON serialization for downstream observability tools. A production-grade evaluator must open a resilient connection, parse the wsrep_ namespace into a deterministic dictionary, apply threshold logic against known failure modes, and hand a clean verdict to whatever consumes it — a systemd timer, a Prometheus textfile collector, or an automated recovery playbook.

The Python Evaluator

The evaluator below is the complete solution. It targets Python 3.9+, uses mysql-connector-python, enforces connection resilience with capped exponential backoff, and — critically — treats the two write-conflict error codes Galera surfaces under contention as transient rather than fatal: 1213 (deadlock, i.e. a certification conflict) and 1205 (lock wait timeout). A busy node under flow control must not read as a hard outage. Each significant block is explained inline.

#!/usr/bin/env python3
"""
Galera Cluster State Evaluator
Deterministic wsrep telemetry parser for automated node health checks.
Targets Python 3.9+ and mysql-connector-python.
"""

import sys
import json
import time
import logging
import argparse
from typing import Dict, Any, Tuple

import mysql.connector
from mysql.connector import Error

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

# Exit codes aligned with standard monitoring conventions (Nagios/Prometheus).
EXIT_OK = 0
EXIT_WARNING = 1
EXIT_CRITICAL = 2
EXIT_UNKNOWN = 3

# Transient Galera write-conflict codes: a contended node is not a config fault.
TRANSIENT_CODES = {1213, 1205}  # 1213 = certification deadlock, 1205 = lock wait timeout

# Numeric metrics where a rising value is worse, and where a falling value is worse.
HIGHER_IS_WORSE = {"wsrep_flow_control_paused", "wsrep_local_recv_queue", "wsrep_local_cert_failures"}
LOWER_IS_WORSE = {"wsrep_cluster_size"}

THRESHOLDS: Dict[str, Dict[str, Any]] = {
    "wsrep_cluster_status": {"healthy": "Primary", "critical": "non-Primary"},
    "wsrep_local_state_comment": {"healthy": "Synced",
                                  "warning": ("Joined", "Donor/Desynced"),
                                  "critical": ("Initialized", "Unknown")},
    "wsrep_ready": {"healthy": "ON", "critical": "OFF"},
    "wsrep_flow_control_paused": {"healthy": 0.0, "warning": 0.1, "critical": 0.25},
    "wsrep_local_recv_queue": {"healthy": 0, "warning": 50, "critical": 200},
    "wsrep_local_cert_failures": {"healthy": 0, "warning": 1, "critical": 5},
    "wsrep_cluster_size": {"healthy": 3, "warning": 2, "critical": 1},
}


def fetch_wsrep_status(host: str, user: str, password: str,
                       port: int = 3306, retries: int = 3) -> Dict[str, str]:
    """Open a resilient connection and return the wsrep_ status namespace."""
    config = {
        "user": user, "password": password, "host": host, "port": port,
        "connection_timeout": 5,
    }
    for attempt in range(1, retries + 1):
        try:
            conn = mysql.connector.connect(**config)
            cursor = conn.cursor(dictionary=True)
            cursor.execute("SHOW GLOBAL STATUS LIKE 'wsrep_%'")
            status = {row["Variable_name"]: row["Value"] for row in cursor.fetchall()}
            cursor.close()
            conn.close()
            return status
        except Error as exc:
            # Contention codes are transient; surface them but keep retrying.
            code = getattr(exc, "errno", None)
            level = "info" if code in TRANSIENT_CODES else "warning"
            getattr(logging, level)(f"Attempt {attempt}/{retries} failed ({code}): {exc}")
            if attempt < retries:
                time.sleep(min(2 ** attempt, 10))  # capped exponential backoff
            else:
                raise RuntimeError(f"Could not read wsrep status after {retries} attempts: {exc}")


def evaluate_state(wsrep_data: Dict[str, str]) -> Tuple[int, Dict[str, Any]]:
    """Apply threshold logic and return (exit_code, structured payload)."""
    findings: Dict[str, Any] = {"metrics": {}, "alerts": [], "status": "UNKNOWN"}
    exit_code = EXIT_OK

    for var, rules in THRESHOLDS.items():
        raw = wsrep_data.get(var)
        if raw is None:
            findings["alerts"].append(f"Missing metric: {var}")
            exit_code = max(exit_code, EXIT_UNKNOWN)
            continue

        # Normalize numeric strings; leave enum/string values as-is.
        try:
            val: Any = float(raw) if "." in raw else int(raw)
        except ValueError:
            val = raw
        findings["metrics"][var] = val

        # Numeric metrics use magnitude comparisons, not equality — a recv_queue
        # of 500 must trip the critical band, not slip through because 500 != 200.
        if var in HIGHER_IS_WORSE and isinstance(val, (int, float)):
            if val >= rules["critical"]:
                findings["alerts"].append(f"CRITICAL: {var} = {val}")
                exit_code = max(exit_code, EXIT_CRITICAL)
            elif val >= rules["warning"]:
                findings["alerts"].append(f"WARNING: {var} = {val}")
                exit_code = max(exit_code, EXIT_WARNING)
        elif var in LOWER_IS_WORSE and isinstance(val, (int, float)):
            if val <= rules["critical"]:
                findings["alerts"].append(f"CRITICAL: {var} = {val}")
                exit_code = max(exit_code, EXIT_CRITICAL)
            elif val <= rules["warning"]:
                findings["alerts"].append(f"WARNING: {var} = {val}")
                exit_code = max(exit_code, EXIT_WARNING)
        elif val != rules["healthy"]:
            # Enum metrics: match against the warning/critical value bands.
            crit, warn = rules.get("critical"), rules.get("warning")
            if (isinstance(crit, tuple) and val in crit) or val == crit:
                findings["alerts"].append(f"CRITICAL: {var} = {val}")
                exit_code = max(exit_code, EXIT_CRITICAL)
            elif isinstance(warn, tuple) and val in warn:
                findings["alerts"].append(f"WARNING: {var} = {val}")
                exit_code = max(exit_code, EXIT_WARNING)
            else:
                findings["alerts"].append(f"WARNING: {var} = {val} (expected {rules['healthy']})")
                exit_code = max(exit_code, EXIT_WARNING)

    findings["status"] = {EXIT_OK: "OK", EXIT_WARNING: "WARNING",
                          EXIT_CRITICAL: "CRITICAL"}.get(exit_code, "UNKNOWN")
    return exit_code, findings


def main() -> None:
    parser = argparse.ArgumentParser(description="Galera wsrep state evaluator")
    parser.add_argument("--host", required=True, help="Database host or node address")
    parser.add_argument("--user", required=True, help="Read-only monitoring user")
    parser.add_argument("--password", required=True, help="Monitoring user password")
    parser.add_argument("--port", type=int, default=3306, help="MySQL/MariaDB port")
    args = parser.parse_args()

    try:
        wsrep_data = fetch_wsrep_status(args.host, args.user, args.password, args.port)
        exit_code, payload = evaluate_state(wsrep_data)
        print(json.dumps(payload, indent=2))
        sys.exit(exit_code)
    except Exception as exc:  # noqa: BLE001 — probe must always exit cleanly
        logging.error(f"Evaluation failed: {exc}")
        sys.exit(EXIT_UNKNOWN)


if __name__ == "__main__":
    main()

Two design choices carry most of the weight. First, parsing is strictly dictionary-based (row["Variable_name"]: row["Value"]) rather than regex over raw text, so a provider version bump that reorders variables cannot break the parser. Second, numeric bands use magnitude comparisons: wsrep_flow_control_paused and wsrep_local_recv_queue are compared with >=, and wsrep_cluster_size with <=, so a value far past the threshold still lands in the correct band instead of matching only on exact equality. Enum variables such as wsrep_cluster_status and wsrep_local_state_comment are matched against explicit healthy / warning / critical value sets.

wsrep Status Variables This Probe Reads

These are the specific variables the evaluator interrogates, with the type it normalizes them to and the bands it applies. They form the minimal diagnostic baseline for write availability; Galera exposes dozens more, but this subset is what determines whether a node is safe to receive writes.

Variable	Type	Healthy	Warning band	Critical band	What a breach means
`wsrep_cluster_status`	enum	`Primary`	—	`non-Primary`	Quorum lost or network partition; node rejects writes to avoid split-brain
`wsrep_local_state_comment`	enum	`Synced`	`Joined`, `Donor/Desynced`	`Initialized`, `Unknown`	Node joining, serving an SST as donor, or provider not initialized
`wsrep_ready`	enum (ON/OFF)	`ON`	—	`OFF`	Node refusing writes during sync or after a fatal provider error
`wsrep_flow_control_paused`	float 0–1	`0.0`	`≥ 0.1`	`≥ 0.25`	Fraction of time replication was throttled; slow appliers or disk I/O saturation
`wsrep_local_recv_queue`	integer	`0`	`≥ 50`	`≥ 200`	Received write-sets waiting to apply; write burst outpacing apply rate
`wsrep_local_cert_failures`	integer	`0`	`≥ 1`	`≥ 5`	Certification (write) conflicts; retry storms or missing unique constraints
`wsrep_cluster_size`	integer	node count	`< expected`	`1`	A member dropped; check `4567/tcp` reachability and `gcomm` timeouts

The wsrep_local_state_comment (see Fixing wsrep_local_state_comment Issues) enum is the single most informative field: Synced is the only value at which a node is fully in the group and writable. Flow-control and receive-queue pressure both trace back to the certification pipeline described in the Write-Set Certification Process Explained reference, and the throttle thresholds themselves are governed by the GCS provider options tuned in Configuring wsrep_provider_options for Low Latency.

Verification

Confirm the probe reports correctly against a live node and that its exit code matches its verdict. First run it directly and inspect the JSON:

# Run the probe against a node; capture the exit code separately.
python3 galera_state.py --host 10.0.1.10 --user monitor --password 's3cret'
echo "exit code: $?"   # 0 = OK, 1 = WARNING, 2 = CRITICAL, 3 = UNKNOWN

Cross-check the raw values the probe read against the server itself, so a wrong verdict is immediately attributable to either the thresholds or the source data:

SHOW GLOBAL STATUS WHERE Variable_name IN (
  'wsrep_cluster_status',        -- must be 'Primary'
  'wsrep_local_state_comment',   -- must be 'Synced'
  'wsrep_ready',                 -- must be 'ON'
  'wsrep_flow_control_paused',   -- near 0.0
  'wsrep_local_recv_queue',      -- near 0
  'wsrep_cluster_size'           -- equals live member count
);

To prove the WARNING and CRITICAL paths without waiting for a real incident, point the probe at a Joiner node mid-SST (expect wsrep_local_state_comment to trip a WARNING) or at a node you have isolated on 4567/tcp (expect wsrep_cluster_status = non-Primary and a CRITICAL exit code of 2). The monitoring account only needs the global USAGE/SELECT grant required to run SHOW GLOBAL STATUS; never reuse the SST account or a superuser for this probe.

Edge Cases & Gotchas

The node is up but rejects the connection during SST. A Joiner receiving a full State Snapshot Transfer often refuses new client connections until it reaches Synced. The probe will exit UNKNOWN (3) on the connection failure, not CRITICAL. Treat a short burst of UNKNOWN during a known rejoin as expected — correlate it with the transfer flow in Initial Data Synchronization Methods rather than paging on the first miss.
Flow control is transient, not an outage. A single sample of wsrep_flow_control_paused above 0.1 during a write burst is normal; it only signals real degradation when it stays elevated across consecutive intervals. Alert on a sustained condition (several checks in a row), not one spike, or your pipeline will drown in false criticals during batch jobs.
Docker and systemd connect to the wrong endpoint. In containers, --host 127.0.0.1 can hit a local socket or a stale mapped port instead of the intended node; always target the node’s real service address (the one in wsrep_node_address) and pin --port. When shipping the probe as a systemd timer, run it as an unprivileged user with credentials from an EnvironmentFile, not baked into the unit — and remember the timer’s network namespace must reach the database port, which for firewalled hosts means the rules covered in Network Security & Firewall Rules for Galera.

Automated Node Health Monitoring — the parent guide defining which wsrep states mean a node is safe to receive writes
wsrep.cnf Configuration Deep Dive — the provider options and identity keys behind the values this probe reads
Fixing wsrep_local_state_comment Issues — recovering a node stuck outside the Synced state
Write-Set Certification Process Explained — why certification failures and flow control appear in the metrics above

Monitoring Galera Cluster State with Python

Context: Why a wsrep-Aware Probe, Not a TCP Check #

The Python Evaluator #

wsrep Status Variables This Probe Reads #

Verification #

Edge Cases & Gotchas #

Related #

Context: Why a wsrep-Aware Probe, Not a TCP Check

The Python Evaluator

wsrep Status Variables This Probe Reads

Verification

Edge Cases & Gotchas

Related