Automated Node Health Monitoring for MariaDB Galera Clusters

This monitoring model builds on the node lifecycle described in Galera Cluster Setup & Node Management, and turns it into a continuous, machine-readable signal of whether each member is safe to receive writes. In a synchronous multi-master deployment, node health is a state machine governed by write-set replication (wsrep), not a binary connectivity flag. A member can answer a TCP handshake, accept a socket, and return SELECT 1 while it is Joiner applying a state transfer, Donor/Desynced streaming to another node, or sitting in a non-Primary partition that silently rejects every write. The operational problem this page solves is precise: define the exact wsrep status variables that determine write availability, expose them as deterministic telemetry, set alert thresholds that separate transient state changes from real degradation, and wire the whole thing into an automation loop that can act before a partial failure cascades into split-brain or a stalled cluster.

Concept: Health Is a wsrep State, Not a Ping

Every Galera node is always in exactly one wsrep state, and only one of those states — Synced inside a Primary component — means the node is genuinely healthy for both reads and writes. Automated health monitoring is the practice of reading that state directly from the provider through SHOW GLOBAL STATUS LIKE 'wsrep_%' and mapping it onto three decisions: keep the node in the router pool, drain it, or page an operator. The distinction from a generic liveness probe is the whole point. A load balancer that only checks port 3306 will happily route writes to a Joiner, which rejects them with WSREP has not yet prepared node for application use; a probe that only runs SELECT 1 cannot see that flow control has paused commits cluster-wide because one node’s apply queue is saturated.

The mechanics of why a node’s state is authoritative for the whole group come from the certification protocol — see how Galera synchronous replication works and the write-set certification process for the underlying model. What matters for monitoring is that a small set of variables fully describes a node’s position in that protocol at any instant, and that healthy operation is a specific combination of them, not any single value.

Prerequisites & Environment Requirements

The monitor is intentionally low-privilege and read-only. Put the following in place before deploying it.

MariaDB 10.6–11.8 with Galera 4. The status-variable names below are stable across this range. On older 10.3/10.4 builds a handful of variables (notably wsrep_flow_control_paused_ns) may be absent — probe for presence rather than assuming.

A dedicated monitoring user with least privilege. Health probes only need USAGE plus the global status counters; never point the monitor at an application or root account.

CREATE USER 'monitor'@'127.0.0.1' IDENTIFIED BY 'MonitorPass!';
GRANT USAGE ON *.* TO 'monitor'@'127.0.0.1';
-- USAGE is enough to run SHOW GLOBAL STATUS; no table grants required.
FLUSH PRIVILEGES;

A local connection path per node. Run one monitor instance per node against 127.0.0.1 or the unix socket so a network partition never hides a node’s own view of itself. Cross-node polling is a secondary signal, not the primary one.
A scrape target and time-series backend. The reference implementation exposes a Prometheus endpoint on :9104; any OpenMetrics-compatible collector works. Scrape intervals of 10–15 seconds align with the provider’s own group-communication cadence.
Known expected topology. The monitor needs the intended member count (3, 5, 7) so it can flag wsrep_cluster_size drift. Derive it from the same inventory that renders wsrep.cnf in the wsrep.cnf configuration deep dive, so monitoring and configuration never disagree about how many nodes should exist.

Core State Variables and Validation Logic

Production automation must parse and validate a strict subset of the wsrep_% variables — the ones that directly gate write availability and cluster consensus. Threshold definitions have to be tuned to workload, network latency, and hardware provisioning, but the pass/fail meaning of each variable is fixed.

wsrep_cluster_status must remain Primary. A transition to non-Primary indicates quorum loss or a network partition and immediately halts writes until a Primary component is re-established (a manual SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'; only when you have positively confirmed the surviving set is authoritative).
wsrep_local_state_comment must read Synced in steady state. Joiner (applying a state transfer) and Donor/Desynced (serving SST/IST) are expected only during a lifecycle event; seeing them at any other time signals a degraded node.
wsrep_ready must be ON. OFF means the node is rejecting client statements because of an initialization failure, a state mismatch, or an in-progress transfer.
wsrep_flow_control_paused measures replication backpressure as the fraction of the last interval spent paused. Sustained values above 0.5 mean one slow node is throttling the entire cluster’s write throughput; the cause is almost always disk I/O latency or an MTU misconfiguration on the replication path.
wsrep_local_send_queue and wsrep_local_recv_queue track write-set backlog. Growth beyond the gcs.fc_limit threshold predicts an IST-to-SST fallback on the next rejoin and correlates with the flow-control pause metric.
wsrep_cluster_size must equal the expected member count, and wsrep_local_cert_failures should trend flat — a climbing certification-failure counter points at tables lacking a primary key or an application issuing conflicting concurrent writes.

Threshold validation must be enforced programmatically rather than eyeballed on a dashboard. A first baseline is best established right after bootstrapping your first Galera cluster, so the health check exists before the first node ever rejoins under load. Anchor the queue-depth ceilings to the same limits you set for gcs.fc_limit and gcs.fc_factor in wsrep.cnf — monitoring thresholds that disagree with the provider’s own flow-control settings produce alerts that neither predict nor explain real throttling.

Step-by-Step Procedure

Stand the monitor up in the order below. Each step exists to protect a specific property of the signal, noted in the why.

Create the least-privilege monitoring user (shown in prerequisites). Why: a compromised or buggy monitor must never be able to write to or read application data; USAGE is sufficient for every probe on this page.

Confirm the variables are readable over a plain connection. Run the raw query the monitor will run, so a permission or socket problem surfaces before you deploy code:

mariadb -umonitor -p -h127.0.0.1 -N -e \
  "SHOW GLOBAL STATUS WHERE Variable_name IN \
   ('wsrep_cluster_status','wsrep_local_state_comment','wsrep_ready', \
    'wsrep_flow_control_paused','wsrep_local_send_queue', \
    'wsrep_local_recv_queue','wsrep_cluster_size','wsrep_local_cert_failures')"

Why: validating the exact query out-of-band isolates connectivity and grant issues from bugs in the exporter.

Deploy the exporter as a per-node service. Run one instance on each member, bound to that node’s loopback. Why: a node’s own loopback view survives a partition that would make it unreachable from a central poller, so you never lose sight of the node that most needs watching.
Define alert rules that require duration, not a single sample. Encode “wsrep_cluster_status != Primary for > 30s” and “wsrep_flow_control_paused > 0.5 for > 2m” as sustained conditions. Why: a graceful join legitimately passes through Joiner and a momentary pause is normal under a write spike — alerting on a single bad sample produces noise that trains operators to ignore the page.
Wire the verdict back into the router and the automation loop. A DRAIN verdict removes the node from the ProxySQL/HAProxy pool; a PAGE verdict opens an incident. Why: telemetry that no system acts on is a dashboard, not monitoring — the value is in the automated response.

Executable Monitoring Implementation

The exporter below queries MariaDB with PyMySQL, validates the state variables, and exposes them for scraping. It targets Python 3.9+, follows the Python Database API v2.0 (PEP 249) connection model, and handles the two wsrep error codes an automation layer hits most — 1213 (deadlock / certification conflict) and 1205 (lock wait timeout) — by retrying rather than crashing the scrape loop. For deeper integration patterns and packaging, see monitoring Galera cluster state with Python.

#!/usr/bin/env python3
"""
Production-grade Galera health monitor.
Exposes wsrep metrics via a Prometheus HTTP endpoint and logs deterministic alerts.
Dependencies: pip install pymysql prometheus-client
"""

import logging
import sys
import time
from dataclasses import dataclass
from typing import Dict

import pymysql
from prometheus_client import Gauge, Info, start_http_server

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    stream=sys.stdout,
)

TRANSIENT_WSREP_ERRORS = (1213, 1205)  # deadlock / lock wait timeout


@dataclass
class GaleraHealthConfig:
    host: str = "127.0.0.1"
    port: int = 3306
    user: str = "monitor"
    password: str = "MonitorPass!"
    expected_cluster_size: int = 3
    fc_threshold: float = 0.5
    queue_depth_limit: int = 1000
    scrape_interval: int = 10


class GaleraHealthMonitor:
    def __init__(self, config: GaleraHealthConfig):
        self.config = config
        self.metrics = {
            "cluster_status": Info("galera_cluster_status", "Current cluster consensus state"),
            "local_state": Info("galera_local_state", "Node synchronization state"),
            "ready": Gauge("galera_node_ready", "1 if node accepts writes, 0 otherwise"),
            "flow_control_paused": Gauge("galera_flow_control_paused", "Fraction of interval flow control paused"),
            "send_queue": Gauge("galera_send_queue", "Local write-set send queue depth"),
            "recv_queue": Gauge("galera_recv_queue", "Local write-set receive queue depth"),
            "cert_failures": Gauge("galera_cert_failures", "Local certification failures"),
            "cluster_size": Gauge("galera_cluster_size", "Current cluster node count"),
            "healthy": Gauge("galera_node_healthy", "1 if Synced in a Primary component, else 0"),
        }

    def fetch_status(self) -> Dict[str, str]:
        """Return the wsrep_% status map, retrying past transient wsrep errors."""
        last_exc = None
        for _ in range(3):
            try:
                conn = pymysql.connect(
                    host=self.config.host,
                    port=self.config.port,
                    user=self.config.user,
                    password=self.config.password,
                    cursorclass=pymysql.cursors.DictCursor,
                    connect_timeout=5,
                    read_timeout=5,
                    autocommit=True,
                )
                try:
                    with conn.cursor() as cursor:
                        cursor.execute("SHOW GLOBAL STATUS LIKE 'wsrep_%'")
                        return {r["Variable_name"]: r["Value"] for r in cursor.fetchall()}
                finally:
                    conn.close()
            except pymysql.err.OperationalError as exc:
                last_exc = exc
                if exc.args and exc.args[0] in TRANSIENT_WSREP_ERRORS:
                    logging.warning("Transient wsrep error %s, retrying: %s", exc.args[0], exc)
                    time.sleep(2)
                    continue
                raise
        raise last_exc if last_exc else RuntimeError("status fetch failed")

    def evaluate_and_export(self, status: Dict[str, str]) -> None:
        cluster_status = status.get("wsrep_cluster_status", "Unknown")
        local_state = status.get("wsrep_local_state_comment", "Unknown")
        ready = status.get("wsrep_ready") == "ON"
        fc_paused = float(status.get("wsrep_flow_control_paused", 0) or 0)
        send_q = int(status.get("wsrep_local_send_queue", 0) or 0)
        recv_q = int(status.get("wsrep_local_recv_queue", 0) or 0)
        size = int(status.get("wsrep_cluster_size", 0) or 0)

        self.metrics["cluster_status"].info({"state": cluster_status})
        self.metrics["local_state"].info({"state": local_state})
        self.metrics["ready"].set(1 if ready else 0)
        self.metrics["flow_control_paused"].set(fc_paused)
        self.metrics["send_queue"].set(send_q)
        self.metrics["recv_queue"].set(recv_q)
        self.metrics["cert_failures"].set(int(status.get("wsrep_local_cert_failures", 0) or 0))
        self.metrics["cluster_size"].set(size)

        healthy = cluster_status == "Primary" and local_state == "Synced" and ready
        self.metrics["healthy"].set(1 if healthy else 0)

        # Deterministic alert evaluation (duration handled by the alerting layer).
        if cluster_status != "Primary":
            logging.critical("QUORUM_LOST: wsrep_cluster_status=%s", cluster_status)
        if fc_paused > self.config.fc_threshold:
            logging.warning("BACKPRESSURE: flow control paused %.2f > %.2f", fc_paused, self.config.fc_threshold)
        if send_q > self.config.queue_depth_limit:
            logging.warning("QUEUE_DEPTH: send queue %d exceeds %d", send_q, self.config.queue_depth_limit)
        if size != self.config.expected_cluster_size:
            logging.warning("SIZE_MISMATCH: expected %d, got %d", self.config.expected_cluster_size, size)


def main() -> None:
    config = GaleraHealthConfig()
    monitor = GaleraHealthMonitor(config)
    start_http_server(9104)
    logging.info("Galera health monitor started on :9104")

    while True:
        try:
            monitor.evaluate_and_export(monitor.fetch_status())
        except pymysql.err.MySQLError as exc:
            # During SST the socket may be unavailable — surface but keep polling.
            logging.error("Monitoring cycle failed: %s", exc)
        time.sleep(config.scrape_interval)


if __name__ == "__main__":
    main()

Parameter Deep-Dive

These are the knobs that most change how the health signal behaves. The provider-side flow-control values must match the ceilings your monitor alerts on; the scrape-side values control how quickly a real fault becomes a page without generating noise.

Parameter	Recommended value	Why it matters for monitoring
`gcs.fc_limit`	`256` (128–512 for busy OLTP)	The write-set queue depth that triggers throttling. Your `queue_depth_limit` alert must sit at or below this — alerting above it means you page only after the group is already paused.
`gcs.fc_factor`	`0.8`	Flow control releases when the queue drains to this fraction of `fc_limit`. It sets how long a pause lasts, which is why the pause alert needs a duration window, not a single sample.
`wsrep_flow_control_paused`	alert > `0.5` for > 2m	Fraction of the interval spent paused. A sustained value here is the single clearest sign that one node is throttling cluster-wide writes.
`evs.inactive_timeout`	`PT15S`	How long the group waits before evicting a silent member. Set your “node missing” alert longer than this so a normal eviction-and-rejoin does not double-fire.
`scrape_interval`	`10s`	Poll cadence. Align it with the provider’s group-communication period; faster adds load without new information, slower blinds you during a fast SST.
`expected_cluster_size`	your member count (3/5/7)	Ground truth for `wsrep_cluster_size` drift detection. Drive it from inventory so a planned scale-out updates monitoring and `wsrep.cnf` together.

Group gcs.* and evs.* values under wsrep_provider_options; the deep tuning for low-latency EVS behavior pairs with configuring wsrep_provider_options for low latency.

Verification & Health Checks

Confirm the monitor and the node it watches agree. First, read the ground-truth state directly:

-- Must read: Primary
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
-- Must read: Synced
SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
-- Must read: ON
SHOW GLOBAL STATUS LIKE 'wsrep_ready';
-- Must equal your expected member count
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';
-- Should be 0 in steady state
SHOW GLOBAL STATUS LIKE 'wsrep_flow_control_paused';

Then confirm the exporter is up and the derived health gauge matches reality:

systemctl is-active galera-health-monitor        # active
curl -s localhost:9104/metrics | grep -E '^galera_(node_healthy|cluster_size|flow_control_paused) '

A healthy Synced node in a full Primary component must expose galera_node_healthy 1.0 and a galera_cluster_size equal to your member count. If the SQL says Synced but the gauge reads 0, the exporter cannot reach the node — check the loopback grant and socket path from the prerequisites before trusting any alert.

Automation Integration

Deploy the exporter as a systemd unit rendered from the same inventory that provisions the fleet, so a new node is monitored the moment it exists:

[Unit]
Description=Galera wsrep health exporter
After=network-online.target mariadb.service
Wants=network-online.target

[Service]
ExecStart=/usr/bin/python3 /opt/galera/health_monitor.py
Restart=always
RestartSec=5
User=galera-monitor
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

Drive rollout from the Ansible role that already renders wsrep.cnf per host — see automating node provisioning with Ansible — so expected_cluster_size and the flow-control ceilings are templated from the same group variables the provider uses. In a Terraform-managed fleet, gate the load-balancer target-group attachment on galera_node_healthy == 1 rather than on a raw TCP check, so an instance is only added to rotation once it is genuinely Synced. The same gauge is the signal a rolling maintenance run should wait on between nodes during a graceful node join and leave sequence, keeping the SST/IST ordering strictly serialized.

Alert routing should map each verdict to a specific runbook. A non-Primary page routes to the quorum-recovery procedure; a startup-failure page routes to handling Galera startup errors and logs to decode the raw provider output; a sustained-backpressure alert routes to capacity review, because a node that constantly triggers flow control is usually starved of the I/O bandwidth described in the Galera cluster hardware requirements.

Troubleshooting

These are the failure signatures specific to health monitoring, with the exact remediation for each.

The monitor reports the node healthy, but applications get write errors. The exporter is reading Synced while clients see WSREP has not yet prepared node for application use. Root cause: the monitor connected over a path (a second socket or a stale connection) that bypassed the check, or wsrep_ready flipped to OFF between scrapes. Confirm by running the raw SHOW GLOBAL STATUS LIKE 'wsrep_ready' yourself; if it reads OFF, the node is mid-transfer and must be drained, not served.
galera_flow_control_paused sits high on one node only. That member is the bottleneck throttling the whole group. Root cause: its disk cannot keep up with the apply rate, or gcs.fc_limit is set too low for the workload. Correlate the pause metric with wsrep_local_recv_queue growth and node I/O; raise gcs.fc_limit toward 256+ and verify the node’s storage meets the hardware requirements.
wsrep_cluster_size disagrees between nodes. Two members report different counts at the same instant. Root cause: a partition has split the group into components that each think they are the surviving cluster. Do not pc.bootstrap blindly — identify which component holds quorum (floor(N/2)+1) first, then recover the minority side by letting it rejoin the authoritative component.
Scrape fails intermittently with OperationalError 1205. The monitor query times out waiting on a lock. Root cause: the node is a busy SST donor or is under a long-running transaction. The exporter already retries 1205/1213; if it persists, the node is likely Donor/Desynced — treat that as an expected drain state, not an outage.
Prometheus shows the target down while the node is fine. The exporter process died but MariaDB is healthy. Root cause: an unhandled exception or a killed unit. systemctl status galera-health-monitor and the unit’s Restart=always should recover it; if it crash-loops, run health_monitor.py in the foreground to capture the traceback rather than trusting the metric’s absence as a node fault.

Frequently Asked Questions

Why isn’t a TCP check or SELECT 1 enough to gate the load balancer? Because both pass on a node that cannot serve writes. A Joiner applying an SST answers the socket and, once far enough along, even runs SELECT 1, yet it rejects every write with WSREP has not yet prepared node for application use. Only the combination of wsrep_cluster_status = Primary, wsrep_local_state_comment = Synced, and wsrep_ready = ON proves the node is safe for traffic, which is exactly the galera_node_healthy gauge the exporter derives.

What threshold should trigger a page versus just a warning? Gate on duration, not a single sample. wsrep_cluster_status != Primary sustained for more than about 30 seconds is a P1 page — writes are being rejected cluster-wide. wsrep_flow_control_paused > 0.5 for more than two minutes is a warning that points at a capacity or I/O problem on one node. A momentary Joiner state or a brief pause during a write spike is normal and should never page on its own.

Should I run one central monitor or one per node? Run one exporter per node, bound to that node’s loopback. A single central poller loses sight of exactly the member you most need to watch the instant a partition isolates it, and it cannot distinguish “node down” from “node unreachable from the poller.” Per-node loopback probes plus a central view of wsrep_cluster_size across all members gives you both the local truth and the group-level consensus picture.

Galera Cluster Setup & Node Management — the parent guide covering the full node lifecycle
Monitoring Galera Cluster State with Python — packaging and extending the wsrep exporter
Graceful Node Join and Leave Procedures — the health gauge that gates safe lifecycle transitions
Handling Galera Startup Errors & Logs — decoding the provider output an alert routes you to
wsrep.cnf Configuration Deep Dive — the flow-control and EVS parameters your thresholds must match

Automated Node Health Monitoring for MariaDB Galera Clusters

Concept: Health Is a wsrep State, Not a Ping #

Prerequisites & Environment Requirements #

Core State Variables and Validation Logic #

Step-by-Step Procedure #

Executable Monitoring Implementation #

Parameter Deep-Dive #

Verification & Health Checks #

Automation Integration #

Troubleshooting #

Frequently Asked Questions #

Related #