Automating Node Provisioning with Ansible for MariaDB Galera Clusters
Manual provisioning of MariaDB Galera nodes introduces configuration drift, state transfer race conditions, and inconsistent wsrep topology tracking. For platform teams managing multi-master synchronization at scale, Ansible delivers deterministic, idempotent orchestration that aligns with infrastructure-as-code standards. This guide details production-grade automation workflows targeting exact error codes, edge-case recovery, and targeted Python/CLI integrations. We will cover inventory topology mapping, templated configuration deployment, bootstrap state management, and automated SST validation.
Topology Mapping & Deterministic Inventory Design
Galera provisioning requires strict sequencing. The initial node must bootstrap with wsrep_cluster_address=gcomm://, while subsequent nodes join via gcomm://<seed_ip>:4567. Ansible inventory must reflect this topology using explicit host variables and group ordering to prevent split-brain scenarios during parallel execution.
A static YAML inventory with ansible_host, galera_role, and wsrep_node_name ensures deterministic execution order. Platform teams should structure groups to isolate bootstrap candidates from joiners:
# inventory/galera_nodes.yml
all:
children:
galera_cluster:
hosts:
galera-node-01:
ansible_host: 10.0.1.10
galera_role: bootstrap
wsrep_node_name: node01
wsrep_node_address: 10.0.1.10
galera-node-02:
ansible_host: 10.0.1.11
galera_role: joiner
wsrep_node_name: node02
wsrep_node_address: 10.0.1.11
galera-node-03:
ansible_host: 10.0.1.12
galera_role: joiner
wsrep_node_name: node03
wsrep_node_address: 10.0.1.12
When designing the configuration layer, parameter mapping and state machine behavior must align with the wsrep.cnf Configuration Deep Dive framework before committing Jinja2 templates to version control. Dynamic inventory generators written in Python can parse CMDB or cloud provider APIs to inject galera_role dynamically, but must enforce a single bootstrap candidate per cluster lifecycle.
Pre-Flight Validation & Parameter Templating
A modular role structure isolates OS hardening, package installation, configuration templating, and service orchestration. The galera_node role should expose variables for galera_cluster_name, wsrep_sst_method, and wsrep_provider_options. Templating must enforce strict validation of wsrep.cnf parameters before any service restart.
The playbook must execute a pre-flight validation that checks firewalld/iptables rules for ports 3306, 4444, 4567, and 4568, verifies socat and rsync binaries, and confirms mariabackup version compatibility across all targets:
- name: Validate Galera prerequisites
block:
- name: Ensure required ports are open
ansible.builtin.iptables:
chain: INPUT
protocol: tcp
destination_port: "{{ item }}"
jump: ACCEPT
loop: [3306, 4444, 4567, 4568]
register: firewall_rules
- name: Verify SST binaries and versions
ansible.builtin.command: "{{ item }} --version"
loop: [mariabackup, socat, rsync]
changed_when: false
register: sst_binaries
- name: Fail if mariabackup version mismatch detected
ansible.builtin.fail:
msg: "mariabackup version incompatible across cluster nodes"
when: >
(sst_binaries.results | selectattr('item', 'equalto', 'mariabackup') | first).stdout is search('10.3')
Configuration templating should leverage ansible.builtin.template with validate directives to catch syntax errors before deployment:
- name: Deploy wsrep.cnf
ansible.builtin.template:
src: templates/wsrep.cnf.j2
dest: /etc/my.cnf.d/wsrep.cnf
owner: root
group: root
mode: '0644'
validate: 'mysqld --validate-config --defaults-file=%s'
notify: Restart MariaDB
Bootstrap State Machine & Sequencing
Bootstrap logic requires explicit state detection. The playbook must parse /var/lib/mysql/grastate.dat for safe_to_bootstrap=1. If the flag is absent or set to 0, the first node executes galera_new_cluster or temporarily overrides wsrep_cluster_address='gcomm://' in a systemd drop-in. Subsequent nodes join with the full cluster address.
- name: Detect bootstrap state
ansible.builtin.command: "grep -q 'safe_to_bootstrap: 1' /var/lib/mysql/grastate.dat"
register: grastate_check
changed_when: false
failed_when: false
- name: Bootstrap initial node
ansible.builtin.command: galera_new_cluster
when:
- galera_role == 'bootstrap'
- grastate_check.rc != 0
notify: Wait for cluster sync
- name: Configure systemd drop-in for joiners
ansible.builtin.copy:
dest: /etc/systemd/system/mariadb.service.d/galera-join.conf
content: |
[Service]
ExecStart=
ExecStart=/usr/sbin/mariadbd --wsrep-cluster-address=gcomm://10.0.1.10:4567
owner: root
group: root
mode: '0644'
when: galera_role == 'joiner'
notify: Reload systemd & Start MariaDB
Service orchestration should use systemd handlers with state=started and enabled=yes, paired with ansible.builtin.wait_for tasks polling port 4567 and wsrep_cluster_status via the MySQL CLI. Refer to the official MariaDB Galera system variables documentation for parameter precedence during state transitions.
SST Interception & Edge-Case Recovery
Automated provisioning frequently encounters SST failures and state mismatch errors. Ansible must intercept these via failed_when conditions and execute targeted recovery playbooks. WSREP: 113 (Deadlock/Timeout during SST) typically stems from network MTU mismatches, donor throttling, or insufficient gcs.fc_limit values.
Inject wsrep_provider_options="socket.ssl=NO; socket.checksum=1; gcs.fc_limit=256; gcs.fc_factor=0.8" into the template and implement a bounded retry loop with a fixed delay (retries/until apply a constant delay, not exponential backoff):
- name: Join cluster with SST retry logic
ansible.builtin.systemd:
name: mariadb
state: started
register: join_result
retries: 3
delay: 15
until: join_result is not failed
failed_when: >
join_result is failed and
'WSREP: 113' not in join_result.msg
For Python automation builders, integrating a custom health-check script via ansible.builtin.script provides granular diagnostics. The script can parse SHOW GLOBAL STATUS LIKE 'wsrep_%' and return structured JSON for Ansible’s changed_when evaluation:
#!/usr/bin/env python3
import subprocess
import json
import sys
def check_wsrep_sync():
result = subprocess.run(
["mysql", "-N", "-e", "SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';"],
capture_output=True, text=True
)
state = result.stdout.strip().split('\t')[1] if result.stdout else "UNKNOWN"
print(json.dumps({"wsrep_state": state, "synced": state == "Synced"}))
sys.exit(0 if state == "Synced" else 1)
if __name__ == "__main__":
check_wsrep_sync()
Production-Safe Execution & Idempotency
Production deployments require strict idempotency and safe rollback paths. Use changed_when to suppress false positives during configuration drift checks, and pair ansible.builtin.command with creates or removes flags where applicable. Always validate cluster quorum before marking a provisioning run as successful:
- name: Verify cluster quorum
ansible.builtin.shell: |
mysql -N -e "SELECT VARIABLE_VALUE FROM information_schema.GLOBAL_STATUS WHERE VARIABLE_NAME='wsrep_cluster_size';"
register: cluster_size
changed_when: false
failed_when: cluster_size.stdout | int < 3
- name: Mark provisioning complete
ansible.builtin.debug:
msg: "Galera cluster provisioned successfully. Quorum established."
when: cluster_size.stdout | int >= 3
For comprehensive operational guidance on node lifecycle management, consult the Galera Cluster Setup & Node Management framework. By enforcing deterministic inventory mapping, strict pre-flight validation, state-aware bootstrap sequencing, and automated SST recovery, platform teams can eliminate manual provisioning drift and maintain multi-master synchronization at enterprise scale.