Recover from a Master Aggregator Failure

Prerequisites

Before proceeding, determine if your cluster is running with Master Aggregator High Availability (MA HA) enabled, as the recovery steps differ significantly. MA HA provides enhanced reliability and automatic failover capabilities for mission-critical workloads. Refer to Multi-Datacenter Failover for more information about MA HA.

Change Master Aggregator Disk with MA HA Enabled

If your cluster is running with Master Aggregator MA HA enabled, the recovery process is simplified. You only need to change the storage class:

Update the cluster custom resource with the new storage class in masterAggregatorSpec.storageClass.
Update the Master Aggregator StatefulSet's PersistentVolumeClaim (PVC) template with the desired storage class.
Delete the PersistentVolumeClaim (PVC) for the master-0 pod.
Delete the master-0 pod itself.
The disk will be recreated with the new storage class by the StatefulSet controller. The promoted child aggregator continues to serve as the active Master during the operation.

Note

SingleStore recommends enabling MA HA to support automatic storage class changes with minimal manual intervention.

Recover from MA Failure without MA HA

If MA HA is not enabled, perform the following steps to recover from a Master Aggregator failure.

To prevent the risk of disruptive actions when fixing the cluster, turn off the Operator by setting the replicas field to 0 in the sdb-operator.yaml file and apply the change to the cluster.
Determine which of the child aggregators is furthest ahead in data replication using the following SQL command.
SQL
```
SHOW DATABASE EXTENDED
```
From the output, review the position column for the cluster database on every node and select the child aggregator where:
- The position is highest
- All reference databases are not in an unrecoverable state
- The position for all databases does not equal 0:0
Set the identified child aggregator as the Master Aggregator using the following SQL command.
SQL
```
AGGREGATOR SET AS MASTER;
```
Ensure that the node on master-0 is emptied out. Depending on what occurred when the disk became corrupted, this may require re-emptying the master-0 volume.
- If you attempted to start master-0 with an empty volume when the Operator was on, the Operator bootstraps it as a Master Aggregator. This would have resulted in a new single-node cluster that must be emptied out.
- If you did not start master-0 when the Operator was on, then only its storage must be emptied out. When master-0 is started as a new empty node, it will not be bootstrapped as the Master Aggregator as the Operator is not running.
Remove the former Master Aggregator and clear its metadata by running the following SQL command on the temporary Master Aggregator.
SQL
```
REMOVE AGGREGATOR '...-master-0';
```
Optional: If you need to change the Master Aggregator storage class after promoting a child aggregator to master, refer to Change Master Aggregator Disk with MA HA Enabled.
Re-add this child aggregator. This will add the empty node to the cluster as a child aggregator.
SQL
```
ADD AGGREGATOR '...-master-0';
```
Promote the child aggregator to turn this node in the master StatefulSet into the new Master Aggregator.
SQL
```
PROMOTE AGGREGATOR '...-master-0' TO MASTER;
```
Turn on the Operator by setting the replicas field to 1 in the sdb-operator.yaml file and apply the change to the cluster.

On this page

Prerequisites

Change Master Aggregator Disk with MA HA Enabled

Recover from MA Failure without MA HA

Was this article helpful?

On this page

Was this article helpful?