Recover from a Master Aggregator Failure

  1. To prevent the risk of a potentially disruptive action when fixing the cluster, turn the Operator off by setting the replicas field to 0 in the sdb-operator.yaml file and apply the change to the cluster.

  2. Determine which of the child aggregators is the furthest ahead in data replication using the following SQL command.

    SHOW DATABASES EXTENDED

    From the output, review the position column for the cluster database on every node and select the child aggregator where:

    • The position is highest

    • All reference databases are in a non-unrecoverable state

    • The position for all databases does not equal 0:0

    These checks are important as the next step will fail unless they’re performed.

  3. Set the identified child aggregator as the Master Aggregator using the following SQL command.

    AGGREGATOR SET AS MASTER;
  4. Ensure that the node on master-0 is emptied out.

    Depending on what had occurred when the disk was corrupted, this may require re-emptying the master-0 volume.

    • If you attempted to start master-0 with an empty volume when the Operator was on, the Operator would have bootstrapped it as a Master Aggregator. This would have resulted in a new single-node cluster that must be emptied out.

    • If you never restarted master-0 when the Operator was on, then only its storage must be emptied out. When master-0 is started as a new empty node, it will not be bootstrapped as the Master Aggregator as the Operator is not running.

  5. Remove the former Master Aggregator and clear its metadata by running the following SQL command, and the next two SQL commands (Steps 6 and 7) on the temporary Master Aggregator.

    REMOVE AGGREGATOR '...-master-0';
  6. Re-add this child aggregator. This will add the empty node to the cluster as a child aggregator.

    ADD AGGREGATOR '...-master-0';
  7. Promote the child aggregator to turn this node in the master StatefulSet into the new Master Aggregator.

    PROMOTE AGGREGATOR '...-master-0' TO MASTER;
  8. Turn the Operator back on by setting the replicas field to 1 in the sdb-operator.yaml file and apply the change to the cluster.

Last modified: August 31, 2022

Was this article helpful?