Tools: Leaf Node Failures
On this page
Replace a Failed Leaf Node in a Redundancy-1 Cluster
Warning
Review the size of your SingleStore data directory and confirm that at least twice the amount of disk space used by the leaf node’s data directory is available.
A leaf node’s data typically resides in /var/lib/memsql/<node-directory>/data
by default.<node-directory>
in SingleStore v7.memsql-id
within the node’s data
directory (e.cat .
).
This section covers the replacement of a failed leaf node within a redundancy-1 cluster.
-
Determine if the leaf node in question is still running within the cluster.
If it is, first stop it, and then remove the leaf
role from the node.sdb-admin stop-node --memsql-id <node-ID>sdb-admin remove-leaf --force --memsql-id <node-ID> -
Assuming that the data is recoverable from the failed leaf node, preserve it by changing to the node directory (on the corresponding host) and compressing it to another location, such as the
/tmp
directory.cd /path/to/memsql/<node-directory>sudo tar -zcvf /tmp/data.tgz ./data/ -
As the leaf node’s data has been preserved, delete the leaf node from the cluster.
sdb-admin delete-node --memsql-id <node-ID> --skip-remove -
Create a new node to store this data in.
After the node’s been created, stop it so that you can manipulate the node’s data directory. sdb-admin create-node --host [IP or hostname OF NODE HOST] --port [PORT OF NODE HOST] --password [PASSWORD OF NODE]sdb-admin stop-node --memsql-id <node-ID> -
Navigate to the newly created node’s directory and remove the data directory within.
Then, extract the data from the /tmp
directory to this node’sdata
directory.cd /var/lib/memsql/<new-node-directory>sudo rm -r ./datasudo tar -zxvf /tmp/data.tgz -
Once the extraction is finished, update the ownership and permissions on the directory’s files.
sudo chown memsql:memsql -R ./data -
Start the new node.
The MemSQL ID will now be that of the former node, which you can confirm with the following command. sdb-admin start-node --all -
Run the following SQL command on the Master Aggregator host to add this leaf node to the cluster.
ADD LEAF <user>:'<password>'@'<node's-host-IP or hostname>':<port>;
Replace a Failed Leaf Node in a Redundancy-2 Cluster
Warning
Review the size of your SingleStore data directory and determine if you have enough disk space remaining when you proceed with the recovery process.
This section details how to recover a failed leaf node in a high availability cluster.
Clusters using Async or Sync replication could potentially present different errors, so knowing which replication option that is being used will be key to recovering a failed cluster.
By default, all MemSQL v7.
Async replication treats data differently and, as its first priority, allows the cluster to continue to run.
Sometimes data can become out of sync when one or more sizable queries are run.
As a result, the cluster will stop performing transactions and will throw an Asynchronous replication with a FailoverLeaf: Couldn’t failover from node.
error.
To acknowledge the potential data loss and manually move the data to the secondary leaf node that is now out of sync, use a SQL editor to run the REBALANCE PARTITIONS command on each database.
Alert: If REBALANCE PARTITIONS
is run before attempting to retain or fix the leaf/partition, the data will be lost and will no longer be recoverable.
REBALANCE PARTITIONS ON db-name;
You may also use the bash script in the One Leaf Node Fails section below to rebalance the partitions on all nodes and restore redundancy on all databases.
One Leaf Node Fails
Reintroduce the Leaf Node
This section details how to reintroduce a failed leaf node in a Redundancy-2 cluster.
If the failed leaf node was the partition master, those partitions will be promoted to its pair.
In scenarios where the filesystem in the leaf node hosting the SingleStore data-directory is lost or corrupted, you can introduce a new host with a leaf node and then rebalance your data.
Replace the Leaf Node from a Different Host
This section details how to replace a failed leaf node in a Redundancy-2 cluster with a replacement leaf node from a different host.
In case of hardware failure of your Linux/Ubuntu machines hosting the leaf node, if the failure does not cause any data corruption, the node can be reintroduced to cluster after it is recovered.
If the failed leaf node was the partition master, those partitions will be promoted to its pair.
-
If the host of the failed leaf node is still available, note either its availability group, or the availability group of its pair.
-
If the host that the failed leaf node resides on has also failed, the leaf node must be removed from the cluster.
Determine if the failed leaf node is still shown. sdb-admin list-nodesIf so, remove it from the cluster.
sdb-admin remove-leaf --memsql-id <node-ID>If not, skip removing it from the cluster and just remove the failed host.
sdb-admin delete-node --memsql-id <node-ID> --skip-removesdb-toolbox-config unregister-host --host <host-IP or hostname> -
The leaf node may still be visible to the cluster, but Toolbox will no longer recognize it.
To confirm, run the following SQL command on the Master Aggregator host and look for the IP and port of the failed leaf node. SHOW LEAVES; -
If the leaf node persists within the cluster, remove it from the Master Aggregator.
REMOVE LEAF 'IP OF LEAF':PORT; -
Using Toolbox, add a new replacement host to the cluster.
sdb-toolbox-config register-host --host <host-IP or hostname> -i <SSH-identity-file> -
Deploy SingleStore to this host.
sdb-deploy install --host <host-IP or hostname> -
Create a replacement node, assign it a
leaf
role, and add it to the availability group you noted earlier.sdb-admin create-node --host <host-IP or hostname> --password <secure-password> --port <port>sdb-admin add-leaf --memsql-id <ID-of-new-node> --availability-group <1 or 2> -
Rebalance the partitions on all nodes and restore redundancy on all databases.
REBALANCE ALL DATABASES;
Last modified: November 24, 2023