Ops: Leaf Node Failures
On this page
Leaf Failures in a Redundancy-1 Cluster
When a leaf dies in a redundancy-1 cluster, all partitions hosted on that leaf will be offline to reads and writes.
If the leaf machine is recoverable, SingleStore will automatically reattach the leaf as soon as it is back online.
If the leaf machine is unrecoverable but you can still access its data, then you can introduce a replacement following the guide on how to Replace a Dead Leaf in a Redundancy-1 Cluster.
If the leaf machine and its data is unrecoverable, and you wish to introduce a replacement, you can follow the guide on how to Replace a Dead Leaf in a Redundancy-1 Cluster, steps 1 to 3.
Finally, if you wish to recreate the lost partitions on the remaining leaves, run REBALANCE PARTITIONS .
.
Warning
REBALANCE PARTITIONS .
Replace a Dead Leaf in a Redundancy-1 Cluster
This guide shows how to replace a dead leaf in a redundancy-1 cluster.
-
Unmonitor the dead SingleStore Leaf and uninstall the dead MemSQL Ops Agent by executing the following commands:
memsql-ops memsql-unmonitor <DEAD-LEAF-ID>memsql-ops agent-uninstall --force --agent-id <DEAD-AGENT-ID>
Note
You can get Leaf IDs and Agent IDs by executing the memsql-ops memsql-list
and memsql-ops agent-list
commands.
-
Remove the dead leaf from cluster by running REMOVE LEAF on the dead leaf to remove it from
SHOW LEAVES
and free up its pairing slot.REMOVE LEAF "<DEAD-LEAF>"[:<PORT>]; -
Deploy a new leaf by deploying a new MemSQL Ops Agent and leaf via the web UI, or via CLI.
Once the agent is deployed, you can replace the settings. conf file from the dead agent: memsql-ops agent-deploy --host <HOST-IP> [--user <USER> --identity-file /path/to/id_rsa]memsql-ops agent-stop <NEW-AGENT-ID>Edit your settings.
conf file at /var/lib/memsql-ops/settings.
.conf memsql-ops agent-start <NEW-AGENT-ID>memsql-ops memsql-deploy --agent-id <NEW-AGENT-ID> --role leaf -
Stop the new leaf to make sure the new SingleStore Leaf is NOT running before copying the recovered data:
memsql-ops memsql-stop <NEW-LEAF-ID> -
Copy the recovered data into the new leaf data directory.
Make sure to save the memsql_ id file. sudo mv /var/lib/memsql/leaf-3306/data/memsql_id /tmpsudo cp -r /path/to/recovered/leaf-3306/data/* /var/lib/memsql/leaf-3306/data/sudo mv /tmp/memsql_id /var/lib/memsql/leaf-3306/data/sudo chown -R memsql.memsql /var/lib/memsql/leaf-3306/data/ -
Restart the new leaf.
memsql-ops memsql-start <NEW-LEAF-ID> -
Partitions are currently present in the new leaf, but SingleStore distributed system is still unaware of them, so you must reattach the partitions by trigger partitions detection.
Temporarily remove the new leaf from the cluster. MemSQL Ops will automatically re-attach the leaf, and this action will trigger detecting (and attaching) all partitions. On the Master Aggregator, run:
REMOVE LEAF "<NEW-LEAF>";In the examples below, 10.
0. 0. 101 is the new SingleStore Leaf, and 10. 0. 2. 128 is an existing leaf. Before reattaching partitions:
SHOW PARTITIONS ON `memsql_demo`;+---------+------------+------+--------+--------+ | Ordinal | Host | Port | Role | Locked | +---------+------------+------+--------+--------+ | 0 | 10.0.2.128 | 3306 | Master | 0 | | 1 | 10.0.2.128 | 3306 | Master | 0 | | 2 | NULL | NULL | NULL | 0 | | 3 | NULL | NULL | NULL | 0 | ...
After reattaching partitions:
SHOW PARTITIONS ON `memsql_demo`;+---------+------------+------+--------+--------+ | Ordinal | Host | Port | Role | Locked | +---------+------------+------+--------+--------+ | 0 | 10.0.2.128 | 3306 | Master | 0 | | 1 | 10.0.2.128 | 3306 | Master | 0 | | 2 | 10.0.0.101 | 3306 | Master | 0 | | 3 | 10.0.0.101 | 3306 | Master | 0 | ...
Leaf Failures in a Redundancy-2 Cluster
One Leaf Dies
Any partitions for which the dead leaf was the partition master will be promoted on the dead leaf’s pair.
You can reintroduce the dead leaf, or add a new leaf to replace it.
Reintroducing the leaf is the simplest solution – SingleStore will automatically reattach the leaf as soon as it is back online.
If you decide to add a replacement leaf on a different host, you can follow this guide on how to replace a dead leaf in a redundancy-2 Cluster.
Warning
If the machine for the dead agent is still accessible, ensure all memsqld processes are killed and their data directories are emptied before attempting to uninstall the agent.
-
Unmonitor the dead SingleStore Leaf and uninstall the dead MemSQL Ops Agent to remove the dead leaf from MemSQL Ops:
memsql-ops memsql-unmonitor <DEAD-LEAF-ID>memsql-ops agent-uninstall --force --agent-id <DEAD-AGENT-ID>Note that leaf and agent ids can be retrieved with memsql-ops memsql-list and memsql-ops agent-list.
-
Figure out which availability group the failed leaf was in by running
SHOW LEAVES
on the master aggregator to identify the dead leaf and its availability group:show leaves;+----------------+------+--------------------+----------------+-----------+---------+--------------------+------------------------------+ | Host | Port | Availability_Group | Pair_Host | Pair_Port | State | Opened_Connections | Average_Roundtrip_Latency_ms | +----------------+------+--------------------+----------------+-----------+---------+--------------------+------------------------------+ | 54.242.219.243 | 3306 | 1 | 54.196.216.103 | 3306 | online | 1 | 0.640 | | 54.160.224.3 | 3306 | 1 | 54.234.29.206 | 3306 | online | 1 | 0.623 | | 54.196.216.103 | 3306 | 2 | 54.242.219.243 | 3306 | online | 1 | 0.583 | | 54.234.29.206 | 3306 | 2 | 54.160.224.3 | 3306 | offline | 0 | NULL | +----------------+------+--------------------+----------------+-----------+---------+--------------------+------------------------------+
In this example, the offline leaf was in availability group 2.
-
Remove the dead leaf from cluster by running
REMOVE LEAF
on the master aggregator to remove the dead leaf fromSHOW LEAVES
and free up its pairing slot.REMOVE LEAF "<DEAD-LEAF>"[:<PORT>] FORCE; -
Deploy a new leaf by deploying a new MemSQL Ops Agent and leaf via the web UI, or via CLI.
Once the agent is deployed, you can replace the settings. conf file from the dead agent: memsql-ops agent-deploy --host <HOST-IP> [--user <USER> --identity-file /path/to/id_rsa]memsql-ops agent-stop <NEW-AGENT-ID>Edit your settings.
conf file at /var/lib/memsql-ops/settings.
.conf memsql-ops agent-start <NEW-AGENT-ID>memsql-ops memsql-deploy --agent-id <NEW-AGENT-ID> --role leaf --availability-group <GROUP>After this has completed,
SHOW LEAVES
on the master aggregator should indicate that all leaves are online and paired:+----------------+------+--------------------+----------------+-----------+--------+--------------------+------------------------------+ | Host | Port | Availability_Group | Pair_Host | Pair_Port | State | Opened_Connections | Average_Roundtrip_Latency_ms | +----------------+------+--------------------+----------------+-----------+--------+--------------------+------------------------------+ | 54.242.219.243 | 3306 | 1 | 54.196.216.103 | 3306 | online | 2 | 0.578 | | 54.160.224.3 | 3306 | 1 | 54.145.52.142 | 3306 | online | 2 | 0.624 | | 54.196.216.103 | 3306 | 2 | 54.242.219.243 | 3306 | online | 2 | 0.612 | | 54.145.52.142 | 3306 | 2 | 54.160.224.3 | 3306 | online | 1 | 0.568 | +----------------+------+--------------------+----------------+-----------+--------+--------------------+------------------------------+
-
Finally, you need to rebalance the cluster, by running
REBALANCE PARTITIONS
for all your databases.On the Master Aggregator, you can run REBALANCE ALL DATABASES.
A Pair of Leaves Die
In this case, the partitions that were hosted on the dead leaves have no remaining instances.
If either of the leaf machines (or at least its data) is recoverable, you can reintroduce it and reattach its partitions, as in the Leaf Failures In a Redundancy-1 Cluster section.
If neither leaf machine is recoverable, then data loss has occurred.REBALANCE PARTITIONS .
to create new (empty) replacement partitions.
As long as two paired leaves have not both died, all partitions are still available for reads and writes.
As a special case of this scenario, all leaves in one availability group can be down.
Many Leaves Die, Some of Them Paired
Every partition for which both leaves hosting it died is now offline to reads and writes.
Offline partitions should be handled as they are in the scenario A pair of leaves die.
Last modified: April 27, 2023