Leaf Node Recovery Failed Scenario
Scenario: One of the leaves can not replicate data, its status is RECOVERY_FAILED
, and the replica partitions are in an unrecoverable state.
If the node is not able to recover, the most common issue is related to memory configuration but not always. To find out the exact reason the tracelog, memsql.log for that node should be investigated for the error at the time the database became unrecoverable.
If it is the case of the node not having enough memory to replay data back into memory then increase maximum_memory
and/or maximum_table_memory
to allow recovery to complete. For other configuration or bug related issues contact Support.
Troubleshooting Steps:
Try to restart the leaf that is still in the RECOVERY_FAILED
state.
To investigate the possible causes you can also check the following:
Cluster report
Output of
SHOW CLUSTER STATUS
Output of
sysctl -a
from the host that has the leaf node with the error.
Other Recommendations:
vm.max_map_count
should be 100000000 on all nodes.
Ensure open files ulimit
is set to >= 1000000 on all nodes.
If using NUMA nodes the total size of the nodes should be less than the physical memory available on the server.
After making any adjustments to the memory settings or the variables you should try to restart the leaf that is still in the RECOVERY_FAILED
state. You can do this with the command sdb-admin restart-node
and then select the appropriate leaf.