Leaf Node Recovery Failed Scenario

Scenario: One of the leaves can not replicate data, its status is RECOVERY_FAILED, and the replica partitions are in an unrecoverable state.

If the node is not able to recover, the most common issue is related to memory configuration but not always. To find out the exact reason the tracelog, memsql.log for that node should be investigated for the error at the time the database became unrecoverable.

If it is the case of the node not having enough memory to replay data back into memory then increase maximum_memory and/or maximum_table_memory to allow recovery to complete. For other configuration or bug related issues contact Support.

Troubleshooting Steps:

Try to restart the leaf that is still in the RECOVERY_FAILED state.

To investigate the possible causes you can also check the following:

  • Cluster report

  • Output of SHOW CLUSTER STATUS

  • Output of sysctl -a from the host that has the leaf node with the error.

Other Recommendations:

vm.max_map_count should be 100000000 on all nodes.

Ensure open files ulimit is set to >= 1000000 on all nodes.

If using NUMA nodes the total size of the nodes should be less than the physical memory available on the server.

After making any adjustments to the memory settings or the variables you should try to restart the leaf that is still in the RECOVERY_FAILED state. You can do this with the command sdb-admin restart-node and then select the appropriate leaf.