How Failover is Triggered in HA

Case 1: The node is experiencing some kind of grey failure and even though a live connection to it exists, it is not answering the heartbeats.

In this case, a heartbeat (ping) is sent every 150 ms and after 200 consecutive heartbeats with no response, there is a failover. The failover will take 30 seconds (150ms*200). This requires a thorough investigation because we don't really know what is causing the no response. Since the node is not returning an error, it could just be slow or under heavy load.

Case 2: The node is refusing the heartbeat connection entirely.

In this case, the connection attempts return an error or time out trying to connect. Instead of waiting, an extra penalty is added to the heartbeat counter with every failed connection attempt, so that failover happens in 3 seconds if the heartbeat fails to connect 3 consecutive times. Each connection attempt also has a 10-second timeout by default (and the heartbeat counter from Case 1 continues to rise every 150ms during this connection attempt).

See Leaf States for more details.

Last modified: May 5, 2022

Was this article helpful?