Overview of Leaf Node Failover
This is a general overview of what happens when a leaf node failover occurs:
-
Connections to the failed node are flushed.
-
Slave partitions on the failed node are detached.
-
The slave partitions on the paired node are promoted and repointed as master partitions.
-
The failed node is detached from the cluster.
-
The node is flagged as "offline" in
information_
.schema. mv_ events Offline in this case means either the node was literally offline, or simply the master could not talk to that node and failover was initiated. -
Once the master aggregator can establish heartbeat connections, complete recovery of the node(s) typically takes a few minutes, depending on partition count and data size, etc.
While there are leaves offline in the cluster the paired node will serve content and respond to connections for all of its partitions which are all now master partitions if failover succeeds.
During this entire process, the tracelog reflects the view of the node that's logging them.
System health is extremely context-dependent, and in a distributed system there is a lot of self-healing going on.
For a listing of some of the typical error/warning messages, refer Failover Related Messages in Tracelog.
Last modified: June 24, 2024