Failover Related Messages in Tracelog

The following tracelog messages are related to a node disconnecting, reconnecting, and being attached back to the cluster.

Most of these error messages are either a transient issue that will be automatically resolved by the cluster's self-recovering processes, or they may indicate a more serious issue related to networking or other infrastructure. It is also important to consider the cluster's workload because that is often related to these types of messages.

Ultimately, a full analysis of a particular issue can be done only with a detailed cluster report and full logging.

  • AttachManyLeaves: AttachManyLeaves: failed to attach leaf 'psfbdlf2':3306: Unable to ATTACH leaf 'psfbdlf2':3306. database `POSBP_1` on 'psfbdlf2':3306 is currently recovering (ReplayLSN 0:18446744073709551615, CommitLSN 0:3308). Try running ATTACH LEAF again when the leaf finishes recovering. Use SHOW DATABASES EXTENDED on the leaf to examine its progress.

    This message means that the specified leaf node has failed to attach because it is currently recovering. Depending on the context, this can be the expected normal behaviour or it can potentially indicate some kind of problem with that leaf node.

  • Thread 115039 (ntid 3624, conn id -1): AttachManyLeaves: AttachManyLeaves: failed to attach leaf 'psfbdlf2':3306: Unable to ATTACH leaf 'psfbdlf2':3306. database `POSBIZP_1` on 'psfbdlf2':3306 is currently recovering (ReplayLSN 0:18446744073709551615, CommitLSN 0:3308). Try running ATTACH LEAF again when the leaf finishes recovering. Use SHOW DATABASES EXTENDED on the leaf to examine its progress.

    Same as the above message. It additionally identifies the thread.

  • Thread 115046 (ntid 3617, conn id -1): HeartbeatConnectionFailure: Failure during heartbeat connecting to tlpsfbdlf2:3306

    Assuming this logging is from the Master Aggregator, this message indicates that the regular heartbeat process failed to connect to the specified node. Depending on the context, this can either be transient behaviour related to a heavy workload, or it can indicate some kind of networking or other infrastructure issues related to that node.

  • ProcessARD ProcessARD Worker (POSM2L): Thread 115044 (ntid 3619, conn id -1): operator(): Failed to transition database `POSM2L` to master, reason: Partial Progress.

    This message indicates trying to transition a slave database into a master database and the process not completing at the time of logging this message. Depending on the context, this message can be transient and the cluster will self-recover, or this can be indicative of more serious issues.

  • Thread 115045 (ntid 441695, conn id -1): operator(): `possfma` log: Disconnecting replica at node 3 because of connection error.

    This message indicates a connectivity problem with node 3, so the cluster is disconnecting that replica as part of the automatic self-recovery process.

  • ProcessARD Thread 115064 (ntid 373595, conn id -1): AdjustReferenceDatabases: Failure updating reference databases at LSN 2f. Succeeded updating 1, needed 2.

    This message can either be transient possibly indicative of heavy workload or more serious issues depending on the context.

  • IsOK: Heartbeat connection error reading RPC (2 = End of file)

    This message can indicate a transient heartbeat connection issue that self-recovers (often related to heavy workload) or more serious issues with network connectivity or other infrastructure issues.

  • IsOK: Heartbeat connection error reading heartbeat response header to tlpsfbdlf1:3306 (2 = End of file)

    This is similar to the above error except it is specifying the particular node involved and the particular problem of reading the response header.

  • HeartbeatConnectionFailure: Failure during heartbeat connecting to tlpsfbdlf1:3306

    This is similar to the above error except it specifies a problem connecting to that node (instead of reading the response header).

  • IsOK: Heartbeat connection error connecting to tlpsfbdlf1:3306 (111 = Connection refused)

    This is similar to the above error except that it specifies a particular connection error (Connection refused).

  • MoveOneStep: `cluster` log: Slave packet read (86) failed with error 104 (Connection reset by peer) while in state x_streaming.

    This message indicates a connection reset that interrupted internal activity and can either be transient or indicative of more serious issues.

  • GetClusterSyncLSNTermFromLeader: Failed to connect to node believed to be leader: distributed@'tlpsfbdag1':3306

    This depends on which node this log entry is from and what your cluster topology is, it indicates a connection failure from a leaf node to the specified aggregator. Again, depending on context, this can be transient or potentially indicative of more serious issues.

  • AdjustReferenceDatabases: Failure updating reference databases at LSN 2f. Succeeded updating 1, needed 2

    This message can either be transient (possibly indicative of heavy workload) or indicative of more serious issues depending on context.

  • AttachLeafSetErrorInternal: Unable to ATTACH LEAF 'tlpsfbdlf1':3306. database `POSBIZP_0` on 'tlpsfbdlf1':3306 is currently recovering(ReplayLSN 0:18446744073709551615, CommitLSN 0:3206). Try running ATTACH LEAF again when the leaf finishes recovering. Use SHOW DATABASES EXTENDED on the leaf to examine its progress.

    Similar to the first message, the specified leaf node has failed to attach because it is currently recovering. Depending on the context, this can be expected normal behaviour or it can potentially indicate some kind of problem with that leaf node. The AttachLeafSetErrorInternal versus AttachManyLeaves refers to the origin of the process that leads to this error,

Last modified: June 24, 2024

Was this article helpful?