Startup Sequence and Process in a Cluster

Recommended shut down order (if the cluster is up and running):

  1. First shut down the master aggregator.

  2. Next, shut down the child aggregators and then the leaf nodes.

Recommended startup/restart order:

  1. First start up the leaf nodes, then the child aggregators.

  2. Wait until partitions of all databases in leaves are fully recovered.

  3. Start up the master aggregator.

When the cluster shut down and start up take place in the recommended order, no metadata changes happen, and partitions remain in the same places, with no failover. Everything stays online.

The process sequence that happens is as follows:

  1. Leaf nodes start up and their state becomes "RECOVERING". This can be verified by running SHOW CLUSTER STATUS.

  2. Leaf nodes recover keys and rowstore data into memory. This is the longest part of the process. This can be seen in logs with messages like "replaying snapshot" and "replaying log". As keys are recovered, a progress percentage is included in log messages.

  3. Once all data and keys are recovered into memory, the master aggregator attaches the leaf nodes.

  4. Next, the master aggregator attaches the partitions and auto rebalances them to promote (otherwise called "repoint") partitions.

  5. As partitions are attached, their state becomes "PENDING". This can be verified by running  SHOW CLUSTER STATUS.

  6. Next, the partitions will show the state as "TRANSITIONING".

  7. Finally, the partitions will attach and the leaf nodes show their state as "ONLINE".

  8. After the leaf nodes have finished recovering, the aggregators will start up.

  9. Aggregators also recover data and keys into memory but only for internal databases and reference tables, both of which get replicated to all nodes in the cluster from the master aggregator. This is much faster than leaf recovery.

  10. Child aggregators register their ID with the master aggregator.

  11. Aggregators are synced, including syncing auto_increment.

  12. Aggregators show the state as "ONLINE".

  13. Once leaves and then aggregators are online, the cluster becomes available and responsive to queries.

This process can be tracked in the nodes list using the command memsql-admin list-nodes, individual node memsql.log tracelog messages, as well as in the state and replay position of partitions indicated by the SHOW CLUSTER STATUS command.

Important

No user interaction is necessary to bring a cluster online, except to start the nodes. It is not necessary and not recommended to manually rebalance partitions while a node is offline or recovering.

If recovery fails for any reason, the node will indicate a failed recovery in the logs and have show the state "RECOVERY_FAILED" in the node list and may restart. If you see a node in the state "RECOVERY_FAILED" or a node repeatedly restarting or taking much longer to recover than normal, then contact support.

If you need to restart nodes, SingleStore recommends to snapshot all user databases beforehand, to make recovery faster. This truncates transaction logs into the most recent snapshot, so there are fewer lines to replay.

How Auto Rebalance Works on Restart of a Leaf Node

  • Once a node comes online from the offline/attaching state, the engine waits for two minutes (as per the default value for the attach_rebalance_delay_seconds engine variable) before the rebalance operation is initiated. This is to make sure that the node is stable before the partitions are balanced across the nodes.

  • Within this timeframe, the stability of the cluster is checked and if any of the nodes fail or transition to an online/offline state, the two minutes counter is reset.

Last modified: March 8, 2024

Was this article helpful?