Multi-Datacenter Failover

To enable this feature, you must explicitly deploy an MA and two or more child aggregators (CA) nodes with the "voting member" role assigned at the time of cluster creation. Thus, there should be one primary MA and at least two additional voting members (CAs). Data across these three instances is synchronously replicated. In case of a failure of the primary (leader) MA, the remaining voting members (CAs) will elect a new primary MA from amongst themselves.

In the event of the MA becoming unavailable, one of the voting members will automatically be elected as the new MA through a consensus mechanism. SingleStore ensures that your transactions maintain the atomicity property by guaranteeing that all changes are either fully committed or rolled back during the process of electing the new MA. SingleStore recommends implementing a load balancer or a proxy to ensure that if the MA goes offline, your application can automatically retry and connect to the newly elected MA.

The recovery of a failed MA needs an automation process to be in place. Creating a new node is not always necessary; the decision to create a new node depends on the type of failure encountered. Once the recovered or new node joins the voting membership, SingleStore's consensus algorithm ensures that it is fully synchronized and eligible to be elected as a potential MA.

Configure Multi-Datacenter Failover

To take advantage of this feature, you will need SingleStore v8.9 or later and Toolbox v 1.18.1 or later. With this combination, you can create a configuration of one MA and a minimum of two CAs set as voting members.

You can set up this feature in either of the following ways:

Using a YAML-based configuration file
Using Toolbox commands

Using a YAML-based Configuration File

If you set up a cluster from a YAML-based configuration file with the sdb-deploy setup-cluster --cluster-file ... command, the configuration requires:

consensus_enabled must be ON
aggregator_role must be set as VOTING_MEMBER for all the CAs that will act as voting members.

The configuration file will resemble the following format:

...
sync_variables:  
 ...                                                                                      
 consensus_enabled: ON
...
hosts:
...
- hostname: ...  
  ...  
  nodes:    
  ...  
  - role: Aggregator    
    aggregator_role: voting_member    
    …

Refer to Deploy for more information.

Using Toolbox Commands

Set the consensus_enabled global variable to ON.

You can manually update the consensus_enabled variable using the command:
Shell
```
sdb-admin update-config --key=consensus-enabled --value=ON --set-global --role master -y
```
Add aggregators with a voting member aggregator role.

A new voting member can be added to the cluster using either the sdb-admin create-node or sdb-admin add-aggregator commands:
- For a new node:
  Shell
```
sdb-admin create-node --host <host> --port <port> --role aggregator --aggregator-role voting_member
```
- For an existing node:
  Shell
```
sdb-admin add-aggregator --memsql-id <MemsqlID of the node> --role voting_member
```

You can change the role of an existing CA by combining two commands.

First, remove the aggregator:

Shell

sdb-admin remove-aggregator --memsql-id <MemsqlID of the node>

Then, add it back to the cluster with a new role:

Shell

sdb-admin add-aggregator --memsql-id <MemsqlID of the node> --role voting_member

To get the full list of aggregators and roles, use the sdb-admin show-aggregators command.

Shell

sdb-admin show-aggregators

✓ Successfully ran 'memsqlctl show-aggregators'
+-----------+------+--------+---------------+--------------------------------+------------------+---------------------+
|   Host    | Port | State  | Opened        | Average                        | Master           | Role                |
|           |      |        | Connections   | Roundtrip Latency (ms).        | Aggregator       |                     |
+-----------+------+--------+---------------+--------------------------------+------------------+---------------------+
| 127.0.0.1 | 3306 | online | 1             | null                           | 1                | Leader              |
| 127.0.0.1 | 3308 | online | 2             | 0.377                          | 0                | Voting Member       |
| 127.0.0.1 | 3309 | online | 2             | 0.313                          | 0                | Voting Member       |
| 127.0.0.1 | 3310 | online | 1             | 6.023                          | 0                | Voting Member       |
+-----------+------+--------+---------------+--------------------------------+------------------+---------------------+

Troubleshooting Multi-Datacenter Failover

If an MA is down, a new MA will be elected from the set of voting members and become automatically available to Toolbox in a few moments. While Toolbox may still display the stopped MA as the primary MA, this does not affect any commands. Toolbox will continue to work fine if at least one MA is up and running.

If Toolbox shows two or more running MAs (via sdb-admin list-nodes), some commands may become unavailable. In this case, you should manually stop one of the MAs using the sdb-admin stop-node --memsql-id command.

Note

SingleStore recommends disabling consensus before upgrading and then re-enabling consensus once the upgrade is complete.

FAQs

Which global engine variables are used in configuring this feature?

The consensus_enabled and consensus_election_timeout variables are used.
- consensus_enabled must be set to ON to add aggregators as voting members.
- consensus_election_timeout controls the time, in milliseconds, for which a voting member waits before conducting an election if it does not hear from the MA. You can adjust the value if required.
  SQL
```
SHOW VARIABLES LIKE '%election%';
```
```
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| consensus_election_timeout | 30000 |
+----------------------------+-------+
```
When an MA goes offline and a new voting member becomes the MA, how should the cluster be reconfigured to the three MA config?

You have to set up a process to restart the failed node. If restarting is not possible, you should remove the old aggregator, deploy a new CA, and add it as a voting member. Once this third voting member is provisioned, SingleStore’s consensus algorithm will ensure it is caught up and eligible to be elected as a potential MA.
Are there any specific steps needed to catch up the new MA?

No, the new voting member automatically catches up once it is back online.
How to find out if an MA is down and a new MA is successfully promoted?

You can check the output of either of these commands from all the voting members. The current MA will be the one that the majority of the voting members report.
- SQL
```
SHOW AGGREGATORS EXTENDED;
```
- SQL
```
SELECT * FROM INFORMATION_SCHEMA.AGGREGATORS;
```
What does a user or application need to do when a new MA is promoted?

SingleStore recommends implementing application-level retry logic combined with a load balancer or proxy that manages the endpoints for all three MAs. This setup will enable your application to seamlessly connect to the newly promoted MA and continue its operations without encountering errors due to the former MA endpoint being offline.

The load balancer or proxy should be configured to automatically detect and route traffic to the active MA, ensuring uninterrupted service during failover events. Additionally, the application's retry logic should be designed to handle transient connection failures and transparently reconnect to the newly promoted MA.

If a DDL or DML operation is hitting the MA endpoint and, at the same time, the original MA goes offline and a new MA is being elected, what will happen to the operation? What are the various failure scenarios?

SingleStore ensures your transactions maintain the atomicity property by guaranteeing that all changes are either fully committed or rolled back during the process of electing the new MA.

In the following scenarios, SingleStore recommends implementing application-level retry logic (similar to Q#5) to ensure a DDL or DML request is re-established.

Scenarios	Description
A user sends a query to the CA, but the MA is offline.	The CA will internally retry the query until a new MA is elected, and the CA can connect to the new MA.
A user sends a query to the MA, but the MA is offline	SingleStore recommends building retry logic in the application along with a load balancer or a proxy, so they can manage the connections and connect to the newly promoted MA.
A user sends a query or a multi-statement transaction to the CA, which forwards it to the MA, but the MA goes offline while executing the query.	Similar to the above scenario, SingleStore recommends building retry logic in the application to allow connecting to the newly promoted MA. Without this retry logic, the user will receive an error stating that the connection to the server was lost.
A user sends a query to the CA, which forwards it to the MA, but a new MA is elected while the previous MA is running the query.	Depending on how far the execution of the query has progressed, the CA will internally retry and forward the query to the new MA or the user may receive an error. To eliminate this error, SingleStore recommends building retry logic in your applications to connect to the new MA's endpoint.
A user sends a query to the MA but another node is elected as the MA, without the previous MA going offline.	The query will fail with an error, `Can't write to replica` or `Not the master`. A retry of the query should forward the query to the new MA unless it is a non-forwardable query, in which case it will fail again. If users are executing non-forwardable queries, then they would need to drop and reconnect on these errors.

How should the CAs that are allocated as voting members be placed to best optimize resiliency?

SingleStore recommends placing the CA nodes marked as voting members across different failure domains or datacenters. This will help improve the resiliency.
How does SingleStore resolve "split-brain" issues?

Split-brain implies there is a network partition where there are two groups of nodes in which the nodes in one group can communicate with nodes in the same group, but cannot communicate with any node in the other group.

To eliminate the possibility of a split-brain scenario, SingleStore ensures that only the primary MA can write data to the reference and cluster databases and execute DDL operations. Only this MA is responsible for managing cluster metadata, running cluster operations, and detecting failures on CAs and data nodes.
Is it possible that after the split-brain resolution, the master aggregator is elected in Zone A, and the leaf survives in Zone B?

If there is a majority of aggregators with the voting member role in Zone A then one of them will become the master aggregator. The leaf will not shut itself down so the leaf survives in the sense that it will still be reachable from the other nodes in Zone B.

But the leaf will not be reachable from the master aggregator in Zone A so the master aggregator will eventually failover the leaf. Assuming all databases use sync replication, what happens is:

Case 1: If the cluster metadata states that the leaf has the master instance M of a database partition and other leaf in Zone A has replica R of the database partition replicating synchronously, the master aggregator will update the cluster metadata to state that R is the new master instance and M is now an offline async replica.

Case 2: If the cluster metadata states that the leaf has the master instance M of a database partition and another leaf in Zone A has replica R of the database partition replicating asynchronously, the master aggregator will only update cluster metadata to state that M is now offline.
How are the client's queries and transactions processed during the process of recognizing that the aggregator is split-brained and the resolution is in progress? How does SingleStore ensure that DML through CA is not processed in the leaf when the network is split?

In case 1 from Q#9, clients/aggregators in Zone B will still be able to read from the database partition via M, but writes (to M) will block and R will not be able to acknowledge those writes because R is in Zone A. These blocked writes will not commit and R will no longer acknowledge these writes after the network partition heals because R is the new master instance. This is because M continues to replicate synchronously to R despite making it seem like M would begin replicating asynchronously to R.

M does not individually decide to start replicating asynchronously to R. M sends a request to the master aggregator to update cluster metadata to state that R must be in async state. Only after said cluster metadata update is successful does M start replicating asynchronously to R. But in this scenario the master aggregator is unreachable in Zone A, therefore M continues to replicate synchronously to R.

In case 1 from Q#9, clients/aggregators in Zone A will be able to read and write to the database partition via R.

In case 2 from Q#9, clients/aggregators in Zone B will still be able to read and write to the database partition via M.

In case 2 from Q#9, clients/aggregators in Zone A will not be able to read or write to the database partition.
What happens in a network partition where the aggregators are evenly split, for example, 2 aggregators on one side and 2 on the other, and neither side has a quorum?

As per the standards of any distributed system using any consensus algorithm, SingleStore does not resolve the problem of an even split with an even number of nodes. It is recommended to use an odd number of nodes (3, 5, 7) in consensus algorithms, for two main reasons:
- You get optimal fault-tolerance - with 3 nodes for instance, you need 2 votes to make a decision, and you can tolerate the failure of one. With 4 nodes, you need 3 votes (1 more) to make a decision, so you still can only tolerate the failure of 1 node. So you don't really get any improvement from having 4 nodes instead of 3 (and the same for 6 instead of 5, etc), only the downside of needing 1 extra vote to make decisions.
- It avoids the split-brain scenario. If you use an odd number of nodes, network partitions which split the nodes into 2 groups will always result in one of the sides having a quorum.

Multi-Datacenter Failover

On this page

Configure Multi-Datacenter Failover

Using a YAML-based Configuration File

Using Toolbox Commands

Troubleshooting Multi-Datacenter Failover

FAQs

Was this article helpful?

On this page

Was this article helpful?