High Availability for the Master Aggregator
On this page
Note
This is a Preview feature and will be part of an upcoming add-on pricing package.
This feature introduces High Availability for the Master Aggregator (HA for MA).
The Master Aggregator (MA) plays a pivotal role in the management and operational coordination of a SingleStore cluster.
To enable HA for MA, you must explicitly deploy an MA and two or more Child Aggregators (CAs) with the voting member role assigned, at the time of cluster creation.
In the event of the MA becoming unavailable, one of the voting members will automatically be elected as the new MA through a consensus mechanism.
The recovery of a failed MA node needs an automation process to be in place.
Configuring HA for MA for a Cluster
You should have at least the 8.
You can setup a cluster with HA for MA in either of these two ways:
-
Using the YAML configuration file, or
-
Using a combination of the Toolbox commands
Using the YAML Configuration File
If you set up a cluster from a YAML configuration file with the sdb-deploy setup-cluster --cluster-file
.
-
consensus_
should be ONenabled -
the
aggregator_
should be set as VOTING_role MEMBER for all the CAs that will act as voting members.
The configuration file should be similar to the below format:
...
sync_variables:
...
consensus_enabled: ON
...
hosts:...
- hostname: ...
...
nodes:
...
- role: Aggregator
aggregator_role: voting_member
…
Refer to Deploy for more information.
Using the Toolbox sdb-admin
Command
-
Set
consensus_
global variable to ON.enabled -
Add aggregators with a voting member aggregator role.
You can manually update the consensus_
variable using the command:
sdb-admin update-config --key=consensus-enabled --value=ON --set-global --role master -y
A new voting member can be added to the cluster using either the sdb-admin create-node
or sdb-admin add-aggregator
commands:
-
For a new node:
sdb-admin create-node --host ... --port ... --role aggregator --aggregator-role voting_member … -
For an existing node:
sdb-admin add-aggregator --memsql-id ... --role voting_member
You can change the role of an existing CA by combining two commands:
-
First, remove the aggregator:
sdb-admin remove-aggregator --memsql-id ... -
Then, add it again with a new role:
sdb-admin add-aggregator --memsql-id ... --role voting_member
A user can get the full list of aggregators and roles using the sdb-admin show-aggregators
command:
sdb-admin show-aggregators
✓ Successfully ran 'memsqlctl show-aggregators'
+-----------+------+--------+--------------------+--------------------------------+-------------------+---------------+
| Host | Port | State | Opened | Average | Master | Role |
| | Connections | Roundtrip Latency (ms). | Aggregator | |
+-----------+------+--------+--------------------+--------------------------------+-------------------+---------------+
| 127.0.0.1 | 3306 | online | 1 | null | 1 | Leader |
| 127.0.0.1 | 3308 | online | 2 | 0.377 | 0 | Voting Member |
| 127.0.0.1 | 3309 | online | 2 | 0.313 | 0 | Voting Member |
| 127.0.0.1 | 3310 | online | 1 | 6.023 | 0 | Voting Member |
+-----------+------+--------+---------------+--------------------------------+------------------+---------------------+
Troubleshooting HA for MA using Toolbox
If an MA is down, a new MA will be elected from the set of voting members and become automatically available to the Toolbox in a few seconds.
If the Toolbox shows two or more running MAs (you can check using the sdb-admin list-nodes
command), some commands may become unavailable.sdb-admin stop-node --memsql-id
.
HA for MA FAQs
-
Which global engine variables are used in configuring HA for MA?
The
consensus_
andenabled consensus_
variables are used.election_ timeout -
consensus_
must be set to ON to add aggregators as voting members.enabled -
consensus_
controls the time, in milliseconds, for which a voting member waits before conducting an election if it does not hear from the MA.election_ timeout You can adjust the value if required. SHOW VARIABLES LIKE '%election%';+----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | consensus_election_timeout | 30000 | +----------------------------+-------+
-
-
When an MA goes offline and a new voting member becomes the MA, how should the cluster be reconfigured to the three MA config?
You have to set up a process to restart the failed node or spin up a new CA and add it as a voting member.
Once this 3rd voting member is provisioned, SingleStore’s consensus algorithm will ensure it is caught up and eligible to be elected as a potential MA. -
Are there any specific steps needed to catch up the new MA?
No, the new voting member automatically catches up once it is back online.
-
How to find out if an MA is down and a new MA is successfully promoted?
You can check the output of the commands:
SHOW AGGREGATORS EXTENDED
and/orINFORMATION_
from all the voting members.SCHEMA. AGGREGATORS The current MA will be the one that the majority of the voting members report. -
What does a user or application need to do when a new MA is promoted?
SingleStore recommends implementing application-level retry logic combined with a load balancer or proxy that manages the endpoints for all three MAs.
This setup will enable your application to seamlessly connect to the newly promoted MA and continue its operations without encountering errors due to the old MA endpoint being offline. The load balancer or proxy should be configured to automatically detect and route traffic to the active MA, ensuring uninterrupted service during failover events.
Additionally, the application's retry logic should be designed to handle transient connection failures and transparently reconnect to the newly promoted MA. -
How does SingleStore resolve split-brain issues?
To eliminate the possibility of a split-brain scenario, SingleStore ensures that only the primary MA can write data to the reference and cluster databases and execute DDL operations.
Only this MA is responsible for managing cluster metadata, executing cluster operations, and detecting failures on CAs and data nodes. -
If a DDL or DML operation is hitting the MA endpoint and, at the same time, the original MA goes offline and a new MA is being elected, what will happen to the operation? What are the various failure scenarios?
SingleStore ensures your transactions maintain the atomicity property by guaranteeing that all changes are either fully committed or rolled back during the process of electing the new MA.
In the following scenarios, SingleStore recommends implementing application-level retry logic (similar to FAQ # 5) to ensure a DDL or DML request is re-established.
Scenarios
Description
A user sends a query to the CA, but the MA is offline.
The CA will internally retry the query until a new MA is elected, and the CA can connect to the new MA.
A user sends a query to the MA, but the MA is offline
SingleStore recommends building a retry logic in the application along with a load balancer or a proxy, so they can manage the connections and connect to the newly promoted MA.
A user sends a query or a multi-statement transaction to the CA, which forwards it to the MA but the MA goes offline while executing the query.
Similar to the above scenario, SingleStore recommends building a retry logic in the application to allow connecting to the newly promoted MA.
Without this retry logic, the user will get a typical error stating that the connection to the server was lost. A user sends a query to the CA, which forwards it to the MA, but a new MA is elected while the previous MA is executing the query.
Depending on how far the execution of the query has progressed, the CA will internally retry and forward the query to the new MA or the user may get an error.
To eliminate this error, SingleStore recommends building a retry logic in your applications to connect to the new MA’s endpoint. -
How should the CAs that are allocated as voting members be placed to best optimize resiliency?
SingleStore’s recommendation is to place the CA nodes marked as voting members across different failure domains or availability groups.
This will help improve the resiliency.
Last modified: November 25, 2024