High Availability

SingleStore is highly available by default. It ensures high availability by storing data redundantly in a set of nodes, called availability groups. SingleStore supports two availability groups. Each availability group contains a copy of every partition in the system—some as primaries and some as replicas. As a result, SingleStore has two copies of your data in the system to protect the data against single node failure.

The primary partitions are distributed evenly on nodes across the workspace. The primary partitions on every node in an availability group have their replicas spread evenly among a set of nodes in the opposite availability group. The even distribution of replicas ensures that a failover distributes the additional load from the node failure uniformly across the workspace. As read queries are sent to the primary partitions, balanced load distribution prevents the overloading of a single node with the newly promoted primary partitions.

In the event of a node failure, SingleStore automatically promotes the appropriate replica partitions on the node’s pair into primary partitions, so that the databases remain online. The additional workload from the node failure is spread evenly among multiple other nodes, which contain the replica copies of the primary partitions in the failed node. However, if all of the machines fail, then data will be unavailable until enough machines are recovered or until the workspace is recreated from scratch.

The following diagrams illustrate the partition distribution before and after a workspace failover. In the first diagram, the primary partitions are distributed evenly across nodes. Replica copies of the primary partitions in an availability group are placed evenly across the nodes in the opposite availability group. For example, db_0 has a replica on Node 2, while db_1 has a replica on Node 4.

If Node 1 fails in this setup, SingleStore promotes the replica of db_0 on Node 2 to primary and the replica of db_1 on Node 4 to primary.

When a node comes back online, it will be automatically reintroduced to the workspace. As for the partitions on the node, they are either caught up or rebuilt from scratch.

In SingleStore Helios, one load balancer is set up for the Master Aggregator (MA) and then a second load balancer for the Child Aggregators (CAs). The load balancer for the CAs does equal distribution of traffic between CAs.

Note: In AWS, the Network Load Balancer (NLB) load balancer works better than the classic Elastic Load Balancing (ELB). You can use a single NLB but different ports for MA and CA targets.

Availability Zone (AZ)

An availability zone (AZ) is a data center with its own power, cooling, and network connections. An AZ is physically separated from other AZs, so local failures, like a fire that destroys a whole AZ, will not affect other AZs. AWS, Azure, and Google Cloud Platform (GCP) all provide multiple AZs in each region where they operate.

Single and Multi AZ High Availability

High Availability is available in 1-AZ and 2-AZ configurations depending on the edition of SingleStore purchased. 2-AZ failover is only supported in SingleStore’s cloud offering.

SingleStore Standard : 1-AZ

SingleStore Standard is deployed with high availability within a single cloud availability zone. This ensures data redundancy within the workspace to automatically recovery from the failure of cloud instances within the availability zone.

SingleStore Enterprise : 2-AZ

SingleStore Enterprise is deployed with high availability across two cloud availability zones. Each availability group is located in separate cloud availability zone ensuring that data is resilient to both cloud instance failure and the failure of an entire cloud AZ.

High Availability for the Master Aggregator

Note

This is a Preview feature.

This feature is available to all SingleStore Helios customers who have enabled multi-AZ failover.

When an availability zone (AZ) containing a Master Aggregator (MA) fails, the MA is automatically replaced by a node in a different AZ.

SingleStore Dedicated is deployed with high availability across cloud AZs. There is a single MA, and the remaining aggregators are called Child Aggregators (CAs). The aggregators are deployed across three different cloud AZs in order to survive failure of a single AZ, and the system maintains availability of the MA automatically. Each availability group is located in a separate cloud availability zone ensuring that data is resilient to both cloud instance failure and the failure of an entire cloud AZ.

When an MA fails, an election process automatically takes place by which a designated subset of the aggregators, called voting members, determine among themselves the node that will be the new MA. Of all the voting members, a group containing a majority of them is guaranteed to have a full copy of all relevant cluster metadata. Elections follow the Raft protocol in order to ensure that agreement is reached on which node will be the new MA even in the event of additional failures of hardware, software, or communications.

An Example to Illustrate Auto MA Failover

Graphic showing the interaction between availability zones and master aggregators

In the above illustration:

  • MA failure in AZ1 is detected.

  • Election takes place between the voting members in AZ2 and AZ3 (quorum).

  • The node in AZ3 is elected as the new MA.

  • Automatic failover to the new MA in AZ3 takes place.

Elections can interrupt any ongoing DDL/reference table DML until completed. Hence any backups, alter table, create table, create database, drop database, attach/detach database, PITR, and reference table write activities will be affected. Reference table reads will continue working. The impact to the workload is the same as during upgrades.

Disruptions are typically resolved within 1-2 minutes. If only one availability zone (AZ) fails, then no data will be lost, and availability is minimally affected. If more than one AZ is lost simultaneously (a very unlikely event), there is a possibility of data loss, and manual intervention may be required both to restore data to the most current available state, and, if necessary, establish a new MA.

In case any applications lose connection because an MA was down, then the user should reconnect to the same address following the same procedure as when the MA is down during upgrade. SingleStore recommends that the applications should have retry logic built-in.

To help users monitor MA, HA and failover activity, the following roles are available in the information_schema.AGGREGATORS and SHOW AGGREGATORS EXTENDED output:

  • Master Aggregator is a special voting member that is the only node in the cluster that can write to the reference databases, and cluster database. This node is responsible for managing cluster metadata, executing cluster operations, and failure detection.

  • Voting Member is a node in the SingleStore cluster that participates in the election process in the event of MA failure.

  • Demoted Voting Member is a voting member that experienced communication failure with the MA, and has stopped participating in replication and elections. Once this node is reachable by the MA again, and has synchronized its copies of the reference databases, it is automatically transitioned to voting member.

Last modified: March 13, 2025

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK