SingleStore is highly available by default. It ensures high availability by storing data redundantly in a set of nodes, called availability groups. SingleStore supports two availability groups. Each availability group contains a copy of every partition in the system—some as primaries and some as replicas. As a result, SingleStore has two copies of your data in the system to protect the data against single node failure.
The primary partitions are distributed evenly on nodes across the workspace. The primary partitions on every node in an availability group have their replicas spread evenly among a set of nodes in the opposite availability group. The even distribution of replicas ensures that a failover distributes the additional load from the node failure uniformly across the workspace. As read queries are sent to the primary partitions, balanced load distribution prevents the overloading of a single node with the newly promoted primary partitions.
In the event of a node failure, SingleStore automatically promotes the appropriate replica partitions on the node’s pair into primary partitions, so that the databases remain online. The additional workload from the node failure is spread evenly among multiple other nodes, which contain the replica copies of the primary partitions in the failed node. However, if all of the machines fail, then data will be unavailable until enough machines are recovered or until the workspace is recreated from scratch.
The following diagrams illustrate the partition distribution before and after a workspace failover. In the first diagram, the primary partitions are distributed evenly across nodes. Replica copies of the primary partitions in an availability group are placed evenly across the nodes in the opposite availability group. For example, db_0 has a replica on Node 2, while db_1 has a replica on Node 4.
If Node 1 fails in this setup, SingleStore promotes the replica of db_0 on Node 2 to primary and the replica of db_1 on Node 4 to primary.
When a node comes back online, it will be automatically reintroduced to the workspace. As for the partitions on the node, they are either caught up or rebuilt from scratch.
In SingleStore Helios, one load balancer is set up for the Master Aggregator (MA) and then a second load balancer for the Child Aggregators (CAs). The load balancer for the CAs does equal distribution of traffic between CAs.
Note: In AWS, the Network Load Balancer (NLB) load balancer works better than the classic Elastic Load Balancing (ELB). You can use a single NLB but different ports for MA and CA targets.
Availability Zone (AZ)
An availability zone (AZ) is a data center with its own power, cooling, and network connections. An AZ is physically separated from other AZs, so local failures, like a fire that destroys a whole AZ, will not affect other AZs. AWS, Azure, and Google Cloud Platform (GCP) all provide multiple AZs in each region where they operate.
Single and Multi AZ High Availability
High Availability is available in 1-AZ and 2-AZ configurations depending on the edition of SingleStore purchased. 2-AZ failover is only supported in SingleStore’s cloud offering.
SingleStore Standard : 1-AZ
SingleStore Standard is deployed with high availability within a single cloud availability zone. This ensures data redundancy within the workspace to automatically recovery from the failure of cloud instances within the availability zone.
SingleStore Enterprise : 2-AZ
SingleStore Enterprise is deployed with high availability across two cloud availability zones. Each availability group is located in separate cloud availability zone ensuring that data is resilient to both cloud instance failure and the failure of an entire cloud AZ.