SingleStore Managed Service

High Availability

SingleStore is highly available by default. It ensures high availability by storing data redundantly in a set of nodes, called availability groups. SingleStore supports two availability groups. Each availability group contains a copy of every partition in the system—some as masters and some as replicas. As a result, SingleStore has two copies of your data in the system to protect the data against single node failure.

The master partitions are distributed evenly on nodes across the cluster. The master partitions on every node in an availability group have their replicas spread evenly among a set of nodes in the opposite availability group. The even distribution of replicas ensures that a failover distributes the additional load from the node failure uniformly across the cluster. As read queries are sent to the master partitions, balanced load distribution prevents the overloading of a single node with the newly promoted master partitions.

diagram_load-balanced-partition-placement.png

In the event of a node failure, SingleStore automatically promotes the appropriate replica partitions on the node’s pair into master partitions, so that the databases remain online. The additional workload from the node failure is spread evenly among multiple other nodes, which contain the replica copies of the master partitions in the failed node. However, if all of the machines fail, then data will be unavailable until enough machines are recovered or until the cluster is recreated from scratch.

The following diagrams illustrate the partition distribution before and after a cluster failover. In the first diagram, the master partitions are distributed evenly across nodes. Replica copies of the master partitions in an availability group are placed evenly across the nodes in the opposite availability group. For example, db_0 has a replica on Node 2, while db_1 has a replica on Node 4.

load-balanced-failover.png

If Node 1 fails in this setup, SingleStore promotes the replica of db_0 on Node 2 to master and the replica of db_1 on Node 4 to master.

load-balanced-failed-node.png

When a node comes back online, it will be automatically reintroduced to the cluster. As for the partitions on the node, they are either caught up or rebuilt from scratch.