Recommended Configurations to Tolerate Failure of a Cloud AZ or Nearby Data Center

SingleStore recommends that if you are self-hosting SingleStore and you wish to configure your databases to tolerate failure of an entire cloud availability zone (AZ) within a region, you should create separate clusters in two different AZs in the region and replicate information from one to the other. This replication can be done using REPLICATE DATABASE, or with application logic.

Important

SingleStore does not support spanning a self-hosted cluster across multiple AZs, typically referred to as a "cross-AZ" or "multi-AZ" deployment.

While it is technically possible to span a cluster across multiple AZs, it is highly discouraged for the following reasons:

  • The SingleStore architecture expects low, consistent latency between nodes -- i.e. it expects them to be on the same local area network. Cross-AZ latencies are usually 1 - 2 milliseconds (ms), but can occasionally, and at random, be higher (up to around 100 ms), leading to dropped heartbeat messages or other failures in the SingleStore software. In addition, cross-AZ query processing has potential extra latency for query processing of up to an additional 100 ms or so, which can occur at random, with low probability, based on experiments. On the other hand, consistent single-digit milliseconds can be obtained with single-AZ deployments. This extra latency in multi-AZ clusters can cause missed SLAs for some applications.

  • Setting up SingleStore across AZs is easy to misconfigure in ways that would negate potential ability to recover from AZ failure.

  • If an AZ failure occurs for a cross-AZ cluster, the cluster will be in a degraded state with reduced redundancy and reduced compute capacity, potentially for an extended period of time. This may lead to (1) higher probability of additional failure that can take the cluster offline and (2) poor response times, during the time the cluster is in this degraded state.

  • Traffic across AZs can occur naturally during query processing (e.g. to process a shuffle operation) and cross-AZ traffic is potentially expensive. It costs $0.01 per GB on AWS in each direction. A heavy load that shuffles 100MB/sec average cross-AZ traffic would cost the following amount (note that shuffling requires sending and receiving the same amount of data so the $0.01 is multiplied by two):

    2 * 0.01 * 60 * 60 * 24 = $1,728 per day

    1,728 * 365 = $630,720 per year

With unlimited storage databases and self-hosted SingleStore, you can also recover from the loss of an AZ in a region with at most about 2 minutes of data loss by attaching the S3 storage for a database to a new cluster on a second AZ in the same region. This works because S3 storage in a region is highly available across AZs in the region.

If using REPLICATE DATABASE to get cross-AZ redundancy, use of local storage is required. Unlimited Storage databases do not support REPLICATE DATABASE.

The recommendations given above also apply to SingleStore customers running their workloads in multiple nearby private data centers. Nearby data centers with fast network connections between them, in different buildings with independent power and cooling and less than 60 miles apart, are analogous to the cloud availability zones in the above discussion. For private data centers, the issue of charges for cross-data-center network traffic may not apply.

SingleStore Helios does utilize cross-AZ clusters for Enterprise Edition. However, it has special handling for timeout errors and other issues that can occur due to high but random cross-AZ message latency, plus dedicated monitoring and planned recovery procedures. These are not available for self-hosted SingleStore.

Last modified: November 18, 2024

Was this article helpful?