Clustering

How does SingleStore shard tables?

Every distributed table (except reference tables, which are replicated in whole on each leaf node) has a SHARD KEY that specifies which columns of a row to hash to determine what partition a row should reside in. When rows are inserted into a sharded table, they are hashed by the table’s shard key and sent to the leaf carrying the corresponding partition. This technique is commonly referred to as hash-based partitioning. You can choose how to shard each table by specifying its SHARD KEY as part of the CREATE TABLE statement. See Sharding for more details.

What are aggregator and leaf nodes?

SingleStore stores and computes data on leaf nodes. You can linearly scale both storage and computational power by adding more leaf nodes. Clients query an aggregator node, which in turn queries one or more leaf nodes to collect the rows required to execute the query. Multiple aggregators nodes perform the same functions with respect to executing Data Manipulation Language (DML) queries and allow clients to load-balance queries across the aggregators. Leaf nodes should not be queried directly except for maintenance purposes in exceptional situations.

What is a Master Aggregator?

The Master Aggregator is a specialized aggregator responsible for cluster monitoring and failover.

What happens if the Master Aggregator crashes?

If the Master Aggregator becomes unresponsive, clients can continue to execute DML queries (e.g. INSERT and SELECT) against the other aggregators, but DDL and clustering operations can not be performed until the master aggregator is revived or another aggregator is promoted to be the Master Aggregator.

How do I add nodes to SingleStore?

Using the new SingleStore management tools, you can add new nodes and assign them roles in your cluster. To see how to do that in a deployment, visit our SingleStore Tools reference.

How many aggregator and leaf nodes do I need?

SingleStore stores data in leaf nodes, so you need enough leaf nodes to store all your data in memory. If you are replicating data (redundancy level 2), you need twice as many leaf nodes.

The recommended number of aggregators depends on your use case. If, for instance, your cluster is being used for more than one type of workload (for example, it is the backend for a web application and also being queried by analysts), it is probably best to have multiple aggregators, or pools of aggregators, for these separate workloads. Aside from distribution of workload, the most significant factor to consider is network bandwidth. As a rule of thumb, clusters with 50 nodes or fewer should have about a 5:1 leaf to aggregator ratio. Clusters with more than 50 nodes can have closer to a 10:1 leaf to aggregator ratio. Note that you can also add nodes to a cluster to tune performance after it is up and running.

The appropriate ratio of aggregators to leaves also depends on the type of workload running. Transactional workloads that run many small queries or queries that involve only a single partition require more aggregators, since those queries interact with one aggregator and one leaf. Analytical workloads, especially those involving distributed joins, require fewer aggregators because almost all the work is performed on the leaves.

Can I query the leaf nodes individually?

Yes, but this is not recommended. It only should be done in troubleshooting scenarios.

After I ran a single host install, why are there 2 SingleStore nodes (or 4 SingleStore processes) running?

This is the single host cluster setup. Your machine is running both a master aggregator process and a leaf process, and hence 2 SingleStore Nodes is shown (each SingleStore instance or node contains two processes). Make sure to send your queries to the aggregator.