Shard Keys

The second consideration when putting data into SingleStore is choosing the shard key for a table. Data is distributed across the SingleStore cluster into a number of partitions on the leaf nodes. The shard key is a collection of the columns in a table that are used to control how the rows of that table are distributed. To determine the partition responsible for a given row, SingleStore computes a hash from all the columns in the shard key to the partition ID. Therefore, rows with the same shard key will reside on the same partition.

For example, the table below has the shard key that contains only the first column. All people with the same first name will be stored on the same partition.

Check Your Understanding

Q: Why are there restrictions on ALTERing or UPDATEing the shard key of a table?

A: Unlike adding a new index to a rowstore table (which only requires adding information to existing data without moving it), ALTERing the shard key requires moving the table to different partitions. Similarly, UPDATEing a column in the shard key could require a row to be moved to a different partition.

Q: Why is the shard key required to be a subset of the primary key for a rowstore table?

A: The primary key of a rowstore table requires that there are no duplicate rows. We can only enforce uniqueness by ensuring that all potentially duplicate data is on the same partition.

Choosing a Shard Key

Shard keys are specified at the time of CREATE TABLE. Once created, the shard key for a table cannot be modified. There are two competing considerations when sharding data:

  • Distributing data evenly across partitions

  • Partitioning data on columns you frequently filter or join

First, distributing data evenly makes capacity planning much easier since the system will use its resources uniformly. Query performance can also suffer when the data is skewed or unevenly distributed, because there can be slow partitions. Multi-partition queries cannot run faster than the slowest partition involved.

Second, queries perform significantly better and use fewer resources when the optimizer can precisely understand what partitions are used. SingleStore only has to involve a single partition if the query filter matches the shard key, which greatly reduces cluster resource usage for high concurrency workloads. Similarly, joins of multiple tables that have the same shard key will be performed locally on the leaf partition rather than redistributing the data around the cluster, resulting in much faster joins that can be done at significantly higher concurrency. Joins of tables with different shard keys, or distributed joins, are substantially slower and more resource-intensive. Here, the order of the shard keys between one or more tables being joined must be the same in order to see this increase in performance. The only other affect of shard order is that rowstore skiplist indexes are created on the same columns chosen for the shard key. However, this order can always be redefined.

For example, the following query is considered single partition because the filter clause (where first = ‘john') includes the columns of the shard key. The aggregator only needs to talk to one partition to get the data.

For more about identifying and understanding the performance implications of single partition queries (and more), see the Query Tuning guide.

Skewed Partitions

A shard key should ensure that the rows are (almost) evenly distributed across partitions. The two concerns specified above, typically, run into conflict when the data is skewed. For example, first is probably a poor shard key in the table above, because some first names are much more common than others. In cases like these, it is generally more important that the data be evenly distributed so that the storage and query processing requirements are balanced across the cluster. If the partitions are unbalanced, some partitions may face significantly higher workload than others for distributed queries. These partitions may become the limiting factor on parallelized operations. Hence, you should choose shard keys carefully to avoid problematic skewed partitions. In the example above, user is probably a better choice of shard key. It will be more evenly distributed, and it is likely to be a filter for queries against the table. If you want a table to get truly uniform sharding, you may also shard on an auto-increment column.

Cardinality

A shard key reduces the cardinality across the set of columns it contains. GROUP BYs execute more efficiently when they match the shard keys, because it reduces the amount of data transfer that is required, which subsequently allows more efficient computations.

Joins

If the join conditions match the shard keys, colocated data reduces the computational needs for join operations. Hence, ensure that the join conditions match shard keys as much as possible.

Single-Partition Queries

Queries which perform an equality filter on a shard key value can be executed against a single partition, instead of fanning out to all the partitions.

More Ways to Distribute Data

In addition to specifying a key the system will use to shard the table, you may also distribute data as follows:

  • Keyless sharding is the default for tables with no primary key and no explicit shard key. You can explicitly declare a table as keyless sharded by specifying a shard key with an empty list of columns: shard key () ... With keyless sharding, data is distributed uniformly across partitions in most cases. Cases when the data becomes non uniform typically occur because INSERT … SELECT statements are optimized to insert locally into the same partition. This makes these operations substantially faster, but it does mean that they can cause skew where there wasn’t before or amplify existing skew. To remove skew in these cases, you can force the INSERT … SELECT to redistribute the data with the force_random_reshuffle query hint. See more about this in the INSERT reference topic. Keyless sharding also makes it impossible to do single partition queries or local (collocated) joins, because rows are not associated with specific partitions via an explicit key value. The following are examples of CREATE TABLE statements which will cause keyless sharding to be used.

CREATE TABLE t1(a INT, b INT);
CREATE TABLE t1(a INT, b INT, SHARD KEY());
  • Data in reference tables is duplicated on every node in the cluster, including aggregators. This makes it possible to do local joins between reference tables and any other tables; however, the table consumes a large amount of cluster storage because a copy of the entire table is stored on every node. Furthermore, reference tables replicate synchronously to the aggregators and asynchronously to the leaves, dramatically limiting the performance of writes. For star schema workloads, reference tables are ideal for small, slowly changing dimension tables.

Questions to Ask When Choosing a Shard Key

  • Does this table have a primary key? Make sure there is a shard key and that it is a subset of the primary key.

  • Do you frequently join on a specific set of columns (e.g. WHERE users.id = action.user_id and users.country = action.country)? Try to make the shard key a subset of the joined columns. Shard order should be the same between tables to accomplish this.

  • Do you frequently filter on a specific set of columns (e.g. WHERE user_id = 17 and date = '2007-06-14')? Try to make the shard key a subset of the filtered columns.

    Note

    Filtering on a range of the shard key (e.g. WHERE user_id BETWEEN 2 and 12) can negatively affect the performance of your query. Multiple partitions may need to be scanned, thereby slowing the retrieval of your result set.

  • Do you have high concurrency queries? Choose a shard key that allows these queries to be single partition.

  • Is your data skewed for your current choice of shard key? Try to add additional columns to the shard key to ensure even distribution.

  • Do you need to UPDATE or ALTER any fields in the shard key? Remove those fields from the shard key.

  • Is this a small, infrequently changing table that needs to be on every node in the cluster to guarantee local joins? Use a reference table instead of sharding the table.

In addition to these questions, see the Detecting and Resolving Data Skew topic for more information.

Check Your Understanding

Q: Why can the SingleStore Spark Connector load data directly into the leaf partitions for keyless sharding only?

A: If there is keyless sharding, data can be placed anywhere in the cluster. Since the leaf partitions are fully functional databases, the spark connector can connect directly to load the data in parallel. If a shard key is used, the connector must use an aggregator to ensure that data is inserted into the correct partition.

Q: Why does SingleStore use the primary key as a shard key if there is no explicit shard key?

A: It is very common to query tables with a filter on the primary key; in addition, the uniqueness constraint of primary keys guarantees that data will not be skewed.

Q: Why is it generally better to use a shard key with the fewest number of columns, as long as it’s not skewed?

A: We can only optimize queries to run against a single partition when they filter or join on the entire shard key. A smaller shard key is likely to match more queries.

Q: Why is it faster to run queries that group by the shard key?

A: When the grouping matches the shard key, queries are faster because the GROUP BY can be executed locally on each partition. This effect is much more pronounced when the groups are high-cardinality.

Last modified: April 3, 2023

Was this article helpful?