On this page
The second consideration when putting data into SingleStore is choosing the shard key for a table.
For example, the table below has the shard key that contains only the
Q: Why are there restrictions on
UPDATEing the shard key of a table?
A: Unlike adding a new index to a rowstore table (which only requires adding information to existing data without moving it),
ALTERing the shard key requires moving the table to different partitions.
UPDATEing a column in the shard key could require a row to be moved to a different partition.
Q: Why is the shard key required to be a subset of the primary key for a rowstore table?
A: The primary key of a rowstore table requires that there are no duplicate rows.
Shard keys are specified at the time of
Distributing data evenly across partitions
Partitioning data on columns you frequently filter or join
First, distributing data evenly makes capacity planning much easier since the system will use its resources uniformly.
Second, queries perform significantly better and use fewer resources when the optimizer can precisely understand what partitions are used.
For example, the following query is considered
single partition because the filter clause (
where first = ‘john') includes the columns of the shard key.
For more about identifying and understanding the performance implications of single partition queries (and more), see the Query Tuning guide.
A shard key should ensure that the rows are (almost) evenly distributed across partitions.
first is probably a poor shard key in the table above, because some first names are much more common than others.
user is probably a better choice of shard key.
A shard key reduces the cardinality across the set of columns it contains.
GROUP BYs execute more efficiently when they match the shard keys, because it reduces the amount of data transfer that is required, which subsequently allows more efficient computations.
If the join conditions match the shard keys, colocated data reduces the computational needs for join operations.
Queries which perform an equality filter on a shard key value can be executed against a single partition, instead of fanning out to all the partitions.
In addition to specifying a key the system will use to shard the table, you may also distribute data as follows:
Keyless sharding is the default for tables with no primary key and no explicit shard key.
You can explicitly declare a table as keyless sharded by specifying a shard key with an empty list of columns:
shard key () .With keyless sharding, data is distributed uniformly across partitions in most cases.
. . Cases when the data becomes non uniform typically occur because
INSERT … SELECTstatements are optimized to insert locally into the same partition.
This makes these operations substantially faster, but it does mean that they can cause skew where there wasn’t before or amplify existing skew. To remove skew in these cases, you can force the
INSERT … SELECTto redistribute the data with the
random_ reshuffle See more about this in the INSERT reference topic. Keyless sharding also makes it impossible to do single partition queries or local (collocated) joins, because rows are not associated with specific partitions via an explicit key value. The following are examples of
CREATE TABLEstatements which will cause keyless sharding to be used.
CREATE TABLE t1(a INT, b INT);CREATE TABLE t1(a INT, b INT, SHARD KEY());
Data in reference tables is duplicated on every node in the cluster, including aggregators.
This makes it possible to do local joins between reference tables and any other tables; however, the table consumes a large amount of cluster storage because a copy of the entire table is stored on every node. Furthermore, reference tables replicate synchronously to the aggregators and asynchronously to the leaves, dramatically limiting the performance of writes. For star schema workloads, reference tables are ideal for small, slowly changing dimension tables.
Does this table have a primary key? Make sure there is a shard key and that it is a subset of the primary key.
Do you frequently join on a specific set of columns (e.
WHERE users.)? Try to make the shard key a subset of the joined columns.
id = action. user_ id and users. country = action. country Shard order should be the same between tables to accomplish this.
Do you frequently filter on a specific set of columns (e.
WHERE user_)? Try to make the shard key a subset of the filtered columns.
id = 17 and date = '2007-06-14'
Filtering on a range of the shard key (e.
WHERE user_) can negatively affect the performance of your query.
id BETWEEN 2 and 12 Multiple partitions may need to be scanned, thereby slowing the retrieval of your result set.
Do you have high concurrency queries? Choose a shard key that allows these queries to be single partition.
Is your data skewed for your current choice of shard key? Try to add additional columns to the shard key to ensure even distribution.
Do you need to
ALTERany fields in the shard key? Remove those fields from the shard key.
Is this a small, infrequently changing table that needs to be on every node in the cluster to guarantee local joins? Use a reference table instead of sharding the table.
In addition to these questions, see the Detecting and Resolving Data Skew topic for more information.
Q: Why can the SingleStore Spark Connector load data directly into the leaf partitions for keyless sharding only?
A: If there is keyless sharding, data can be placed anywhere in the cluster.
Q: Why does SingleStore use the primary key as a shard key if there is no explicit shard key?
A: It is very common to query tables with a filter on the primary key; in addition, the uniqueness constraint of primary keys guarantees that data will not be skewed.
Q: Why is it generally better to use a shard key with the fewest number of columns, as long as it’s not skewed?
A: We can only optimize queries to run against a single partition when they filter or join on the entire shard key.
Q: Why is it faster to run queries that group by the shard key?
A: When the grouping matches the shard key, queries are faster because the
GROUP BY can be executed locally on each partition.
Last modified: April 3, 2023