Understanding Shard Key Selection
On this page
The shard key is a table column or multiple columns used to control how the rows of that table are distributed.
To determine the partition responsible for a given row, SingleStore computes a hash from all the columns in the shard key and uses this hash to assign the partition ID.
Data is distributed amongst the partitions in the leaf nodes based on the shard key selected when a table is created.
For example, the table below has the shard key that contains only the
CREATE ROWSTORE TABLE people (user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(first));
Inside the node, values within the
first column attempt to distribute evenly across all partitions.
But using a first name column as the shard key can create data skew because some first names are more common than others.
first is used as the shard key, the data could be distributed amongst the partitions as follows.
In the first partition, three of the rows have the same shard key value ("alex") so the rows would be assigned to the same partition.
people table is distributed unevenly across partitions resulting in data skew.
When data is skewed, some partitions will store more data than others, throwing off the balance of the cluster.
Choosing an appropriate shard key for a table is important for minimizing data skew.
CREATE TABLE example, the
user column would be a better choice for the shard key, in terms of even data distribution.
JOIN operations are generally performed on the shard key, using a varchar datatype as the shard key can impact performance.
id column as the shard key.
CREATE ROWSTORE TABLE people (id INT AUTO_INCREMENT PRIMARY KEY,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(id));
On a rowstore table, if a primary key is specified without a shard key, then the primary key will be used as the shard key.
SingleStore does not support the use of
ALTER TABLE to add or change shard keys on existing tables.
One method to reduce skew in a table is by using the
INSERT INTO… SELECT to redistribute the data with the
force_ query hint.
Create a table without a specified shard key:
CREATE ROWSTORE TABLE people (id INT,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24));
Create another table with one shard key:
CREATE ROWSTORE TABLE people_1 (id INT AUTO_INCREMENT,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(id));
Create another table with multiple shard keys:
CREATE ROWSTORE TABLE people (id INT AUTO_INCREMENT,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(id,user));
INSERT INTO people_1 (id, user, first, last)SELECT WITH(force_random_reshuffle=1) *FROM people;
Run the following query to confirm the data is no longer skewed:
SELECTDATABASE_NAME,TABLE_NAME,ORDINAL AS PARTITION_ID,ROWS,MEMORY_USEFROM INFORMATION_SCHEMA.TABLE_STATISTICSWHERE TABLE_NAME = 'people_1';
Does this table have a primary key? Make sure there is a shard key and that it is a subset of the primary key.
Do you frequently join on a specific set of columns? Try to make the shard key a subset of the joined columns.
Shard order should be the same between tables to accomplish this. For example, if you frequently join an
user_column, one of those columns should be the shard key.
Do you frequently filter on a specific set of columns? Try to make the shard key a subset of the filtered columns.
Filtering on a range of the shard key can negatively affect the performance of your query.
Multiple partitions may need to be scanned, thereby slowing the retrieval of your result set.
Do you have high concurrency queries? Choose a shard key that allows these queries to be single partition.
Is your data skewed for your current choice of shard key? Try to add additional columns to the shard key to ensure even distribution.
Do you need to
ALTERany fields in the shard key? Remove those fields from the shard key.
Is this a small, infrequently changing table that needs to be on every node in the cluster to guarantee local joins? Use a reference table instead of sharding the table.
A local join is a join that can be executed in the partition where the data is currently stored.
people table with
id as the shard key, say we are joining to an
address table within the same database.
id is a column and a shard key in the
A distributed join is a join where the shard key that matches the join condition is stored on different partitions.
Each partition opens connections to every node.
Because the join requires the table to be sharded on the join condition, the query optimizer will create an internal, temporary results table in memory.
This table will contain the results set at that stage of the query with a shard key that matches the join condition.
Occurs when one of the joining tables is small.
Therefore, no cost is incurred by sending a copy of the table to each node.
After the broadcast of the table, the join is performed on the partition where the larger table is stored.
Queries that require distributed joins, i.
In most scenarios, the default settings for workload management will schedule your workload appropriately to utilize cluster resources without exhausting connection and thread limits.
If you encounter performance problems with distributed join queries run at high concurrency, see the Setting Resource Limits page for information about how to configure resource limit settings.
Last modified: June 29, 2023