Understanding Shard Key Selection
On this page
Shard Key
The shard key is a table column or multiple columns used to control how the rows of that table are distributed.
To determine the partition responsible for a given row, SingleStore computes a hash from all the columns in the shard key and uses this hash to assign the partition ID.
Data Distribution into Partitions
Data is distributed amongst the partitions in the leaf nodes based on the shard key selected when a table is created.
For example, the table below has the shard key that contains only the first
column.
CREATE ROWSTORE TABLE people (user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(first));
Inside the node, values within the first
column attempt to distribute evenly across all partitions.
But using a first name column as the shard key can create data skew because some first names are more common than others.
If first
is used as the shard key, the data could be distributed amongst the partitions as follows.
In the first partition, three of the rows have the same shard key value ("alex") so the rows would be assigned to the same partition.people
table is distributed unevenly across partitions resulting in data skew.
When data is skewed, some partitions will store more data than others, throwing off the balance of the cluster.SELECT
, UPDATE
, and DELETE
queries.
Choosing an appropriate shard key for a table is important for minimizing data skew.CREATE TABLE
example, the user
column would be a better choice for the shard key, in terms of even data distribution.JOIN
operations are generally performed on the shard key, using a varchar data type as the shard key can impact performance.id
column as the shard key.
CREATE ROWSTORE TABLE people (id INT AUTO_INCREMENT PRIMARY KEY,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(id));
On a rowstore table, if a primary key is specified without a shard key, then the primary key will be used as the shard key.
Resolving Data Skew
SingleStore does not support the use of ALTER TABLE
to add or change shard keys on existing tables.
One method to reduce skew in a table is by using the INSERT INTO… SELECT
to redistribute the data with the force_
query hint.
Create a table without a specified shard key:
CREATE ROWSTORE TABLE people (id INT,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24));
Create another table with one shard key:
CREATE ROWSTORE TABLE people_1 (id INT AUTO_INCREMENT,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(id));
Create another table with multiple shard keys:
CREATE ROWSTORE TABLE people (id INT AUTO_INCREMENT,user VARCHAR(24),first VARCHAR(24),last VARCHAR(24),SHARD KEY(id,user));
INSERT INTO.
INSERT INTO people_1 (id, user, first, last)SELECT WITH(force_random_reshuffle=1) *FROM people;
Run the following query to confirm the data is no longer skewed:
SELECTDATABASE_NAME,TABLE_NAME,ORDINAL AS PARTITION_ID,ROWS,MEMORY_USEFROM INFORMATION_SCHEMA.TABLE_STATISTICSWHERE TABLE_NAME = 'people_1';
Questions to Ask When Choosing Shard Keys
-
Does this table have a primary key? Make sure there is a shard key and that it is a subset of the primary key.
-
Do you frequently join on a specific set of columns? Try to make the shard key a subset of the joined columns.
Shard order should be the same between tables to accomplish this. For example, if you frequently join an id
anduser_
column, one of those columns should be the shard key.id -
Do you frequently filter on a specific set of columns? Try to make the shard key a subset of the filtered columns.
Note
Filtering on a range of the shard key can negatively affect the performance of your query.
Multiple partitions may need to be scanned, thereby slowing the retrieval of your result set. -
Do you have high concurrency queries? Choose a shard key that allows these queries to be single partition.
-
Is your data skewed for your current choice of shard key? Try to add additional columns to the shard key to ensure even distribution.
-
Do you need to
UPDATE
orALTER
any fields in the shard key? Remove those fields from the shard key. -
Is this a small, infrequently changing table that needs to be on every node in the cluster to guarantee local joins? Use a reference table instead of sharding the table.
Local Joins and Distributed Joins
A local join is a join that can be executed in the partition where the data is currently stored.people
table with id
as the shard key, say we are joining to an address
table within the same database.id
is a column and a shard key in the address
table.
A distributed join is a join where the shard key that matches the join condition is stored on different partitions.
1) Repartition:
-
Each partition opens connections to every node.
-
Because the join requires the table to be sharded on the join condition, the query optimizer will create an internal, temporary results table in memory.
This table will contain the results set at that stage of the query with a shard key that matches the join condition.
2) Broadcast:
-
Occurs when one of the joining tables is small.
Therefore, no cost is incurred by sending a copy of the table to each node. -
After the broadcast of the table, the join is performed on the partition where the larger table is stored.
Performance differences between repartition and broadcast operations
Queries that require distributed joins, i.
The concurrency of such queries is automatically managed by Workload Manager.
In most scenarios, the default settings for workload management will schedule your workload appropriately to utilize cluster resources without exhausting connection and thread limits.
If you encounter performance problems with distributed join queries run at high concurrency, see the Setting Resource Limits page for information about how to configure resource limit settings.
Last modified: March 8, 2024