Understanding Shard Key Selection

Shard Key

The shard key is a table column or multiple columns used to control how the rows of that table are distributed. Shard keys are vital in a distributed database like SingleStore. They are responsible for distribution of data across partitions. Shard key columns should be as unique as possible.

To determine the partition responsible for a given row, SingleStore computes a hash from all the columns in the shard key and uses this hash to assign the partition ID. Therefore, rows with the same value for the shard key column will reside on the same partition.

Data Distribution into Partitions

Data is distributed amongst the partitions in the leaf nodes based on the shard key selected when a table is created. Rows with the same shard key value are stored in the same partition, even if they are in a different table.

For example, the table below has the shard key that contains only the first column. All people with the same first name will be stored on the same partition.

CREATE ROWSTORE TABLE people (
user VARCHAR(24),
first VARCHAR(24),
last VARCHAR(24),
SHARD KEY(first)
);

Inside the node, values within the first column attempt to distribute evenly across all partitions.

But using a first name column as the shard key can create data skew because some first names are more common than others. Data skew occurs when data is unevenly distributed across partitions.

If first is used as the shard key, the data could be distributed amongst the partitions as follows. SingleStore computes a hash from all the columns in the shard key and uses this hash to assign the partition. Rows with the same value for the shard key column will reside on the same partition.

In the first partition, three of the rows have the same shard key value ("alex") so the rows would be assigned to the same partition. For the row with the shard key value of "tom", the hash computation just happened to assign this row to the first partition. This is the case with the other three partitions. The data in the people table is distributed unevenly across partitions resulting in data skew. If this table had millions of rows, the data skew could be even more pronounced.

When data is skewed, some partitions will store more data than others, throwing off the balance of the cluster. Those partitions with more data than the average will probably be asked to do more work during SELECTUPDATE, and DELETE queries. Since a query executes as fast as its slowest partition, data skew affects overall performance.

Choosing an appropriate shard key for a table is important for minimizing data skew. In the CREATE TABLE example, the user column would be a better choice for the shard key, in terms of even data distribution. It will be more evenly distributed, and it is likely to be a filter for queries against the table. However, it is best not to use a varchar column as the shard key at all because of the inherent potential for data skew. Also, varchar data types take longer to process during query execution. Since JOIN operations are generally performed on the shard key, using a varchar data type as the shard key can impact performance. In our example, you would get optimal data distribution by creating an auto-increment id column as the shard key.

CREATE ROWSTORE TABLE people (
id INT AUTO_INCREMENT PRIMARY KEY,
user VARCHAR(24),
first VARCHAR(24),
last VARCHAR(24),
SHARD KEY(id)
);

On a rowstore table, if a primary key is specified without a shard key, then the primary key will be used as the shard key. If the table doesn't have a primary key or a shard key, the engine will perform keyless sharding on that table. This means the engine generates a random number using a hash function, then maps the resulting random number to a partition.

Resolving Data Skew

SingleStore does not support the use of ALTER TABLE to add or change shard keys on existing tables. If your data becomes skewed, there are a couple of workarounds to reduce the skew in a table.

One method to reduce skew in a table is by using the INSERT INTO… SELECT to redistribute the data with the force_random_reshuffle query hint. A common use case for this hint is on keyless sharded tables, as it allows forced redistribution of data among partitions, rather than inserting it locally into the same partition.

Create a table without a specified shard key:

CREATE ROWSTORE TABLE people (
id INT,
user VARCHAR(24),
first VARCHAR(24),
last VARCHAR(24)
);

Create another table with one shard key:

CREATE ROWSTORE TABLE people_1 (
id INT AUTO_INCREMENT,
user VARCHAR(24),
first VARCHAR(24),
last VARCHAR(24),
SHARD KEY(id)
);

Create another table with multiple shard keys:

CREATE ROWSTORE TABLE people (
id INT AUTO_INCREMENT,
user VARCHAR(24),
first VARCHAR(24),
last VARCHAR(24),
SHARD KEY(id,user)
);

INSERT INTO...SELECT using force_random_reshuffle:

INSERT INTO people_1 (id, user, first, last)
SELECT WITH(force_random_reshuffle=1) *
FROM people;

Run the following query to confirm the data is no longer skewed:

SELECT
DATABASE_NAME,
TABLE_NAME,
ORDINAL AS PARTITION_ID,
ROWS,
MEMORY_USE
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
WHERE TABLE_NAME = 'people_1';

Questions to Ask When Choosing Shard Keys

  • Does this table have a primary key? Make sure there is a shard key and that it is a subset of the primary key.

  • Do you frequently join on a specific set of columns? Try to make the shard key a subset of the joined columns. Shard order should be the same between tables to accomplish this. For example, if you frequently join an id and user_id column, one of those columns should be the shard key.

  • Do you frequently filter on a specific set of columns? Try to make the shard key a subset of the filtered columns.

    Note

    Filtering on a range of the shard key can negatively affect the performance of your query. Multiple partitions may need to be scanned, thereby slowing the retrieval of your result set.

  • Do you have high concurrency queries? Choose a shard key that allows these queries to be single partition.

  • Is your data skewed for your current choice of shard key? Try to add additional columns to the shard key to ensure even distribution.

  • Do you need to UPDATE or ALTER any fields in the shard key? Remove those fields from the shard key.

  • Is this a small, infrequently changing table that needs to be on every node in the cluster to guarantee local joins? Use a reference table instead of sharding the table.

Local Joins and Distributed Joins

A local join is a join that can be executed in the partition where the data is currently stored. The tables that are joined must have a shard key that matches the join condition. The best way to accomplish this is to set the shard key on the column that you will be frequently joining on. In our first example of the people table with id as the shard key, say we are joining to an address table within the same database. To ensure your joins execute locally, make sure id is a column and a shard key in the address table.

A distributed join is a join where the shard key that matches the join condition is stored on different partitions. Distributed joins occur in 2 ways:

1) Repartition:

  • Each partition opens connections to every node.

  • Because the join requires the table to be sharded on the join condition, the query optimizer will create an internal, temporary results table in memory. This table will contain the results set at that stage of the query with a shard key that matches the join condition.

2) Broadcast:

  • Occurs when one of the joining tables is small. Therefore, no cost is incurred by sending a copy of the table to each node.

  • After the broadcast of the table, the join is performed on the partition where the larger table is stored.

Performance differences between repartition and broadcast operations

Queries that require distributed joins, i.e. queries which involve Broadcasts or Repartitions, generally require much larger numbers of connections and threads than other queries.

The concurrency of such queries is automatically managed by Workload Manager. If you have any questions about this functionality, please contact us via the Cloud Portal.

In most scenarios, the default settings for workload management will schedule your workload appropriately to utilize cluster resources without exhausting connection and thread limits.

If you encounter performance problems with distributed join queries run at high concurrency, see the Setting Resource Limits page for information about how to configure resource limit settings.

Last modified: March 8, 2024

Was this article helpful?