Choosing a Shard Key

There are two major factors to take into account when defining a shard key:

  • Using a column or set of columns unique enough to minimize skew.

  • Sharding on columns which you expect to filter or join on often. This allows the optimizer to minimize network traffic during the execution of the query (see Distributed DML Query Execution for more details).

Let’s say we have the following table:

CREATE TABLE events(event_id INT, user_id INT, time TIMESTAMP);

If we aren’t sure yet what our workload will look like, just using the first create table statement without specifying a shard key will shard rows uniformly. This will allow you to get started. It will ensure that there is no skew, but may not yield optimal query performance.

Let’s say we have this query which we would like to be extremely performant:

FROM events AS first JOIN events AS second
WHERE first.user_id = second.user_id
AND TIMESTAMPDIFF(MINUTE, first.time, second.time) < 20;

In this case, we’ll want to modify our table schema to shard on user_id. When we run our query, we’ll be able to take advantage of locality when performing the join and when computing the distinct aggregate value: The advantage with the join is that it will happen on the leaves without having to repartition or move data between leaves.

CREATE TABLE events(event_id INT, user_id INT, time TIMESTAMP, SHARD KEY(user_id));

Last modified: January 17, 2023

Was this article helpful?