SingleStore DB

Sharding

Every distributed table has exactly one shard key. A shard key can contain any number of columns. When you run CREATE TABLE to create a table, you can specify a shard key for the table. If you don’t specify a shard key:

  • If the table has a primary key, then the shard key will be whatever the primary key is.

  • If the table has neither a shard key nor a primary key, then the engine will perform keyless sharding for that table, which means the engine randomly distributes rows of the table across database partitions.

A table’s shard key determines in which partition a given row in the table is stored. When you run anINSERT query, the aggregator node computes a hash function of the values in the column or columns that make up the shard key, which produces the partition number where the row should be stored. The aggregator node then directs the INSERT operation to the appropriate leaf node machine and partition.

sharding.png

DML queries that fully match the shard key can be routed directly to a single partition on a single leaf node server. Group-by queries where the set of keys are guaranteed to not overlap between partitions can be executed in parallel on the leaf node, with the results streamed back without any additional processing on the aggregator node.

Any two rows with the same shard key value are guaranteed to be on the same partition.

SingleStore requires that the set of columns in the PRIMARY or UNIQUE index needs to be a superset of the shard key columns, meaning that the primary / unique key contains all columns in the shard key. Rows that might have a primary key conflict thus map to the same partition so uniqueness can be enforced locally.

Choosing an appropriate shard key is important for minimizing data skew.

Types of Shard Keys
Default Shard Key

If no shard key is specified, the primary key is used as the shard key. If no primary key or shard key is specified for a given table, the shard key defaults to being keylessly sharded. To accomplish this, you can either specify no keys, or use shard key() with nothing between the parenthesis as follows.

CREATE TABLE t1(a INT, b INT);
CREATE TABLE t1(a INT, b INT, SHARD KEY());

In most cases, keyless sharding will result in uniform distribution of rows across partitions. The exception to this is INSERT … SELECT statement involving sharded tables, in which case, data is inserted locally into the same partition as the source in order to avoid data redistribution.

Primary Key as the Shard Key

If you create a table with a primary key and no explicit shard key, the PK will be used as the shard key by default.

    CREATE TABLE clicks (
        click_id BIGINT AUTO_INCREMENT PRIMARY KEY,
        user_id INT,
        page_id INT,
        ts TIMESTAMP
    );
Non-Unique Shard Key

The explicit syntax for a shard key is SHARD KEY (column_list), where column_list is zero or more columns separated by commas. For example:

    CREATE TABLE clicks (
        click_id BIGINT AUTO_INCREMENT,
        user_id INT,
        page_id INT,
        ts TIMESTAMP,
        SHARD KEY (user_id),
        PRIMARY KEY (click_id, user_id)
    );

In this example, any two clicks by the same user will be guaranteed to be on the same partition. You can take advantage of this property in query execution for efficient COUNT(DISTINCT user_id) queries, which knows that any two equal user_id values will never be on different partitions.

Note that even though click_id will be unique, we still have to include user_id in the primary key.

Choosing a Shard Key

See the Optimizing Table Data Structures guide for information on how to choose a shard key.

Data Skew

Data skew occurs when data is unevenly distributed across partitions.

It's important to choose a shard key that minimizes data skew in order to get the fastest query performance.

Refer to Detecting and Resolving Data Skew for more information.