Understanding Sort Key Selection
Sort Key
The sort key is an index that groups rows of columnstore tables into logical segments, where each segment contains data for many rows. It can be set on a column or multiple columns of columnstore tables to create a sort order of values within those columns. Each group can contain a maximum of 1 million rows in a compressed format on disk. These groups are called segment files or blobs. (Columnstore blobs have no connection to the BLOB
datatype. See BLOB Types for information on the BLOB
datatype.) These terms are used interchangeably. The segment files contain metadata that holds the minimum and maximum values for each group.
To set a sort key, you use SORT KEY()
syntax. If you want to create an unsorted columnstore table, you can specify an empty key using SORT KEY()
. An columnstore sort key cannot be altered once the table has been created. See the following example:
CREATE TABLE t1(col1 int, SORT KEY( ));
EXPLAIN SELECT * FROM t1;
****
+---------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+---------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 |
| Project [t1.col1] |
| ColumnStoreScan test8.t1, KEY __UNORDERED () USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+---------------------------------------------------------------------------------------------------------+
The sort key order can be specified as ascending (SORT KEY(index_column_name)
) or descending (SORT KEY(index_column_name DESC)
). SingleStoreDB does not support scanning a SORT KEY()
in reverse order to its sort order. See the following example:
CREATE TABLE t1(col1 int, SORT KEY(col1 DESC)); EXPLAIN SELECT * FROM t1 ORDER BY col1 DESC; **** +------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +------------------------------------------------------------------------------------------------------------------+ | GatherMerge [remote_0.col1 DESC] partitions:all alias:remote_0 | | Project [t1.col1] | | OrderedColumnStoreScan test1.t1, KEY col1 (col1 DESC) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore | +------------------------------------------------------------------------------------------------------------------+ 3 rows in set (0.00 sec)
EXPLAIN SELECT * FROM t1 ORDER BY col1; **** +-----------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-----------------------------------------------------------------------------------------------------------+ | GatherMerge [remote_0.col1] partitions:all alias:remote_0 | | Project [t1.col1] | | Sort [t1.col1] | | ColumnStoreScan test1.t1, KEY col1 (col1 DESC) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore | +-----------------------------------------------------------------------------------------------------------+ 4 rows in set (0.01 sec)
Note
KEY() USING CLUSTERED COLUMNSTORE
is a legacy syntax that is equivalent to SORT KEY()
. We recommend using SORT KEY()
.
As with any other SingleStoreDB table, we suggest you define a shard key to explicitly control the data distribution within the cluster. In the following example, we define the shard key on id
since sharding on a high cardinality identifier column generally allows for more even distribution and prevents skew.
CREATE TABLE people ( id INT AUTO_INCREMENT, user VARCHAR(24), first VARCHAR(24), last VARCHAR(24), SORT KEY (user), SHARD KEY (id) );
Furthermore, setting the shard key and the sort key on the same column improves data compression. In this case, both keys could be set on the id
column.
Segment Elimination
The single most important consideration for columnstore tables is setting a sort key. When filtering on the sort key, the amount of scanning the engine must perform is decreased. The minimum/maximum value metadata for each segment is used at query execution time to determine whether a segment can match a filter; if not, the segment is skipped entirely and no data is examined. This is called segment elimination because the segment is eliminated from the scan..
Indexing cuts down on the number of rows scanned when a select query with a where clause is executed. If the table has an index for the columns in question, the SingleStoreDB engine can quickly determine the position to seek without having to look at all the data. This is much faster than reading every row sequentially.
The segment size also impacts query execution with respect to the sort key. A smaller segment size means that a small number of rows are read from the segments that pass segment elimination.
Column segments typically contain on the order of tens of thousands of rows. Using the people
table (the segment size is 3 rows for readability):
user
values aa - jl
user
values jm - rl
user
values rm - zz
If a query is searching for a person whose first name starts with an "a", then only the first segment is scanned. The other two segments are eliminated from the scan since those user
values are "jm - zz".
Using Multiple Sort Keys on a Table
Using a table that only stores integers, this is how the data is stored when only one column is set as the sort key.
CREATE TABLE INTEGERS ( A int, B int, SORT KEY (A) );
If only integers 1 - 10 are inserted into both columns, the segmentation would look like this if we review only values 1 - 3 for column A:

The values in column A are segmented in order. Each distinct value of column A is stored on the same location on disk (unless the count of each distinct value exceeds 1 million, then the excess is stored on a different segment) The values in column B are stored randomly with the segmented column A.
If this same INTEGERS
table was created with both columns as sort keys, here's the syntax:
CREATE TABLE INTEGERS ( A int, B int, SORT KEY (A,B) );
Again, if only integers 1 - 10 are inserted into both columns, the segmentation would look like this if we review only values 1 - 3 for column A:

The values in column A are segmented in order. Each distinct value of column A is stored on the same location on disk (unless the count of each distinct value exceeds 1 million, then the excess is stored on a different segment). The values in column B are stored in order with the segmented column A.
Creating multiple sort keys on a table can decrease the amount of scanning needed when querying a table. However, you must understand your data. If you expect low cardinality in your tables one of the sort keys, there could be decreased performance since both columns A and B will need to be scanned.
A shard key can also be set on columnstore tables to further improve performance.