Understanding Sort Key Selection

Sort Key

The sort key is an index that groups rows of columnstore tables into logical segments, where each segment contains data for many rows. The sort key can be defined on a single column or multiple columns of a columnstore table.

To specify a sort key, use SORT KEY. The sort key order can be specified as ascending (SORT KEY(<index_column_name>)) or descending (SORT KEY(<index_column_name> DESC)). A columnstore sort key cannot be altered once the table has been created. To create an unsorted columnstore table, specify an empty sort key using SORT KEY().

A table scan using a sort key scans the table in the order of the sort key. A sort key cannot be used to scan a table in the reverse order of the sort key order.

The following example creates a table, t1, with a sort key on col1 in descending order. In the first query in the example below, the sort order in the query (DESC) matches the sort order of the SORT KEY, and the table is scanned using the SORT KEY. In the second query, the sort order in the query (ASC) does not match the sort order of the SORT KEY, and the SORT KEY is not used in the table scan.

CREATE TABLE t1(col1 int, SORT KEY(col1 DESC));
EXPLAIN SELECT * FROM t1 ORDER BY col1 DESC;
+------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                          |
+------------------------------------------------------------------------------------------------------------------+
| GatherMerge [remote_0.col1 DESC] partitions:all alias:remote_0                                                   |
| Project [t1.col1]                                                                                                |
| OrderedColumnStoreScan test1.t1, KEY col1 (col1 DESC) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
EXPLAIN SELECT * FROM t1 ORDER BY col1;
+-----------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                   |
+-----------------------------------------------------------------------------------------------------------+
| GatherMerge [remote_0.col1] partitions:all alias:remote_0                                                 |
| Project [t1.col1]                                                                                         |
| Sort [t1.col1]                                                                                            |
| ColumnStoreScan test1.t1, KEY col1 (col1 DESC) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+-----------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)

The following example creates a table with an empty sort key.

CREATE TABLE t1(col1 int, SORT KEY( ));
EXPLAIN SELECT * FROM t1;
+---------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                 |
+---------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0                                                                    |
| Project [t1.col1]                                                                                       |
| ColumnStoreScan test8.t1, KEY __UNORDERED () USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+---------------------------------------------------------------------------------------------------------+

Note

KEY() USING CLUSTERED COLUMNSTORE is a legacy syntax that is equivalent to SORT KEY(). SingleStore recommends using SORT KEY().

SingleStore recommends defining a shard key to explicitly control the data distribution. The following example defines a shard key on id. Sharding on a high cardinality identifier column generally allows for more even distribution and prevents skew. Refer to Understanding Shard Key Selection for more information.

CREATE TABLE people (
id INT AUTO_INCREMENT,
user VARCHAR(24),
first VARCHAR(24),
last VARCHAR(24),
SORT KEY (user),
SHARD KEY (id)
);

Setting the shard key and the sort key on the same column improves data compression.

Refer to Managing Columnstore Segments for additional information about columnstore segments.

Segment Elimination

The single most important consideration for columnstore tables is setting a sort key. Each segment file contains metadata that holds the minimum and maximum values for the rows in the segment. The minimum/maximum value metadata is used at query execution time to determine whether a segment can possibly match a filter; if not, the segment is skipped entirely and no data is examined. This functionality is called segment elimination because the segment is eliminated from the scan.

For queries that filter on the sort key column of the columnstore index, segment elimination is typically very efficient as row segments within each row segment group will not have overlapping segments.

The segment size (number of rows per segment) also impacts query execution with respect to the sort key. A smaller segment size means that a smaller number of rows are read from the segments that pass segment elimination. Column segments typically contain on the order of tens of thousands of rows.

Indexing further cuts down on the number of rows scanned when a select query with a WHERE clause is executed. If the table has an index for the columns in the WHERE clause, the engine can quickly determine the location of the matching rows position to seek without having to look at all the data. This is much faster than reading every row sequentially.

The figures below show three segments of the people table. The segment size is 3 rows for readability.

Segment #1 of 3 - user values aa - jl

Segment #2 or 3 - user values jm - rl

Segment #3 of 3 - user values rm - zz

If a query is searching for a person whose first name starts with an "a", then only the first segment is scanned. The other two segments are eliminated from the scan since those user values are "jm - zz".

Ordered Scans

Ordered scan over a sort key is faster than sorting a table. A query that sorts with a sort key can benefit from running an ordered columnstore scan instead of performing a normal columnstore scan.

Multi-Column Sort Keys

Sort keys can be defined on multiple columns, which can improve performance in certain cases.

Storage

The following examples show how a table is stored with a single-column sort key followed by how the table is stored with a two-column sort key.

The data in a partition is broken into segments based on the sort key order and the segment size. (Refer to Configuring the Columnstore to Work Effectively for more information.)

Create a table named INTEGERS with a single-column sort key.

CREATE TABLE INTEGERS (
A int,
B int,
SORT KEY (A)
);

The data in the INTEGERS table will be sorted on column A, and then broken into segments, so that segments contain contiguous values of column A.

The figure below shows how the data is stored with the single-column sort key. If the integers 1 - 10 are inserted into columns A and B, and the segment size is 8, the first three columnstore segments would look as follows. The segmentation is defined by the sort key order.

For a multi-column sort key, the data in a partition is also broken into segments based on the sort key order and the segment size.

The command below creates a table named INTEGERS with a multi-column sort key on columns A and B.

CREATE TABLE INTEGERS (
A int,
B int,
SORT KEY (A,B)
);

The data in this table will be sorted on column A, and secondarily on column B, and then broken into segments so each segment contains a contiguous segment of data in that (multi-column) sort order.

The figure below shows how the data is stored with the multi-column sort key. If the integers 1 - 10 are inserted into columns A and B, and the segment size is 10, the first three columnstore segments would look as follows. The segmentation is defined by the sort key order - sorting the data first on the values in A, and then secondarily on the values in column B.

Example

Creating multi-column sort keys on a table can improve query performance. For example, queries on a table with date and region columns would be expected to benefit from a sort key defined on both columns (SORT KEY (date, region)). This sort key definition clusters the table on date and region columns, so that segments contain values with similar date and region values. This in turn makes encoding more effective so the data is better compressed. For queries on this table that filter on date and region columns, the engine eliminates segments and sub-segments that do not meet the filter criteria, reducing the amount of scanning needed for the query. Efficient encoding also helps reduce I/O during the scan.

Questions to Ask When Choosing Sort Keys

  • Is the data always filtered by some column (e.g. insert timestamp or event type)? Ensure that the common columns for all queries are in the sort key to improve segment elimination. 

  • Is the data generally inserted in order by some column (e.g. insert timestamp)? It’s best to put that column first in the sort key to minimize the amount of work required by the background columnstore segment merger.

  • Does one column in your key have a higher cardinality than the other? It’s best to put the lowest cardinality columns first to increase the likelihood that segment elimination will be able to affect later columns.

Last modified: January 28, 2025

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK