Understanding Sort Key Selection

Sort Key

The sort key is an index that groups rows of columnstore tables into logical segments, where each segment contains data for many rows. The sort key can be defined on a single column or multiple columns of a columnstore table.

To specify a sort key, use SORT KEY. The sort key order can be specified as ascending (SORT KEY(<index_column_name>)) or descending (SORT KEY(<index_column_name> DESC)). A columnstore sort key cannot be altered once the table has been created. To create an unsorted columnstore table, specify an empty sort key using SORT KEY().

A table scan using a sort key scans the table in the order of the sort key. A sort key cannot be used to scan a table in the reverse order of the sort key order.

The following example creates a table, t1, with a sort key on col1 in descending order. In the first query in the example below, the sort order in the query (DESC) matches the sort order of the SORT KEY, and the table is scanned using the SORT KEY. In the second query, the sort order in the query (ASC) does not match the sort order of the SORT KEY, and the SORT KEY is not used in the table scan.

SQL

CREATE TABLE t1(col1 int, SORT KEY(col1 DESC));

EXPLAIN SELECT * FROM t1 ORDER BY col1 DESC;

+------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                          |
+------------------------------------------------------------------------------------------------------------------+
| GatherMerge [remote_0.col1 DESC] partitions:all alias:remote_0                                                   |
| Project [t1.col1]                                                                                                |
| OrderedColumnStoreScan test1.t1, KEY col1 (col1 DESC) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

SQL

EXPLAIN SELECT * FROM t1 ORDER BY col1;

+-----------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                   |
+-----------------------------------------------------------------------------------------------------------+
| GatherMerge [remote_0.col1] partitions:all alias:remote_0                                                 |
| Project [t1.col1]                                                                                         |
| Sort [t1.col1]                                                                                            |
| ColumnStoreScan test1.t1, KEY col1 (col1 DESC) USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+-----------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)

The following example creates a table with an empty sort key.

SQL

CREATE TABLE t1(col1 int, SORT KEY( ));

EXPLAIN SELECT * FROM t1;

+---------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                 |
+---------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0                                                                    |
| Project [t1.col1]                                                                                       |
| ColumnStoreScan test8.t1, KEY __UNORDERED () USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+---------------------------------------------------------------------------------------------------------+

Note

KEY() USING CLUSTERED COLUMNSTORE is a legacy syntax that is equivalent to SORT KEY(). SingleStore recommends using SORT KEY().

SingleStore recommends defining a shard key to explicitly control the data distribution. The following example defines a shard key on id. Sharding on a high cardinality identifier column generally allows for more even distribution and prevents skew. Refer to Understanding Shard Key Selection for more information.

SQL

CREATE TABLE people (
    id INT AUTO_INCREMENT,               
    user VARCHAR(24),  
    first VARCHAR(24),  
    last VARCHAR(24),
    SORT KEY (user),   
    SHARD KEY (id)
);

Setting the shard key and the sort key on the same column improves data compression.

Refer to Managing Columnstore Segments for additional information about columnstore segments.

Segment Elimination

The single most important consideration for columnstore tables is setting a sort key. Each segment file contains metadata that holds the minimum and maximum values for the rows in the segment. The minimum/maximum value metadata is used at query execution time to determine whether a segment can possibly match a filter; if not, the segment is skipped entirely and no data is examined. This functionality is called segment elimination because the segment is eliminated from the scan.

For queries that filter on the sort key column of the columnstore index, segment elimination is typically very efficient as row segments within each row segment group will not have overlapping segments.

The segment size (number of rows per segment) also impacts query execution with respect to the sort key. A smaller segment size means that a smaller number of rows are read from the segments that pass segment elimination. Column segments typically contain on the order of tens of thousands of rows.

Indexing further cuts down on the number of rows scanned when a select query with a WHERE clause is executed. If the table has an index for the columns in the WHERE clause, the engine can quickly determine the location of the matching rows position to seek without having to look at all the data. This is much faster than reading every row sequentially.

The figures below show three segments of the people table. The segment size is 3 rows for readability.

If a query is searching for a person whose first name starts with an "a", then only the first segment is scanned. The other two segments are eliminated from the scan since those user values are "jm - zz".

Ordered Scans

Ordered scan over a sort key is faster than sorting a table. A query that sorts with a sort key can benefit from running an ordered columnstore scan instead of performing a normal columnstore scan.

Sub-segment Elimination

Sub-segment elimination improves performance by eliminating data at the sub-segment level during query processing and thus extends the performance improvements provided by segment elimination.

Segments consist of blocks of 4096 rows each. With sub-segment elimination, statistics are collected for blocks and are used during query processing to eliminate blocks. Sub-segment elimination is restricted to sort key columns.

A segment with the default size (columnstore_segment_rows = 1,024,000) would have 1,024,000 / 4,096 = 250 sub-segments. Queries with highly selective predicates using flexible parallelism are expected to see performance improvements due to sub-segment elimination.

Consider a table and a query as follows:

SQL

CREATE TABLE sales(id INT, dt DATETIME, itemnum INT, qty FLOAT, price numeric(18,4), SORT KEY(dt));

SELECT itemnum, SUM(qty*price)
FROM sales 
WHERE dt BETWEEN "2024-09-20 17:00:00" AND "2024-09-20 17:01:00"
GROUP BY all;

This query has a very narrow time range (a highly selective predicate) on the dt column, and the dt column is the sort key. Sub-segment elimination will be applied to this query and is expected to improve the performance of this query.

The following figure shows how sub-segment elimination can greatly reduce the number of rows read and significantly improve performance.

Segments that were created before sub-segment elimination was introduced, or which were created when the enable_block_stats_use_in_query engine variable was set to OFF, will not have block-level statistics and will not benefit from sub-segment elimination. Such segments will need to be rebuilt to benefit from sub-segment elimination.

The engine variable enable_block_stats_use_in_query is set to ON by default.

Multi-Column Sort Keys

Sort keys can be defined on multiple columns, which can improve performance in certain cases.

Storage

The following examples show how a table is stored with a single-column sort key followed by how the table is stored with a two-column sort key.

The data in a partition is broken into segments based on the sort key order and the segment size. (Refer to Configuring the Columnstore to Work Effectively for more information.)

Create a table named INTEGERS with a single-column sort key.

SQL

CREATE TABLE INTEGERS (
    A int,
    B int,
    SORT KEY (A)
);

The data in the INTEGERS table will be sorted on column A, and then broken into segments, so that segments contain contiguous values of column A.

The figure below shows how the data is stored with the single-column sort key. If the integers 1 - 10 are inserted into columns A and B, and the segment size is 8, the first three columnstore segments would look as follows. The segmentation is defined by the sort key order.

For a multi-column sort key, the data in a partition is also broken into segments based on the sort key order and the segment size.

The command below creates a table named INTEGERS with a multi-column sort key on columns A and B.

SQL

CREATE TABLE INTEGERS (
    A int,
    B int,
    SORT KEY (A,B)
);

The data in this table will be sorted on column A, and secondarily on column B, and then broken into segments so each segment contains a contiguous segment of data in that (multi-column) sort order.

The figure below shows how the data is stored with the multi-column sort key. If the integers 1 - 10 are inserted into columns A and B, and the segment size is 10, the first three columnstore segments would look as follows. The segmentation is defined by the sort key order - sorting the data first on the values in A, and then secondarily on the values in column B.

Example

Creating multi-column sort keys on a table can improve query performance. For example, queries on a table with date and region columns would be expected to benefit from a sort key defined on both columns (SORT KEY (date, region)). This sort key definition clusters the table on date and region columns, so that segments contain values with similar date and region values. This in turn makes encoding more effective so the data is better compressed. For queries on this table that filter on date and region columns, the engine eliminates segments and sub-segments that do not meet the filter criteria, reducing the amount of scanning needed for the query. Efficient encoding also helps reduce I/O during the scan.

Questions to Ask When Choosing Sort Keys

Is the data always filtered by some column (e.g. insert timestamp or event type)? Ensure that the common columns for all queries are in the sort key to improve segment elimination.
Is the data generally inserted in order by some column (e.g. insert timestamp)? It’s best to put that column first in the sort key to minimize the amount of work required by the background columnstore segment merger.
Does one column in your key have a higher cardinality than the other? It’s best to put the lowest cardinality columns first to increase the likelihood that segment elimination will be able to affect later columns.

Understanding Sort Key Selection

On this page

Sort Key

Segment Elimination

Ordered Scans

Sub-segment Elimination

Multi-Column Sort Keys

Storage

Example

Questions to Ask When Choosing Sort Keys

Was this article helpful?

On this page

Was this article helpful?

Understanding Sort Key Selection

On this page

Sort Key

Segment Elimination

Ordered Scans

Sub-segment Elimination

Multi-Column Sort Keys

Storage

Example

Questions to Ask When Choosing Sort Keys

Related Topics

Was this article helpful?

On this page

Was this article helpful?