SingleStore Managed Service

Understanding Clustered Columnstore Key Selection
Clustered Columnstore Key

The clustered columnstore key (also referred to as the columnstore sort key) is an index that groups rows of columnstore tables into logical segments, where each segment contains data for many rows. It can be set on a column or multiple columns of columnstore tables to create a sort order of values within those columns. Each group can contain a maximum of 1 million rows in a compressed format on disk. These groups are called segment files or blobs. These terms are used interchangeably. The segment files contain metadata that holds the minimum and maximum values for each group.

To set a clustered columnstore key, you can use KEY() USING CLUSTERED COLUMNSTORE or SORT KEY() syntax. Both designations are equivalent. If you do not want to define a key (e.g. if records are already inserted in the right order or you do not care about the sort order), you can specify an empty key using KEY() USING CLUSTERED COLUMNSTORE or SORT KEY().

CREATE TABLE people (
    id INT AUTO_INCREMENT,               
    user VARCHAR(24),  
    first VARCHAR(24),  
    last VARCHAR(24),
    KEY (user) USING CLUSTERED COLUMNSTORE,   
    SHARD KEY (id)
);


CREATE TABLE people (
    id INT AUTO_INCREMENT,               
    user VARCHAR(24),  
    first VARCHAR(24),  
    last VARCHAR(24),
    SORT KEY (user),   
    SHARD KEY (id)
);

As with any other SingleStore table, we suggest you define a shard key to explicitly control the data distribution within the cluster. In this case, we define it on user (which is a concatenation of first name initial and last name) since sharding on a high cardinality identifier column generally allows for more even distribution and prevents skew.

Segment Elimination

The single most important consideration for columnstore tables is setting a clustered columnstore key(s). When filtering on the clustered columnstore key, the amount of scanning the engine must perform is decreased. The minimum/maximum value metadata for each segment is used at query execution time to determine whether a segment can match a filter; if not, the segment is skipped entirely and no data is examined. This is called segment elimination because the segment is eliminated from the scan..

Indexing cuts down on the number of rows scanned when a select query with a where clause is executed. If the table has an index for the columns in question, the SingleStore engine can quickly determine the position to seek without having to look at all the data. This is much faster than reading every row sequentially.

The segment size also impacts query execution with respect to the clustered columnstore key. A smaller segment size means that a small number of rows are read from the segments that pass segment elimination.

Column segments typically contain on the order of tens of thousands of rows. Using the people table (the segment size is 3 rows for readability):

Figure 4. Segment #1 of 3 - user values aa - jl
Segment #1 of 3 - user values aa - jl


Figure 5. Segment #2 or 3 - user values jm - rl
Segment #2 or 3 - user values jm - rl


Figure 6. Segment #3 of 3 - user values rm - zz
Segment #3 of 3 - user values rm - zz


If a query is searching for a person whose first name starts with an "a", then only the first segment is scanned. The other two segments are eliminated from the scan since those user values are "jm - zz".

Using Multiple Clustered Columnstore Keys on a Table

Using a table that only stores integers, this is how the data is stored when only one column is set as the clustered columnstore key. Remember, you can use either the KEY()USING CLUSTERED COLUMNSTORE or SORT KEY () syntax.

CREATE TABLE INTEGERS (
    A int,
    B int,
    KEY (A) USING CLUSTERED COLUMNSTORE
);


CREATE TABLE INTEGERS (
    A int,
    B int,
    SORT KEY (A)
);

If only integers 1 - 10 are inserted into both columns, the segmentation would look like this if we review only values 1 - 3 for column A:

Clustered_Columnstore_Key__A_.png

The values in column A are segmented in order.  Each distinct value of column A is stored on the same location on disk (unless the count of each distinct value exceeds 1 million, then the excess is stored on a different segment)  The values in column B are stored randomly with the segmented column A.

If this same INTEGERS table was created with both columns as clustered columnstore keys, here's the syntax:

CREATE TABLE INTEGERS (
    A int,
    B int,
    KEY (A,B) USING CLUSTERED COLUMNSTORE
);


CREATE TABLE INTEGERS (
    A int,
    B int,
    SORT KEY (A,B)
);

Again, if only integers 1 - 10 are inserted into both columns, the segmentation would look like this if we review only values 1 - 3 for column A:

Clustered_Columnstore_Key__A_B_.png

The values in column A are segmented in order. Each distinct value of column A is stored on the same location on disk (unless the count of each distinct value exceeds 1 million, then the excess is stored on a different segment). The values in column B are stored in order with the segmented column A.

Creating multiple clustered columnstore keys on a table can decrease the amount of scanning needed when querying a table. However, you must understand your data. If you expect low cardinality in your tables one of the clustered columnstore keys, there could be decreased performance since both columns A and B will need to be scanned.

A shard key can also be set on columnstore tables to further improve performance.

Questions to Ask When Choosing Clustered Columnstore Keys
Managing Clustered Columnstore Segments