Choosing a Columnstore Key

Columnstore tables have exactly one index, the sort key. Using this sort key, columnstores group the rows into logical segments, where each segment contains data for many rows. The sort key controls the ordering of data in the blobs that compose the segment’s data. Data within a segment is stored on disk into segment files containing the same field for many rows. This enables two important capabilities. One is to scan each column individually; in essence, being able to scan only the columns that are needed to execute a query with a high degree of locality. The other capability is that columnstores lend themselves well to compression; for example, repeating and similar values can be easily compressed together.

In addition, SingleStore stores metadata for each row segment in memory, which includes the minimum and maximum values for each column contained within the segment. This metadata is used at query execution time to determine whether a segment can possibly match a filter, a process known as segment elimination.

For example, we will use this columnstore table:

CREATE TABLE products (
ProductId INT,
Color VARCHAR(10),
Price INT,
Quantity INT,
SORT KEY (Price)
);

The following table represents a logical arrangement of the data in a single partition of the database:

ProductId

Color

Price

Quantity

1

Red

10

2

2

Red

20

2

3

Black

20

2

4

White

30

2

5

Red

20

2

6

Black

10

2

7

White

25

2

8

Red

30

2

9

Black

50

2

10

White

15

2

11

Red

5

2

12

Red

20

2

13

Black

35

2

14

White

30

2

15

Red

4

2

Since this is a columnstore table, the data is split into segments. Column segments typically contain on the order of tens of thousands of rows; in this example, the segment size is 5 rows for readability. In addition, we will use the convention x N to indicate that a value is repeated N times.

Row segment #1 of 3:

Row segment #2 of 3:

Row segment #3 of 3:

In this example, we can see there are 3 segments containing different ranges of the Price column (4-15, 20-25, and 30-50). Each segment has the same number of rows and is organized data by column. For each column, we store the minimum and maximum values in the segment as metadata.

Segment Elimination

The single most important consideration for columnstore keys is increasing the amount of segment elimination. The minimum/maximum value metadata is used at query execution time to determine whether a segment can possibly match a filter; if not, the segment is skipped entirely and no data is examined.

For queries that filter on the key column of the columnstore index, segment elimination is typically very efficient as row segments within each row segment group will not have overlapping segments. For example, in the above table the query SELECT AVG(Price), AVG(Qty) FROM Products WHERE Price BETWEEN 1 AND 10; will eliminate all segments except row segment #1 Price 4-15.

The segment size (number of rows per segment) also impacts query execution with respect to sort keys. A smaller segment size means that a small number of rows are read from the segments that pass segment elimination.

Ordered Scans

Ordered scan over a sort key is faster than sorting a table. A query that sorts with a sort key can benefit from running an ordered columnstore scan instead of performing a normal columnstore scan.

Questions to Ask When Choosing a Columnstore Key:

  • Is the data always filtered by some column (e.g. insert timestamp or event type)? Ensure that the common columns for all queries are in the columnstore key to improve segment elimination.

  • Is the data generally inserted in order by some column (e.g. insert timestamp)? It’s best to put that column first in the columnstore key to minimize the amount of work required by the background columnstore segment merger.

  • Does one column in your key have a higher cardinality than the other? It’s best to put the lowest cardinality columns first to increase the likelihood that segment elimination will be able to affect later columns.

Check Your Understanding

Q: Can the columnstore key and the shard key be different?

A: Absolutely – they are not related at all. It is very common to pick a shard key that matches other rowstore tables (e.g. event_id) to improve join performance and to pick an unrelated columnstore key that matches the common filters on the table (e.g. event_timestamp or event_type).

Q: Is (insert_datetime6, region_id) a good columnstore key?

A: Not necessarily, because the number of distinct microsecond precision timestamps is likely very high. The partitioning of data into segments will likely be entirely controlled by insert_datetime6 and not use region_id. This means that queries scanning microseconds of data will get good segment elimination, because they’ll match fewer segments. However, queries that scan days or months of data would not benefit from segment elimination on region_id. If the data is frequently accessed days or months at a time, it would be better to truncate the precision of the timestamp. For example, use a datetime type or a computed column of DATE_TRUNC('hour', insert_datetime6). In general, your columnstore key should not be more precise than your query filters.

For more information, see How the Columnstore Works.

Last modified: August 4, 2022

Was this article helpful?