Choosing a Columnstore Key

Columnstore tables have exactly one index, the sort key. Using this sort key, columnstores group the rows into logical segments, where each segment contains data for many rows. The sort key controls the ordering of data in the blobs that compose the segment’s data. Data within a segment is stored on disk into segment files containing the same field for many rows. This enables two important capabilities. One is to scan each column individually; in essence, being able to scan only the columns that are needed to execute a query with a high degree of locality. The other capability is that columnstores lend themselves well to compression; for example, repeating and similar values can be easily compressed together.

In addition, SingleStore Helios stores metadata for each row segment in memory, which includes the minimum and maximum values for each column contained within the segment. This metadata is used at query execution time to determine whether a segment can possibly match a filter, a process known as segment elimination.

For example, we will use this columnstore table:

SQL

CREATE TABLE products (
     ProductId INT,
     Color VARCHAR(10),
     Price INT,
     Quantity INT, 
     SORT KEY (Price)
);

The following table represents a logical arrangement of the data in a single partition of the database:

ProductId	Color	Price	Quantity
1	Red	10	2
2	Red	20	2
3	Black	20	2
4	White	30	2
5	Red	20	2
6	Black	10	2
7	White	25	2
8	Red	30	2
9	Black	50	2
10	White	15	2
11	Red	5	2
12	Red	20	2
13	Black	35	2
14	White	30	2
15	Red	4	2

Since this is a columnstore table, the data is split into segments. Column segments typically contain on the order of tens of thousands of rows; in this example, the segment size is 5 rows for readability. In addition, we will use the convention x N to indicate that a value is repeated N times.

Row segment #1 of 3:

Row segment #2 of 3:

Row segment #3 of 3:

In this example, we can see there are 3 segments containing different ranges of the Price column (4-15, 20-25, and 30-50). Each segment has the same number of rows and is organized data by column. For each column, we store the minimum and maximum values in the segment as metadata.

Segment Elimination

The single most important consideration for columnstore tables is setting a sort key. Each segment file contains metadata that holds the minimum and maximum values for the rows in the segment. The minimum/maximum value metadata is used at query execution time to determine whether a segment can possibly match a filter; if not, the segment is skipped entirely and no data is examined. This functionality is called segment elimination because the segment is eliminated from the scan.

For queries that filter on the sort key column of the columnstore index, segment elimination is typically very efficient as row segments within each row segment group will not have overlapping segments.

For example, in the above table the query SELECT AVG(Price), AVG(Qty) FROM Products WHERE Price BETWEEN 1 AND 10; will eliminate all segments except row segment #1 Price 4-15.

The segment size (number of rows per segment) also impacts query execution with respect to the sort key. A smaller segment size means that a smaller number of rows are read from the segments that pass segment elimination. Column segments typically contain on the order of tens of thousands of rows.

Indexing further cuts down on the number of rows scanned when a select query with a WHERE clause is executed. If the table has an index for the columns in the WHERE clause, the engine can quickly determine the location of the matching rows position to seek without having to look at all the data. This is much faster than reading every row sequentially.

Sub-segment elimination is also supported. Refer to Understanding Sort Key Selection for more information.

Ordered Scans

Ordered scan over a sort key is faster than sorting a table. A query that sorts with a sort key can benefit from running an ordered columnstore scan instead of performing a normal columnstore scan.

Questions to Ask When Choosing a Columnstore Key:

Is the data always filtered by some column (e.g. insert timestamp or event type)? Ensure that the common columns for all queries are in the sort key to improve segment elimination.
Is the data generally inserted in order by some column (e.g. insert timestamp)? It’s best to put that column first in the sort key to minimize the amount of work required by the background columnstore segment merger.
Does one column in your key have a higher cardinality than the other? It’s best to put the lowest cardinality columns first to increase the likelihood that segment elimination will be able to affect later columns.

Check Your Understanding

Q: Can the columnstore key and the shard key be different?

A: Absolutely – they are not related at all. It is very common to pick a shard key that matches other rowstore tables (e.g. event_id) to improve join performance and to pick an unrelated columnstore key that matches the common filters on the table (e.g. event_timestamp or event_type).

Q: Is (insert_datetime6, region_id) a good columnstore key?

A: Not necessarily, because the number of distinct microsecond precision timestamps is likely very high. The partitioning of data into segments will likely be entirely controlled by insert_datetime6 and not use region_id. This means that queries scanning microseconds of data will get good segment elimination, because they’ll match fewer segments. However, queries that scan days or months of data would not benefit from segment elimination on region_id. If the data is frequently accessed days or months at a time, it would be better to truncate the precision of the timestamp. For example, use a datetime type or a computed column of DATE_TRUNC('hour', insert_datetime6). In general, your columnstore key should not be more precise than your query filters.

For more information, see How the Columnstore Works.