SingleStore DB

Columnstore
How the Columnstore Works

In SingleStore DB there are a few concepts that are important to understand in order to make the best possible use of the columnstore:

  • Clustered columnstore index - A clustered columnstore index represents the full table structure and is its primary storage.

  • Columnstore key column(s) - When creating a columnstore index, one or more columns need to be defined as the key column(s) for the columnstore index. The data in the columnstore is stored in key column order. Selecting a good columnstore key can significantly improve performance as covered later in this section.

  • Row segment - A row segment is a set of rows within a columnstore index that are stored together, each of which is made up of column segements. SingleStore DB stores the metadata for each row, which includes the total row count for a given segment, as well as a bitmask tracking which rows have been deleted.

  • Column segment - Each row segment contains a column segment for every column in a table. The column segment is the unit of storage for a columnstore table and contains all values for a specific column within the row segment. Values in column segments are always stored in the same logical order across column segments within the same row segment. SingleStore stores in memory the metadata for each column segment, which includes the minimum and maximum values contained within the segment. This metadata is used at query execution time to determine whether a segment can possibly match a filter, a process known as segment elimination.

  • Sorted row segment group - A sorted row segment group represents a set of row segments that are sorted together on the columnstore key column(s). This means that within a sorted row segment group there will be no row segments with overlapping value ranges for the column(s) that make up the key for the columnstore index. New segment groups are formed when more segments are created after running INSERT, LOAD, or UPDATE queries on the table. Information on how this affects query performance and how to minimize the number of sorted row segment groups is covered later in this section.

Below is an example of a SingleStore DB columnstore index on the Products table using the Price column as its key. See Choosing a Columnstore Key for an example of how to create the table.

Important

Column segments typically contain on the order of hundreds of thousands of rows. In this example, the segment size is 5 rows for readability.

Each row segment contains one column segment per column in the table. For simplicity, this example contains the initial load of the table and has a single sorted row segment group.

Logical table and rowstore representation

ProductId

Color

Price

Qty

1

Red

10

2

2

Red

20

2

3

Black

20

2

4

White

30

2

5

Red

20

2

6

Black

10

2

7

White

25

2

8

Red

30

2

9

Black

50

2

10

White

15

2

11

Red

5

2

12

Red

20

2

13

Black

35

2

14

White

30

2

15

Red

4

2

Sorted row segment group #1 of 1

Figure 2. Row segment #1 of 3
Row segment #1 of 3


Figure 3. Row segment #2 of 3
Row segment #2 of 3


Figure 4. Row segment #3 of 3image
Row segment #3 of 3image


x N indicates that the value is repeated N times.

Creating Efficient Columnstore Queries

Queries against tables with a columnstore index in SingleStore can take advantage of five characteristics of a columnstore index:

  1. All queries will be able to take advantage of the fact that only the column segments containing columns referenced in the query will need to be scanned, as well as the fact that the columnstore’s compression causes less data to need to be scanned. Taking the above table as an example. The query SELECT SUM(Qty) FROM Products; will only need to scan the three Qty column segments, each of which only contain a single value due to the compression.

  2. Some queries can be performed simply by reading the in-memory metadata for column segments referenced in the query. As an example, the query SELECT COUNT(*) FROM Products; will only need to read the row count and delete bitmask for all row segments to produce its result, completely eliminating the need to read column segments from disk. Queries that use the MIN or MAX aggregates can eliminate the need to read column segments from disk if no delete has been performed against the column segment.

  3. Some queries can reduce the number of segments that need to be read from disk by eliminating these segments based on the segment metadata (min and max values for the segment). The efficiency of this method depends on what percentage of segments can actually be eliminated.

    • For queries that filter on the key columns of the columnstore index, segment elimination is typically very efficient as segments within each row segment will not cover overlapping value ranges. For example, in the above table the query SELECT AVG(Price), AVG(Qty) FROM Products WHERE Price BETWEEN 1 AND 10; will eliminate all segments except row segment: #1 column segment: Price 4-15 and row segment: #1 column segment: Qty 2-2 which will be scanned.

    • For queries that filter on the non-key columns that don’t correlate with the key columns, segment elimination can be much less valuable as segments within each row segment can cover overlapping value ranges. For example, in the above table, the query SELECT AVG(Price) FROM Products WHERE Color = 'Red'; will be forced to scan all segments for the Price and Color columns as no segment can be eliminated when the value Red is contained in all segments of the Color column.

  4. Queries that join tables on columns that are the index columns of a columnstore index can be performed very efficiently through the use of a merge join algorithm allowing the join to be performed by simply scanning two segments that need to be joined in lock-step.

  5. Certain types of data allow filters and group-by operations to be performed without decompressing data from its serialized-for-disk format. This greatly improves performance by reducing the amount of data that need to be processed, especially when the cardinalities of the involved columns are low. This optimization is only performed in cases when execution run time would be improved. See Encoded Data in Columnstores for more information.

  6. Queries with selective filters use subsegment access. These queries seek into column segments to read the needed values, rather than scanning whole segments. For a columnstore table MyTable, the query SELECT Field1, Field2 FROM MyTable WHERE Field1 > 50 uses subsegment access. Selective filters with multiple conditions, such as SELECT Field1, Field2 FROM MyTable WHERE Field1 > 50 AND Field2 > 100, also use subsegment access.

  7. Queries using equality filters can take advantage of hash indexes. See an example. The previous note applies to these queries, since equality filters are often selective filters.

Writing Columnstore Data

Unlike other columnstore implementations, SingleStore DB supports very fast, small-batch writes (such as single row inserts and updates) directly into columnstore tables.

In some cases, writes (inserts and updates) are implemented by storing newly written rows in a rowstore-oriented skiplist before flushing them to the column-oriented format. Rows are visible to reads as soon as they are committed to the rowstore-oriented skiplist.

In other cases, the rowstore-skiplist is bypassed and writes are made directly to the column-oriented format on disk.

The following explains the behavior of different operations that can be performed against a columnstore.

  1. Insert - Inserts into a columnstore will either go into the rowstore-backed segment or a new columnstore-backed row segment. In the latter case, the inserts are written directly to disk.

    If the engine variable columnstore_disk_insert_threshold has a low value, INSERT...SELECT and INSERT write directly to disk, instead of writing first to rowstore-backed memory. Otherwise, INSERT first writes to rowstore-backed memory.

  2. Delete - Deleting a row in a columnstore causes the row to be marked as deleted in the segment metadata, leaving the data in place within the row segment. Segments which only contain deleted rows are removed, and the optimization process covered below will compact segments that require optimization.

  3. Update - An update in a columnstore is internally performed as a delete followed by an insert within a transaction.

  4. Replace

    REPLACE updates rows in a rowstore-oriented skiplist before flushing them to the column-oriented format.

  5. Optimization - SingleStore DB has optimization routines that run automatically and can also be started manually. These routines attempt to automatically merge row segments together in order to improve query efficiency. DML statements can be performed while columnstore optimizations take place. For more information, see the OPTIMIZE TABLE topic.

Locking in Columnstores

By default, UPDATE and DELETE queries lock columnstore tables at the row level.

The following example demonstrates how multi-table filters may lock rows that do not match the filters. Consider the following query,

UPDATE stock JOIN product ON stock.qty = 10 AND stock.id = product.id SET ...

This query locks all the rows of the table stock where stock.qty = 10, including the rows where stock.id is not equal to product.id. Alternatively, use a single table filter to trim the number of rows locked.

Suppose that a database errors_db containing eight partitions has the app_errors table that is defined as follows.

CREATE TABLE app_errors (
  error_id INT,
  app_name TEXT,
  error_code TEXT,
  error_date DATE,
  KEY (error_date) USING CLUSTERED COLUMNSTORE,
  SHARD KEY (error_id)
  );

The following UPDATE example demonstrates row-level locking. Assume the app_errors table on partition one contains 4500 records having an app_name of App1.

UPDATE app_errors SET error_code = 'ERR-2000'
  WHERE app_name = 'App1';

While this query is running:

  • In the app_errors table on partition one, other queries can UPDATE and DELETE the rows not having an app_name value of App1.

  • In the app_errors table on partition one, other queries may not UPDATE and DELETE the rows having an app_name value of App1.

Overriding Default Locking

By default, UPDATE and DELETE queries use row-level locking when they operate on fewer than 5000 rows in a columnstore table and use partition-level locking when they operate on 5000 or more rows in a columnstore table. In the latter case, all of the table’s rows in the partition are locked.

For an UPDATE or a DELETE query, you can override the default threshold of 5000 by specifying the OPTION (columnstore_table_lock_threshold = <value>) hint. The <value> indicates the row count threshold for which partition level locking takes effect.

The following UPDATE example specifies that columnstore_table_lock_threshold is 4000.

UPDATE app_errors SET error_code = 'ERR-2000'
  WHERE app_name = 'App1' OPTION (columnstore_table_lock_threshold = 4000);`

When you specify a higher columnstore_table_lock_threshold value, you can get higher concurrency, but more memory may be used for locking.

Managing Columnstore Segments

A columnstore table will have the best performance if the rows in the table are in global sorted order across all the row segments. In reality, maintaining such an order is not feasible in the presence of continuous writes.

SingleStore DB uses an advanced algorithm that allows it to maintain the order as close to sorted as possible, while data is being ingested or updated. Such a process is called a background merger and is constantly running in the background if the order of the row segments can be improved.

Background merger is optimistic, in that if at any point it tries to move around data which is also being changed by a concurrent UPDATE or DELETE query, it will discard all the work it has done so far and start over. It works on a small chunk of data at a time, so it is always a relatively small amount of work that is being discarded. However, in the presence of a very heavy update workload it can be a significant slowdown compared to a pessimistic merger, which locks the row segments it is currently processing. A user can manually trigger a pessimistic merger by running an OPTIMIZE TABLE command. We will explain below how to decide whether such a command is necessary, and how to run it.

SingleStore DB uses a concept of a sorted row segment group to describe a set of row segments that are sorted together. Row segments form a sorted row segment group if and only if there is an order on the row segments such that for each row segment the smallest row in it is no smaller than the largest row in any row segment before it. Here and below when we say that one row is smaller than another row, we mean that the values of columns of the CLUSTERED COLUMNSTORE key of that row are smaller than those of the other row.

If the data had a perfect global order, it would consist of a single sorted row segment group. If the data is in a completely random order, it is likely to comprise as many sorted row segment groups as there are row segments. The goal of the background merger is to reorganize the rows among row segments in such a way that the number of sorted row segment groups is as small as possible.

To inspect the current state of the sorted row segment groups of a particular table, run the SHOW COLUMNAR MERGE STATUS FOR <table_name> command:

SHOW COLUMNAR MERGE STATUS FOR groups;
****
+----------------------------------------------------------------------------+
| Merger           | State | Plan                    | Progress | Partition  |
+----------------------------------------------------------------------------+
| (Current groups) | NULL  | 741,16,1                | NULL     |   0        |
| (Current groups) | NULL  | 782,20                  | NULL     |   1        |
| (Current groups) | NULL  | 701,40,5                | NULL     |   2        |
| (Current groups) | NULL  | 326,207,123,37,21,19,17 | NULL     |   3        |
+----------------------------------------------------------------------------+

Let’s look closely at the first row of the result. According to it, the slice of the table that is stored on partition 0 has three sorted row segment groups, one consists of 741 row segments, one consists of 16 row segments, and one consists of a single row segment - a total of 758 row segments. Consider the impact of such a split into sorted row segment groups on a very simple query like

SELECT * FROM groups WHERE user_group = 15;

By the definition of the sorted row segment group, the very first sorted row segment group will have at most one row segment that contains rows with user_group equal to 15, unless user_group = 15 is on the boundary of two row segments, or if there is a large data skew and several row segments consist only of rows with user_group = 15. Similarly, at most one row segment in the second sorted row segment group contains relevant rows, and the only segment of the third sorted row segment group might also contain relevant rows. This way, only three row segments out of the total of 758 will be opened and materialized. While the query in this example is very simple, similar reasoning works for significantly more complex queries.

Now take a look at the sorted row segment groups on partition 3. Clearly, it is significantly less optimized than the remaining three, and a select query like the one shown above will result in materializing 8 row segments. If the background merger is enabled, and no workload is running concurrently, within several seconds this partition would get optimized. However, in the presence of a heavy workload, the optimistic background merger might fall behind. In this case, it might be reasonable to manually trigger a pessimistic merger by calling:

OPTIMIZE TABLE groups

If we run SHOW COLUMNAR MERGE STATUS while OPTIMIZE TABLE is being executed, we might see the manual merger in action:

SHOW COLUMNAR MERGE STATUS FOR groups;
****
+--------------------------------------------------------------------------------+
| Merger           | State    | Plan                    | Progress | Partition   |
+--------------------------------------------------------------------------------+
| (Current groups) | NULL     | 741,16,1                | NULL     |         0   |
| (Current groups) | NULL     | 782,20                  | NULL     |         1   |
| (Current groups) | NULL     | 701,40,5                | NULL     |         2   |
| (Current groups) | NULL     | 326,207,123,37,21,19,17 | NULL     |         3   |
| Manual Merger    | Working  | 326+207+123+37+21+19+17 | 53.12%   |         3   |
+--------------------------------------------------------------------------------+

What this new row indicates is that there is a manual merger running on the partition 3, and that at this time it has done 53.12% of the work.

When the merger is done, the table now is in a better shape:

SHOW COLUMNAR MERGE STATUS FOR groups;
****
+------------------------------------------------------------+
| Merger           | State | Plan     | Progress | Partition |
+------------------------------------------------------------+
| (Current groups) | NULL  | 741,16,1 | NULL     |         0 |
| (Current groups) | NULL  | 782,20   | NULL     |         1 |
| (Current groups) | NULL  | 701,40,5 | NULL     |         2 |
| (Current groups) | NULL  | 730,20   | NULL     |         3 |
+------------------------------------------------------------+

Note that at no point were any of the partitions merged into a single sorted row segment group in this example. The reason for that is that both optimistic and pessimistic merger use an advanced algorithm that is optimized to do small amortized chunks of work in the presence of concurrent writes and maintain data in a few sorted row segment group, rather than to attempt to merge all the data into a single sorted row segment group. In cases when it is acceptable to sacrifice some time on data ingestion to achieve even higher SELECT performance, it is possible to run a manual command that merges data on each partition into a single sorted row segment group:

OPTIMIZE TABLE groups FULL;

SHOW COLUMNAR MERGE STATUS FOR groups;
****
+---------------------------------------------------------+
| Merger           | State | Plan | Progress | Partition  |
+---------------------------------------------------------+
| (Current groups) | NULL  | 758  | NULL     |         0  |
| (Current groups) | NULL  | 802  | NULL     |         1  |
| (Current groups) | NULL  | 746  | NULL     |         2  |
| (Current groups) | NULL  | 750  | NULL     |         3  |
+---------------------------------------------------------+

At this time any highly selective select will materialize only one row segment per partition.

Important

Unlike OPTIMIZE TABLE <name>, which takes amortized time proportional to the size of recently loaded data, OPTIMIZE TABLE <name> FULL always takes time in the order of magnitude of the size of the entire table, unless data in that table is already sorted.

When inserting a small amount of rows into the columnstore table, an in-memory rowstore-backed segment is used to store the rows. As this rowstore-backed segment fills, the background flusher periodically will flush these rows to disk. A rowstore-backed segment can be flushed to disk manually by running OPTIMIZE TABLE <table_name> FLUSH.

OPTIMIZE TABLE t FLUSH;
Encoded Data in Columnstores

SingleStore DB performs some query processing operations directly on encoded data in columnstore tables. Columnstore data is stored encoded in a number of different forms, including dictionary encoding, run-length encoding, and value encoding. For these encodings, certain operations, including filtering and aggregation, can be run very efficiently, through the use of special coding techniques, as well as Single-Instruction, Multiple-Data (SIMD) instructions on processors that support the Intel AVX2 instruction set.

SIMD support on your hardware is not required to benefit from operations on encoded data. Performance may speed up for operations on encoded data anywhere from a few times to 30 times without SIMD support. Use of SIMD may give an additional increase in performance. A gain of another 2 to 3 times due to SIMD is not unusual. Your results will depend on your data and queries. Portions of larger queries may be done without operating on encoded data, so the end-to-end speedup you experience may vary.

Operations on encoded data are performed automatically, by default. You don’t need to change any settings to benefit from it. The decision to use operations on encoded data is made by the query execution system at runtime, not by the query optimizer.

What Is Encoded Data and What Does It Mean to Operate on It?

SingleStore DB supports several different kinds of data encodings to help compress data. All columnstore data is said to be encoded in one way or another. Common encodings are run-length, dictionary, and LZ4. Only certain encodings can be processed directly, i.e. operated on directly. These include dictionary encoding, run-length encoding, and integer value encoding. For example, with dictionary encoding, for a segment of a string column with only 3 distinct values, a 2-bit ID number is stored for each string. The ID numbers are used as references into the dictionary. These IDs are packed tightly together when stored in the columnstore.

For example, consider this dictionary:

ID

value

0

red

1

green

2

blue

This can be represented as a bit vector of packed string IDs when it is stored into the column segment:

  • Strings: red, blue, green, green, red

  • String IDs: 0, 2, 1, 1, 0

  • Bit vector: 00 10 01 01 00 (2 bits per string ID)

As an example of operating directly on encoded data, SingleStore DB can perform a filter operation, say t.a = xyz, on a string dictionary-encoded column segment by first finding the result of the filter for every entry in the dictionary. Then, while scanning the segment, the query execution system simply takes the encoded ID number for each value in t.a and uses it to look up the result of the string comparison for that ID that was computed in the initial scan of the dictionary. That tends to be much quicker than doing an actual string comparison. Other kinds of operations can also be done directly on encoded data ID values, including the comparison of ID values needed to do group-by operations. The details of these are beyond the scope of this documentation.

The encoding for a blob depends on its type and data. It may hinge both on the ordering and cardinality for numeric delta encoding and dictionary encoding, respectively. It impacts the execution speed, as well as enables SIMD operations on encoded data.

Most query processing in SingleStore DB outside of columnstore scan is done row-at-a-time. Columnstore processing on encoded data is done in a vectorized fashion, where large batches of data from a column are processed in one or more relatively simple loops. These loops tend to be more friendly to modern CPUs than row-at-a-time processing, resulting in lower instruction count, better cache usage, and improved efficiency of the processor’s instruction pipeline.

Queries That Can Benefit from Operations on Encoded Data

Queries that can benefit from operations on encoded data are those that run over columnstore tables, and have one or more of these components:

  • Filters

  • Group-by

  • Aggregates

  • Expressions in aggregates

  • Expressions in group-by

  • Star joins on integer columns

Operations on encoded data can be done regardless of the number of distinct values of the encoded columns being processed. Performance tends to be better when the number of distinct values is smaller. This is because when dictionaries are smaller the columnstore data is compressed to a smaller size, and intermediate working lookup tables fit more easily into the cache memory of the processor.

Here’s an example of a query that can benefit from operations on encoded data in several ways.

select
  dayofweek(shipdate), location,
  sum(price * (1 - discount)), avg(quantity)
from lineitem
where
  status = 'S'
group by 1, 2;

Here’s another example of a query that can benefit from join operations on encoded data, if the table f is a columnstore.

select d1.x, d2.y, sum(f.m)
from f, d1, d2
where f.d1_key = d1.key
and f.d2_key = d2.key
and d1.v > 100
and d2.v in (1, 2, 3)
group by d1.x, d2.y

This query is a simple example of a star join. If you are using a star schema (with a fact table linked to multiple dimension tables), and the joins are on integer columns, you can benefit from operations on encoded data. Use of a group by clause and aggregate functions is required to benefit from join operations on encoded data.

Examples

The following example shows encoded group-by in action:

create database db;
use db;
drop table if exists r;
drop table if exists is_tables;

/* Extract rows for 50 different tables into a scratch table, is_tables. */
create table is_tables as
select * from information_schema.tables
limit 50;

/* Create a table with a million rows, with every table name
   appearing in most or all segments. */
create table r (index using clustered columnstore(n)) as
select s.*, (row_number () over (order by s.table_name) % 1000) as n
from (select t1.*
      from is_tables t1, is_tables t2, is_tables t3, is_tables t4
      limit 1000000) as s;

The table r created above is a columnstore with one million rows. You can see the encoding of the TABLE_NAME column using this query:

select count(*), encoding, column_name
from information_schema.columnar_segments
where table_name  = 'r' and database_name  = 'db'
and column_name = 'TABLE_NAME'
group by column_name, encoding

The result indicates that StringDictionary encoding is used for the column.

Now, run the following group-by/aggregate query that groups on the table_name column, in profile mode:

profile select table_name, count(*) from r group by table_name;

Finally, output the JSON profile for the query:

show profile json;

The result includes the following text:

inputs":[
    {
        "executor":"ColumnStoreScan",
        "db":"db",
        "table":"r",
        ...
        "segments_scanned":{ "value":16, "avg":2.000000, "stddev":0.000000, "max":2, "maxPartition":0 },
        "segments_skipped":{ "value":0 },
        "segments_fully_contained":{ "value":0 },
        "segments_encoded_group_by":{ "value":16, "avg":2.000000, "stddev":0.000000, "max":2, "maxPartition":0 },
        "inputs":[]
    }

Notice the portion labeled segments_encoded_group_by. This part shows that 16 segments were processed in total and all of them were processed with encoded group-by operations. These operations are taking place directly on encoded data.

Now, here is another query (run in profile mode) that is similar to the first, but also has a filter on column table_name:

profile select table_name, count(*) from r
where table_name like '%COLUMN%'
group by table_name;

show profile json;

The JSON profile for this query contains the text below:

"inputs":[
    {
        "executor":"ColumnStoreScan",
        "db":"db",
        "table":"r",
        ...
        "segments_scanned":{ "value":16, "avg":2.000000, "stddev":0.000000, "max":2, "maxPartition":0 },
        "segments_skipped":{ "value":0 },
        "segments_fully_contained":{ "value":0 },
        "segments_ops_compressed_data":{ "value":16, "avg":2.000000, "stddev":0.000000, "max":2, "maxPartition":0 },
        "segments_encoded_group_by":{ "value":16, "avg":2.000000, "stddev":0.000000, "max":2, "maxPartition":0 },
        "inputs":[]
    }

Notice that it includes sections for both segments_ops_compressed_data and segments_encoded_group_by. The segments_ops_compressed_data section indicates that 16 segments had filters applied on encoded data.

The final decision about use of operations on encoded data is made at runtime and depends on column encodings for each segment and (for group-by) the number of distinct values in the segment. So the segments_ops_compressed_data and segments_encoded_group_by values may be less than the number of segments scanned minus the number skipped. In this case, the encoded_group_by_bailout section will indicate the reasons that encoded group by was disabled as well as the number of segments disabled for each reason.

Encoded joins

This example shows encoded joins in action.

/* Create a simulated dimension table with 1000 rows with unique
   values for n, and around 100 categories. */
create table d as
select n, max(concat('cat', n % 100)) as category
from r
group by n;

/* Join dimension table d to "fact" table r on an integer column
   and group by category. This is a very simple star join. */
select d.category, count(*)
from r, d
where r.n = d.n
and d.category like 'cat1%'
group by d.category;

The above join query takes only 0.02 seconds on a two-core laptop. Yet it is fully scanning the million-row table r and joining it with 110 rows from d.

You can see that encoded joins are being used in the profile query execution plan for the above statement by using SingleStore DB Studio’s graphical plan display or show profile json output. On the HashJoin operator you will see the property "encoded_join_enabled":"yes". In addition, you can see that operations on encoded data are pushed to the ColumnStoreScan operator since it has the property segments_encoded_group_by with value set to 8. The HashJoin, ColumnStoreScan, and various GroupBy operators can work together for a star join with a sequence of one or more hash joins over the scan.

Example Performance Results

Operations on encoded data can yield some astonishingly short query execution times. Below are example queries and runtimes, on tables with 25 to 50 million rows of data, running on a single core.

query

time

select count(*) from y where b in (thomas1,thomas3,thomas5)

0.424s

select count(*) from col320 group by b {320 groups}

0.154s

select sum(j) from col10 group by b {10 groups}

0.053s

select count(*) from col10 where c < 900000 group by b {90% selectivity}

0.114s

select count(*) from col10 where c < 100000 group by b {10% selectivity}

0.097s

Best case performance in other internal tests have shown a processing rate of over one billion rows per second per core for queries like the following, when column a has eight or fewer distinct values, using a recent model high-performance Intel processor.

select a, count(*) from t group by a

These results can be extrapolated to much larger data sets on a large SingleStore DB cluster, although memory bandwidth limits may limit total throughput in rows per second per core across the cluster, below the best-case numbers shown here. Your results will of course depend on your data, hardware, and queries.

Performance Considerations

Because operations on encoded data can process data so fast, their performance can become limited by the bandwidth of main memory (RAM). And the bandwidth of RAM is normally far higher than the bandwidth of your I/O system. For example, RAM bandwidth could be 50 GB/sec and the bandwidth of an SSD could be only 600 MB/sec.

So, it is recommended that you configure your system so that little or no I/O is being done while processing columnstore data, under a steady query workload. You can do this by ensuring that the operating system’s file system buffer cache has enough main memory to hold the working set of your columnstore data, i.e. the segments of columns that are frequently accessed. A good rule of thumb is to have enough RAM in the file system buffer cache to hold at least 20% of your compressed columnstore data. If your workload does a lot of full table scans that touch all the columns of the table, you may want to consider adding more than that.

The Linux operating system will allocate available system memory to the file system buffer cache in an on-demand way, without the need to set any configuration knobs. Simply ensure that there is sufficient memory remaining after all the memory dedicated to SingleStore DB, the operating system, and any other applications running on the hardware.

Relevant Hints

SingleStore DB can perform group-by on encoded data using a HashGroupBy operation, and can also do a group-by on columnstore data using a StreamingGroupBy operation on the sort key of the columnstore. Due to the improvement of HashGroupBy with the introduction of operations on encoded data, it may be the case that HashGroupBy is faster than StreamingGroupBy, yet the query optimizer may choose StreamingGroupBy. If this occurs and it is important for your workload to tune the query to get better performance, you can use the disable_ordered_scan query hint. For example:

select sum(a) from t with (disable_ordered_scan=true) group by b

You can use EXPLAIN to see the plan for your query to check the type of group-by operation chosen.

Data Encodings Supported

Operations on encoded data are supported only for the following:

  • String columns with string dictionary and string run-length encodings

  • Integer columns with value and run-length encodings

Group-by/aggregate operations on encoded data are supported only for integers. Filter acceleration on encoded data is supported for both strings and integers.

It is recommended that you allow SingleStore DB to choose automatically how to encode data for columnstores. However, in the rare event that it chooses an encoding that does not support operations on encoded data, and it is important to you for your application that operations on encoded data be performed, you may wish to force the encoding chosen. You can do this with the option '<encoding>' notation. For example:

create table t (a int, b varchar(50) option 'StringDictionary',
  key(a) using clustered columnstore);
Operations Supported, and Limitations

To summarize the operations supported on encoded data, as well as limitations to operations on encoded data, they include:

  • Scan

    • Faster decoding for integer encoding

  • Filter

    • Filters on the following:

      • Strings for string dictionary and string run-length encoding

      • Integers for run-length encoding

    • Or-of-ands of filter expressions; filter expressions must involve a single string column for faster processing

    • Support for Bloom filters on string columns (elimination of rows with no matches for joins on a single string column)

  • Aggregates

    • Aggregates supported: sum, min, max, count, any

    • Aggregate input data types supported: all numeric

    • Aggregate expressions supported: multiple table columns allowed in a single expression

    • Aggregates encodings supported: integer, integer run-length

  • Group-by:

    • Operations on encoded data are not done for aggregates without a group-by (a.k.a. scalar aggregates)

    • Group-by count star is very fast

    • Group-by:

      • Multiple group-by columns are allowed

      • A mix of columns and expressions is allowed

      • Group-by columns must only use these encodings: integer, integer run-length

      • There is a limit on number of distinct values per column of a few thousand rows, beyond which, the system reverts to row-at-a-time processing and the local aggregation of data for a row segment will output rows to the parent global aggregate operator

      • Similarly, there is a limit on total number of groups, of a few thousand groups, beyond which the optimized group-by processing is not used

      • Group-by on the sort key may not be as efficient as group-by on other columns because ordered group-by may be performed while hash groupby might be better

    • Group-by on expressions:

      • Only one input column is allowed in a single expression

      • The expression result must be integer

      • Each table column can be used only once in the set of group-by columns and expressions

  • Joins:

    • Joins must be on some type of integer column, or another column type represented internally as an integer, such as datetime

    • The join or joins selected by the query optimizer must be of type HashJoin

    • The join must be a many-to-one relationship

    • The HashJoin (or a sequence of HashJoins) must appear over a ColumnStoreScan

    • A Group By operation and aggregate must be present in the query over the result of the join

Encoded group by bailout reasons

The final decision on whether to use encoded group by is made at runtime. The possible bailout reasons are:

  • Join condition is not a many-to-one relationship

  • Unsupported column encoding

  • Aggregate not guaranteed to fit in aggregate type (the result may overflow the internal integer or decimal type)

  • Reached dynamic dictionary size limit (too many unique group by values)

  • Group by column cannot be placeholder blob (column does not yet exist in columnstore)

Additional Tuning Considerations

If you have upgraded from a MemSQL release below 6.0, you can benefit immediately from operations on encoded data. There is no need to rebuild your columnstore tables. However, columnstore encoding choices have changed slightly in the 6.0 release, with dictionary encoding being more preferred. So queries may run faster on newly loaded data. Encodings are chosen separately for each segment, so even if you don’t use OPTIMIZE on your columnstore tables, as new data is inserted and older data is removed, the encodings chosen will naturally evolve to be those preferred by the latest version of SingleStore DB.

If you find that operations on encoded data are not occurring as much as you expect them to you can rebuild the columnstore table involved. Before doing this, you can query information_schema.columnar_segments as described earlier to see if dictionary, run-length, or integer encodings are not used. If not, then consider rebuilding the columnstore table.

One way to rebuild it is to use

create table newTable (index using clustered columnstore(<columns>)) as
select *
from oldTable;

Then drop oldTable and rename newTable to oldTable. Alternatively you can use OPTIMIZE TABLE… FULL. However, be aware that this can take sig

nificantly longer than loading the data in the first place, because it causes a single sorted run to be created.

Scalar count(*)

Because encoded group-by is so fast, and scalar aggregates are not done on encoded data, you may find the following unexpected behavior. A query like this:

select a, count(*) from t group by a;

may be faster than this:

select count(*) from t;

Although it is usually not necessary because count(*) scalar aggregates are so fast anyway, if you want the fastest possible count(*), consider this workaround:

select sum(t2.c)
from (select a, count(*) as c from t group by a) as t2;

The subquery will be done with encoded group-by, so the containing query may run faster that a scalar count(*);

Aggregation of Decimal Types

When creating decimal type columns that are aggregated in queries, for best performance, it is recommended to use precision of 18 digits or less if your application can accommodate this (i.e. the extra precision is not needed to represent meaningful information). Decimal values of 18 digits or less can be processed more efficiently than others because they can be handled internally as 64-bit integers during aggregation. Processing these 64-bit values takes much less time than interpreting decimal values of more than 18 digits, which must be handled in a more general way.

Highly Selective Joins
Working with Highly Selective Joins

Starting in version 7.0, SingleStore introduced support for columnstore hash indexes to broaden the support for OLTP-type queries. However, a common join pattern in OLTP is to have a very selective filter on one table, which produces a few rows from the source table, and then join those rows with another table. Databases for OLTP normally use a nested loop join for this. For each row from the outer table, an index seek is done on the inner table.

SingleStore DB supports these highly selective joins using an adaptive hash join algorithm. As a result, there’s no need to run a full columnstore scan when there’s a matching hash index. First, it builds a hash table for the table with the highly-selective filter. Then, depending on the number of rows in the hash table, the adaptive algorithm will switch strategies internally.

  • If there are only a few rows in the hash table, it switches to use a nested loop join strategy, seeking into the larger table (on the probe side) via the index on the join column of the table on the probe side.

  • If the hash build side produces a lot of rows, then it performs a normal hash join.

The following example takes advantage of this strategy for selective joins.

CREATE TABLE orders(
  oid INT,
  d DATETIME,
  KEY(d) USING CLUSTERED COLUMNSTORE,
  SHARD(oid),
  KEY(oid) USING hash);

CREATE TABLE lineitems(
  id INT,
  oid INT,
  item INT,
  KEY(oid) USING CLUSTERED COLUMNSTORE,
  SHARD(oid) USING HASH);

Now, add some sample data to orders:

INSERT INTO orders VALUES(1, NOW());

Execute the following command repeatedly, until the table has around 33.5 million rows in it.

INSERT INTO orders
SELECT oid+(SELECT MAX(oid) FROM orders), NOW()
FROM orders;

Add 67.1 million rows of data to the lineitems table, such that each line item belongs to an order, and each order has exactly two line items.

INSERT INTO lineitems SELECT oid, oid, 1 FROM orders;
INSERT INTO lineitems SELECT oid + 1000*1000*1000, oid, 2 FROM orders;

Find a selective DATETIME value for d to search on:

SELECT d, COUNT(*)
FROM orders
GROUP BY d;

The result shows that a number of DATETIME values only appear in one row in orders. For this example, let’s say that one such value is 2020-03-30 16:47:05.

The following query uses this date to produce a join result with exactly two rows:

SELECT *
FROM orders o JOIN lineitems l ON o.oid = l.oid
WHERE o.d = "2020-03-30 16:47:05";

This query filters rows on o.d to find a single row of orders, then joins to lineitems via the hash index on lineitems.oid, using the new selective join algorithm.

In the JSON profile plan, you can see if a plan may be able to perform a join via a hash index. If this optimization strategy is available in your plan, you will see join index in descriptions of columnstore filter operators. For example:

"executor":"ColumnStoreFilter",
"keyId":4294968023,
"condition":[
        "o.oid = l.oid bloom AND l.oid = o.oid join index"
],

In the SingleStore DB Studio, you may see a condition that mentions join index in the ColumnStoreFilter operator in the properties pane:

o.oid = l.oid bloom AND l.oid = o.oid join index

The join index filter can also be viewed in the EXPLAIN output.

Performance of Highly Selective Joins

A higher number of join columns having a matching columnstore hash index and fewer number of hash values result in better join performance. For example, if inventory and product are two columnstore tables with hash indexes on the columns id and code, then the following query has better join performance than query with single join column:

SELECT * FROM 
inventory JOIN product
ON inventory.id = product.id AND inventory.code = product.code;
Columnstore Sizing Estimations

There are two possible methods for estimating the total memory used by a columnstore table. The first method is brief, while the second is more precise. For more information about how columnstore works, see Columnstore, and the other topics nested under it.

Simple Method

For the simple method, start with calculating the average row size for your table by adding together the size of each column's data type from Data Types.  Multiply the average row size by the expected number of rows in your table to get the estimated uncompressed size.

Next, estimate the compression ratio for your dataset. Columnstore compression typically shrinks the data size by 3-7x. Based on these estimations, it is possible to get a very rough estimate of storage requirements for columnstore tables.

Complex Method

In the same way as the simple method, start with calculating the average row size for your table by adding together the size of each column's data type from Data Types.  Multiply the average row size by the expected number of rows in your table to get the estimated uncompressed size.

Then, to get an estimate that is more accurate than the base calculation demonstrated above, a more precise compression ratio is needed. Use the steps below to accomplish this by loading a large amount of representative data, then measuring the actual compression ratios achieved (by table, and by column).

  1. Load a representative sample of data into a columnstore table. More data leads to more accurate measurements.

    1. For information about creating a columnstore table, see Creating a Columnstore Table.

    2. To see options and procedures for loading data, see Load Data and the topics nested below it.

  2. Run OPTIMIZE TABLE on the table in order to ensure the data is optimized and on disk. Use caution when running this command, as it can impact the performance of all other workloads running in the cluster. For more information, see the linked topic.

  3. Measure the compression ratios using either SingleStore Studio, or SQL queries as outlined below.

    1. Using SingleStore Studio:

      1. Navigate to the Databases page.

      2. Select the database containing the table in question.

      3. Compression ratios are displayed at the table level.

      4. Select a table to see compression ratios per column.

    2. By querying information_schema.COLUMNAR_SEGMENTS.

      1. Compression ratio per table (compressed / uncompressed):

        SELECT
        database_name,
        table_name, 
        (1-SUM(compressed_size)/SUM(uncompressed_size)) AS ratio 
        FROM information_schema.columnar_segments 
        GROUP BY database_name, table_name 
        ORDER BY ratio DESC;
      2. Compression ratio per column (compressed / uncompressed):

        SELECT
        database_name,
        table_name,
        column_name,
        (1-SUM(compressed_size)/SUM(uncompressed_size)) AS ratio
        FROM information_schema.columnar_segments
        GROUP BY database_name, table_name, column_name
        ORDER BY ratio DESC;
Additional Sizing Estimations for Columnstore Table Overhead

Beyond the variable size estimates explained above, columnstore tables also require space for a constant metadata overhead of:

  • ~100 bytes of memory per segment per column.

  • ~100 KB for each segment that has at least one deleted row.

  • 7KB per column per partition for incremental cardinality statistics.

Universal Storage

Universal storage allows you to support large-scale Online Transaction Processing (OLTP) and Hybrid Transactional and Analytical Processing (HTAP) at a lower total cost of ownership (TCO). Universal storage is a continuing evolution of the columnstore, supporting transactional workloads that would have traditionally used the rowstore.

Although rowstores are well-suited for transaction processing, it can be costly to store large datasets in rowstores, as they store all data in RAM. Because columnstores are disk-based, it is more economical to store data in columnstores as opposed to rowstores.

Current Features in Universal Storage

Currently, universal storage has five features that allow columnstores to process transactional workloads more efficiently:

Additionally, sparse compression for rowstores enables wide tables with a large percentage of NULL values to be stored in as little as half of the RAM.