Configuring the Columnstore to Work Effectively

Warning

Configuration options described in this section may have significant performance impact on the workload. Make sure to test on a staging environment first before deploying the configuration changes.

Configuring Segment Size in Columnstore Tables

In SingleStore, data in a columnstore table are organized into multiple row segments. For certain workloads, the size of row segments significantly affects performance. Considerations include:

  • In general, larger segments compress better than smaller ones. This leads to less disk usage and faster scan across the table.

  • On the other hand, smaller segments could benefit more from segment elimination. As a result, queries with highly selective filters on the columnstore index run faster with smaller segments.

In SingleStore, the default size of the row segments is controlled by the global variable columnstore_segment_rows. By default, columnstore_segment_rows has a value of 1024000, meaning that each segment contains 1024000 rows by default.

Note

The variable columnar_segment_rows from previous SingleStore versions was deprecated in favor of columnstore_segment_rows. columnar_segment_rows exists as an alias to columnstore_segment_rows.

A cluster upgraded from a version before 6.0 will have the previous default value of 102400 for columnstore_segment_rows.

Refer to Upgrade to SingleStore 7.6 for more information.

In addition to the global variable, it is possible to overwrite the global setting and set the segment size for a specific columnstore table. This can be done during the table creation, or by altering an existing table. For example:

CREATE TABLE t (id INT, SORT KEY (id) WITH (columnstore_segment_rows=100000));
SHOW CREATE TABLE t;
+-------+-------------------------------------------------------------------------+
| Table | Create Table                                                            |
+-------+-------------------------------------------------------------------------+
| t     | CREATE TABLE `t` (                                                      |
|       |   `id` int(11) DEFAULT NULL,                                            |
|       |   SORT KEY `id` (`id`)                                                  |
|       |   /*!90619 */ /*!90621 WITH(COLUMNSTORE_SEGMENT_ROWS=100000) */         |
|       | )                                                                       |
+-------+-------------------------------------------------------------------------+

Now, modify the value of columnstore_segment_rows for this columnstore. The MODIFY KEY keyName clause identifies the index whose settings are to be changed (in this case, the columnstore index).

ALTER TABLE t MODIFY KEY id SET (columnstore_segment_rows=20000);
SHOW CREATE TABLE t;
+-------+-----------------------------------------------------------------------------------+
| Table | Create Table                                                                      |
+-------+-----------------------------------------------------------------------------------+
| t     | CREATE TABLE `t` (                                                                |
|       |   `id` int(11) DEFAULT NULL,                                                      |
|       |    SORT KEY `id` (`id`)                                                           |
|       |    /*!90619 */ /*!90621 WITH(COLUMNSTORE_SEGMENT_ROWS=20000) */                   |
|       | )                                                                                 |
+-------+-----------------------------------------------------------------------------------+

Configuring the Rowstore-backed Segment Size in Columnstore Tables

In addition to the on-disk segments, each SingleStore columnstore also has an in-memory rowstore-backed segment.

In some cases, inserts and updates will write to the rowstore-backed segment first. In these cases, the background flusher process periodically compresses those recently inserted rows and creates on-disk segments. In other cases, inserts and updates will bypass the rowstore-backed segment, and the writes are made to the column-oriented format on disk. See Writing Columnstore Data for more information.

The size of the rowstore-backed segment is controlled by the global variable columnstore_flush_bytes. The background flusher process starts to create on-disk segments when the amount of data in the rowstore-backed segment exceeds columnstore_flush_bytes, with a default of 32 MB. Additionally, insert and load operations are considered small-batch if they write to each partition less than columnstore_flush_bytes * columnstore_disk_insert_threshold, which is by default 16 MB at a time. The amount of time (in seconds) the background flusher waits before trying to flush a table is set via internal_columnstore_idle_flush_wait_seconds. If the table row count has not changed in that period of time then the rows are flushed regardless of the row count. This flushes idle tables that have not reached the threshold.

The minimum size of the disk-backed row segment created by insert and load operations is controlled by the engine variable columnstore_disk_insert_threshold . It is a fractional value with a default of 0.5. Note that if OPTIMIZE TABLE FLUSH is manually run, the minimum segment size can be much smaller.

Considerations about tuning the rowstore-backed segment size include:

  • The rowstore-backed segment is stored in-memory. Therefore, the table consumes less memory when the rowstore-backed segment is smaller.

  • The background flusher process can write more rows to disk at once if the rowstore-backed segment is larger, reducing the number of disk write operations during data ingestion.

Similar to columnstore_segment_rows, the columnstore_flush_bytes value can also be configured per-table with the following syntax:

CREATE TABLE t2 (id INT, SORT KEY (id) WITH (columnstore_flush_bytes=4194304, columnstore_segment_rows=100000));
ALTER TABLE t2 MODIFY KEY id SET (columnstore_flush_bytes=8388608);

Advanced Columnstore Configuration

If your database utilizes unlimited storage, it might be beneficial to set the columnstore_small_blob_combination_threshold engine variable which is the maximum size of a columnstore blob. The minimum size is 4,096 and the maximum size is 33,554,432 bytes.

SingleStore combines multiple columns from a segment into a single blob. This single blob size must be at most the columnstore_small_blob_combination_threshold value.

The columnstore_small_blob_combination_threshold engine variable refers to the maximum file size of two or more column blobs. It doesn't apply to the maximum size of individual column blobs. For example, consider the columnstore_small_blob_combination_threshold is set to 64mb and there are 3 column blobs sized 31mb, 31mb, and 92mb. The first column blob is 31mb and goes into a file. Then the second column blob is also 31mb. Since columnstore_small_blob_combination_threshold is greater than 62mb (the sum of the two column blobs), the second column blob is added to the file with the first column blob. When the third column blob (92mb) is processed, it can't fit into the existing file because the combined size would exceed the threshold. So the third column blob gets its own file, even though it's greater than the threshold value.

Note

SingleStore suggests only advanced users set this engine variable unguided unless told to do so under the direction of our support team.

The following is a use case where setting the maximum value of the columnstore_small_blob_combination_threshold variable is beneficial:

  • A columnstore table contains a large number of columns, but only a few columns are queried.

  • The local blob cache is cold or the data must be read from the remote object store, e.g. the table being read is larger than the cache.

  • All blob files have been combined to total nearly 32Mb in size. If a column read by a query is small in comparison, then all 32Mb from the remote object store must be read to access the data for the small column.

Setting the columnstore_small_blob_combination_threshold engine variable to a smaller size can make this query run faster by reducing the amount of data that must be read from remote object storage.

Last modified: November 8, 2024

Was this article helpful?