Managing Columnstore Segments
A columnstore table will have the best performance if the rows in the table are in global sorted order across all the row segments.
SingleStore uses an advanced algorithm that allows it to maintain the order as close to sorted as possible, while data is being ingested or updated.background merger
and is constantly running in the background if the order of the row segments can be improved.
Background merger
is optimistic, in that if at any point it tries to move around data which is also being changed by a concurrent UPDATE or DELETE query, it will discard all the work it has done so far and start over.pessimistic
merger, which locks the row segments it is currently processing.pessimistic
merger by running an OPTIMIZE TABLE
command.
SingleStore uses a concept of a sorted row segment group
to describe a set of row segments that are sorted together.SORT KEY
of that row are smaller than those of the other row.
Note
KEY() USING CLUSTERED COLUMNSTORE
is a legacy syntax that is equivalent to SORT KEY()
.SORT KEY()
.
If the data had a perfect global order, it would consist of a single sorted row segment group.background merger
is to reorganize the rows among row segments in such a way that the number of sorted row segment groups is as small as possible.
To inspect the current state of the sorted row segment groups of a particular table, run the SHOW COLUMNAR MERGE STATUS FOR <table_
command:
SHOW COLUMNAR MERGE STATUS FOR groups;
+----------------------------------------------------------------------------+
| Merger | State | Plan | Progress | Partition |
+----------------------------------------------------------------------------+
| (Current groups) | NULL | 741,16,1 | NULL | 0 |
| (Current groups) | NULL | 782,20 | NULL | 1 |
| (Current groups) | NULL | 701,40,5 | NULL | 2 |
| (Current groups) | NULL | 326,207,123,37,21,19,17 | NULL | 3 |
+----------------------------------------------------------------------------+
Let’s look closely at the first row of the result.0
has three sorted row segment groups, one consists of 741 row segments, one consists of 16 row segments, and one consists of a single row segment - a total of 758 row segments.
SELECT * FROM groups WHERE user_group = 15;
The very first sorted row segment group will have at most one row segment that contains rows with user_
equal to 15, unless user_
is on the boundary of two row segments, or if there is a large data skew and several row segments consist only of rows with user_
.
Now take a look at the sorted row segment groups on partition 3
.SELECT
query like the one shown above will result in materializing eight row segments.background merger
is enabled, and no workload is running concurrently, within several seconds this partition would get optimized.background merger
might fall behind.
OPTIMIZE TABLE groups
If we run SHOW COLUMNAR MERGE STATUS
while OPTIMIZE TABLE
is being executed, we might see the manual merger in action:
SHOW COLUMNAR MERGE STATUS FOR groups;
+--------------------------------------------------------------------------------+
| Merger | State | Plan | Progress | Partition |
+--------------------------------------------------------------------------------+
| (Current groups) | NULL | 741,16,1 | NULL | 0 |
| (Current groups) | NULL | 782,20 | NULL | 1 |
| (Current groups) | NULL | 701,40,5 | NULL | 2 |
| (Current groups) | NULL | 326,207,123,37,21,19,17 | NULL | 3 |
| Manual Merger | Working | 326+207+123+37+21+19+17 | 53.12% | 3 |
+--------------------------------------------------------------------------------+
What this new row indicates is that there is a manual merger running on partition 3 and that at this time it has done 53.
of the work.
When the merger is done, the table now is in a better shape:
SHOW COLUMNAR MERGE STATUS FOR groups;
+------------------------------------------------------------+
| Merger | State | Plan | Progress | Partition |
+------------------------------------------------------------+
| (Current groups) | NULL | 741,16,1 | NULL | 0 |
| (Current groups) | NULL | 782,20 | NULL | 1 |
| (Current groups) | NULL | 701,40,5 | NULL | 2 |
| (Current groups) | NULL | 730,20 | NULL | 3 |
+------------------------------------------------------------+
Note that at no point were any of the partitions merged into a single sorted row segment group in this example.SELECT
performance, it is possible to run a manual command that merges data on each partition into a single sorted row segment group:
OPTIMIZE TABLE groups FULL;SHOW COLUMNAR MERGE STATUS FOR groups;
+---------------------------------------------------------+
| Merger | State | Plan | Progress | Partition |
+---------------------------------------------------------+
| (Current groups) | NULL | 758 | NULL | 0 |
| (Current groups) | NULL | 802 | NULL | 1 |
| (Current groups) | NULL | 746 | NULL | 2 |
| (Current groups) | NULL | 750 | NULL | 3 |
+---------------------------------------------------------+
At this time, any highly selective SELECT
will materialize only one row segment per partition.
Important
Unlike OPTIMIZE TABLE <name>
, which takes the amount of time proportional to the size of recently loaded data, OPTIMIZE TABLE <name> FULL
always takes the amount of time in the order of magnitude of the size of the entire table, unless data in that table is already sorted.
When inserting a small number of rows into the columnstore table, an in-memory rowstore-backed segment is used to store the rows.background flusher
periodically will flush these rows to disk.OPTIMIZE TABLE <table_
.
OPTIMIZE TABLE t FLUSH;
MV_
shows the details about the merger processes.
Unneeded columnstore segments stored in memory are eliminated in a process called garbage collection.
The garbage collection process automatically runs in the background, so it's normally not necessary to execute the TRIGGER GC [FLUSH]
command which manually activates the garbage collector.TRIGGER GC [FLUSH]
might be appropriate.
Last modified: August 1, 2024