Encoding String Data in Columnstore Tables
On this page
SingleStore columnstore tables encode string data using customized, seekable versions of basic encoding schemes.
The four types of encoding for String Types in SingleStore columnstore tables are:
-
String Dictionary Encoding
-
Seekable String Run Length Encoding (RLE)
-
Seekable LZ4 Encoding
-
Seekable String Encoding
To support these encoding schemes, the query engine maintains a small amount of specialized metadata and deliberately stores that metadata with the data.
The encoding scheme choice is per segment, on-the-fly, dynamic, and adaptive.
String Data Encoding Schemes In Depth
String Dictionary Encoding
Dictionary encoding is a data compression technique that can be applied to individual columns including columns of type String.
Dictionary encoding is most effective on columns with low cardinality; the fewer the number of unique values within a column, the greater the reduction in memory and storage usage.
For example, with dictionary encoding, for a segment of a string column with only 3 distinct values, a 2-bit key value (or encoded ID) is stored for each string.
Consider this dictionary with 3 entries:
ID |
Value |
---|---|
0 |
|
1 |
|
2 |
|
A string column using this dictionary can be represented as a bit vector of packed string IDs when it is stored as shown in the following example:
For the dictionary encoding, the encoded data is seekable while the dictionary part is not.
String Run Length Encoding (RLE)
In general, Run-Length Encoding is a compression technique where consecutive repetitions of the same item within a sequence are replaced with a single instance of that item and a count of how many times the item repeats, essentially reducing the sequence size by identifying and encoding "runs" of identical items.
Original data of a column Department
in table student
:
…’physics’, ‘math’, ‘math’, ‘math’, ‘math’, ‘physics’, ‘physics’, ‘chemistry’, …
The same data run length encoded:
… [1’physics’], [4‘math’], [2’physics’], [1’chemistry’]...
String Run Length Encoding yields a great compression ratio when a string column has a large number of repeated values in consecutive rows.
Seekable String Run Length Encoding (RLE): Seekable String Run Length Encoding is an encoding scheme that supports point-access on top of String Run-Length Encoding.
LZ4 Encoding
LZ4 is a compression scheme which is a byte-oriented encoding scheme that achieves compression by encoding input bytes that have occurred recently in the input stream with smaller symbols.
The downside of LZ4 compression compared to Dictionary encoding and RLE encoding is that it is not easy to support query operations directly on LZ4 encoded data during query processing.
When selecting an encoding scheme, unless the LZ4 compression ratio is significantly better than the compression ratio of dictionary encoding or RLE encoding, SingleStore generally recommends dictionary encoding and RLE encoding because those encodings have the benefit of supporting encoded data processing in query execution.
Seekable LZ4 Encoding: Seekable LZ4 is an encoding scheme that supports point-access on top of LZ4 encoding.
Seekable String Encoding
Out of the four string encoding schemes, string encoding or string value encoding is the only scheme that doesn’t compress the data in size.
View the Encoding Scheme in information_ schema. COLUMNAR_ SEGMENTS
The information_
view stores information about each segment of a columnar table.
SELECT column_name, segment_id, encoding, uncompressed_size, compressed_sizeFROM information_schema.COLUMNAR_SEGMENTSWHERE table_name = ‘Sales’ and segment_id = 3;
+-------------------+------------+--------------------------+--------------------------------+--------------------------------+
| column_name | segment_id | encoding | uncompressed_size | compressed_size |
+-------------------+------------+--------------------------+--------------------------------+--------------------------------+
| country | 3 | StringDictionary | 2867200 | 1200 |
| item_name | 3 | StringDictionary | 6392648 | 3128 |
| tx_id | 3 | Integer | 16384 | 420 |
...
Encoding Scheme Selection in the Columnstore Storage Engine
The choice of encoding scheme in the columnstore storage engine is per segment.
When a segment needs to be written to files on disk, the columnstore storage engine checks the properties of the data, iterates through the set of desired encoding schemes and does trial compressions, compares their compression ratios taking into account the following rules and heuristics, and eventually picks one and writes the encoded data to the blob files.
-
String Dictionary or seekable string RLE are favored over other encoding schemes because query execution can directly operate on the encoded data to boost performance in addition to the storage savings.
-
Dictionary encoding will not be picked when data uniqueness is larger than 70%.
-
Seekable RLE will not be chosen when the average run length is less than 1.
1. -
The choice between string dictionary encoding and string RLE is based on their compression ratios if they both pass Rules 2 and 3.
-
Seekable LZ4 will be considered with an additional 20% penalty, i.
e. only when the LZ4 compression ratio leads to a compressed size less than the best compressed size by another encoding scheme minus 20% uncompressed size. This heuristic correction to make LZ4 less favorite is due to the consideration in Rule 1. -
Seekable string encoding will be considered if all other encoding schemes cannot meet the requirements.
As mentioned earlier, some metadata is also collected and the data organization in files is structured to support efficient point-access of rows in the file.
SingleStore offers users the option to specify an encoding scheme for a column if they have a valid reason to override the storage engine’s internal choice.<encoding>
provided in the CREATE TABLE statement: For example:
CREATE TABLE t (a int, b varchar(50) option 'StringDictionary', Sort key(a));
Operating on Encoded String Data During Query Execution
Filtering directly on string data encoded with dictionary or RLE encoding is supported during query execution.
Encoded data can be directly processed to boost query performance with a filtering on an encoded column.T.
, on a string dictionary encoded column segment first finds the result of the filter for every entry in the dictionary.
That execution pattern tends to be much quicker than doing an actual string comparison.
The encoding for a blob depends on its type and data.
Table of String Data Encoding Schemes
Encoding Scheme |
Description |
Advantages and Disadvantages |
Considerations for Usage |
---|---|---|---|
String Dictionary Encoding |
An encoding scheme that uses dictionary encoding for strings. |
Query execution can operate directly on the encoded data. |
Not used when data uniqueness is larger than 70%. |
Seekable String Run Length Encoding (RLE) |
An encoding scheme based on String Run-Length Encoding. |
Query execution can operate directly on the encoded data. |
Not used when average run length is less than 1. |
Seekable LZ4 Encoding |
An encoding scheme based on LZ4 encoding. |
The compression ratio is significantly better than the compression ratios of dictionary encoding and RLE encoding. Query execution cannot directly operate on the encoded data. |
Considered only when the LZ4 compression leads to a compressed data size less than the best compressed size of any other encoding scheme minus 20% uncompressed size. |
Seekable String Encoding |
Encodes string values without compression. |
Does not compress the data in size. Keeps the data seekable in cases where the data is not readily compressed. |
Considered only when other encoding schemes do not meet the requirements. |
Summary
SingleStore columnstore tables store string data in a number of different encoded forms.
Last modified: February 12, 2025