Encoding String Data in Columnstore Tables

SingleStore columnstore tables encode string data using customized, seekable versions of basic encoding schemes. The compression schemes Dictionary Encoding, Run Length Encoding (RLE), and LZ4 Encoding have been extended to be seekable and support point-access. In addition, a seekable encoding without compression is provided for use when data compression is not effective. These enhanced encoding schemes provide low query latency for point-access queries and storage savings from compression.

The four types of encoding for String Types in SingleStore columnstore tables are:

String Dictionary Encoding
Seekable String Run Length Encoding (RLE)
Seekable LZ4 Encoding
Seekable String Encoding

To support these encoding schemes, the query engine maintains a small amount of specialized metadata and deliberately stores that metadata with the data. This structure makes seekable encoding schemes that efficiently read data at a specific row offset without the need to decode all rows in a segment. This point-access ability is critical for the query engine to run low latency point queries in an OLTP workload.

The encoding scheme choice is per segment, on-the-fly, dynamic, and adaptive. The decision to pick one encoding scheme over another scheme takes into account the compression ratio, the operationability of the encoded data during query execution, and heuristics.

String Data Encoding Schemes In Depth

String Dictionary Encoding

Dictionary encoding is a data compression technique that can be applied to individual columns including columns of type String. Dictionary encoding stores each unique value of a column in a dictionary and associates each record with its corresponding unique value. This eliminates the storage of duplicate values in a column and reduces the overall memory and disk space required to store the data.

Dictionary encoding is most effective on columns with low cardinality; the fewer the number of unique values within a column, the greater the reduction in memory and storage usage. Queries against the encoded column will generally be faster for low cardinality columns.

For example, with dictionary encoding, for a segment of a string column with only 3 distinct values, a 2-bit key value (or encoded ID) is stored for each string. The encoded IDs are used as a reference into the dictionary. These IDs are packed tightly together when stored in the columnstore on disk.

Consider this dictionary with 3 entries:

ID	Value
0	`red`
1	`green`
2	`blue`

A string column using this dictionary can be represented as a bit vector of packed string IDs when it is stored as shown in the following example:

For the dictionary encoding, the encoded data is seekable while the dictionary part is not. A large dictionary presents potential issues as the entire dictionary needs to be loaded in memory before any values in the column are decoded.

String Run Length Encoding (RLE)

In general, Run-Length Encoding is a compression technique where consecutive repetitions of the same item within a sequence are replaced with a single instance of that item and a count of how many times the item repeats, essentially reducing the sequence size by identifying and encoding "runs" of identical items. The string RLE implementation in columnstore tables applies the basic RLE encoding concept to determine if an entire string is repeated across multiple consecutive rows and if so, stores that sequence as a run, i.e. the unit of string RLE in this implementation is a string value. For example,

Original data of a column Department in table student:

Shell

…’physics’, ‘math’, ‘math’, ‘math’, ‘math’, ‘physics’, ‘physics’, ‘chemistry’, …

The same data run length encoded:

Shell

… [1’physics’], [4‘math’], [2’physics’], [1’chemistry’]...

String Run Length Encoding yields a great compression ratio when a string column has a large number of repeated values in consecutive rows.

Seekable String Run Length Encoding (RLE): Seekable String Run Length Encoding is an encoding scheme that supports point-access on top of String Run-Length Encoding. The columnstore storage engine implements this encoding scheme, so that string data encoded with RLE can be accessed by the row offset without the need to decode the other rows.

LZ4 Encoding

LZ4 is a compression scheme which is a byte-oriented encoding scheme that achieves compression by encoding input bytes that have occurred recently in the input stream with smaller symbols. LZ4 is a very fast compression algorithm with a good compression ratio in general.

The downside of LZ4 compression compared to Dictionary encoding and RLE encoding is that it is not easy to support query operations directly on LZ4 encoded data during query processing.

When selecting an encoding scheme, unless the LZ4 compression ratio is significantly better than the compression ratio of dictionary encoding or RLE encoding, SingleStore generally recommends dictionary encoding and RLE encoding because those encodings have the benefit of supporting encoded data processing in query execution.

Seekable LZ4 Encoding: Seekable LZ4 is an encoding scheme that supports point-access on top of LZ4 encoding. The point-access support of this scheme can greatly benefit query performance by allowing direct access to specific rows instead of scanning and decoding all the rows in the blocks or segments.

Seekable String Encoding

Out of the four string encoding schemes, string encoding or string value encoding is the only scheme that doesn’t compress the data in size. However, it provides great value in applicable scenarios where keeping the data seekable is critical for low latency point-access queries even though storage savings may not be easily achievable. The point-access ability is beneficial to query performance in general and critical for low latency point-access OLTP queries.

View the Encoding Scheme in `information_schema.COLUMNAR_SEGMENTS`

The information_schema.COLUMNAR_SEGMENTS view stores information about each segment of a columnar table. Users can query the encoding column of a columnar segment to find out what encoding scheme is used for that segment. For example,

SQL

SELECT column_name, segment_id, encoding, uncompressed_size, compressed_size
FROM information_schema.COLUMNAR_SEGMENTS
     WHERE table_name = ‘Sales’ and segment_id = 3;

+-------------------+------------+--------------------------+--------------------------------+--------------------------------+
| column_name       | segment_id | encoding                 | uncompressed_size              | compressed_size                |
+-------------------+------------+--------------------------+--------------------------------+--------------------------------+
| country           |          3 | StringDictionary         | 2867200                        | 1200                           |
| item_name         |          3 | StringDictionary         | 6392648                        | 3128                           |
| tx_id             |          3 | Integer                  | 16384                          | 420                            |
...

Encoding Scheme Selection in the Columnstore Storage Engine

The choice of encoding scheme in the columnstore storage engine is per segment. The goal is to both maximize storage savings and lower the query latency.

When a segment needs to be written to files on disk, the columnstore storage engine checks the properties of the data, iterates through the set of desired encoding schemes and does trial compressions, compares their compression ratios taking into account the following rules and heuristics, and eventually picks one and writes the encoded data to the blob files. The rules and heuristics include:

String Dictionary or seekable string RLE are favored over other encoding schemes because query execution can directly operate on the encoded data to boost performance in addition to the storage savings.
Dictionary encoding will not be picked when data uniqueness is larger than 70%.
Seekable RLE will not be chosen when the average run length is less than 1.1.
The choice between string dictionary encoding and string RLE is based on their compression ratios if they both pass Rules 2 and 3.
Seekable LZ4 will be considered with an additional 20% penalty, i.e. only when the LZ4 compression ratio leads to a compressed size less than the best compressed size by another encoding scheme minus 20% uncompressed size. This heuristic correction to make LZ4 less favorite is due to the consideration in Rule 1.
Seekable string encoding will be considered if all other encoding schemes cannot meet the requirements.

As mentioned earlier, some metadata is also collected and the data organization in files is structured to support efficient point-access of rows in the file.

SingleStore offers users the option to specify an encoding scheme for a column if they have a valid reason to override the storage engine’s internal choice. To specify an encoding scheme, use option <encoding> provided in the CREATE TABLE statement: For example:

SQL

CREATE TABLE t (a int, b varchar(50) option 'StringDictionary', Sort key(a));

Operating on Encoded String Data During Query Execution

Filtering directly on string data encoded with dictionary or RLE encoding is supported during query execution.

Encoded data can be directly processed to boost query performance with a filtering on an encoded column. For example, the filter T.a = xyz, on a string dictionary encoded column segment first finds the result of the filter for every entry in the dictionary. Then, while scanning the segment, the query execution system simply takes the encoded ID number for each value in T.a and uses it to look up the result of the string comparison for that ID that was computed in the initial scan of the dictionary.

That execution pattern tends to be much quicker than doing an actual string comparison. Other kinds of operations can also be done directly on encoded data ID values, including the comparison of ID values needed to do group-by operations.

The encoding for a blob depends on its type and data. The encoding may hinge both on the ordering and cardinality for numeric delta encoding and dictionary encoding, respectively. It impacts the execution speed, as well as enables SIMD operations on encoded data.

Table of String Data Encoding Schemes

Encoding Scheme	Description	Advantages and Disadvantages	Considerations for Usage
String Dictionary Encoding	An encoding scheme that uses dictionary encoding for strings. Supports seekable point-access.	Query execution can operate directly on the encoded data.	Not used when data uniqueness is larger than 70%.
Seekable String Run Length Encoding (RLE)	An encoding scheme based on String Run-Length Encoding. Supports seekable point-access.	Query execution can operate directly on the encoded data.	Not used when average run length is less than 1.1.
Seekable LZ4 Encoding	An encoding scheme based on LZ4 encoding. Supports seekable point-access.	The compression ratio is significantly better than the compression ratios of dictionary encoding and RLE encoding. Query execution cannot directly operate on the encoded data.	Considered only when the LZ4 compression leads to a compressed data size less than the best compressed size of any other encoding scheme minus 20% uncompressed size.
Seekable String Encoding	Encodes string values without compression.	Does not compress the data in size. Keeps the data seekable in cases where the data is not readily compressed.	Considered only when other encoding schemes do not meet the requirements.

Summary

SingleStore columnstore tables store string data in a number of different encoded forms. The encoding scheme decision is per segment, on-the-fly, dynamic, and adaptive. The decision process takes the compression ratio, the operationability of the encoded data during query execution, and heuristics into account. The algorithms are considered in the following order of preference: String Dictionary Encoding and Seekable String RLE, Seekable LZ4 Encoding, Seekable String Encoding. By enhancing the basic encoding schemes with seekable implementations, SingleStore gets the best outcome of storage savings and low query latency for point-access queries.

Encoding String Data in Columnstore Tables

On this page

String Data Encoding Schemes In Depth

String Dictionary Encoding

String Run Length Encoding (RLE)

LZ4 Encoding

Seekable String Encoding

View the Encoding Scheme in `information_schema.COLUMNAR_SEGMENTS`

Encoding Scheme Selection in the Columnstore Storage Engine

Operating on Encoded String Data During Query Execution

Table of String Data Encoding Schemes

Summary

Was this article helpful?

On this page

Was this article helpful?

Encoding String Data in Columnstore Tables

On this page

String Data Encoding Schemes In Depth

String Dictionary Encoding

String Run Length Encoding (RLE)

LZ4 Encoding

Seekable String Encoding

View the Encoding Scheme in information_schema.COLUMNAR_SEGMENTS

Encoding Scheme Selection in the Columnstore Storage Engine

Operating on Encoded String Data During Query Execution

Table of String Data Encoding Schemes

Summary

Was this article helpful?

On this page

Was this article helpful?

View the Encoding Scheme in `information_schema.COLUMNAR_SEGMENTS`