Columnstore Seekability using JSON

In SingleStore versions before 8.0, seeking into JSON data columns meant searching the entire segment which could be up to one million rows. Beginning in version 8.0, sub-segment seeking is available and enabled by default. Instead of searching a large chunk of data, a subset of rows with the same features will be grouped together. This results in much better performance.

When using the seekable JSON encoding for columnstore tables, data is stored as follows:

  • Top-level JSON properties are stored column-wise and use the same encodings as other data types (i.e., numbers, strings, and dates).

  • Nested properties (inside sub-documents) are stored in sequential column format, using Singlestore’s standard encoding.

  • Arrays are pivoted and stored in column format. For example, a numeric array that is a property of an enclosing document would be stored as a logical column, with multiple values in the column belonging to the same row. The array length information is kept to enable SingleStore to identify what values belong to what row.

  • Columns that do not regularly appear in the documents (e.g., less than 1 in 10,000) are stored in a catch-all area.

A (segment_id, row_number) pair functions as a row ID for a row in a partition of a columnstore table. Given this pair, you can directly seek into the specified segment to find the data for row_number. It is not necessary to scan the whole segment to find the data for the row.

Seekability has been improved for the following string encodings:

  • LZ4 (SeekableLZ4)

  • Run_length encoding (RLE)

The option SeekableLZ4 on JSON columns improves overall resource consumption for string data types for universal storage for LZ4 and run-length encoded (RLE) data. The option can be applied in a table against a column as follows:

column_name JSON COLLATE utf8_bin NOT NULL option 'SeekableLZ4'

Using JSON Row-Level Decoding

Row-Level Decoding is enabled via the use_seekable_json engine variable. By default in version 8.0, use_seekable_json is set to ON (enabled). In pre-8.0 versions the engine variable use_seekable_json (and therefore the Row-Level Decoding feature) is not available. To check if the engine variable is enabled:

SELECT @@use_seekable_json;
+---------------------+
| @@use_seekable_json |
+---------------------+
|                   1 |
+---------------------+

Applying Columnstore Seekability to Existing Data

The easiest way, which almost always works and keeps the table data online while being done, is to change the setting to use_seekable_json = ON, then run OPTIMIZE TABLE FULL.

If there is no consistent improvement in seek performance; and the table can be taken offline to be reorganized; this procedure is guaranteed to rewrite all the segments in the new, seekable format.

To apply this behavior to existing tables, do the following:

  1. Enable or check if the the engine variable is enabled for your database: SET GLOBAL use_seekable_json = ON;

  2. Create a new table to copy the existing data to: CREATE TABLE <new_table> LIKE <old_table>;

  3. Insert the existing data from the old table to the new table: INSERT INTO <new_table> LIKE <old_table>;

  4. Run the optimize table statement with the full argument: OPTIMIZE TABLE <new_table> FULL;

  5. Rename the old table with a temporary name: ALTER TABLE <old_table> RENAME TO <tmp_old_table>;

  6. Rename the new table to the original (i.e., old table name): ALTER TABLE <new_table> TO <old_table>;

  7. Drop the temporary table: DROP TABLE <tmp_old_table>;

If the table needs to remain online but the guarantee that every segment is reorganized is needed, update every row in the table gradually, batch by batch. The background merger will regorganize the segments.

To avoid touching the other data, add a new dummy column to the table and update it. The updates must be true updates; setting a value to itself will not be considered an update.

Last modified: October 9, 2024

Was this article helpful?