Columnstore Seekability using JSON

SingleStore exposes a Javascript Object Notation (JSON) column type that implements the JSON standard.

The primary encoding for JSON columns in columnstore tables is SeekableJson. With this encoding, JSON data is automatically columnarized. Analytics on columns with the SeekableJson encoding are very efficient.

JSON columns may also be encoded with the seekable string encodings SeekableLZ4 and SeekableStringRunLength. These seekable string encodings compress data and save space, but do not support efficient search or extraction on JSON values. SingleStore recommends using SeekableLZ4 and SeekableStringRunLength only for columns on which JSON values will be extracted as a whole.

The non-seekable, columnarized format, JSONParquet, is provided for backwards compatibility.

The following table summarizes the properties of the encodings.

Encoding

Seekable

Columnarized

Supports Efficient Search and Extraction

Default

SeekableJson

Yes

Yes

Yes

Yes

SeekableLZ4

SeekableStringRunLength

Yes

No

No

No

JSONParquet

No

Yes

Yes, but less efficient than SeekableJson.

No

Choose an Encoding

SingleStore recommends using SeekableJson for any JSON columns on which analytics, search, or extraction will be done. That is, if you use functions like JSON_EXTRACT_<type>, the :: operators, or JSON_MATCH_ANY on the JSON column, use the SeekableJson encoding for good performance. SeekableJson is the default encoding.

SingleStore recommends using SeekableLZ4 (or SeekableStringRunLength) if you are concerned about memory usage and will not be searching on or extracting from the JSON column. Using the SeekableLZ4 (or SeekableStringRunLength) encoding may reduce memory usage but will hurt performance if you are extracting keys and values from the JSON.

Check which Encoding is Used

The information_schema.COLUMNAR_SEGMENTS view stores information about each segment of a columnstore table. Query the encoding column of this view to find out which encoding is used.

The following query returns the encoding and sizes of the segments for column <column_name>; in table <table_name>.

SELECT column_name, segment_id, encoding, uncompressed_size, compressed_size
FROM information_schema.COLUMNAR_SEGMENTS
WHERE table_name =<table_name>and column_name = '<column_name>';

SeekableJson

In the SeekableJson encoding, a schema is inferred from JSON keys and the data is split into columns by keypath (columnarized) and stored in an encoded Parquet-like format. The JSON data is stored as if you had created a schema with separate columns for every field. As a result, queries on JSON columns read only the parts of a JSON object that are relevant to a query and therefore have excellent performance.

The SeekableJson encoding format is inspired by Apache Parquet but does not produce a Parquet file.

SeekableJson is a seekable encoding. That is, during queries, the engine does not search entire segments, instead only a subset of rows in each segment is searched. This practice, called sub-segment seeking or row-level decoding, significantly improves performance.

When using the SeekableJson encoding for JSON columns in columnstore tables, the data is stored as follows:

  • Top-level JSON properties are stored columnwise and use the same encodings as other data types (i.e., numbers, strings, and dates).

  • Nested properties (inside sub-documents) are stored in sequential column format, using SingleStore's standard encoding.

  • Arrays are pivoted and stored in column format. For example, a numeric array that is a property of an enclosing JSON document would be stored as a logical column, with multiple values in the column belonging to the same row. The array length information is kept to enable SingleStore to identify what values belong to what row.

The engine variable use_seekable_json controls if the default encoding for JSON columns is SeekableJson or JSONParquet. When use_seekable_json is set to ON, the default encoding for JSON columns is SeekableJson.

Check if use_seekable_json is enabled:

SELECT @@use_seekable_json;
+---------------------+
| @@use_seekable_json |
+---------------------+
|                   1 |
+---------------------+

Limit Number of Inferred Keypaths

The engine variable json_document_max_leaves limits the number of JSON key paths inferred within a segment and can be used to tune schema inference. Refer to Columnstore JSON Internals for more details.

SeekableLZ4 and SeekableStringRunLength

JSON columns may also be encoded with the seekable string encodings SeekableLZ4 and SeekableStringRunLength.

These encodings store the JSON as a string. Search or extraction on columns stored with these encoding decodes and parses the entire JSON value. Using these encodings may reduce memory use but significantly reduces performance for search and extraction on the JSON column.

SingleStore recommends using SeekableLZ4 and SeekableStringRunLength only for columns on which JSON values are extracted as a whole.

Encoding String Data in Columnstore Tables  provides a general description of encoding schemes and encoding scheme selection for string (including JSON) columns. String encoding schemes are per-segment and are automatically selected with the goal of maximizing storage savings and lowering query latency.

To specify a string encoding for a JSON column, use a collation specification as follows:

CREATE TABLE json_table (json_lz4_col JSON COLLATE utf8_bin OPTION 'SeekableLZ4');
CREATE TABLE json_table (json_rle_col JSON COLLATE utf8_bin OPTION 'SeekableStringRunLength');

JSON Parquet

Like SeekableJson, the JSONParquet format is columnarized and inspired by the Apache Parquet format but does not produce a Parquet file.

SingleStore recommends that you use SeekableJson instead of JSONParquet, as SeekableJson is more performant.

JSONParquet is a precursor to SeekableJson and is not a seekable format, meaning that entire segments must be extracted and decoded for them to be processed.

Apply SeekableJson Encoding to Existing Data

To change the encoding of a JSON column from ParquetJSON to SeekableJson encoding, use the following options.

Online Approach

To keep the table online when use_seekable_json is not set to ON, you can upgrade existing JSON columns to use SeekableJson by setting use_seekable_json = ON, and then running OPTIMIZE TABLE FULL. This approach provides a best effort upgrade, when new segments are written, the new segments have the SeekableJson encoding. Segments that do not need to be re-written continue to use JSONParquet encoding.

Offline Approach

If there is not a consistent improvement in seek performance with the Online Approach, query the information_schema.COLUMNAR_SEGMENTS view using the following query to see if the JSONParquet encoding is still used for many segments.

SELECT column_name, segment_id, encoding, uncompressed_size, compressed_size
FROM information_schema.COLUMNAR_SEGMENTS
WHERE table_name = '<table_name>' and column_name = '<column_name>';

If many segments are still using JSONParquet and it is acceptable to take the table offline; use the following procedure to rewrite all the segments in the new seekable format.

  1. Enable or check if the use_seekable_json engine variable is enabled for your database: SET GLOBAL use_seekable_json = ON;

  2. Create a new table to copy the existing data to: CREATE TABLE <new_table> LIKE <old_table>;

  3. Insert the existing data from the old table to the new table: INSERT INTO <new_table> SELECT * FROM <old_table>;

  4. Run the optimize table statement with the full argument: OPTIMIZE TABLE <new_table> FULL;

  5. Rename the old table with a temporary name: ALTER TABLE <old_table> RENAME TO <tmp_old_table>;

  6. Rename the new table to the original (i.e., old table name): ALTER TABLE <new_table> TO <old_table>;

  7. Drop the temporary table: DROP TABLE <tmp_old_table>;

If the table needs to remain online but you need to ensure every segment is reorganized, update every row in the table gradually. When the merger is run after these updates, the merger reorganizes the segments and applies the seekable encodings.

To avoid modifying data in the table, add a new dummy column to the table and update it. The updates must be true updates; setting a value to itself is considered an update.

Columnstore JSON Internals

This section provides details on the SeekableJson encoding.

A Parquet schema has a JSON-like syntax and a nested structure. The schema consists of a group of fields called a message. Each field has three attributes: repetition, type, and name. The type attribute is either a group or a primitive (e.g., int, float, boolean, or string). The repetition attribute can only be one of the following:

  • Required: exactly one occurrence

  • Optional: 0 or 1 occurrence

  • Repeated: 0 or more occurrences

The infer schema process works as follows:

  1. The schema loops through the list of JSON objects.

    1. The present keypaths are merged into a schema tree object.

    2. If there is a type conflict, the node in the schema tree is marked as un-inferable.

    3. If any node in the schema tree contains more children than indicated in the json_document_max_children engine variable, the node is marked as un-inferable.

    4. If a node has a greater number of children than indicated in the engine variable json_document_sparse_children_check_threshold and the average occurrence of all children related to the parent is less than 1/ as indicated in the engine variable json_document_sparse_children_check_ratio, the node will be marked as un-inferable.

  2. Once the schema tree object is constructed, analyze the schema tree and prune the tree until the number of key paths (distinct root to leaf paths) is less than the setting for the engine variable json_document_max_leaves. Also, prune any node where the average occurrence of all the children in relation to the number of rows is less than 1/ as indicated in the engine variable json_document_absolute_sparse_key_check_ratio.

The examples will use the engine variables settings and the JSON object table shown below:

  • json_document_max_children = 4

  • json_document_sparse_children_threshold = 3

  • json_document_sparse_children_ratio = 2

JSON Objects

{“a1”: 1, “a2”: {“d1”: 1}, “a3”: {“c1”: 1}, “a4”: {“b1”: 1}}

{“a1”: 2, “a2”: 1, “a4”: {“b2”: 1}}

{“a1”: 3, “a2”: 1, “a4”: {“b3”: 1}}

{“a1”: 4, “a2”: 1, ”a3”: {“c2”: 1}, “a4”: {“b4”: 1}}

{“a1”: 5,, “a3”: {“c3”: 1} “a4”: {“b5”: 1}}

The first pass at merging the keypaths will yield:

The second pass at merging the keypaths contains a type mismatch on a2. The results would be:

The results of the third and fourth passes at merging yield:

In the final merge, the number of children for a4 exceeds the limit (4) set for json_document_max_children so it is marked as un-inferable. The number of children for a3 exceeds the limit (3) set for json_document_sparse_children_threshold. The average occurrences of children of a3 (1) relative to the number of occurrences of a3 (3) is calculated as ⅓ < ½, so a3 would also be un-inferable.

After the merging process, the schema tree with leaves are the inferred columns (also may be referred to as leaves, leaf columns, leaf paths, or key paths).

The keypaths will be  a1, a2, a3, a4.

Encoding Nested Columns

Parquet uses the Dremel encoding for nested columns with definition and repetition levels.

  • Definition levels specify how many optional fields in the path for the column are defined.

  • Repetition levels specify the repeated value of the field in the path.

  • SingleStore stores the leaf columns for the JSON schema along with their respective repetition and definition levels.

Each of these internal columns will correspond to the value, definition level and repetition level columns which are encoded in SingleStore encodings (SeekableString, etc).

The example tables will have these values for the internal columns:

a1

Value

Definition Level

Repetition Level

1

1

0

1

1

0

1

1

0

1

1

0

1

1

0

a2

Value

Definition Level

Repetition Level

‘{“d1”: 1}’

1

0

1

1

0

1

1

0

1

1

0

-

0

0

a3

Value

Definition Level

Repetition Level

‘{“c1”: 1}’

1

0

-

0

0

-

0

0

‘{“c2”: 1}’

1

0

‘{“c3”: 1}’

1

0

a4

Value

Definition Level

Repetition Level

‘{“b1”: 1}’

1

0

‘{“b2”: 1}’

1

0

‘{“b3”: 1}’

1

0

‘{“b4”: 1}’

1

0

‘{“b5”: 1}’

1

0

Last modified: September 11, 2025

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK