# Columnstore Seekability using JSON

SingleStore exposes a Javascript Object Notation (JSON) (An open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays.) column type that implements the [JSON standard](http://http://www.ietf.org/rfc/rfc4627.txt).&#x20;

The primary encoding for `JSON` columns in columnstore tables is `SeekableJson`. With this encoding, JSON data is automatically columnarized. Analytics on columns with the `SeekableJson` encoding are very efficient.

`JSON` columns may also be encoded with the [seekable string encodings](https://docs.singlestore.com/cloud/reference/sql-reference/data-types/encoding-string-data-in-columnstore-tables.md) `SeekableLZ4` and `SeekableStringRunLength`. These seekable string encodings compress data and save space, but do not support efficient search or extraction on JSON values. SingleStore recommends using `SeekableLZ4` and `SeekableStringRunLength` only for columns on which JSON values will be extracted as a whole.

The non-seekable, columnarized format, `JSONParquet`, is provided for backwards compatibility.

The following table summarizes the properties of the encodings.

| Encoding                               | Seekable | Columnarized | Supports Efficient Search and Extraction    | Default |
| -------------------------------------- | -------- | ------------ | ------------------------------------------- | ------- |
| `SeekableJson`                         | Yes      | Yes          | Yes                                         | Yes     |
| `SeekableLZ4``SeekableStringRunLength` | Yes      | No           | No                                          | No      |
| `JSONParquet`                          | No       | Yes          | Yes, but less efficient than`SeekableJson`. | No      |

## Choose an Encoding

SingleStore recommends using `SeekableJson` for any `JSON` columns on which analytics, search, or extraction will be done. That is, if you use functions like `JSON_EXTRACT_<type>`, the `::` operators, or `JSON_MATCH_ANY` on the `JSON` column, use the `SeekableJson` encoding for good performance. `SeekableJson` is the default encoding.

SingleStore recommends using `SeekableLZ4` (or `SeekableStringRunLength`) if you are concerned about memory usage and will not be searching on or extracting from the `JSON` column. Using the `SeekableLZ4` (or `SeekableStringRunLength`) encoding may reduce memory usage but will hurt performance if you are extracting keys and values from the JSON.

## Check which Encoding is Used

The `information_schema.COLUMNAR_SEGMENTS` view stores information about each segment of a columnstore table. Query the `encoding` column of this view to find out which encoding is used.

The following query returns the encoding and sizes of the segments for column `<column_name>`; in table `<table_name>`.

```sql
SELECT column_name, segment_id, encoding, uncompressed_size, compressed_size
FROM information_schema.COLUMNAR_SEGMENTS
WHERE table_name = ‘<table_name>’ and column_name = '<column_name>';
```

## SeekableJson

In the `SeekableJson` encoding, a schema is inferred from JSON keys and the data is split into columns by keypath (columnarized) and stored in an encoded Parquet-like format. The JSON data is stored as if you had created a schema with separate columns for every field. As a result, queries on `JSON` columns read only the parts of a JSON object that are relevant to a query and therefore have excellent performance.

The `SeekableJson` encoding format is inspired by Apache Parquet but does not produce a Parquet file.

`SeekableJson` is a seekable encoding. That is, during queries, the engine does not search entire segments, instead only a subset of rows in each segment is searched. This practice, called sub-segment seeking or row-level decoding, significantly improves performance.

When using the `SeekableJson` encoding for JSON columns in columnstore tables, the data is stored as follows:

* Top-level JSON properties are stored columnwise and use the same encodings as other data types (i.e., numbers, strings, and dates).
* Nested properties (inside sub-documents) are stored in sequential column format, using SingleStore's standard encoding.
* Arrays are pivoted and stored in column format. For example, a numeric array that is a property of an enclosing JSON document would be stored as a logical column, with multiple values in the column belonging to the same row. The array length information is kept to enable SingleStore to identify what values belong to what row.

The engine variable `use_seekable_json` controls if the default encoding for `JSON` columns is `SeekableJson` or `JSONParquet`. When `use_seekable_json` is set to `ON`, the default encoding for `JSON` columns is `SeekableJson`.

Check if `use_seekable_json` is enabled:

```sql
SELECT @@use_seekable_json;

```

```output

+---------------------+
| @@use_seekable_json |
+---------------------+
|                   1 |
+---------------------+
```

## Limit Number of Inferred Keypaths

The engine variable `json_document_max_leaves` limits the number of JSON key paths inferred within a segment and can be used to tune schema inference. Refer to [Columnstore JSON Internals](https://docs.singlestore.com/#section-idm413515222379322.md) for more details.

## SeekableLZ4 and SeekableStringRunLength

`JSON` columns may also be encoded with the seekable string encodings `SeekableLZ4` and `SeekableStringRunLength`.

These encodings store the JSON as a string. Search or extraction on columns stored with these encoding decodes and parses the entire JSON value. Using these encodings may reduce memory usage and improve ingestion time, but they may significantly reduce performance for search and extraction on the JSON column, for example, when extracting keys and values from JSON data.

SingleStore recommends using `SeekableLZ4` and `SeekableStringRunLength` only for columns on which JSON values are extracted as a whole.

[Encoding String Data in Columnstore Tables](https://docs.singlestore.com/cloud/reference/sql-reference/data-types/encoding-string-data-in-columnstore-tables.md)  provides a general description of encoding schemes and encoding scheme selection for string (including JSON) columns. String encoding schemes are per-segment and are automatically selected with the goal of maximizing storage savings and lowering query latency.

To specify a string encoding for a `JSON` column, use a collation specification as follows:

```sql
CREATE TABLE json_table (json_lz4_col JSON COLLATE utf8_bin OPTION 'SeekableLZ4');

CREATE TABLE json_table (json_rle_col JSON COLLATE utf8_bin OPTION 'SeekableStringRunLength');
```

## JSON Parquet

Like `SeekableJson`, the `JSONParquet` format is columnarized and inspired by the Apache Parquet format but does not produce a Parquet file.

SingleStore recommends that you use `SeekableJson` instead of `JSONParquet`, as `SeekableJson` is more performant.

`JSONParquet` is a precursor to `SeekableJson` and is not a seekable format, meaning that entire segments must be extracted and decoded for them to be processed.

## Apply SeekableJson Encoding to Existing Data

To change the encoding of a `JSON` column from `ParquetJSON` to `SeekableJson` encoding, use the following options.

## Online Approach

To keep the table online when `use_seekable_json` is not set to `ON`, you can upgrade existing `JSON` columns to use `SeekableJson` by setting `use_seekable_json = ON`, and then running `OPTIMIZE TABLE FULL`. This approach provides a best effort upgrade, when new segments are written, the new segments have the `SeekableJson` encoding. Segments that do not need to be re-written continue to use `JSONParquet` encoding.

## Offline Approach

If there is not a consistent improvement in seek performance with the Online Approach, query the `information_schema.COLUMNAR_SEGMENTS` view using the following query to see if the `JSONParquet` encoding is still used for many segments.

```sql
SELECT column_name, segment_id, encoding, uncompressed_size, compressed_size
FROM information_schema.COLUMNAR_SEGMENTS
WHERE table_name = '<table_name>' and column_name = '<column_name>';
```

If many segments are still using `JSONParquet` and it is acceptable to take the table offline; use the following procedure to rewrite all the segments in the new seekable format.

1. Enable or check if the `use_seekable_json` engine variable is enabled for your database: `SET GLOBAL use_seekable_json = ON;`

2. Create a new table to copy the existing data to: `CREATE TABLE <new_table> LIKE <old_table>;`

3. Insert the existing data from the old table to the new table: `INSERT INTO <new_table> SELECT * FROM <old_table>;`

4. Run the optimize table statement with the full argument: `OPTIMIZE TABLE <new_table> FULL;`

5. Rename the old table with a temporary name: `ALTER TABLE <old_table> RENAME TO <tmp_old_table>;`

6. Rename the new table to the original (i.e., old table name): `ALTER TABLE <new_table> TO <old_table>`;

7. Drop the temporary table: `DROP TABLE <tmp_old_table>;`

If the table needs to remain online but you need to ensure every segment is reorganized, update every row in the table gradually. When the merger is run after these updates, the merger reorganizes the segments and applies the seekable encodings.

To avoid modifying data in the table, add a new dummy column to the table and update it. The updates must be true updates; setting a value to itself is considered an update.

## Columnstore JSON Internals

This section provides details on the `SeekableJson` encoding.

A Parquet schema has a JSON-like syntax and a nested structure. The schema consists of a group of fields called a message. Each field has three attributes: repetition, type, and name. The type attribute is either a group or a primitive (e.g., int, float, boolean, or string). The repetition attribute can only be one of the following:

* Required: exactly one occurrence
* Optional: 0 or 1 occurrence
* Repeated: 0 or more occurrences

The infer schema process works as follows:

1. The schema loops through the list of JSON objects.

   1. The present keypaths are merged into a schema tree object.

   2. If there is a type conflict, the node in the schema tree is marked as un-inferable.

   3. If any node in the schema tree contains more children than indicated in the `json_document_max_children` engine variable, the node is marked as un-inferable.

   4. If a node has a greater number of children than indicated in the engine variable `json_document_sparse_children_check_threshold` and the average occurrence of all children related to the parent is less than 1/ as indicated in the engine variable `json_document_sparse_children_check_ratio`, the node will be marked as un-inferable.

2. Once the schema tree object is constructed, analyze the schema tree and prune the tree until the number of key paths (distinct root to leaf paths) is less than the setting for the engine variable `json_document_max_leaves`. Also, prune any node where the average occurrence of all the children in relation to the number of rows is less than 1/ as indicated in the engine variable `json_document_absolute_sparse_key_check_ratio`.

The examples will use the engine variables settings and the JSON object table shown below:

* `json_document_max_children = 4`
* `json_document_sparse_children_threshold = 3`
* `json_document_sparse_children_ratio = 2`

| JSON Objects                                                 |
| ------------------------------------------------------------ |
| {“a1”: 1, “a2”: {“d1”: 1}, “a3”: {“c1”: 1}, “a4”: {“b1”: 1}} |
| {“a1”: 2, “a2”: 1, “a4”: {“b2”: 1}}                          |
| {“a1”: 3, “a2”: 1, “a4”: {“b3”: 1}}                          |
| {“a1”: 4, “a2”: 1, ”a3”: {“c2”: 1}, “a4”: {“b4”: 1}}         |
| {“a1”: 5,, “a3”: {“c3”: 1} “a4”: {“b5”: 1}}                  |

The first pass at merging the keypaths will yield:

![](https://images.contentstack.io/v3/assets/bltac01ee6daa3a1e14/blte8aa18724a22bced/6a2fe903fbb715046d1e7bde/diagram_first-pass-lLgPpK.png)

The second pass at merging the keypaths contains a type mismatch on a2. The results would be:

![](https://images.contentstack.io/v3/assets/bltac01ee6daa3a1e14/blt7ae24845148dc71f/6a2fe95d18d64feb4e6a85db/diagram_second-pass-YVSa0Y.png)

The results of the third and fourth passes at merging yield:

![](https://images.contentstack.io/v3/assets/bltac01ee6daa3a1e14/blt7eb1655cd208468d/6a2fe8c339f2373bc828a28a/diagram_last-pass-I40897.png)

In the final merge, the number of children for a4 exceeds the limit (4) set for `json_document_max_children` so it is marked as un-inferable. The number of children for `a3` exceeds the limit (3) set for `json_document_sparse_children_threshold`. The average occurrences of children of a3 (1) relative to the number of occurrences of a3 (3) is calculated as **⅓ < ½**, so a3 would also be un-inferable.

After the merging process, the schema tree with leaves are the inferred columns (also may be referred to as leaves, leaf columns, leaf paths, or key paths).

The keypaths will be  a1, a2, a3, a4.

## Encoding Nested Columns

Parquet uses the `Dremel encoding` for nested columns with definition and repetition levels.

* Definition levels specify how many optional fields in the path for the column are defined.
* Repetition levels specify the repeated value of the field in the path.
* SingleStore stores the leaf columns for the JSON schema along with their respective repetition and definition levels.

Each of these internal columns will correspond to the value, definition level and repetition level columns which are encoded in SingleStore encodings (SeekableString, etc).

The example tables will have these values for the internal columns:

| a1    |                  |                  |
| ----- | ---------------- | ---------------- |
| Value | Definition Level | Repetition Level |
| 1     | 1                | 0                |
| 1     | 1                | 0                |
| 1     | 1                | 0                |
| 1     | 1                | 0                |
| 1     | 1                | 0                |

| a2          |                  |                  |
| ----------- | ---------------- | ---------------- |
| Value       | Definition Level | Repetition Level |
| ‘{“d1”: 1}’ | 1                | 0                |
| 1           | 1                | 0                |
| 1           | 1                | 0                |
| 1           | 1                | 0                |
| -           | 0                | 0                |

| a3          |                  |                  |
| ----------- | ---------------- | ---------------- |
| Value       | Definition Level | Repetition Level |
| ‘{“c1”: 1}’ | 1                | 0                |
| -           | 0                | 0                |
| -           | 0                | 0                |
| ‘{“c2”: 1}’ | 1                | 0                |
| ‘{“c3”: 1}’ | 1                | 0                |

| a4          |                  |                  |
| ----------- | ---------------- | ---------------- |
| Value       | Definition Level | Repetition Level |
| ‘{“b1”: 1}’ | 1                | 0                |
| ‘{“b2”: 1}’ | 1                | 0                |
| ‘{“b3”: 1}’ | 1                | 0                |
| ‘{“b4”: 1}’ | 1                | 0                |
| ‘{“b5”: 1}’ | 1                | 0                |

## Related Topics

* [OPTIMIZE TABLE](https://docs.singlestore.com/cloud/reference/sql-reference/data-definition-language-ddl/optimize-table/#UUID-95582480-6981-8db3-0ad8-c559a8cfe64e.md)
* [Using JSON](https://docs.singlestore.com/cloud/create-a-database/using-json/#section-idm4634652054806433297983139685.md)

***

Modified at: September 16, 2025

Source: [/cloud/create-a-database/columnstore/columnstore-seekability-using-json/](https://docs.singlestore.com/cloud/create-a-database/columnstore/columnstore-seekability-using-json/)

(An index of the documentation is available at /llms.txt)
