Columnstore Seekability using JSON
On this page
SingleStore exposes a Javascript Object Notation (JSON) column type that implements the JSON standard.
The primary encoding for JSON columns in columnstore tables is SeekableJson.SeekableJson encoding are very efficient.
JSON columns may also be encoded with the seekable string encodings SeekableLZ4 and SeekableStringRunLength.SeekableLZ4 and SeekableStringRunLength only for columns on which JSON values will be extracted as a whole.
The non-seekable, columnarized format, JSONParquet, is provided for backwards compatibility.
The following table summarizes the properties of the encodings.
|
Encoding |
Seekable |
Columnarized |
Supports Efficient Search and Extraction |
Default |
|---|---|---|---|---|
|
|
Yes |
Yes |
Yes |
Yes |
|
|
Yes |
No |
No |
No |
|
|
No |
Yes |
Yes, but less efficient than |
No |
Choose an Encoding
SingleStore recommends using SeekableJson for any JSON columns on which analytics, search, or extraction will be done.JSON_, the :: operators, or JSON_ on the JSON column, use the SeekableJson encoding for good performance.SeekableJson is the default encoding.
SingleStore recommends using SeekableLZ4 (or SeekableStringRunLength) if you are concerned about memory usage and will not be searching on or extracting from the JSON column.SeekableLZ4 (or SeekableStringRunLength) encoding may reduce memory usage but will hurt performance if you are extracting keys and values from the JSON.
Check which Encoding is Used
The information_ view stores information about each segment of a columnstore table.encoding column of this view to find out which encoding is used.
The following query returns the encoding and sizes of the segments for column <column_; in table <table_.
SELECT column_name, segment_id, encoding, uncompressed_size, compressed_sizeFROM information_schema.COLUMNAR_SEGMENTSWHERE table_name = ‘<table_name>’ and column_name = '<column_name>';
SeekableJson
In the SeekableJson encoding, a schema is inferred from JSON keys and the data is split into columns by keypath (columnarized) and stored in an encoded Parquet-like format.JSON columns read only the parts of a JSON object that are relevant to a query and therefore have excellent performance.
The SeekableJson encoding format is inspired by Apache Parquet but does not produce a Parquet file.
SeekableJson is a seekable encoding.
When using the SeekableJson encoding for JSON columns in columnstore tables, the data is stored as follows:
-
Top-level JSON properties are stored columnwise and use the same encodings as other data types (i.
e. , numbers, strings, and dates). -
Nested properties (inside sub-documents) are stored in sequential column format, using SingleStore's standard encoding.
-
Arrays are pivoted and stored in column format.
For example, a numeric array that is a property of an enclosing JSON document would be stored as a logical column, with multiple values in the column belonging to the same row. The array length information is kept to enable SingleStore to identify what values belong to what row.
The engine variable use_ controls if the default encoding for JSON columns is SeekableJson or JSONParquet.use_ is set to ON, the default encoding for JSON columns is SeekableJson.
Check if use_ is enabled:
SELECT @@use_seekable_json;
+---------------------+
| @@use_seekable_json |
+---------------------+
| 1 |
+---------------------+Limit Number of Inferred Keypaths
The engine variable json_ limits the number of JSON key paths inferred within a segment and can be used to tune schema inference.
SeekableLZ4 and SeekableStringRunLength
JSON columns may also be encoded with the seekable string encodings SeekableLZ4 and SeekableStringRunLength.
These encodings store the JSON as a string.
SingleStore recommends using SeekableLZ4 and SeekableStringRunLength only for columns on which JSON values are extracted as a whole.
Encoding String Data in Columnstore Tables provides a general description of encoding schemes and encoding scheme selection for string (including JSON) columns.
To specify a string encoding for a JSON column, use a collation specification as follows:
CREATE TABLE json_table (json_lz4_col JSON COLLATE utf8_bin OPTION 'SeekableLZ4');CREATE TABLE json_table (json_rle_col JSON COLLATE utf8_bin OPTION 'SeekableStringRunLength');
JSON Parquet
Like SeekableJson, the JSONParquet format is columnarized and inspired by the Apache Parquet format but does not produce a Parquet file.
SingleStore recommends that you use SeekableJson instead of JSONParquet, as SeekableJson is more performant.
JSONParquet is a precursor to SeekableJson and is not a seekable format, meaning that entire segments must be extracted and decoded for them to be processed.
Apply SeekableJson Encoding to Existing Data
To change the encoding of a JSON column from ParquetJSON to SeekableJson encoding, use the following options.
Online Approach
To keep the table online when use_ is not set to ON, you can upgrade existing JSON columns to use SeekableJson by setting use_, and then running OPTIMIZE TABLE FULL.SeekableJson encoding.JSONParquet encoding.
Offline Approach
If there is not a consistent improvement in seek performance with the Online Approach, query the information_ view using the following query to see if the JSONParquet encoding is still used for many segments.
SELECT column_name, segment_id, encoding, uncompressed_size, compressed_sizeFROM information_schema.COLUMNAR_SEGMENTSWHERE table_name = '<table_name>' and column_name = '<column_name>';
If many segments are still using JSONParquet and it is acceptable to take the table offline; use the following procedure to rewrite all the segments in the new seekable format.
-
Enable or check if the
use_engine variable is enabled for your database:seekable_ json SET GLOBAL use_seekable_ json = ON; -
Create a new table to copy the existing data to:
CREATE TABLE <new_table> LIKE <old_ table>; -
Insert the existing data from the old table to the new table:
INSERT INTO <new_table> SELECT * FROM <old_ table>; -
Run the optimize table statement with the full argument:
OPTIMIZE TABLE <new_table> FULL; -
Rename the old table with a temporary name:
ALTER TABLE <old_table> RENAME TO <tmp_ old_ table>; -
Rename the new table to the original (i.
e. , old table name): ALTER TABLE <new_;table> TO <old_ table> -
Drop the temporary table:
DROP TABLE <tmp_old_ table>;
If the table needs to remain online but you need to ensure every segment is reorganized, update every row in the table gradually.
To avoid modifying data in the table, add a new dummy column to the table and update it.
Columnstore JSON Internals
This section provides details on the SeekableJson encoding.
A Parquet schema has a JSON-like syntax and a nested structure.
-
Required: exactly one occurrence
-
Optional: 0 or 1 occurrence
-
Repeated: 0 or more occurrences
The infer schema process works as follows:
-
The schema loops through the list of JSON objects.
-
The present keypaths are merged into a schema tree object.
-
If there is a type conflict, the node in the schema tree is marked as un-inferable.
-
If any node in the schema tree contains more children than indicated in the
json_engine variable, the node is marked as un-inferable.document_ max_ children -
If a node has a greater number of children than indicated in the engine variable
json_and the average occurrence of all children related to the parent is less than 1/ as indicated in the engine variabledocument_ sparse_ children_ check_ threshold json_, the node will be marked as un-inferable.document_ sparse_ children_ check_ ratio
-
-
Once the schema tree object is constructed, analyze the schema tree and prune the tree until the number of key paths (distinct root to leaf paths) is less than the setting for the engine variable
json_.document_ max_ leaves Also, prune any node where the average occurrence of all the children in relation to the number of rows is less than 1/ as indicated in the engine variable json_.document_ absolute_ sparse_ key_ check_ ratio
The examples will use the engine variables settings and the JSON object table shown below:
-
json_document_ max_ children = 4 -
json_document_ sparse_ children_ threshold = 3 -
json_document_ sparse_ children_ ratio = 2
|
JSON Objects |
|---|
|
{“a1”: 1, “a2”: {“d1”: 1}, “a3”: {“c1”: 1}, “a4”: {“b1”: 1}} |
|
{“a1”: 2, “a2”: 1, “a4”: {“b2”: 1}} |
|
{“a1”: 3, “a2”: 1, “a4”: {“b3”: 1}} |
|
{“a1”: 4, “a2”: 1, ”a3”: {“c2”: 1}, “a4”: {“b4”: 1}} |
|
{“a1”: 5,, “a3”: {“c3”: 1} “a4”: {“b5”: 1}} |
The first pass at merging the keypaths will yield:
The second pass at merging the keypaths contains a type mismatch on a2.
The results of the third and fourth passes at merging yield:
In the final merge, the number of children for a4 exceeds the limit (4) set for json_ so it is marked as un-inferable.a3 exceeds the limit (3) set for json_.
After the merging process, the schema tree with leaves are the inferred columns (also may be referred to as leaves, leaf columns, leaf paths, or key paths).
The keypaths will be a1, a2, a3, a4.
Encoding Nested Columns
Parquet uses the Dremel encoding for nested columns with definition and repetition levels.
-
Definition levels specify how many optional fields in the path for the column are defined.
-
Repetition levels specify the repeated value of the field in the path.
-
SingleStore stores the leaf columns for the JSON schema along with their respective repetition and definition levels.
Each of these internal columns will correspond to the value, definition level and repetition level columns which are encoded in SingleStore encodings (SeekableString, etc).
The example tables will have these values for the internal columns:
|
a1 | ||
|---|---|---|
|
Value |
Definition Level |
Repetition Level |
|
1 |
1 |
0 |
|
1 |
1 |
0 |
|
1 |
1 |
0 |
|
1 |
1 |
0 |
|
1 |
1 |
0 |
|
a2 | ||
|---|---|---|
|
Value |
Definition Level |
Repetition Level |
|
‘{“d1”: 1}’ |
1 |
0 |
|
1 |
1 |
0 |
|
1 |
1 |
0 |
|
1 |
1 |
0 |
|
- |
0 |
0 |
|
a3 | ||
|---|---|---|
|
Value |
Definition Level |
Repetition Level |
|
‘{“c1”: 1}’ |
1 |
0 |
|
- |
0 |
0 |
|
- |
0 |
0 |
|
‘{“c2”: 1}’ |
1 |
0 |
|
‘{“c3”: 1}’ |
1 |
0 |
|
a4 | ||
|---|---|---|
|
Value |
Definition Level |
Repetition Level |
|
‘{“b1”: 1}’ |
1 |
0 |
|
‘{“b2”: 1}’ |
1 |
0 |
|
‘{“b3”: 1}’ |
1 |
0 |
|
‘{“b4”: 1}’ |
1 |
0 |
|
‘{“b5”: 1}’ |
1 |
0 |
Related Topics
Last modified: September 16, 2025