Using JSON

About JSON in SingleStore

SingleStore exposes a Javascript Object Notation (JSON) column-type that implements the JSON standard. You can define columns in SingleStore tables as the JSON data type to efficiently store and query JSON values. A JSON value is a JavaScript object, which means it is represented as a series of name/value pairs that are written according to a structured format and sequence. For more information, see http://www.ietf.org/rfc/rfc4627.txt.

Managing Collections of Metadata

JSON is useful for managing a collection of diverse data, represented as name-value pairs, that might otherwise be cumbersome to refactor into a formalized key-value table, or that might be stored in a table that is sparsely populated. For example, suppose an organization had an asset management application using SingleStore to track all the information about its physical assets. The asset data is diverse—what’s relevant for a desk differs from what’s important for a server machine or a company car. All assets might have common attributes, such as asset tag ID, asset type, asset name, and description. Each type of asset might have unique attributes, such as size and weight dimensions, hostname and IP address, or gas mileage.

Instead of creating a highly granular table to manage all data as key-value pairs, this organization could simply create a SingleStore table using a JSON column to efficiently manage the unique attribute data. With this design:

  • Each asset gets a row in the table.

  • Attributes that are common to all assets have their own column in the table. These columns allow you to query on common features and quickly narrow down the final result set as much as possible (for example, filtering by asset type).

  • The various, remaining attributes associated with each asset are stored in a JSON column (which might be named something like property_bag). For example, the JSON column for an office desk asset could include JSON data such as size, weight, and number of drawers. The JSON data for a server machine could include rack location, number of cores, and MAC address.

DDL: Defining Tables with JSON Columns

Any SingleStore table can contain one or more columns of data type JSON. A JSON column can optionally be suffixed with NOT NULL.

Comparing JSON and LONGTEXT Columns

A JSON column is analogous to a LONGTEXT column in the following ways:

  • JSON columns can store arbitrarily large JSON values in a normalized text representation.

  • JSON columns have the same storage requirement, as if the JSON value were stored in a text column.

The key difference is that JSON data is stored in a normalized format, which makes many operations faster than if the data were stored manually in a text column. The following is an example of non-normalized data, which is valid JSON but is relatively difficult to parse:

'{ "b\u000a": 1,"a": 2 ,"a":3 } '

Normalized data, on the other hand, is easier to parse because duplicate keys are merged, keys are sorted by value, and extraneous whitespace is removed.

'{"a":3,"b\n":1}'

Defining JSON Columns

Defining a JSON column in a SingleStore table is as simple as specifying the JSON data type in the CREATE TABLE command:

CREATE TABLE assets (
tag_id BIGINT PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
properties JSON NOT NULL);

Indexing Data in JSON Columns

JSON columns are not indexed directly - they are indexed using computed columns. For the fastest performance, you should not use JSON built-ins or :: notation in your filters. Instead, create a computed column that includes the JSON column in the computation, and then use the computed column for the index. In this way, the index gets updated only when the relevant JSON data is updated in a row.

CREATE TABLE assets (
tag_id BIGINT PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
properties JSON NOT NULL,
weight AS properties::%weight PERSISTED DOUBLE,
license_plate AS properties::$license_plate PERSISTED LONGTEXT,
KEY(license_plate), KEY(weight));

JSON computed columns that are indexed will be utilized by the optimizer more efficiently. Queries that use indexed computed columns as filters or sorts, will perform faster by avoiding expression evaluation, and by seeking into or searching the indexes rather than scanning tables. In the following examples, we will use the :: notation in the filtering and sorting clauses to illustrate how the optimizer matches the computed columns.

EXPLAIN SELECT * FROM assets WHERE properties::$license_plate = "VGB116";
+------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                          |
+------------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0                                                                             |
| Project [assets.tag_id, assets.name, assets.description, assets.properties, assets.weight, assets.license_plate] |
| ColumnStoreFilter [assets.license_plate = 'VGB116' index]                                                       |
| ColumnStoreScan test1.assets, KEY __UNORDERED () USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore      |
+------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
EXPLAIN SELECT * FROM assets ORDER BY properties::%weight;
+------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                          |
+------------------------------------------------------------------------------------------------------------------+
| GatherMerge [remote_0.weight] partitions:all alias:remote_0                                                      |
| Project [assets.tag_id, assets.name, assets.description, assets.properties, assets.weight, assets.license_plate] |
| Sort [assets.weight]                                                                                             |
| ColumnStoreScan test1.assets, KEY __UNORDERED () USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore      |
+------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)

DML: Accessing Data in JSON Columns

This section describes how to insert and update data in a SingleStore table with one or more JSON columns.

Inserting Data into a JSON Column

When inserting a row in a table, specifying JSON data is straightforward. For example: Given a table t defined as CREATE TABLE test_table(col_a TEXT, col_b JSON);, you can insert a row into test_table as follows:

INSERT INTO test_table(col_a,col_b) VALUES ('hello','{"x":"goodbye","y":"goodnight"}');

Columnstore Tables Having JSON Columns with Null Values or Empty Arrays

By default, SingleStore preserves columnstore JSON NULL values and empty arrays. To change this behavior and disable this setting, set the preserve_original_colstore_json global variable to OFF. This flag preserves the original columnstore JSON for any new data that is loaded; it does not update the existing data. By default, this variable is set to AUTO (same as ON).

Behavior When preserve_original_colstore_json is Set to OFF

When you store a JSON column in a columnstore table (and preserve_original_colstore_json is set to OFF), then null values and empty arrays in the JSON object are handled as follows when the object is written to the columnstore:

  • Name/value pairs with the value NULL are normally removed from the JSON object.

  • Name/value pairs containing empty arrays are normally removed from the JSON object.

  • If the JSON object has only the value NULL or [], the value is replaced with NULL.

Example commands you can use to store data are INSERT, UPDATE, and LOAD DATA.

An example INSERT scenario follows. Consider a table that is defined as:

CREATE TABLE json_empty_values_table(a INT, b JSON, SORT KEY (a));

Insert five rows into the table:

INSERT INTO json_empty_values_table VALUES (1, '{"v":null}');
INSERT INTO json_empty_values_table VALUES (2, '{"w":[]}');
INSERT INTO json_empty_values_table VALUES (3, '{"x":"foo","y":null,"z":[]}');
INSERT INTO json_empty_values_table VALUES (4, 'null');
INSERT INTO json_empty_values_table VALUES (5, '[]');

Manually flush the inserted data to the columnstore:

OPTIMIZE TABLE json_empty_values_table FLUSH;

Query the table:

SELECT * FROM json_empty_values_table ORDER BY a;
+------+-----------------------------+
| a    | b                           |
+------+-----------------------------+
|    1 | {"v":null}                  |
|    2 | {"w":[]}                    |
|    3 | {"x":"foo","y":null,"z":[]} |
|    4 | null                        |
|    5 | []                          |
+------+-----------------------------+

Accessing Fields in a JSON Object

To access a field of a JSON object stored in a column, use the name of the column suffixed with ::keyName. For example, if column data contains {"bits":[true,false]}, then the expression data::bits evaluates to the JSON value [true,false].

Converting a JSON String Value into a SQL String

The :: operator, when applied to a string column of a JSON object, returns the value of the string enclosed in quotes. Use the ::$keyname operator to return the value of the string without the enclosing quotes.

For example, consider a table TestJSON that contains a JSON column data.

INSERT INTO TestJSON VALUES ('{"first":"hello"}');

Retrieve the value of the data column using :: and ::$ operators:

SELECT data::first, data::$first FROM TestJSON;
+-------------+--------------+
| data::first | data::$first |
+-------------+--------------+
| "hello"     | hello        |
+-------------+--------------+

In this example, the data JSON column contains two string values.

INSERT INTO TestJSON VALUES ('{"first":"hello", "second":"world"}');

To retrieve the concatenated SQL string, use the `::$` operator:

SELECT CONCAT(data::$first, ' ', data::$second) FROM TestJSON;
+------------------------------------------+
| CONCAT(data::$first, ' ', data::$second) |
+------------------------------------------+
| hello world                              |
+------------------------------------------+

Converting a JSON Number or Boolean Value into a SQL DOUBLE

To transparently convert a JSON number or Boolean value into a SQL DOUBLE, use the name of the column suffixed with ::%keyname. For example, if column data contains {"valid":true,"value":3.14}, then data::%valid is 1 and data::%value is 3.14.

Using the ::$ and ::% Operators

If the ::$ or ::% operator is used on a JSON field that is not of the appropriate type, the JSON result is quietly converted to string or double, following the rules described in JSON_EXTRACT_<type>. In fact, these access operators are simply convenient aliases for the JSON_EXTRACT_<type> built-in function, and they follow all the same rules.

Note that the :: syntax only produces the expected output when keys are given as literals like a, or 5. The :: syntax returns NULL for variables or expressions, hence use JSON_EXTRACT_<type> instead.

For example, the following SELECT statement returns the expected output if the value 2 is provided in the JSON field along with the :: operator:

SELECT json, json::a::`2` FROM (SELECT '{"a":[1,2,3,4]}' AS json) sub;
+-----------------+--------------+
| json            | json::a::`2` |
+-----------------+--------------+
| {"a":[1,2,3,4]} | 3            |
+-----------------+--------------+

While for an expression like 1+1, use the JSON_EXTRACT_<type> function:

SELECT json, json_extract_json(json, 'a', 1+1) FROM (SELECT '{"a":[1,2,3,4]}' AS json) sub;
+-----------------+-----------------------------------+
| json            | json_extract_json(json, 'a', 1+1) |
+-----------------+-----------------------------------+
| {"a":[1,2,3,4]} | 3                                 |
+-----------------+-----------------------------------+

Accessing Nested JSON Objects

To access nested JSON objects, chain the colon-colon operator to form a keypath. For example, data::address::street means the street field of the address field of the data column.

Note

If one of the keys in the keypath is not present in the nested object, then the entire colon-colon expression yields SQL NULL.

Working with Nested Arrays in a JSON Column

Consider a table defined as:

CREATE TABLE json_tab (`id` INT(11) DEFAULT NULL,`jsondata` JSON COLLATE utf8_bin);

Insert values as follows:

INSERT INTO json_tab VALUES
( 8765 ,' {"city":"SFO","sports_teams":[{"sport_name":"football","teams": [{"club_name":"Raiders"},{"club_name":"49ers"}]},
{"sport_name":"baseball","teams" : [{"club_name":"As"},{"club_name":"SF Giants"}]}]}') ;
INSERT INTO json_tab VALUES
( 9876,'{"city":"NY","sports_teams" : [{ "sport_name":"football","teams" : [{ "club_name":"Jets"},{"club_name":"Giants"}]},
{"sport_name":"baseball","teams" : [ {"club_name":"Mets"},{"club_name":"Yankees"}]},
{"sport_name":"basketball","teams" : [{"club_name":"Nets"},{"club_name":"Knicks"}]}]}');

Query the table:

WITH t AS(
SELECT id, jsondata::city city , table_col AS sports_clubs FROM json_tab JOIN TABLE(JSON_TO_ARRAY(jsondata::sports_teams))),
t1 AS(
SELECT t.id, t.city, t.sports_clubs::sport_name sport, table_col AS clubs FROM t JOIN TABLE(JSON_TO_ARRAY(t.sports_clubs::teams)))
SELECT t1.id, t1.city,t1.sport,t1.clubs::club_name club_name FROM t1;
+------+-------+--------------+-------------+
| id   | city  | sport        | club_name   |
+------+-------+--------------+-------------+
| 9876 | "NY"  | "football"   | "Jets"      |
| 9876 | "NY"  | "football"   | "Giants"    |
| 9876 | "NY"  | "baseball"   | "Mets"      |
| 9876 | "NY"  | "baseball"   | "Yankees"   |
| 9876 | "NY"  | "basketball" | "Nets"      |
| 9876 | "NY"  | "basketball" | "Knicks"    |
| 8765 | "SFO" | "football"   | "Raiders"   |
| 8765 | "SFO" | "football"   | "49ers"     |
| 8765 | "SFO" | "baseball"   | "As"        |
| 8765 | "SFO" | "baseball"   | "SF Giants" |
+------+-------+--------------+-------------+

You can also further filter the results by applying conditions. For example, to find the city with the "Yankees" club, run the following query:

WITH t AS
(SELECT id, jsondata::city city , table_col AS sports_clubs FROM json_tab JOIN TABLE(JSON_TO_ARRAY(jsondata::sports_teams))),
t1 AS
(SELECT t.id, t.city, t.sports_clubs::sport_name sport, table_col AS clubs FROM t JOIN TABLE(JSON_TO_ARRAY(t.sports_clubs::teams)))
SELECT t1.id, t1.city,t1.sport,t1.clubs::club_name club_name FROM t1 WHERE t1.clubs::$club_name = 'Yankees';
+------+------+------------+-----------+
| id   | city | sport      | club_name |
+------+------+------------+-----------+
| 9876 | "NY" | "baseball" | "Yankees" |
+------+------+------------+-----------+

Nested JSON Ingest

Working with a nested JSON ingest requires an additional escape as the value being inserted is interpreted as a SQL string first. Therefore, the backslash (\) symbol needs an escape in addition to the JSON escape (‘\’):

CREATE TABLE test_json(col_a json);
INSERT INTO test_json VALUES ('{"addParams": "{\\"Emp_Id\\":\\"1487\\",
\\"Emp_LastName\\":\\"Stephens\\",\\"Emp_FirstName\\":\\"Mark\\",\\"Dept\\":\\"Support\\"}"}');
SELECT * FROM test_json;
+------------------------------------------------------------------------------------+
| col_a                                                                              |
+------------------------------------------------------------------------------------+
| {"addParams":"{\"Emp_Id\":\"1487\",                                                |
|   \"Emp_LastName\":\"Stephens\",\"Emp_FirstName\":\"Mark\",\"Dept\":\"Support\"}"} |
+------------------------------------------------------------------------------------+  

Using Colon-Colon Notation in UPDATE Queries

You can use the colon-colon notation for JSON keypaths in UPDATE queries. For example, the following two UPDATE queries perform the same operation:

UPDATE users SET userdata::name::$first = 'Alex';
UPDATE users SET userdata = JSON_SET_STRING(userdata, 'name', 'first', 'Alex');

In fact, these access operators are simply convenient aliases for the JSON_SET_<type> built-in function (see JSON_SET_<type>), and they follow all the same rules.

Field Name Syntax for JSON Access and UPDATE Queries

In both the JSON access and update contexts, each field name must either:

  • follow the syntax for a valid SQL identifier, or

  • be escaped with backticks in the same manner as a SQL identifier

For example, the following two SELECT queries perform the same operation:

SELECT ticker_symbol FROM stocks WHERE statistics::%`P/E` > 1.5;
SELECT ticker_symbol FROM stocks WHERE JSON_EXTRACT_DOUBLE(statistics, 'P/E') > 1.5;

When using the JSON_EXTRACT_<type> form of the query (see JSON_EXTRACT_<type> ), there is no constraint on the contents of the keystring. A JSON object can contain any string as a key, even "", or a string with "embedded\u0000nulls". For more information on extracting JSON data, see JSON LOAD DATA.

JSON Performance

The following example will demonstrate how to improve JSON seekability by utilizing sub-segments.

Begin by turning off or disabling the variable:

SET GLOBAL use_seekable_json = OFF;

Create a database and change to the newly created database..

CREATE DATABASE orders;
USE orders;

Create a table with id as an integer and set as the primary key

CREATE TABLE orders(id INT PRIMARY KEY not null, order_doc json, sort KEY(id));

Create a stored procedure to generate documents until it has reached a specific number nearest to the power of two > = n.

DELIMITER //
CREATE or REPLACE PROCEDURE gen_orders(n INT) AS
DECLARE
  i INT;
  c INT;
  m INT;
BEGIN
  DELETE FROM orders;
  INSERT orders VALUES (1,
     '{ "id" : 1,
        "time" : "2022-11-17 23:03:54",
        "lineitems" : [{ "line_num" : 1,
                         "SKU" : 688968,
                         "description" : "Apple iPhone 14 Plus",
                         "qty" : 1},
                        { "line_num" : 2,
                         "SKU" : 6514052,
                         "description" : "Energizer - MAX AA Batteries (8 Pack), Double A Alkaline Batteries",
                         "qty" : 3},
                        { "line_num" : 3,
                         "SKU" : 6457697,
                         "description" : "Star Wars - Galactic Snackin Grogu",
                         "qty" : 1}] }'
        );
  i = 1;
  WHILE i < n LOOP
    SELECT MAX(id) INTO m FROM orders;
    INSERT INTO orders
    SELECT id + m, new_order_doc(order_doc, id + m)
    FROM orders;
    SELECT COUNT(*) INTO c FROM orders;
    i += c;
 END LOOP;
END
//
DELIMITER ;

Create a function that will change the input order_doc into a new one with random data, and a new id.

DELIMITER //
CREATE or REPLACE FUNCTION new_order_doc(order_doc json, new_id INT) RETURNS json
AS
DECLARE
  j json;
  lineitems json;
  l1 json;
  l2 json;
  l3 json;
BEGIN
  j = json_set_double(order_doc, "id", new_id);
  j = json_set_string(j, "time", now());
  lineitems = json_extract_json(j, "lineitems");
  l1 = json_extract_json(lineitems, 0);
  l2 = json_extract_json(lineitems, 1);
  l3 = json_extract_json(lineitems, 2);
  lineitems = json_array_push_json('[]', mutate_lineitem(l1));
  lineitems = json_array_push_json(lineitems, mutate_lineitem(l2));
  lineitems = json_array_push_json(lineitems, mutate_lineitem(l3));
  j = json_set_json(j, "lineitems", lineitems);
  RETURN j;
END
//
DELIMITER ;

Create a function to use the mutate_lineitem.

DELIMITER //
CREATE OR REPLACE FUNCTION mutate_lineitem(item json) RETURNS json
AS
DECLARE
  j json;
BEGIN
  j = json_set_double(item, "SKU", floor(rand()*10*1000*1000));
  j = json_set_string(j, "description", uuid());
  j = json_set_double(j, "qty", ceil(3*rand()));
  RETURN j;
END
//
DELIMITER ;

Generate eight million rows of test data.

CALL gen_orders(8*1024*1024);

Create a stored procedure that will measure the average lookup time.

DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
  _id INT;
  _od json;
  _oid INT;
  m INT;
  st DATETIME(6);
  et DATETIME(6);
BEGIN
  SELECT MAX(id) INTO m FROM orders;
  st = now(6);
  FOR i in 1..n LOOP
    _oid = CEILING(m*RAND());
    SELECT id, order_doc
    INTO _id, _od
    FROM orders
    WHERE id = _oid;
  END LOOP;
  et = now(6);
  echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n AS avg_time;
END
//
DELIMITER ;

Before obtaining the average query time make sure the table has been optimized and all the data is in columnstore format.

OPTIMIZE TABLE orders FULL;

Obtain average query time.

CALL get_avg_time(100);
+------------+
| avg_time   |
+------------+
| 1.73598285 |
+------------+
1 row in set (2 min 53.61 sec)

To measure the new query time, enable to variable.

SET GLOBAL use_seekable_json = ON

Create a second table using the same format as the first table.

CREATE TABLE orders2 LIKE orders;

Use the same data from the first table and make sure to optimize the new table.

INSERT INTO orders2 SELECT * FROM orders;
OPTIMIZE TABLE orders2 FULL;

Update the stored procedure get_avg_time by changing orders to orders2 in both places.

DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
  _id INT;
  _od json;
  _oid INT;
  m INT;
  st DATETIME(6);
  et DATETIME(6);
BEGIN
  SELECT MAX(id) INTO m FROM orders2;
  st = now(6);
  FOR i in 1..n LOOP
    _oid = CEILING(m*RAND());
    SELECT id, order_doc
    INTO _id, _od
    FROM orders2
    WHERE id = _oid;
  END LOOP;
  et = now(6);
  echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n AS avg_time;
END
//
DELIMITER ;

Run the get_avg_time stored procedure again with the changed table name.

CALL get_avg_time(100);
+------------+
| avg_time   |
+------------+
| 0.03216373 |
+------------+
1 row in set (3.47 sec)

Implementation Considerations

This section describes some unique behaviors in SingleStore’s implementation of the JSON standard (RFC 4627).

Infinities and NaNs

SingleStore does not attempt to support entities such as the nan (not a number) entity, as in {"value":nan}. Although the JavaScript language supports nan and positive and negative infinities, neither the JSON standard nor SQL’s DOUBLE type provide any way to represent such non-finite values. If your application requires these special values, you might try using null, 0.0, or strings (such as "nan" and "inf") in their place.

Unicode Support

JSON strings can contain any of the 65536 Unicode characters in the Basic Multilingual Plane, either directly as themselves or escaped with JSON’s \uABCD notation. JSON normalization converts all such ASCII escape sequences into UTF-8 byte sequences, except for those characters that must remain \u-encoded to keep the string valid JSON.

Warning

Characters outside of the Basic Multilingual Plans in JSON strings are not supported with escaped notation and may result in incorrect results.

In SingleStore, a backslash (\) literal in a string must be escaped with a backslash. Therefore, pattern strings containing backslashes will have two backslash characters (\\).

Here is an example of how to use escaped notation for characters in the Basic Multilingual Plane:

SELECT '{"a":"\\u00F9"}' :> JSON;
+---------------------------+
| '{"a":"\\u00F9"}' :> JSON |
+---------------------------+
| {"a":"ù"}                 |
+---------------------------+

Like MySQL, SingleStore supports characters outside the Basic Multilingual Plane (characters whose codepoints are in the range 0x10000 to 0x10FFFF). This includes some uncommon Chinese characters and symbols such as emojis.

You must have the character_set_server engine variable set to utf8mb4 for these characters to work. Refer to List of Engine Variables for more information.

Character Encodings and Collation

Collation provides a set of rules to use when comparing characters in a character set. By default, JSON columns use the utf8_bin collation, which is a case-sensitive collation that sorts by Unicode codepoint value. The collation of a JSON column affects the following:

  • Output of SELECT DISTINCT, ORDER BY, and other queries that compare entire values.

  • Identification of duplicate keys inside a single JSON object during normalization. For example, whether the string {"Schlüssel":1,"Schluessel":2} is normalized to {"Schluessel":2}.

  • Sorting of keys inside a single JSON object. For example, whether the string {"Schlüssel":1,"Schluff":2} is normalized to {"Schluff":2,"Schlüssel":1}.

The default collation generally provides the desired behavior. However, you can override the default at the table or column level.

Note

SingleStore recommends that you use the utf8_bin collation for all JSON columns.

Table-level Override

In the following example, for the users table, both userdata and lastname use the table collation (which has been specified as utf8_unicode_ci).

CREATE TABLE users (
uid INT AUTO_INCREMENT PRIMARY KEY,
userdata JSON,
lastname AS data::name::$last PERSISTED TEXT) COLLATE=utf8_unicode_ci;

Column-level Override

In the following example, for the orders table:

  • The data column uses utf8_unicode_ci.

  • The productdetails column uses utf8_bin.

  • Both the productname column and the comments column use utf8_general_ci, which is SingleStore’s default database collation

CREATE TABLE orders (
oid INT AUTO_INCREMENT PRIMARY KEY,
data JSON COLLATE utf8_unicode_ci,
productname AS data::product::$name PERSISTED TEXT,
productdetails AS data::product::$details PERSISTED TEXT COLLATE utf8_bin,
comments VARCHAR(256));

Expression-level Override

The :> operator allows you to specify a collation for different expressions in a query.

:> text collate <colation_name>

The following example uses a binary collation (utf8_bin) for the first expression and a case-insensitive collation (utf8_general_ci) for the second expression.

SELECT *
FROM sets
WHERE sets.json_field::$x :> text collate utf8_bin = "string1"
AND sets.json_field::$y :> text collate utf8_general_ci = "string2";

Maximum JSON Value Size

Under the hood, JSON data is stored in LONGBLOB columns. While the DDL will allow you to specify a length of up to 4GB, there is an internal limit applied when assigning a value to a string or JSON field that caps the maximum size of a single value to max_allowed_packet. This is 100MB by default and can be set to up to 1GB.

Parquet Encoding for JSON

A Parquet schema has a JSON-like syntax and a nested structure. The schema consists of a group of fields called a message. Each field has three attributes: repetition, type, and name. The type attribute is either a group or a primitive (e.g., int, float, boolean, or string). The repetition attribute can only be one of the following:

  • Required: exactly one occurrence

  • Optional : 0 or 1 occurrence

  • Repeated: 0 or more occurrences

The infer schema process works as follows:

  1. The schema loops through the list of JSON objects.

    1. The present keypaths are merged into a schema tree object.

    2. If there is a type conflict, the node in the schema tree is marked as un-inferable.

    3. If any node in the schema tree contains more children than indicated in the json_document_max_children engine variable, the node is marked as un-inferable.

    4. If a node has a greater number of children than indicated in the engine variable json_document_sparse_children_check_threshold and the average occurrence of all children related to the parent is less than 1/ as indicated in the engine variable json_document_sparse_children_check_ratio, the node will be marked as un-inferable.

  2. Once the schema tree object is constructed, analyze the schema tree and prune the tree until the number of key paths (distinct root to leaf paths) is less than the setting for the engine variable json_document_max_leaves. Also, prune any node where the average occurrence of all the children in relation to the number of rows is less than 1/ as indicated in the engine variable json_document_absolute_sparse_key_check_ratio.

The examples will use the engine variables settings and the JSON object table shown below:

  • json_document_max_children = 4

  • json_document_sparse_children_threshold = 3

  • json_document_sparse_children_ratio = 2

JSON Objects

{“a1”: 1, “a2”: {“d1”: 1}, “a3”: {“c1”: 1}, “a4”: {“b1”: 1}}

{“a1”: 2, “a2”: 1, “a4”: {“b2”: 1}}

{“a1”: 3, “a2”: 1, “a4”: {“b3”: 1}}

{“a1”: 4, “a2”: 1, ”a3”: {“c2”: 1}, “a4”: {“b4”: 1}}

{“a1”: 5,, “a3”: {“c3”: 1} “a4”: {“b5”: 1}}

The first pass at merging the keypaths will yield:

The second pass at merging the keypaths contains a type mismatch on a2. The results would be:

The results of the third and fourth passes at merging yield:

In the final merge, the number of children for a4 exceeds the limit (4) set for json_document_max_children so it is marked as un-inferable. The number of children for a3 exceeds the limit (3) set for json_document_sparse_children_threshold. The average occurrences of children of a3 (1) relative to the number of occurrences of a3 (3) is calculated as ⅓ < ½, so a3 would also be un-inferable.

After the merging process, the schema tree with leaves are the inferred columns (also may be referred to as leaves, leaf columns, leaf paths, or key paths).

The keypaths will be  a1, a2, a3, a4.

Encoding Nested Columns

Parquet uses the Dremel encoding for nested columns with definition and repetition levels.

  • Definition levels specify how many optional fields in the path for the column are defined.

  • Repetition levels specify the repeated value of the field in the path.

  • SingleStore stores the leaf columns for the JSON schema along with their respective repetition and definition levels.

Each of these internal columns will correspond to the value, definition level and repetition level columns which are encoded in SingleStore encodings (SeekableString, etc).

The example tables will have these values for the internal columns:

a1

Value

Definition Level

Repetition Level

1

1

0

1

1

0

1

1

0

1

1

0

1

1

0

a2

Value

Definition Level

Repetition Level

‘{“d1”: 1}’

1

0

1

1

0

1

1

0

1

1

0

-

0

0

a3

Value

Definition Level

Repetition Level

‘{“c1”: 1}’

1

0

-

0

0

-

0

0

‘{“c2”: 1}’

1

0

‘{“c3”: 1}’

1

0

a4

Value

Definition Level

Repetition Level

‘{“b1”: 1}’

1

0

‘{“b2”: 1}’

1

0

‘{“b3”: 1}’

1

0

‘{“b4”: 1}’

1

0

‘{“b5”: 1}’

1

0

Performance Impact on Parquet Encoding for JSON

When using seekable JSON/JSON Parquet encoding a major impact on performance is if a schema is dense or sparse.

A node in a tree is considered to be dense if it occurs in most JSON rows. A schema is said to be dense if most of the nodes in the entire schema are dense. Otherwise, the schema is considered to be sparse.

Consider the following JSON data table:

JSON Data (dense)

{"a":1, "b":1}

{“a”:2, “c”:{“d”: 1}}

{“c”:{“d”:2, “e”:3}}

The schema will infer the JSON rows as follows:

Using the image above the leaves of the tree become internal columns in the JSON encoding. This is an example of a dense schema.

a

b

c::d

c::e

1

1

NULL

NULL

2

NULL

1

NULL

NULL

NULL

2

3

Using the following JSON data table:

JSON Data (sparse)

{“a”: 1}

{“b”: 1}

{“c”: 1}

{“d”: 1}

{“e”: 1}

The JSON rows will be encoded as follows:

a

b

c

d

e

1

NULL

NULL

NULL

NULL

NULL

1

NULL

NULL

NULL

NULL

NULL

1

NULL

NULL

NULL

NULL

NULL

1

NULL

NULL

NULL

NULL

NULL

1

The preceding table represents a sparse schema caused by the NULLs not being part of the original JSON strings. This results in poor performance since the NULLs will need to be counted which increases the execution time and memory usage.

To prevent an overly sparse schema from being inferred, SingleStore uses a method where if the average of a key’s children is too low (<1%) in relation to the key itself, the key is stored as a string without inferring its children.

Last modified: September 24, 2024

Was this article helpful?