Using JSON
On this page
About JSON in SingleStore Helios
SingleStore Helios exposes a Javascript Object Notation (JSON) column-type that implements the JSON standard.
Managing Collections of Metadata
JSON is useful for managing a collection of diverse data, represented as name-value pairs, that might otherwise be cumbersome to refactor into a formalized key-value table, or that might be stored in a table that is sparsely populated.
Instead of creating a highly granular table to manage all data as key-value pairs, this organization could simply create a SingleStore Helios table using a JSON column to efficiently manage the unique attribute data.
-
Each asset gets a row in the table.
-
Attributes that are common to all assets have their own column in the table.
These columns allow you to query on common features and quickly narrow down the final result set as much as possible (for example, filtering by asset type). -
The various, remaining attributes associated with each asset are stored in a JSON column (which might be named something like property_
bag). For example, the JSON column for an office desk asset could include JSON data such as size, weight, and number of drawers. The JSON data for a server machine could include rack location, number of cores, and MAC address.
DDL: Defining Tables with JSON Columns
Any SingleStore Helios table can contain one or more columns of data type JSON
.NOT NULL
.
Comparing JSON and LONGTEXT Columns
A JSON column is analogous to a LONGTEXT
column in the following ways:
-
JSON columns can store arbitrarily large JSON values in a normalized text representation.
-
JSON columns have the same storage requirement, as if the JSON value were stored in a text column.
The key difference is that JSON data is stored in a normalized format, which makes many operations faster than if the data were stored manually in a text column.
'{ "b\u000a": 1,"a": 2 ,"a":3 } '
Normalized data, on the other hand, is easier to parse because duplicate keys are merged, keys are sorted by value, and extraneous whitespace is removed.
'{"a":3,"b\n":1}'
Defining JSON Columns
Defining a JSON column in a SingleStore Helios table is as simple as specifying the JSON data type in the CREATE TABLE
command:
CREATE TABLE assets (tag_id BIGINT PRIMARY KEY,name TEXT NOT NULL,description TEXT,properties JSON NOT NULL);
JSON Collation
The collation for a JSON column can be set explicitly using a COLLATE
clause.json_
engine variable.
Refer to Sync Variables Lists, Character Set and Collation Override, Unicode Support, and Character Encodings and Collation for more information.
Indexing Data in JSON Columns
JSON columns are not indexed directly - they are indexed using computed columns.::
notation in your filters.
CREATE TABLE assets (tag_id BIGINT PRIMARY KEY,name TEXT NOT NULL,description TEXT,properties JSON NOT NULL,weight AS properties::%weight PERSISTED DOUBLE,license_plate AS properties::$license_plate PERSISTED LONGTEXT,KEY(license_plate), KEY(weight));
JSON computed columns that are indexed will be utilized by the optimizer more efficiently.::
notation in the filtering and sorting clauses to illustrate how the optimizer matches the computed columns.
EXPLAIN SELECT * FROM assets WHERE properties::$license_plate = "VGB116";
+------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+------------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 |
| Project [assets.tag_id, assets.name, assets.description, assets.properties, assets.weight, assets.license_plate] |
| ColumnStoreFilter [assets.license_plate = 'VGB116' index] |
| ColumnStoreScan test1.assets, KEY __UNORDERED () USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
EXPLAIN SELECT * FROM assets ORDER BY properties::%weight;
+------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+------------------------------------------------------------------------------------------------------------------+
| GatherMerge [remote_0.weight] partitions:all alias:remote_0 |
| Project [assets.tag_id, assets.name, assets.description, assets.properties, assets.weight, assets.license_plate] |
| Sort [assets.weight] |
| ColumnStoreScan test1.assets, KEY __UNORDERED () USING CLUSTERED COLUMNSTORE table_type:sharded_columnstore |
+------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
DML: Accessing Data in JSON Columns
This section describes how to insert and update data in a SingleStore Helios table with one or more JSON columns.
Inserting Data into a JSON Column
When inserting a row in a table, specifying JSON data is straightforward.t
defined as CREATE TABLE test_
, you can insert a row into test_
as follows:
INSERT INTO test_table(col_a,col_b) VALUES ('hello','{"x":"goodbye","y":"goodnight"}');
Columnstore Tables Having JSON Columns with Null Values or Empty Arrays
By default, SingleStore Helios preserves columnstore JSON NULL
values and empty arrays.preserve_
global variable to OFF
.AUTO
(same as ON
).
Behavior When preserve_ original_ colstore_ json
is Set to OFF
When you store a JSON column in a columnstore table (and preserve_
is set to OFF
), then null values and empty arrays in the JSON object are handled as follows when the object is written to the columnstore:
-
Name/value pairs with the value
NULL
are normally removed from the JSON object. -
Name/value pairs containing empty arrays are normally removed from the JSON object.
-
If the JSON object has only the value
NULL
or[]
, the value is replaced withNULL
.
Example commands you can use to store data are INSERT
, UPDATE
, and LOAD DATA
.
An example INSERT
scenario follows.
CREATE TABLE json_empty_values_table(a INT, b JSON, SORT KEY (a));
Insert five rows into the table:
INSERT INTO json_empty_values_table VALUES (1, '{"v":null}');INSERT INTO json_empty_values_table VALUES (2, '{"w":[]}');INSERT INTO json_empty_values_table VALUES (3, '{"x":"foo","y":null,"z":[]}');INSERT INTO json_empty_values_table VALUES (4, 'null');INSERT INTO json_empty_values_table VALUES (5, '[]');
Manually flush the inserted data to the columnstore:
OPTIMIZE TABLE json_empty_values_table FLUSH;
Query the table:
SELECT * FROM json_empty_values_table ORDER BY a;
+------+-----------------------------+
| a | b |
+------+-----------------------------+
| 1 | {"v":null} |
| 2 | {"w":[]} |
| 3 | {"x":"foo","y":null,"z":[]} |
| 4 | null |
| 5 | [] |
+------+-----------------------------+
Accessing Fields in a JSON Object
To access a field of a JSON object stored in a column, use the name of the column suffixed with ::keyName
.{"bits":[true,false]}
, then the expression data::bits
evaluates to the JSON value [true,false]
.
Converting a JSON String Value into a SQL String
The ::
operator, when applied to a string column of a JSON object, returns the value of the string enclosed in quotes.::$keyname
operator to return the value of the string without the enclosing quotes.
For example, consider a table TestJSON
that contains a JSON column data
.
INSERT INTO TestJSON VALUES ('{"first":"hello"}');
Retrieve the value of the data
column using ::
and ::$
operators:
SELECT data::first, data::$first FROM TestJSON;
+-------------+--------------+
| data::first | data::$first |
+-------------+--------------+
| "hello" | hello |
+-------------+--------------+
In this example, the data
JSON column contains two string values.
INSERT INTO TestJSON VALUES ('{"first":"hello", "second":"world"}');
To retrieve the concatenated SQL string, use the `::$` operator:
SELECT CONCAT(data::$first, ' ', data::$second) FROM TestJSON;
+------------------------------------------+
| CONCAT(data::$first, ' ', data::$second) |
+------------------------------------------+
| hello world |
+------------------------------------------+
Converting a JSON Number or Boolean Value into a SQL DOUBLE
To transparently convert a JSON number or Boolean value into a SQL DOUBLE
, use the name of the column suffixed with ::%keyname
.{"valid":true,"value":3.
, then data::%valid
is 1 and data::%value
is 3.
Using the ::$ and ::% Operators
If the ::$
or ::%
operator is used on a JSON field that is not of the appropriate type, the JSON result is quietly converted to string or double, following the rules described in JSON_JSON_
built-in function, and they follow all the same rules.
Note that the ::
syntax only produces the expected output when keys are given as literals like a
, or 5.::
syntax returns NULL
for variables or expressions, hence use JSON_
instead.
For example, the following SELECT
statement returns the expected output if the value 2
is provided in the JSON field along with the ::
operator:
SELECT json, json::a::`2` FROM (SELECT '{"a":[1,2,3,4]}' AS json) sub;
+-----------------+--------------+
| json | json::a::`2` |
+-----------------+--------------+
| {"a":[1,2,3,4]} | 3 |
+-----------------+--------------+
While for an expression like 1+1
, use the JSON_
function:
SELECT json, json_extract_json(json, 'a', 1+1) FROM (SELECT '{"a":[1,2,3,4]}' AS json) sub;
+-----------------+-----------------------------------+
| json | json_extract_json(json, 'a', 1+1) |
+-----------------+-----------------------------------+
| {"a":[1,2,3,4]} | 3 |
+-----------------+-----------------------------------+
Accessing Nested JSON Objects
To access nested JSON objects, chain the colon-colon operator to form a keypath
.data::address::street
means the street
field of the address
field of the data
column.
Note
If one of the keys in the keypath is not present in the nested object, then the entire
colon-colon expression yields SQL NULL
.
Working with Nested Arrays in a JSON Column
Consider a table defined as:
CREATE TABLE json_tab (`id` INT(11) DEFAULT NULL,`jsondata` JSON COLLATE utf8_bin);
Insert values as follows:
INSERT INTO json_tab VALUES( 8765 ,' {"city":"SFO","sports_teams":[{"sport_name":"football","teams": [{"club_name":"Raiders"},{"club_name":"49ers"}]},{"sport_name":"baseball","teams" : [{"club_name":"As"},{"club_name":"SF Giants"}]}]}') ;INSERT INTO json_tab VALUES( 9876,'{"city":"NY","sports_teams" : [{ "sport_name":"football","teams" : [{ "club_name":"Jets"},{"club_name":"Giants"}]},{"sport_name":"baseball","teams" : [ {"club_name":"Mets"},{"club_name":"Yankees"}]},{"sport_name":"basketball","teams" : [{"club_name":"Nets"},{"club_name":"Knicks"}]}]}');
Query the table:
WITH t AS(SELECT id, jsondata::city city , table_col AS sports_clubs FROM json_tab JOIN TABLE(JSON_TO_ARRAY(jsondata::sports_teams))),t1 AS(SELECT t.id, t.city, t.sports_clubs::sport_name sport, table_col AS clubs FROM t JOIN TABLE(JSON_TO_ARRAY(t.sports_clubs::teams)))SELECT t1.id, t1.city,t1.sport,t1.clubs::club_name club_name FROM t1;
+------+-------+--------------+-------------+
| id | city | sport | club_name |
+------+-------+--------------+-------------+
| 9876 | "NY" | "football" | "Jets" |
| 9876 | "NY" | "football" | "Giants" |
| 9876 | "NY" | "baseball" | "Mets" |
| 9876 | "NY" | "baseball" | "Yankees" |
| 9876 | "NY" | "basketball" | "Nets" |
| 9876 | "NY" | "basketball" | "Knicks" |
| 8765 | "SFO" | "football" | "Raiders" |
| 8765 | "SFO" | "football" | "49ers" |
| 8765 | "SFO" | "baseball" | "As" |
| 8765 | "SFO" | "baseball" | "SF Giants" |
+------+-------+--------------+-------------+
You can also further filter the results by applying conditions.
WITH t AS(SELECT id, jsondata::city city , table_col AS sports_clubs FROM json_tab JOIN TABLE(JSON_TO_ARRAY(jsondata::sports_teams))),t1 AS(SELECT t.id, t.city, t.sports_clubs::sport_name sport, table_col AS clubs FROM t JOIN TABLE(JSON_TO_ARRAY(t.sports_clubs::teams)))SELECT t1.id, t1.city,t1.sport,t1.clubs::club_name club_name FROM t1 WHERE t1.clubs::$club_name = 'Yankees';
+------+------+------------+-----------+
| id | city | sport | club_name |
+------+------+------------+-----------+
| 9876 | "NY" | "baseball" | "Yankees" |
+------+------+------------+-----------+
Nested JSON Ingest
Working with a nested JSON ingest requires an additional escape as the value being inserted is interpreted as a SQL string first.
CREATE TABLE test_json(col_a json);INSERT INTO test_json VALUES ('{"addParams": "{\\"Emp_Id\\":\\"1487\\",\\"Emp_LastName\\":\\"Stephens\\",\\"Emp_FirstName\\":\\"Mark\\",\\"Dept\\":\\"Support\\"}"}');
SELECT * FROM test_json;
+------------------------------------------------------------------------------------+
| col_a |
+------------------------------------------------------------------------------------+
| {"addParams":"{\"Emp_Id\":\"1487\", |
| \"Emp_LastName\":\"Stephens\",\"Emp_FirstName\":\"Mark\",\"Dept\":\"Support\"}"} |
+------------------------------------------------------------------------------------+
Using Colon-Colon Notation in UPDATE Queries
You can use the colon-colon notation for JSON keypaths in UPDATE queries.UPDATE
queries perform the same operation:
UPDATE users SET userdata::name::$first = 'Alex';UPDATE users SET userdata = JSON_SET_STRING(userdata, 'name', 'first', 'Alex');
In fact, these access operators are simply convenient aliases for the JSON_
built-in function (see JSON_
Field Name Syntax for JSON Access and UPDATE Queries
In both the JSON access and update contexts, each field name must either:
-
follow the syntax for a valid SQL identifier, or
-
be escaped with backticks in the same manner as a SQL identifier
For example, the following two SELECT queries perform the same operation:
SELECT ticker_symbol FROM stocks WHERE statistics::%`P/E` > 1.5;SELECT ticker_symbol FROM stocks WHERE JSON_EXTRACT_DOUBLE(statistics, 'P/E') > 1.5;
When using the JSON_
form of the query (see JSON_""
, or a string with "embedded\u0000nulls"
.
JSON Performance
The following example will demonstrate how to improve JSON seekability by utilizing sub-segments.
Begin by turning off or disabling the variable:
SET GLOBAL use_seekable_json = OFF;
Create a database and change to the newly created database.
CREATE DATABASE orders;USE orders;
Create a table with id as an integer and set as the primary key
CREATE TABLE orders(id INT PRIMARY KEY not null, order_doc json, sort KEY(id));
Create a stored procedure to generate documents until it has reached a specific number nearest to the power of two > = n.
DELIMITER //CREATE or REPLACE PROCEDURE gen_orders(n INT) ASDECLAREi INT;c INT;m INT;BEGINDELETE FROM orders;INSERT orders VALUES (1,'{ "id" : 1,"time" : "2022-11-17 23:03:54","lineitems" : [{ "line_num" : 1,"SKU" : 688968,"description" : "Apple iPhone 14 Plus","qty" : 1},{ "line_num" : 2,"SKU" : 6514052,"description" : "Energizer - MAX AA Batteries (8 Pack), Double A Alkaline Batteries","qty" : 3},{ "line_num" : 3,"SKU" : 6457697,"description" : "Star Wars - Galactic Snackin Grogu","qty" : 1}] }');i = 1;WHILE i < n LOOPSELECT MAX(id) INTO m FROM orders;INSERT INTO ordersSELECT id + m, new_order_doc(order_doc, id + m)FROM orders;SELECT COUNT(*) INTO c FROM orders;i += c;END LOOP;END//DELIMITER ;
Create a function that will change the input order_
DELIMITER //CREATE or REPLACE FUNCTION new_order_doc(order_doc json, new_id INT) RETURNS jsonASDECLAREj json;lineitems json;l1 json;l2 json;l3 json;BEGINj = json_set_double(order_doc, "id", new_id);j = json_set_string(j, "time", now());lineitems = json_extract_json(j, "lineitems");l1 = json_extract_json(lineitems, 0);l2 = json_extract_json(lineitems, 1);l3 = json_extract_json(lineitems, 2);lineitems = json_array_push_json('[]', mutate_lineitem(l1));lineitems = json_array_push_json(lineitems, mutate_lineitem(l2));lineitems = json_array_push_json(lineitems, mutate_lineitem(l3));j = json_set_json(j, "lineitems", lineitems);RETURN j;END//DELIMITER ;
Create a function to use the mutate_
DELIMITER //CREATE OR REPLACE FUNCTION mutate_lineitem(item json) RETURNS jsonASDECLAREj json;BEGINj = json_set_double(item, "SKU", floor(rand()*10*1000*1000));j = json_set_string(j, "description", uuid());j = json_set_double(j, "qty", ceil(3*rand()));RETURN j;END//DELIMITER ;
Generate eight million rows of test data.
CALL gen_orders(8*1024*1024);
Create a stored procedure that will measure the average lookup time.
DELIMITER //CREATE OR REPLACE PROCEDURE get_avg_time(n INT) ASDECLARE_id INT;_od json;_oid INT;m INT;st DATETIME(6);et DATETIME(6);BEGINSELECT MAX(id) INTO m FROM orders;st = now(6);FOR i in 1..n LOOP_oid = CEILING(m*RAND());SELECT id, order_docINTO _id, _odFROM ordersWHERE id = _oid;END LOOP;et = now(6);echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n AS avg_time;END//DELIMITER ;
Before obtaining the average query time make sure the table has been optimized and all the data is in columnstore format.
OPTIMIZE TABLE orders FULL;
Obtain average query time.
CALL get_avg_time(100);
+------------+
| avg_time |
+------------+
| 1.73598285 |
+------------+
1 row in set (2 min 53.61 sec)
To measure the new query time, enable to variable.
SET GLOBAL use_seekable_json = ON
Create a second table using the same format as the first table.
CREATE TABLE orders2 LIKE orders;
Use the same data from the first table and make sure to optimize the new table.
INSERT INTO orders2 SELECT * FROM orders;
OPTIMIZE TABLE orders2 FULL;
Update the stored procedure get_
by changing orders to orders2 in both places.
DELIMITER //CREATE OR REPLACE PROCEDURE get_avg_time(n INT) ASDECLARE_id INT;_od json;_oid INT;m INT;st DATETIME(6);et DATETIME(6);BEGINSELECT MAX(id) INTO m FROM orders2;st = now(6);FOR i in 1..n LOOP_oid = CEILING(m*RAND());SELECT id, order_docINTO _id, _odFROM orders2WHERE id = _oid;END LOOP;et = now(6);echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n AS avg_time;END//DELIMITER ;
Run the get_
stored procedure again with the changed table name.
CALL get_avg_time(100);
+------------+
| avg_time |
+------------+
| 0.03216373 |
+------------+
1 row in set (3.47 sec)
JSON Array Performance
Aggregations over array objects using the JOIN TABLE(JSON_
operation are optimized in SingleStore to run efficiently on large batches of rows in parallel.
The following example uses the TPC-H data.
ALTER TABLE orders ADD COLUMN lineitems_json JSON;CREATE TABLE new_table (l_orderkey bigint(11) NOT NULL,lineitems JSON);INSERT INTO new_tableSELECT l_orderkey,JSON_AGG(JSON_BUILD_OBJECT('l_partkey', l_partkey,'l_suppkey', l_suppkey,'l_linenumber', l_linenumber,'l_quantity', l_quantity,'l_extendedprice', l_extendedprice,'l_discount', l_discount,'l_tax', l_tax,'l_returnflag', l_returnflag,'l_linestatus', l_linestatus,'l_shipdate', l_shipdate,'l_commitdate', l_commitdate,'l_receiptdate', l_receiptdate,'l_shipinstruct', l_shipinstruct,'l_shipmode', l_shipmode,'l_comment', l_comment)) as lineitemsFROM lineitemGROUP BY l_orderkey;UPDATE orders oJOIN new_table t ON o.o_orderkey = t.l_orderkeySET o.lineitems_json = t.lineitems;
Here are a few sample query structures that benefit from the optimization.
-
The following query calculates the total number of line items for each shipping mode by aggregating and grouping on the JSON array columns:
SELECT t.table_col::$l_shipmode, sum(t.table_col::%l_quantity) as quantityFROM ordersJOIN TABLE(JSON_TO_ARRAY(lineitems_json)) tGROUP BY t.table_col::$l_shipmode;+--------------------------+-----------+ | t.table_col::$l_shipmode | quantity | +--------------------------+-----------+ | MAIL | 218541962 | | FOB | 218469194 | | RAIL | 218553230 | | AIR | 218525124 | | REG AIR | 218617001 | | SHIP | 218576988 | | TRUCK | 218454537 | +--------------------------+-----------+
-
The following query calculates the total number of line items for each order priority:
SELECT o_orderpriority as priority, sum(t.table_col::$l_quantity) as quantityFROM orders JOIN TABLE(JSON_TO_ARRAY(lineitems_json)) tGROUP BY o_orderpriority;+-----------------+-----------+ | priority | quantity | +-----------------+-----------+ | 3-MEDIUM | 305731841 | | 5-LOW | 305656541 | | 2-HIGH | 306112515 | | 1-URGENT | 306300507 | | 4-NOT SPECIFIED | 305936632 | +-----------------+-----------+
-
The following query performs an intricate aggregation operation over multiple fields:
SELECTt.table_col::$l_returnflag as r,t.table_col::$l_linestatus as s,sum(t.table_col::%l_quantity) as sum_qty,sum(t.table_col::%l_extendedprice) as sum_base_price,sum(t.table_col::%l_extendedprice * (1 - t.table_col::%l_discount)) as sum_disc_price,avg(t.table_col::%l_quantity) as avg_qtyFROM ordersJOIN TABLE(JSON_TO_ARRAY(lineitems_json)) tGROUP by r, s+---+---+-----------+--------------------+--------------------+--------------------+ | r | s | sum_qty | sum_base_price | sum_disc_price | avg_qty | +---+---+-----------+--------------------+--------------------+--------------------+ | A | F | 377518399 | 566065727797.2633 | 537759104278.07465 | 25.500975103007097 | | R | F | 377732830 | 566431054976.0034 | 538110922664.7574 | 25.50838478968014 | | N | F | 9851614 | 14767438399.169971 | 14028805792.211407 | 25.522448302840946 | | N | O | 764635193 | 1146548935600.9531 | 1089215873201.9531 | 25.498214140059787 | +---+---+-----------+--------------------+--------------------+--------------------+
Limitation
-
The optimization is only applied to the first dimension of an array.
Implementation Considerations
This section describes some unique behaviors in SingleStore Helios’s implementation of the JSON standard (RFC 4627).
Infinities and NaNs
SingleStore Helios does not attempt to support entities such as the nan
(not a number
) entity, as in {"value":nan}
.nan
and positive and negative infinities, neither the JSON standard nor SQL’s DOUBLE
type provide any way to represent such non-finite values.null
, 0.
, or strings (such as "nan"
and "inf"
) in their place.
Unicode Support
JSON strings can contain any of the 65536 Unicode characters in the Basic Multilingual Plane, either directly as themselves or escaped with JSON’s \uABCD
notation.\u
-encoded to keep the string valid JSON.
Warning
Characters outside of the Basic Multilingual Plans in JSON strings are not supported with escaped notation and may result in incorrect results.
In SingleStore, a backslash (\) literal in a string must be escaped with a backslash.
Here is an example of how to use escaped notation for characters in the Basic Multilingual Plane:
SELECT '{"a":"\\u00F9"}' :> JSON;
+---------------------------+
| '{"a":"\\u00F9"}' :> JSON |
+---------------------------+
| {"a":"ù"} |
+---------------------------+
Like MySQL, SingleStore Helios supports characters outside the Basic Multilingual Plane (characters whose codepoints are in the range 0x10000 to 0x10FFFF).
The json_
engine variable must be set to utf8mb4_
for these characters to work.
Character Encodings and Collation
Collation
provides a set of rules to use when comparing characters in a character set.json_
engine variable (Sync Variables Lists).
-
Output of
SELECT DISTINCT
,ORDER BY
, and other queries that compare entire values. -
Identification of duplicate keys inside a single JSON object during normalization.
For example, whether the string {"Schlüssel":1,"Schluessel":2}
is normalized to{"Schluessel":2}
. -
Sorting of keys inside a single JSON object.
For example, whether the string {"Schlüssel":1,"Schluff":2}
is normalized to{"Schluff":2,"Schlüssel":1}
.
The default collation generally provides the desired behavior.json_
engine variable.
Note
SingleStore recommends that you use the utf8mb4_
collation for all JSON columns.
Table-level Override
In the following example, for the users
table, both userdata
and lastname
use the table collation (which has been specified as utf8_
).
CREATE TABLE users (uid INT AUTO_INCREMENT PRIMARY KEY,userdata JSON,lastname AS data::name::$last PERSISTED TEXT) COLLATE=utf8_unicode_ci;
Column-level Override
In the following example, for the orders
table:
-
The
data
column usesutf8_
.unicode_ ci -
The
productdetails
column usesutf8_
.bin -
Both the
productname
column and thecomments
column useutf8_
, which is SingleStore Helios’s default database collationgeneral_ ci
CREATE TABLE orders (oid INT AUTO_INCREMENT PRIMARY KEY,data JSON COLLATE utf8_unicode_ci,productname AS data::product::$name PERSISTED TEXT,productdetails AS data::product::$details PERSISTED TEXT COLLATE utf8_bin,comments VARCHAR(256));
Expression-level Override
The :>
operator allows you to specify a collation for different expressions in a query.
:> text collate <colation_name>
The following example uses a binary collation (utf8_
) for the first expression and a case-insensitive collation (utf8_
) for the second expression.
SELECT *FROM setsWHERE sets.json_field::$x :> text collate utf8_bin = "string1"AND sets.json_field::$y :> text collate utf8_general_ci = "string2";
Maximum JSON Value Size
Under the hood, JSON data is stored in LONGBLOB
columns.max_
.
Parquet Encoding for JSON
A Parquet schema has a JSON-like syntax and a nested structure.
-
Required: exactly one occurrence
-
Optional : 0 or 1 occurrence
-
Repeated: 0 or more occurrences
The infer schema process works as follows:
-
The schema loops through the list of JSON objects.
-
The present keypaths are merged into a schema tree object.
-
If there is a type conflict, the node in the schema tree is marked as un-inferable.
-
If any node in the schema tree contains more children than indicated in the
json_
engine variable, the node is marked as un-inferable.document_ max_ children -
If a node has a greater number of children than indicated in the engine variable
json_
and the average occurrence of all children related to the parent is less than 1/ as indicated in the engine variabledocument_ sparse_ children_ check_ threshold json_
, the node will be marked as un-inferable.document_ sparse_ children_ check_ ratio
-
-
Once the schema tree object is constructed, analyze the schema tree and prune the tree until the number of key paths (distinct root to leaf paths) is less than the setting for the engine variable
json_
.document_ max_ leaves Also, prune any node where the average occurrence of all the children in relation to the number of rows is less than 1/ as indicated in the engine variable json_
.document_ absolute_ sparse_ key_ check_ ratio
The examples will use the engine variables settings and the JSON object table shown below:
-
json_
document_ max_ children = 4 -
json_
document_ sparse_ children_ threshold = 3 -
json_
document_ sparse_ children_ ratio = 2
JSON Objects |
---|
{“a1”: 1, “a2”: {“d1”: 1}, “a3”: {“c1”: 1}, “a4”: {“b1”: 1}} |
{“a1”: 2, “a2”: 1, “a4”: {“b2”: 1}} |
{“a1”: 3, “a2”: 1, “a4”: {“b3”: 1}} |
{“a1”: 4, “a2”: 1, ”a3”: {“c2”: 1}, “a4”: {“b4”: 1}} |
{“a1”: 5,, “a3”: {“c3”: 1} “a4”: {“b5”: 1}} |
The first pass at merging the keypaths will yield:
The second pass at merging the keypaths contains a type mismatch on a2.
The results of the third and fourth passes at merging yield:
In the final merge, the number of children for a4 exceeds the limit (4) set for json_
so it is marked as un-inferable.a3
exceeds the limit (3) set for json_
.
After the merging process, the schema tree with leaves are the inferred columns (also may be referred to as leaves, leaf columns, leaf paths, or key paths).
The keypaths will be a1, a2, a3, a4.
Encoding Nested Columns
Parquet uses the Dremel encoding
for nested columns with definition and repetition levels.
-
Definition levels specify how many optional fields in the path for the column are defined.
-
Repetition levels specify the repeated value of the field in the path.
-
SingleStore stores the leaf columns for the JSON schema along with their respective repetition and definition levels.
Each of these internal columns will correspond to the value, definition level and repetition level columns which are encoded in SingleStore encodings (SeekableString, etc).
The example tables will have these values for the internal columns:
a1 |
||
---|---|---|
Value |
Definition Level |
Repetition Level |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
a2 |
||
---|---|---|
Value |
Definition Level |
Repetition Level |
‘{“d1”: 1}’ |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
- |
0 |
0 |
a3 |
||
---|---|---|
Value |
Definition Level |
Repetition Level |
‘{“c1”: 1}’ |
1 |
0 |
- |
0 |
0 |
- |
0 |
0 |
‘{“c2”: 1}’ |
1 |
0 |
‘{“c3”: 1}’ |
1 |
0 |
a4 |
||
---|---|---|
Value |
Definition Level |
Repetition Level |
‘{“b1”: 1}’ |
1 |
0 |
‘{“b2”: 1}’ |
1 |
0 |
‘{“b3”: 1}’ |
1 |
0 |
‘{“b4”: 1}’ |
1 |
0 |
‘{“b5”: 1}’ |
1 |
0 |
Performance Impact on Parquet Encoding for JSON
When using seekable JSON/JSON Parquet encoding a major impact on performance is if a schema is dense or sparse.
A node in a tree is considered to be dense if it occurs in most JSON rows.
Consider the following JSON data table:
JSON Data (dense) |
---|
|
|
|
The schema will infer the JSON rows as follows:
Using the image above the leaves of the tree become internal columns in the JSON encoding.
a |
b |
c::d |
c::e |
---|---|---|---|
1 |
1 |
NULL |
NULL |
2 |
NULL |
1 |
NULL |
NULL |
NULL |
2 |
3 |
Using the following JSON data table:
JSON Data (sparse) |
---|
|
|
|
|
|
The JSON rows will be encoded as follows:
a |
b |
c |
d |
e |
---|---|---|---|---|
1 |
NULL |
NULL |
NULL |
NULL |
NULL |
1 |
NULL |
NULL |
NULL |
NULL |
NULL |
1 |
NULL |
NULL |
NULL |
NULL |
NULL |
1 |
NULL |
NULL |
NULL |
NULL |
NULL |
1 |
The preceding table represents a sparse schema caused by the NULLs not being part of the original JSON strings.
To prevent an overly sparse schema from being inferred, SingleStore uses a method where if the average of a key’s children is too low (<1%) in relation to the key itself, the key is stored as a string without inferring its children.
Related Topics
-
See: JSON_
AGG -
Training: Working with JSON
Last modified: November 11, 2024