Generalized Inverted Index (GIN) (BSON)
On this page
Generalized Inverted Index (GIN) is a multi-valued index that maps values to the row IDs containing the value.
GIN stores pairs consisting of a value and a list of row IDs in which the value occurs.
BSON type includes extending BSON_
with equality filters inside to be optimized using GIN.
-
BSON_
MATCH_ ANY(MATCH_ PARAM_ BSON() = 'value', 'path') -
BSON_
MATCH_ ANY(MATCH_ PARAM_ BSON() IN ( 'value1', 'value2', 'value3'), 'path')
Syntax
For SQL queries:
Adding GIN index to new tables:
CREATE TABLE <tablename>
(<col1> BSON NULL,
GIN INDEX(col1) INDEX_OPTIONS='<options>';
Adding GIN index to existing tables:
ALTER TABLE <tableName>
ADD GIN INDEX (col1) INDEX_OPTIONS='<options>';
where '<options>'
is a JSON object to specify the tokenizer and other metadata related to it.
'{"tokenizer":"MATCH_ANY", "path":[<path_array>]}'
where path_
is a comma-separated list of values that specify the path to the field that needs to be indexed.
For MongoDB queries:
db.<collection>.createIndex({"<path_name>":"gin"})
where <path_
is the path to a property in the BSON document.
Remarks
Currently, GIN is supported only on BSON type columns.
For SQL queries, GIN activates only when matched against a hexadecimal literal of type X'.
SELECT HEX('1234':>BSON);
+-------------------+
| HEX('1234':>BSON) |
+-------------------+
| D204000010 |
+-------------------+
SELECT ... WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'D204000010', col1);
Example 1 - Using SQL
Consider the following table named orders.
CREATE TABLE orders(
id BIGINT PRIMARY KEY,
created DATETIME(6),
product_ids BSON,
GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":[]}');
INSERT INTO orders VALUES
(1, '2025-03-03 12:34:56.000001', '100001':>BSON),
(2, '2025-03-03 12:34:56.000002', '100002':>BSON),
(3, '2025-03-03 12:34:56.000003', '100003':>BSON),
(4, '2025-03-03 12:34:56.000004', '100004':>BSON);
Optimize the table:
OPTIMIZE TABLE orders FULL;
Because the GIN index activates only when matched with hex values (in SQL queries), convert a value to hex, and then perform the lookup:
SELECT HEX(100001:>BSON);
+-------------------+
| HEX(100001:>BSON) |
+-------------------+
| A186010010 |
+-------------------+
SELECT id, created, product_ids:>JSON AS product_ids
FROM orders
WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'A186010010',product_ids);
+----+----------------------------+-------------+
| id | created | product_ids |
+----+----------------------------+-------------+
| 1 | 2025-03-03 12:34:56.000001 | 100001 |
+----+----------------------------+-------------+
The PRODUCT_
The query execution benefits from the GIN index, the ColumnStoreFilter [orders.
EXPLAIN SELECT id, created, product_ids:>JSON
FROM orders
WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'A186010010',product_ids);
+-----------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-----------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [orders.id, orders.created, (orders.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS `product_ids:>JSON`] |
| ColumnStoreFilter [orders.product_ids = x'a186010010' index] |
| ColumnStoreScan dbTest.orders, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+-----------------------------------------------------------------------------------------------------------------+
Example 2 - Using GIN at Scale
In the following example, you add a GIN index and then query the indexed column or path.
Create a table, BOOKINGS, and add GIN to the PRODUCT_
CREATE TABLE bookings(
id BIGINT PRIMARY KEY,
created DATETIME(6),
product_ids BSON,
GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":[]}'
);
-- Add 1Million rows, each with 5 products sampled from 100000 product ids
DELIMITER //
DO DECLARE
arr ARRAY(RECORD(id BIGINT, created DATETIME(6), product_ids BSON)) =
CREATE_ARRAY(1000000);
product_ids BSON;
n BIGINT;
BEGIN
FOR i IN 0..999999 LOOP
product_ids = CONCAT('[',CONCAT_WS(',', (RAND()*100000):>INT,
(RAND()*100000):>INT, (RAND()*100000):>INT,
(RAND()*100000):>INT, (RAND()*100000):>INT),']');
arr[i] = ROW(i, NOW(), product_ids);
END LOOP;
n = INSERT_ALL('orders', arr);END //
DELIMITER ;
Optimize the table:
OPTIMIZE TABLE orders FULL;
Because GIN activates when matched with hex values, convert a value to hex, and then perform the lookup:
SELECT HEX(94567:>BSON);
+------------------+
| HEX(94567:>BSON) |
+------------------+
| 6771010010 |
+------------------+
SELECT id, created, product_ids:>JSON
FROM bookings
WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'6771010010',product_ids);
+--------+----------------------------+---------------------------------+
| id | created | product_ids:>JSON |
+--------+----------------------------+---------------------------------+
| 719886 | 2025-03-03 12:50:49.000000 | [77807,55647,59828,9374,94567] |
| 500322 | 2025-03-03 12:50:49.000000 | [94567,58653,77465,55399,16804] |
| 207632 | 2025-03-03 12:50:48.000000 | [63421,19400,16959,65091,94567] |
| 943351 | 2025-03-03 12:50:50.000000 | [4616,41546,31047,94567,68472] |
| 363185 | 2025-03-03 12:50:48.000000 | [50258,88567,99311,94567,2009] |
| 803463 | 2025-03-03 12:50:49.000000 | [22582,64484,15540,44440,94567] |
| 698599 | 2025-03-03 12:50:49.000000 | [84654,32124,94567,26840,90261] |
| 105530 | 2025-03-03 12:50:47.000000 | [57273,21771,31151,4171,94567] |
| 781873 | 2025-03-03 12:50:49.000000 | [2862,12649,71063,90341,94567] |
--- Output truncated ---
+--------+----------------------------+---------------------------------+
The PRODUCT_
The query execution benefits from the GIN index (the ColumnStoreFilter [bookings.
EXPLAIN SELECT id, created, product_ids:>JSON
FROM bookings
WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'6771010010',product_ids);
+-----------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-----------------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [bookings.id, bookings.created, (bookings.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS `product_ids:>JSON`] |
| ColumnStoreFilter [bookings.product_ids = x'6771010010' index] |
| ColumnStoreScan dbTest.bookings, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+-----------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.23 sec)
Example 3 - Using Kai
In this example, the following BSON data is inserted into the USERS collection using the SingleStore Kai endpoint (mongodb://).
{ "id": 1,"name": "Alice","addresses": [{"street": "123 Main St", "city": "New York", "zip": "10001"},{"street": "456 Oak Ave", "city": "Los Angeles", "zip": "90001"}]},{"id": 2,"name": "Bob","addresses": [{"street": "789 Pine Ln", "city": "Chicago", "zip": "60601"}]},{"id": 3,"name": "Charlie","addresses": [{"street": "101 Elm St", "city": "New York", "zip": "10002"},{"street": "222 Maple Dr", "city": "Houston", "zip": "77001"}]}
The ID field of the document is added to the _
Add the GIN to the CITY field in the BSON data.
db.users.createIndex({"addresses.city":"gin"})
To add a GIN index using a SQL query, run the following command:
ALTER TABLE usersADD GIN INDEX (_more) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":["addresses","city"]}';
The following query returns customers from New York city:
db.users.find({"addresses.city": "New York"})
[ { _id: new ObjectId("67af386d8e3564880fdd4cbd"),
id: 3,
name: 'Charlie',
addresses:
[ { street: '101 Elm St', city: 'New York', zip: '10002' },
{ street: '222 Maple Dr', city: 'Houston', zip: '77001' } ] },
{ _id: new ObjectId("67af386d8e3564880fdd4cbb"),
id: 1,
name: 'Alice',
addresses:
[ { street: '123 Main St', city: 'New York', zip: '10001' },
{ street: '456 Oak Ave', city: 'Los Angeles', zip: '90001' } ] } ]
Last modified: March 7, 2025