Generalized Inverted Index (GIN) (BSON)
On this page
Generalized Inverted Index (GIN) is a multi-valued index that maps values to the row IDs containing the value.
GIN stores pairs consisting of a value and a list of row IDs in which the value occurs.
BSON type includes extending BSON_
with equality filters inside to be optimized using GIN.
-
BSON_
MATCH_ ANY(MATCH_ PARAM_ BSON() = 'value', 'path') -
BSON_
MATCH_ ANY(MATCH_ PARAM_ BSON() IN ( 'value1', 'value2', 'value3'), 'path')
Syntax
For SQL Queries
Adding GIN index to new tables:
CREATE TABLE <tablename>(<col1> BSON NULL,GIN INDEX(col1) INDEX_OPTIONS='<options>';
Adding GIN index to existing tables:
ALTER TABLE <tableName>ADD GIN INDEX (col1) INDEX_OPTIONS='<options>';
where '<options>'
is a JSON object to specify the tokenizer and other metadata related to it.
'{"tokenizer":"MATCH_ANY", "path":[<path_array>]}'
where path_
is a comma-separated list of values that specify the path to the field that needs to be indexed."PATH":[]
, is specified, then top-level elements in the array are indexed (such as [1,2,3]).
For MongoDB® Queries
db.<collection>.createIndex({"<path_name>":"gin"})
where <path_
is the path to a property in the BSON document.
Remarks
Currently, GIN is supported only on BSON type columns.
For SQL queries, GIN activates only when matched against a hexadecimal literal of type X'.
.
SELECT HEX('1234':>BSON);
+-------------------+
| HEX('1234':>BSON) |
+-------------------+
| D204000010 |
+-------------------+
SELECT ... WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'D204000010', col1);
Example 1 - Using SQL
Consider the following table named orders.
CREATE TABLE orders(id BIGINT PRIMARY KEY,created DATETIME(6),product_ids BSON,GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":[]}');INSERT INTO orders VALUES(1, '2025-03-03 12:34:56.000001', '100001':>BSON),(2, '2025-03-03 12:34:56.000002', '100002':>BSON),(3, '2025-03-03 12:34:56.000003', '100003':>BSON),(4, '2025-03-03 12:34:56.000004', '100004':>BSON);
Optimize the table:
OPTIMIZE TABLE orders FULL;
Because the GIN index activates only when matched with hex values (in SQL queries), convert a value to hex, and then perform the lookup:
SELECT HEX(100001:>BSON);
+-------------------+
| HEX(100001:>BSON) |
+-------------------+
| A186010010 |
+-------------------+
SELECT id, created, product_ids:>JSON AS product_idsFROM ordersWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'A186010010',product_ids);
+----+----------------------------+-------------+
| id | created | product_ids |
+----+----------------------------+-------------+
| 1 | 2025-03-03 12:34:56.000001 | 100001 |
+----+----------------------------+-------------+
The product_
column is cast to JSON for clarity.
The query execution benefits from the GIN index, note the ColumnStoreFilter [orders.
in the output:
EXPLAIN SELECT id, created, product_ids:>JSONFROM ordersWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'A186010010',product_ids);
+-----------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-----------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [orders.id, orders.created, (orders.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS `product_ids:>JSON`] |
| ColumnStoreFilter [orders.product_ids = x'a186010010' index] |
| ColumnStoreScan dbTest.orders, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+-----------------------------------------------------------------------------------------------------------------+
Example 2 - Using GIN at Scale
The following example shows how a large data set can benefit from GIN at scale.
Create a table bookings, and add GIN to the product_
CREATE TABLE bookings(id BIGINT PRIMARY KEY,created DATETIME(6),product_ids BSON,GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":[]}');-- Add 1Million rows, each with 5 products sampled from 100000 product idsDELIMITER //DO DECLAREarr ARRAY(RECORD(id BIGINT, created DATETIME(6), product_ids BSON)) =CREATE_ARRAY(1000000);product_ids BSON;n BIGINT;BEGINFOR i IN 0..999999 LOOPproduct_ids = CONCAT('[',CONCAT_WS(',', (RAND()*100000):>INT,(RAND()*100000):>INT, (RAND()*100000):>INT,(RAND()*100000):>INT, (RAND()*100000):>INT),']');arr[i] = ROW(i, NOW(), product_ids);END LOOP;n = INSERT_ALL('orders', arr);END //DELIMITER ;
Optimize the table:
OPTIMIZE TABLE orders FULL;
Because GIN activates when matched with hex values, convert a value to hex, and then perform the lookup:
SELECT HEX(94567:>BSON);
+------------------+
| HEX(94567:>BSON) |
+------------------+
| 6771010010 |
+------------------+
SELECT id, created, product_ids:>JSONFROM bookingsWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'6771010010',product_ids);
+--------+----------------------------+---------------------------------+
| id | created | product_ids:>JSON |
+--------+----------------------------+---------------------------------+
| 719886 | 2025-03-03 12:50:49.000000 | [77807,55647,59828,9374,94567] |
| 500322 | 2025-03-03 12:50:49.000000 | [94567,58653,77465,55399,16804] |
| 207632 | 2025-03-03 12:50:48.000000 | [63421,19400,16959,65091,94567] |
| 943351 | 2025-03-03 12:50:50.000000 | [4616,41546,31047,94567,68472] |
| 363185 | 2025-03-03 12:50:48.000000 | [50258,88567,99311,94567,2009] |
| 803463 | 2025-03-03 12:50:49.000000 | [22582,64484,15540,44440,94567] |
| 698599 | 2025-03-03 12:50:49.000000 | [84654,32124,94567,26840,90261] |
| 105530 | 2025-03-03 12:50:47.000000 | [57273,21771,31151,4171,94567] |
| 781873 | 2025-03-03 12:50:49.000000 | [2862,12649,71063,90341,94567] |
--- Output truncated ---
+--------+----------------------------+---------------------------------+
The product_
column is cast to JSON for clarity.
The query execution benefits from the GIN index (the ColumnStoreFilter [bookings.
in the output indicates that the query used an index):
EXPLAIN SELECT id, created, product_ids:>JSONFROM bookingsWHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'6771010010',product_ids);
+-----------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-----------------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [bookings.id, bookings.created, (bookings.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS `product_ids:>JSON`] |
| ColumnStoreFilter [bookings.product_ids = x'6771010010' index] |
| ColumnStoreScan dbTest.bookings, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+-----------------------------------------------------------------------------------------------------------------------+
Last modified: March 11, 2025