Generalized Inverted Index (GIN) (BSON)

Generalized Inverted Index (GIN) is a multi-valued index that maps values to the row IDs containing the value. GIN is useful when you want to search for elements within composite data structures like arrays, BSON documents, etc.

GIN stores pairs consisting of a value and a list of row IDs in which the value occurs. Each value is stored only once, so a GIN index remains compact even when the same value appears multiple times. The performance of queries that search for specific words in documents benefits if those keywords are indexed using GIN.

BSON type includes extending BSON_MATCH_ANY with equality filters inside to be optimized using GIN. The following query shapes would benefit from GIN index creation:

  • BSON_MATCH_ANY(MATCH_PARAM_BSON() = 'value', 'path')

  • BSON_MATCH_ANY(MATCH_PARAM_BSON() IN ( 'value1', 'value2',  'value3'), 'path')

Syntax

For SQL queries:

Adding GIN index to new tables:

CREATE TABLE <tablename> 
  (<col1> BSON NULL, 
   GIN INDEX(col1) INDEX_OPTIONS='<options>';

Adding GIN index to existing tables:

ALTER TABLE <tableName> 
  ADD GIN INDEX (col1) INDEX_OPTIONS='<options>';

where '<options>' is a JSON object to specify the tokenizer and other metadata related to it.

'{"tokenizer":"MATCH_ANY", "path":[<path_array>]}'

where path_array is a comma-separated list of values that specify the path to the field that needs to be indexed. If an empty path array, for example, "PATH":[], is specified, then top-level elements in the array are indexed (such as [1,2,3]). Given a BSON column and an optional path, the GIN index extracts all values at that path.

For MongoDB queries:

db.<collection>.createIndex({"<path_name>":"gin"})

where <path_name> is the path to a property in the BSON document.

Remarks

Currently, GIN is supported only on BSON type columns.

For SQL queries, GIN activates only when matched against a hexadecimal literal of type X'...'. For example,

SELECT HEX('1234':>BSON);
+-------------------+
| HEX('1234':>BSON) |
+-------------------+
| D204000010        |
+-------------------+
SELECT ...  WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'D204000010', col1);

Example 1 - Using SQL

Consider the following table named orders. Note that the GIN index is added to the product_ids column:

CREATE TABLE orders(  
  id BIGINT PRIMARY KEY,   
  created DATETIME(6),  
  product_ids BSON,  
  GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":[]}');

INSERT INTO orders VALUES
  (1, '2025-03-03 12:34:56.000001', '100001':>BSON),
  (2, '2025-03-03 12:34:56.000002', '100002':>BSON),
  (3, '2025-03-03 12:34:56.000003', '100003':>BSON),
  (4, '2025-03-03 12:34:56.000004', '100004':>BSON);

Optimize the table:

OPTIMIZE TABLE orders FULL;

Because the GIN index activates only when matched with hex values (in SQL queries), convert a value to hex, and then perform the lookup:

SELECT HEX(100001:>BSON);
+-------------------+
| HEX(100001:>BSON) |
+-------------------+
| A186010010        |
+-------------------+
SELECT id, created, product_ids:>JSON AS product_ids
  FROM orders 
  WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'A186010010',product_ids);
+----+----------------------------+-------------+
| id | created                    | product_ids |
+----+----------------------------+-------------+
|  1 | 2025-03-03 12:34:56.000001 | 100001      |
+----+----------------------------+-------------+

The PRODUCT_IDs column is cast to JSON for clarity.

The query execution benefits from the GIN index, the ColumnStoreFilter [orders.product_ids = x'a186010010' index] in the output:

EXPLAIN SELECT id, created, product_ids:>JSON 
  FROM orders 
  WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'A186010010',product_ids);
+-----------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment                                                  |
| Project [orders.id, orders.created, (orders.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS `product_ids:>JSON`] |
| ColumnStoreFilter [orders.product_ids = x'a186010010' index]                                                    |
| ColumnStoreScan dbTest.orders, SORT KEY __UNORDERED () table_type:sharded_columnstore                           |
+-----------------------------------------------------------------------------------------------------------------+

Example 2 - Using GIN at Scale

In the following example, you add a GIN index and then query the indexed column or path.

Create a table, BOOKINGS, and add GIN to the PRODUCT_IDs column:

CREATE TABLE bookings(  
  id BIGINT PRIMARY KEY,   
  created DATETIME(6),  
  product_ids BSON,  
  GIN INDEX (product_ids) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":[]}'
);

-- Add 1Million rows, each with 5 products sampled from 100000 product ids

DELIMITER //

DO DECLARE    
   arr ARRAY(RECORD(id BIGINT, created DATETIME(6), product_ids BSON)) = 
   CREATE_ARRAY(1000000);    
     product_ids BSON;    
     n BIGINT;
BEGIN    
   FOR i IN 0..999999 LOOP        
       product_ids = CONCAT('[',CONCAT_WS(',', (RAND()*100000):>INT, 
                     (RAND()*100000):>INT, (RAND()*100000):>INT, 
                     (RAND()*100000):>INT, (RAND()*100000):>INT),']');        
                        arr[i] = ROW(i, NOW(), product_ids);    
   END LOOP;    
   n = INSERT_ALL('orders', arr);END //

DELIMITER ;

Optimize the table:

OPTIMIZE TABLE orders FULL;

Because GIN activates when matched with hex values, convert a value to hex, and then perform the lookup:

SELECT HEX(94567:>BSON);
+------------------+
| HEX(94567:>BSON) |
+------------------+
| 6771010010       |
+------------------+
SELECT id, created, product_ids:>JSON 
  FROM bookings 
  WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'6771010010',product_ids);
+--------+----------------------------+---------------------------------+
| id     | created                    | product_ids:>JSON               |
+--------+----------------------------+---------------------------------+
| 719886 | 2025-03-03 12:50:49.000000 | [77807,55647,59828,9374,94567]  |
| 500322 | 2025-03-03 12:50:49.000000 | [94567,58653,77465,55399,16804] |
| 207632 | 2025-03-03 12:50:48.000000 | [63421,19400,16959,65091,94567] |
| 943351 | 2025-03-03 12:50:50.000000 | [4616,41546,31047,94567,68472]  |
| 363185 | 2025-03-03 12:50:48.000000 | [50258,88567,99311,94567,2009]  |
| 803463 | 2025-03-03 12:50:49.000000 | [22582,64484,15540,44440,94567] |
| 698599 | 2025-03-03 12:50:49.000000 | [84654,32124,94567,26840,90261] |
| 105530 | 2025-03-03 12:50:47.000000 | [57273,21771,31151,4171,94567]  |
| 781873 | 2025-03-03 12:50:49.000000 | [2862,12649,71063,90341,94567]  |
--- Output truncated ---
+--------+----------------------------+---------------------------------+

The PRODUCT_IDs column is cast to JSON for clarity.

The query execution benefits from the GIN index (the ColumnStoreFilter [bookings.product_ids = x'6771010010' index] in the output indicates that the query used an index):

EXPLAIN SELECT id, created, product_ids:>JSON 
  FROM bookings 
  WHERE BSON_MATCH_ANY(MATCH_PARAM_BSON() = X'6771010010',product_ids);
+-----------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0 parallelism_level:segment                                                        |
| Project [bookings.id, bookings.created, (bookings.product_ids:>JSON COLLATE utf8mb4_bin NULL) AS `product_ids:>JSON`] |
| ColumnStoreFilter [bookings.product_ids = x'6771010010' index]                                                        |
| ColumnStoreScan dbTest.bookings, SORT KEY __UNORDERED () table_type:sharded_columnstore                               |
+-----------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.23 sec)

Example 3 - Using Kai

In this example, the following BSON data is inserted into the USERS collection using the SingleStore Kai endpoint (mongodb://).

 { "id": 1,    
"name": "Alice",    
"addresses": [      
{"street": "123 Main St", "city": "New York", "zip": "10001"},      
{"street": "456 Oak Ave", "city": "Los Angeles", "zip": "90001"}    
]  
},  
{    
"id": 2,    
"name": "Bob",    
"addresses": [      
{"street": "789 Pine Ln", "city": "Chicago", "zip": "60601"}    
]  
},  
{    
"id": 3,    
"name": "Charlie",    
"addresses": [      
{"street": "101 Elm St", "city": "New York", "zip": "10002"},      
{"street": "222 Maple Dr", "city": "Houston", "zip": "77001"}    
]  
}

The ID field of the document is added to the _ID column, and the rest of the fields in the document are added to the _MORE column of the table. Refer to the Kai Data Storage Model for more information on how tables created using the Kai endpoint are stored in SingleStore.

Add the GIN to the CITY field in the BSON data.

db.users.createIndex({"addresses.city":"gin"})

To add a GIN index using a SQL query, run the following command:

ALTER TABLE users
ADD GIN INDEX (_more) INDEX_OPTIONS='{"TOKENIZER":"MATCH ANY","PATH":["addresses","city"]}';

The following query returns customers from New York city:

db.users.find({  
"addresses.city": "New York"
})
[ { _id: new ObjectId("67af386d8e3564880fdd4cbd"),    
    id: 3,    
    name: 'Charlie',    
    addresses:      
      [ { street: '101 Elm St', city: 'New York', zip: '10002' },       
        { street: '222 Maple Dr', city: 'Houston', zip: '77001' } ] },
  { _id: new ObjectId("67af386d8e3564880fdd4cbb"),    
    id: 1,    
    name: 'Alice',    
    addresses:      
      [ { street: '123 Main St', city: 'New York', zip: '10001' },       
        { street: '456 Oak Ave', city: 'Los Angeles', zip: '90001' } ] } ]

Last modified: March 7, 2025

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK