Vector Indexing

SingleStore supports vector similarity scoring and you can use this to find the exact set of k nearest neighbors of a query vector, as described in Working with Vector Data. This is sometimes known as kNN search or exact kNN search.

If you have very large data sets and/or high concurrency requirements for a nearest-neighbor search, exact kNN search may not be efficient enough. For cases like this, SingleStore supports Approximate Nearest Neighbor (ANN) search based on a vector index. ANN search can support finding a set of k near neighbors very efficiently, potentially orders of magnitude faster than exact kNN search.

There is an accuracy vs. speed trade-off between exact kNN and ANN search. ANN will retrieve k near neighbors faster than kNN, but they may not be the exact set of k nearest neighbors. Hence the word "Approximate" in "Approximate Nearest Neighbor."

If you have nearest neighbor queries that do not have any selective filters associated with them, and large sets of vectors, use an ANN search. For example, if you need to find the top 10 matches out of one billion vectors with no filters on other columns in the data set and you want interactive response time, use an ANN search..

Some sample cases for using ANN search are:

In this topic, we will explain how to create a vector index to enable ANN search, how to write queries that will use the index, and how to determine if the index is being used.

Syntax

The syntax for a vector index has to adhere to the following rules:

  • Vector indices can only be built on columnstore tables.

  • Index must be built on a single column that stores the vector.

  • Column type restricted to Vector Type(dimensions[, F32]) NOT NULL where dimensions is the number of dimensions. Currently the only supported element type is F32.

If you have a text column and an associated varbinary column to hold the vector embedding for the text, and you do not immediately have available the vector for the text, consider making the vector all zeros or another dummy value that would never show up in top-K searches.

Similarity Metrics

Two similarity functions are available for creating and searching vector indexes in SingleStore: DOT_PRODUCT and EUCLICEAN_DISTANCE. 

DOT_PRODUCT: Calculates the cosine similarity metric when used with vectors normalized to length 1. Cosine similarity measures the similarity of vectors without taking into account the length of the vectors. The results of DOT_PRODUCT on vectors normalized to length 1 range from -1 to 1; values closer to 1 indicate that the vectors are similar, values closer to -1 indicate that the vectors are less similar.

EUCLIDEAN_DISTANCE: Calculates the Euclidean distance between two vectors. Euclidean distance measures the distance between vectors taking into account the length of the vectors. The results of EUCLIDEAN_DISTANCE range from 0 to infinity; values closer to 0 indicate that the vectors are similar, values closer to infinity indicate that vectors are less similar.

  • Search queries must use the same metric used in the creation of a vector index in order to use that index. That is, a query that uses DOT_PRODUCT can only use a vector index created with DOT_PRODUCT.

  • Vectors must be normalized to length 1 before using the DOT_PRODUCT function to obtain the cosine similarity metric. SingleStore recommends vectors be normalized to length 1 before they are saved to the database. Many models that produce vector embeddings produce vectors normalized to length one. In this case, it is not necessary to normalize the vectors again.

  • Using EUCLIDEAN_DISTANCE over vectors which have been normalized to length 1 may result in poor recall because normalization causes the magnitude information present in the original vectors to be lost.

Creating a Vector Index

A vector index can be created using the CREATE TABLE command:

CREATE TABLE <table_name> (<column_definitions>, VECTOR {INDEX|KEY } [<index_name>] (<column>) [INDEX_OPTIONS '<json>']);

Alternatively, a vector index may be added using the ALTER TABLE command:

ALTER TABLE <table_name> ADD VECTOR {INDEX|KEY } [<index_name>] (<column>) [INDEX_OPTIONS '<json>'];

Multiple vector indexes can be created on the same table or even on the same column.

A vector index may be dropped with the ALTER TABLE command:

ALTER TABLE <table_name> DROP INDEX <index_name>;
OR
DROP INDEX <index_name> ON <table_name>;

Searching a Vector Index

A vector search query to find k ANNs can be done with standard SQL:

SELECT <columns>,
DOT_PRODUCT | EUCLICEAN_DISTANCE (<table_name>.v, @query_vector) AS distance
FROM <table_name>
WHERE <pre-filters>
ORDER BY distance [USE {INDEX|KEY} ([<index_name>])]
[SEARCH_OPTIONS [=] '<json>'] [{DESC|ASC}]
LIMIT k;

Infix syntax is available.

  • <*> for DOT_PRODUCT

  • <-> for EUCLICEAN_DISTNCE

SELECT <columns>,
<table_name>.v {<*>|<->} @query_vector AS distance
FROM <table_name>
WHERE <pre-filters>
ORDER BY distance [USE {INDEX|KEY} ([<index_name>])]
[SEARCH_OPTIONS [=] '<json>'] [{DESC|ASC}]
LIMIT k;

Important:

  • For DOT_PRODUCT (<*>), higher values indicate higher vector similarity; use descending (DESC) order in the ORDER BY clause.

  • For EUCLIDEAN_DISTANCE(<->), lower values indicate smaller distance between values; use ascending (ASC) the ORDER BY clause, which is the default.

  • Search queries must use the same distance metric (DOT_PRODUCT or EUCLIDEAN_DISTANCE) as used in the vector index in order to utilize that vector index.

Index Options

A JSON config string can be used with INDEX_OPTIONS to specify the vector index algorithm and its parameters to use. Search options can also be used as a JSON config string to SEARCH_OPTIONS.

Below are the generic INDEX_OPTIONS and SEARCH_OPTIONS. These generic index building parameters (INDEX_OPTIONS) and index searching parameters (SEARCH_OPTIONS) can be used with all index types.

  • Index building parameters

    • index_type: index type to use. Must be one of AUTO, FLAT, IVF_FLAT, IVF_PQ, IVF_PQFS, HNSW_FLAT, HNSW_PQ. The default is AUTO. SingleStore recommends IVF_PQFS and HNSW_FLAT.

    • metric_type: distance metric. Either EUCLIDEAN_DISTANCE or DOT_PRODUCT. The default is DOT_PRODUCT.

  • Search parameters

    • k: number of rows outputted by vector index scan. k must be >= limit, where limit is from ORDER BY … LIMIT clause. The default is the limit.

Index Hints

When there are multiple vector indices available or when you want to disable any available vector index, the index hint [USE {INDEX|KEY} ([<index_name>])] can be used to specify a specific index to use or to disable the use of an index.

Specify the exact index you want to use as <index_name> or omit the <index_name> part, i.e., USE {INDEX|KEY} () to disable ANN search. See Index Hints Example for an example.

Index Types and Parameters

The following index types and parameters are supported.

All index search parameters can also be specified as index building parameters in INDEX_OPTIONS. When search parameters are specified in INDEX_OPTIONS, that value will be used as the default value for searches using that index so you do not need to specify those values each time you use that index.

Note

SingleStore recommends using the IVF_PQFS and HNSW_FLAT index types. IVF_PQFS creates a smaller index and has lower index build time, while HNSW_FLAT has lower search time and higher accuracy.

AUTO

AUTO automatically selects the vector index algorithm and parameters. Specifying index and search options for AUTO is not allowed.

AUTO currently behaves the same as IVF_PQFS. AUTO may be enhanced in the future with automatic index type selection, so the behavior of AUTO may change in future releases.

FLAT

FLAT does not use an index. It keeps all the vectors in RAM and uses a full scan to find the nearest vector matches. In general, it is not needed because SingleStore will do full scan matching without an index. But it may be useful for purposes of comparison and experimentation with recall and performance or guaranteeing all vectors are in RAM.

IVF_FLAT

When using basic Inverted File Index (IVF), vectors are clustered into nlist clusters and search is done by searching only nprobe nearest clusters. The nlist parameter controls the number of centroids generated by k-means clustering. nprobe cannot be greater than nlist.

  • Index building parameters:

    • nlist: number of inverted lists (number of clusters) created during index build. 1 <= nlist <= 65536. Default to 128.

    • nprobe: number of probes at query time. 1 <= nprobe <= 65536. Default to 8.

  • Search parameters:

    • nprobe

IVF_PQ

Inverted file index with residual vectors PQ-encoded. Vectors are clustered into nlist clusters and search is done by searching only nprobe nearest clusters. The nlist parameter controls the number of centroids generated by k-means clustering. nprobe cannot be greater than nlist.

  • Index building parameters:

    • nlist: number of inverted lists (number of clusters) created during index build. 1 <= nlist <= 65536. Default to 128.

    • m: number of subquantizers used in product quantization. Dimensions % m must equal 0. Default to 32.

    • nbits: number of bits per quantization index. 1 <= nbits <= 16. Default to 8.

    • nprobe: number of probes at query time. 1 <= nprobe <= 65536. Default to 8.

  • Search parameters:

    • nprobe

IVF_PQFS

IVF_PQ with 4-bit PQ fast scan. Vectors are clustered into nlist clusters and search is done by searching only nprobe nearest clusters. The nlist parameter controls the number of centroids generated by k-means clustering. nprobe cannot be greater than nlist.

  • Index building parameters:

    • nlist: number of inverted lists (number of clusters) created during index build. 1 <= nlist <= 65536. Default to 128.

    • m: number of subquantizers used in product quantization. Dimensions % m must equal 0. Default to 32.

    • nprobe: number of probes at query time. 1 <= nprobe <= 65536. Default to 8.

  • Search parameters:

    • nprobe

HNSW_FLAT

Basic Hierarchical Navigable Small World (HNSW) index. Builds a hierarchical proximity graph and search is done in a layered fashion.

  • Index building parameters:

    • M: number of neighbors. 1 <= M <= 2048. Default to 30.

    • efConstruction: expansion factor at construction time. 1 <= efConstruction. Default to 40.

    • ef: expansion factor at search time. 1<= ef. Default to 16.

  • Search parameters:

    • ef

HNSW_PQ

HNSW with vectors PQ-encoded.

  • Index building parameters:

    • M: number of neighbors. 1 <= M <= 2048. Default to 30.

    • efConstruction: expansion factor at construction time. 1 <= efConstruction. Default to 40.

    • m: number of sub-quantizers. dimensions % m must equal 0. Default to 32.

    • nbits: number of bits per quantization index. 1 <= nbits <= 16. Default to 8.

    • ef: expansion factor at search time. 1 <= ef. Default to 16.

  • Search parameters: ef: expansion factor at search time. 1 <= ef. Default to 16.

    • ef

Output Format for Examples

Vectors may be output in JSON or binary format. Use JSON format for examples and for output readability. For production, use the default binary for efficiency.

To get JSON output which will match the examples, use the following command to output vectors in JSON.

SET vector_type_project_format = JSON;

Use the following command to set the output format back to binary.

SET vector_type_project_format = BINARY;

Example 1

Create a table with a VECTOR attribute and then create an IVF_PQFS index on the vector column with index building parameter nlist set to 1024 and search parameter nprobe set to 20.

CREATE TABLE vect (k int, v VECTOR(2) NOT NULL);
INSERT INTO vect VALUES
(1, '[-10, 1]'),
(2, '[10, 2]'),
(3, '[20, 3]');
ALTER TABLE vect ADD VECTOR INDEX (v) INDEX_OPTIONS
'{"index_type":"IVF_PQFS", "nlist":1024, "nprobe":20}';

Note that the search parameter nprobe is used in this index creation command. All search parameters can be specified in INDEX_OPTIONS during index creation. The value provided for the search parameter (nprobe in this example) in index creation will be used as the default value for that search parameter for that index.

Searching the nearest neighbor can be done with the commands below.

The @query_vec variable is cast to a VECTOR to ensure that @query_vec is a valid VECTOR and to improve performance.

Click the Playground icon to the right of the SQL listing to try this query. Also, try experimenting and writing your own query.

SET vector_type_project_format = JSON;  /* to make vector output readable */
SET @query_vec = ('[9,0]'):> VECTOR(2);
/* run the SET commands before this query */
SELECT k, v, v <*> @query_vec AS score
FROM vect
ORDER BY score DESC LIMIT 1;
+------+--------+------------------+
| k    | v      | score            |
+------+--------+------------------+
|    3 | [20,3] |              180 |
+------+--------+------------------+

Optionally optimize the table to enhance performance .

OPTIMIZE TABLE vect FULL;

You can increase  k, as shown in the example below, to increase the number of rows output by the vector index scan. Increasing k will increase the search space and likely increase the recall, but will also increase the execution time of the query.

SET vector_type_project_format = JSON;  /* to make vector output readable */
SELECT k, v, v <*> '[9, 0]' AS score
FROM vect
ORDER BY score SEARCH_OPTIONS '{"k" : 30 }' DESC
LIMIT 3;
+------+---------+-------------------------------+
| k    | v       | score                         |
+------+---------+-------------------------------+
|    3 | [20,3]  |                           180 |
|    2 | [10,2]  |                            90 |
|    1 | [-10,1] |                           -90 |
+------+---------+-------------------------------+

Example 2

Below is a small, self-contained example. Real vectors for AI use cases such as retrieval-augmented generation (RAG) with LLMs would normally have a much higher dimensionality (e.g., 64 to a few thousand). Two-dimensional vectors are used to keep it simple.

Create and use a new database:

CREATE DATABASE ann_vectors; 
USE ann_vectors;

Set the SQL Mode:

SET sql_mode = pipes_as_concat;

Create a table:

CREATE TABLE ann_test(id INT, v VECTOR(2) NOT NULL, SHARD KEY(id), KEY (id));

Use the following SQL script to generate data:

DELIMITER //
DO
DECLARE s INT = 1*1024*1024;
DECLARE c BIGINT;
BEGIN
INSERT ann_test VALUES(1,"[0,0]");
LOOP
INSERT INTO ann_test
SELECT id+(SELECT MAX(id) FROM ann_test),
"[" || s*rand() || "," || s*rand() || "]"
FROM ann_test;
SELECT COUNT(*) INTO c FROM ann_test;
IF c >= s then
EXIT;
END IF;
END LOOP;
END
//
DELIMITER ;

Add a vector index to the table:

ALTER TABLE ann_test ADD VECTOR INDEX(v)
  INDEX_OPTIONS '{"index_type":"IVF_PQFS", "nlist":1024,
 "metric_type":"EUCLIDEAN_DISTANCE"}';

Find a query vector as the vector for row 1000:

SET @qv = (SELECT v FROM ann_test WHERE id = 1000);

Find top five closest matches to the query vector:

SELECT id, v, v <-> @qv AS score
FROM ann_test
ORDER BY score
LIMIT 5;

Verify a Vector Index is Being Used

It is important to verify that your query is using a vector index. A vector index can be used in the following conditions:

  • The metric in the query (dot product vs. euclidean distance) matches the metric used to create the index.

  • The order in the ORDER BY clause (DESC vs. ASC) in the search query matches the metric.

To get a detailed view of query plan for vector similarity queries, use EXPLAIN. The contents of the EXPLAIN plan for the previous example is shown below. The ColumnStoreFilter operator does the indexed ANN search.

EXPLAIN SELECT id, v, v <-> @qv AS score
FROM ann_test
ORDER BY score
LIMIT 5;
+----------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                      |
+----------------------------------------------------------------------------------------------+
| Project [remote_0.id, remote_0.v, remote_0.score]                                            |              
| TopSort limit:5 [remote_0.score]                                                             |
| Gather partitions:all alias:remote_0 parallelism_level:segment                               |
| Project [ann_test.id, ann_test.v, EUCLIDEAN_DISTANCE(ann_test.v,@qv) AS score]               |
| TopSort limit:5 [EUCLIDEAN_DISTANCE(ann_test.v,@qv)]                                         |
| ColumnStoreFilter [INTERNAL_VECTOR_SEARCH(0, @qv, 5, '') index]                              |
| ColumnStoreScan ann_vectors.ann_test, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+----------------------------------------------------------------------------------------------+

The ColumnStoreFilter line of the EXPLAIN output above shows INTERNAL_VECTOR_SEARCH(...) which indicates that the vector index is being used. If INTERNAL_VECTOR_SEARCH is not in the ColumnStoreFilter line, the vector index is not being used.

Index Hints Example

The query below will specify the use of the index with key name v. If that index is not available, the query will return an error. See SHOW INDEXES for information on how to list the indexes for a table.

Use the vect table from Example 1 above for this query and the following one.

SELECT k, v <*> ('[9, 0]' :> VECTOR(2)) AS score
FROM vect
ORDER BY score USE INDEX (v) DESC
LIMIT 2;

The query below, which contains a USE INDEX clause without an index name, will disable the use of the vector (ANN) indexes.

SELECT k, v <*> ('[9, 0]' :> VECTOR(2)) AS score
FROM vect
ORDER BY score USE INDEX () DESC
LIMIT 2;

The query below specifies that the index v should be used and specifies a search option for the parameter k. The USE_INDEX clause and the SEARCH_OPTIONS clause must appear immediately after the ORDER BY clause and USE_INDEX must appear before SEARCH_OPTIONS. 

SELECT k, v <*> ('[9, 0]' :> VECTOR(2)) AS score
FROM vect
ORDER BY score
USE INDEX (v) SEARCH_OPTIONS '{"k":30}' DESC
LIMIT 2;

As described above, EXPLAIN can be used to verify if the query does or does not use an index.

Note

Recall that ORDER BY should be in descending order (DESC) when you are using the DOT PRODUCT metric and in ascending order (ASC) when you are using the EUCLIDEAN DISTANCE metric.

In general, you should use ANN search for large sets of vectors when you do not have selective filters on other columns in the query. In the example above, if there are billions of comments, having an index to find the similarity scores for the query vector is very useful and will provide a significant performance improvement.

On the other hand, if you have a smaller set of vectors, say less than a few million, and selective filters in your queries on other non-vector columns, you're often better off not using ANN indexes. This is true for both vector search and KNN search. You can instead rely on SingleStore's other query processing capabilities and you will get good performance. This will also be less complex because you won't need to think about what index to create, or the impact the index may have on recall, load time and disk and memory usage.

Last modified: December 12, 2024

Was this article helpful?