Vector Indexing
On this page
Note
This is a Preview feature.
SingleStore supports vector similarity scoring and you can use this to find the exact set of k nearest neighbors of a query vector, as described in Working with Vector Data.
If you have very large data sets and/or high concurrency requirements for a nearest-neighbor search, exact kNN search may not be efficient enough.
There is an accuracy vs.
If you have nearest neighbor queries that do not have any selective filters associated with them, and large sets of vectors, use an ANN search.
Some sample cases for using ANN search are:
-
Semantic Search of Text based on vectors from large language models (LLMs)
-
Retrieval-Augmented Generation (RAG) to enable focused, high-quality results for text generation based on LLMs within specific knowledge areas
In this topic, we will explain how to create a vector index to enable ANN search, how to write queries that will use the index, and how to determine if the index is being used.
Syntax
The syntax for a vector index has to adhere to the following rules:
-
Vector indices can only be built on columnstore tables.
-
Index must be built on a single column that stores the vector.
-
Column type restricted to Vector Type(dimensions[, F32]) NOT NULL where dimensions is the number of dimensions.
Currently the only supported element type is F32.
If you have a text column and an associated varbinary column to hold the vector embedding for the text, and you do not immediately have available the vector for the text, consider making the vector all zeros or another dummy value that would never show up in top-K searches.
Creating a Vector Index
A vector index can be created using the CREATE TABLE
command:
CREATE TABLE <table_name> (<column_definitions>, VECTOR {INDEX|KEY } [<index_name>] (<column>) [INDEX_OPTIONS '<json>']);
Alternatively, a vector index may be added using the ALTER TABLE
command:
ALTER TABLE <table_name> ADD VECTOR {INDEX|KEY } [<index_name>] (<column>) [INDEX_OPTIONS '<json>'];
Multiple vector indexes can be created on the same table or even on the same column.
A vector index may be dropped with the ALTER TABLE
command:
ALTER TABLE <table_name> DROP INDEX <index_name>;ORDROP INDEX <index_name> ON <table_name>;
Searching a Vector Index
A vector search query to find k ANNs can be done with standard SQL:
SELECT <columns>,<table_name>.v {<*>|<->} @query_vector AS distanceFROM <table_name>WHERE <pre-filters>ORDER BY distance [USE {INDEX|KEY} ([<index_name>])][SEARCH_OPTIONS [=] '<json>']LIMIT k;
Index Options
A JSON config string can be used with INDEX_
to specify the vector index algorithm and its parameters to use.SEARCH_
.
Below are the generic INDEX_
and SEARCH_
.INDEX_
) and index searching parameters (SEARCH_
) can be used with all index types.
-
Index building parameters
-
index_
type: index type to use. Must be one of AUTO, FLAT, IVF_ FLAT, IVF_ PQ, IVF_ PQFS, HNSW_ FLAT, HNSW_ PQ. The default is AUTO. SingleStore recommends IVF_ PQFS and HNSW_ FLAT. -
metric_
type: distance metric. Either EUCLIDEAN_
orDISTANCE DOT_
.PRODUCT The default is DOT_
.PRODUCT
-
-
Search parameters
-
k: number of rows outputted by vector index scan.
k must be >= limit, where limit is from ORDER BY … LIMIT
clause.The default is the limit.
-
Index Hints
When there are multiple vector indices available or when you want to disable any available vector index, the index hint [USE {INDEX|KEY} ([<index_
can be used to specify a specific index to use or to disable the use of an index.
Specify the exact index you want to use as <index_
or omit the <index_
part, i. USE {INDEX|KEY} ()
to disable ANN search.
Index Types and Parameters
The following index types and parameters are supported.INDEX_
and it will be used as the default value for that index so you do not need to specify them every time you use that vector index.
Note
SingleStore recommends using the IVF_
AUTO
AUTO automatically selects the vector index algorithm and parameters.
AUTO currently behaves the same as IVF_
FLAT
FLAT does not use an index.
IVF_ FLAT
When using basic Inverted File Index (IVF), vectors are clustered into nlist
clusters and search is done by searching only nprobe
nearest clusters.nlist
parameter controls the number of centroids generated by k-means clustering.nprobe
cannot be greater than nlist
.
-
Index building parameters:
-
nlist: number of inverted lists (number of clusters) created during index build.
1 <= nlist <= 65536. Default to 128.
-
-
Search parameters:
-
nprobe: number of probes at query time.
1 <= nprobe <= 65536. Default to 8.
-
IVF_ PQ
Inverted file index with residual vectors PQ-encoded.nlist
clusters and search is done by searching only nprobe
nearest clusters.nlist
parameter controls the number of centroids generated by k-means clustering.nprobe
cannot be greater than nlist
.
-
Index building parameters:
-
nlist: number of inverted lists (number of clusters) created during index build.
1 <= nlist <= 65536. Default to 128. -
m: number of subquantizers used in product quantization.
Dimensions % m must equal 0. Default to 32. -
nbits: number of bits per quantization index.
1 <= nbits <= 16. Default to 8.
-
-
Search parameters:
-
nprobe: number of probes at query time.
1 <= nprobe <= 65536. Default to 8.
-
IVF_ PQFS
IVF_nlist
clusters and search is done by searching only nprobe
nearest clusters.nlist
parameter controls the number of centroids generated by k-means clustering.nprobe
cannot be greater than nlist
.
-
Index building parameters:
-
nlist: number of inverted lists (number of clusters) created during index build.
1 <= nlist <= 65536. Default to 128. -
m: number of subquantizers used in product quantization.
Dimensions % m must equal 0. Default to 32.
-
HNSW_ FLAT
Basic Hierarchical Navigable Small World (HNSW) index.
-
Index building parameters:
-
M: number of neighbors.
1 <= M <= 2048. Default to 30. -
efConstruction: expansion factor at construction time.
1 <= efConstruction. Default to 40.
-
-
Search parameters:
-
ef: expansion factor at search time.
1<= ef. Default to 16.
-
HNSW_ PQ
HNSW with vectors PQ-encoded.
-
Index building parameters:
-
M: number of neighbors.
1 <= M <= 2048. Default to 30. -
efConstruction: expansion factor at construction time.
1 <= efConstruction. Default to 40. -
m: number of sub-quantizers.
dimensions % m must equal 0. Default to 32. -
nbits: number of bits per quantization index.
1 <= nbits <= 16. Default to 8.
-
-
Search parameters: ef: expansion factor at search time.
1 <= ef. Default to 16. -
ef: expansion factor at search time.
1 <= ef. Default to 16.
-
Example 1
Create a table with a VECTOR
attribute and then create an IVF_
index on the vector column with index building parameter nlist
set to 2 and search parameter nprobe
set to 1.
CREATE TABLE vect (k int, v VECTOR(2) NOT NULL);INSERT INTO vect VALUES(1, '[-10, 1]'),(2, '[10, 2]'),(3, '[20, 3]');ALTER TABLE vect ADD VECTOR INDEX (v) INDEX_OPTIONS'{"index_type":"IVF_PQFS", "nlist":2, "nprobe":1}';
Note that the search parameter nprobe
is used in this index creation command.INDEX_
during index creation.nprobe
in this example) in index creation will be used as the default value for that search parameter for that index.
Searching the nearest neighbor can be done with the commands below.
The @query_
variable is cast to a VECTOR
to ensure that @query_
is a valid VECTOR
and to improve performance.
Click the Playground icon to the right of the SQL listing to try this query.
SET @query_vec = ('[9,0]'):> VECTOR(2);SELECT k, v, v <*> @query_vec AS scoreFROM vectORDER BY score DESC LIMIT 1;
+------+--------+------------------+
| k | v | score |
+------+--------+------------------+
| 3 | [20,3] | 180 |
+------+--------+------------------+
Optionally optimize the table to enhance performance .
OPTIMIZE TABLE vect FULL;
You can increase k
, as shown in the example below, to increase the number of rows output by the vector index scan.k
will increase the search space and likely increase the recall, but will also increase the execution time of the query.
SELECT k, v, v <*> '[9, 0]' AS scoreFROM vectORDER BY score SEARCH_OPTIONS '{"k" : 30 }' DESCLIMIT 3;
+------+---------+-------------------------------+
| k | v | score |
+------+---------+-------------------------------+
| 3 | [20,3] | 180 |
| 2 | [10,2] | 90 |
| 1 | [-10,1] | -90 |
+------+---------+-------------------------------+
New Infix Operators for euclidean_ distance and dot_ product
The following infix operators are available for calculating vector similarity:
-
<*>
(DOT_PRODUCT) -
<->
(EUCLIDEAN_DISTANCE)
They are equivalent to the existing functions specified.
Important
Search queries must use the same distance metric as used in the vector index in order to utilize that vector index.DOT_
and EUCLIDEAN_
.
Example 2
Below is a small, self-contained example.
Create and use a new database:
CREATE DATABASE ann_vectors;USE ann_vectors;
Set the SQL Mode:
SET sql_mode = pipes_as_concat;
Create a table:
CREATE TABLE ann_test(id INT, v VECTOR(2) NOT NULL, SHARD KEY(id), KEY (id));
Use the following SQL script to generate data:
DELIMITER //DODECLARE s INT = 1*1024*1024;DECLARE c BIGINT;BEGININSERT ann_test VALUES(1,"[0,0]");LOOPINSERT INTO ann_testSELECT id+(SELECT MAX(id) FROM ann_test),"[" || s*rand() || "," || s*rand() || "]"FROM ann_test;SELECT COUNT(*) INTO c FROM ann_test;IF c >= s thenEXIT;END IF;END LOOP;END//DELIMITER ;
Add a vector index to the table:
ALTER TABLE ann_test ADD VECTOR INDEX(v)INDEX_OPTIONS '{"index_type":"IVF_PQFS", "nlist":1000,"metric_type":"EUCLIDEAN_DISTANCE"}';
Find a query vector as the vector for row 1000:
SET @qv = (SELECT v FROM ann_test WHERE id = 1000);
Find top five closest matches to the query vector:
SELECT id, v, v <-> @qv AS scoreFROM ann_testORDER BY scoreLIMIT 5;
Verify a Vector Index is Being Used
It is important to verify that your query is using a vector index.
-
The metric in the query (dot product vs.
euclidean distance) matches the metric used to create the index. -
The order in the
ORDER BY
clause (DESC
vs.ASC
) in the search query matches the metric.
To get a detailed view of query plan for vector similarity queries, use EXPLAIN
.EXPLAIN
plan for the previous example is shown below.
EXPLAIN SELECT id, v, v <-> @qv AS scoreFROM ann_testORDER BY scoreLIMIT 5;
+----------------------------------------------------------------------------------------------+
| EXPLAIN |
+----------------------------------------------------------------------------------------------+
| Project [remote_0.id, remote_0.v, remote_0.score] |
| TopSort limit:5 [remote_0.score] |
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [ann_test.id, ann_test.v, EUCLIDEAN_DISTANCE(ann_test.v,@qv) AS score] |
| TopSort limit:5 [EUCLIDEAN_DISTANCE(ann_test.v,@qv)] |
| ColumnStoreFilter [INTERNAL_VECTOR_SEARCH(0, @qv, 5, '') index] |
| ColumnStoreScan ann_vectors.ann_test, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+----------------------------------------------------------------------------------------------+
The ColumnStoreFilter line of the EXPLAIN
output above shows INTERNAL_
which indicates that the vector index is being used.INTERNAL_
is not in the ColumnStoreFilter
line, the vector index is not being used.
Index Hints Example
The query below will specify the use of the index with key name v
.
Use the vect
table from Example 1 above for this query and the following one.
SELECT k, v <*> ('[9, 0]' :> VECTOR(2)) AS scoreFROM vectORDER BY score USE INDEX (v) DESCLIMIT 2;
The query below, which contains a USE INDEX
clause without an index name, will disable the use of the vector (ANN) indexes.
SELECT k, v <*> ('[9, 0]' :> VECTOR(2)) AS scoreFROM vectORDER BY score USE INDEX () DESCLIMIT 2;
The query below specifies that the index v
should be used and specifies a search option for the parameter k
.USE_
clause and the SEARCH_
clause must appear immediately after the ORDER BY
clause and USE_
must appear before SEARCH_
.
SELECT k, v <*> ('[9, 0]' :> VECTOR(2)) AS scoreFROM vectORDER BY scoreUSE INDEX (v) SEARCH_OPTIONS '{"k":30}' DESCLIMIT 2;
As described above, EXPLAIN
can be used to verify if the query does or does not use an index.
Note
Recall that ORDER BY
should be in descending order (DESC
) when you are using the DOT PRODUCT metric and in ascending order (ASC
) when you are using the EUCLIDEAN DISTANCE metric.
Understanding Behavior for DELETEs and UPDATEs
Rows that have been deleted and old versions of rows that have been updated will be filtered out from results for indexed ANN search.
You can work around this to ensure you actually get N
rows of output by using a larger value of the search option k – the number of rows output by a vector index scan.N
in a LIMIT N
query – you may not get exactly N but you can generate more output, which may be sufficient for many applications like RAG where the exact amount of rows retrieved is not critical.
When to use an ANN Index to Support Vector Search
In general, you should use ANN search for large sets of vectors when you do not have selective filters on other columns in the query.
On the other hand, if you have a smaller set of vectors, say less than a few million, and selective filters in your queries on other non-vector columns, you're often better off not using ANN indexes.
Related Topics
Last modified: September 3, 2024