Working with Vector Data
On this page
SingleStore supports vector database processing, which allows you to store and search vector data.
SingleStore supports a native vector data type and indexed approximate-nearest-neighbor (ANN) search that provide high-performance vector search and easier building of vector-based applications.
Some benefits of using SingleStore for vector database processing, as opposed to a specialized vector database system, are:
-
A broad array of standard modern database capabilities are available in SingleStore.
These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support, etc. -
Less data movement is needed between different data subsystems (e.
g. , caches, text search systems, SQL databases, and specialized vector databases) when all the data, including vector data, is stored in SingleStore. -
Operational costs may be reduced since fewer data management tools and fewer copies of the data are needed.
-
Less specialized skills and reduced labor may be needed to run an application environment.
Vector Data
Vector data consists of arrays of numbers.
Many large language models (LLMs) are now available, and can be used to provide vector embeddings for language to help implement semantic search, chatbots, and other applications.
-
GPT models from OpenAI
-
BERT by Google
-
LaMDA by Google
-
PaLM by Google
-
LLaMA by Meta AI
LLM technology is evolving quickly, and new sources of embeddings for language are rapidly becoming available.
Vector embeddings can be stored as vector data in SingleStore.
Vector Similarity Search
A similarity search is the most common vector data operation.SELECT…ORDER BY…LIMIT
… queries that use vector similarity functions, including DOT_DOT_
is the most commonly used similarity metric.
If the vectors are normalized to length one before saving to a database, and the query vector is also normalized to length one, then DOT_
gives the cosine of the angle between the two vectors.DOT_
will produce what is known as the cosine similarity metric for its two arguments.
The figure below illustrates when vectors are similar (cosine close to 1) and dissimilar (cosine close to 0):
Note
When using high-dimensional vectors the cosine similarity concept is still applicable.
Many vector models that produce vector embeddings will already have the vectors normalized to length one.DOT_
, it is not necessary to normalize the vectors again.
Vector indexes can be used to improve performance of search over vectors.
Loading, Inserting, and Updating Vectors
Vector data can be added to a database using the VECTOR
data type.
CREATE TABLE comments(id INT,comment TEXT,comment_embedding VECTOR(4) not null,category VARCHAR(256));
The default element type for VECTOR
is a 32-bit floating point number (F32
).I18
, I16
, I32
, I64
, F32
, and F64
.
Consider the following information, where @emb
is a JSON array of numbers representing the vector embedding for the phrase "The cafeteria in building 35 has a great salad bar.
SET @_id = 1;SET @cmt = "The cafeteria in building 35 has a great salad bar";SET @emb = '[0.45, 0.55, 0.495, 0.5]';SET @cat = "Food";
This data can be inserted into the comments table with a standard INSERT
statement:
INSERT INTO comments VALUES (@_id, @cmt, @emb, @cat);
In addition, if you have vector data that is already in binary format, that data should be inserted into the database in binary format.
Binary data may be inserted in hexadecimal string format and converted using UNHEX before inserting it in a table.
UNHEX can be used to convert a hex string that will work for insert into the database.
SET @_id = 2;SET @cmt = "I love the taco bar in the B16 cafeteria.";SET @hs = "B806363CE90A363CCDCCCC3D77BE7F3F";SET @cat = "Food";
INSERT INTO comments VALUES (@_id, @cmt, unhex(@hs), @cat);
Insert a third tuple into this table.
SET @_id = 3;SET @cmt = "The B24 restaurant salad bar is quite good.";SET @emb = '[0.1, 0.8, 0.2, 0.555]';SET @cat = "Food";
INSERT INTO comments VALUES (@_id, @cmt, @emb, @cat);
View the contents of the table using this query.
To try out SQL queries, click the Playground icon to the right of the SQL listing.
SELECT id, comment, comment_embedding, categoryFROM commentsORDER BY id;
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
comment_embedding: [0.449999988,0.550000012,0.495000005,0.5]
category: Food
*** 2. row ***
id: 2
comment: I love the taco bar in the B16 cafeteria.
comment_embedding: [0.0111100003,0.0111109996,0.100000001,0.999000013]
category: Food
*** 3. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
comment_embedding: [0.100000001,0.800000012,0.200000003,0.555000007]
category: Food
If you are using a command line tool, you can use \G
at the end of the query to get the results formatted as above.
Another way to insert binary vectors is to convert them to a packed binary format in the client application and submit them through a standard client API.
An alternate way to insert vector data is using pipelines.
Example Search Based on Vector Similarity
To find the most similar vectors in a query vector, use an ORDER BY… LIMIT
… query.ORDER BY
will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top.
Suppose that the query is "restaurants with good salad," and for this query, the vector embedding API returned the vector '[0.
The infix operator <*>
represents DOT_
and the infix operator <->
represents EUCLIDEAN_
.
The @query_
variable is cast to a VECTOR
to ensure that @query_
is a valid VECTOR
and to improve performance.
SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);SELECT id, comment, category,comment_embedding <*> @query_vec AS scoreFROM commentsORDER BY score DESCLIMIT 2;
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
category: Food
score: 0.9810000061988831
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
category: Food
score: 0.8993000388145447
Hybrid Filtering or Metadata Filtering
When building vector search applications, you may wish to filter on the fields of a record, with simple filters or via joins, in addition to applying vector similarity operations.
For example, given the comments
table, you can get the top three matches for a query vector that is in the category "Food" using this SQL:
SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);SELECT id, comment, category,comment_embedding <*> @query_vec AS scoreFROM commentsWHERE category = "Food"ORDER BY score DESCLIMIT 3;
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
category: Food
score: 0.9810000061988831
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
category: Food
score: 0.8993000388145447
*** 3. row ***
id: 2
comment: I love the taco bar in the B16 cafeteria.
category: Food
score: 0.6644238829612732
Any SQL feature can be used along with vector similarity calculation using DOT_
and EUCLIDEAN_
.
Create and Use a Vector Index
The command below creates a vector index on the comment_
field of the comments
table.
ALTER TABLE comments ADD VECTOR INDEX ivf(comment_embedding)INDEX_OPTIONS '{"index_type":"IVF_PQFS"}';
Optionally optimize the table for best performance.
OPTIMIZE TABLE comments FULL;
The following query will use the vector index.
SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);SELECT id, comment, category,comment_embedding <*> @query_vec AS scoreFROM commentsORDER BY score DESCLIMIT 2;
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
category: Food
score: 0.9810000061988831
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
category: Food
score: 0.8993000388145447
The EXPLAIN
command can be used to see the query plan and verify that the vector index is being used.INTERNAL_
in the ColumnStoreFilter
row.
SET @query_vec = ('[0.09, 0.14, 0.5, 0.05]'):>VECTOR(4);EXPLAINSELECT id, comment, category,comment_embedding <*> @query_vec AS scoreFROM commentsORDER BY score DESCLIMIT 2;
+--------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+--------------------------------------------------------------------------------------------------------+
| Project [remote_0.id, remote_0.comment, remote_0.category, remote_0.score] |
| TopSort limit:2 [remote_0.score DESC] |
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| Project [comments.id, comments.comment, comments.category,
DOT_PRODUCT(comments.comment_embedding,(@query_vec:>vector(4, F32) NOT NULL)) AS score] |
| TopSort limit:2 [DOT_PRODUCT(comments.comment_embedding,(@query_vec:>vector(4, F32) NOT NULL)) DESC] |
| ColumnStoreFilter [INTERNAL_VECTOR_SEARCH(0, (@query_vec:>vector(4, F32) NOT NULL), 2, '') index] |
| ColumnStoreScan test.comments, SORT KEY __UNORDERED () table_type:sharded_columnstore |
+--------------------------------------------------------------------------------------------------------+
Hybrid Search
Hybrid Search combines multiple search methods in one query and blends full-text search (which finds keyword matches) and vector search (which finds semantic matches) allowing search results to be (re-)ranked by a score that combines full-text and vector rankings.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a method for enhancing the quality of results for text-based Generative AI (GenAI) applications.
In advance:
-
split up relevant text into chunks
-
get a vector for each chunk from your chosen LLM and place the chunks in SingleStore
Then, when a question comes in for which you wish to generate an answer:
-
get a vector for the question from the same LLM
-
search the stored vectors to find the top k matching vectors compared with the query vector
-
for each of these top vectors, get the associated text chunks
-
pass these chunks to the LLM’s question-answering API as context, along with the original question
This method helps the LLM generate high-quality results for a specialized topic, beyond what it was trained for.
Tools such as Langchain, or a Natural Language Processing (NLP) library like spaCy or NLTK can be used to make splitting documents into chunks take less time for the application developer.
Bulk Loading Vectors
Loading large vector data sets requires use of LOAD DATA and PIPELINES and formats such as Apache Parquet or Apache Avro™ to ensure good performance.
Related Topics
Last modified: August 14, 2024