Working with Vector Data
On this page
SingleStoreDB supports vector database processing, which allows you to store and search vector data.
Some benefits of using SingleStoreDB for vector database processing, as opposed to a specialized vector database system, are:
-
A broad array of standard modern database capabilities are available in SingleStoreDB.
These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support, etc. -
Less data movement is needed between different data subsystems (e.
g. , caches, text search systems, and SQL databases) when all the data, including vector data, is stored in SingleStoreDB. -
Operational costs may be reduced since fewer data management tools and fewer copies of the data are needed.
-
Less specialized skills and reduced labor may be needed to run an application environment.
Vector Data
Vector data consists of arrays of numbers.
Many large language models (LLMs) are now available, and can be used to provide vector embeddings for language to help implement semantic search, chatbots, and other applications.
-
GPT models from OpenAI
-
BERT by Google
-
LaMDA by Google
-
PaLM by Google
-
LLaMA by Meta AI
LLM technology is evolving quickly, and new sources of embeddings for language are rapidly becoming available.
Vector embeddings can be stored as vector data in SingleStoreDB.
Vector Similarity Search
A similarity search is the most common vector data operation.SELECT…ORDER BY…LIMIT
… queries that use vector similarity functions, including DOT_DOT_
is the most commonly used similarity metric.
If the vectors are normalized to length one before saving to a database, and the query vector is also normalized to length one, then DOT_
gives the cosine of the angle between the two vectors.DOT_
will produce what is known as the cosine similarity metric for its two arguments.
The figure below illustrates when vectors are similar (cosine close to 1) and dissimilar (cosine close to 0):

Note
When using high-dimensional vectors the cosine similarity concept is still applicable.
Many vector models that produce vector embeddings will already have the vectors normalized to length one.DOT_
, it is not necessary to normalized the vectors again.
Loading, Inserting, and Updating Vectors
Vector data can be added to a database as blob data containing packed arrays of numbers.
CREATE TABLE comments(id INT not null PRIMARY KEY,comment TEXT,vector blob,category VARCHAR(256));
Consider the following information, where @jv
is a JSON array of numbers representing the vector embedding for the phrase "The cafeteria in building 35 has a great salad bar.
SET @_id = 1;SET @jv = '[0.2, 0.11, 0.37, 0.05]';SET @cmt = "The cafeteria in building 35 has a great salad bar";SET @cat = "Food";
This data can be inserted into the comments table with a standard INSERT
, using JSON_
:
INSERT INTO comments VALUES (@_id, @cmt, json_array_pack(@jv), @cat);
It is important to convert vectors into an internal binary vector format before inserting them into a database.JSON_
in the INSERT
statement.
Binary data may be inserted in hexadecimal string format and converted using UNHEX before inserting it in a table.
UNHEX can be used to convert a string into a blob that works with DOT_
.
SET @_id = 2;SET @hs = "00000000CDCC4C3F9A99193E00000000";SET @cmt = "I love the taco bar in the B16 cafeteria.";SET @cat = "Food";
Insert the values into the table.
INSERT INTO comments VALUES (@_id, @cmt, unhex(@hs), @cat);
View the contents of the comments table using this query:
SELECT id, comment, json_array_unpack(vector), category FROM comments\G
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
json_array_unpack(vector): [0.200000003,0.109999999,0.370000005,0.0500000007]
category: Food
*** 2. row ***
id: 2
comment: I love the taco bar in the B16 cafeteria.
json_array_unpack(vector): [0,0.800000012,0.150000006,0]
category: Food
Another way to insert binary vectors is to convert them to a packed binary format in the client application and submit them through a standard client API.
An alternate way to insert vector data is using the SingleStoreDBLoad Data with Pipelines and Load Data from a Data Source features.JSON_
, HEX
, or convert the vector data to binary before inserting.
Example Search Based on Vector Similarity
To find the most similar vectors in a query vector, use an ORDER BY… LIMIT
… query.ORDER BY
will arrange the vectors by their similarity score produced by a vector similarity function, with the highest scores at the top.
SET @_id = 3;SET @jv = '[0.1, 0.15, 0.37, 0.05]';SET @cmt = "The B24 restaurant salad bar is quite good.";SET @cat = "Food";
Insert the values into the table.
INSERT INTO comments VALUES (@_id, @cmt, json_array_pack(@jv), @cat);
Suppose that the query is "restaurants with good salad," and for this query, the vector embedding API returned this vector:
SET @query_vec = json_array_pack('[0.09, 0.14, 0.5, 0.05]');
To find the top two matches for this query vector use this query:
SELECT id, comment, category, dot_product(vector, @query_vec) AS scoreFROM commentsORDER BY score DESCLIMIT 2\G
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
category: Food
score: 0.22089999914169312
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
category: Food
score: 0.2175000011920929
Hybrid Filtering or Metadata Filtering
When building vector search applications, you may wish to filter on the fields of a record, with simple filters or via joins, in addition to applying vector similarity operations.
For example, given the comments table, you can get the top three matches for a query vector that is in the category "Food" using this SQL:
SET @query_vec = json_array_pack('[ a query vector from your application ]');
SELECT id, comment, category, dot_product(vector, @query_vec) AS scoreFROM commentsWHERE category = "Food"ORDER BY score DESCLIMIT 3;
Any SQL feature can be used along with vector similarity calculation using DOT_
and EUCLIDEAN_
.
Hybrid Search
Hybrid searching uses a combination of vector similarity search and full-text search.
This command adds a full-text index on the comment field:
ALTER TABLE comments ADD FULLTEXT(comment);
Now, create a new query vector:
SET @query_vec = json_array_pack('[0.1, 0.16, 0.36, 0.05]');
This command combines the full-text and DOT_
similarity scores for the query with equal weight and returns the top two matches.
The row with id 3 has the word "restaurant" in it, so it gets a boost from the full-text search here and is delivered at the top of the ranking.
SELECT id, comment, dot_product(@query_vec, vector) AS vec_score,match(comment) against ("restaurant") AS ft_score,(vec_score + ft_score) / 2 AS combined_scoreFROM commentsORDER BY combined_score DESCLIMIT 2\G
*** 1. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
vec_score: 0.1696999967098236
ft_score: 0.375
combined_score: 0.2723499983549118
*** 2. row ***
id: 2
comment: I love the taco bar in the B16 cafeteria.
vec_score: 0.18200001120567322
ft_score: 0
combined_score: 0.09100000560283661
To use the full-text search index to limit the search to just rows with specific keywords, add a WHERE
clause with a MATCH…AGAINST
… expression.
SELECT id, comment, dot_product(@query_vec, vector) AS vec_score,match(comment) against ("restaurant") AS ft_score,(vec_score + ft_score) / 2 AS combined_scoreFROM commentsWHERE match(comment) against ("restaurant")ORDER BY combined_score DESCLIMIT 2\G
*** 1. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
vec_score: 0.1696999967098236
ft_score: 0.375
combined_score: 0.2723499983549118
The WHERE
clause, in this case, limits the search to rows with "restaurant" in the comment column.
The availability of vector search, full-text search, and standard SQL operations in SingleStoreDB allows you to combine them to identify important data in powerful ways.
How to Bulk Load Vectors into SingleStoreDB
Vector data can be inserted into SingleStoreDB from your application by using standard INSERT statements.
Use this format for the table:
CREATE TABLE vec_test(id INT, vec blob, shard(id));
Use a text file formatted using tabs (or some other character besides commas) between the columns./tmp/vec_
in the examples below):
1 [0.18,0.36,0.54,0.73]2 [0.62,0.70,0.15,0.31]
Loading Data Using a Local File
LOAD DATA infile '/tmp/vec_data.txt'INTO TABLE vec_testfields terminated by '\t'(id, @vec_text)SET vec = json_array_pack(@vec_text);
Using JSON_
Validate the vector data using:
SELECT id, json_array_unpack(vec)FROM vec_test;
+------+---------------------------------------------------+
| id | json_array_unpack(vec) |
+------+---------------------------------------------------+
| 1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
| 2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+
The rounding errors compared with the input file are normal due to the inexact nature of floating-point conversions from decimal strings.
Loading Large Datasets
To load a large dataset or continuous streaming data use a pipeline.
TRUNCATE TABLE vec_test;
Create and start the pipeline:
CREATE PIPELINE vec_pipeline ASLOAD DATA fs '/tmp/vec_data.txt'INTO TABLE vec_test fields terminated BY '\t'(id, @vec_text)SET vec = json_array_pack(@vec_text);
START PIPELINE vec_pipeline;
Validate the vector data using:
SELECT id, json_array_unpack(vec)FROM vec_test;
+------+---------------------------------------------------+
| id | json_array_unpack(vec) |
+------+---------------------------------------------------+
| 1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
| 2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+
Stop or drop the pipeline when it is no longer needed.
STOP PIPELINE vec_pipeline;DROP PIPELINE vec_pipeline;
If using SingleStoreDB Cloud, store your data in a cloud object store.
Related Topics
Last modified: July 11, 2023