Working with Vector Data
On this page
SingleStoreDB supports vector database processing, which allows you to store and search vector data.
Some benefits of using SingleStoreDB for vector database processing, as opposed to a specialized vector database system, are:
-
A broad array of standard modern database capabilities are available in SingleStoreDB.
These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support, etc. -
Less data movement is needed between different data subsystems (e.
g. , caches, text search systems, and SQL databases) when all the data, including vector data, is stored in SingleStoreDB. -
Operational costs may be reduced since fewer data management tools and fewer copies of the data are needed.
-
Less specialized skills and reduced labor may be needed to run an application environment.
Vector Data
Vector data consists of arrays of numbers.
Many large language models (LLMs) are now available, and can be used to provide vector embeddings for language to help implement semantic search, chatbots, and other applications.
-
GPT models from OpenAI
-
BERT by Google
-
LaMDA by Google
-
PaLM by Google
-
LLaMA by Meta AI
LLM technology is evolving quickly, and new sources of embeddings for language are rapidly becoming available.
Vector embeddings can be stored as vector data in SingleStoreDB.
Vector Similarity Search
A similarity search is the most common vector data operation.SELECT…ORDER BY…LIMIT
… queries that use vector similarity functions, including DOT_DOT_
is the most commonly used similarity metric.
If the vectors are normalized to length one before saving to a database, and the query vector is also normalized to length one, then DOT_
gives the cosine of the angle between the two vectors.DOT_
will produce what is known as the cosine similarity metric for its two arguments.
The figure below illustrates when vectors are similar (cosine close to 1) and dissimilar (cosine close to 0):

Note
When using high-dimensional vectors the cosine similarity concept is still applicable.
Many vector models that produce vector embeddings will already have the vectors normalized to length one.DOT_
, it is not necessary to normalized the vectors again.
Loading, Inserting, and Updating Vectors
Vector data can be added to a database as blob data containing packed arrays of numbers.
CREATE TABLE comments(id INT not null PRIMARY KEY,comment TEXT,vector blob,category VARCHAR(256));
Consider the following information, where @jv
is a JSON array of numbers representing the vector embedding for the phrase "The cafeteria in building 35 has a great salad bar.
SET @_id = 1;SET @jv = '[0.2, 0.11, 0.37, 0.05]';SET @cmt = "The cafeteria in building 35 has a great salad bar";SET @cat = "Food";
This data can be inserted into the comments table with a standard INSERT
, using JSON_
:
INSERT INTO comments VALUES (@_id, @cmt, json_array_pack(@jv), @cat);
It is important to convert vectors into an internal binary vector format before inserting them into a database.JSON_
in the INSERT
statement.
Binary data may be inserted in hexadecimal string format and converted using UNHEX before inserting it in a table.
UNHEX can be used to convert a string into a blob that works with DOT_
.
SET @_id = 2;SET @hs = "00000000CDCC4C3F9A99193E00000000";SET @cmt = "I love the taco bar in the B16 cafeteria.";SET @cat = "Food";
Insert the values into the table.
INSERT INTO comments VALUES (@_id, @cmt, unhex(@hs), @cat);
View the contents of the comments table using this query:
SELECT id, comment, json_array_unpack(vector), category FROM comments\G
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
json_array_unpack(vector): [0.200000003,0.109999999,0.370000005,0.0500000007]
category: Food
*** 2. row ***
id: 2
comment: I love the taco bar in the B16 cafeteria.
json_array_unpack(vector): [0,0.800000012,0.150000006,0]
category: Food
Another way to insert binary vectors is to convert them to a packed binary format in the client application and submit them through a standard client API.
An alternate way to insert vector data is using the SingleStoreDBLoad Data with Pipelines and How to Load Data Using Pipelines features.JSON_
, HEX
, or convert the vector data to binary before inserting.
Example Search Based on Vector Similarity
To find the most similar vectors in a query vector, use an ORDER BY… LIMIT
… query.ORDER BY
will arrange the vectors by their similarity score produced by a vector similarity function, with the highest scores at the top.
SET @_id = 3;SET @jv = '[0.1, 0.15, 0.37, 0.05]';SET @cmt = "The B24 restaurant salad bar is quite good.";SET @cat = "Food";
Insert the values into the table.
INSERT INTO comments VALUES (@_id, @cmt, json_array_pack(@jv), @cat);
Suppose that the query is "restaurants with good salad," and for this query, the vector embedding API returned this vector:
SET @query_vec = json_array_pack('[0.09, 0.14, 0.5, 0.05]');
To find the top two matches for this query vector use this query:
SELECT id, comment, category, dot_product(vector, @query_vec) AS scoreFROM commentsORDER BY score DESCLIMIT 2\G
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
category: Food
score: 0.22089999914169312
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
category: Food
score: 0.2175000011920929
Hybrid Filtering or Metadata Filtering
When building vector search applications, you may wish to filter on the fields of a record, with simple filters or via joins, in addition to applying vector similarity operations.
For example, given the comments table, you can get the top three matches for a query vector that is in the category "Food" using this SQL:
SET @query_vec = json_array_pack('[ a query vector from your application ]');
SELECT id, comment, category, dot_product(vector, @query_vec) AS scoreFROM commentsWHERE category = "Food"ORDER BY score DESCLIMIT 3;
Any SQL feature can be used along with vector similarity calculation using DOT_
and EUCLIDEAN_
.
Retrieval-Augmented Generation (RAG) with SingleStoreDB
Retrieval-Augmented Generation (RAG) is a method for enhancing the quality of results for text-based Generative AI (GenAI) applications.
In advance:
-
split up relevant text into chunks
-
get a vector for each chunk from your chosen LLM and place the chunks in SingleStoreDB
Then, when a question comes in for which you wish to generate an answer:
-
get a vector for the question from the same LLM
-
search the stored vectors to find the top k matching vectors compared with the query vector
-
for each of these top vectors, get the associated text chunks
-
pass these chunks to the LLM’s question-answering API as context, along with the original question
This method helps the LLM generate high-quality results for a specialized topic, beyond what it was trained for.
Tools such as Langchain, or a Natural Language Processing (NLP) library like spaCy or NLTK can be used to make splitting documents into chunks take less time for the application developer.
How to Bulk Load Vectors into SingleStoreDB
Vector data can be inserted into SingleStoreDB from your application by using standard INSERT statements.
Use this format for the table:
CREATE TABLE vec_test(id INT, vec blob, shard(id));
Use a text file formatted using tabs (or some other character besides commas) between the columns./tmp/vec_
in the examples below):
1 [0.18,0.36,0.54,0.73]2 [0.62,0.70,0.15,0.31]
Loading Data Using a Local File
LOAD DATA infile '/tmp/vec_data.txt'INTO TABLE vec_testfields terminated by '\t'(id, @vec_text)SET vec = json_array_pack(@vec_text);
Using JSON_
Validate the vector data using:
SELECT id, json_array_unpack(vec)FROM vec_test;
+------+---------------------------------------------------+
| id | json_array_unpack(vec) |
+------+---------------------------------------------------+
| 1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
| 2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+
The rounding errors compared with the input file are normal due to the inexact nature of floating-point conversions from decimal strings.
Loading Large Datasets
To load a large dataset or continuous streaming data use a pipeline.
TRUNCATE TABLE vec_test;
Create and start the pipeline:
CREATE PIPELINE vec_pipeline ASLOAD DATA fs '/tmp/vec_data.txt'INTO TABLE vec_test fields terminated BY '\t'(id, @vec_text)SET vec = json_array_pack(@vec_text);
START PIPELINE vec_pipeline;
Validate the vector data using:
SELECT id, json_array_unpack(vec)FROM vec_test;
+------+---------------------------------------------------+
| id | json_array_unpack(vec) |
+------+---------------------------------------------------+
| 1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
| 2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+
Stop or drop the pipeline when it is no longer needed.
STOP PIPELINE vec_pipeline;DROP PIPELINE vec_pipeline;
If using SingleStoreDB Cloud, store your data in a cloud object store.
Related Topics
Last modified: November 28, 2023