Working with Vector Data

SingleStore supports vector database processing, which allows you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc. Vector database searches find data based on its content or meaning, even without exact matches. For example, vector search can allow a semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.

Some benefits of using SingleStore for vector database processing, as opposed to a specialized vector database system, are:

  • A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support, etc.

  • Less data movement is needed between different data subsystems (e.g., caches, text search systems, SQL databases, and specialized vector databases) when all the data, including vector data, is stored in SingleStore.

  • Operational costs may be reduced since fewer data management tools and fewer copies of the data are needed.

  • Less specialized skills and reduced labor may be needed to run an application environment.

Vector Data

Vector data consists of arrays of numbers. Vector data can be stored in SingleStore as blobs. Vector embeddings, which are vectors describing the meaning of objects, can be obtained from many sources. For example, OpenAI has APIs that will give vector embeddings for text, and facenet is an open-source software package with a pre-trained neural network that can provide embeddings for face images.

Many large language models (LLMs) are now available, and can be used to provide vector embeddings for language to help implement semantic search, chatbots, and other applications. Some of the more well-known LLMs are:

  • GPT models from OpenAI

  • BERT by Google

  • LaMDA by Google

  • PaLM by Google

  • LLaMA by Meta AI

LLM technology is evolving quickly, and new sources of embeddings for language are rapidly becoming available. Moreover, it is possible to train your own models that produce embeddings for database information. Regardless of the embedding source, these embeddings can be used for vector similarity search in SingleStore.

Vector embeddings can be stored as vector data in SingleStore. Vector embeddings are typically high-dimensional, with anywhere from a hundred to a few thousand dimensions.

A similarity search is the most common vector data operation. A vector similarity search in SingleStore uses SELECT…ORDER BY…LIMIT… queries that use vector similarity functions, including DOT_PRODUCT and EUCLIDEAN_DISTANCE. DOT_PRODUCT is the most commonly used similarity metric.

If the vectors are normalized to length one before saving to a database, and the query vector is also normalized to length one, then DOT_PRODUCT gives the cosine of the angle between the two vectors. DOT_PRODUCT will produce what is known as the cosine similarity metric for its two arguments. If the cosine is close to one, then the angle between the vectors is close to zero. A cosine value closer to one shows that the vectors are pointing in almost the same direction, so the objects they represent are similar.

The figure below illustrates when vectors are similar (cosine close to 1) and dissimilar (cosine close to 0):

Note

When using high-dimensional vectors the cosine similarity concept is still applicable.

Many vector models that produce vector embeddings will already have the vectors normalized to length one. In this case, when using DOT_PRODUCT, it is not necessary to normalize the vectors again.

Loading, Inserting, and Updating Vectors

Vector data can be added to a database as blob data containing packed arrays of numbers. For example using this table format:

CREATE TABLE comments_b(id INT,
comment TEXT,
vect blob,
category VARCHAR(256));

Consider the following information where @jv is a JSON array of numbers representing the vector embedding for the phrase "The cafeteria in building 35 has a great salad bar." Fictional vectors are used here to keep the examples small enough to read easily.

SET @_id = 1;
SET @cmt = "The cafeteria in building 35 has a great salad bar";
SET @jv = '[0.45, 0.55, 0.495, 0.5]';
SET @cat = "Food";

This data can be inserted into the comments table with a standard INSERT, using JSON_ARRAY_PACK:

INSERT INTO comments_b VALUES (@_id, @cmt, JSON_ARRAY_PACK(@jv), @cat);

It is important to convert vectors into an internal binary vector format before inserting them into a database. In this example, the conversion is done by calling JSON_ARRAY_PACK in the INSERT statement. Storing vectors as strings and converting them to binary during query will deliver much slower performance and is not recommended.

Binary data may be inserted in hexadecimal string format and converted using UNHEX before inserting it in a table. For example, suppose that client application software produces this hex string B806363CE90A363CCDCCCC3D77BE7F3F for the vector '[0.01111, 0.011111, 0.1, 0.999]'.

UNHEX can be used to convert a hex string that will work for insert into the database. For example:

SET @_id = 2;
SET @cmt = "I love the taco bar in the B16 cafeteria.";
SET @hs = "B806363CE90A363CCDCCCC3D77BE7F3F";
SET @cat = "Food";

Insert the values into the table.

INSERT INTO comments_b VALUES (@_id, @cmt, unhex(@hs), @cat);

View the contents of the table using this query:

SELECT id, comment, JSON_ARRAY_UNPACK(vect), category
FROM comments_b;
*** 1. row ***
                     id: 1
                comment: The cafeteria in building 35 has a great salad bar
JSON_ARRAY_UNPACK(vect): [0.449999988,0.550000012,0.495000005,0.5]
               category: Food
*** 2. row ***
                     id: 2                
                comment: I love the taco bar in the B16 cafeteria.
JSON_ARRAY_UNPACK(vect): [0.0111100003,0.0111109996,0.100000001,0.999000013]               category: Food

Note

If you are using a command line tool, you can use \G at the end of the query to get the results formatted as above.

Another way to insert binary vectors is to convert them to a packed binary format in the client application and submit them through a standard client API. Consult the corresponding API documentation for details on how to do this.

An alternate way to insert vector data is using pipelines. See Load Data with Pipelines and How to Load Data Using Pipelines. Vector data has to be converted into packed binary before inserting in a column in the target table. Use HEX or convert the vector data to binary before inserting.

Example Search Based on Vector Similarity

To find the most similar vectors in a query vector, use an ORDER BY… LIMIT… query. ORDER BY will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top. To demonstrate, another vector will be added to the comments table.

SET @_id = 3;
SET @cmt = "The B24 restaurant salad bar is quite good.";
SET @jv = '[0.1, 0.8, 0.2, 0.555]';
SET @cat = "Food";

Insert the values into the table.

INSERT INTO comments_b VALUES (@_id, @cmt, JSON_ARRAY_PACK(@jv), @cat);

Suppose that the query is "restaurants with good salad," and for this query, the vector embedding API returned the vector '[0.44, 0.554, 0.34, 0.62]'. To find the top two matches for this query vector use the SQL below.

SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');
SELECT id, comment, category, DOT_PRODUCT(vect, @query_vec) AS score
    FROM comments_b
    ORDER BY score DESC
    LIMIT 2;
*** 1. row ***
      id: 1
 comment: The cafeteria in building 35 has a great salad bar
category: Food
   score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447      

Hybrid Filtering or Metadata Filtering

When building vector search applications, you may wish to filter on the fields of a record, with simple filters or via joins, in addition to applying vector similarity operations. Filtering on fields of a record using simple filters or joins along with applying vector similarity operations is often referred to as hybrid filtering or metadata filtering. SingleStore can handle this filtering using standard SQL operations.

For example, given the comments table, you can get the top three matches for a query vector that is in the category "Food" using this SQL:

SET @query_vec = JSON_ARRAY_PACK('[ a query vector from your application ]');
SELECT id, comment, category, DOT_PRODUCT(vect, @query_vec) AS score
    FROM comments_b
    WHERE category = "Food"
    ORDER BY score DESC
    LIMIT 3;

Any SQL feature can be used along with vector similarity calculation using DOT_PRODUCT() and EUCLIDEAN_DISTANCE(). These include filtering, ordering, grouping, aggregation, window functions, full-text search, and more.

Hybrid searching uses a combination of vector similarity search and full-text search. For example, this may be useful when performing a semantic search with a high priority given to something with a specific keyword. An example follows.

This command adds a full-text index on the comment field:

ALTER TABLE comments_b ADD FULLTEXT(comment);

Run the following command to ensure the values above are indexed.

OPTIMIZE TABLE comments_b FULL;

The commands below first create a query vector and then combines the full-text and DOT_PRODUCT similarity scores for the query with equal weight and returns the top two matches. The query will use the full-text index when matching the comment against "restaurant."

The row with id 3 has the word "restaurant" in it, so it gets a boost from the full-text search here and is delivered at the top of the ranking.

SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');
SELECT id, comment, DOT_PRODUCT(@query_vec, vect) AS vec_score,
match(comment) against ("restaurant") AS ft_score,
(vec_score + ft_score) / 2 AS combined_score
FROM comments_b
ORDER BY combined_score DESC
LIMIT 2;
*** 1. row ***
            id: 3
       comment: The B24 restaurant salad bar is quite good.
     vec_score: 0.8993000388145447
      ft_score: 0.11506980657577515
combined_score: 0.5071849226951599
*** 2. row ***
            id: 1
       comment: The cafeteria in building 35 has a great salad bar
     vec_score: 0.9810000061988831
      ft_score: 0
combined_score: 0.49050000309944153

To use the full-text search index to limit the search to just rows with specific keywords, add a WHERE clause with a MATCH…AGAINST… expression. For example:

SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');
SELECT id, comment, dot_product(@query_vec, vect) AS vec_score,  
    match(comment) against ("restaurant") AS ft_score,  
    (vec_score + ft_score) / 2 AS combined_score
    FROM comments_b
    WHERE match(comment) against ("restaurant")
    ORDER BY combined_score DESC
    LIMIT 2;
   
*** 1. row ***
            id: 3
       comment: The B24 restaurant salad bar is quite good.
     vec_score: 0.8993000388145447
      ft_score: 0.11506980657577515
combined_score: 0.5071849226951599

The WHERE clause, in this case, limits the search to rows with "restaurant" in the comment column.

The availability of vector search, full-text search, and standard SQL operations in SingleStore allows you to combine them to identify important data in powerful ways.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a method for enhancing the quality of results for text-based Generative AI (GenAI) applications. To do RAG with SingleStore:

In advance:

  • split up relevant text into chunks

  • get a vector for each chunk from your chosen LLM and place the chunks in SingleStore

Then, when a question comes in for which you wish to generate an answer:

  • get a vector for the question from the same LLM

  • search the stored vectors to find the top k matching vectors compared with the query vector

  • for each of these top vectors, get the associated text chunks

  • pass these chunks to the LLM’s question-answering API as context, along with the original question

This method helps the LLM generate high-quality results for a specialized topic, beyond what it was trained for. RAG is a recommended approach for use with SingleStore to generate high-quality answers for via your LLM for a specific situation or context.

Tools such as Langchain, or a Natural Language Processing (NLP) library like spaCy or NLTK can be used to make splitting documents into chunks take less time for the application developer.

How to Bulk Load Vectors

Vector data can be inserted into SingleStore from your application by using standard INSERT statements. For larger datasets, you can bulk load vectors and associated data with LOAD DATA or CREATE PIPELINE.

Use this format for the table:

CREATE TABLE vec_test(id INT, vec blob, shard(id));

Use a text file formatted using tabs (or some other character besides commas) between the columns. This is so the loader does not interpret the commas in the JSON array as column separators. Below is the sample data that has been cut and pasted into a text file (/tmp/vec_data.text in the examples below):

1 [0.18,0.36,0.54,0.73]
2 [0.62,0.70,0.15,0.31]

Loading Data Using a Local File

LOAD DATA infile '/tmp/vec_data.txt'
INTO TABLE vec_test
fields terminated by ' '
(id, @vec_text)
SET vec = JSON_ARRAY_PACK(@vec_text);

Using JSON_ARRAY_PACK is important because you need to convert the text form of the vector, a JSON array of numbers, to the internal packed binary format, an array of 32-bit floats. That is the format that the DOT_PRODUCT and EUCLIDEAN_DISTANCE functions require as input.

Validate the data in the table using the following SQL.

SELECT id, JSON_ARRAY_UNPACK(vec)
FROM vec_test;
+------+---------------------------------------------------+
| id   | JSON_ARRAY_UNPACK(vec)                            |
+------+---------------------------------------------------+
|    1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
|    2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+

The rounding errors compared with the input file are normal due to the inexact nature of floating-point conversions from decimal strings.

Loading Large Datasets

To load a large dataset or continuous streaming data use a pipeline. For purposes of this example, begin by removing the data from the vectors table:

TRUNCATE TABLE vec_test;

Create and start the pipeline:

CREATE PIPELINE vec_pipeline AS
LOAD DATA fs '/tmp/vec_data.txt'
INTO TABLE vec_test fields terminated BY ' '
(id, @vec_text)
SET vec = JSON_ARRAY_PACK(@vec_text);
START PIPELINE vec_pipeline;

Validate the vector data using:

SELECT id, JSON_ARRAY_UNPACK(vec)
FROM vec_test;
+------+---------------------------------------------------+
| id   | JSON_ARRAY_UNPACK(vec)                            |
+------+---------------------------------------------------+
|    1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
|    2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+

Stop or drop the pipeline when it is no longer needed. Stopping a pipeline will allow it to be used later, while dropping a pipeline will delete it permanently.

STOP PIPELINE vec_pipeline;
DROP PIPELINE vec_pipeline;

If using SingleStore Helios, store your data in a cloud object store. Consult the documentation on LOAD DATA and CREATE PIPELINE to adjust the file name to reference a file in S3, Azure Blob Store, or GCS.

Last modified: March 14, 2024

Was this article helpful?