Working with Vector Data

SingleStore supports vector database processing, which allows you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc. Vector database searches find data based on its content or meaning, even without exact matches. For example, vector search can allow a semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.

Some benefits of using SingleStore for vector database processing, as opposed to a specialized vector database system, are:

A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support, etc.
Less data movement is needed between different data subsystems (e.g., caches, text search systems, SQL databases, and specialized vector databases) when all the data, including vector data, is stored in SingleStore.
Operational costs may be reduced since fewer data management tools and fewer copies of the data are needed.
Less specialized skills and reduced labor may be needed to run an application environment.

Vector Data

Vector data consists of arrays of numbers. Vector data can be stored in SingleStore as blobs (Binary String Types). Vector embeddings, which are vectors describing the meaning of objects, can be obtained from many sources. For example, OpenAI has APIs that will give vector embeddings for text, and facenet is an open-source software package with a pre-trained neural network that can provide embeddings for face images.

Many large language models (LLMs) are now available and can be used to provide vector embeddings for language to help implement semantic search, chatbots, and other applications. Some of the more well-known LLMs are:

GPT models from OpenAI
BERT by Google
LaMDA by Google
PaLM by Google
LLaMA by Meta AI

LLM technology is evolving quickly, and new sources of embeddings for language are rapidly becoming available. Moreover, it is possible to train your own models that produce embeddings for database information. Sources for models include Ollama, voyageAI, and huggingface. Regardless of the embedding source, these embeddings can be used for vector similarity search in SingleStore.

Vector embeddings can be stored as vector data in SingleStore. Vector embeddings are typically high-dimensional, with anywhere from a hundred to a few thousand dimensions.

Vector Similarity Search

A similarity search is the most common vector data operation. A vector similarity search in SingleStore uses SELECT…ORDER BY…LIMIT… queries that use vector similarity functions, including DOT_PRODUCT and EUCLIDEAN_DISTANCE. DOT_PRODUCT is the most commonly used similarity metric.

If the vectors are normalized to length one before saving to a database, and the query vector is also normalized to length one, then DOT_PRODUCT gives the cosine of the angle between the two vectors. DOT_PRODUCT will produce what is known as the cosine similarity metric for its two arguments. If the cosine is close to one, then the angle between the vectors is close to zero. A cosine value closer to one shows that the vectors are pointing in almost the same direction, so the objects they represent are similar. Cosine similarity values range from 1 (vectors are similar) to -1 (vectors are dissimilar).

The figure below illustrates when vectors are similar (cosine close to 1) and dissimilar (cosine close to 0):

Note

When using high-dimensional vectors the cosine similarity concept is still applicable.

Many vector models that produce vector embeddings will already have the vectors normalized to length one. In this case, when using DOT_PRODUCT, it is not necessary to normalize the vectors again.

The EUCLIDEAN_DISTANCE function calculates the Euclidean distance between two vectors. Euclidean distance measures the distance between vectors taking into account the length of the vectors. The results of EUCLIDEAN_DISTANCE range from 0 to infinity; values closer to 0 indicate that the vectors are similar, values closer to infinity indicate that vectors are less similar.

Loading, Inserting, and Updating Vectors

Vector data can be added to a database as blob data containing packed arrays of numbers. For example, using this table format:

SQL

CREATE TABLE comments_b(id INT,
  comment TEXT,
  vect BLOB,
  category VARCHAR(256));

Consider the following information where @jv is a JSON array of numbers representing the vector embedding for the phrase "The cafeteria in building 35 has a great salad bar." Fictional vectors are used here to keep the examples small enough to read easily.

SQL

SET @_id = 1;
SET @cmt = "The cafeteria in building 35 has a great salad bar";
SET @jv = '[0.45, 0.55, 0.495, 0.5]';
SET @cat = "Food";

This data can be inserted into the comments table with a standard INSERT statement:

This data can be inserted into the comments table with a standard INSERT, using JSON_ARRAY_PACK:

SQL

INSERT INTO comments_b VALUES (@_id, @cmt, JSON_ARRAY_PACK(@jv), @cat);

It is important to convert vectors into an internal binary vector format before inserting them into a database. In this example, the conversion is done by calling JSON_ARRAY_PACK in the INSERT statement. Storing vectors as strings and converting them to binary during query will deliver much slower performance and is not recommended.

Binary data may be inserted in hexadecimal string format and converted using UNHEX before inserting it in a table. For example, suppose that client application software produces this hex string B806363CE90A363CCDCCCC3D77BE7F3F for the vector '[0.01111, 0.011111, 0.1, 0.999]'.

UNHEX can be used to convert a hex string that will work for insert into the database. For example:

SQL

SET @_id = 2;
SET @cmt = "I love the taco bar in the B16 cafeteria.";
SET @hs = "B806363CE90A363CCDCCCC3D77BE7F3F";
SET @cat = "Food";

SQL

INSERT INTO comments_b VALUES (@_id, @cmt, unhex(@hs), @cat);

Insert a third tuple into this table.

SQL

SET @_id = 3;
SET @cmt = "The B24 restaurant salad bar is quite good.";
SET @jv = '[0.1, 0.8, 0.2, 0.555]';
SET @cat = "Food";

SQL

INSERT INTO comments_b VALUES (@_id, @cmt, JSON_ARRAY_PACK(@jv), @cat);

View the contents of the table using this query.

SQL

SELECT id, comment, JSON_ARRAY_UNPACK(vect), category
FROM comments_b
ORDER BY id;

*** 1. row ***
                     id: 1
                comment: The cafeteria in building 35 has a great salad bar
JSON_ARRAY_UNPACK(vect): [0.449999988,0.550000012,0.495000005,0.5]
               category: Food
*** 2. row ***
                     id: 2
                comment: I love the taco bar in the B16 cafeteria.
JSON_ARRAY_UNPACK(vect): [0.0111100003,0.0111109996,0.100000001,0.999000013]
               category: Food
*** 3. row ***
                     id: 3
                comment: The B24 restaurant salad bar is quite good.
JSON_ARRAY_UNPACK(vect): [0.100000001,0.800000012,0.200000003,0.555000007]
               category: Food

If you are using a command line tool, you can use \G at the end of the query to get the results formatted as above.

Another way to insert binary vectors is to convert them to a packed binary format in the client application and submit them through a standard client API. Consult the corresponding API documentation for details on how to do this.

An alternate way to insert vector data is using pipelines. See About SingleStore Pipelines. Vector data has to be converted into packed binary before inserting in a column in the target table. Use HEX or convert the vector data to binary before inserting.

Example Search Based on Vector Similarity

To find the most similar vectors in a query vector, use an ORDER BY… LIMIT… query. ORDER BY will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top.

Suppose that the query is "restaurants with good salad," and for this query, the vector embedding API returned the vector '[0.44, 0.554, 0.34, 0.62]'. To find the top two matches for this query vector use the following SQL.

SQL

SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');

SELECT id, comment, category, DOT_PRODUCT(vect, @query_vec) AS score
    FROM comments_b
    ORDER BY score DESC
    LIMIT 2;

*** 1. row ***
      id: 1
 comment: The cafeteria in building 35 has a great salad bar
category: Food
   score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447

Search queries must use the same metric used in the creation of a vector index in order to use that index. That is, a query that uses DOT_PRODUCT can only use a vector index created with DOT_PRODUCT.
Vectors must be normalized to length 1 before using the DOT_PRODUCT function to obtain the cosine similarity metric. SingleStore recommends vectors be normalized to length 1 before they are saved to the database. Many models that produce vector embeddings produce vectors normalized to length one. In this case, it is not necessary to normalize the vectors again.
Using EUCLIDEAN_DISTANCE over vectors which have been normalized to length 1 may result in poor recall.

Hybrid Filtering or Metadata Filtering

When building vector search applications, you may wish to filter on the fields of a record, with simple filters or via joins, in addition to applying vector similarity operations. Filtering on fields of a record using simple filters or joins along with applying vector similarity operations is often referred to as hybrid filtering or metadata filtering. SingleStore can handle this filtering using standard SQL operations.

For example, given the comments table, you can get the top three matches for a query vector that is in the category "Food" using this SQL:

SQL

SET @query_vec = JSON_ARRAY_PACK('[ a query vector from your application ]');

SELECT id, comment, category, DOT_PRODUCT(vect, @query_vec) AS score
    FROM comments_b
    WHERE category = "Food"
    ORDER BY score DESC
    LIMIT 3;

Any SQL feature can be used along with vector similarity calculation using DOT_PRODUCT() and EUCLIDEAN_DISTANCE(). These include filtering, ordering, grouping, aggregation, window functions, full-text search, and more.

Hybrid Search

Hybrid searching uses a combination of vector similarity search and full-text search. For example, this may be useful when performing a semantic search with a high priority given to something with a specific keyword. An example follows.

This command adds a full-text index on the comment field:

SQL

ALTER TABLE comments_b ADD FULLTEXT(comment);

Run the following command to ensure the values above are indexed.

SQL

OPTIMIZE TABLE comments_b FLUSH;

The commands below first create a query vector and then combines the full-text and DOT_PRODUCT similarity scores for the query with equal weight and returns the top two matches. The query will use the full-text index when matching the comment against "restaurant."

The row with id 3 has the word "restaurant" in it, so it gets a boost from the full-text search here and is delivered at the top of the ranking.

SQL

SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');

SELECT id, comment, DOT_PRODUCT(@query_vec, vect) AS vec_score,
  MATCH(comment) AGAINST ("restaurant") AS ft_score,
  (vec_score + ft_score) / 2 AS combined_score
  FROM comments_b
  ORDER BY combined_score DESC
  LIMIT 2;

*** 1. row ***
            id: 3
       comment: The B24 restaurant salad bar is quite good.
     vec_score: 0.8993000388145447
      ft_score: 0.11506980657577515
combined_score: 0.5071849226951599
*** 2. row ***
            id: 1
       comment: The cafeteria in building 35 has a great salad bar
     vec_score: 0.9810000061988831
      ft_score: 0
combined_score: 0.49050000309944153

To use the full-text search index to limit the search to just rows with specific keywords, add a WHERE clause with a MATCH…AGAINST… expression. For example:

SQL

SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');

SELECT id, comment, DOT_PRODUCT(@query_vec, vect) AS vec_score,  
    MATCH(comment) AGAINST ("restaurant") AS ft_score,  
    (vec_score + ft_score) / 2 AS combined_score
    FROM comments_b
    WHERE MATCH(comment) AGAINST ("restaurant")
    ORDER BY combined_score DESC
    LIMIT 2;

*** 1. row ***
            id: 3
       comment: The B24 restaurant salad bar is quite good.
     vec_score: 0.8993000388145447
      ft_score: 0.11506980657577515
combined_score: 0.5071849226951599

The WHERE clause, in this case, limits the search to rows with "restaurant" in the comment column.

The availability of vector search, full-text search, and standard SQL operations in SingleStore allows you to combine them to identify important data in powerful ways.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a method for enhancing the quality of results for text-based Generative AI (GenAI) applications. To do RAG with SingleStore:

In advance:

split up relevant text into chunks
get a vector for each chunk from your chosen LLM and place the chunks in SingleStore

Then, when a question comes in for which you wish to generate an answer:

get a vector for the question from the same LLM
search the stored vectors to find the top k matching vectors compared with the query vector
for each of these top vectors, get the associated text chunks
pass these chunks to the LLM’s question-answering API as context, along with the original question

This method helps the LLM generate high-quality results for a specialized topic, beyond what it was trained for. RAG is a recommended approach for use with SingleStore to generate high-quality answers via your LLM for a specific situation or context.

Tools such as Langchain, or a Natural Language Processing (NLP) library like spaCy or NLTK can be used to make splitting documents into chunks take less time for the application developer.

How to Bulk Load Vectors

Vector data can be inserted into SingleStore from your application by using standard INSERT statements. For larger datasets, you can bulk load vectors and associated data with LOAD DATA or CREATE PIPELINE.

Use this format for the table:

SQL

CREATE TABLE vec_test(id INT, vec blob, shard(id));

Use a text file formatted using tabs (or some other character besides commas) between the columns. This is so the loader does not interpret the commas in the JSON array as column separators. Below is the sample data that has been cut and pasted into a text file (/tmp/vec_data.text in the examples below):

1 [0.18,0.36,0.54,0.73]
2 [0.62,0.70,0.15,0.31]

Loading Data Using a Local File

SQL

LOAD DATA infile '/tmp/vec_data.txt'
INTO TABLE vec_test
fields terminated by ' '
(id, @vec_text)
SET vec = JSON_ARRAY_PACK(@vec_text);

Using JSON_ARRAY_PACK is important because you need to convert the text form of the vector, a JSON array of numbers, to the internal packed binary format, an array of 32-bit floats. That is the format that the DOT_PRODUCT and EUCLIDEAN_DISTANCE functions require as input.

Validate the data in the table using the following SQL.

SQL

SELECT id, JSON_ARRAY_UNPACK(vec)
FROM vec_test;

+------+---------------------------------------------------+
| id   | JSON_ARRAY_UNPACK(vec)                            |
+------+---------------------------------------------------+
|    1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
|    2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+

The rounding errors compared with the input file are normal due to the inexact nature of floating-point conversions from decimal strings.

Loading Large Datasets

To load a large dataset or continuous streaming data use a pipeline. For purposes of this example, begin by removing the data from the vectors table:

SQL

TRUNCATE TABLE vec_test;

Create and start the pipeline:

SQL

CREATE PIPELINE vec_pipeline AS
LOAD DATA fs '/tmp/vec_data.txt'
INTO TABLE vec_test fields terminated BY ' '
(id, @vec_text)
SET vec = JSON_ARRAY_PACK(@vec_text);

SQL

START PIPELINE vec_pipeline;

Validate the vector data using:

SQL

SELECT id, JSON_ARRAY_UNPACK(vec)
FROM vec_test;

+------+---------------------------------------------------+
| id   | JSON_ARRAY_UNPACK(vec)                            |
+------+---------------------------------------------------+
|    1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
|    2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+

Stop or drop the pipeline when it is no longer needed. Stopping a pipeline will allow it to be used later, while dropping a pipeline will delete it permanently.

SQL

STOP PIPELINE vec_pipeline;

DROP PIPELINE vec_pipeline;

If using SingleStore Helios, store your data in a cloud object store. Consult the documentation on LOAD DATA and CREATE PIPELINE to adjust the file name to reference a file in S3, Azure Blob Store, or GCS.

Blog Post: Why Your Vector Database Should Not Be a Vector Database

Working with Vector Data

On this page

Vector Data

Vector Similarity Search

Loading, Inserting, and Updating Vectors

Example Search Based on Vector Similarity

Hybrid Filtering or Metadata Filtering

Hybrid Search

Retrieval-Augmented Generation (RAG)

How to Bulk Load Vectors

Loading Data Using a Local File

Loading Large Datasets

Was this article helpful?

On this page

Was this article helpful?

Working with Vector Data

On this page

Vector Data

Vector Similarity Search

Loading, Inserting, and Updating Vectors

Example Search Based on Vector Similarity

Hybrid Filtering or Metadata Filtering

Hybrid Search

Retrieval-Augmented Generation (RAG)

How to Bulk Load Vectors

Loading Data Using a Local File

Loading Large Datasets

Related Topics

Was this article helpful?

On this page

Was this article helpful?