Working with Vector Data

SingleStore supports vector database processing, which allows you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc. Vector database searches find data based on its content or meaning, even without exact matches. For example, vector search can allow a semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.

SingleStore supports a native vector data type and indexed approximate-nearest-neighbor (ANN) search that provide high-performance vector search and easier building of vector-based applications. See Vector Indexing.

Some benefits of using SingleStore for vector database processing, as opposed to a specialized vector database system, are:

  • A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support, etc.

  • Less data movement is needed between different data subsystems (e.g., caches, text search systems, SQL databases, and specialized vector databases) when all the data, including vector data, is stored in SingleStore.

  • Operational costs may be reduced since fewer data management tools and fewer copies of the data are needed.

  • Less specialized skills and reduced labor may be needed to run an application environment.

Vector Data

Vector data consists of arrays of numbers. Vector data can be stored in SingleStore using the native Vector Type. Vector embeddings, which are vectors describing the meaning of objects, can be obtained from many sources. For example, OpenAI has APIs that will give vector embeddings for text, and facenet is an open-source software package with a pre-trained neural network that can provide embeddings for face images.

Many large language models (LLMs) are now available, and can be used to provide vector embeddings for language to help implement semantic search, chatbots, and other applications. Some of the more well-known LLMs are:

  • GPT models from OpenAI

  • BERT by Google

  • LaMDA by Google

  • PaLM by Google

  • LLaMA by Meta AI

LLM technology is evolving quickly, and new sources of embeddings for language are rapidly becoming available. Moreover, it is possible to train your own models that produce embeddings for database information. Sources for models include Ollama, voyageAI, and huggingface. Regardless of the embedding source, these embeddings can be used for vector similarity search in SingleStore.

Vector embeddings can be stored as vector data in SingleStore. Vector embeddings are typically high-dimensional, with anywhere from a hundred to a few thousand dimensions.

A similarity search is the most common vector data operation. A vector similarity search in SingleStore uses SELECT…ORDER BY…LIMIT… queries that use vector similarity functions, including DOT_PRODUCT and EUCLIDEAN_DISTANCE. DOT_PRODUCT is the most commonly used similarity metric.

If the vectors are normalized to length one before saving to a database, and the query vector is also normalized to length one, then DOT_PRODUCT gives the cosine of the angle between the two vectors. DOT_PRODUCT will produce what is known as the cosine similarity metric for its two arguments. If the cosine is close to one, then the angle between the vectors is close to zero. A cosine value closer to one shows that the vectors are pointing in almost the same direction, so the objects they represent are similar. Cosine similarity values range from 1 (vectors are similar) to -1 (vectors are dissimilar).

The figure below illustrates when vectors are similar (cosine close to 1) and dissimilar (cosine close to 0):

Vector angles close to 1 and 0

Note

When using high-dimensional vectors the cosine similarity concept is still applicable.

Many vector models that produce vector embeddings will already have the vectors normalized to length one. In this case, when using DOT_PRODUCT, it is not necessary to normalize the vectors again.

Vector indexes can be used to improve performance of search over vectors. ANN searches give a fast, approximate answer to nearest neighbor queries. If exact answers are required an exact k-Nearest Neighbor (KNN) search can be performed simply by not using a vector index. See Vector Indexing for information on ANN search.

The EUCLIDEAN_DISTANCE function calculates the Euclidean distance between two vectors. Euclidean distance measures the distance between vectors taking into account the length of the vectors. The results of EUCLIDEAN_DISTANCE range from 0 to infinity; values closer to 0 indicate that the vectors are similar, values closer to infinity indicate that vectors are less similar.

Output Format for Examples

Vectors may be output in JSON or binary format. Use JSON format for examples and for output readability. For production, use the default binary for efficiency.

To get JSON output which will match the examples, use the following command to output vectors in JSON.

SET vector_type_project_format = JSON;

Use the following command to set the output format back to binary.

SET vector_type_project_format = BINARY;

Loading, Inserting, and Updating Vectors

Vector data can be added to a database using the VECTOR data type. For example using this table format:

CREATE TABLE comments(id INT,   
comment TEXT,   
comment_embedding VECTOR(4),   
category VARCHAR(256));

The default element type for VECTOR is a 32-bit floating point number (F32). The options for element type are I18, I16, I32, I64, F32, and F64. Refer to Vector Type for more details.

Consider the following information, where @emb is a JSON array of numbers representing the vector embedding for the phrase "The cafeteria in building 35 has a great salad bar." Fictional vectors are used here to keep the examples small enough to read easily.

SET @_id = 1;
SET @cmt = "The cafeteria in building 35 has a great salad bar";
SET @emb = '[0.45, 0.55, 0.495, 0.5]';
SET @cat = "Food";

This data can be inserted into the comments table with a standard INSERT statement:

INSERT INTO comments VALUES (@_id, @cmt, @emb, @cat);

In addition, if you have vector data that is already in binary format, that data should be inserted into the database in binary format. Converting such data to JSON array of numbers format and inserting that data into the database will result in significantly reduced performance.

It is important to convert vectors into an internal binary vector format before inserting them into a database. In this example, the conversion is done by calling JSON_ARRAY_PACK in the INSERT statement. Storing vectors as strings and converting them to binary during query will deliver much slower performance and is not recommended.

Binary data may be inserted in hexadecimal string format and converted using UNHEX before inserting it in a table. For example, suppose that client application software produces this hex string B806363CE90A363CCDCCCC3D77BE7F3F for the vector '[0.01111, 0.011111, 0.1, 0.999]'.

UNHEX can be used to convert a hex string that will work for insert into the database. For example:

SET @_id = 2;
SET @cmt = "I love the taco bar in the B16 cafeteria.";
SET @hs = "B806363CE90A363CCDCCCC3D77BE7F3F";
SET @cat = "Food";
INSERT INTO comments VALUES (@_id, @cmt, unhex(@hs), @cat);

Insert a third tuple into this table.

SET @_id = 3;
SET @cmt = "The B24 restaurant salad bar is quite good.";
SET @emb = '[0.1, 0.8, 0.2, 0.555]';
SET @cat = "Food";
INSERT INTO comments VALUES (@_id, @cmt, @emb, @cat);

View the contents of the table using this query.

To try out SQL queries, click the Playground icon to the right of the SQL listing.

SET vector_type_project_format = JSON; /* to make vector output readable */
SELECT id, comment, comment_embedding, category
FROM comments
ORDER BY id;
*** 1. row ***
               id: 1
          comment: The cafeteria in building 35 has a great salad bar
comment_embedding: [0.449999988,0.550000012,0.495000005,0.5]
         category: Food
*** 2. row ***
               id: 2
          comment: I love the taco bar in the B16 cafeteria.
comment_embedding: [0.0111100003,0.0111109996,0.100000001,0.999000013]
         category: Food
*** 3. row ***
               id: 3
          comment: The B24 restaurant salad bar is quite good.
comment_embedding: [0.100000001,0.800000012,0.200000003,0.555000007]
         category: Food

If you are using a command line tool, you can use \G at the end of the query to get the results formatted as above.

Another way to insert binary vectors is to convert them to a packed binary format in the client application and submit them through a standard client API. Consult the corresponding API documentation for details on how to do this.

An alternate way to insert vector data is using pipelines. See Load Data with Pipelines and How to Load Data Using Pipelines.

Example Search Based on Vector Similarity

To find the most similar vectors in a query vector, use an ORDER BY… LIMIT… query. ORDER BY will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top.

Suppose that the query is "restaurants with good salad," and for this query, the vector embedding API returned the vector '[0.44, 0.554, 0.34, 0.62]'. To find the top two matches for this query vector use the SQL below.

In this query, the infix operator <*> represents DOT_PRODUCT. The infix operator <-> can be used for EUCLIDEAN_DISTANCE.

The @query_vec variable is cast to a VECTOR to ensure that @query_vec is a valid VECTOR and to improve performance.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
ORDER BY score DESC
LIMIT 2;
*** 1. row ***
      id: 1 
 comment: The cafeteria in building 35 has a great salad bar
category: Food   
    score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447
  • Search queries must use the same metric used in the creation of a vector index in order to use that index. That is, a query that uses DOT_PRODUCT can only use a vector index created with DOT_PRODUCT.

  • Vectors must be normalized to length 1 before using the DOT_PRODUCT function to obtain the cosine similarity metric. SingleStore recommends vectors be normalized to length 1 before they are saved to the database. Many models that produce vector embeddings produce vectors normalized to length one. In this case, it is not necessary to normalize the vectors again.

  • Using EUCLIDEAN_DISTANCE over vectors which have been normalized to length 1 may result in poor recall.

Hybrid Filtering or Metadata Filtering

When building vector search applications, you may wish to filter on the fields of a record, with simple filters or via joins, in addition to applying vector similarity operations. Filtering on fields of a record using simple filters or joins along with applying vector similarity operations is often referred to as hybrid filtering or metadata filtering. SingleStore can handle this filtering using standard SQL operations.

For example, given the comments table, you can get the top three matches for a query vector that is in the category "Food" using this SQL:

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
WHERE category = "Food"
ORDER BY score DESC
LIMIT 3;
*** 1. row ***
      id: 1
 comment: The cafeteria in building 35 has a great salad bar
category: Food
   score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447
*** 3. row ***
      id: 2
 comment: I love the taco bar in the B16 cafeteria.
category: Food
   score: 0.6644238829612732

Any SQL feature can be used along with vector similarity calculation using DOT_PRODUCT() and EUCLIDEAN_DISTANCE(). These include filtering, ordering, grouping, aggregation, window functions, full-text search, and more.

Create and Use a Vector Index

The command below creates a vector index on the comment_embedding field of the comments table. Vector indexes can be used to improve performance of queries over large vector data sets. Refer to Vector Indexing for information on creating and using vector indexes.

ALTER TABLE comments ADD VECTOR INDEX ivf(comment_embedding)
INDEX_OPTIONS '{"index_type":"IVF_PQFS"}';

Optionally optimize the table for best performance.

OPTIMIZE TABLE comments FULL;

The following query will use the vector index.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
ORDER BY score DESC
LIMIT 2;
*** 1. row ***
      id: 1
 comment: The cafeteria in building 35 has a great salad bar
category: Food
   score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447

The EXPLAIN command can be used to see the query plan and verify that the vector index is being used. In the example below, you can see INTERNAL_VECTOR_SEARCH in the ColumnStoreFilter row. This tells you that the vector index is being used.

SET @query_vec = ('[0.09, 0.14, 0.5, 0.05]'):>VECTOR(4);
EXPLAIN
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
ORDER BY score DESC
LIMIT 2;
+--------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                |
+--------------------------------------------------------------------------------------------------------+
| Project [remote_0.id, remote_0.comment, remote_0.category, remote_0.score]                             |
| TopSort limit:2 [remote_0.score DESC]                                                                  |
| Gather partitions:all alias:remote_0 parallelism_level:segment                                         |
| Project [comments.id, comments.comment, comments.category, 
    DOT_PRODUCT(comments.comment_embedding,(@query_vec:>vector(4, F32))) AS score]                       |
| TopSort limit:2 [DOT_PRODUCT(comments.comment_embedding,(@query_vec:>vector(4, F32))) DESC]            |
| ColumnStoreFilter [INTERNAL_VECTOR_SEARCH(0, (@query_vec:>vector(4, F32)), 2, '') index]               |
| ColumnStoreScan test.comments, SORT KEY __UNORDERED () table_type:sharded_columnstore                  |
+--------------------------------------------------------------------------------------------------------+

Hybrid Search combines multiple search methods in one query and blends full-text search (which finds keyword matches) and vector search (which finds semantic matches) allowing search results to be (re-)ranked by a score that combines full-text and vector rankings. Hybrid Search - Re-ranking and Blending Searches provides two examples of hybrid search including an example of indexed hybrid search using SingleStore's Vector Indexing and full-text indexing (Working with Full-Text Search). See also this blog post on Hybrid Search.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a method for enhancing the quality of results for text-based Generative AI (GenAI) applications. To do RAG with SingleStore:

In advance:

  • split up relevant text into chunks

  • get a vector for each chunk from your chosen LLM and place the chunks in SingleStore

Then, when a question comes in for which you wish to generate an answer:

  • get a vector for the question from the same LLM

  • search the stored vectors to find the top k matching vectors compared with the query vector

  • for each of these top vectors, get the associated text chunks

  • pass these chunks to the LLM’s question-answering API as context, along with the original question

This method helps the LLM generate high-quality results for a specialized topic, beyond what it was trained for. RAG is a recommended approach for use with SingleStore to generate high-quality answers for via your LLM for a specific situation or context.

Tools such as Langchain, or a Natural Language Processing (NLP) library like spaCy or NLTK can be used to make splitting documents into chunks take less time for the application developer.

Bulk Loading Vectors

Loading large vector data sets requires use of LOAD DATA and PIPELINES and formats such as Apache Parquet or Apache Avro™ to ensure good performance. See How to Bulk Load Vectors for information on loading larger vector data sets.

Tracking Vector Index Memory Use

Vector index memory use can be tracked using SHOW STATUS EXTENDED. Refer to Vector Indexing for more information.

Last modified: December 12, 2024

Was this article helpful?