Working with Vector Data

SingleStore supports vector database processing, which allows you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc. Vector database searches find data based on its content or meaning, even without exact matches. For example, vector search can allow a semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.

SingleStore supports a native vector data type and indexed approximate-nearest-neighbor (ANN) search that provide high-performance vector search and easier building of vector-based applications. See Vector Indexing.

Some benefits of using SingleStore for vector database processing, as opposed to a specialized vector database system, are:

  • A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support, etc.

  • Less data movement is needed between different data subsystems (e.g., caches, text search systems, SQL databases, and specialized vector databases) when all the data, including vector data, is stored in SingleStore.

  • Operational costs may be reduced since fewer data management tools and fewer copies of the data are needed.

  • Less specialized skills and reduced labor may be needed to run an application environment.

Vector Data

Vector data consists of arrays of numbers. Vector data can be stored in SingleStore using the native Vector Type. Vector embeddings, which are vectors describing the meaning of objects, can be obtained from many sources. For example, OpenAI has APIs that will give vector embeddings for text, and facenet is an open-source software package with a pre-trained neural network that can provide embeddings for face images.

Many large language models (LLMs) are now available, and can be used to provide vector embeddings for language to help implement semantic search, chatbots, and other applications. Some of the more well-known LLMs are:

  • GPT models from OpenAI

  • BERT by Google

  • LaMDA by Google

  • PaLM by Google

  • LLaMA by Meta AI

LLM technology is evolving quickly, and new sources of embeddings for language are rapidly becoming available. Moreover, it is possible to train your own models that produce embeddings for database information. Sources for models include Ollama, voyageAI, and huggingface. Regardless of the embedding source, these embeddings can be used for vector similarity search in SingleStore.

Vector embeddings can be stored as vector data in SingleStore. Vector embeddings are typically high-dimensional, with anywhere from a hundred to a few thousand dimensions.

A similarity search is the most common vector data operation. A vector similarity search in SingleStore uses SELECT…ORDER BY…LIMIT… queries that use vector similarity functions, including DOT_PRODUCT and EUCLIDEAN_DISTANCE. DOT_PRODUCT is the most commonly used similarity metric.

If the vectors are normalized to length one before saving to a database, and the query vector is also normalized to length one, then DOT_PRODUCT gives the cosine of the angle between the two vectors. DOT_PRODUCT will produce what is known as the cosine similarity metric for its two arguments. If the cosine is close to one, then the angle between the vectors is close to zero. A cosine value closer to one shows that the vectors are pointing in almost the same direction, so the objects they represent are similar.

The figure below illustrates when vectors are similar (cosine close to 1) and dissimilar (cosine close to 0):

Vector angles close to 1 and 0

Note

When using high-dimensional vectors the cosine similarity concept is still applicable.

Many vector models that produce vector embeddings will already have the vectors normalized to length one. In this case, when using DOT_PRODUCT, it is not necessary to normalize the vectors again.

Vector indexes can be used to improve performance of search over vectors. ANN searches give a fast, approximate answer to nearest neighbor queries. If exact answers are required an exact k-Nearest Neighbor (KNN) search can be performed simply by not using a vector index. See Vector Indexing for information on ANN search.

Output Format for Examples

Vectors may be output in JSON or binary format. Use JSON format for examples and for output readability. For production, use the default binary for efficiency.

To get JSON output which will match the examples, use the following command to output vectors in JSON.

SET vector_type_project_format = JSON;

Use the following command to set the output format back to binary.

SET vector_type_project_format = BINARY;

Loading, Inserting, and Updating Vectors

Vector data can be added to a database using the vector data type. For example using this table format:

CREATE TABLE comments(id INT,   
comment TEXT,   
comment_embedding VECTOR(4) not null,   
category VARCHAR(256));

The default element type for VECTOR is a 32-bit floating point number (F32). The options for element type are I18, I16, I32, I64, F32, and F64. Refer to Vector Type for more details.

Consider the following information, where @emb is a JSON array of numbers representing the vector embedding for the phrase "The cafeteria in building 35 has a great salad bar." Fictional vectors are used here to keep the examples small enough to read easily.

SET @_id = 1;
SET @cmt = "The cafeteria in building 35 has a great salad bar";
SET @emb = '[0.45, 0.55, 0.495, 0.5]';
SET @cat = "Food";

This data can be inserted into the comments table with a standard INSERT statement:

INSERT INTO comments VALUES (@_id, @cmt, @emb, @cat);

In addition, if you have vector data that is already in binary format, that data should be inserted into the database in binary format. Converting such data to JSON array of numbers format and inserting that data into the database will result in significantly reduced performance.

Binary data may be inserted in hexadecimal string format and converted using UNHEX before inserting it in a table. For example, suppose that client application software produces this hex string B806363CE90A363CCDCCCC3D77BE7F3F for the vector '[0.01111, 0.011111, 0.1, 0.999]'.

UNHEX can be used to convert a hex string that will work for insert into the database. For example:

SET @_id = 2;
SET @cmt = "I love the taco bar in the B16 cafeteria.";
SET @hs = "B806363CE90A363CCDCCCC3D77BE7F3F";
SET @cat = "Food";
INSERT INTO comments VALUES (@_id, @cmt, unhex(@hs), @cat);

Insert a third tuple into this table.

SET @_id = 3;
SET @cmt = "The B24 restaurant salad bar is quite good.";
SET @emb = '[0.1, 0.8, 0.2, 0.555]';
SET @cat = "Food";
INSERT INTO comments VALUES (@_id, @cmt, @emb, @cat);

View the contents of the table using this query.

To try out SQL queries, click the Playground icon to the right of the SQL listing.

If you are using a command line tool, you can use \G at the end of the query to get the results formatted as above.

Another way to insert binary vectors is to convert them to a packed binary format in the client application and submit them through a standard client API. Consult the corresponding API documentation for details on how to do this.

An alternate way to insert vector data is using pipelines. See Load Data with Pipelines and How to Load Data Using Pipelines.

Example Search Based on Vector Similarity

To find the most similar vectors in a query vector, use an ORDER BY… LIMIT… query. ORDER BY will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top.

Suppose that the query is "restaurants with good salad," and for this query, the vector embedding API returned the vector '[0.44, 0.554, 0.34, 0.62]'. To find the top two matches for this query vector use the SQL below.

The infix operator <*> represents DOT_PRODUCT and the infix operator <-> represents EUCLIDEAN_DISTANCE.

The @query_vec variable is cast to a VECTOR and then to a BLOB to ensure that @query_vec is a valid VECTOR and to improve performance.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4):>BLOB;
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
ORDER BY score DESC
LIMIT 2;
*** 1. row ***
      id: 1 
 comment: The cafeteria in building 35 has a great salad bar
category: Food   
    score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447

Hybrid Filtering or Metadata Filtering

When building vector search applications, you may wish to filter on the fields of a record, with simple filters or via joins, in addition to applying vector similarity operations. Filtering on fields of a record using simple filters or joins along with applying vector similarity operations is often referred to as hybrid filtering or metadata filtering. SingleStore can handle this filtering using standard SQL operations.

For example, given the comments table, you can get the top three matches for a query vector that is in the category "Food" using this SQL:

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4):>BLOB;
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
WHERE category = "Food"
ORDER BY score DESC
LIMIT 3;
*** 1. row ***
      id: 1
 comment: The cafeteria in building 35 has a great salad bar
category: Food
   score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447
*** 3. row ***
      id: 2
 comment: I love the taco bar in the B16 cafeteria.
category: Food
   score: 0.6644238829612732

Any SQL feature can be used along with vector similarity calculation using DOT_PRODUCT() and EUCLIDEAN_DISTANCE(). These include filtering, ordering, grouping, aggregation, window functions, full-text search, and more.

Create and Use a Vector Index

The command below creates a vector index on the comment_embedding field of the comments table. Vector indexes can be used to improve performance of queries over large vector data sets. Refer to Vector Indexing for information on creating and using vector indexes.

ALTER TABLE comments ADD VECTOR INDEX ivf(comment_embedding)
INDEX_OPTIONS '{"index_type":"IVF_PQFS"}';

Optionally optimize the table for best performance.

OPTIMIZE TABLE comments FULL;

The following query will use the vector index.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4):>BLOB;
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
ORDER BY score DESC
LIMIT 2;
*** 1. row ***
      id: 1
 comment: The cafeteria in building 35 has a great salad bar
category: Food
   score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447

The EXPLAIN command can be used to see the query plan and verify that the vector index is being used. In the example below, you can see INTERNAL_VECTOR_SEARCH in the ColumnStoreFilter row. This tells you that the vector index is being used.

SET @query_vec = ('[0.09, 0.14, 0.5, 0.05]'):>VECTOR(4):>BLOB;
EXPLAIN
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
ORDER BY score DESC
LIMIT 2;
+--------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                |
+--------------------------------------------------------------------------------------------------------+
| Project [remote_0.id, remote_0.comment, remote_0.category, remote_0.score]                             |
| TopSort limit:2 [remote_0.score DESC]                                                                  |
| Gather partitions:all alias:remote_0 parallelism_level:segment                                         |
| Project [comments.id, comments.comment, comments.category, 
    DOT_PRODUCT(comments.comment_embedding,(@query_vec:>vector(4, F32) NOT NULL)) AS score]              |
| TopSort limit:2 [DOT_PRODUCT(comments.comment_embedding,(@query_vec:>vector(4, F32) NOT NULL)) DESC]   |
| ColumnStoreFilter [INTERNAL_VECTOR_SEARCH(0, (@query_vec:>vector(4, F32) NOT NULL), 2, '') index]      |
| ColumnStoreScan test.comments, SORT KEY __UNORDERED () table_type:sharded_columnstore                  |
+--------------------------------------------------------------------------------------------------------+

Hybrid Search combines multiple search methods in one query and blends full-text search (which finds keyword matches) and vector search (which finds semantic matches) allowing search results to be (re-)ranked by a score that combines full-text and vector rankings. Hybrid Search - Re-ranking and Blending Searches provides two examples of hybrid search including an example of indexed hybrid search using SingleStore's Vector Indexing and full-text indexing (Working with Full-Text Search). See also this blog post on Hybrid Search.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a method for enhancing the quality of results for text-based Generative AI (GenAI) applications. To do RAG with SingleStore:

In advance:

  • split up relevant text into chunks

  • get a vector for each chunk from your chosen LLM and place the chunks in SingleStore

Then, when a question comes in for which you wish to generate an answer:

  • get a vector for the question from the same LLM

  • search the stored vectors to find the top k matching vectors compared with the query vector

  • for each of these top vectors, get the associated text chunks

  • pass these chunks to the LLM’s question-answering API as context, along with the original question

This method helps the LLM generate high-quality results for a specialized topic, beyond what it was trained for. RAG is a recommended approach for use with SingleStore to generate high-quality answers for via your LLM for a specific situation or context.

Tools such as Langchain, or a Natural Language Processing (NLP) library like spaCy or NLTK can be used to make splitting documents into chunks take less time for the application developer.

Bulk Loading Vectors

Loading large vector data sets requires use of LOAD DATA and PIPELINES and formats such as Apache Parquet or Apache Avro™ to ensure good performance. See How to Bulk Load Vectors for information on loading larger vector data sets.

Last modified: May 31, 2024

Was this article helpful?