Cosine Similarity and Cosine Distance

Cosine similarity is a measure of the similarity of two vectors. Cosine similarity is commonly used in semantic text search, generative AI, searches of images and audio files, and other applications.

In these applications, images, blocks of text, audio files, and other content are converted to vectors. These vectors can be stored and then searched over using measures of vector similarity, including cosine similarity.

SingleStore provides a Vector Type for storing vector data and Vector Indexing to speed up vector search when searching over large sets of vector data.

Also see Working with Vector Data for more information about and examples of working with vector data in SingleStore.

Cosine Similarity Calculation: Cosine similarity can be calculated in two different ways in SingleStore. Cosine similarity can be calculated with the DOT_PRODUCT function when the input vectors are all of length 1. SingleStore recommends normalizing all vectors before storing in the database and then using DOT_PRODUCT to calculate cosine similarity. Vector Normalization provides a function for normalizing vectors in SingleStore.

Using DOT_PRODUCT to calculate cosine similarity gives the best performance since DOT_PRODUCT does not have to normalize its input every time it is called and because indexed vector search can be used. 

Important

When performing cosine similarity calculations with DOT_PRODUCT, the vectors should be normalized to length 1 before storing them in the database.

Example 1 below demonstrates calculating cosine similarity using DOT_PRODUCT over a set of vectors with length 1.

In some cases, it is necessary to calculate cosine similarity directly or to calculate cosine distance. Example 2 defines functions for cosine_similiarity and cosine_distance. cosine distance is the inverse of cosine similarity (cosine_distance = 1 - cosine_similarity). These functions are more expensive than DOT_PRODUCT.

Euclidean distance is another common measure or vector similarity. See EUCLIDEAN_DISTANCE for information on using the Euclidean distance metric in SingleStore.

Important

When finding the top K closest matches, descending (DESC) sort order must be used when you are using DOT_PRODUCT and cosine similarity. The sort order for cosine distance and Euclidean distance are opposite of the sort order for cosine similarity; use ascending (ASC) sort order for cosine distance and EUCLIDEAN_DISTANCE.

Examples

Example 1 - Cosine Similarity using DOT_PRODUCT (<*>)

A common use case of cosine similarity is to calculate the similarity of a set of vectors to a query vector. The example below shows calculating the cosine similarity between a @query_vector and a set of normalized vectors using the infix DOT_PRODUCT operator (<*>).

Note

This example assumes that the vectors are already normalized. And, for the purposes of this example, we assume that the vectors in the example have length 1. In fact, these vectors have length very close to 1, but not exactly 1. The vectors have been simplified to improve the readability of the example. In practice, it is important to use fully normalized vectors.

This SQL creates a comments table which includes text comments, and a vector representing an embedding for each comment. Vector embeddings are vectors describing the meaning of objects and are a common product of large language models (LLMs). See Working with Vector Data for more information about embeddings and examples of using vector data in SingleStore.

INSERT INTO comments VALUES       
(1, "The cafeteria in building 35 has a great salad bar",    
'[0.45, 0.55, 0.495, 0.5]',
    "Food"),
(2, "I love the taco bar in the B16 cafeteria.",     
   '[0.01111, 0.01111, 0.1, 0.999]',
    "Food"),
(3, "The B24 restaurant salad bar is quite good.",    
    '[0.1, 0.8, 0.2, 0.555]',
"Food");

This SQL creates a query vector and uses an ORDER BY … LIMIT query to find the two comment_embeddings from the comments table which are the most similar to the query vector. A more comprehensive version of this example is found in Working with Vector Data.

The @query_vec variable is cast to a VECTOR to ensure that @query_vec is a valid VECTOR and to improve performance.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);
SELECT id, comment, category,
comment_embedding <*> @query_vec AS score
FROM comments
ORDER BY score DESC
LIMIT 2;
*** 1. row ***
      id: 1 
 comment: The cafeteria in building 35 has a great salad bar
category: Food   
    score: 0.9810000061988831
*** 2. row ***
      id: 3
 comment: The B24 restaurant salad bar is quite good.
category: Food
   score: 0.8993000388145447

Example 2 - Cosine Similarity and Cosine Distance

Note

Using the cosine_similarity and cosine_distance functions provided below will have a performance impact as the functions normalize the input vectors and are unable to use vector indexes.

The following is a User Defined Function (UDF) (CREATE FUNCTION (UDF)) for calculating cosine similarity.

DELIMITER //
CREATE OR REPLACE FUNCTION cosine_similarity(v1 VECTOR(4), v2 VECTOR(4))
   RETURNS FLOAT
AS
BEGIN
  RETURN DOT_PRODUCT(normalize(v1), normalize(v2));
END //
DELIMITER ;

This function uses DOT_PRODUCT and the normalize function defined in Vector Normalization to compute the cosine similarity of two vectors of length 4.

You can adapt this code to suit your application by changing the vector length. If you work with different vector lengths, you can create a function with a different name for each desired length.

The UDF below calculates cosine_distance using the cosine_similarity UDF defined above.

DELIMITER //
CREATE OR REPLACE FUNCTION cosine_distance(v1 VECTOR(4), v2 VECTOR(4))
   RETURNS FLOAT
AS
BEGIN
  RETURN 1 - cosine_similarity(v1, v2);
END //
DELIMITER ;

With cosine similarity, a higher score means the vectors have greater similarity. The opposite is true of cosine difference, in which a lower score (lower distance) means the vectors are more similar.

Important

For top K queries, you must use a descending (DESC) sort order when using cosine similarity and an ascending (ASC) sort order when using cosine distance.

The following query calculates the cosine similarity of the comment embedding vectors to the query vector, @query_vec, and orders the results in descending (DESC) order of the score.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);
SELECT id, comment,
cosine_similarity(comment_embedding, @query_vec) AS score
FROM comments 
ORDER BY score DESC;
+------+----------------------------------------------------+----------+
| id   | comment                                            | score    |
+------+----------------------------------------------------+----------+
|    1 | The cafeteria in building 35 has a great salad bar | 0.980735 |
|    3 | The B24 restaurant salad bar is quite good.        | 0.899957 |
|    2 | I love the taco bar in the B16 cafeteria.          | 0.66153  |
+------+----------------------------------------------------+----------+

The query below calculates the cosine_distance between the query vector and the vectors in the table. The results are ordered in ascending (ASC) order as is appropriate for cosine distance. Recall that cosine distance is 1 - cosine similarity.

Note

ASC is the default ordering for ORDER BY clauses.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);
SELECT id, comment,
cosine_distance(comment_embedding, @query_vec) AS score
FROM comments
ORDER BY score ASC;
+------+----------------------------------------------------+-----------+
| id   | comment                                            | score     |
+------+----------------------------------------------------+-----------+
|    1 | The cafeteria in building 35 has a great salad bar | 0.0192653 |
|    3 | The B24 restaurant salad bar is quite good.        | 0.100043  |
|    2 | I love the taco bar in the B16 cafeteria.          | 0.33847   |
+------+----------------------------------------------------+-----------+

Finally, observe that while the score is different in the query using cosine_distance versus the query using cosine_similarity, the ordering of the comments in the results is the same.

Last modified: November 21, 2024

Was this article helpful?