# Cosine Similarity and Cosine Distance

Cosine similarity is a measure of the similarity of two vectors. Cosine similarity is commonly used in semantic text search, generative AI, searches of images and audio files, and other applications.

In these applications, images, blocks of text, audio files, and other content are converted to vectors. These vectors can be stored and then searched over using measures of vector similarity, including cosine similarity.

SingleStore provides a Vector Type for storing vector data and Vector Indexing to speed up vector search when searching over large sets of vector data.

Also see Working with Vector Data for more information about and examples of working with vector data in SingleStore.

Cosine Similarity Calculation: Cosine similarity can be calculated in two different ways in SingleStore. Cosine similarity can be calculated with the DOT_PRODUCT function when the input vectors are all of length 1. SingleStore recommends normalizing all vectors before storing in the database and then using `DOT_PRODUCT` to calculate cosine similarity. Vector Normalization provides a function for normalizing vectors in SingleStore.

Using `DOT_PRODUCT` to calculate cosine similarity gives the best performance since `DOT_PRODUCT` does not have to normalize its input every time it is called and because indexed vector search can be used.

Important

When performing cosine similarity calculations with `DOT_PRODUCT`, the vectors should be normalized to length 1 before storing them in the database.

Example 1 below demonstrates calculating cosine similarity using `DOT_PRODUCT` over a set of vectors with length 1.

In some cases, it is necessary to calculate cosine similarity directly or to calculate cosine distance. Example 2 defines functions for `cosine_similiarity` and `cosine_distance`. cosine distance is the inverse of cosine similarity (cosine_distance = 1 - cosine_similarity). These functions are more expensive than `DOT_PRODUCT`.

Euclidean distance is another common measure or vector similarity. See EUCLIDEAN_DISTANCE for information on using the Euclidean distance metric in SingleStore.

Important

When finding the top K closest matches, descending (`DESC`) sort order must be used when you are using `DOT_PRODUCT` and cosine similarity. The sort order for cosine distance and Euclidean distance are opposite of the sort order for cosine similarity; use ascending (`ASC`) sort order for cosine distance and `EUCLIDEAN_DISTANCE`.

## Examples

### Example 1 - Cosine Similarity using DOT_PRODUCT (<*>)

A common use case of cosine similarity is to calculate the similarity of a set of vectors to a query vector. The example below shows calculating the cosine similarity between a `@query_vector` and a set of normalized vectors using the infix `DOT_PRODUCT` operator (`<*>`).

Note

This example assumes that the vectors are already normalized. And, for the purposes of this example, we assume that the vectors in the example have length 1. In fact, these vectors have length very close to 1, but not exactly 1. The vectors have been simplified to improve the readability of the example. In practice, it is important to use fully normalized vectors.

This SQL creates a comments table which includes text comments, and a vector representing an embedding for each comment. Vector embeddings are vectors describing the meaning of objects and are a common product of large language models (LLMs). See Working with Vector Data for more information about embeddings and examples of using vector data in SingleStore.

`CREATE TABLE comments(id INT,      comment TEXT,      comment_embedding VECTOR(4) not null,      category VARCHAR(256));`
`INSERT INTO comments VALUES       (1, "The cafeteria in building 35 has a great salad bar",        '[0.45, 0.55, 0.495, 0.5]',    "Food"),(2, "I love the taco bar in the B16 cafeteria.",         '[0.01111, 0.01111, 0.1, 0.999]',    "Food"),(3, "The B24 restaurant salad bar is quite good.",        '[0.1, 0.8, 0.2, 0.555]',    "Food");`

This SQL creates a query vector and uses an `ORDER BY … LIMIT` query to find the two comment_embeddings from the comments table which are the most similar to the query vector. A more comprehensive version of this example is found in Working with Vector Data.

The `@query_vec` is cast from a `VECTOR` to a `BLOB` for performance reasons.

```SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4):>BLOB;
SELECT id, comment, category,   comment_embedding <*> @query_vec AS scoreFROM commentsORDER BY score DESC    LIMIT 2;```
``````*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
category: Food
score: 0.9810000061988831
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
category: Food
score: 0.8993000388145447``````

### Example 2 - Cosine Similarity and Cosine Distance

Note

Using the `cosine_similarity` and `cosine_distance` functions provided below will have a performance impact as the functions normalize the input vectors and are unable to use vector indexes.

The following is a User Defined Function (UDF) (CREATE FUNCTION (UDF)) for calculating cosine similarity.

`DELIMITER //CREATE OR REPLACE FUNCTION cosine_similarity(v1 VECTOR(4), v2 VECTOR(4))   RETURNS FLOATASBEGIN  RETURN DOT_PRODUCT(normalize(v1), normalize(v2));END //DELIMITER ;`

This function uses `DOT_PRODUCT` and the normalize function defined in Vector Normalization to compute the cosine similarity of two vectors of length 4.

You can adapt this code to suit your application by changing the vector length. If you work with different vector lengths, you can create a function with a different name for each desired length.

The UDF below calculates `cosine_distance` using the `cosine_similarity` UDF defined above.

`DELIMITER //CREATE OR REPLACE FUNCTION cosine_distance(v1 VECTOR(4), v2 VECTOR(4))   RETURNS FLOATASBEGIN  RETURN 1 - cosine_similarity(v1, v2);END //DELIMITER ;`

With cosine similarity, a higher score means the vectors have greater similarity. The opposite is true of cosine difference, in which a lower score (lower distance) means the vectors are more similar.

Important

For top K queries, you must use a descending (`DESC`) sort order when using cosine similarity and an ascending (`ASC`) sort order when using cosine distance.

The following query calculates the cosine similarity of the comment embedding vectors to the query vector, `@query_vec`, and orders the results in descending (`DESC`) order of the score.

```SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4):>BLOB;
SELECT id, comment,    cosine_similarity(comment_embedding, @query_vec) AS scoreFROM comments ORDER BY score DESC;```
``````+------+----------------------------------------------------+----------+
| id   | comment                                            | score    |
+------+----------------------------------------------------+----------+
|    1 | The cafeteria in building 35 has a great salad bar | 0.980735 |
|    3 | The B24 restaurant salad bar is quite good.        | 0.899957 |
|    2 | I love the taco bar in the B16 cafeteria.          | 0.66153  |
+------+----------------------------------------------------+----------+``````

The query below calculates the `cosine_distance` between the query vector and the vectors in the table. The results are ordered in ascending (`ASC`) order as is appropriate for cosine distance. Recall that cosine distance is 1 - cosine similarity.

Note

`ASC` is the default ordering for `ORDER BY` clauses.

```SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4):>BLOB;
SELECT id, comment,    cosine_distance(comment_embedding, @query_vec) AS scoreFROM commentsORDER BY score ASC;```
``````+------+----------------------------------------------------+-----------+
| id   | comment                                            | score     |
+------+----------------------------------------------------+-----------+
|    1 | The cafeteria in building 35 has a great salad bar | 0.0192653 |
|    3 | The B24 restaurant salad bar is quite good.        | 0.100043  |
|    2 | I love the taco bar in the B16 cafeteria.          | 0.33847   |
+------+----------------------------------------------------+-----------+``````

Finally, observe that while the score is different in the query using `cosine_distance` versus the query using `cosine_similarity`, the ordering of the comments in the results is the same.