Cosine Similarity and Cosine Distance
On this page
Cosine similarity is a measure of the similarity of two vectors.
In these applications, images, blocks of text, audio files, and other content are converted to vectors.
SingleStore provides a Vector Type for storing vector data and Vector Indexing to speed up vector search when searching over large sets of vector data.
Also see Working with Vector Data for more information about and examples of working with vector data in SingleStore.
Cosine Similarity Calculation: Cosine similarity can be calculated in two different ways in SingleStore.DOT_
to calculate cosine similarity.
Using DOT_
to calculate cosine similarity gives the best performance since DOT_
does not have to normalize its input every time it is called and because indexed vector search can be used.
Important
When performing cosine similarity calculations with DOT_
, the vectors should be normalized to length 1 before storing them in the database.
Example 1 below demonstrates calculating cosine similarity using DOT_
over a set of vectors with length 1.
In some cases, it is necessary to calculate cosine similarity directly or to calculate cosine distance.cosine_
and cosine_
.DOT_
.
Euclidean distance is another common measure or vector similarity.
Important
When finding the top K closest matches, descending (DESC
) sort order must be used when you are using DOT_
and cosine similarity.ASC
) sort order for cosine distance and EUCLIDEAN_
.
Examples
Example 1 - Cosine Similarity using DOT_ PRODUCT (<*>)
A common use case of cosine similarity is to calculate the similarity of a set of vectors to a query vector.@query_
and a set of normalized vectors using the infix DOT_
operator (<*>
).
Note
This example assumes that the vectors are already normalized.
This SQL creates a comments table which includes text comments, and a vector representing an embedding for each comment.
INSERT INTO comments VALUES(1, "The cafeteria in building 35 has a great salad bar",'[0.45, 0.55, 0.495, 0.5]',"Food"),(2, "I love the taco bar in the B16 cafeteria.",'[0.01111, 0.01111, 0.1, 0.999]',"Food"),(3, "The B24 restaurant salad bar is quite good.",'[0.1, 0.8, 0.2, 0.555]',"Food");
This SQL creates a query vector and uses an ORDER BY … LIMIT
query to find the two comment_
The @query_
variable is cast to a VECTOR
to ensure that @query_
is a valid VECTOR
and to improve performance.
SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);SELECT id, comment, category,comment_embedding <*> @query_vec AS scoreFROM commentsORDER BY score DESCLIMIT 2;
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
category: Food
score: 0.9810000061988831
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
category: Food
score: 0.8993000388145447
Example 2 - Cosine Similarity and Cosine Distance
Note
Using the cosine_
and cosine_
functions provided below will have a performance impact as the functions normalize the input vectors and are unable to use vector indexes.
The following is a User Defined Function (UDF) (CREATE FUNCTION (UDF)) for calculating cosine similarity.
DELIMITER //CREATE OR REPLACE FUNCTION cosine_similarity(v1 VECTOR(4), v2 VECTOR(4))RETURNS FLOATASBEGINRETURN DOT_PRODUCT(normalize(v1), normalize(v2));END //DELIMITER ;
This function uses DOT_
and the normalize
function defined in Vector Normalization to compute the cosine similarity of two vectors of length 4.
You can adapt this code to suit your application by changing the vector length.
The UDF below calculates cosine_
using the cosine_
UDF defined above.
DELIMITER //CREATE OR REPLACE FUNCTION cosine_distance(v1 VECTOR(4), v2 VECTOR(4))RETURNS FLOATASBEGINRETURN 1 - cosine_similarity(v1, v2);END //DELIMITER ;
With cosine similarity, a higher score means the vectors have greater similarity.
Important
For top K queries, you must use a descending (DESC
) sort order when using cosine similarity and an ascending (ASC
) sort order when using cosine distance.
The following query calculates the cosine similarity of the comment embedding vectors to the query vector, @query_
, and orders the results in descending (DESC
) order of the score.
SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);SELECT id, comment,cosine_similarity(comment_embedding, @query_vec) AS scoreFROM commentsORDER BY score DESC;
+------+----------------------------------------------------+----------+
| id | comment | score |
+------+----------------------------------------------------+----------+
| 1 | The cafeteria in building 35 has a great salad bar | 0.980735 |
| 3 | The B24 restaurant salad bar is quite good. | 0.899957 |
| 2 | I love the taco bar in the B16 cafeteria. | 0.66153 |
+------+----------------------------------------------------+----------+
The query below calculates the cosine_
between the query vector and the vectors in the table.ASC
) order as is appropriate for cosine distance.
Note
ASC
is the default ordering for ORDER BY
clauses.
SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);SELECT id, comment,cosine_distance(comment_embedding, @query_vec) AS scoreFROM commentsORDER BY score ASC;
+------+----------------------------------------------------+-----------+
| id | comment | score |
+------+----------------------------------------------------+-----------+
| 1 | The cafeteria in building 35 has a great salad bar | 0.0192653 |
| 3 | The B24 restaurant salad bar is quite good. | 0.100043 |
| 2 | I love the taco bar in the B16 cafeteria. | 0.33847 |
+------+----------------------------------------------------+-----------+
Finally, observe that while the score is different in the query using cosine_
versus the query using cosine_
, the ordering of the comments in the results is the same.
Related Topics
Last modified: November 21, 2024