Cosine Similarity and Cosine Distance
On this page
Cosine similarity is a measure of the similarity of two vectors.
In these applications, images, blocks of text, audio files, and other content are converted to vectors.
Also see Working with Vector Data for more information about and examples of working with vector data in SingleStore.
Cosine Similarity Calculation: Cosine similarity can be calculated in two different ways in SingleStore.DOT_
to calculate cosine similarity.
Using DOT_
to calculate cosine similarity gives the best performance since DOT_
does not have to normalize its input every time it is called and because indexed vector search can be used.
Important
When performing cosine similarity calculations with DOT_
, the vectors should be normalized to length 1 before storing them in the database.
Example 1 below shows a brief example of using DOT_
In some cases, it is necessary to calculate cosine similarity directly or to calculate cosine distance.cosine_
and cosine_
.DOT_
.
Euclidean distance is another common measure or vector similarity.
Important
When finding the top K closest matches, descending (DESC
) sort order must be used when you are using DOT_
and cosine similarity.ASC
) sort order for cosine distance and EUCLIDEAN_
.
Examples
Example 1 - Cosine Similarity using DOT_ PRODUCT
To calculate the cosine similarity between two vectors, make sure your vectors are all of length 1 before storing them in your database, and then simply use DOT_
.
Note
This example assumes that the vectors are already normalized.
The example below demonstrates the use of DOT_
to find the cosine similarity between a query vector and vectors in a table.
CREATE TABLE vectors_b (id int, vec BLOB not null);INSERT INTO vectors_b VALUES (1, JSON_ARRAY_PACK('[0.1, 0.8, 0.2, 0.555]'));INSERT INTO vectors_b VALUES (2, JSON_ARRAY_PACK('[0.45, 0.55, 0.495, 0.5]'));
The SQL below creates a query vector and then uses DOT_
to find the cosine similarity between the @query_
and the vectors in the vectors_
table.
SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');SELECT DOT_PRODUCT(vec, @query_vec) AS scoreFROM vectors_bORDER BY score DESC;
+---------------------+
| score |
+---------------------+
| 0.9810000061988831 |
| 0.8993000388145447 |
+---------------------+
Example 2 - Cosine Similarity and Cosine Distance
The following is a User Defined Function (UDF) (CREATE FUNCTION (UDF)) for calculating cosine similarity.
DELIMITER //CREATE OR REPLACE FUNCTION cosine_similarity(v1 BLOB, v2 BLOB)RETURNS FLOATASBEGINRETURN DOT_PRODUCT(normalize(v1), normalize(v2));END //DELIMITER ;
This function uses DOT_
and the normalize
function defined in Vector Normalization to compute the cosine similarity of two vectors of length 4.
You can adapt this code to suit your application by changing the vector length.
The UDF below calculates cosine_
using the cosine_
UDF defined above.
DELIMITER //CREATE OR REPLACE FUNCTION cosine_distance(v1 BLOB, v2 BLOB)RETURNS FLOATASBEGINRETURN 1 - cosine_similarity(v1, v2);END //DELIMITER ;
With cosine similarity, a higher score means the vectors have greater similarity.
Important
For top K queries, you must use a descending (DESC
) sort order when using cosine similarity and an ascending (ASC
) sort order when using cosine distance.
The following examples use the functions above to compute similarity and distance for all pairs of a set of three vectors.
CREATE TABLE <table_name> (id INT, v BLOB);INSERT INTO <table_name>VALUES(1, JSON_ARRAY_PACK('[3, 4, 7]')),(2, JSON_ARRAY_PACK('[-2, 0, 6]')),(3, JSON_ARRAY_PACK('[1, 0, 0]'));
The SELECT
clause below uses the id column from both instances in the table (Note: a1
and a2
are aliases).a1.
and a2.
.a1.
and a2.
The FROM
clause specifies the table being used.WHERE
clause is used so a row is not compared with itself.<>
; indicates not equal.ORDER BY ALL
indicates the results will be in default order.
SELECT a1.id, a2.id, FORMAT(cosine_similarity(a1.v, a2.v), 3),FORMAT(cosine_distance(a1.v, a2.v), 3)FROM <table_name> a1, <table_name> a2WHERE a1.id <> a2.idORDER BY ALL;
This statement updates the user-defined variable (@qv
) to be equal to the value in the v
column from the table where the id is equal to 1.
SET @qv = (SELECT v FROM <table_name> WHERE id = 1);
The following query retrieves the rows from the table and unpacks the JSON array in the v column.
SELECT id, JSON_ARRAY_UNPACK(v), cosine_distance(v, @qv) AS <results>FROM <table_name>ORDER BY <results> LIMIT 2;
The next query is similar to the previous one with the main differences being cosine similarity
is being calculated and the results are in descending order.
Cosine_
is calculated between the vectors in the v column and the vector represented by the @qv
variable and then orders the results based on the calculated cosine distance.
SELECT id, JSON_ARRAY_UNPACK(v), cosine_similarity(v, @qv) AS<results>FROM <table_name>ORDER BY <results> DESC LIMIT 2;
Related Topics
Last modified: March 12, 2024