Cosine Similarity and Cosine Distance

Cosine Similarity Calculation: Cosine similarity can be calculated in two different ways in SingleStore. Cosine similarity can be calculated with the DOT_PRODUCT function when the input vectors are all of length 1. SingleStore recommends normalizing all vectors before storing in the database and then using DOT_PRODUCT to calculate cosine similarity. Vector Normalization provides a function for normalizing vectors in SingleStore.

Using DOT_PRODUCT to calculate cosine similarity gives the best performance since DOT_PRODUCT does not have to normalize its input every time it is called and because indexed vector search can be used.

Important

When performing cosine similarity calculations with DOT_PRODUCT, the vectors should be normalized to length 1 before storing them in the database.

Example 1 below shows a brief example of using DOT_Product.

In some cases, it is necessary to calculate cosine similarity directly or to calculate cosine distance. Example 2 defines functions for cosine_similiarity and cosine_distance. cosine distance is the inverse of cosine similarity (cosine_distance = 1 - cosine_similarity). These functions are more expensive than DOT_PRODUCT.

Euclidean distance is another common measure or vector similarity. See EUCLIDEAN_DISTANCE for information on using the Euclidean distance metric in SingleStore.

Important

When finding the top K closest matches, descending (DESC) sort order must be used when you are using DOT_PRODUCT and cosine similarity. The sort order for cosine distance and Euclidean distance are opposite of the sort order for cosine similarity; use ascending (ASC) sort order for cosine distance and EUCLIDEAN_DISTANCE.

Examples

Example 1 - Cosine Similarity using DOT_PRODUCT

To calculate the cosine similarity between two vectors, make sure your vectors are all of length 1 before storing them in your database, and then simply use DOT_PRODUCT.

Note

This example assumes that the vectors are already normalized. And, for the purposes of this example, we assume that the vectors in the example have length 1. In fact, these vectors have length very close to 1, but not exactly 1. The vectors have been simplified to improve the readability of the example. In practice, it is important to use fully normalized vectors.

The example below demonstrates the use of DOT_PRODUCT to find the cosine similarity between a query vector and vectors in a table.

SQL

CREATE TABLE vectors_b (id int, vec BLOB not null);

INSERT INTO vectors_b VALUES (1, JSON_ARRAY_PACK('[0.1, 0.8, 0.2, 0.555]'));
INSERT INTO vectors_b VALUES (2, JSON_ARRAY_PACK('[0.45, 0.55, 0.495, 0.5]'));

The SQL below creates a query vector and then uses DOT_PRODUCT to find the cosine similarity between the @query_vec and the vectors in the vectors_b table.

SQL

SET @query_vec = JSON_ARRAY_PACK('[0.44, 0.554, 0.34, 0.62]');

SELECT DOT_PRODUCT(vec, @query_vec) AS score 
FROM vectors_b
ORDER BY score DESC;

+---------------------+
| score               |
+---------------------+
| 0.9810000061988831  |
| 0.8993000388145447  |
+---------------------+

Example 2 - Cosine Similarity and Cosine Distance

The following is a User Defined Function (UDF) (CREATE FUNCTION (UDF)) for calculating cosine similarity.

SQL

DELIMITER //
CREATE OR REPLACE FUNCTION cosine_similarity(v1 BLOB, v2 BLOB)
   RETURNS FLOAT
AS
BEGIN
  RETURN DOT_PRODUCT(normalize(v1), normalize(v2));
END //
DELIMITER ;

This function uses DOT_PRODUCT and the normalize function defined in Vector Normalization to compute the cosine similarity of two vectors of length 4.

You can adapt this code to suit your application by changing the vector length. If you work with different vector lengths, you can create a function with a different name for each desired length.

The UDF below calculates cosine_distance using the cosine_similarity UDF defined above.

SQL

DELIMITER //
CREATE OR REPLACE FUNCTION cosine_distance(v1 BLOB, v2 BLOB)
   RETURNS FLOAT
AS
BEGIN
  RETURN 1 - cosine_similarity(v1, v2);
END //
DELIMITER ;

With cosine similarity, a higher score means the vectors have greater similarity. The opposite is true of cosine difference, in which a lower score (lower distance) means the vectors are more similar.

Important

For top K queries, you must use a descending (DESC) sort order when using cosine similarity and an ascending (ASC) sort order when using cosine distance.

The following examples use the functions above to compute similarity and distance for all pairs of a set of three vectors.

SQL

CREATE TABLE <table_name> (id INT, v BLOB);

INSERT INTO <table_name>
  VALUES  
    (1, JSON_ARRAY_PACK('[3, 4, 7]')),
    (2, JSON_ARRAY_PACK('[-2, 0, 6]')),
    (3, JSON_ARRAY_PACK('[1, 0, 0]'));

The SELECT clause below uses the id column from both instances in the table (Note: a1 and a2 are aliases). The cosine similarity is calculated based on the two vectors a1.v and a2.v. The cosine distance is calculated between the vectors a1.v and a2.v. The FROM clause specifies the table being used. The WHERE clause is used so a row is not compared with itself. Note: the operator <>; indicates not equal. ORDER BY ALL indicates the results will be in default order.

SQL

SELECT a1.id, a2.id, FORMAT(cosine_similarity(a1.v, a2.v), 3),
    FORMAT(cosine_distance(a1.v, a2.v), 3) 
FROM <table_name> a1, <table_name> a2 
WHERE a1.id <> a2.id 
ORDER BY ALL;

This statement updates the user-defined variable (@qv) to be equal to the value in the v column from the table where the id is equal to 1.

SQL

SET @qv = (SELECT v FROM <table_name> WHERE id = 1);

The following query retrieves the rows from the table and unpacks the JSON array in the v column.

SQL

SELECT id, JSON_ARRAY_UNPACK(v), cosine_distance(v, @qv) AS <results>
FROM <table_name>
ORDER BY <results> LIMIT 2;

The next query is similar to the previous one with the main differences being cosine similarity is being calculated and the results are in descending order.

Cosine_distance is calculated between the vectors in the v column and the vector represented by the @qv variable and then orders the results based on the calculated cosine distance. The results are limited to only the top 2 rows.

SQL

SELECT id, JSON_ARRAY_UNPACK(v), cosine_similarity(v, @qv) AS
<results>
FROM <table_name>
ORDER BY <results> DESC LIMIT 2;

Cosine Similarity and Cosine Distance

On this page

Examples

Example 1 - Cosine Similarity using DOT_PRODUCT

Example 2 - Cosine Similarity and Cosine Distance

Was this article helpful?

On this page

Was this article helpful?

Cosine Similarity and Cosine Distance

On this page

Examples

Example 1 - Cosine Similarity using DOT_PRODUCT

Example 2 - Cosine Similarity and Cosine Distance

Related Topics

Was this article helpful?

On this page

Was this article helpful?