Vector Normalization

Vector normalization is the process of standardizing a vector. Vector normalization often involves normalizing feature vectors that describe data objects in applications of artificial intelligence for searching vector embeddings derived from images, text, and audio. The normalization process ensures that the vectors have a consistent scale or magnitude, which can be important in certain operations such as distance calculations, clustering, or classification. 

Specifically, cosine similarity, a common measure of the similarity of two vectors, can be calculated with the DOT_PRODUCT function when the input vectors are all of length 1. Thus, when performing cosine similarity calculations with DOT_PRODUCT, the vectors should be normalized to length 1 before storing them in the database.

Many APIs that produce vector embeddings, such as the OpenAI APIs, always return vectors of length 1, so check the documentation for the vectors' source to see if they are length 1. If so, they do not need to be normalized further.

See the following for more information about working with vector data in SingleStore.

Example

If your database contains vectors that are not normalized, you can normalize them with a SQL function. The following UDF (CREATE FUNCTION (UDF)) normalizes a vector of 32-bit floating point values with length 4:

DELIMITER //
CREATE or REPLACE FUNCTION normalize(v VECTOR(4)) RETURNS VECTOR(4) AS
DECLARE 
squares VECTOR(4) = vector_mul(v,v); 
length FLOAT = sqrt(vector_elements_sum(squares));
BEGIN 
RETURN scalar_vector_mul(1/length, v);
END //
DELIMITER ;

This function computes the length of the vector as the square root of the sum of the squares of the elements of the vector. The function then forms a new vector by dividing the original vector elements by the calculated length of the vector. The result is a new "normalized" vector which has length 1.

You can adapt this code to suit your application by changing the vector length. If you work with different vector lengths, you can create a function with a different name for each desired length.

The following example shows the use of this function to normalize a table of vectors. Vectors of length 4 are used to simplify the example and make it easier to see and understand the results. In a real-world use case, vectors are expected to have more elements.

CREATE TABLE vectors_not_normalized(id TEXT, v VECTOR(4));
INSERT INTO vectors_not_normalized VALUES
("A", '[1,2,3,4]'),
("B", '[5,4,3,2]'),
("C", '[1,0,0,0]');

Display the vectors and their lengths from the vectors_not_normalized table with the following query.

SELECT V.id, V.v, sqrt(vector_elements_sum(vector_mul(V.v,V.v))) AS length
FROM vectors_not_normalized V
ORDER BY V.id;
+------+-----------+--------------------+
| id   | v         | length             |
+------+-----------+--------------------+
| A    | [1,2,3,4] |  5.477225575051661 |
| B    | [5,4,3,2] | 7.3484692283495345 |
| C    | [1,0,0,0] |                  1 |
+------+-----------+--------------------+

Note that vectors with ids A and B have length longer than 1 and will be expected to change when normalized. Vector with id C has a length of 1 and is not expected to change when normalized.

Now, create a table to hold the normalized vectors.

CREATE TABLE vectors_normalized(id TEXT, v VECTOR(4));

The following SQL will normalize the vectors and insert the normalized vectors into the new vectors_normalized table.

INSERT INTO vectors_normalized
SELECT id, normalize(v)
FROM vectors_not_normalized;

Run the following SQL to see the vectors and their lengths after normalization.

SELECT V.id, V.v, sqrt(vector_elements_sum(vector_mul(V.v,V.v))) AS length
FROM vectors_normalized V
ORDER BY V.id;
+------+---------------------------------------------------+--------------------+
| id   | v                                                 | length             |
+------+---------------------------------------------------+--------------------+
| A    | [0.182574183,0.365148365,0.547722578,0.730296731] | 0.9999999850988387 |
| B    | [0.680413842,0.544331074,0.408248305,0.272165537] |  1.000000033527612 |
| C    | [1,0,0,0]                                         |                  1 |
+------+---------------------------------------------------+--------------------+

The elements of vectors A and B have changed significantly since and their original lengths were not very close to 1. Vector C has not changed as its original length was 1.

The lengths of vectors A and B are close to, but not exactly one. This is because of the finite precision of floating point representation.

Last modified: March 5, 2024

Was this article helpful?