How to Bulk Load Vectors

SingleStore recommends loading larger vector data sets using a binary format such as Apache Parquet or Apache Avro™ and using LOAD DATA and pipelines (CREATE PIPELINE).

The default input and output format for vectors in SingleStore is JSON array format. This format is convenient and user-readable, but has performance impacts. Thus for high performance, SingleStore recommends using Parquet or Avro formats to load large vector data sets, if possible. SingleStore also supports vectors encoded in hexadecimal; this encoding is more efficient than JSON array format, but less efficient than Avro or Parquet. Use hexadecimal vector representations if using Parquet or Avro would be inconvenient.

Examples 1 and 2 below demonstrate loading vector data using Parquet and hexadecimal encodings, respectively. Example 3 shows exporting vector data using Parquet and hexadecimal encoding. Using Avro is similar to using Parquet, except that the format should be specified as Avro and not Parquet. Finally, Example 4 shows loading vector data using the default JSON array format.

Warning

Deprecation Notice

String functions, such as CONCAT, LENGTH, SUBSTR, and HEX, previously operated on the JSON representation of a vector, interpreted as a string. This behavior is deprecated. In the 8.7 release, string functions operate on the binary representation of a vector.

If you truly intend to use a string function on the JSON string representation of a vector, you can write code in the following way so that it will run the same way before and after this behavior change.

Suppose that vec is a vector-type value and stringfunc is any string function and you have expression:

stringfunc(vec)

you can change it to:

stringfunc(vec :> json)

Output Format for Examples

Vectors may be output in JSON or binary format. Use JSON format for examples and for output readability. For production, use the default binary for efficiency.

To get JSON output which will match the examples, use the following command to output vectors in JSON.

SET vector_type_project_format = JSON;

Use the following command to set the output format back to binary.

SET vector_type_project_format = BINARY;

Example 1 - Bulk Loading Vector Data using Parquet

SingleStore supports loading vector data from Parquet files. Parquet files can be generated using Python and pandas, which provides a function for exporting a pandas DataFrame to a Parquet file.

Alternatively, you can generate a small sample Parquet file to experiment with by loading the sample data below into SingleStore and following the instructions for exporting data to Parquet in the Example 3 - Exporting Vector Data in Parquet and Hexadecimal section.

Once you have a Parquet file with vector data, the example below demonstrates how to load that data into SingleStore.

Create a table with a column of type VECTOR with 4 elements (dimension 4) and an element type of 32-bit floating-point number (F32). Note that in practice vectors typically have significantly more dimensions than 4, vectors with dimension 4 are used for readability.

CREATE TABLE vectors(id INT, vec VECTOR(4, F32) NOT NULL);

This example uses the following data. The vectors are shown in JSON array format for readability.

1,'[0.45, 0.55, 0.495, 0.5]'
2,'[0.1, 0.8, 0.2, 0.555]'
3,'[-0.5, -0.03, -0.1, 0.86]'
4,'[0.5, 0.3, 0.807, 0.1]'

Example 1a - Using LOAD DATA to Load Parquet Data from S3

The following command loads a parquet file stored in S3 to the vectors table. This example uses LOAD DATA , which is more efficient than loading data using INSERT statements.

To run the following command, replace the URI for the S3 location of the data, update the config, and add your credentials. Refer to SELECT for information on using AWS credentials with SingleStore.

LOAD DATA S3
's3://vec_data_folder/vec_f32_data.parquet'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{
"aws_access_key_id":"your_access_key_id",
"aws_secret_access_key":"your_secret_access_key",
"aws_session_token":"your_session_token"}'
INTO TABLE vectors
FORMAT PARQUET
(id <- id, @v <-vec)
SET vec = @v:>BLOB:>VECTOR(4, F32);

Note the last line of this command which uses a set of casts to load Parquet data into the VECTOR attribute: SET vec = @v:>BLOB:>VECTOR(4, F32). By default, LOAD DATA expects vector data to be in textual JSON array notation. The casts in the SET clause override this, and allow more efficient assignment directly from binary data.

Run the following SQL statement to verify that the data was loaded.

SET vector_type_project_format = JSON; /* to make vector output readable */
SELECT id, vec 
FROM vectors
ORDER BY id;
+------+---------------------------------------------------+
| id   | vec                                               |
+------+---------------------------------------------------+
|    1 | [0.449999988,0.550000012,0.495000005,0.5]         |
|    2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
|    3 | [-0.5,-0.0299999993,-0.100000001,0.860000014]     |
|    4 | [0.5,0.300000012,0.806999981,0.100000001]         |
+------+---------------------------------------------------+

The results do not exactly match the vector values listed above because elements in the VECTOR data type are stored as floating-point numbers and the vector values are not perfectly representable in floating-point representation.

Example 1b - Using a Pipeline to Load Parquet Data from S3

You can also use a pipeline to load data from S3. This example is the same as Example 1a, except that a pipeline is used to load the data.

Create a pipeline to load the Parquet data into the vectors table. To run this pipeline, replace the URI for the S3 location of the data, update the config, and add your credentials.

CREATE PIPELINE vec_pipeline_pqt AS
LOAD DATA S3
's3://vec_data_folder/vec_f32_data.parquet'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{
"aws_access_key_id":"your_access_key_id",
"aws_secret_access_key":"your_secret_access_key",
"aws_session_token":"your_session_token"}'
INTO TABLE vectors
FORMAT PARQUET
(id <- id, @v <-vec)
SET vec = @v:>BLOB:>VECTOR(4, F32);

This command uses CREATE PIPELINE and will result in an error if the pipeline already exists. If you want to update a pipeline that exists and may be running, use the syntax CREATE OR REPLACE PIPELINE. Refer to CREATE PIPELINE for more information.

Start the pipeline.

START PIPELINE vec_pipeline_pqt FOREGROUND;

The command above starts a pipeline in the FOREGROUND so that errors will be displayed in the client. If you are using pipelines that will run continuously, consider starting pipelines in the background by omitting the FOREGROUND keyword. Refer to START PIPELINE for more information.

Once the pipeline has completed, verify that the data has been loaded.

SET vector_type_project_format = JSON; /* to make vector output readable */
SELECT id, vec 
FROM vectors
ORDER BY id;
+------+---------------------------------------------------+
| id   | vec                                               |
+------+---------------------------------------------------+
|    1 | [0.449999988,0.550000012,0.495000005,0.5]         |
|    2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
|    3 | [-0.5,-0.0299999993,-0.100000001,0.860000014]     |
|    4 | [0.5,0.300000012,0.806999981,0.100000001]         |
+------+---------------------------------------------------+

Once a pipeline is no longer needed, it can be dropped.

DROP PIPELINE vec_pipeline_pqt;

If you have started a pipeline in the background (by omitting the FOREGROUND keyword), you can stop the pipeline instead of, or before, dropping it. Stopping a pipeline allows the pipeline to be used later, dropping a pipeline removes the pipeline from the system. Stopping is applicable for background pipelines only. Refer to STOP PIPELINE and DROP PIPELINE for more information.

Example 2 - Bulk Loading Vector data using Hexadecimal

SingleStore supports loading vector data from CSV files containing the hexadecimal encodings of vectors. SingleStore recommends this option when writing vectors into a binary format like Parquet or Avro is inconvenient.

Hexadecimal data can be generated using Python as is shown below.

To encode a vector in hexadecimal format, convert the little-endian binary representation of each floating point element of the vector to its 8 character hexadecimal encoding and concatenate the results. For example, the following Python code produces the hexadecimal encoding of a vector represented as a numpy array:

v = numpy.array([0.45, 0.55, 0.495, 0.5], dtype=numpy.float32)
v.tobytes(order='C').hex()
'6666e63ecdcc0c3fa470fd3e0000003f'

Data for the vectors from Example 1 in CSV format with the vectors encoded in hexadecimal is shown below. Copy this data into a text file that is accessible to your SingleStore instance. For this example, place the data in cloud storage such as S3.

1, 6666E63ECDCC0C3FA470FD3E0000003F
2, CDCCCC3DCDCC4C3FCDCC4C3E7B140E3F
3, 000000BF8FC2F5BCCDCCCCBDF6285C3F
4, 0000003F9A99993E8D974E3FCDCCCC3D

Once you have data with vectors encoded in hexadecimal in a CSV file, the example below demonstrates how to load that data into SingleStore.

Example 2a - Using a Pipeline to Load Data in Hexadecimal from S3

This example uses the vectors table created earlier, so before running this example, delete data from the vectors table to prevent duplicate records.

TRUNCATE vectors;

The following shows loading the data with vectors encoded in hexadecimal into the vectors table.

Note

For loading data in CSV format from S3, a pipeline must be used with LOAD DATA. LOAD DATA S3 works without a pipeline only for Parquet or Avro formats.

To use this pipeline, replace the URI for the S3 location of the data and substitute in your config and credentials. Refer to SELECT for additional details on AWS config and credentials in SingleStore.

CREATE PIPELINE vec_pipeline_hex AS
LOAD DATA S3
's3://vec_data_folder/vec_f32_data_hex.csv'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{
"aws_access_key_id":"your_access_key_id",
"aws_secret_access_key":"your_secret_access_key",
"aws_session_token":"your_session_token"}'
INTO TABLE vectors
FORMAT CSV(id, @v)
SET vec = UNHEX(@v):>VECTOR(4,F32);

Note the last line of this command uses the UNHEX function and a cast to load hexadecimal data into a VECTOR attribute.

FORMAT CSV is the default and is not required in this query. We include it for clarity and to differentiate from the query for loading Parquet data. In addition, note that the column labelling format for CSV, (id, @v) in this case, is different than the column labelling format for Parquet (id ->- id, @v -> vec) in the previous example. Refer to LOAD DATA for more detail.

Start the pipeline.

START PIPELINE vec_pipeline_hex FOREGROUND;

Once the data load has completed, verify that the data was loaded using the following SQL.

SET vector_type_project_format = JSON; /* to make vector output readable */
SELECT * 
FROM vectors
ORDER BY id;
+------+---------------------------------------------------+
| id   | vec                                               |
+------+---------------------------------------------------+
|    1 | [0.449999988,0.550000012,0.495000005,0.5]         |
|    2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
|    3 | [-0.5,-0.0299999993,-0.100000001,0.860000014]     |
|    4 | [0.5,0.300000012,0.806999981,0.100000001]         |
+------+---------------------------------------------------+

The results do not exactly match the vector values listed, because elements in the VECTOR data type are stored as floating-point numbers and the vector values are not perfectly representable in floating-point representation.

Once a pipeline is no longer needed it can be dropped.

DROP PIPELINE vec_pipeline_hex;

If you have started a pipeline in the background (by omitting the FOREGROUND keyword), you can stop the pipeline instead of, or before, dropping it. Stopping a pipeline allows the pipeline to be used later, dropping a pipeline removes the pipeline from the system. Stopping is applicable for background pipelines only. Refer to STOP PIPELINE and DROP PIPELINE for more information.

Example 3 - Exporting Vector Data in Parquet and Hexadecimal

To generate vector data in Parquet or with hexadecimal encodings, first create the vectors table and insert data using the commands below.

CREATE TABLE vectors (id int, vec VECTOR(4) not null);
INSERT INTO vectors VALUES (1, '[0.45, 0.55, 0.495, 0.5]');
INSERT INTO vectors VALUES (2, '[0.1, 0.8, 0.2, 0.555]');
INSERT INTO vectors VALUES (3, '[-0.5, -0.03, -0.1, 0.86]');
INSERT INTO vectors VALUES (4, '[0.5, 0.3, 0.807, 0.1]');

Example 3a - Exporting Vector Data in Parquet Format

When exporting vector data in Parquet, or other binary format, set vector_type_project_format to BINARY.

To export into a Parquet file on S3, use the following command. You will need to replace the S3 URI and the configuration and credentials with your own values. Refer to SELECT for additional details on AWS configuration and credentials in SingleStore.

SET vector_type_project_format = BINARY;
SELECT id, vec
FROM vectors
INTO S3 's3://vec_data_folder/vec_f32_data.parquet'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{
"aws_access_key_id":"your_access_key_id",
"aws_secret_access_key":"your_secret_access_key",
"aws_session_token":"your_session_token"}'
FORMAT PARQUET;

Example 3b - Exporting Vector Data in Hexadecimal Format

To export the vectors table with vectors encoded in hexadecimal on S3, use the command below. Again, you will need to replace the S3 URI, configuration, and credentials with your own values.

SELECT id, HEX(vec) AS vec
FROM vectors
INTO S3 's3://vec_data_folder/vec_f32_data_hex.csv'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{
"aws_access_key_id":"your_access_key_id",
"aws_secret_access_key":"your_secret_access_key",
"aws_session_token":"your_session_token"}';

Example 4 - Loading Vector Data in JSON Array Format

Create a table with an attribute of type VECTOR with 3 elements and element type 16-bit integer (I16).

CREATE TABLE vectors_i16 (id INT,vec VECTOR(3, I16));

Sample data for the vectors_i16 table in JSON array format is below.

To test loading data from a local file, first copy this data into a text file.

1,'[1,2,3]'
2,'[4,5,6]'
3,'[1,4,8]'

Example 4a - Using a PIPELINE to Load Vector Data in JSON Format from S3

To use this example, put the text file you created in a S3 bucket. The PIPELINE below will load that data into the vectors_i16 table.

Replace the URI for the S3 location of the data, update the config, and add your credentials. Refer to SELECT for information on using AWS credentials with SingleStore.

CREATE PIPELINE vec_i16_pipeline AS
LOAD DATA S3
's3://vec_data_folder/vec_i16_data.csv'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{
"aws_access_key_id":"your_access_key_id",
"aws_secret_access_key":"your_secret_access_key",
"aws_session_token":"your_session_token"}'
INTO TABLE vectors_i16
FIELDS TERMINATED BY ','
ENCLOSED BY "'"
FORMAT CSV;

Once the pipeline has completed, validate that the data loaded using the following SQL.

SET vector_type_project_format = JSON; /* to make vector output readable */
SELECT vec FROM vectors_i16;
+---------+
| vec     |
+---------+
| [1,2,3] |
| [4,5,6] |
| [1,4,8] |
+---------+

Once a pipeline is no longer needed it can be dropped.

DROP PIPELINE vec_i16_pipeline;

If you have started a pipeline in the background (by omitting the FOREGROUND keyword), you can stop the pipeline instead of, or before, dropping it. Stopping a pipeline allows the pipeline to be used later, dropping a pipeline removes the pipeline from the system. Stopping is applicable for background pipelines only. Refer to STOP PIPELINE and DROP PIPELINE for more information.

Vector Type

Working with Vector Data

LOAD DATA

CREATE PIPELINE

Last modified: July 16, 2024

Was this article helpful?