How to Bulk Load Vectors
On this page
SingleStore recommends loading larger vector data sets using a binary format such as Apache Parquet or Apache Avro™ and using LOAD DATA and pipelines (CREATE PIPELINE).
The default input and output format for vectors in SingleStore is JSON array format.
Examples 1 and 2 below demonstrate loading vector data using Parquet and hexadecimal encodings, respectively.
Example 1 - Bulk Loading Vector Data using Parquet
SingleStore supports loading vector data from Parquet files.DataFrame
to a Parquet file.
Alternatively, you can generate a small sample Parquet file to experiment with by loading the sample data below into SingleStore and following the instructions for exporting data to Parquet in the Example 3 - Exporting Vector Data in Parquet and Hexadecimal section.
Once you have a Parquet file with vector data, the example below demonstrates how to load that data into SingleStore.
Create a table with a column of type VECTOR
with 4 elements (dimension 4) and an element type of 32-bit floating-point number (F32
).
CREATE TABLE vectors(id INT, vec VECTOR(4, F32) NOT NULL);
This example uses the following data.
1,'[0.45, 0.55, 0.495, 0.5]'2,'[0.1, 0.8, 0.2, 0.555]'3,'[-0.5, -0.03, -0.1, 0.86]'4,'[0.5, 0.3, 0.807, 0.1]'
Example 1a - Using LOAD DATA to Load Parquet Data from S3
The following command loads a parquet file stored in S3 to the vectors
table.
To run the following command, replace the URI for the S3 location of the data, update the config, and add your credentials.
LOAD DATA S3's3://vec_data_folder/vec_f32_data.parquet'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}'INTO TABLE vectorsFORMAT PARQUET(id <- id, @v <-vec)SET vec = @v:>BLOB:>VECTOR(4, F32);
Note the last line of this command which uses a set of casts to load Parquet data into the VECTOR
attribute: SET vec = @v:>BLOB:>VECTOR(4, F32)
.LOAD DATA
expects vector data to be in textual JSON array notation.SET
clause override this, and allow more efficient assignment directly from binary data.
Run the following SQL statement to verify that the data was loaded.
SELECT id, vecFROM vectorsORDER BY id;
+------+---------------------------------------------------+
| id | vec |
+------+---------------------------------------------------+
| 1 | [0.449999988,0.550000012,0.495000005,0.5] |
| 2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
| 3 | [-0.5,-0.0299999993,-0.100000001,0.860000014] |
| 4 | [0.5,0.300000012,0.806999981,0.100000001] |
+------+---------------------------------------------------+
The results do not exactly match the vector values listed above because elements in the VECTOR
data type are stored as floating-point numbers and the vector values are not perfectly representable in floating-point representation.
Example 1b - Using a Pipeline to Load Parquet Data from S3
You can also use a pipeline to load data from S3.
Create a pipeline to load the Parquet data into the vectors
table.
CREATE PIPELINE vec_pipeline_pqt ASLOAD DATA S3's3://vec_data_folder/vec_f32_data.parquet'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}'INTO TABLE vectorsFORMAT PARQUET(id <- id, @v <-vec)SET vec = @v:>BLOB:>VECTOR(4, F32);
This command uses CREATE PIPELINE
and will result in an error if the pipeline already exists.CREATE OR REPLACE PIPELINE
.
Start the pipeline.
START PIPELINE vec_pipeline_pqt FOREGROUND;
The command above starts a pipeline in the FOREGROUND
so that errors will be displayed in the client.FOREGROUND
keyword.
Once the pipeline has completed, verify that the data has been loaded.
SELECT id, vecFROM vectorsORDER BY id;
+------+---------------------------------------------------+
| id | vec |
+------+---------------------------------------------------+
| 1 | [0.449999988,0.550000012,0.495000005,0.5] |
| 2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
| 3 | [-0.5,-0.0299999993,-0.100000001,0.860000014] |
| 4 | [0.5,0.300000012,0.806999981,0.100000001] |
+------+---------------------------------------------------+
Once a pipeline is no longer needed, it can be dropped.
DROP PIPELINE vec_pipeline_pqt;
If you have started a pipeline in the background (by omitting the FOREGROUND
keyword), you can stop the pipeline instead of, or before, dropping it.
Example 1c - Loading Data in Parquet Format from a Local File
Use the following SQL statement to load Parquet data from a local file into the vectors
table.'/tmp/vec_
, with the path to your text file.
LOAD DATA LOCAL INFILE '/tmp/vec_f32_data.parquet'INTO TABLE vectorsFORMAT PARQUET(id <- id, @v <-vec)SET vec = @v:>BLOB:>VECTOR(4, F32);
Refer to LOAD DATA for more information on using LOAD DATA LOCAL
.
Example 2 - Bulk Loading Vector data using Hexadecimal
SingleStore supports loading vector data from CSV files containing the hexadecimal encodings of vectors.
Hexadecimal data can be generated using Python as is shown below.
To encode a vector in hexadecimal format, convert the little-endian binary representation of each floating point element of the vector to its 8 character hexadecimal encoding and concatenate the results.numpy
array:
v = numpy.array([0.45, 0.55, 0.495, 0.5], dtype=numpy.float32)v.tobytes(order='C').hex()'6666e63ecdcc0c3fa470fd3e0000003f'
Data for the vectors from Example 1 in CSV format with the vectors encoded in hexadecimal is shown below.
1, 6666E63ECDCC0C3FA470FD3E0000003F2, CDCCCC3DCDCC4C3FCDCC4C3E7B140E3F3, 000000BF8FC2F5BCCDCCCCBDF6285C3F4, 0000003F9A99993E8D974E3FCDCCCC3D
Once you have data with vectors encoded in hexadecimal in a CSV file, the example below demonstrates how to load that data into SingleStore.
Example 2a - Using a Pipeline to Load Data in Hexadecimal from S3
This example uses the vectors
table created earlier, so before running this example, delete data from the vectors
table to prevent duplicate records.
TRUNCATE vectors;
The following shows loading the data with vectors encoded in hexadecimal into the vectors
table.
Note
For loading data in CSV format from S3, a pipeline must be used with LOAD DATA
.LOAD DATA S3
works without a pipeline only for Parquet or Avro formats.
To use this pipeline, replace the URI for the S3 location of the data and substitute in your config and credentials.
CREATE PIPELINE vec_pipeline_hex ASLOAD DATA S3's3://vec_data_folder/vec_f32_data_hex.csv'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}'INTO TABLE vectorsFORMAT CSV(id, @v)SET vec = UNHEX(@v):>VECTOR(4,F32);
Note the last line of this command uses the UNHEX function and a cast to load hexadecimal data into a VECTOR
attribute.
FORMAT CSV
is the default and is not required in this query.(id, @v)
in this case, is different than the column labelling format for Parquet (id ->- id, @v -> vec)
in the previous example.
Start the pipeline.
START PIPELINE vec_pipeline_hex FOREGROUND;
Once the data load has completed, verify that the data was loaded using the following SQL.
SELECT *FROM vectorsORDER BY id;
+------+---------------------------------------------------+
| id | vec |
+------+---------------------------------------------------+
| 1 | [0.449999988,0.550000012,0.495000005,0.5] |
| 2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
| 3 | [-0.5,-0.0299999993,-0.100000001,0.860000014] |
| 4 | [0.5,0.300000012,0.806999981,0.100000001] |
+------+---------------------------------------------------+
The results do not exactly match the vector values listed, because elements in the VECTOR
data type are stored as floating-point numbers and the vector values are not perfectly representable in floating-point representation.
Once a pipeline is no longer needed it can be dropped.
DROP PIPELINE vec_pipeline_hex;
If you have started a pipeline in the background (by omitting the FOREGROUND
keyword), you can stop the pipeline instead of, or before, dropping it.
Example 2b - Loading Data in Hexadecimal from a Local File
Use the following SQL statement to load the data with vectors encoded in hexadecimal into the vectors
table.'/tmp/vec_
with the path to your text file.
As with the example above, the command below uses UNHEX and a cast to load hexadecimal data into a VECTOR
attribute.
LOAD DATA LOCAL INFILE '/tmp/vec_f32_data_hex.csv'INTO TABLE vectorsFORMAT CSV(id, @v)SET vec = UNHEX(@v):>VECTOR(4,F32);
Example 3 - Exporting Vector Data in Parquet and Hexadecimal
To generate vector data in Parquet or with hexadecimal encodings, first create the vectors
table and insert data using the commands below.
CREATE TABLE vectors (id int, vec VECTOR(4) not null);INSERT INTO vectors VALUES (1, '[0.45, 0.55, 0.495, 0.5]');INSERT INTO vectors VALUES (2, '[0.1, 0.8, 0.2, 0.555]');INSERT INTO vectors VALUES (3, '[-0.5, -0.03, -0.1, 0.86]');INSERT INTO vectors VALUES (4, '[0.5, 0.3, 0.807, 0.1]');
Example 3a - Exporting Vector Data in Parquet Format
To export into a Parquet file on S3, use the following command.
SELECT id, vec :>BLOB AS vecFROM vectorsINTO S3 's3://vec_data_folder/vec_f32_data.parquet'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}'FORMAT PARQUET;
Note that the first line of this command uses a cast, vec:>BLOB AS vec
to obtain the binary representation of the VECTOR
instead of the JSON array representation and to ensure that the binary representation is encoded in the Parquet file.
You can output data in Parquet format to a local file using the following command.
SELECT id, vec :>BLOB as vecFROM vectorsINTO outfile "/tmp/vec_f32_data.parquet"FORMAT PARQUET;
Example 3b - Exporting Vector Data in Hexadecimal Format
To export the vectors
table with vectors encoded in hexadecimal on S3, use the command below.
SELECT id, hex(vec:>BLOB) as vecFROM vectorsINTO S3 's3://vec_data_folder/vec_f32_data_hex.csv'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}';
Observe the syntax in the first line hex(vec:>BLOB)
which converts the vector to hexadecimal.vec:>BLOB
is required; HEX(vec)
will return the hexadecimal representation of the JSON representation of the vector, while HEX(vec:>BLOB)
returns the hexadecimal representation of the numerical vector data.
You can output the data with vectors encoded in hexadecimal to a local CSV file using the following command.
SELECT id, HEX(vec:>BLOB) as vecFROM vectorsINTO outfile "/tmp/vec_f32_data_hex.csv";
Example 4 - Loading Vector Data in JSON Array Format
Create a table with an attribute of type VECTOR
with 3 elements and element type 16-bit integer (I16
).
CREATE TABLE vectors_i16 (id INT,vec VECTOR(3, I16));
Sample data for the vectors_
table in JSON array format is below.
To test loading data from a local file, first copy this data into a text file.
1,'[1,2,3]'2,'[4,5,6]'3,'[1,4,8]'
Example 4a - Using a PIPELINE to Load Vector Data in JSON Format from S3
To use this example, put the text file you created in a S3 bucket.PIPELINE
below will load that data into the vectors_
table.
Replace the URI for the S3 location of the data, update the config, and add your credentials.
CREATE PIPELINE vec_i16_pipeline ASLOAD DATA S3's3://vec_data_folder/vec_i16_data.csv'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}'INTO TABLE vectors_i16FIELDS TERMINATED BY ','ENCLOSED BY "'"FORMAT CSV;
Once the pipeline has completed, validate that the data loaded using the following SQL.
SELECT vec FROM vectors_i16;
+---------+
| vec |
+---------+
| [1,2,3] |
| [4,5,6] |
| [1,4,8] |
+---------+
Once a pipeline is no longer needed it can be dropped.
DROP PIPELINE vec_i16_pipeline;
If you have started a pipeline in the background (by omitting the FOREGROUND
keyword), you can stop the pipeline instead of, or before, dropping it.
Example 4b - Loading data in JSON Format from a Local File
Use the following SQL to load the data with vectors in JSON array format into the vectors_
table.'/tmp/vec_
with the path to your text file.
LOAD DATA LOCAL INFILE '/tmp/vec_i16_data.csv'INTO TABLE vectors_i16FIELDS TERMINATED BY ','ENCLOSED BY "'"FORMAT CSV;
Note that lines FIELDS TERMINATED BY ','
and ENCLOSED BY "''"
clause tells the database that the fields in the data file are separated by commas and that the fields may be enclosed with single quotes.
Related Topics
Last modified: April 15, 2024