Important
The SingleStore 9.0 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 8.9 is recommended for production workloads, which can later be upgraded to SingleStore 9.0.
How to Bulk Load Vectors
On this page
SingleStore recommends loading larger vector data sets using a binary format, such as Apache Parquet or Apache Avro™, and using LOAD DATA and PIPELINEs.
-
Faster load speed.
-
Loading from binary file formats uses less CPU than loading from JSON array of numbers format.
-
-
Smaller input file size.
-
Vectors stored in binary format are typically much smaller than vectors stored in JSON array of numbers format.
-
-
No loss of precision.
-
Using the JSON array of numbers format, you may lose precision due to rounding errors.
-
Thus, for high performance, SingleStore recommends using Parquet or Avro formats to load large vector data sets, if possible.
SingleStore also supports input and output of vectors in JSON array of numbers format.
Example 1 provides an end-to-end example of generating and loading binary vectors using Parquet file format and Python.
Warning
Deprecation Notice
String functions, such as CONCAT
, LENGTH
, SUBSTR
, and HEX
, previously operated on the JSON representation of a vector, interpreted as a string.
If you truly intend to use a string function on the JSON string representation of a vector, you can write code in the following way so that it will run the same way before and after this behavior change.
Suppose that vec
is a vector-type value and stringfunc
is any string function and you have expression:
stringfunc(vec)
you can change it to:
stringfunc(vec :> json)
Output Format for Examples
Vectors can be output in JSON or binary format.
Use the following command to output vectors in JSON format.
SET vector_type_project_format = JSON;
Use the following command to set the output format back to binary.
SET vector_type_project_format = BINARY;
Example 1 - Generate and Load Binary Vector Data in Parquet using Python
This example first generates a file containing binary vectors using Parquet file format.
Generate a Parquet File containing Binary Vector Data
Vector data in Parquet file format can be generated using Python and pyarrow.
Install the pyarrow
library, using pip3
or using an installation command appropriate to your system.
pip3 install -q pyarrow numpy
Start the Python interpreter.
python3
The Python code below will generate a Parquet file that matches the schema of the comments
table with the data shown in the table.
CREATE TABLE comments(id INT,comment TEXT,comment_embedding VECTOR(4),category VARCHAR(256));
id |
comment |
comment_ |
category |
---|---|---|---|
1 |
"The cafeteria in building 35 has a great salad bar" |
[0. |
"Food" |
2 |
"I love the taco bar in the B16 cafeteria. |
[0. |
"Food" |
3 |
"The B24 restaurant salad bar is quite good. |
[0. |
"Food" |
Note: In this table, the vector data is shown as a JSON array for readability; in the example, the data will be generated and loaded in binary format.
Run the following code in the python interpreter.comments.
._
are used for column names in the parquet file so that the parquet column names can be distinguished from SingleStore column names.
import pyarrow as paimport pyarrow.parquet as pqimport numpy as np# Define column names for Parquet filecolumns = ['Id_pqt', 'Comment_pqt', 'Comment_embedding_pqt', 'Category_pqt']# create some example binary vectorsemb1 = np.array([0.45, 0.55, 0.495, 0.5], dtype=np.float32).tobytes()emb2 = np.array([0.01, 0.01, 0.1, 0.99], dtype=np.float32).tobytes()emb3 = np.array([0.1, 0.8, 0.2, 0.555], dtype=np.float32).tobytes()# Create the data arraysids = pa.array([1,2,3], type=pa.int8())comments = pa.array(["The cafeteria in building 35 has a great salad bar", "I love the taco bar in the B16 cafeteria.", "The B24 restaurant salad bar is quite good."], type=pa.string())binary_vectors = pa.array([emb1, emb2, emb3], type=pa.binary())categories = pa.array(["Food","Food","Food"],type=pa.string())table = pa.Table.from_arrays([ids,comments,binary_vectors,categories],columns) pq.write_table(table, 'comments.parquet')
Load the Parquet File using a Pipeline
Create the comments table in SingleStore.CREATE TABLE
statement for the comments table is repeated below for convenience.
CREATE TABLE comments(id INT,comment TEXT,comment_embedding VECTOR(4),category VARCHAR(256));
Create a pipeline to load the comments.
file into the comments
table.
The following pipeline accesses a publicly available bucket which contains the comments.
file and can be run as is (without credentials).
CREATE PIPELINE parquet_comments_pipeAS LOAD DATA S3 's3://singlestore-docs-example-datasets/vecs/comments.parquet'CONFIG '{"region":"us-east-1"}'INTO TABLE comments(id <- Id_pqt,comment <- Comment_pqt,@var <- Comment_embedding_pqt,category <- Category_pqt)FORMAT PARQUETSET comment_embedding = @var:>BLOB:>VECTOR(4, F32);
The SET
statement converts the binary vector from the parquet file into a SingleStore VECTOR
type.SET
statement does not do data conversion; it casts the data to a VECTOR(4, F32)
, validates that the number of elements is as expected, and that the data does not contain any NAN/INF values.
Run the pipeline and verify the import succeeded.
START PIPELINE parquet_comments_pipe FOREGROUND;
View the values in the table to confirm the import succeeded.
SET vector_type_project_format = JSON; /* to make vector output human-readable */SELECT *FROM comments;
*** 1. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
comment_embedding: [0.449999988,0.550000012,0.495000005,0.5]
category: Food
*** 2. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
comment_embedding: [0.100000001,0.800000012,0.200000003,0.555000007]
category: Food
*** 3. row ***
id: 2
comment: I love the taco bar in the B16 cafeteria.
comment_embedding: [0.00999999978,0.00999999978,0.100000001,0.99000001]
category: Food
Load the Parquet File using LOAD DATA
Load the comments.
file into the comments
table using a LOAD DATA command.LOAD DATA
may be used to load Parquet and Avro files from S3.comments.
file and can be run as is (without credentials).
LOAD DATA S3 's3://singlestore-docs-example-datasets/vecs/comments.parquet'CONFIG '{"region":"us-east-1"}'INTO TABLE comments(id <- Id_pqt,comment <- Comment_pqt,@emb <- Comment_embedding_pqt,category <- Category_pqt)FORMAT PARQUETSET comment_embedding = @emb:>BLOB:>VECTOR(4, F32);
View the values in the table to confirm the import succeeded.
SET vector_type_project_format = JSON; /* to make vector output human readable */SELECT *FROM comments;
*** 1. row ***
id: 2
comment: I love the taco bar in the B16 cafeteria.
comment_embedding: [0.00999999978,0.00999999978,0.100000001,0.99000001]
category: Food
*** 2. row ***
id: 1
comment: The cafeteria in building 35 has a great salad bar
comment_embedding: [0.449999988,0.550000012,0.495000005,0.5]
category: Food
*** 3. row ***
id: 3
comment: The B24 restaurant salad bar is quite good.
comment_embedding: [0.100000001,0.800000012,0.200000003,0.555000007]
category: Food
The LOAD DATA LOCAL command may be used to load from a local file.
Example 2 - Bulk Loading Vector Data using Parquet
SingleStore supports loading vector data from Parquet files.
Create a table with a column of type VECTOR
with 4 elements (dimension 4) and an element type of 32-bit floating-point number (F32
).
CREATE TABLE vectors(id INT, vec VECTOR(4, F32) NOT NULL);
This example uses the following data.
1,'[0.45, 0.55, 0.495, 0.5]'
2,'[0.1, 0.8, 0.2, 0.555]'
3,'[-0.5, -0.03, -0.1, 0.86]'
4,'[0.5, 0.3, 0.807, 0.1]'
Example 2a - Using LOAD DATA to Load Parquet Data from S3
The following command loads a parquet file stored in S3 to the vectors
table.s3://singlestore-docs-examples-datasets
is a public bucket.
LOAD DATA S3's3://singlestore-docs-example-datasets/vecs/vec_f32_data.parquet'CONFIG '{"region":"us-east-1"}'CREDENTIALS '{}'INTO TABLE vectorsFORMAT PARQUET(id <- id, @v <-vec)SET vec = @v:>BLOB:>VECTOR(4, F32);
Note the last line of this command which uses a set of casts to load Parquet data into the VECTOR
attribute: SET vec = @v:>BLOB:>VECTOR(4, F32)
.LOAD DATA
expects vector data to be in textual JSON array notation.SET
clause override this and allow more efficient assignment directly from binary data.
Run the following SQL statement to verify that the data was loaded.
SET vector_type_project_format = JSON; /* to make vector output readable */SELECT id, vecFROM vectorsORDER BY id;
+------+---------------------------------------------------+
| id | vec |
+------+---------------------------------------------------+
| 1 | [0.449999988,0.550000012,0.495000005,0.5] |
| 2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
| 3 | [-0.5,-0.0299999993,-0.100000001,0.860000014] |
| 4 | [0.5,0.300000012,0.806999981,0.100000001] |
+------+---------------------------------------------------+
The results do not exactly match the vector values listed at the beginning of this example because elements in the VECTOR
data type are stored as floating-point numbers and the vector values are not perfectly representable in floating-point representation.
To experiment further and use your own data file, follow the instructions in the Example 5 - Export Vector Data in Parquet and Hexadecimal section.
Example 2b - Using a Pipeline to Load Parquet Data from S3
You can also use a pipeline to load data from S3.
Create a pipeline to load the Parquet data into the vectors
table.
CREATE PIPELINE vec_pipeline_pqt ASLOAD DATA S3's3://singlestore-docs-example-datasets/vecs/vec_f32_data.parquet'CONFIG '{"region":"us-east-1"}'CREDENTIALS '{}'INTO TABLE vectorsFORMAT PARQUET(id <- id, @v <-vec)SET vec = @v:>BLOB:>VECTOR(4, F32);
This command uses CREATE PIPELINE
and will result in an error if the pipeline already exists.CREATE OR REPLACE PIPELINE
.
Start the pipeline.
START PIPELINE vec_pipeline_pqt FOREGROUND;
The command above starts a pipeline in the FOREGROUND
so that errors will be displayed in the client.FOREGROUND
keyword.
Once the pipeline has completed, verify that the data has been loaded.
SET vector_type_project_format = JSON; /* to make vector output readable */SELECT id, vecFROM vectorsORDER BY id;
+------+---------------------------------------------------+
| id | vec |
+------+---------------------------------------------------+
| 1 | [0.449999988,0.550000012,0.495000005,0.5] |
| 2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
| 3 | [-0.5,-0.0299999993,-0.100000001,0.860000014] |
| 4 | [0.5,0.300000012,0.806999981,0.100000001] |
+------+---------------------------------------------------+
Once a pipeline is no longer needed, it can be dropped.
DROP PIPELINE vec_pipeline_pqt;
If you have started a pipeline in the background (by omitting the FOREGROUND
keyword), you can stop the pipeline instead of, or before, dropping it.
Example 2c - Loading Data in Parquet Format from a Local File
Use the following SQL statement to load Parquet data from a local file into the vectors
table.'/tmp/vec_
, with the path to your text file.
LOAD DATA LOCAL INFILE '/tmp/vec_f32_data.parquet'INTO TABLE vectorsFORMAT PARQUET(id <- id, @v <-vec)SET vec = @v:>BLOB:>VECTOR(4, F32);
Refer to LOAD DATA for more information on using LOAD DATA LOCAL
.
To generate a Parquet file to use with this command, follow the instructions in the Example 5 - Export Vector Data in Parquet and Hexadecimal section.
Example 3 - Bulk Loading Vector data using Hexadecimal
SingleStore supports loading vector data from CSV files containing the hexadecimal encodings of vectors.
Data for the vectors from Example 1 in CSV format with the vectors encoded in hexadecimal is shown below.
1, 6666E63ECDCC0C3FA470FD3E0000003F
2, CDCCCC3DCDCC4C3FCDCC4C3E7B140E3F
3, 000000BF8FC2F5BCCDCCCCBDF6285C3F
4, 0000003F9A99993E8D974E3FCDCCCC3D
Example 3a - Using a Pipeline to Load Data in Hexadecimal from S3
This example uses the vectors
table created earlier, so before running this example, delete data from the vectors
table to prevent duplicate records.
TRUNCATE vectors;
The following shows loading the data with vectors encoded in hexadecimal into the vectors
table.
Note
For loading data in CSV format from S3, a pipeline must be used with LOAD DATA
.LOAD DATA S3
works without a pipeline only for Parquet or Avro formats.
The command below can be run as is because s3://singlestore-docs-example-datasets/
is a public bucket.
CREATE PIPELINE vec_pipeline_hex ASLOAD DATA S3's3://singlestore-docs-example-datasets/vecs/vec_f32_data_hex.csv'CONFIG '{"region":"us-east-1"}'CREDENTIALS '{}'INTO TABLE vectorsFORMAT CSV(id, @v)SET vec = UNHEX(@v):>VECTOR(4,F32);
Note the last line of this command uses the UNHEX function and a cast to load hexadecimal data into a VECTOR
attribute.
FORMAT CSV
is the default and is not required in this query.(id, @v)
in this case, is different than the column labelling format for Parquet (id ->- id, @v -> vec)
in the previous example.
Start the pipeline.
START PIPELINE vec_pipeline_hex FOREGROUND;
Once the data load has completed, verify that the data was loaded using the following SQL.
SET vector_type_project_format = JSON; /* to make vector output readable */SELECT *FROM vectorsORDER BY id;
+------+---------------------------------------------------+
| id | vec |
+------+---------------------------------------------------+
| 1 | [0.449999988,0.550000012,0.495000005,0.5] |
| 2 | [0.100000001,0.800000012,0.200000003,0.555000007] |
| 3 | [-0.5,-0.0299999993,-0.100000001,0.860000014] |
| 4 | [0.5,0.300000012,0.806999981,0.100000001] |
+------+---------------------------------------------------+
The results do not exactly match the vector values listed, because elements in the VECTOR
data type are stored as floating-point numbers and the vector values are not perfectly representable in floating-point representation.
Once a pipeline is no longer needed it can be dropped.
DROP PIPELINE vec_pipeline_hex;
If you have started a pipeline in the background (by omitting the FOREGROUND
keyword), you can stop the pipeline instead of, or before, dropping it.
To experiment with your own data files and bucket, refer to Example 5 - Export Vector Data in Parquet and Hexadecimal and Example 6 - Generate Vector Data.
Example 3b - Loading Data in Hexadecimal from a Local File
Use the following SQL statement to load the data with vectors encoded in hexadecimal into the vectors
table.'/tmp/vec_
with the path to your text file.
As with the example above, the command below uses UNHEX and a cast to load hexadecimal data into a VECTOR
attribute.
LOAD DATA LOCAL INFILE '/tmp/vec_f32_data_hex.csv'INTO TABLE vectorsFORMAT CSV(id, @v)SET vec = UNHEX(@v):>VECTOR(4,F32);
Example 4 - Loading Vector Data in JSON Array Format
Create a table with an attribute of type VECTOR
with 3 elements and element type 16-bit integer (I16
).
CREATE TABLE vectors_i16 (id INT,vec VECTOR(3, I16));
Sample data for the vectors_
table in JSON array format is below.
1,'[1,2,3]'
2,'[4,5,6]'
3,'[1,4,8]'
Example 4a - Using a PIPELINE to Load Vector Data in JSON Format from S3
The PIPELINE
below loads this data into the vectors_
table.s3://singlestore-docs-example-datasets/
is a public bucket.
CREATE PIPELINE vec_i16_pipeline ASLOAD DATA S3's3://singlestore-docs-example-datasets/vecs/vec_i16_data.csv'CONFIG '{"region":"us-east-1"}'CREDENTIALS '{}'INTO TABLE vectors_i16FIELDS TERMINATED BY ','ENCLOSED BY "'"FORMAT CSV;
Start the pipeline.
START PIPELINE vec_i16_pipeline FOREGROUND;
Once the pipeline has completed, validate that the data loaded using the following SQL.
SET vector_type_project_format = JSON; /* to make vector output readable */SELECT vec FROM vectors_i16;
+---------+
| vec |
+---------+
| [1,2,3] |
| [4,5,6] |
| [1,4,8] |
+---------+
Once a pipeline is no longer needed it can be dropped.
DROP PIPELINE vec_i16_pipeline;
If you have started a pipeline in the background (by omitting the FOREGROUND
keyword), you can stop the pipeline instead of, or before, dropping it.
Example 4b - Loading data in JSON Format from a Local File
Use the following SQL to load the data with vectors in JSON array format into the vectors_
table.'/tmp/vec_
with the path to your text file.
LOAD DATA LOCAL INFILE '/tmp/vec_i16_data.csv'INTO TABLE vectors_i16FIELDS TERMINATED BY ','ENCLOSED BY "'"FORMAT CSV;
Note that lines FIELDS TERMINATED BY ','
and ENCLOSED BY "''"
clause tells the database that the fields in the data file are separated by commas and that the fields may be enclosed with single quotes.
Example 5 - Export Vector Data in Parquet and Hexadecimal
To experiment further with loading data in Parquet and hexadecimal format, use these instructions in this section to generate a data file and place that file in cloud storage location such as S3.
First create the vectors
table and insert data using the commands below.
CREATE TABLE vectors (id int, vec VECTOR(4) not null);INSERT INTO vectors VALUES (1, '[0.45, 0.55, 0.495, 0.5]');INSERT INTO vectors VALUES (2, '[0.1, 0.8, 0.2, 0.555]');INSERT INTO vectors VALUES (3, '[-0.5, -0.03, -0.1, 0.86]');INSERT INTO vectors VALUES (4, '[0.5, 0.3, 0.807, 0.1]');
Example 5a - Export Vector Data in Parquet Format
When exporting vector data in Parquet, or other binary format, set vector_
to BINARY
.
To export into a Parquet file on S3, use the following command.
SET vector_type_project_format = BINARY;SELECT id, vecFROM vectorsINTO S3 's3://vec_data_folder/vec_f32_data.parquet'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}'FORMAT PARQUET;
You can output data in Parquet format to a local file using the following command.
SET vector_type_project_format = BINARY;SELECT id, vecFROM vectorsINTO outfile "/tmp/vec_f32_data.parquet"FORMAT PARQUET;
Example 5b - Export Vector Data in Hexadecimal Format
To export the vectors
table with vectors encoded in hexadecimal on S3, use the command below.
SELECT id, HEX(vec) AS vecFROM vectorsINTO S3 's3://vec_data_folder/vec_f32_data_hex.csv'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key","aws_session_token":"your_session_token"}';
You can output the data with vectors encoded in hexadecimal to a local CSV file using the following command.
SELECT id, HEX(vec) AS vecFROM vectorsINTO outfile "/tmp/vec_f32_data_hex.csv";
Example 6 - Generate Vector Data
Vector data in Parquet file format can be generated using Python and pandas.pandas
library provides a function for exporting a pandas
DataFrame
to a Parquet file.
To encode a vector in hexadecimal format, convert the little-endian binary representation of each floating-point element of the vector to its 8-character hexadecimal encoding and concatenate the results.numpy
array:
v = numpy.array([0.45, 0.55, 0.495, 0.5], dtype=numpy.float32)v.tobytes(order='C').hex()'6666e63ecdcc0c3fa470fd3e0000003f'
To experiment further with loading hexadecimal data, use this code to generate the hexadecimal data and place the data in cloud storage location such as S3.
Related Topics
Last modified: August 22, 2025