Important

The SingleStore 9.1 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 9.0 is recommended for production workloads, which can later be upgraded to SingleStore 9.1.

Load Parquet Data using LOAD DATA

Syntax for LOAD DATA AWS S3 or Local File Source

Parquet-formatted data stored in an AWS S3 bucket or the local filesystem can be loaded via a LOAD DATA query without a pipeline. This streamlines the process of loading cloud-stored data into tables. Other LOAD DATA clauses (SET, WHERE, etc.) are supported (but not shown) in the following syntax examples.

For S3:

SQL

LOAD DATA S3 '<bucket name>'
CONFIG '{"region":"<region_name>"}'
CREDENTIALS '{
    "aws_access_key_id":"<key_id>",
    "aws_secret_access_key":"<access_key>"
}'
INTO TABLE <table_name>
(
    `<col_a>` <- %,
    `<col_b>` <- % DEFAULT NULL
)
FORMAT PARQUET;

This data can also be loaded from S3 by using a connection link. Refer to CREATE LINK for more information on. connection links.

SQL

LOAD DATA LINK <link_name> '<bucket name>/<path>'
INTO TABLE <table_name>(`<col_a>` <- %,
`<col_b>` <- % DEFAULT NULL ,
) FORMAT PARQUET;

For local file:

SQL

LOAD DATA INFILE '<path_to_file/file_name>'
INTO TABLE <table_name>
    (val1 <- source1,
     val2 <- source2
     [ ... ]
) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }]
[ ... ]
FORMAT PARQUET;

Disk Usage for Parquet Staging

Note

This staging behavior applies only to Parquet ingestion. CSV, JSON, and Avro files are parsed sequentially and do not use the ingest_staging directory. Because Iceberg data files are stored as Parquet, Iceberg ingestion follows the same staging behavior described in this section.

When loading Parquet files from sources that do not support positional reads, such as S3, GCS, and Azure sources used in pipelines, SingleStore may create a temporary local copy of each file before parsing it. Parquet readers must seek within a file to read metadata, footers, and column chunks, while most object storage APIs provide only sequential downloads. To support these random-access operations, the engine downloads the entire file to a staging location and reads it from a local disk.

Staging Location

Ingest Path	Staging Node
`LOAD DATA` and aggregator pipelines	Aggregator
Regular pipelines (`CREATE PIPELINE`)	Leaf nodes (one staging file per partition processing a Parquet file in the current batch)

For regular pipelines, the aggregator retrieves metadata and assigns source partitions to SingleStore partitions, while leaf nodes download and parse the file. For LOAD DATA and aggregator pipelines, parsing occurs on the aggregator, which forwards parsed rows to leaf nodes in an internal format. Raw Parquet files are not transferred to leaf nodes.

Temporary files are written to the ingest_staging directory under the node's data directory: <data-dir>/ingest_staging/.

Staging files are removed after processing is complete. The directory is also cleared during node startup to remove files left behind by crashes or failed cleanup operations.

Sources that support positional reads natively, such as the local filesystem and seekable HDFS paths, are read directly without staging.

Some LOAD DATA S3 operations can use range reads and avoid creating a full local copy. However, S3 pipelines and other non-seekable extractors always use staging.

Disk Space Requirements

Ensure that each ingest node has sufficient free disk space to accommodate the largest Parquet file or set of Parquet files, processed concurrently.

For LOAD DATA and aggregator pipelines, plan capacity on the aggregator.
For regular pipelines, plan capacity per leaf node. A leaf can host multiple partitions, and each partition can stage its own Parquet file within a batch. Available disk space on the leaf must be sufficient for the combined size of all Parquet files being processed in parallel.