Important
The SingleStore 9.1 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 9.0 is recommended for production workloads, which can later be upgraded to SingleStore 9.1.
Load Parquet Data using LOAD DATA
On this page
The LOAD DATA command supports loading Parquet files from AWS S3 or local files.LOAD DATA clause in a CREATE PIPELINE .
Syntax for LOAD DATA AWS S3 or Local File Source
Parquet-formatted data stored in an AWS S3 bucket or the local filesystem can be loaded via a LOAD DATA query without a pipeline.LOAD DATA clauses (SET, WHERE, etc.
For S3:
LOAD DATA S3 '<bucket name>'CONFIG '{"region" : "<region_name>"}'CREDENTIALS '{"aws_access_key_id" : "<key_id> ","aws_secret_access_key": "<access_key>"}'INTO TABLE <table_name>(`<col_a>` <- %,`<col_b>` <- % DEFAULT NULL ,) FORMAT PARQUET;
This data can also be loaded from S3 by using a connection link.
LOAD DATA LINK <link_name> '<bucket name>/<path>'INTO TABLE <table_name>(`<col_a>` <- %,`<col_b>` <- % DEFAULT NULL ,) FORMAT PARQUET;
For local file:
LOAD DATA INFILE '<path_to_file/file_name>'INTO TABLE <table_name>(val1 <- source1,val2 <- source2[ ... ]) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }][ ... ]FORMAT PARQUET;
Disk Usage for Parquet Staging
Note
This staging behavior applies only to Parquet ingestion.
When loading Parquet files from sources that do not support positional reads, such as S3, GCS, and Azure sources used in pipelines, SingleStore may create a temporary local copy of each file before parsing it.
Staging Location
|
Ingest Path |
Staging Node |
|---|---|
|
|
Aggregator |
|
Regular pipelines ( |
Leaf nodes (one staging file per partition processing a Parquet file in the current batch) |
For regular pipelines, the aggregator retrieves metadata and assigns source partitions to SingleStore partitions, while leaf nodes download and parse the file.LOAD DATA and aggregator pipelines, parsing occurs on the aggregator, which forwards parsed rows to leaf nodes in an internal format.
Temporary files are written to the ingest_<data-dir>/ingest_.
Staging files are removed after processing is complete.
Sources that support positional reads natively, such as the local filesystem and seekable HDFS paths, are read directly without staging.
Some LOAD DATA S3 operations can use range reads and avoid creating a full local copy.
Disk Space Requirements
Ensure that each ingest node has sufficient free disk space to accommodate the largest Parquet file or set of Parquet files, processed concurrently.
-
For
LOAD DATAand aggregator pipelines, plan capacity on the aggregator. -
For regular pipelines, plan capacity per leaf node.
A leaf can host multiple partitions, and each partition can stage its own Parquet file within a batch. Available disk space on the leaf must be sufficient for the combined size of all Parquet files being processed in parallel.
Last modified: