Important

The SingleStore 9.1 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 9.0 is recommended for production workloads, which can later be upgraded to SingleStore 9.1.

Load Parquet Data using LOAD DATA

The LOAD DATA command supports loading Parquet files from AWS S3 or local files. You can also use the LOAD DATA clause in a CREATE PIPELINE .. FORMAT statement to create a pipeline that loads Parquet files.

Syntax for LOAD DATA AWS S3 or Local File Source

Parquet-formatted data stored in an AWS S3 bucket or the local filesystem can be loaded via a LOAD DATA query without a pipeline. This streamlines the process of loading cloud-stored data into tables. Other LOAD DATA clauses (SET, WHERE, etc.) are supported (but not shown) in the following syntax examples.

For S3:

LOAD DATA S3 '<bucket name>'
CONFIG '{"region" : "<region_name>"}' 
CREDENTIALS '{"aws_access_key_id" : "<key_id> ", 
             "aws_secret_access_key": "<access_key>"}' 
INTO TABLE <table_name>
       (`<col_a>` <- %, 
 `<col_b>` <- % DEFAULT NULL , 
  ) FORMAT PARQUET;

This data can also be loaded from S3 by using a connection link. Refer to CREATE LINK for more information on. connection links.

LOAD DATA LINK <link_name> '<bucket name>/<path>'
INTO TABLE <table_name>(`<col_a>` <- %,
`<col_b>` <- % DEFAULT NULL ,
) FORMAT PARQUET;

For local file:

LOAD DATA INFILE '<path_to_file/file_name>'
INTO TABLE <table_name>
(val1 <- source1,
val2 <- source2
[ ... ]
) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }]
[ ... ]
FORMAT PARQUET;

Disk Usage for Parquet Staging

Note

This staging behavior applies only to Parquet ingestion. CSV, JSON, and Avro files are parsed sequentially and do not use the ingest_staging directory. Because Iceberg data files are stored as Parquet, Iceberg ingestion follows the same staging behavior described in this section.

When loading Parquet files from sources that do not support positional reads, such as S3, GCS, and Azure sources used in pipelines, SingleStore may create a temporary local copy of each file before parsing it. Parquet readers must seek within a file to read metadata, footers, and column chunks, while most object storage APIs provide only sequential downloads. To support these random-access operations, the engine downloads the entire file to a staging location and reads it from a local disk.

Staging Location

Ingest Path

Staging Node

LOAD DATA and aggregator pipelines

Aggregator

Regular pipelines (CREATE PIPELINE)

Leaf nodes (one staging file per partition processing a Parquet file in the current batch)

For regular pipelines, the aggregator retrieves metadata and assigns source partitions to SingleStore partitions, while leaf nodes download and parse the file. For LOAD DATA and aggregator pipelines, parsing occurs on the aggregator, which forwards parsed rows to leaf nodes in an internal format. Raw Parquet files are not transferred to leaf nodes.

Temporary files are written to the ingest_staging directory under the node's data directory: <data-dir>/ingest_staging/.

Staging files are removed after processing is complete. The directory is also cleared during node startup to remove files left behind by crashes or failed cleanup operations.

Sources that support positional reads natively, such as the local filesystem and seekable HDFS paths, are read directly without staging.

Some LOAD DATA S3 operations can use range reads and avoid creating a full local copy. However, S3 pipelines and other non-seekable extractors always use staging.

Disk Space Requirements

Ensure that each ingest node has sufficient free disk space to accommodate the largest Parquet file or set of Parquet files, processed concurrently.

  • For LOAD DATA and aggregator pipelines, plan capacity on the aggregator.

  • For regular pipelines, plan capacity per leaf node. A leaf can host multiple partitions, and each partition can stage its own Parquet file within a batch. Available disk space on the leaf must be sufficient for the combined size of all Parquet files being processed in parallel.

Last modified:

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK

Try Out This Notebook to See What’s Possible in SingleStore

Get access to other groundbreaking datasets and engage with our community for expert advice.