# Load Parquet Data using LOAD DATA

The `LOAD DATA` command supports loading Parquet files from AWS S3 or local files. You can also use the `LOAD DATA` clause in a [CREATE PIPELINE .. FORMAT ](https://docs.singlestore.com/cloud/reference/sql-reference/pipelines-commands/create-pipeline.md) statement to create a pipeline that loads Parquet files.

## Syntax for LOAD DATA AWS S3 or Local File Source

Parquet-formatted data stored in an AWS S3 bucket or the local filesystem can be loaded via a LOAD DATA query without a pipeline. This streamlines the process of loading cloud-stored data into tables. Other `LOAD DATA` clauses (`SET`, `WHERE`, etc.) are supported (but not shown) in the following syntax examples.

For S3:

```sql
LOAD DATA S3 '<bucket name>'
CONFIG '{"region" : "<region_name>"}' 
CREDENTIALS '{"aws_access_key_id" : "<key_id> ", 
             "aws_secret_access_key": "<access_key>"}' 
INTO TABLE <table_name>
       (`<col_a>` <- %, 
 `<col_b>` <- % DEFAULT NULL , 
  ) FORMAT PARQUET;
```

This data can also be loaded from S3 by using a connection link. Refer to [CREATE LINK](https://docs.singlestore.com/cloud/reference/sql-reference/security-management-commands/create-link.md) for more information on. connection links.

```sql
LOAD DATA LINK <link_name> '<bucket name>/<path>'
INTO TABLE <table_name>(`<col_a>` <- %,
`<col_b>` <- % DEFAULT NULL ,
) FORMAT PARQUET;
```

For local file:

```sql
LOAD DATA INFILE '<path_to_file/file_name>'
INTO TABLE <table_name>
    (val1 <- source1, 
     val2 <- source2
     [ ... ]
) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }]
[ ... ]
FORMAT PARQUET;
```

## Disk Usage for Parquet Staging

> **📝 Note**: This staging behavior applies only to Parquet ingestion. CSV, JSON, and Avro files are parsed sequentially and do not use the ingest\_staging directory. Because Iceberg data files are stored as Parquet, Iceberg ingestion follows the same staging behavior described in this section.

When loading Parquet files from sources that do not support positional reads, such as S3, GCS, and Azure sources used in pipelines, SingleStore may create a temporary local copy of each file before parsing it. Parquet readers must seek within a file to read metadata, footers, and column chunks, while most object storage APIs provide only sequential downloads. To support these random-access operations, the engine downloads the entire file to a staging location and reads it from a local disk.

## Staging Location

| Ingest Path                           | Staging Node                                                                               |
| ------------------------------------- | ------------------------------------------------------------------------------------------ |
| `LOAD DATA`and aggregator pipelines   | Aggregator                                                                                 |
| Regular pipelines (`CREATE PIPELINE`) | Leaf nodes (one staging file per partition processing a Parquet file in the current batch) |

For regular pipelines, the aggregator retrieves metadata and assigns source partitions to SingleStore partitions, while leaf nodes download and parse the file. For `LOAD DATA` and aggregator pipelines, parsing occurs on the aggregator, which forwards parsed rows to leaf nodes in an internal format. Raw Parquet files are not transferred to leaf nodes.

Temporary files are written to the ingest\_staging directory under the node's data directory: `<data-dir>/ingest_staging/`.

Staging files are removed after processing is complete. The directory is also cleared during node startup to remove files left behind by crashes or failed cleanup operations.

Sources that support positional reads natively, such as the local filesystem and seekable HDFS paths, are read directly without staging.

Some `LOAD DATA S3` operations can use range reads and avoid creating a full local copy. However, S3 pipelines and other non-seekable extractors always use staging.

## Disk Space Requirements

Ensure that each ingest node has sufficient free disk space to accommodate the largest Parquet file or set of Parquet files, processed concurrently.

* For `LOAD DATA` and aggregator pipelines, plan capacity on the aggregator.
* For regular pipelines, plan capacity per leaf node. A leaf can host multiple partitions, and each partition can stage its own Parquet file within a batch. Available disk space on the leaf must be sufficient for the combined size of all Parquet files being processed in parallel.

***

Modified at: June 25, 2026

Source: [/cloud/load-data/load-data-from-files/load-data-from-parquet-files/load-parquet-data-using-load-data/](https://docs.singlestore.com/cloud/load-data/load-data-from-files/load-data-from-parquet-files/load-parquet-data-using-load-data/)

(An index of the documentation is available at /llms.txt)
