Warning
SingleStore 9.0 gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 8.9 is recommended for production workloads, which can later be upgraded to SingleStore 9.0.
Scanning and Loading Files in AWS S3
On this page
Scanning and Loading Files in AWS S3
S3 Pipelines discover new files by periodically rescanning the S3 bucket or S3 directory associated with the pipeline.information_
and information_
views.
S3 Pipelines can be used to scan and load data from an entire bucket, or a specified directory in the bucket.
All file names are stored in memory in metadata tables, such as information_
and information_
.
-
Use larger files when possible; smaller files hurt the ingestion speed.
-
Enable the OFFSETS METADATA GC clause when creating a pipeline.
This setting ensures that old filenames are removed from the metadata table.
Note
Skipped files can be retried by pipelines by using the ALTER PIPELINE DROP FILE
command.
An S3 pipeline performs partial directory scanning in its steady state.batch_
engine variable, which is the interval between checks for new data.ListObjects
requests per batch_
.batch_
in the pipeline’s steady state.
Out of the 2 or 3 ListObjects
requests, one of the requests will look for files at the end of the directory.ListObjects
requests gradually scan the entire directory, making 1-2 requests for every 1000 files.
Note
Maximum Latency can be calculated by using this formula:
Maximum Latency = (Number of files to scan in the directory) * (batch_
The batch_
engine variable is set to 2.
To reduce latency,
-
Reduce the
batch_
.interval -
Add new files in alphabetical order.
If new files are always last in a directory, in alphabetical order, they will be picked up in the first batch after they are added. Files that are not added in this order, will be ingested normally with the usual latency.
If the pipeline has not discovered enough unprocessed files to load, it will do a partial scan of the directory in every batch using ListObject
calls.
Pipelines use prefixes, hence, a pipeline with an address like s3://bucket/some_
will scan the contents of some_
.s3://bucket/some_
will scan the entire contents of some_
, leading to suboptimal performance if there are a lot of files in some_
, but only a few of the files have the required suffix.
Last modified: July 16, 2025