Scanning and Loading Files in AWS S3

S3 Pipelines discover new files by periodically rescanning the S3 bucket or S3 directory associated with the pipeline. The order in which the files are loaded from S3 is not defined. To view the state of files being loaded, refer to the information_schema.PIPELINES_FILES and information_schema.PIPELINES_OFFSETS views.

S3 Pipelines can be used to scan and load data from an entire bucket, or a specified directory in the bucket. In the rest of this section, references to loading a directory should be interpreted to mean either loading a directory or loading an entire bucket.

All file names are stored in memory in metadata tables, such as information_schema.PIPELINES_FILES and information_schema.PIPELINES_OFFSETS. When there are a large number of files, this slows down the ingestion speed. For directories with a large number of files, SingleStore recommends the following:

Use larger files when possible; smaller files hurt the ingestion speed.
Enable the OFFSETS METADATA GC clause when creating a pipeline. This setting ensures that old filenames are removed from the metadata table.

Note

Skipped files can be retried by pipelines by using the ALTER PIPELINE DROP FILE command. If a file has an error, the file can be fixed and retried without retrying the entire pipeline.

An S3 pipeline performs partial directory scanning in its steady state. When a pipeline is started, it is more aggressive, but it slows down after the first scan of the S3 directory is complete. The rate of pipeline scanning is controlled by the batch_interval engine variable, which is the interval between checks for new data. In steady state, a pipeline issues 2 or 3 ListObjects requests per batch_interval. By default, 1000 files are scanned per batch_interval in the pipeline’s steady state.

Out of the 2 or 3 ListObjects requests, one of the requests will look for files at the end of the directory. The other ListObjects requests gradually scan the entire directory, making 1-2 requests for every 1000 files. This gradual scanning, spanning across several pipeline batches, causes latency.

Note

Maximum Latency can be calculated by using this formula:

Maximum Latency = (Number of files to scan in the directory) * (batch_interval)/(Number of files scanned per batch interval ≈ 1000).

The batch_interval engine variable is set to 2.5 seconds by default.

To reduce latency,

Reduce the batch_interval.
Add new files in alphabetical order. If new files are always last in a directory, in alphabetical order, they will be picked up in the first batch after they are added. Files that are not added in this order, will be ingested normally with the usual latency.

If the pipeline has not discovered enough unprocessed files to load, it will do a partial scan of the directory in every batch using ListObject calls.

Pipelines use prefixes, hence, a pipeline with an address like s3://bucket/some_dir will scan the contents of some_dir. However, the AWS call cannot filter with patterns or regexes. For example, a pipeline with an address like s3://bucket/some_dir/*_suffix will scan the entire contents of some_dir, leading to suboptimal performance if there are a lot of files in some_dir, but only a few of the files have the required suffix.

Scanning and Loading Files in AWS S3

On this page

Scanning and Loading Files in AWS S3

Was this article helpful?

On this page

Was this article helpful?