Important

The SingleStore 9.1 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 9.0 is recommended for production workloads, which can later be upgraded to SingleStore 9.1.

Scanning and Loading Files in AWS S3

S3 Pipelines discover new files by periodically rescanning the S3 bucket or S3 directory associated with the pipeline. The order in which the files are loaded from S3 is not defined. To view the state of files being loaded, refer to the information_schema.PIPELINES_FILES and information_schema.PIPELINES_OFFSETS views.

S3 Pipelines can be used to scan and load data from an entire bucket, or a specified directory in the bucket. In the rest of this section, references to loading a directory should be interpreted to mean either loading a directory or loading an entire bucket.

All file names are stored in memory in metadata tables, such as information_schema.PIPELINES_FILES and information_schema.PIPELINES_OFFSETS. When there are a large number of files, this slows down the ingestion speed. For directories with a large number of files, SingleStore recommends the following:

Use larger files when possible; smaller files hurt the ingestion speed.
Enable the OFFSETS METADATA GC clause when creating a pipeline. This setting ensures that old filenames are removed from the metadata table.

Note

Skipped files can be retried by pipelines by using the ALTER PIPELINE DROP FILE command. If a file has an error, the file can be fixed and retried without retrying the entire pipeline.

An S3 pipeline performs partial directory scanning in its steady state. When a pipeline is started, it is more aggressive, but it slows down after the first scan of the S3 directory is complete. The rate of pipeline scanning is controlled by the batch_interval engine variable, which is the interval between checks for new data. In steady state, a pipeline issues 2 or 3 ListObjects requests per batch_interval. By default, 1000 files are scanned per batch_interval in the pipeline’s steady state.

Out of the 2 or 3 ListObjects requests, one of the requests will look for files at the end of the directory. The other ListObjects requests gradually scan the entire directory, making 1-2 requests for every 1000 files. This gradual scanning, spanning across several pipeline batches, causes latency.

Note

Maximum Latency can be calculated by using this formula:

Maximum Latency = (Number of files to scan in the directory) * (batch_interval)/(Number of files scanned per batch interval ≈ 1000).

The batch_interval engine variable is set to 2.5 seconds by default.

To reduce latency,

Reduce the batch_interval.
Add new files in alphabetical order. If new files are always last in a directory, in alphabetical order, they will be picked up in the first batch after they are added. Files that are not added in this order, will be ingested normally with the usual latency.

If the pipeline has not discovered enough unprocessed files to load, it will do a partial scan of the directory in every batch using ListObject calls.

Pipelines use prefixes, hence, a pipeline with an address like s3://bucket/some_dir will scan the contents of some_dir. However, the AWS call cannot filter with patterns or regexes. For example, a pipeline with an address like s3://bucket/some_dir/*_suffix will scan the entire contents of some_dir, leading to suboptimal performance if there are a lot of files in some_dir, but only a few of the files have the required suffix.

File Discovery with Kinesis Event Notifications

For S3 buckets that contain a large volume of files, SingleStore supports an optional Kinesis-based file discovery mechanism that reduces file discovery latency from minutes to approximately 1–2 seconds.

How It Works

Instead of periodically scanning the S3 bucket with ListObjectsV2 requests, the pipeline receives real-time notifications when new files are created in S3:

S3 publishes object creation events to AWS EventBridge.
EventBridge routes the events to an Amazon Kinesis Data Stream.
The S3 pipeline consumes events from the Kinesis stream.
The pipeline discovers and ingests new files with approximately 1.2–1.5 seconds of latency.

Prerequisites

Before enabling Kinesis-based file discovery, configure the following AWS resources:

Kinesis Data Stream: Receives S3 event notifications.
EventBridge Rule: Routes S3 object creation events to the Kinesis stream.
IAM Permissions: Grants the pipeline permission to read from the Kinesis stream.

Refer to Configure Kinesis Event Notifications for S3 Pipelines for detailed setup instructions.

Configuration

To enable Kinesis-based file discovery, add the file_notifications_kinesis_stream_arn parameter to the pipeline CONFIG JSON:

CREATE PIPELINE s3_kinesis_pipeline AS
LOAD DATA S3 'bucket-name/path'
CONFIG '{
 "region": "us-east-1",
 "file_notifications_kinesis_stream_arn": "arn:aws:kinesis:us-east-1:123456789:stream/s3-events"
}'
CREDENTIALS '{
 "aws_access_key_id": "...",
 "aws_secret_access_key": "..."
}'
INTO TABLE my_table
FORMAT JSON;

The Kinesis integration uses the same AWS credential chain as the S3 extractor and supports the following authentication methods:

Static AWS credentials (aws_access_key_id and aws_secret_access_key)
Amazon EKS IRSA (IAM Roles for Service Accounts)
Amazon EKS IRSA with cross-account role assumption

Scanning and Loading Files in AWS S3

On this page