Important
The SingleStore 9.1 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 9.0 is recommended for production workloads, which can later be upgraded to SingleStore 9.1.
Scanning and Loading Files in AWS S3
On this page
Scanning and Loading Files in AWS S3
S3 Pipelines discover new files by periodically rescanning the S3 bucket or S3 directory associated with the pipeline.information_ and information_ views.
S3 Pipelines can be used to scan and load data from an entire bucket, or a specified directory in the bucket.
All file names are stored in memory in metadata tables, such as information_ and information_.
-
Use larger files when possible; smaller files hurt the ingestion speed.
-
Enable the OFFSETS METADATA GC clause when creating a pipeline.
This setting ensures that old filenames are removed from the metadata table.
Note
Skipped files can be retried by pipelines by using the ALTER PIPELINE DROP FILE command.
An S3 pipeline performs partial directory scanning in its steady state.batch_ engine variable, which is the interval between checks for new data.ListObjects requests per batch_.batch_ in the pipeline’s steady state.
Out of the 2 or 3 ListObjects requests, one of the requests will look for files at the end of the directory.ListObjects requests gradually scan the entire directory, making 1-2 requests for every 1000 files.
Note
Maximum Latency can be calculated by using this formula:
Maximum Latency = (Number of files to scan in the directory) * (batch_
The batch_ engine variable is set to 2.
To reduce latency,
-
Reduce the
batch_.interval -
Add new files in alphabetical order.
If new files are always last in a directory, in alphabetical order, they will be picked up in the first batch after they are added. Files that are not added in this order, will be ingested normally with the usual latency.
If the pipeline has not discovered enough unprocessed files to load, it will do a partial scan of the directory in every batch using ListObject calls.
Pipelines use prefixes, hence, a pipeline with an address like s3://bucket/some_ will scan the contents of some_.s3://bucket/some_ will scan the entire contents of some_, leading to suboptimal performance if there are a lot of files in some_, but only a few of the files have the required suffix.
File Discovery with Kinesis Event Notifications
For S3 buckets that contain a large volume of files, SingleStore supports an optional Kinesis-based file discovery mechanism that reduces file discovery latency from minutes to approximately 1–2 seconds.
How It Works
Instead of periodically scanning the S3 bucket with ListObjectsV2 requests, the pipeline receives real-time notifications when new files are created in S3:
-
S3 publishes object creation events to AWS EventBridge.
-
EventBridge routes the events to an Amazon Kinesis Data Stream.
-
The S3 pipeline consumes events from the Kinesis stream.
-
The pipeline discovers and ingests new files with approximately 1.
2–1. 5 seconds of latency.
Prerequisites
Before enabling Kinesis-based file discovery, configure the following AWS resources:
-
Kinesis Data Stream: Receives S3 event notifications.
-
EventBridge Rule: Routes S3 object creation events to the Kinesis stream.
-
IAM Permissions: Grants the pipeline permission to read from the Kinesis stream.
Refer to Configure Kinesis Event Notifications for S3 Pipelines for detailed setup instructions.
Configuration
To enable Kinesis-based file discovery, add the file_ parameter to the pipeline CONFIG JSON:
CREATE PIPELINE s3_kinesis_pipeline AS
LOAD DATA S3 'bucket-name/path'
CONFIG '{
"region": "us-east-1",
"file_notifications_kinesis_stream_arn": "arn:aws:kinesis:us-east-1:123456789:stream/s3-events"
}'
CREDENTIALS '{
"aws_access_key_id": "...",
"aws_secret_access_key": "..."
}'
INTO TABLE my_table
FORMAT JSON;The Kinesis integration uses the same AWS credential chain as the S3 extractor and supports the following authentication methods:
-
Static AWS credentials (
aws_andaccess_ key_ id aws_)secret_ access_ key -
Amazon EKS IRSA with cross-account role assumption
Last modified: