# Scanning and Loading Files in AWS S3

## Scanning and Loading Files in AWS S3

S3 Pipelines discover new files by periodically rescanning the S3 bucket or S3 directory associated with the pipeline. The order in which the files are loaded from S3 is not defined. To view the state of files being loaded, refer to the `information_schema.PIPELINES_FILES` and `information_schema.PIPELINES_OFFSETS` views.

S3 Pipelines can be used to scan and load data from an entire bucket, or a specified directory in the bucket. In the rest of this section, references to loading a directory should be interpreted to mean either loading a directory or loading an entire bucket.

All file names are stored in memory in metadata tables, such as `information_schema.PIPELINES_FILES` and `information_schema.PIPELINES_OFFSETS`. When there are a large number of files, this slows down the ingestion speed. For directories with a large number of files, SingleStore recommends the following:

* Use larger files when possible; smaller files hurt the ingestion speed.
* Enable the [OFFSETS METADATA GC](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/alter-pipeline/#section-idm4592227490996833069756379227.md) clause when creating a pipeline. This setting ensures that old filenames are removed from the metadata table.

> **📝 Note**: Skipped files can be retried by pipelines by using the `ALTER PIPELINE DROP FILE` command. If a file has an error, the file can be fixed and retried without retrying the entire pipeline.

An S3 pipeline performs partial directory scanning in its steady state. When a pipeline is started, it is more aggressive, but it slows down after the first scan of the S3 directory is complete. The rate of pipeline scanning is controlled by the `batch_interval` engine variable, which is the interval between checks for new data. In steady state, a pipeline issues 2 or 3 `ListObjects` requests per `batch_interval`. By default, 1000 files are scanned per `batch_interval` in the pipeline’s steady state.

Out of the 2 or 3 `ListObjects` requests, one of the requests will look for files at the end of the directory. The other `ListObjects` requests gradually scan the entire directory, making 1-2 requests for every 1000 files. This gradual scanning, spanning across several pipeline batches, causes latency.

> **📝 Note**: Maximum Latency can be calculated by using this formula:`Maximum Latency = (Number of files to scan in the directory) * (batch_interval)/(Number of files scanned per batch interval ≈ 1000).`The `batch_interval` engine variable is set to 2.5 seconds by default.

To reduce latency,

1. Reduce the `batch_interval`.

2. Add new files in alphabetical order. If new files are always last in a directory, in alphabetical order, they will be picked up in the first batch after they are added. Files that are not added in this order, will be ingested normally with the usual latency.

If the pipeline has not discovered enough unprocessed files to load, it will do a partial scan of the directory in every batch using `ListObject` calls.

Pipelines use prefixes, hence, a pipeline with an address like `s3://bucket/some_dir` will scan the contents of `some_dir`. However, the AWS call cannot filter with patterns or regexes. For example, a pipeline with an address like `s3://bucket/some_dir/*_suffix` will scan the entire contents of `some_dir`, leading to suboptimal performance if there are a lot of files in `some_dir`, but only a few of the files have the required suffix.

## File Discovery with Kinesis Event Notifications

For S3 buckets that contain a large volume of files, SingleStore supports an optional Kinesis-based file discovery mechanism that reduces file discovery latency from minutes to approximately 1–2 seconds.

## How It Works

Instead of periodically scanning the S3 bucket with `ListObjectsV2` requests, the pipeline receives real-time notifications when new files are created in S3:

1. S3 publishes object creation events to AWS EventBridge.

2. EventBridge routes the events to an Amazon Kinesis Data Stream.

3. The S3 pipeline consumes events from the Kinesis stream.

4. The pipeline discovers and ingests new files with approximately 1.2–1.5 seconds of latency.

## Prerequisites

Before enabling Kinesis-based file discovery, configure the following AWS resources:

1. Kinesis Data Stream: Receives S3 event notifications.

2. EventBridge Rule: Routes S3 object creation events to the Kinesis stream.

3. IAM Permissions: Grants the pipeline permission to read from the Kinesis stream.

Refer to Configure Kinesis Event Notifications for S3 Pipelines for detailed setup instructions.

## Configuration

To enable Kinesis-based file discovery, add the `file_notifications_kinesis_stream_arn` parameter to the pipeline `CONFIG JSON`:

```
CREATE PIPELINE s3_kinesis_pipeline AS
LOAD DATA S3 'bucket-name/path'
CONFIG '{
 "region": "us-east-1",
 "file_notifications_kinesis_stream_arn": "arn:aws:kinesis:us-east-1:123456789:stream/s3-events"
}'
CREDENTIALS '{
 "aws_access_key_id": "...",
 "aws_secret_access_key": "..."
}'
INTO TABLE my_table
FORMAT JSON;

```

The Kinesis integration uses the same AWS credential chain as the S3 extractor and supports the following authentication methods:

* Static AWS credentials (`aws_access_key_id` and `aws_secret_access_key`)
* [Amazon EKS IRSA (IAM Roles for Service Accounts)](https://docs.singlestore.com/db/v9.1/user-and-cluster-administration/#cloud-workload-identity-and-delegated-entities.md)
* Amazon EKS IRSA with cross-account role assumption

***

Modified at: May 18, 2026

Source: [/db/v9.1/load-data/data-sources/load-data-from-amazon-web-services-aws-s-3/scanning-and-loading-files-in-aws-s-3/](https://docs.singlestore.com/db/v9.1/load-data/data-sources/load-data-from-amazon-web-services-aws-s-3/scanning-and-loading-files-in-aws-s-3/)

(An index of the documentation is available at /llms.txt)
