Load Data from Amazon S3 using a Pipeline
SingleStore Pipelines can extract objects from Amazon S3 buckets, optionally transform them, and insert them into a destination table. To understand Amazon S3’s core concepts and the terminology used in this topic, please read the Amazon S3 documentation.
Prerequisites
To complete this Quickstart, your environment must meet the following prerequisites:
AWS Account: This Quickstart uses Amazon S3 and requires an AWS account’s access key id and secret access key.
SingleStore DB installation –or– a SingleStore Managed Service cluster: You will connect to the database or cluster and create a pipeline to pull data from your Amazon S3 bucket.
Part 1: Creating an Amazon S3 Bucket and Adding a File
On your local machine, create a text file with the following CSV contents and name it
books.txt
:The Catcher in the Rye, J.D. Salinger, 1945 Pride and Prejudice, Jane Austen, 1813 Of Mice and Men, John Steinbeck, 1937 Frankenstein, Mary Shelley, 1818
In S3 create a bucket and upload
books.txt
to the bucket. For information on working with S3, refer to the Amazon S3 documentation.Note that the
aws_access_key_id
that your SingleStore pipeline will use (specified in the next section inCREATE PIPELINE library ... CREDENTIALS ...
) must have read access to both the bucket and the file.
Once the books.txt
file has been uploaded, you can proceed to the next part of the Quickstart.
Part 2: Creating a SingleStore Database and S3 Pipeline
Now that you have an S3 bucket that contains an object (file), you can use SingleStore Managed Service or DB to create a new pipeline and ingest the messages.
Create a new database and a table that adheres to the schema contained in the books.txt
file. At the prompt, execute the following statements:
CREATE DATABASE books;
CREATE TABLE classic_books ( title VARCHAR(255), author VARCHAR(255), date VARCHAR(255) );
These statements create a new database named books
and a new table named classic_books
, which has three columns: title
, author
, and date
.
Now that the destination database and table have been created, you can create an S3 pipeline. In Part 1 of this Quickstart, you uploaded the books.txt
file to your bucket. To create the pipeline, you will need the following information:
The name of the bucket, such as:
my-bucket-name
The name of the bucket’s region, such as:
us-west-1
Your AWS account’s access keys, such as:
Access Key ID:
your_access_key_id
Secret Access Key:
your_access_key_id
Your AWS account's session token, such as:
Session Token:
your_session_token
Note that the
aws_session_token
is required only if your credentials in theCREDENTIALS
clause are temporary
Using these identifiers and keys, execute the following statement, replacing the placeholder values with your own.
CREATE PIPELINE library AS LOAD DATA S3 'my-bucket-name' CONFIG '{"region": "us-west-1"}' CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key", "aws_session_token": "your_session_token"}' INTO TABLE `classic_books` FIELDS TERMINATED BY ',';
You can see what files the pipeline wants to load by running the following:
SELECT * FROM information_schema.PIPELINES_FILES;
If everything is properly configured, you should see one row in the Unloaded
state, corresponding to books.txt
. The CREATE PIPELINE
statement creates a new pipeline named library
, but the pipeline has not yet been started, and no data has been loaded. A SingleStore pipeline can run either in the background or be triggered by a foreground query. Start it in the foreground first.
START PIPELINE library FOREGROUND;
When this command returns success, all files from your bucket will be loaded. If you check information_schema.PIPELINES_FILES
again, you should see all files in the Loaded
state. Now query the classic_books
table to make sure the data has actually loaded.
SELECT * FROM classic_books; **** +------------------------+-----------------+-------+ | title | author | date | +------------------------+-----------------+-------+ | The Catcher in the Rye | J.D. Salinger | 1945 | | Pride and Prejudice | Jane Austen | 1813 | | Of Mice and Men | John Steinbeck | 1937 | | Frankenstein | Mary Shelley | 1818 | +------------------------+-----------------+-------+
You can also have SingleStore run your pipeline in the background. In such a configuration, SingleStore will periodically poll S3 for new files and continuously them as they are added to the bucket. Before running your pipeline in the background, you must reset the state of the pipeline and the table.
DELETE FROM classic_books; ALTER PIPELINE library SET OFFSETS EARLIEST;
The first command deletes all rows from the target table. The second causes the pipeline to start from the beginning, in this case, forgetting it already loaded books.txt
so you can load it again. You can also drop and recreate the pipeline, if you prefer.
To start a pipeline in the background, run
START PIPELINE library;
This statement starts the pipeline. To see whether the pipeline is running, run SHOW PIPELINES
.
SHOW PIPELINES; **** +----------------------+---------+ | Pipelines_in_books | State | +----------------------+---------+ | library | Running | +----------------------+---------+
At this point, the pipeline is running and the contents of the books.txt
file should once again be present in the classic_books
table.
Notice
Foreground pipelines and background pipelines have different intended uses and behave differently. For more information, see the START PIPELINE topic.
Next Steps
See Pipeline Concepts to learn more about how pipelines work.