Load Data Using pipeline_source_file()

Pipelines can extract, transform, and insert objects from an Amazon S3 bucket into a destination table. Pipelines can persist the name of a file by using the pipeline_source_file() helper.

Below is a list of three files that will be loaded into a table using an S3 pipeline. These files have a numeric column and one numeric entry per line which corresponds to a filename.

SQL

CREATE TABLE book_inventory(isbn NUMERIC(13),title VARCHAR(50));

Create an S3 pipeline to ingest the data.

SQL

CREATE PIPELINE books AS
LOAD DATA S3 's3://<bucket_name>/Books/'
CONFIG '{"region":"us-west-2"}
'CREDENTIALS '{"aws_access_key_id": "<access_key_id>",
               "aws_secret_access_key": "<secret_access_key>"}'
SKIP DUPLICATE KEY ERRORS
INTO TABLE book_inventory
(isbn)
SET title = pipeline_source_file();

Test the pipeline:

SQL

TEST PIPELINE books limit 5;

+---------------+------------------+
| isbn          | title            |
+---------------+------------------+
| 9780770437404 | Books/Horror.csv |
| 9780380977277 | Books/Horror.csv |
| 9780385319676 | Books/Horror.csv |
| 9781416552963 | Books/Horror.csv |
| 9780316362269 | Books/Horror.csv |
+---------------+------------------+

Start the pipeline.

SQL

START PIPELINE books;

Check each row to verify that every one has a corresponding filename.

SQL

SELECT * FROM book_inventory;

+---------------+---------------------------+
| isbn          | title                     |
+---------------+---------------------------+
| 9780316137492 | Books/Nautical.csv        |
| 9780440117377 | Books/Horror.csv          |
| 9780297866374 | Books/Nautical.csv        |
| 9780006166269 | Books/Nautical.csv        |
| 9780721405971 | Books/Nautical.csv        |
| 9781416552963 | Books/Horror.csv          |
| 9780316362269 | Books/Horror.csv          |
| 9783104026886 | Books/Nautical.csv        |
| 9788496957879 | Books/Nautical.csv        |
| 9780380783601 | Books/Horror.csv          |
| 9780380973835 | Books/science_fiction.csv |
| 9780739462287 | Books/science_fiction.csv |
+---------------+---------------------------+

To load files from a specific folder in your S3 bucket while ignoring the files in the subfolders, use the '**' regular expression pattern as 's3://<bucket_name>/<folder_name>/**'. For example:

SQL

CREATE PIPELINE <your_pipeline> AS
LOAD DATA S3 's3://<bucket_name>/<folder_name>/**'
CONFIG '{"region":"<your_region>"}'
CREDENTIALS '{"aws_access_key_id": "<access_key_id>",  
              "aws_secret_access_key": "<secret_access_key>"}'
SKIP DUPLICATE KEY ERRORS
INTO TABLE <your_table>;

Using two asterisks (**) after the folder instructs the pipeline to load all of the files in the main folder and ignore the files in the subfolders. However, the files in the subfolders will get scanned when listing the contents of the bucket.

Load Data Using pipeline_source_file()

Was this article helpful?

Was this article helpful?