Load Data in CSV Format from Amazon S3 Using a Pipeline

SingleStore Pipelines can read data from a comma-separated value (CSV) formatted file stored in an S3 bucket and insert these values into a destination table.

Prerequisites

The following prerequisites need to be in place before creating a pipeline using a CSV file:

  • A CSV file is available for upload to an S3 bucket.

  • An S3 bucket has been configured and has the proper permissions for access.

See Load Data from Amazon Web Services (AWS) S3 for more details about Amazon S3 Pipelines. For more information on loading data from a CSV, JSON, or Avro file, refer to LOAD DATA.

Create a CSV Pipeline

The following examples use the same dataset and table structure. The main difference are the clauses used in the pipeline command.

Create a Table

Create a table using the following syntax:

CREATE TABLE nautical_books(
isbn text(13) CHARACTER SET utf8 COLLATE utf8_general_ci,
title text(50) CHARACTER SET utf8 COLLATE utf8_general_ci,
author text (50) CHARACTER SET utf8 COLLATE utf8_general_ci,
year_published text(4) CHARACTER SET utf8 COLLATE utf8_general_ci);

The dataset used in this example is below. Cut and paste the dataset into any program that allows files to be saved in a CSV file format and upload the files to an S3 bucket.

isbn, book_title,Book_author,Year
9780393972832,Moby Dick,Herman Melville,1851
9780006166269,Master and Commander,Patrick O’Brian,1969
9780192823182,The Riddle of the Sands,Erskine Childers,1903
9781421834016,Captain Blood,Rafael Sabatini,1922
9783104026886,Mr Midshipman Hornblower,CS Forester,1950
9780192829313,The Sea Wolf,Jack London,1904
9781857027204,The Perfect Storm,Sebastian Junger,1997
9780297866374,The Spanish Armada,Garrett Mattingly,1959
9788496957879,Mutiny on the Bounty,Charles Nordoff,1932
9780375757945,Two Years Before the Mast,Richard Henry Dana,1840
9780340203163,Shogun,James Clavell,1975
9787201046440,The Old Man and The Sea,Ernest Hemingway,1952
9780721405971,Treasure Island,Robert Louis Stevenson,1883
9780006172765,The Hunt for Red October,Tom Clancy,1984
9780316137492,The Terror,Dan Simmons,2007

Create a Pipeline with Minimal Information

Creating a pipeline using only the following information will result in the data not loading, and the table will be empty.

CREATE PIPELINE books AS
LOAD DATA S3 's3://test-bucket/nautical_books.csv'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXXXXXXX",
"aws_secret_access_key": "XXXXXXXXXXXXXXX"}'
INTO TABLE nautical_books;
START PIPELINE books;
SELECT * FROM nautical_books;
Empty set (0.13 sec)

Troubleshooting Pipelines

The following command provides valuable information about why a running pipeline is not working as expected. The example below shows a warning and an error when loading the data. A warning message can show if the delimiters between fields and lines are correctly formatted for a CSV file.

SELECT * FROM information_schema.pipelines_errors
WHERE pipeline_name = 'books';
+--------------+--------------+------------------------+------------+-------------+--------------------------------------------------------------------+------------+
|DATABASE_NAME |PIPELINE_NAME | ERROR_UNIX_TIMESTAMP   | ERROR_TYPE | ERROR_CODE  | ERROR_MESSAGE                                                      | ERROR_KIND |
+--------------+---------------------------------------+------------+-------------+--------------------------------------------------------------------+------------+
| tickets      | books        |             1675962208 | Warning    |        2568 | Detected input line ending in '\r\n',                              | Load       |
|              |              |                        |            |             | but FORMAT CSV is configured with LINES                            |            |
|              |              |                        |            |             | TERMINATED BY '\n'. So '\r' will be included                       |            |
|              |              |                        |            |             | in loaded data. Consider adding LINES TERMINATED BY '\r\n' if '\r' |            |
|              |              |                        |            |             | should be part of the line terminator and stripped from the data.  |            |
| tickets      | books        |             1675962208 | Error      |        1261 | Row 1 doesn't contain data for all columns                         | Load       |
+--------------+--------------+------------------------+------------+-------------+--------------------------------------------------------------------+------------+

Note

The table above has been truncated for viewing purposes.

Create Pipeline with Escape Clauses

Creating a pipeline with the escape clauses will allow the data to load. The header line(s) from the CSV file is included as a line entry in the table.

CREATE PIPELINE books AS
LOAD DATA S3 's3://test-bucket/nautical_books.csv'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX",
"aws_secret_access_key": "XXXXXXXXXX"}'
INTO TABLE nautical_books
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n' STARTING BY ''
START PIPELINE books;
SELECT * FROM nautical_books;
+---------------+---------------------------+------------------------+----------------+
| isbn          | title                     | author                 | year_published |
+---------------+---------------------------+------------------------+----------------+
| 9788496957879 | Mutiny on the Bounty      | Charles Nordoff        | 1932           |
| 9781421834016 | Captain Blood             | Rafael Sabatini        | 1922           |
| 9780006172765 | The Hunt for Red October  | Tom Clancy             | 1984           |
| 9787201046440 | The Old Man and The Sea   | Ernest Hemingway       | 1952           |
| 9780316137492 | The Terror                | Dan Simmons            | 2007           |
| 9780375757945 | Two Years Before the Mast | Richard Henry Dana     | 1840           |
| 9780393972832 | Moby Dick                 | Herman Melville        | 1851           |
| 9780006166269 | Master and Commander      | Patrick O’Brian        | 1969           |
| 9780192823182 | The Riddle of the Sands   | Erskine Childers       | 1903           |
| 9783104026886 | Mr Midshipman Hornblower  | CS Forester            | 1950           |
| 9780721405971 | Treasure Island           | Robert Louis Stevenson | 1883           |
| 9781857027204 | The Perfect Storm         | Sebastian Junger       | 1997           |
| 9780192829313 | The Sea Wolf              | Jack London            | 1904           |
| 9780340203163 | Shogun                    | James Clavell          | 1975           |
| 9780297866374 | The Spanish Armada        | Garrett Mattingly      | 1959           |
+---------------+---------------------------+------------------------+----------------+

Create Pipeline Using the Ignore Clause

Add an IGNORE <#> LINES clause in the pipeline creation to skip over the headers or a set number of lines at the beginning of the file.

CREATE PIPELINE books AS
LOAD DATA S3 's3://test-bucket/nautical_books.csv'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX",
"aws_secret_access_key": "XXXXXXXXXX"}'
INTO TABLE nautical_books
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n' STARTING BY ''
IGNORE 1 LINES
+---------------+---------------------------+------------------------+----------------+
| isbn          | title                     | author                 | year_published |
+---------------+---------------------------+------------------------+----------------+
| 9780006172765 | The Hunt for Red October  | Tom Clancy             | 1984           |
| 9783104026886 | Mr Midshipman Hornblower  | CS Forester            | 1950           |
| 9780721405971 | Treasure Island           | Robert Louis Stevenson | 1883           |
| 9780316137492 | The Terror                | Dan Simmons            | 2007           |
| 9780375757945 | Two Years Before the Mast | Richard Henry Dana     | 1840           |
| 9780340203163 | Shogun                    | James Clavell          | 1975           |
| 9780192823182 | The Riddle of the Sands   | Erskine Childers       | 1903           |
| 9781421834016 | Captain Blood             | Rafael Sabatini        | 1922           |
| 9781857027204 | The Perfect Storm         | Sebastian Junger       | 1997           |
| 9788496957879 | Mutiny on the Bounty      | Charles Nordoff        | 1932           |
| 9787201046440 | The Old Man and The Sea   | Ernest Hemingway       | 1952           |
| 9780393972832 | Moby Dick                 | Herman Melville        | 1851           |
| 9780297866374 | The Spanish Armada        | Garrett Mattingly      | 1959           |
| 9780006166269 | Master and Commander      | Patrick O’Brian        | 1969           |
| 9780192829313 | The Sea Wolf              | Jack London            | 1904           |
+---------------+---------------------------+------------------------+----------------+

Last modified: August 29, 2024

Was this article helpful?