Load Data in CSV Format from Amazon S3 Using a Pipeline
SingleStore Pipelines can read data from a comma-separated value (CSV) formatted file stored in an S3 bucket and insert these values into a destination table.
Prerequisites
The following prerequisites need to be in place before creating a pipeline using a CSV file:
A CSV file is available for upload to an S3 bucket.
An S3 bucket has been configured and has the proper permissions for access.
See Load Data from Amazon S3 using a Pipeline for more details about Amazon S3 Pipelines.
Create a CSV Pipeline
The following examples use the same dataset and table structure. The main difference are the clauses used in the pipeline command.
Create a Table
Create a table using the following syntax:
CREATE TABLE nautical_books( isbn text(13) CHARACTER SET utf8 COLLATE utf8_general_ci, title text(50) CHARACTER SET utf8 COLLATE utf8_general_ci, author text (50) CHARACTER SET utf8 COLLATE utf8_general_ci, year_published text(4) CHARACTER SET utf8 COLLATE utf8_general_ci);
The dataset used in this example is below. Cut and paste the dataset into any program that allows files to be saved in A CSV file format and upload the file to an S3 bucket.
isbn, book_title,Book_author,Year 9780393972832,Moby Dick,Herman Melville,1851 9780006166269,Master and Commander,Patrick O’Brian,1969 9780192823182,The Riddle of the Sands,Erskine Childers,1903 9781421834016,Captain Blood,Rafael Sabatini,1922 9783104026886,Mr Midshipman Hornblower,CS Forester,1950 9780192829313,The Sea Wolf,Jack London,1904 9781857027204,The Perfect Storm,Sebastian Junger,1997 9780297866374,The Spanish Armada,Garrett Mattingly,1959 9788496957879,Mutiny on the Bounty,Charles Nordoff,1932 9780375757945,Two Years Before the Mast,Richard Henry Dana,1840 9780340203163,Shogun,James Clavell,1975 9787201046440,The Old Man and The Sea,Ernest Hemingway,1952 9780721405971,Treasure Island,Robert Louis Stevenson,1883 9780006172765,The Hunt for Red October,Tom Clancy,1984 9780316137492,The Terror,Dan Simmons,2007
Create a Pipeline with Minimal Information
Creating a pipeline using only the following information will result in the data not loading, and the table will be empty.
CREATE PIPELINE books AS LOAD DATA S3 's3://test-bucket/nautical_books.csv' CONFIG '{"region":"us-west-2"}' CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXXXXXXX", "aws_secret_access_key": "XXXXXXXXXXXXXXX"}' INTO TABLE nautical_books; START PIPELINE books; SELECT * FROM nautical_books; **** Empty set (0.13 sec)
Troubleshooting Pipelines
The following command provides valuable information about why a running pipeline is not working as expected. The example below shows a warning and an error when loading the data. A warning message can show if the delimiters between fields and lines are correctly formatted for a CSV file.
SELECT * FROM information_schema.pipelines_errors WHERE pipeline_name = 'books'; **** +--------------+--------------+------------------------+------------+-------------+--------------------------------------------------------------------+------------+ |DATABASE_NAME |PIPELINE_NAME | ERROR_UNIX_TIMESTAMP | ERROR_TYPE | ERROR_CODE | ERROR_MESSAGE | ERROR_KIND | +--------------+---------------------------------------+------------+-------------+--------------------------------------------------------------------+------------+ | tickets | books | 1675962208 | Warning | 2568 | Detected input line ending in '\r\n', | Load | | | | | | | but FORMAT CSV is configured with LINES | | | | | | | | TERMINATED BY '\n'. So '\r' will be included | | | | | | | | in loaded data. Consider adding LINES TERMINATED BY '\r\n' if '\r' | | | | | | | | should be part of the line terminator and stripped from the data. | | | tickets | books | 1675962208 | Error | 1261 | Row 1 doesn't contain data for all columns | Load | +--------------+--------------+------------------------+------------+-------------+--------------------------------------------------------------------+------------+
Note
The table above has been truncated for viewing purposes.
Create Pipeline with Escape Clauses
Creating a pipeline with the escape clauses will allow the data to load. The header line(s) from the CSV file is included as a line entry in the table.
CREATE PIPELINE books AS
LOAD DATA S3 's3://test-bucket/nautical_books.csv'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX",
"aws_secret_access_key": "XXXXXXXXXX"}'
INTO TABLE nautical_books
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
LINES TERMINATED BY '\r\n' STARTING BY ''
START PIPELINE books;
SELECT * FROM nautical_books;
****
+---------------+---------------------------+------------------------+----------------+
| isbn | title | author | year_published |
+---------------+---------------------------+------------------------+----------------+
| 9788496957879 | Mutiny on the Bounty | Charles Nordoff | 1932 |
| 9781421834016 | Captain Blood | Rafael Sabatini | 1922 |
| isbn | book_title | Book_author | Year |
| 9780006172765 | The Hunt for Red October | Tom Clancy | 1984 |
| 9787201046440 | The Old Man and The Sea | Ernest Hemingway | 1952 |
| 9780316137492 | The Terror | Dan Simmons | 2007 |
| 9780375757945 | Two Years Before the Mast | Richard Henry Dana | 1840 |
| 9780393972832 | Moby Dick | Herman Melville | 1851 |
| 9780006166269 | Master and Commander | Patrick O’Brian | 1969 |
| 9780192823182 | The Riddle of the Sands | Erskine Childers | 1903 |
| 9783104026886 | Mr Midshipman Hornblower | CS Forester | 1950 |
| 9780721405971 | Treasure Island | Robert Louis Stevenson | 1883 |
| 9781857027204 | The Perfect Storm | Sebastian Junger | 1997 |
| 9780192829313 | The Sea Wolf | Jack London | 1904 |
| 9780340203163 | Shogun | James Clavell | 1975 |
| 9780297866374 | The Spanish Armada | Garrett Mattingly | 1959 |
+---------------+---------------------------+------------------------+----------------+
Create Pipeline Using the Ignore Clause
Add an IGNORE <#> LINES
clause in the pipeline creation to skip over the headers or a set number of lines at the beginning of the file.
CREATE PIPELINE books AS LOAD DATA S3 's3://test-bucket/nautical_books.csv' CONFIG '{"region":"us-west-2"}' CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX", "aws_secret_access_key": "XXXXXXXXXX"}' INTO TABLE nautical_books FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\r\n' STARTING BY '' IGNORE 1 LINES **** +---------------+---------------------------+------------------------+----------------+ | isbn | title | author | year_published | +---------------+---------------------------+------------------------+----------------+ | 9780006172765 | The Hunt for Red October | Tom Clancy | 1984 | | 9783104026886 | Mr Midshipman Hornblower | CS Forester | 1950 | | 9780721405971 | Treasure Island | Robert Louis Stevenson | 1883 | | 9780316137492 | The Terror | Dan Simmons | 2007 | | 9780375757945 | Two Years Before the Mast | Richard Henry Dana | 1840 | | 9780340203163 | Shogun | James Clavell | 1975 | | 9780192823182 | The Riddle of the Sands | Erskine Childers | 1903 | | 9781421834016 | Captain Blood | Rafael Sabatini | 1922 | | 9781857027204 | The Perfect Storm | Sebastian Junger | 1997 | | 9788496957879 | Mutiny on the Bounty | Charles Nordoff | 1932 | | 9787201046440 | The Old Man and The Sea | Ernest Hemingway | 1952 | | 9780393972832 | Moby Dick | Herman Melville | 1851 | | 9780297866374 | The Spanish Armada | Garrett Mattingly | 1959 | | 9780006166269 | Master and Commander | Patrick O’Brian | 1969 | | 9780192829313 | The Sea Wolf | Jack London | 1904 | +---------------+---------------------------+------------------------+----------------+