Load Data in JSON Format from Amazon S3 Using a Wildcard
Warning
SingleStore 9.0 gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 8.9 is recommended for production workloads, which can later be upgraded to SingleStore 9.0.
On this page
SingleStore Pipelines can read data from JSON formatted files stored in an S3 bucket by using a wildcard (*).
Prerequisites
The following prerequisites need to be in place before creating a pipeline using a JSON wildcard.
-
Two or more JSON files for upload to an S3 bucket.
-
Access to the S3 bucket.
Create a Table
Create a table using the following syntax:
CREATE TABLE employees (lastname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,firstname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,age INT NOT NULL,DOB TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,partner TEXT CHARACTER SET utf8 COLLATE utf8_general_ci,hasChildren boolean,children JSON COLLATE utf8_bin,SHARD KEY ());
The JSON file contents are below.
Contents of emp1:
{"lastname": "Bateman","firstname": "Patrick","DOB" : "October 23, 1982","age": "40","partner": "Sharon Bateman","hasChildren": true,"children":["Isabell", "Jackson"]}{"lastname": "Denbrough","firstname": "Bill","DOB" : "January 04, 1990","age": "33","partner": "Audra Phillips","hasChildren": true,"children":["Stephen", "Maya", "Hunter"]}{"lastname": "Torance","firstname": "Jack","DOB" : "June 20, 1980","age": "43","partner": "Wendy Torrance","hasChildren": true,"children":["Danny", "Lucy"]}{"lastname": "Strode","firstname": "Laurie","DOB" : "February 22, 1987","age": "36","partner": null,"hasChildren": true,"children":["Jamie Lloyd", "John Tate", "Karen Nelson"]}{"lastname": "Whateley","firstname": "Wilbur","DOB" : "February 02, 1970","age": "53","partner": null,"hasChildren": true,"children":["Wilbur Jr.", "Eleanor"]}
Contents of emp2:
{"lastname": "Karras","firstname": "Damien","DOB" : "April 12, 1973","age": "50","partner": null,"hasChildren": false,"children": null}{"lastname": "Wilkes","firstname": "Annie","DOB" : "April 01, 1972","age": "51","partner":null,"hasChildren": false,"children": null}{"lastname": "Cady","firstname": "Max","DOB" : "September 13, 1989","age": "33","partner": null,"hasChildren": false,"children": null}{"lastname": "White","firstname": "Carrie","DOB" : "August 25, 2000","age": "22","partner": null,"hasChildren": false,"children": null}{"lastname": "MacNeil","firstname": "Regan","DOB" : "April 07, 1985","age": "38","partner": "Orlando Knotts","hasChildren": false,"children": null}{"lastname": "Eriksson","firstname": "Oskar","DOB" : "May 29, 2001","age": "22","partner": null,"hasChildren": false,"children": null}{"lastname": "Grevers","firstname": "Nick","DOB" : "November 21, 1995","age": "28","partner" : "Sam Avery","hasChildren": false,"children": null}
Create a Pipeline to the S3 Bucket
Make sure to add the correct pipeline name, bucket name, folder name (if applicable), and AWS credentials.
CREATE PIPELINE emp_dataAS LOAD DATA S3 '<bucket_name>/<folder_name>/*.json'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id": "<xxxxxxxxxxxxxxx>","aws_secret_access_key": "<xxxxxxxxxxxxxxx>"}'BATCH_INTERVAL 2500MAX_PARTITIONS_PER_BATCH 1DISABLE OUT_OF_ORDER OPTIMIZATIONDISABLE OFFSETS METADATA GCSKIP DUPLICATE KEY ERRORSINTO TABLE employeesFORMAT JSON(`lastname` <- `lastname` default '',`firstname` <- `firstname` default '',`age` <- `age` default -1,`DOB` <- `DOB` default -1,`partner` <- `partner` default '',`hasChildren` <- `hasChildren` default '',`children` <- `children` default '');
Start the pipeline.
START PIPELINE emp_data;
Verify the data has loaded.
SELECT * FROM employees;
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| lastname | firstname | age | DOB | partner | hasChildren | children |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| Bateman | Patrick | 40 | October 23, 1982 | Sharon Bateman | 1 | ["Isabell","Jackson"] |
| Eriksson | Oskar | 22 | May 29, 2001 | NULL | 0 | NULL |
| MacNeil | Regan | 38 | April 07, 1985 | Orlando Knotts | 0 | NULL |
| Grevers | Nick | 28 | November 21, 1995 | Sam Avery | 0 | NULL |
| Karras | Damien | 50 | April 12, 1973 | NULL | 0 | NULL |
| Whateley | Wilbur | 53 | February 02, 1970 | NULL | 1 | ["Wilbur Jr.","Eleanor"] |
| Denbrough | Bill | 33 | January 04, 1990 | Audra Phillips | 1 | ["Stephen","Maya","Hunter"] |
| Torance | Jack | 43 | June 20, 1980 | Wendy Torrance | 1 | ["Danny","Lucy"] |
| Wilkes | Annie | 51 | April 01, 1972 | NULL | 0 | NULL |
| Cady | Max | 33 | September 13, 1989 | NULL | 0 | NULL |
| White | Carrie | 22 | August 25, 2000 | NULL | 0 | NULL |
| Strode | Laurie | 36 | February 22, 1987 | NULL | 1 | ["Jamie Lloyd","John Tate","Karen Nelson"] |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
Other Useful Pipeline Commands
To troubleshoot an empty table run the command below be sure to add in the correct pipeline name.
SELECT * FROM information_schema.pipelines_errorsWHERE pipeline_name = '<pipeline_name>';
Stopping a pipeline is useful when you want to stop additional data from loading but would like to use the pipeline again.
STOP PIPELINE <pipeline_name>;
Dropping a pipeline is useful when the pipeline was created for a one-off data load and it will not be used in the future.
DROP PIPELINE <pipeline_name>;
Note
A pipeline does not need to be stopped in order to be dropped.
Last modified: September 27, 2023