Load Data in JSON Format from Amazon S3 Using a Wildcard

SingleStore Pipelines can read data from JSON formatted files stored in an S3 bucket by using a wildcard (*).

Prerequisites

The following prerequisites need to be in place before creating a pipeline using a JSON wildcard.

  • Two or more JSON files for upload to an S3 bucket.

  • Access to the S3 bucket.

Create a Table

Create a table using the following syntax:

CREATE TABLE employees (
lastname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
firstname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
age INT NOT NULL,
DOB TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
partner TEXT CHARACTER SET utf8 COLLATE utf8_general_ci,
hasChildren boolean,
children JSON COLLATE utf8_bin,
SHARD KEY ()
);

The JSON file contents are below. To test, cut and paste the file contents into any program that allows files to be saved in JSON format. Upload both files to the S3 bucket. For this example, the files names are emp1 and emp2.

Contents of emp1:

{
"lastname": "Bateman",
"firstname": "Patrick",
"DOB" : "October 23, 1982",
"age": "40",
"partner": "Sharon Bateman",
"hasChildren": true,
"children":["Isabell", "Jackson"]
}
{
"lastname": "Denbrough",
"firstname": "Bill",
"DOB" : "January 04, 1990",
"age": "33",
"partner": "Audra Phillips",
"hasChildren": true,
"children":["Stephen", "Maya", "Hunter"]
}
{
"lastname": "Torance",
"firstname": "Jack",
"DOB" : "June 20, 1980",
"age": "43",
"partner": "Wendy Torrance",
"hasChildren": true,
"children":["Danny", "Lucy"]
}
{
"lastname": "Strode",
"firstname": "Laurie",
"DOB" : "February 22, 1987",
"age": "36",
"partner": null,
"hasChildren": true,
"children":["Jamie Lloyd", "John Tate", "Karen Nelson"]
}
{
"lastname": "Whateley",
"firstname": "Wilbur",
"DOB" : "February 02, 1970",
"age": "53",
"partner": null,
"hasChildren": true,
"children":["Wilbur Jr.", "Eleanor"]
}

Contents of emp2:

{
"lastname": "Karras",
"firstname": "Damien",
"DOB" : "April 12, 1973",
"age": "50",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "Wilkes",
"firstname": "Annie",
"DOB" : "April 01, 1972",
"age": "51",
"partner":null,
"hasChildren": false,
"children": null
}
{
"lastname": "Cady",
"firstname": "Max",
"DOB" : "September 13, 1989",
"age": "33",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "White",
"firstname": "Carrie",
"DOB" : "August 25, 2000",
"age": "22",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "MacNeil",
"firstname": "Regan",
"DOB" : "April 07, 1985",
"age": "38",
"partner": "Orlando Knotts",
"hasChildren": false,
"children": null
}
{
"lastname": "Eriksson",
"firstname": "Oskar",
"DOB" : "May 29, 2001",
"age": "22",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "Grevers",
"firstname": "Nick",
"DOB" : "November 21, 1995",
"age": "28",
"partner" : "Sam Avery",
"hasChildren": false,
"children": null
}

Create a Pipeline to the S3 Bucket

Make sure to add the correct pipeline name, bucket name, folder name (if applicable), and AWS credentials.

CREATE PIPELINE emp_data
AS LOAD DATA S3 '<bucket_name>/<folder_name>/*.json'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{"aws_access_key_id": "<xxxxxxxxxxxxxxx>",
"aws_secret_access_key": "<xxxxxxxxxxxxxxx>"}'
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE employees
FORMAT JSON
(`lastname` <- `lastname` default '',
`firstname` <- `firstname` default '',
`age` <- `age` default -1,
`DOB` <- `DOB` default -1,
`partner` <- `partner` default '',
`hasChildren` <- `hasChildren` default '',
`children` <- `children` default '');

Start the pipeline.

START PIPELINE emp_data;

Verify the data has loaded. The contents of both JSON files should be loaded into the table.

SELECT * FROM employees;
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| lastname  | firstname | age | DOB                | partner        | hasChildren | children                                   |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| Bateman   | Patrick   |  40 | October 23, 1982   | Sharon Bateman |           1 | ["Isabell","Jackson"]                      |
| Eriksson  | Oskar     |  22 | May 29, 2001       | NULL           |           0 | NULL                                       |
| MacNeil   | Regan     |  38 | April 07, 1985     | Orlando Knotts |           0 | NULL                                       |
| Grevers   | Nick      |  28 | November 21, 1995  | Sam Avery      |           0 | NULL                                       |
| Karras    | Damien    |  50 | April 12, 1973     | NULL           |           0 | NULL                                       |
| Whateley  | Wilbur    |  53 | February 02, 1970  | NULL           |           1 | ["Wilbur Jr.","Eleanor"]                   |
| Denbrough | Bill      |  33 | January 04, 1990   | Audra Phillips |           1 | ["Stephen","Maya","Hunter"]                |
| Torance   | Jack      |  43 | June 20, 1980      | Wendy Torrance |           1 | ["Danny","Lucy"]                           |
| Wilkes    | Annie     |  51 | April 01, 1972     | NULL           |           0 | NULL                                       |
| Cady      | Max       |  33 | September 13, 1989 | NULL           |           0 | NULL                                       |
| White     | Carrie    |  22 | August 25, 2000    | NULL           |           0 | NULL                                       |
| Strode    | Laurie    |  36 | February 22, 1987  | NULL           |           1 | ["Jamie Lloyd","John Tate","Karen Nelson"] |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+

Other Useful Pipeline Commands

To troubleshoot an empty table run the command below be sure to add in the correct pipeline name.

SELECT * FROM information_schema.pipelines_errors
WHERE pipeline_name = '<pipeline_name>';

Stopping a pipeline is useful when you want to stop additional data from loading but would like to use the pipeline again. Make sure to include the correct pipeline name.

STOP PIPELINE <pipeline_name>;

Dropping a pipeline is useful when the pipeline was created for a one-off data load and it will not be used in the future. Once a pipeline is dropped it will need to be recreated in order to use it again.

DROP PIPELINE <pipeline_name>;

Note

A pipeline does not need to be stopped in order to be dropped.

Last modified: September 27, 2023

Was this article helpful?