Load Data in JSON Format from Amazon S3 Using a Wildcard

Prerequisites

The following prerequisites need to be in place before creating a pipeline using a JSON wildcard.

Two or more JSON files for upload to an S3 bucket.
Access to the S3 bucket.

Create a Table

Create a table using the following syntax:

SQL

CREATE TABLE employees (
lastname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
firstname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
age INT NOT NULL,
DOB TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
partner TEXT CHARACTER SET utf8 COLLATE utf8_general_ci,
hasChildren boolean,
children JSON COLLATE utf8_bin,
SHARD KEY ()
);

The JSON file contents are below. To test, cut and paste the file contents into any program that allows files to be saved in JSON format. Upload both files to the S3 bucket. For this example, the files names are emp1 and emp2.

Contents of emp1:

JSON

{
    "lastname": "Bateman",
    "firstname": "Patrick",
    "DOB" : "October 23, 1982",
    "age": "40",
    "partner": "Sharon Bateman",
    "hasChildren": true,
    "children":["Isabell", "Jackson"]
}
{
    "lastname": "Denbrough",
    "firstname": "Bill",
    "DOB" : "January 04, 1990",
    "age": "33",
    "partner": "Audra Phillips",
    "hasChildren": true,
    "children":["Stephen", "Maya", "Hunter"]
}
{
    "lastname": "Torance",
    "firstname": "Jack",
    "DOB" : "June 20, 1980",
    "age": "43",
    "partner": "Wendy Torrance",
    "hasChildren": true,
    "children":["Danny", "Lucy"]
}
{
    "lastname": "Strode",
    "firstname": "Laurie",
    "DOB" : "February 22, 1987",
    "age": "36",
    "partner": null,
    "hasChildren": true,
    "children":["Jamie Lloyd", "John Tate", "Karen Nelson"]
		
}
{
    "lastname": "Whateley",
    "firstname": "Wilbur",
    "DOB" : "February 02, 1970",
    "age": "53",
    "partner": null,
    "hasChildren": true,
    "children":["Wilbur Jr.", "Eleanor"]
}

Contents of emp2:

JSON

{
    "lastname": "Karras",
    "firstname": "Damien",
    "DOB" : "April 12, 1973",
    "age": "50",
    "partner": null,
    "hasChildren": false,
    "children": null 	
}
{
    "lastname": "Wilkes",
    "firstname": "Annie",
    "DOB" : "April 01, 1972",
    "age": "51",
    "partner":null,
    "hasChildren": false,
    "children": null 
}
{
    "lastname": "Cady",
    "firstname": "Max",
    "DOB" : "September 13, 1989",
    "age": "33",
    "partner": null,
    "hasChildren": false,
    "children": null
} 
{
    "lastname": "White",
    "firstname": "Carrie",
    "DOB" : "August 25, 2000",
    "age": "22",
    "partner": null,
    "hasChildren": false,
     "children": null
}
{
    "lastname": "MacNeil",
    "firstname": "Regan",
    "DOB" : "April 07, 1985",
    "age": "38",
    "partner": "Orlando Knotts",
    "hasChildren": false,
    "children": null
}
{
    "lastname": "Eriksson",
    "firstname": "Oskar",
    "DOB" : "May 29, 2001",
    "age": "22",
    "partner": null,
    "hasChildren": false,
    "children": null
} 
{
    "lastname": "Grevers",
    "firstname": "Nick",
    "DOB" : "November 21, 1995",
    "age": "28",
    "partner" : "Sam Avery",
    "hasChildren": false,
    "children": null
}

Create a Pipeline to the S3 Bucket

Make sure to add the correct pipeline name, bucket name, folder name (if applicable), and AWS credentials.

SQL

CREATE PIPELINE emp_data
AS LOAD DATA S3 '<bucket_name>/<folder_name>/*.json'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{"aws_access_key_id": "<xxxxxxxxxxxxxxx>",
              "aws_secret_access_key": "<xxxxxxxxxxxxxxx>"}'
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE employees
FORMAT JSON
(`lastname` <- `lastname` default '',
`firstname` <- `firstname` default '',
`age` <- `age` default -1,
`DOB` <- `DOB` default -1,
`partner` <- `partner` default '',
`hasChildren` <- `hasChildren` default '',
`children` <- `children` default '');

Start the pipeline.

SQL

START PIPELINE emp_data;

Verify the data has loaded. The contents of both JSON files should be loaded into the table.

SQL

SELECT * FROM employees;

+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| lastname  | firstname | age | DOB                | partner        | hasChildren | children                                   |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| Bateman   | Patrick   |  40 | October 23, 1982   | Sharon Bateman |           1 | ["Isabell","Jackson"]                      |
| Eriksson  | Oskar     |  22 | May 29, 2001       | NULL           |           0 | NULL                                       |
| MacNeil   | Regan     |  38 | April 07, 1985     | Orlando Knotts |           0 | NULL                                       |
| Grevers   | Nick      |  28 | November 21, 1995  | Sam Avery      |           0 | NULL                                       |
| Karras    | Damien    |  50 | April 12, 1973     | NULL           |           0 | NULL                                       |
| Whateley  | Wilbur    |  53 | February 02, 1970  | NULL           |           1 | ["Wilbur Jr.","Eleanor"]                   |
| Denbrough | Bill      |  33 | January 04, 1990   | Audra Phillips |           1 | ["Stephen","Maya","Hunter"]                |
| Torance   | Jack      |  43 | June 20, 1980      | Wendy Torrance |           1 | ["Danny","Lucy"]                           |
| Wilkes    | Annie     |  51 | April 01, 1972     | NULL           |           0 | NULL                                       |
| Cady      | Max       |  33 | September 13, 1989 | NULL           |           0 | NULL                                       |
| White     | Carrie    |  22 | August 25, 2000    | NULL           |           0 | NULL                                       |
| Strode    | Laurie    |  36 | February 22, 1987  | NULL           |           1 | ["Jamie Lloyd","John Tate","Karen Nelson"] |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+

Other Useful Pipeline Commands

To troubleshoot an empty table run the command below be sure to add in the correct pipeline name.

SQL

SELECT * FROM information_schema.pipelines_errors
WHERE pipeline_name = '<pipeline_name>';

Stopping a pipeline is useful when you want to stop additional data from loading but would like to use the pipeline again. Make sure to include the correct pipeline name.

SQL

STOP PIPELINE <pipeline_name>;

Dropping a pipeline is useful when the pipeline was created for a one-off data load and it will not be used in the future. Once a pipeline is dropped it will need to be recreated in order to use it again.

SQL

DROP PIPELINE <pipeline_name>;

Note

A pipeline does not need to be stopped in order to be dropped.

Load Data in JSON Format from Amazon S3 Using a Wildcard

On this page

Prerequisites

Create a Table

Create a Pipeline to the S3 Bucket

Other Useful Pipeline Commands

Was this article helpful?

On this page

Was this article helpful?