Load Data from Parquet Files

Parquet formatted files can be loaded from a local file or by using a pipeline. The basic syntax is provided below:

Load Parquet Files from a Local Filesystem

Parquet formatted data stored on the local filesystem can be loaded using a LOAD DATA query. This streamlines the process of loading cloud-stored data into tables. Other LOAD DATA clauses (SET, WHERE, etc.) are supported (but not shown) in the following syntax examples.

Syntax for loading a Parquet file:

LOAD DATA INFILE '<path_to_file/file_name>'
INTO TABLE <table_name>
(val1 <- source1,
val2 <- source2
[ ... ]
) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }]
[ ... ]
FORMAT PARQUET;

Example: Loading Parquet Files

Create a table:

CREATE TABLE assets(
Product_ID INT,
Category VARCHAR(50),
Model VARCHAR(50),
Price DECIMAL(6,2),
Employee_ID NUMERIC(5)
);

Create the LOAD DATA statement:

Verify the file's location and the column names are correct.

LOAD DATA INFILE '/tmp/assets.parquet'
INTO TABLE assets
(Product_ID <- Product_ID,
Category <- Category,
Model <- Model,
Price <- Price,
Employee_ID <- Employee_ID)
FORMAT PARQUET;

Verify the data loaded:

SELECT * FROM ASSETS;
+------------+------------+-----------------+---------+-------------+
| Product_ID | Category   | Model           | Price   | Employee_ID |
+------------+------------+-----------------+---------+-------------+
|         58 | cell_phone | iphone_12       |  879.00 |         410 |
|       1049 | laptop     | mac_book_pro    | 1999.00 |          14 |
|       1100 | laptop     | mac_book_pro    | 1999.00 |         298 |
|        210 | laptop     | mac_book_pro    | 2500.00 |         312 |
|        130 | cell_phone | iphone_13       |  699.00 |         110 |
|         99 | laptop     | mac_book_air_M1 |  999.00 |          75 |
|        140 | cell_phone | iphone_13_pro   |  999.00 |         263 |
|         49 | cell_phone | iphone_12       |  879.00 |         102 |
|       2037 | laptop     | mac_book_air_M2 | 1199.00 |         399 |
|        111 | laptop     | mac_book_pro    | 2100.00 |          89 |
+------------+------------+-----------------+---------+-------------+

Important

Column names must be specified and they are case-sensitive.

Load Parquet Files Using a Pipeline

Parquet formatted data stored in an AWS S3 bucket can be loaded using a LOAD DATA query with a pipeline.

Pipeline syntax:

LOAD DATA S3 '<bucket name>'
CONFIG '{"region" : "<region_name>"}' 
CREDENTIALS '{"aws_access_key_id" : "<key_id> ", 
             "aws_secret_access_key": "<access_key>"}' 
INTO TABLE <table_name>
       (`<col_a>` <- %, 
 `<col_b>` <- % DEFAULT NULL , 
  ) FORMAT PARQUET;

Example: Parquet Pipeline

This section will provide an example of creating a SQL table, a Parquet file for import, and a Parquet formatted pipeline.

Create a table with the following structure:

CREATE TABLE employees(
ID NUMERIC(5),
Last_Name VARCHAR(25),
First_Name VARCHAR(25),
Job_Title VARCHAR(25),
Department VARCHAR(25),
City VARCHAR(25),
State VARCHAR(2),
Email VARCHAR(50)
);

Install the pandas and pyarrow libraries for Parquet file support:

pip install pandas pyarrow

Create a file with data based on the table structure, be sure to save it as a Python file and use the correct path for the output:

import pandas as pd
# Provided data
data = [
('014', 'Bateman', 'Patrick', 'Prod_Mgr', 'prod_dev', 'NYC', 'NY', 'pbateman@worldtech.com'),
('102', 'Karras', 'Damien', 'Doctor', 'R&D', 'NYC', 'NY', 'dkarras@worldtech.com'),
('298', 'Denbrough', 'Bill', 'Salesperson', 'Sales', 'Bangor', 'ME', 'bdenbrough@worldtech.com'),
('399', 'Torrance', 'Jack', 'PR Dir', 'PR', 'Estes Park', 'CO', 'jtorrance@worldtech.com'),
('410', 'Wilkes', 'Annie', 'HR Mgr', 'HR', 'Silver Creek', 'CO', 'awilkes@worldtech.com'),
('110', 'Strode', 'Laurie', 'VP Sales', 'Sales', 'Haddonfield', 'IL', 'lsatrode@worldtech.com'),
('312', 'Cady', 'Max', 'IT Dir', 'IT', 'New Essex', 'FL', 'mcady@worldtech.com'),
('089', 'Whateley', 'Wilbur', 'CEO', 'Sen_Mgmt', 'Dunwich', 'MA', 'wwhateley@worldtech.com'),
('075', 'White', 'Carrie', 'Receptionist', 'HR', 'Chamberlain', 'ME', 'cwhite@worldtech.com'),
('263', 'MacNeil', 'Regan', 'R&D Mgr', 'R&D', 'Washington', 'DC', 'rmacneil@worldtech.com')
]
# Define column names
columns = ['ID', 'Last_Name', 'First_Name', 'Job_Title', 'Department', 'City', 'State', 'Email']
# Create a DataFrame
df = pd.DataFrame(data, columns=columns)
# Save DataFrame to Parquet file
df.to_parquet('/<path to file>/<file_name.parquet>', index=False)

Once the file is saved, change the directory to the location of the Python script and run the script to generate the Parquet file:

cd /<path/to/python-script/directory>
python <create_parquet_file>.py

Upload the newly created Parquet file up to an S3 bucket and create the pipeline:

CREATE PIPELINE emp_list AS LOAD DATA S3 's3://<path to file>/employee_data.parquet'
CONFIG '{"region":"us-west-2"}' CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX",
"aws_secret_access_key": "XXXXXXXXXX"}'
INTO TABLE employees
(`ID` <- ID,
`Last_Name` <- Last_Name,
`First_Name` <- First_Name,
`Job_Title` <- Job_Title,
`Department` <- Department,
`City` <- City,
`State` <- State,
`Email` <- Email)
FORMAT PARQUET;

Start the pipeline:

START PIPELINE emp_list;

Verify the data loaded correctly:

SELECT * FROM employees;
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
| ID   | Last_Name | First_Name | Job_Title    | Department | City         | State | Email                    |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
|  263 | MacNeil   | Regan      | R&D Mgr      | R&D        | Washington   | DC    | rmacneil@worldtech.com   |
|  312 | Cady      | Max        | IT Dir       | IT         | New Essex    | FL    | mcady@worldtech.com      |
|   89 | Whateley  | Wilbur     | CEO          | Sen_Mgmt   | Dunwich      | MA    | wwhateley@worldtech.com  |
|  102 | Karras    | Damien     | Doctor       | R&D        | NYC          | NY    | dkarras@worldtech.com    |
|   14 | Bateman   | Patrick    | Prod_Mgr     | prod_dev   | NYC          | NY    | pbateman@worldtech.com   |
|  410 | Wilkes    | Annie      | HR Mgr       | HR         | Silver Creek | CO    | awilkes@worldtech.com    |
|   75 | White     | Carrie     | Receptionist | HR         | Chamberlain  | ME    | cwhite@worldtech.com     |
|  110 | Strode    | Laurie     | VP Sales     | Sales      | Haddonfield  | IL    | lsatrode@worldtech.com   |
|  298 | Denbrough | Bill       | Salesperson  | Sales      | Bangor       | ME    | bdenbrough@worldtech.com |
|  399 | Torrance  | Jack       | PR Dir       | PR         | Estes Park   | CO    | jtorrance@worldtech.com  |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+

Important

Column names must be specified and they are case-sensitive.

Last modified: August 5, 2024

Was this article helpful?