Load Data from Parquet Files
On this page
Parquet formatted files can be loaded from a local file or by using a pipeline.
Load Parquet Files from a Local Filesystem
Parquet formatted data stored on the local filesystem can be loaded using a LOAD DATA
query.LOAD DATA
clauses (SET
, WHERE
, etc.
Syntax for loading a Parquet file:
LOAD DATA INFILE '<path_to_file/file_name>'INTO TABLE <table_name>(val1 <- source1,val2 <- source2[ ... ]) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }][ ... ]FORMAT PARQUET;
Example: Loading Parquet Files
Create a table:
CREATE TABLE assets(Product_ID INT,Category VARCHAR(50),Model VARCHAR(50),Price DECIMAL(6,2),Employee_ID NUMERIC(5));
Create the LOAD DATA statement:
Verify the file's location and the column names are correct.
LOAD DATA INFILE '/tmp/assets.parquet'INTO TABLE assets(Product_ID <- Product_ID,Category <- Category,Model <- Model,Price <- Price,Employee_ID <- Employee_ID)FORMAT PARQUET;
Verify the data loaded:
SELECT * FROM ASSETS;
+------------+------------+-----------------+---------+-------------+
| Product_ID | Category | Model | Price | Employee_ID |
+------------+------------+-----------------+---------+-------------+
| 58 | cell_phone | iphone_12 | 879.00 | 410 |
| 1049 | laptop | mac_book_pro | 1999.00 | 14 |
| 1100 | laptop | mac_book_pro | 1999.00 | 298 |
| 210 | laptop | mac_book_pro | 2500.00 | 312 |
| 130 | cell_phone | iphone_13 | 699.00 | 110 |
| 99 | laptop | mac_book_air_M1 | 999.00 | 75 |
| 140 | cell_phone | iphone_13_pro | 999.00 | 263 |
| 49 | cell_phone | iphone_12 | 879.00 | 102 |
| 2037 | laptop | mac_book_air_M2 | 1199.00 | 399 |
| 111 | laptop | mac_book_pro | 2100.00 | 89 |
+------------+------------+-----------------+---------+-------------+
Important
Column names must be specified and they are case-sensitive.
Load Parquet Files Using a Pipeline
Parquet formatted data stored in an AWS S3 bucket can be loaded using a LOAD DATA
query with a pipeline.
Pipeline syntax:
LOAD DATA S3 '<bucket name>'CONFIG '{"region" : "<region_name>"}'CREDENTIALS '{"aws_access_key_id" : "<key_id> ","aws_secret_access_key": "<access_key>"}'INTO TABLE <table_name>(`<col_a>` <- %,`<col_b>` <- % DEFAULT NULL ,) FORMAT PARQUET;
Example: Parquet Pipeline
This section will provide an example of creating a SQL table, a Parquet file for import, and a Parquet formatted pipeline.
Create a table with the following structure:
CREATE TABLE employees(ID NUMERIC(5),Last_Name VARCHAR(25),First_Name VARCHAR(25),Job_Title VARCHAR(25),Department VARCHAR(25),City VARCHAR(25),State VARCHAR(2),Email VARCHAR(50));
Install the pandas
and pyarrow
libraries for Parquet file support:
pip install pandas pyarrow
Create a file with data based on the table structure, be sure to save it as a Python file and use the correct path for the output:
import pandas as pd# Provided datadata = [('014', 'Bateman', 'Patrick', 'Prod_Mgr', 'prod_dev', 'NYC', 'NY', 'pbateman@worldtech.com'),('102', 'Karras', 'Damien', 'Doctor', 'R&D', 'NYC', 'NY', 'dkarras@worldtech.com'),('298', 'Denbrough', 'Bill', 'Salesperson', 'Sales', 'Bangor', 'ME', 'bdenbrough@worldtech.com'),('399', 'Torrance', 'Jack', 'PR Dir', 'PR', 'Estes Park', 'CO', 'jtorrance@worldtech.com'),('410', 'Wilkes', 'Annie', 'HR Mgr', 'HR', 'Silver Creek', 'CO', 'awilkes@worldtech.com'),('110', 'Strode', 'Laurie', 'VP Sales', 'Sales', 'Haddonfield', 'IL', 'lsatrode@worldtech.com'),('312', 'Cady', 'Max', 'IT Dir', 'IT', 'New Essex', 'FL', 'mcady@worldtech.com'),('089', 'Whateley', 'Wilbur', 'CEO', 'Sen_Mgmt', 'Dunwich', 'MA', 'wwhateley@worldtech.com'),('075', 'White', 'Carrie', 'Receptionist', 'HR', 'Chamberlain', 'ME', 'cwhite@worldtech.com'),('263', 'MacNeil', 'Regan', 'R&D Mgr', 'R&D', 'Washington', 'DC', 'rmacneil@worldtech.com')]# Define column namescolumns = ['ID', 'Last_Name', 'First_Name', 'Job_Title', 'Department', 'City', 'State', 'Email']# Create a DataFramedf = pd.DataFrame(data, columns=columns)# Save DataFrame to Parquet filedf.to_parquet('/<path to file>/<file_name.parquet>', index=False)
Once the file is saved, change the directory to the location of the Python script and run the script to generate the Parquet file:
cd /<path/to/python-script/directory>
python <create_parquet_file>.py
Upload the newly created Parquet file up to an S3 bucket and create the pipeline:
CREATE PIPELINE emp_list AS LOAD DATA S3 's3://<path to file>/employee_data.parquet'CONFIG '{"region":"us-west-2"}' CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX","aws_secret_access_key": "XXXXXXXXXX"}'INTO TABLE employees(`ID` <- ID,`Last_Name` <- Last_Name,`First_Name` <- First_Name,`Job_Title` <- Job_Title,`Department` <- Department,`City` <- City,`State` <- State,`Email` <- Email)FORMAT PARQUET;
Start the pipeline:
START PIPELINE emp_list;
Verify the data loaded correctly:
SELECT * FROM employees;
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
| ID | Last_Name | First_Name | Job_Title | Department | City | State | Email |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
| 263 | MacNeil | Regan | R&D Mgr | R&D | Washington | DC | rmacneil@worldtech.com |
| 312 | Cady | Max | IT Dir | IT | New Essex | FL | mcady@worldtech.com |
| 89 | Whateley | Wilbur | CEO | Sen_Mgmt | Dunwich | MA | wwhateley@worldtech.com |
| 102 | Karras | Damien | Doctor | R&D | NYC | NY | dkarras@worldtech.com |
| 14 | Bateman | Patrick | Prod_Mgr | prod_dev | NYC | NY | pbateman@worldtech.com |
| 410 | Wilkes | Annie | HR Mgr | HR | Silver Creek | CO | awilkes@worldtech.com |
| 75 | White | Carrie | Receptionist | HR | Chamberlain | ME | cwhite@worldtech.com |
| 110 | Strode | Laurie | VP Sales | Sales | Haddonfield | IL | lsatrode@worldtech.com |
| 298 | Denbrough | Bill | Salesperson | Sales | Bangor | ME | bdenbrough@worldtech.com |
| 399 | Torrance | Jack | PR Dir | PR | Estes Park | CO | jtorrance@worldtech.com |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
Important
Column names must be specified and they are case-sensitive.
Last modified: August 5, 2024