Load Data from Parquet Files

Load Parquet Files from a Local Filesystem

Parquet formatted data stored on the local filesystem can be loaded using a LOAD DATA query. This streamlines the process of loading cloud-stored data into tables. Other LOAD DATA clauses (SET, WHERE, etc.) are supported (but not shown) in the following syntax examples.

Syntax for loading a Parquet file:

SQL

LOAD DATA INFILE '<path_to_file/file_name>'
INTO TABLE <table_name>
    (val1 <- source1, 
     val2 <- source2
     [ ... ]
) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }]
[ ... ]
FORMAT PARQUET;

Example: Loading Parquet Files

Create a table:

SQL

CREATE TABLE assets(
Product_ID INT,
Category VARCHAR(50),
Model VARCHAR(50),
Price DECIMAL(6,2),
Employee_ID NUMERIC(5)
);

Create the LOAD DATA statement:

Verify the file's location and the column names are correct.

SQL

LOAD DATA INFILE '/tmp/assets.parquet'
INTO TABLE assets
    (Product_ID <- Product_ID,
    Category <- Category,
    Model <- Model,
    Price <- Price,
    Employee_ID <- Employee_ID)
FORMAT PARQUET;

Verify the data loaded:

SQL

SELECT * FROM ASSETS;

+------------+------------+-----------------+---------+-------------+
| Product_ID | Category   | Model           | Price   | Employee_ID |
+------------+------------+-----------------+---------+-------------+
|         58 | cell_phone | iphone_12       |  879.00 |         410 |
|       1049 | laptop     | mac_book_pro    | 1999.00 |          14 |
|       1100 | laptop     | mac_book_pro    | 1999.00 |         298 |
|        210 | laptop     | mac_book_pro    | 2500.00 |         312 |
|        130 | cell_phone | iphone_13       |  699.00 |         110 |
|         99 | laptop     | mac_book_air_M1 |  999.00 |          75 |
|        140 | cell_phone | iphone_13_pro   |  999.00 |         263 |
|         49 | cell_phone | iphone_12       |  879.00 |         102 |
|       2037 | laptop     | mac_book_air_M2 | 1199.00 |         399 |
|        111 | laptop     | mac_book_pro    | 2100.00 |          89 |
+------------+------------+-----------------+---------+-------------+

Important

Column names must be specified and they are case-sensitive.

Load Parquet Files Using a Pipeline

Parquet formatted data stored in an AWS S3 bucket can be loaded using a LOAD DATA query with a pipeline.

Pipeline syntax:

SQL

LOAD DATA S3 '<bucket name>'
CONFIG '{"region" : "<region_name>"}' 
CREDENTIALS '{"aws_access_key_id" : "<key_id> ", 
             "aws_secret_access_key": "<access_key>"}' 
INTO TABLE <table_name>
       (`<col_a>` <- %, 
 `<col_b>` <- % DEFAULT NULL , 
  ) FORMAT PARQUET;

Example: Parquet Pipeline

This section will provide an example of creating a SQL table, a Parquet file for import, and a Parquet formatted pipeline.

Create a table with the following structure:

SQL

CREATE TABLE employees(
ID NUMERIC(5),
Last_Name VARCHAR(25),
First_Name VARCHAR(25),
Job_Title VARCHAR(25),
Department VARCHAR(25),
City VARCHAR(25),
State VARCHAR(2),
Email VARCHAR(50)
);

Install the pandas and pyarrow libraries for Parquet file support:

Shell

pip install pandas pyarrow

Create a file with data based on the table structure, be sure to save it as a Python file and use the correct path for the output:

import pandas as pd

# Provided data
data = [
    ('014', 'Bateman', 'Patrick', 'Prod_Mgr', 'prod_dev', 'NYC', 'NY', 'pbateman@worldtech.com'),
    ('102', 'Karras', 'Damien', 'Doctor', 'R&D', 'NYC', 'NY', 'dkarras@worldtech.com'),
    ('298', 'Denbrough', 'Bill', 'Salesperson', 'Sales', 'Bangor', 'ME', 'bdenbrough@worldtech.com'),
    ('399', 'Torrance', 'Jack', 'PR Dir', 'PR', 'Estes Park', 'CO', 'jtorrance@worldtech.com'),
    ('410', 'Wilkes', 'Annie', 'HR Mgr', 'HR', 'Silver Creek', 'CO', 'awilkes@worldtech.com'),
    ('110', 'Strode', 'Laurie', 'VP Sales', 'Sales', 'Haddonfield', 'IL', 'lsatrode@worldtech.com'),
    ('312', 'Cady', 'Max', 'IT Dir', 'IT', 'New Essex', 'FL', 'mcady@worldtech.com'),
    ('089', 'Whateley', 'Wilbur', 'CEO', 'Sen_Mgmt', 'Dunwich', 'MA', 'wwhateley@worldtech.com'),
    ('075', 'White', 'Carrie', 'Receptionist', 'HR', 'Chamberlain', 'ME', 'cwhite@worldtech.com'),
    ('263', 'MacNeil', 'Regan', 'R&D Mgr', 'R&D', 'Washington', 'DC', 'rmacneil@worldtech.com')
]

# Define column names
columns = ['ID', 'Last_Name', 'First_Name', 'Job_Title', 'Department', 'City', 'State', 'Email']

# Create a DataFrame
df = pd.DataFrame(data, columns=columns)

# Save DataFrame to Parquet file
df.to_parquet('/<path to file>/<file_name.parquet>', index=False)

Once the file is saved, change the directory to the location of the Python script and run the script to generate the Parquet file:

Shell

cd /<path/to/python-script/directory>

Shell

python <create_parquet_file>.py

Upload the newly created Parquet file up to an S3 bucket and create the pipeline:

SQL

 CREATE PIPELINE emp_list AS LOAD DATA S3 's3://<path to file>/employee_data.parquet' 
 CONFIG '{"region":"us-west-2"}' CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX", 
  "aws_secret_access_key": "XXXXXXXXXX"}' 
  INTO TABLE employees 
  (`ID` <- ID,  
  `Last_Name` <- Last_Name,  
  `First_Name` <- First_Name,  
  `Job_Title` <- Job_Title, 
  `Department` <- Department, 
  `City` <- City,
  `State` <- State,  
  `Email` <- Email) 
  FORMAT PARQUET;

Start the pipeline:

SQL

START PIPELINE emp_list;

Verify the data loaded correctly:

SQL

SELECT * FROM employees;

+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
| ID   | Last_Name | First_Name | Job_Title    | Department | City         | State | Email                    |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
|  263 | MacNeil   | Regan      | R&D Mgr      | R&D        | Washington   | DC    | rmacneil@worldtech.com   |
|  312 | Cady      | Max        | IT Dir       | IT         | New Essex    | FL    | mcady@worldtech.com      |
|   89 | Whateley  | Wilbur     | CEO          | Sen_Mgmt   | Dunwich      | MA    | wwhateley@worldtech.com  |
|  102 | Karras    | Damien     | Doctor       | R&D        | NYC          | NY    | dkarras@worldtech.com    |
|   14 | Bateman   | Patrick    | Prod_Mgr     | prod_dev   | NYC          | NY    | pbateman@worldtech.com   |
|  410 | Wilkes    | Annie      | HR Mgr       | HR         | Silver Creek | CO    | awilkes@worldtech.com    |
|   75 | White     | Carrie     | Receptionist | HR         | Chamberlain  | ME    | cwhite@worldtech.com     |
|  110 | Strode    | Laurie     | VP Sales     | Sales      | Haddonfield  | IL    | lsatrode@worldtech.com   |
|  298 | Denbrough | Bill       | Salesperson  | Sales      | Bangor       | ME    | bdenbrough@worldtech.com |
|  399 | Torrance  | Jack       | PR Dir       | PR         | Estes Park   | CO    | jtorrance@worldtech.com  |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+

Important

Column names must be specified and they are case-sensitive.

Load Data from Parquet Files

On this page

Load Parquet Files from a Local Filesystem

Example: Loading Parquet Files

Load Parquet Files Using a Pipeline

Example: Parquet Pipeline

Was this article helpful?

On this page

Was this article helpful?