Load Data from Parquet Files

Warning

SingleStore 9.0 gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 8.9 is recommended for production workloads, which can later be upgraded to SingleStore 9.0.

Load Parquet Files from a Local Filesystem

Parquet formatted data stored on the local filesystem can be loaded using a LOAD DATA query. This streamlines the process of loading cloud-stored data into tables. Other LOAD DATA clauses (SET, WHERE, etc.) are supported (but not shown) in the following syntax examples.

Syntax for loading a Parquet file:

SQL

LOAD DATA INFILE '<path_to_file/file_name>'
INTO TABLE <table_name>
    (val1 <- source1, 
     val2 <- source2
     [ ... ]
) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }]
[ ... ]
FORMAT PARQUET;

Example: Loading Parquet Files

Create a table:

SQL

CREATE TABLE assets(
Product_ID INT,
Category VARCHAR(50),
Model VARCHAR(50),
Price DECIMAL(6,2),
Employee_ID NUMERIC(5)
);

Create the LOAD DATA statement:

Verify the file's location and the column names are correct.

SQL

LOAD DATA INFILE '/tmp/assets.parquet'
INTO TABLE assets
    (Product_ID <- Product_ID,
    Category <- Category,
    Model <- Model,
    Price <- Price,
    Employee_ID <- Employee_ID)
FORMAT PARQUET;

Verify the data loaded:

SQL

SELECT * FROM ASSETS;

+------------+------------+-----------------+---------+-------------+
| Product_ID | Category   | Model           | Price   | Employee_ID |
+------------+------------+-----------------+---------+-------------+
|         58 | cell_phone | iphone_12       |  879.00 |         410 |
|       1049 | laptop     | mac_book_pro    | 1999.00 |          14 |
|       1100 | laptop     | mac_book_pro    | 1999.00 |         298 |
|        210 | laptop     | mac_book_pro    | 2500.00 |         312 |
|        130 | cell_phone | iphone_13       |  699.00 |         110 |
|         99 | laptop     | mac_book_air_M1 |  999.00 |          75 |
|        140 | cell_phone | iphone_13_pro   |  999.00 |         263 |
|         49 | cell_phone | iphone_12       |  879.00 |         102 |
|       2037 | laptop     | mac_book_air_M2 | 1199.00 |         399 |
|        111 | laptop     | mac_book_pro    | 2100.00 |          89 |
+------------+------------+-----------------+---------+-------------+

Important

Column names must be specified and they are case-sensitive.

Load Parquet Files Using a Pipeline

Parquet formatted data stored in an AWS S3 bucket can be loaded using a LOAD DATA query with a pipeline.

Pipeline syntax:

SQL

LOAD DATA S3 '<bucket name>'
CONFIG '{"region" : "<region_name>"}' 
CREDENTIALS '{"aws_access_key_id" : "<key_id> ", 
             "aws_secret_access_key": "<access_key>"}' 
INTO TABLE <table_name>
       (`<col_a>` <- %, 
 `<col_b>` <- % DEFAULT NULL , 
  ) FORMAT PARQUET;

Example: Parquet Pipeline

This section will provide an example of creating a SQL table, a Parquet file for import, and a Parquet formatted pipeline.

Create a table with the following structure:

SQL

CREATE TABLE employees(
ID NUMERIC(5),
Last_Name VARCHAR(25),
First_Name VARCHAR(25),
Job_Title VARCHAR(25),
Department VARCHAR(25),
City VARCHAR(25),
State VARCHAR(2),
Email VARCHAR(50)
);

Install the pandas and pyarrow libraries for Parquet file support:

Python

pip install pandas pyarrow

Create a file with data based on the table structure, be sure to save it as a Python file and use the correct path for the output:

Python

import pandas as pd

# Provided data
data = [
    ('014', 'Bateman', 'Patrick', 'Prod_Mgr', 'prod_dev', 'NYC', 'NY', 'pbateman@worldtech.com'),
    ('102', 'Karras', 'Damien', 'Doctor', 'R&D', 'NYC', 'NY', 'dkarras@worldtech.com'),
    ('298', 'Denbrough', 'Bill', 'Salesperson', 'Sales', 'Bangor', 'ME', 'bdenbrough@worldtech.com'),
    ('399', 'Torrance', 'Jack', 'PR Dir', 'PR', 'Estes Park', 'CO', 'jtorrance@worldtech.com'),
    ('410', 'Wilkes', 'Annie', 'HR Mgr', 'HR', 'Silver Creek', 'CO', 'awilkes@worldtech.com'),
    ('110', 'Strode', 'Laurie', 'VP Sales', 'Sales', 'Haddonfield', 'IL', 'lsatrode@worldtech.com'),
    ('312', 'Cady', 'Max', 'IT Dir', 'IT', 'New Essex', 'FL', 'mcady@worldtech.com'),
    ('089', 'Whateley', 'Wilbur', 'CEO', 'Sen_Mgmt', 'Dunwich', 'MA', 'wwhateley@worldtech.com'),
    ('075', 'White', 'Carrie', 'Receptionist', 'HR', 'Chamberlain', 'ME', 'cwhite@worldtech.com'),
    ('263', 'MacNeil', 'Regan', 'R&D Mgr', 'R&D', 'Washington', 'DC', 'rmacneil@worldtech.com')
]

# Define column names
columns = ['ID', 'Last_Name', 'First_Name', 'Job_Title', 'Department', 'City', 'State', 'Email']

# Create a DataFrame
df = pd.DataFrame(data, columns=columns)

# Save DataFrame to Parquet file
df.to_parquet('/<path to file>/<file_name.parquet>', index=False)

Once the file is saved, change the directory to the location of the Python script and run the script to generate the Parquet file:

Shell

cd /<path/to/python-script/directory>

Shell

python <create_parquet_file>.py

Upload the newly created Parquet file up to an S3 bucket and create the pipeline:

SQL

 CREATE PIPELINE emp_list AS LOAD DATA S3 's3://<path to file>/employee_data.parquet' 
 CONFIG '{"region":"us-west-2"}' CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX", 
  "aws_secret_access_key": "XXXXXXXXXX"}' 
  INTO TABLE employees 
  (`ID` <- ID,  
  `Last_Name` <- Last_Name,  
  `First_Name` <- First_Name,  
  `Job_Title` <- Job_Title, 
  `Department` <- Department, 
  `City` <- City,
  `State` <- State,  
  `Email` <- Email) 
  FORMAT PARQUET;

Start the pipeline:

SQL

START PIPELINE emp_list;

Verify the data loaded correctly:

SQL

SELECT * FROM employees;

+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
| ID   | Last_Name | First_Name | Job_Title    | Department | City         | State | Email                    |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
|  263 | MacNeil   | Regan      | R&D Mgr      | R&D        | Washington   | DC    | rmacneil@worldtech.com   |
|  312 | Cady      | Max        | IT Dir       | IT         | New Essex    | FL    | mcady@worldtech.com      |
|   89 | Whateley  | Wilbur     | CEO          | Sen_Mgmt   | Dunwich      | MA    | wwhateley@worldtech.com  |
|  102 | Karras    | Damien     | Doctor       | R&D        | NYC          | NY    | dkarras@worldtech.com    |
|   14 | Bateman   | Patrick    | Prod_Mgr     | prod_dev   | NYC          | NY    | pbateman@worldtech.com   |
|  410 | Wilkes    | Annie      | HR Mgr       | HR         | Silver Creek | CO    | awilkes@worldtech.com    |
|   75 | White     | Carrie     | Receptionist | HR         | Chamberlain  | ME    | cwhite@worldtech.com     |
|  110 | Strode    | Laurie     | VP Sales     | Sales      | Haddonfield  | IL    | lsatrode@worldtech.com   |
|  298 | Denbrough | Bill       | Salesperson  | Sales      | Bangor       | ME    | bdenbrough@worldtech.com |
|  399 | Torrance  | Jack       | PR Dir       | PR         | Estes Park   | CO    | jtorrance@worldtech.com  |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+