Load Data from Parquet Files

Parquet formatted files can be loaded from a local file or by using a pipeline. The basic syntax is provided below:

Load Parquet Files from a Local Filesystem

Parquet formatted data stored on the local filesystem can be loaded using a LOAD DATA query. This streamlines the process of loading cloud-stored data into tables. Other LOAD DATA clauses (SET, WHERE, etc.) are supported (but not shown) in the following syntax examples.

Syntax for loading a Parquet file:

LOAD DATA INFILE '<path_to_file/file_name>'
INTO TABLE <table_name>
(val1 <- source1,
val2 <- source2
[ ... ]
) [COMPRESSION { AUTO | NONE | LZ4 | GZIP }]
[ ... ]
FORMAT PARQUET;

Example: Loading Parquet Files

Create a table:

CREATE TABLE assets(
Product_ID INT,
Category VARCHAR(50),
Model VARCHAR(50),
Price DECIMAL(6,2),
Employee_ID NUMERIC(5)
);

Create the LOAD DATA statement:

Verify the file's location and the column names are correct.

LOAD DATA INFILE '/tmp/assets.parquet'
INTO TABLE assets
(Product_ID <- Product_ID,
Category <- Category,
Model <- Model,
Price <- Price,
Employee_ID <- Employee_ID)
FORMAT PARQUET;

Verify the data loaded:

SELECT * FROM ASSETS;
+------------+------------+-----------------+---------+-------------+
| Product_ID | Category   | Model           | Price   | Employee_ID |
+------------+------------+-----------------+---------+-------------+
|         58 | cell_phone | iphone_12       |  879.00 |         410 |
|       1049 | laptop     | mac_book_pro    | 1999.00 |          14 |
|       1100 | laptop     | mac_book_pro    | 1999.00 |         298 |
|        210 | laptop     | mac_book_pro    | 2500.00 |         312 |
|        130 | cell_phone | iphone_13       |  699.00 |         110 |
|         99 | laptop     | mac_book_air_M1 |  999.00 |          75 |
|        140 | cell_phone | iphone_13_pro   |  999.00 |         263 |
|         49 | cell_phone | iphone_12       |  879.00 |         102 |
|       2037 | laptop     | mac_book_air_M2 | 1199.00 |         399 |
|        111 | laptop     | mac_book_pro    | 2100.00 |          89 |
+------------+------------+-----------------+---------+-------------+

Important

Column names must be specified and they are case-sensitive.

Load Parquet Files Using a Pipeline

Parquet formatted data stored in an AWS S3 bucket can be loaded using a LOAD DATA query with a pipeline.

Pipeline syntax:

LOAD DATA S3 '<bucket name>'
CONFIG '{"region" : "<region_name>"}' 
CREDENTIALS '{"aws_access_key_id" : "<key_id> ", 
             "aws_secret_access_key": "<access_key>"}' 
INTO TABLE <table_name>
       (`<col_a>` <- %, 
 `<col_b>` <- % DEFAULT NULL , 
  ) FORMAT PARQUET;

Example: Parquet Pipeline

This section will provide an example of creating a SQL table, a Parquet file for import, and a Parquet formatted pipeline.

Create a table with the following structure:

CREATE TABLE employees(
ID NUMERIC(5),
Last_Name VARCHAR(25),
First_Name VARCHAR(25),
Job_Title VARCHAR(25),
Department VARCHAR(25),
City VARCHAR(25),
State VARCHAR(2),
Email VARCHAR(50)
);

Install the pandas and pyarrow libraries for Parquet file support:

pip install pandas pyarrow

Create a file with data based on the table structure, be sure to save it as a Python file and use the correct path for the output:

import pandas as pd
# Provided data
data = [
('014', 'Bateman', 'Patrick', 'Prod_Mgr', 'prod_dev', 'NYC', 'NY', 'pbateman@worldtech.com'),
('102', 'Karras', 'Damien', 'Doctor', 'R&D', 'NYC', 'NY', 'dkarras@worldtech.com'),
('298', 'Denbrough', 'Bill', 'Salesperson', 'Sales', 'Bangor', 'ME', 'bdenbrough@worldtech.com'),
('399', 'Torrance', 'Jack', 'PR Dir', 'PR', 'Estes Park', 'CO', 'jtorrance@worldtech.com'),
('410', 'Wilkes', 'Annie', 'HR Mgr', 'HR', 'Silver Creek', 'CO', 'awilkes@worldtech.com'),
('110', 'Strode', 'Laurie', 'VP Sales', 'Sales', 'Haddonfield', 'IL', 'lsatrode@worldtech.com'),
('312', 'Cady', 'Max', 'IT Dir', 'IT', 'New Essex', 'FL', 'mcady@worldtech.com'),
('089', 'Whateley', 'Wilbur', 'CEO', 'Sen_Mgmt', 'Dunwich', 'MA', 'wwhateley@worldtech.com'),
('075', 'White', 'Carrie', 'Receptionist', 'HR', 'Chamberlain', 'ME', 'cwhite@worldtech.com'),
('263', 'MacNeil', 'Regan', 'R&D Mgr', 'R&D', 'Washington', 'DC', 'rmacneil@worldtech.com')
]
# Define column names
columns = ['ID', 'Last_Name', 'First_Name', 'Job_Title', 'Department', 'City', 'State', 'Email']
# Create a DataFrame
df = pd.DataFrame(data, columns=columns)
# Save DataFrame to Parquet file
df.to_parquet('/<path to file>/<file_name.parquet>', index=False)

Once the file is saved, change the directory to the location of the Python script and run the script to generate the Parquet file:

cd /<path/to/python-script/directory>
python <create_parquet_file>.py

Upload the newly created Parquet file up to an S3 bucket and create the pipeline:

CREATE PIPELINE emp_list AS LOAD DATA S3 's3://<path to file>/employee_data.parquet'
CONFIG '{"region":"us-west-2"}' CREDENTIALS '{"aws_access_key_id": "XXXXXXXXXX",
"aws_secret_access_key": "XXXXXXXXXX"}'
INTO TABLE employees
(`ID` <- ID,
`Last_Name` <- Last_Name,
`First_Name` <- First_Name,
`Job_Title` <- Job_Title,
`Department` <- Department,
`City` <- City,
`State` <- State,
`Email` <- Email)
FORMAT PARQUET;

Start the pipeline:

START PIPELINE emp_list;

Verify the data loaded correctly:

SELECT * FROM employees;
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
| ID   | Last_Name | First_Name | Job_Title    | Department | City         | State | Email                    |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+
|  263 | MacNeil   | Regan      | R&D Mgr      | R&D        | Washington   | DC    | rmacneil@worldtech.com   |
|  312 | Cady      | Max        | IT Dir       | IT         | New Essex    | FL    | mcady@worldtech.com      |
|   89 | Whateley  | Wilbur     | CEO          | Sen_Mgmt   | Dunwich      | MA    | wwhateley@worldtech.com  |
|  102 | Karras    | Damien     | Doctor       | R&D        | NYC          | NY    | dkarras@worldtech.com    |
|   14 | Bateman   | Patrick    | Prod_Mgr     | prod_dev   | NYC          | NY    | pbateman@worldtech.com   |
|  410 | Wilkes    | Annie      | HR Mgr       | HR         | Silver Creek | CO    | awilkes@worldtech.com    |
|   75 | White     | Carrie     | Receptionist | HR         | Chamberlain  | ME    | cwhite@worldtech.com     |
|  110 | Strode    | Laurie     | VP Sales     | Sales      | Haddonfield  | IL    | lsatrode@worldtech.com   |
|  298 | Denbrough | Bill       | Salesperson  | Sales      | Bangor       | ME    | bdenbrough@worldtech.com |
|  399 | Torrance  | Jack       | PR Dir       | PR         | Estes Park   | CO    | jtorrance@worldtech.com  |
+------+-----------+------------+--------------+------------+--------------+-------+--------------------------+

Important

Column names must be specified and they are case-sensitive.

Last modified: August 5, 2024

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK