Load TPC-H Data into SingleStore

This tutorial shows how to load the TPC-H dataset into your SingleStore deployment using Pipelines and then query it to obtain business-critical metrics. Pipelines enable you to extract, transform, and load external data without needing third-party tools or middleware.

What is TPC-H?

The TPC Benchmark™H (TPC-H) is a decision support benchmark, which consists of a suite of business oriented ad-hoc queries and concurrent data modifications. These queries have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

Prerequisites

Connect to a SingleStore Helios deployment or a SingleStore instance deployed on a self-managed cluster running on AWS.

Use a Starter workspace for development environments.

Create Database and Tables

Connect to your SingleStore Helios deployment and run the following SQL commands to create a database named tpch and eight associated tables.

For SingleStore Helios deployments, you can also run these commands in the SQL Editor, select Cloud Portal > Develop > Data Studio > SQL Editor.

Note

The SQL Editor only runs the queries that you select, so ensure you have them all selected before selecting Run.

DROP DATABASE IF EXISTS tpch;
CREATE DATABASE tpch;
USE tpch;
CREATE TABLE `customer` (
`c_custkey` int(11) NOT NULL,
`c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`c_nationkey` int(11) NOT NULL,
`c_phone` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`c_acctbal` decimal(15,2) NOT NULL,
`c_mktsegment` char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`c_custkey`) UNENFORCED RELY,
SHARD KEY (`c_custkey`) USING CLUSTERED COLUMNSTORE
);
CREATE TABLE `lineitem` (
`l_orderkey` bigint(11) NOT NULL,
`l_partkey` int(11) NOT NULL,
`l_suppkey` int(11) NOT NULL,
`l_linenumber` int(11) NOT NULL,
`l_quantity` decimal(15,2) NOT NULL,
`l_extendedprice` decimal(15,2) NOT NULL,
`l_discount` decimal(15,2) NOT NULL,
`l_tax` decimal(15,2) NOT NULL,
`l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_shipdate` date NOT NULL,
`l_commitdate` date NOT NULL,
`l_receiptdate` date NOT NULL,
`l_shipinstruct` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_shipmode` char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`l_orderkey`, `l_linenumber`) UNENFORCED RELY,
SHARD KEY (`l_orderkey`) USING CLUSTERED COLUMNSTORE
);
CREATE TABLE `nation` (
`n_nationkey` int(11) NOT NULL,
`n_name` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`n_regionkey` int(11) NOT NULL,
`n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`n_nationkey`) UNENFORCED RELY,
SHARD KEY (`n_nationkey`) USING CLUSTERED COLUMNSTORE
);
CREATE TABLE `orders` (
`o_orderkey` bigint(11) NOT NULL,
`o_custkey` int(11) NOT NULL,
`o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_totalprice` decimal(15,2) NOT NULL,
`o_orderdate` date NOT NULL,
`o_orderpriority` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_clerk` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`o_shippriority` int(11) NOT NULL,
`o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`o_orderkey`) UNENFORCED RELY,
SHARD KEY (`o_orderkey`) USING CLUSTERED COLUMNSTORE
);
CREATE TABLE `part` (
`p_partkey` int(11) NOT NULL,
`p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`p_mfgr` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`p_brand` char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`p_size` int(11) NOT NULL,
`p_container` char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`p_retailprice` decimal(15,2) NOT NULL,
`p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`p_partkey`) UNENFORCED RELY,
SHARD KEY (`p_partkey`) USING CLUSTERED COLUMNSTORE
);
CREATE TABLE `partsupp` (
`ps_partkey` int(11) NOT NULL,
`ps_suppkey` int(11) NOT NULL,
`ps_availqty` int(11) NOT NULL,
`ps_supplycost` decimal(15,2) NOT NULL,
`ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`ps_partkey`,`ps_suppkey`) UNENFORCED RELY,
SHARD KEY(`ps_partkey`),
KEY (`ps_partkey`,`ps_suppkey`) USING CLUSTERED COLUMNSTORE
);
CREATE TABLE `region` (
`r_regionkey` int(11) NOT NULL,
`r_name` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`r_regionkey`) UNENFORCED RELY,
SHARD KEY (`r_regionkey`) USING CLUSTERED COLUMNSTORE
);
CREATE TABLE `supplier` (
`s_suppkey` int(11) NOT NULL,
`s_name` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`s_nationkey` int(11) NOT NULL,
`s_phone` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`s_acctbal` decimal(15,2) NOT NULL,
`s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk (`s_suppkey`) UNENFORCED RELY,
SHARD KEY (`s_suppkey`) USING CLUSTERED COLUMNSTORE
);

Load Data with Pipelines

This part of the tutorial shows how to ingest TPC-H data from a public AWS S3 bucket into the SingleStore database using pipelines.

Note

The SQL Editor only runs the queries that you select, so ensure you have them all selected before selecting Run.

  1. Run the following SQL commands to create the pipelines:

    USE tpch;
    CREATE OR REPLACE PIPELINE tpch_100_lineitem
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/lineitem/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE lineitem
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
    CREATE OR REPLACE PIPELINE tpch_100_customer
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/customer/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE customer
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
    CREATE OR REPLACE PIPELINE tpch_100_nation
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/nation/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE nation
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
    CREATE OR REPLACE PIPELINE tpch_100_orders
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/orders/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE orders
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
    CREATE OR REPLACE PIPELINE tpch_100_part
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/part/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE part
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
    CREATE OR REPLACE PIPELINE tpch_100_partsupp
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/partsupp/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE partsupp
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
    CREATE OR REPLACE PIPELINE tpch_100_region
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/region/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE region
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
    CREATE OR REPLACE PIPELINE tpch_100_supplier
    AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/supplier/'
    config '{"region":"us-east-1"}'
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE supplier
    FIELDS TERMINATED BY '|'
    LINES TERMINATED BY '|\n';
  2. Run the following SQL commands to start the pipelines:

    USE tpch;
    START ALL PIPELINES;

    Once the Success message is returned for all the created pipelines, SingleStore starts ingesting the data from the S3 bucket.

Verify Pipeline Success

Query the pipelines_files information schema view to inspect the progress of the pipelines. For example, run the following command to check whether the tpch_100_lineitem pipeline has finished ingesting data.

SELECT * FROM information_schema.pipelines_files
WHERE pipeline_name = "tpch_100_lineitem";

You can also run the pipelines in foreground to easily verify that all the data has been ingested.

START PIPELINE <pipeline_name> FOREGROUND;

For SingleStore Helios deployments, you can also monitor the progress of your pipelines on the Cloud Portal, select Deployments > <your_workspace_group> > Databases > tpch > Pipelines.

Run Queries on Data

You can start running queries once the data is loaded into your SingleStore database. Here are some example queries to be used with the TPC-H dataset. For the complete workload of 22 queries that can be run against the TPC-H database, refer to TPC-H standard specification.

Query 1: Pricing Summary Report

This query reports the total number of items, their prices, and average quantities shipped within 90 days of the most recent ship date (specified date) in the database.

USE tpch;
SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM
lineitem
WHERE
l_shipdate <= date('1998-12-01' - interval '90' day)
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
+--------------+--------------+--------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty      | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+--------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            |  83289121.00 | 124912195605.43 | 118665316925.8745 | 123413447483.580151 | 25.491969 | 38231.377828 | 0.050007 |     3267269 |
| N            | F            |   2190539.00 |   3282604241.36 |   3118652485.5821 |   3243864574.415261 | 25.478790 | 38180.915863 | 0.049948 |       85975 |
| N            | O            | 164028717.00 | 245978182329.81 | 233676744963.8986 | 243022073269.855210 | 25.499480 | 38239.131169 | 0.050004 |     6432630 |
| R            | F            |  83339258.00 | 124970353864.21 | 118715983030.2238 | 123467940050.739462 | 25.517608 | 38264.613849 | 0.050047 |     3265951 |
+--------------+--------------+--------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set (4.17 sec)

Note

Run the query again and observe the execution time difference. You will see a dramatic improvement in query execution time. This is due to SingleStore Helios’s Code Generation technology.

Query 2: Shipping Priority

This query retrieves the 10 unshipped orders with the highest value.

USE tpch;
SELECT
l_orderkey,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
orders,
lineitem
WHERE
c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE('1995-03-15')
AND l_shipdate > DATE('1995-03-15')
GROUP BY
l_orderkey,
o_orderdate,
o_shippriority
ORDER BY
revenue DESC,
o_orderdate
LIMIT 5;
+------------+-------------+-------------+----------------+
| l_orderkey | revenue     | o_orderdate | o_shippriority |
+------------+-------------+-------------+----------------+
|   81316036 | 459781.3856 | 1995-02-24  |              0 |
|   54427397 | 391221.7016 | 1995-02-28  |              0 |
|   82073798 | 390074.7002 | 1995-03-12  |              0 |
|   81872546 | 386608.3992 | 1995-02-28  |              0 |
|   45138020 | 386056.2188 | 1995-01-30  |              0 |
+------------+-------------+-------------+----------------+
5 rows in set (46.47 sec)

Query 3: Returned Item Reporting

This query identifies the top 20 customers that may be experiencing issues with the items shipped to them.

USE tpch;
SELECT
c_custkey,
c_name,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
FROM
customer,
orders,
lineitem,
nation
WHERE
c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate >= DATE('1993-10-01')
AND o_orderdate < DATE('1993-10-01') + interval '3' month
AND l_returnflag = 'R'
AND c_nationkey = n_nationkey
GROUP BY
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
ORDER BY
revenue desc
LIMIT 5;
+-----------+--------------------+-------------+-----------+-----------+---------------------------------------+-----------------+------------------------------------------------------------------------------------------------+
| c_custkey | c_name             | revenue     | c_acctbal | n_name    | c_address                             | c_phone         | c_comment                                                                                      |
+-----------+--------------------+-------------+-----------+-----------+---------------------------------------+-----------------+------------------------------------------------------------------------------------------------+
|   2490736 | Customer#002490736 | 473138.4854 |   4701.80 | RUSSIA    | rjaIP,XKH,v0R 0u7Qyb                  | 32-234-811-3202 | s sleep carefully against the ironic pinto beans. blithely unusual deposits haggle furiously a |
|   7658659 | Customer#007658659 | 440156.4690 |   1442.76 | MOROCCO   | ucA926oE6y1fOr,teHGthr6GXz6y qf1ySK3  | 25-418-150-7556 | nusual requests nag blithely according to the regular requests: pending, unusual               |
|  11381863 | Customer#011381863 | 437190.0882 |   4397.51 | JORDAN    | TFd7KvQQmf6XJjSQUBbvZ9rlxTUle1mwfqt9  | 23-407-202-8891 |  lose blithely along the furiously pending sentiments. quickly final theodo                    |
|   4804573 | Customer#004804573 | 432390.5681 |   4930.12 | INDONESIA | LgPei3cnZL19sW8lM,qM                  | 19-664-439-8190 | ial accounts. blithely final platelets breach. furious                                         |
|   2247622 | Customer#002247622 | 414271.9102 |   4292.76 | PERU      | c5eYVZ7PXycKnsgLmCtarE5aCcWYga5u0RIwa | 27-403-465-9280 | ly. sly packages haggle. even theodolites wa                                                   |
+-----------+--------------------+-------------+-----------+-----------+---------------------------------------+-----------------+------------------------------------------------------------------------------------------------+
5 rows in set (37.33 sec)

Query 4: Product Type Profit Measure

This query calculates the profit made on a given line of items, broken down by supplier nation and year.

USE tpch;
SELECT
nation,
o_year,
SUM(amount) AS sum_profit
FROM
(
SELECT
n_name AS nation,
EXTRACT(year FROM o_orderdate) AS o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
FROM
part,
supplier,
lineitem,
partsupp,
orders,
nation
WHERE
s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_orderkey = l_orderkey
AND s_nationkey = n_nationkey
AND p_name LIKE '%green%'
) AS profit
GROUP BY
nation,
o_year
ORDER BY
nation,
o_year DESC
LIMIT 5;
+---------+--------+---------------+
| nation  | o_year | sum_profit    |
+---------+--------+---------------+
| ALGERIA |   1998 | 20613196.5474 |
| ALGERIA |   1997 | 37498211.6758 |
| ALGERIA |   1996 | 35471708.2126 |
| ALGERIA |   1995 | 36720114.9923 |
| ALGERIA |   1994 | 34921048.6736 |
+---------+--------+---------------+
5 rows in set (49.48 sec)

Analyze Query Performance

The SQL Editor allows you to record and visualize workload profiles at either the query or node level.

  1. In SQL Editor, select Visual Explain and select Profile.

  2. The Visual Explain window opens. You can now explore the query and navigate around different operations.

Last modified: October 28, 2024

Was this article helpful?