# Load TPC-H Data into SingleStore

This tutorial shows how to load the TPC-H dataset into your SingleStore deployment using [Pipelines](https://docs.singlestore.com/cloud/load-data/about-singlestore-pipelines.md) and then query it to obtain business-critical metrics. Pipelines enable you to extract, transform, and load external data without needing third-party tools or middleware.

## What is TPC-H?

The TPC Benchmark™H (TPC-H) is a decision support benchmark, which consists of a suite of business oriented ad-hoc queries and concurrent data modifications. These queries have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

## Prerequisites

Connect to a SingleStore Helios deployment or a SingleStore instance deployed on a self-managed cluster running on AWS.

Use a Shared workspace for development environments.

## Create Database and Tables

Connect to your SingleStore Helios deployment and run the following SQL commands to create a database named **tpch** and eight associated tables.

For SingleStore Helios deployments, you can also run these commands in the **SQL Editor**, select **Cloud Portal > Editor > Open SQL Editor**.

> **📝 Note**: The **SQL Editor** only runs the queries that you select, so ensure you have them all selected before selecting **Run**.

```sql
DROP DATABASE IF EXISTS tpch;
CREATE DATABASE tpch;
USE tpch;

CREATE TABLE `customer` (
  `c_custkey` int(11) NOT NULL,
  `c_name` varchar(25) COLLATE utf8mb4_bin NOT NULL,
  `c_address` varchar(40) COLLATE utf8mb4_bin NOT NULL,
  `c_nationkey` int(11) NOT NULL,
  `c_phone` char(15) COLLATE utf8mb4_bin NOT NULL,
  `c_acctbal` decimal(15,2) NOT NULL,
  `c_mktsegment` char(10) COLLATE utf8mb4_bin NOT NULL,
  `c_comment` varchar(117) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`c_custkey`) UNENFORCED RELY,
  SHARD KEY (`c_custkey`) USING CLUSTERED COLUMNSTORE
);

CREATE TABLE `lineitem` (
  `l_orderkey` bigint(11) NOT NULL,
  `l_partkey` int(11) NOT NULL,
  `l_suppkey` int(11) NOT NULL,
  `l_linenumber` int(11) NOT NULL,
  `l_quantity` decimal(15,2) NOT NULL,
  `l_extendedprice` decimal(15,2) NOT NULL,
  `l_discount` decimal(15,2) NOT NULL,
  `l_tax` decimal(15,2) NOT NULL,
  `l_returnflag` char(1) COLLATE utf8mb4_bin NOT NULL,
  `l_linestatus` char(1) COLLATE utf8mb4_bin NOT NULL,
  `l_shipdate` date NOT NULL,
  `l_commitdate` date NOT NULL,
  `l_receiptdate` date NOT NULL,
  `l_shipinstruct` char(25) COLLATE utf8mb4_bin NOT NULL,
  `l_shipmode` char(10) COLLATE utf8mb4_bin NOT NULL,
  `l_comment` varchar(44) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`l_orderkey`, `l_linenumber`) UNENFORCED RELY,
  SHARD KEY (`l_orderkey`) USING CLUSTERED COLUMNSTORE
);

CREATE TABLE `nation` (
  `n_nationkey` int(11) NOT NULL,
  `n_name` char(25) COLLATE utf8mb4_bin NOT NULL,
  `n_regionkey` int(11) NOT NULL,
  `n_comment` varchar(152) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`n_nationkey`) UNENFORCED RELY,
  SHARD KEY (`n_nationkey`) USING CLUSTERED COLUMNSTORE
);

CREATE TABLE `orders` (
  `o_orderkey` bigint(11) NOT NULL,
  `o_custkey` int(11) NOT NULL,
  `o_orderstatus` char(1) COLLATE utf8mb4_bin NOT NULL,
  `o_totalprice` decimal(15,2) NOT NULL,
  `o_orderdate` date NOT NULL,
  `o_orderpriority` char(15) COLLATE utf8mb4_bin NOT NULL,
  `o_clerk` char(15) COLLATE utf8mb4_bin NOT NULL,
  `o_shippriority` int(11) NOT NULL,
  `o_comment` varchar(79) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`o_orderkey`) UNENFORCED RELY,
  SHARD KEY (`o_orderkey`) USING CLUSTERED COLUMNSTORE
);

CREATE TABLE `part` (
  `p_partkey` int(11) NOT NULL,
  `p_name` varchar(55) COLLATE utf8mb4_bin NOT NULL,
  `p_mfgr` char(25) COLLATE utf8mb4_bin NOT NULL,
  `p_brand` char(10) COLLATE utf8mb4_bin NOT NULL,
  `p_type` varchar(25) COLLATE utf8mb4_bin NOT NULL,
  `p_size` int(11) NOT NULL,
  `p_container` char(10) COLLATE utf8mb4_bin NOT NULL,
  `p_retailprice` decimal(15,2) NOT NULL,
  `p_comment` varchar(23) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`p_partkey`) UNENFORCED RELY,
  SHARD KEY (`p_partkey`) USING CLUSTERED COLUMNSTORE
);

CREATE TABLE `partsupp` (
  `ps_partkey` int(11) NOT NULL,
  `ps_suppkey` int(11) NOT NULL,
  `ps_availqty` int(11) NOT NULL,
  `ps_supplycost` decimal(15,2) NOT NULL,
  `ps_comment` varchar(199) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`ps_partkey`,`ps_suppkey`) UNENFORCED RELY,
  SHARD KEY(`ps_partkey`),
  KEY (`ps_partkey`,`ps_suppkey`) USING CLUSTERED COLUMNSTORE
);

CREATE TABLE `region` (
  `r_regionkey` int(11) NOT NULL,
  `r_name` char(25) COLLATE utf8mb4_bin NOT NULL,
  `r_comment` varchar(152) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`r_regionkey`) UNENFORCED RELY,
  SHARD KEY (`r_regionkey`) USING CLUSTERED COLUMNSTORE
);

CREATE TABLE `supplier` (
  `s_suppkey` int(11) NOT NULL,
  `s_name` char(25) COLLATE utf8mb4_bin NOT NULL,
  `s_address` varchar(40) COLLATE utf8mb4_bin NOT NULL,
  `s_nationkey` int(11) NOT NULL,
  `s_phone` char(15) COLLATE utf8mb4_bin NOT NULL,
  `s_acctbal` decimal(15,2) NOT NULL,
  `s_comment` varchar(101) COLLATE utf8mb4_bin NOT NULL,
  UNIQUE KEY pk (`s_suppkey`) UNENFORCED RELY,
  SHARD KEY (`s_suppkey`) USING CLUSTERED COLUMNSTORE
);

```

## Load Data with Pipelines

This part of the tutorial shows how to ingest TPC-H data from a public AWS S3 bucket into the SingleStore database using pipelines.

> **📝 Note**: The **SQL Editor** only runs the queries that you select, so ensure you have them all selected before selecting **Run**.

1. Run the following SQL commands to create the pipelines:
   ```sql
   USE tpch;

   CREATE OR REPLACE PIPELINE tpch_100_lineitem
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/lineitem/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE lineitem
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   CREATE OR REPLACE PIPELINE tpch_100_customer
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/customer/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE customer
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   CREATE OR REPLACE PIPELINE tpch_100_nation
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/nation/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE nation
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   CREATE OR REPLACE PIPELINE tpch_100_orders
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/orders/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE orders
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   CREATE OR REPLACE PIPELINE tpch_100_part
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/part/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE part
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   CREATE OR REPLACE PIPELINE tpch_100_partsupp
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/partsupp/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE partsupp
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   CREATE OR REPLACE PIPELINE tpch_100_region
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/region/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE region
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   CREATE OR REPLACE PIPELINE tpch_100_supplier
       AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/supplier/'
       config '{"region":"us-east-1"}'
       SKIP DUPLICATE KEY ERRORS
       INTO TABLE supplier
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '|\n';

   ```

2. Run the following SQL commands to start the pipelines:
   ```sql
   USE tpch;
   START ALL PIPELINES;

   ```
   Once the **Success** message is returned for all the created pipelines, SingleStore starts ingesting the data from the S3 bucket.

## Verify Pipeline Success

Query the `pipelines_files` information schema view to inspect the progress of the pipelines. For example, run the following command to check whether the `tpch_100_lineitem` pipeline has finished ingesting data.

```sql
SELECT * FROM information_schema.pipelines_files 
WHERE pipeline_name = "tpch_100_lineitem";
```

You can also run the pipelines in foreground to easily verify that all the data has been ingested.

```sql
START PIPELINE <pipeline_name> FOREGROUND;
```

For SingleStore Helios deployments, you can also monitor the progress of your pipelines on the Cloud Portal, select **Ingestion > Pipelines**, and then select a deployment from the list.

## Run Queries on Data

You can start running queries once the data is loaded into your SingleStore database. Here are some example queries to be used with the TPC-H dataset. For the complete workload of 22 queries that can be run against the TPC-H database, refer to [TPC-H standard specification](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf).

## Query 1: Pricing Summary Report

This query reports the total number of items, their prices, and average quantities shipped within 90 days of the most recent ship date (specified date) in the database.

```sql
USE tpch;
SELECT
    l_returnflag,
    l_linestatus,
    SUM(l_quantity) AS sum_qty,
    SUM(l_extendedprice) AS sum_base_price,
    SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    AVG(l_quantity) AS avg_qty,
    AVG(l_extendedprice) AS avg_price,
    AVG(l_discount) AS avg_disc,
    COUNT(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= date('1998-12-01' - interval '90' day)
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;

```

```output

+--------------+--------------+--------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty      | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+--------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            |  83289121.00 | 124912195605.43 | 118665316925.8745 | 123413447483.580151 | 25.491969 | 38231.377828 | 0.050007 |     3267269 |
| N            | F            |   2190539.00 |   3282604241.36 |   3118652485.5821 |   3243864574.415261 | 25.478790 | 38180.915863 | 0.049948 |       85975 |
| N            | O            | 164028717.00 | 245978182329.81 | 233676744963.8986 | 243022073269.855210 | 25.499480 | 38239.131169 | 0.050004 |     6432630 |
| R            | F            |  83339258.00 | 124970353864.21 | 118715983030.2238 | 123467940050.739462 | 25.517608 | 38264.613849 | 0.050047 |     3265951 |
+--------------+--------------+--------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+

```

> **📝 Note**: Run the query **again** and observe the execution time difference. You will see a **dramatic improvement** in query execution time. This is due to SingleStore Helios’s [Code Generation](https://docs.singlestore.com/cloud/query-data/advanced-query-topics/code-generation.md) technology.

## Query 2: Shipping Priority

This query retrieves the 5 unshipped orders with the highest value.

```sql
USE tpch;
SELECT
    l_orderkey,
    SUM(l_extendedprice * (1 - l_discount)) AS revenue,
    o_orderdate,
    o_shippriority
FROM
    customer,
    orders,
    lineitem
WHERE
    c_mktsegment = 'BUILDING'
    AND c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate < DATE('1995-03-15')
    AND l_shipdate > DATE('1995-03-15')
GROUP BY
    l_orderkey,
    o_orderdate,
    o_shippriority
ORDER BY
    revenue DESC,
    o_orderdate
LIMIT 5;


```

```output

+------------+-------------+-------------+----------------+
| l_orderkey | revenue     | o_orderdate | o_shippriority |
+------------+-------------+-------------+----------------+
|   81316036 | 459781.3856 | 1995-02-24  |              0 |
|   54427397 | 391221.7016 | 1995-02-28  |              0 |
|   82073798 | 390074.7002 | 1995-03-12  |              0 |
|   81872546 | 386608.3992 | 1995-02-28  |              0 |
|   45138020 | 386056.2188 | 1995-01-30  |              0 |
+------------+-------------+-------------+----------------+

```

## Query 3: Returned Item Reporting

This query identifies the top 5 customers that may be experiencing issues with the items shipped to them.

```sql

USE tpch;
SELECT
    c_custkey,
    c_name,
    SUM(l_extendedprice * (1 - l_discount)) AS revenue,
    c_acctbal,
    n_name,
    c_address,
    c_phone,
    c_comment
FROM
    customer,
    orders,
    lineitem,
    nation
WHERE
    c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate >= DATE('1993-10-01')
    AND o_orderdate < DATE('1993-10-01') + interval '3' month
    AND l_returnflag = 'R'
    AND c_nationkey = n_nationkey
GROUP BY
    c_custkey,
    c_name,
    c_acctbal,
    c_phone,
    n_name,
    c_address,
    c_comment
ORDER BY
    revenue desc
LIMIT 5;


```

```output

+-----------+--------------------+-------------+-----------+-----------+---------------------------------------+-----------------+------------------------------------------------------------------------------------------------+
| c_custkey | c_name             | revenue     | c_acctbal | n_name    | c_address                             | c_phone         | c_comment                                                                                      |
+-----------+--------------------+-------------+-----------+-----------+---------------------------------------+-----------------+------------------------------------------------------------------------------------------------+
|   2490736 | Customer#002490736 | 473138.4854 |   4701.80 | RUSSIA    | rjaIP,XKH,v0R 0u7Qyb                  | 32-234-811-3202 | s sleep carefully against the ironic pinto beans. blithely unusual deposits haggle furiously a |
|   7658659 | Customer#007658659 | 440156.4690 |   1442.76 | MOROCCO   | ucA926oE6y1fOr,teHGthr6GXz6y qf1ySK3  | 25-418-150-7556 | nusual requests nag blithely according to the regular requests: pending, unusual               |
|  11381863 | Customer#011381863 | 437190.0882 |   4397.51 | JORDAN    | TFd7KvQQmf6XJjSQUBbvZ9rlxTUle1mwfqt9  | 23-407-202-8891 |  lose blithely along the furiously pending sentiments. quickly final theodo                    |
|   4804573 | Customer#004804573 | 432390.5681 |   4930.12 | INDONESIA | LgPei3cnZL19sW8lM,qM                  | 19-664-439-8190 | ial accounts. blithely final platelets breach. furious                                         |
|   2247622 | Customer#002247622 | 414271.9102 |   4292.76 | PERU      | c5eYVZ7PXycKnsgLmCtarE5aCcWYga5u0RIwa | 27-403-465-9280 | ly. sly packages haggle. even theodolites wa                                                   |
+-----------+--------------------+-------------+-----------+-----------+---------------------------------------+-----------------+------------------------------------------------------------------------------------------------+

```

## Query 4: Product Type Profit Measure

This query calculates the profit made on a given line of items, broken down by supplier nation and year.

```sql
USE tpch;
WITH profit AS (
    SELECT
        n_name AS nation,
        EXTRACT(YEAR FROM o_orderdate) AS o_year,
        l_extendedprice * (1 - l_discount)
            - ps_supplycost * l_quantity AS amount
    FROM
        part
        JOIN lineitem
            ON p_partkey = l_partkey
        JOIN supplier
            ON s_suppkey = l_suppkey
        JOIN partsupp
            ON ps_suppkey = l_suppkey
           AND ps_partkey = l_partkey
        JOIN orders
            ON o_orderkey = l_orderkey
        JOIN nation
            ON s_nationkey = n_nationkey
    WHERE
        p_name LIKE '%green%'
)

SELECT
    nation,
    o_year,
    SUM(amount) AS sum_profit
FROM
    profit
GROUP BY
    nation,
    o_year
ORDER BY
    nation,
    o_year DESC
LIMIT 5;

```

```output

+---------+--------+---------------+
| nation  | o_year | sum_profit    |
+---------+--------+---------------+
| ALGERIA |   1998 | 20613196.5474 |
| ALGERIA |   1997 | 37498211.6758 |
| ALGERIA |   1996 | 35471708.2126 |
| ALGERIA |   1995 | 36720114.9923 |
| ALGERIA |   1994 | 34921048.6736 |
+---------+--------+---------------+

```

## Analyze Query Performance

The SQL Editor allows you to record and visualize workload profiles at either the query or node level.

1. In the **SQL Editor**, select **Visual Explain** and select **Profile**.

2. The **Visual Explain** window opens. You can now explore the query and navigate around different operations.

***

Modified at: February 6, 2025

Source: [/cloud/getting-started-with-singlestore-helios/next-steps-and-examples/sample-data/load-tpc-h-data-into-singlestore/](https://docs.singlestore.com/cloud/getting-started-with-singlestore-helios/next-steps-and-examples/sample-data/load-tpc-h-data-into-singlestore/)

(An index of the documentation is available at /llms.txt)
