CREATE PROJECTION

Note

This is a Preview feature.

Creates a projection.

A projection is a copy of a table, which may include only a subset of the base table's columns. The projection can be sharded and/or sorted differently from the base table. Projections are dependent on the base table, similar to indexes.

All projections on the base table must be dropped before the base table can be dropped. If not, an error will be generated when the DROP TABLE operation is attempted. Attempting to modify the columns in the base table that are used in a projection will also result in an error being generated. The projection would have to be dropped before the ALTER TABLE operation can proceed.

Updates to projections are done in the same transaction as updates to the base table. The contents of projections are always kept in sync with the contents of the base table.

Projections consume storage space and require maintenance when updating your data. Physically, a projection is like a second table. So a typical projection that includes all the columns of the base table will double the storage usage and double the update time. Therefore, they should be utilized sparingly. Typically, you'll employ at most a few projections on a table. Often, no projections or only one may be needed on a table.

Projections can improve query performance in these scenarios:

Optimize range queries: The use of sort keys can optimize certain range filters. If your workload requires fast range filters on two different columns (for example, col1 and col2) in the same table, using projections can be helpful. You can use col1 as the sort key of the base table. Then create a projection of the base table with col2 as the sort key.

Optimize ORDER BY/LIMIT Queries: Certain ORDER BY/LIMIT queries can be made faster using a columnstore sort key. If you need the fastest possible query plans for ORDER BY/LIMIT on two different sort keys, you can make the base table be sorted by one key and a projection of the base table sorted on the second key.

Optimize grouping operations: In scenarios involving queries with different GROUP BY clauses within your workload, the goal is to optimize query execution by minimizing data transfer across nodes. You can adopt a strategy where one shard key is applied to the base table, while a different shard key is assigned to a projection. This approach ensures that data grouping remains localized, resulting in minimal cross-node data movement, and ultimately improving query efficiency. Similar to GROUP BY, this same approach can speed up COUNT(DISTINCT …) and SELECT DISTINCT... operations. Additionally, you can specify the grouping columns as the sort key, so that StreamingGroupBy query plan operator can be utilized by the engine to reduce the memory use and improve performance.

Optimize joins: Projections can also be used to optimize data transfer in JOIN clauses. If one or neither of your join predicate columns are sharded, shard keys can be set on those columns in a projection. This allows the utilization of collocated (local) joins where partitions of the table can be joined locally and no shuffle or broadcast is needed to complete the join, which is required if the joined tables are not both sharded on the join columns.

Optimize key lookups: If you have two different keys and want the fastest possible lookup speed on both of them, with high concurrency, you can shard the base table on one key and create a projection that shards on the other key. This can enable single-node lookups rather than broadcast lookups for both keys.

Syntax

CREATE PROJECTION [IF NOT EXISTS] <projection_name>
([SHARD KEY(<columns>),] [SORT KEY(<columns>),] [KEY(<columns>),...])
AS SELECT <project_list>
FROM <table_name>;

Arguments

projection_name

The name of the projection.

table_name

The projection must be defined on one columnstore table.

project_list

The project list may contain any column(s) of <table>.*. Use of * is allowed, but it is expanded at creation time (new columns are not automatically displayed or added to the projection).

Remarks

  • Projections are only allowed with two-phase commit enabled. If a user decides to disable two-phase commit, all projections should be dropped first. Not doing so will cause all write queries with projections to generate a warning.

  • CREATE PROJECTION is supported on columnstore tables only - both as the source table and projection object.

  • CREATE PROJECTION is not supported on temporary tables.

  • Secondary UNIQUE keys and filtered projections are not supported.

  • CREATE PROJECTION is an offline operation, i.e., any writes (or DDL operation) to the base table are blocked while it is running (but read queries to the base table are not blocked). The CREATE PROJECTION operation can use lots of resources and take some time to complete with large base tables.

  • BACKUP DATABASE ... WITH SPLIT PARTITIONS is blocked if there is a projection in a database. Projections must be dropped before running backup with partition split.

  • If a table has projections based on it, truncating and renaming the base table is blocked, as well as any ALTER TABLE operation that modifies a column that is referred to by a projection.

  • Existing projections cannot be altered. They can be dropped and re-created. Refer to DROP PROJECTION for more information.

  • CREATE PROJECTION requires ALTER permissions and any one of the SELECT, INSERT, UPDATE, and DELETE permissions on the base table.

  • The projection query hint (use_projection='projection_name') can be used in a query to force the optimizer to select the projection instead of the base table. This is required for range filter optimization. Collocated join and group by optimization can be identified automatically by the optimizer. The following example shows the usage of this hint in a simple query shape:

    SELECT col1, col2, coln...
    FROM <table_name> WITH (use_projection='<projection_name>');
  • If set to OFF, the consider_secondary_projection engine variable will turn off the use of all projections in queries except where the projection query hint use_projection='projection_name') is used. The default value is ON.

  • If there is a projection on a table that is being queried, the hint disable_projection_replacement = 1 can be used to force the optimizer not to use the projection but use the base table instead.

    SELECT col1, col2, coln...
    FROM <table_name> WITH (disable_projection_replacement = 1);
  • If a projection query hint is explicitly used, any columns queried that are not in the projection will result in an error. The following example returns an error because col3 is not in the projection proj_name but it's queried in the SELECT statement:

    CREATE PROJECTION IF NOT EXISTS proj_name
    (SHARD KEY(col1), SORT KEY(col2))
    AS SELECT col1, col2 FROM t1;
    SELECT col1, col2, col3
    FROM t1 WITH (use_projection='proj_name')
    WHERE col1 BETWEEN '' AND '';
    ERROR 1877 ER_BAD_HINT: projection 'proj_name' is not eligible for replacement for table 't1'

Examples

Using the TPC-H dataset, these examples use the lineitem and the orders tables from the tpch database to demonstrate four use cases for utilizing projections.

The TPC-H scale-factor 100 dataset can be used to load data into both tables. Refer to Example: Loading TPC-H Data into SingleStore for more information.

CREATE DATABASE tpch;
USE tpch;
CREATE TABLE lineitem (
l_orderkey BIGINT(11) NOT NULL,
l_partkey INT(11) NOT NULL,
l_suppkey INT(11) NOT NULL,
l_linenumber INT(11) NOT NULL,
l_quantity DECIMAL(15,2) NOT NULL,
l_extendedprice DECIMAL(15,2) NOT NULL,
l_discount DECIMAL(15,2) NOT NULL,
l_tax DECIMAL(15,2) NOT NULL,
l_returnflag CHAR(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
l_linestatus CHAR(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
l_shipdate DATE NOT NULL,
l_commitdate DATE NOT NULL,
l_receiptdate DATE NOT NULL,
l_shipinstruct CHAR(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
l_shipmode CHAR(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
l_comment VARCHAR(44) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk ( l_orderkey , l_linenumber ) UNENFORCED RELY,
SHARD KEY ( l_orderkey )
);
CREATE TABLE orders (
o_orderkey BIGINT(11) NOT NULL,
o_custkey INT(11) NOT NULL,
o_orderstatus CHAR(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
o_totalprice DECIMAL(15,2) NOT NULL,
o_orderdate DATE NOT NULL,
o_orderpriority CHAR(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
o_clerk CHAR(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
o_shippriority INT(11) NOT NULL,
o_comment VARCHAR(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL
);

The tables can be loaded using pipelines. See Load Data with Pipelines for more information.

Optimizing Range Filters

If your application needs to filter on the l_shipdate column in the lineitem table, this query will not run as fast as it could because l_shipdate is not the sort key:

SELECT COUNT(*)
FROM lineitem
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31';
+----------+
| COUNT(*) |
+----------+
| 861221   |
+----------+
77 ms

If a projection of the lineitem table is created with l_shipdate as the sort key, this will improve the execution speed of this query. The projection hint (use_projection='name_of_projection') must be used in the updated query for the optimizer to use the projection:

CREATE PROJECTION IF NOT EXISTS lineitem_sort_shipdate 
(SHARD KEY(l_orderkey), SORT KEY(l_shipdate)) 
AS SELECT l_shipdate, l_orderkey FROM lineitem;
SELECT COUNT(*)
FROM lineitem WITH (use_projection='lineitem_sort_shipdate')
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31';
+----------+
| COUNT(*) |
+----------+
| 861221   |
+----------+
9 ms

Optimize GROUP BY on Shard Key Queries

If you need to know how many orders are grouped by l_shipdate with the same date range as above, the execution of this query would not run optimally since l_shipdate is not the shard key or the sort key:

SELECT COUNT(l_orderkey), l_shipdate
FROM lineitem
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'
GROUP BY l_shipdate;
+-------------------+------------+
| COUNT(l_orderkey) | l_shipdate |
+-------------------+------------+
| 1853              | 1992-01-02 |
| 3814              | 1992-01-03 |
| . . . . . . . . . | . . . . .  |
| 54004             | 1992-01-30 |
| 55737             | 1992-01-31 |
+-------------------+------------+
80 ms

Another projection can be created with l_shipdate set as both the shard key and the sort key:

CREATE PROJECTION IF NOT EXISTS lineitem_shard_sort_shipdate
(SHARD KEY(l_shipdate), SORT KEY (l_shipdate))
AS SELECT l_orderkey, l_shipdate FROM lineitem;

After creating the projection and using the projection query hint, the query runs faster:

SELECT COUNT(l_orderkey), l_shipdate
FROM lineitem WITH (use_projection='lineitem_shard_sort_shipdate')
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'
GROUP BY l_shipdate;
+-------------------+------------+
| COUNT(l_orderkey) | l_shipdate |
+-------------------+------------+
| 1853              | 1992-01-02 |
| 3814              | 1992-01-03 |
| . . . . . . . . . | . . . . .  |
| 54004             | 1992-01-30 |
| 55737             | 1992-01-31 |
+-------------------+------------+
4 ms

Creating a Collocated Join

The lineitem table needs to be joined to the orders table in order to pull the order dates into the query. The optimizer will need to reshuffle the partitions because there isn't an explicit shard key on the orders table:

SELECT COUNT(l_orderkey), o_orderdate, l_shipdate
FROM lineitem INNER JOIN orders ON lineitem.l_orderkey = orders.o_orderkey
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'
GROUP BY l_shipdate, o_orderdate;
+-------------------+-------------+-------------+
| COUNT(l_orderkey) | o_orderdate | l_shipdate  |
+-------------------+-------------+-------------+
| 1853              | 1992-01-01  | 1992-01-02  |
| 1974              | 1992-01-02  | 1992-01-03  | 
| . . . . . . . . . | . . . . . . | . . . . . . |
| 1833              | 1992-01-21  | 1992-01-31  |
| 1831              | 1992-01-30  | 1992-01-31  |
+-------------------+-------------+-------------+ 
250 ms

You can create a projection and set an explicit shard key on the o_orderkey column. Now, when the tables are joined, there is minimal data movement since all distinct state values are located on a single partition.

CREATE PROJECTION IF NOT EXISTS orders_shard_orderkey_sort_orderdate
(SHARD KEY(o_orderkey), SORT KEY(o_orderdate))
AS SELECT o_orderkey, o_orderdate FROM orders;

Note

The projection query hint may need to be positioned after the table that is projected in a JOIN clause. In the following example, the orders table is the projected table so the WITH (use_projection='') syntax is placed after "INNER JOIN orders".

SELECT COUNT(l_orderkey), o_orderdate, l_shipdate
FROM lineitem INNER JOIN orders WITH (use_projection='orders_shard_orderkey_sort_orderdate') ON lineitem.l_orderkey = orders.o_orderkey
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'
GROUP BY l_shipdate, o_orderdate;
+-------------------+-------------+-------------+
| COUNT(l_orderkey) | o_orderdate | l_shipdate  |
+-------------------+-------------+-------------+
| 1853              | 1992-01-01  | 1992-01-02  |
| 3814              | 1992-01-01  | 1992-01-03  | 
| . . . . . . . . . | . . . . . . | . . . . . . |
| 54004             | 1992-01-09  | 1992-01-30  |
| 55737             | 1992-01-26  | 1992-01-31  |
+-------------------+-------------+-------------+ 
139 ms

Run the SHOW PROFILE command to confirm that the optimizer used the projection. This is an abridged version of the plan:

+--------------------------------------------------------------------------------+
| PROFILE |
+--------------------------------------------------------------------------------+
| . . . . . . . . . . . . . . . . . . . . . . . . . . |
| ColumnStoreScan tpch.orders PROJECTION orders_shard_orderkey_sort_orderdate . .|
| . . . . . . . . . . . . . . . . . . . . . . . . . . |
+--------------------------------------------------------------------------------+

Optimize Key Lookups

Prior to the introduction of projections, any key lookup where the key didn't contain the shard key would require a broadcast to every leaf node to look up rows having that key on that node.

Consider this query.

SELECT * FROM lineitem WHERE l_partkey = 8752612;

If you run the EXPLAIN statement for this query, you will notice "Gather partitions:all" in the output, which means a fanout is used to send a message to every leaf node to look up rows having that key on that node. This is an abridged version of the plan:

EXPLAIN SELECT * FROM lineitem WHERE l_partkey = 8752612;
+------------------------------------------------------------------+
| EXPLAIN                                                          |
+------------------------------------------------------------------+
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  |    
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  | 
+------------------------------------------------------------------+

If the lookup key contains the shard key, then a single partition query can be used to query only the leaf node containing the key. A projection can be created where the lookup key (l_partkey) contains the shard key:

CREATE PROJECTION IF NOT EXISTS lineitem_shard_key_sort_key_partkey
(SHARD KEY(l_partkey), SORT KEY(l_partkey))
AS SELECT * FROM lineitem;

Now if you run the EXPLAIN statement, you notice "Gather partitions:single" in the output. This means that the aggregator node sends a message only to the node containing the partition that may contain the requested key. This is an abridged version of the plan:

EXPLAIN SELECT * FROM lineitem WITH (use_projection='lineitem_shard_key_sort_key_partkey')
WHERE l_partkey = 527508935;
+--------------------------------------------------------------------+
| EXPLAIN                                                            |
+--------------------------------------------------------------------+
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  |    
| Gather partitions:single alias:remote_0 parallelism_level:segment |
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  |  
+--------------------------------------------------------------------+

Note that the projection query hint is currently required in this case. Otherwise, the projection won't be used and you'll notice "Gather partitions:all" indicating a fanout Gather.

You can further optimize this example by adding hash index keys on l_orderkey and l_partkey on their respective projections.

Last modified: April 23, 2024

Was this article helpful?