CREATE PROJECTION
On this page
Note
This is a Preview feature.
Creates a projection.
A projection is a copy of a table, which may include only a subset of the base table's columns.
All projections on the base table must be dropped before the base table can be dropped.DROP TABLE
operation is attempted.ALTER TABLE
operation can proceed.
Updates to projections are done in the same transaction as updates to the base table.
Projections consume storage space and require maintenance when updating your data.
If you want to save space, you can create a projection with fewer columns than the original table – only the columns needed to speed up a certain query or class of queries.
Projections can improve query performance in these scenarios:
Optimize range queries: The use of sort keys can optimize certain range filters.col1
and col2
) in the same table, using projections can be helpful.col1
as the sort key of the base table.col2
as the sort key.
Optimize ORDER BY/LIMIT Queries: Certain ORDER BY/LIMIT
queries can be made faster using a columnstore sort key.ORDER BY/LIMIT
on two different sort keys, you can make the base table be sorted by one key and a projection of the base table sorted on the second key.
Optimize grouping operations: In scenarios involving queries with different GROUP BY
clauses within your workload, the goal is to optimize query execution by minimizing data transfer across nodes.GROUP BY
, this same approach can speed up COUNT(DISTINCT …)
and SELECT DISTINCT.
operations.StreamingGroupBy
query plan operator can be utilized by the engine to reduce the memory use and improve performance.
Optimize joins: Projections can also be used to optimize data transfer in JOIN
clauses.
Optimize key lookups: If you have two different keys and want the fastest possible lookup speed on both of them, with high concurrency, you can shard the base table on one key and create a projection that shards on the other key.
Syntax
CREATE PROJECTION [IF NOT EXISTS] <projection_name>([SHARD KEY(<columns>),] [SORT KEY(<columns>),] [KEY(<columns>),...])AS SELECT <project_list>FROM <table_name>;
Arguments
projection_
The name of the projection.
table_
The projection must be defined on one columnstore table.
project_
The project list may contain any column(s) of <table>.
.*
is allowed, but it is expanded at creation time (new columns are not automatically displayed or added to the projection).
Remarks
-
Projections are only allowed with two-phase commit enabled.
If a user decides to disable two-phase commit, all projections should be dropped first. Not doing so will cause all write queries with projections to generate a warning. -
CREATE PROJECTION
is supported on columnstore tables only - both as the source table and projection object. -
CREATE PROJECTION
is not supported on temporary tables. -
Secondary
UNIQUE
keys and filtered projections are not supported. -
CREATE PROJECTION
is an offline operation, i.e. , any writes (or DDL operation) to the base table are blocked while it is running (but read queries to the base table are not blocked). The CREATE PROJECTION
operation can use lots of resources and take some time to complete with large base tables. -
BACKUP DATABASE .
. . WITH SPLIT PARTITIONS is blocked if there is a projection in a database. Projections must be dropped before running backup with partition split. -
If a table has projections based on it, truncating and renaming the base table is blocked, as well as any
ALTER TABLE
operation that modifies a column that is referred to by a projection. -
Existing projections cannot be altered.
They can be dropped and re-created. Refer to DROP PROJECTION for more information. -
CREATE PROJECTION
requiresALTER
permissions and any one of theSELECT
,INSERT
,UPDATE,
andDELETE
permissions on the base table. -
The projection query hint (
use_
) can be used in a query to force the optimizer to select the projection instead of the base table.projection='projection_ name' This is required for range filter optimization. Collocated join and group by optimization can be identified automatically by the optimizer. The following example shows the usage of this hint in a simple query shape: SELECT col1, col2, coln...FROM <table_name> WITH (use_projection='<projection_name>'); -
If set to
OFF
, theconsider_
engine variable will turn off the use of all projections in queries except where the projection query hintsecondary_ projection use_
) is used.projection='projection_ name' The default value is ON
. -
If there is a projection on a table that is being queried, the hint
disable_
can be used to force the optimizer not to use the projection but use the base table instead.projection_ replacement = 1 SELECT col1, col2, coln...FROM <table_name> WITH (disable_projection_replacement = 1); -
If a projection query hint is explicitly used, any columns queried that are not in the projection will result in an error.
The following example returns an error because col3
is not in the projectionproj_
but it's queried in thename SELECT
statement:CREATE PROJECTION IF NOT EXISTS proj_name(SHARD KEY(col1), SORT KEY(col2))AS SELECT col1, col2 FROM t1;SELECT col1, col2, col3FROM t1 WITH (use_projection='proj_name')WHERE col1 BETWEEN '' AND '';ERROR 1877 ER_BAD_HINT: projection 'proj_name' is not eligible for replacement for table 't1'
Examples
Using the TPC-H dataset, these examples use the lineitem
and the orders
tables from the tpch
database to demonstrate four use cases for utilizing projections.
The TPC-H scale-factor 100 dataset can be used to load data into both tables.
CREATE DATABASE tpch;USE tpch;
CREATE TABLE lineitem (l_orderkey BIGINT(11) NOT NULL,l_partkey INT(11) NOT NULL,l_suppkey INT(11) NOT NULL,l_linenumber INT(11) NOT NULL,l_quantity DECIMAL(15,2) NOT NULL,l_extendedprice DECIMAL(15,2) NOT NULL,l_discount DECIMAL(15,2) NOT NULL,l_tax DECIMAL(15,2) NOT NULL,l_returnflag CHAR(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,l_linestatus CHAR(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,l_shipdate DATE NOT NULL,l_commitdate DATE NOT NULL,l_receiptdate DATE NOT NULL,l_shipinstruct CHAR(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,l_shipmode CHAR(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,l_comment VARCHAR(44) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,UNIQUE KEY pk ( l_orderkey , l_linenumber ) UNENFORCED RELY,SHARD KEY ( l_orderkey ));
CREATE TABLE orders (o_orderkey BIGINT(11) NOT NULL,o_custkey INT(11) NOT NULL,o_orderstatus CHAR(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,o_totalprice DECIMAL(15,2) NOT NULL,o_orderdate DATE NOT NULL,o_orderpriority CHAR(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,o_clerk CHAR(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,o_shippriority INT(11) NOT NULL,o_comment VARCHAR(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL);
The tables can be loaded using pipelines.
Optimizing Range Filters
If your application needs to filter on the l_
column in the lineitem
table, this query will not run as fast as it could because l_
is not the sort key:
SELECT COUNT(*)FROM lineitemWHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31';
+----------+
| COUNT(*) |
+----------+
| 861221 |
+----------+
77 ms
If a projection of the lineitem
table is created with l_
as the sort key, this will improve the execution speed of this query.use_
) must be used in the updated query for the optimizer to use the projection:
CREATE PROJECTION IF NOT EXISTS lineitem_sort_shipdate(SHARD KEY(l_orderkey), SORT KEY(l_shipdate))AS SELECT l_shipdate, l_orderkey FROM lineitem;
SELECT COUNT(*)FROM lineitem WITH (use_projection='lineitem_sort_shipdate')WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31';
+----------+
| COUNT(*) |
+----------+
| 861221 |
+----------+
9 ms
Optimize GROUP BY
on Shard Key Queries
If you need to know how many orders are grouped by l_
with the same date range as above, the execution of this query would not run optimally since l_
is not the shard key or the sort key:
SELECT COUNT(l_orderkey), l_shipdateFROM lineitemWHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'GROUP BY l_shipdate;
+-------------------+------------+
| COUNT(l_orderkey) | l_shipdate |
+-------------------+------------+
| 1853 | 1992-01-02 |
| 3814 | 1992-01-03 |
| . . . . . . . . . | . . . . . |
| 54004 | 1992-01-30 |
| 55737 | 1992-01-31 |
+-------------------+------------+
80 ms
The image below illustrates the execution of this query.l_
is not the shard key for the lineitem
table, rows from lineitem
need to be transferred between nodes so that all rows from lineitem
with the same value of l_
(the GROUP BY
column) are on the same node to enable calculation of the COUNT
aggregate.
The bold lines in the image below represent network transfers, which consume time and resources.
A projection can be created with l_
set as both the shard key and the sort key:
CREATE PROJECTION IF NOT EXISTS lineitem_shard_sort_shipdate(SHARD KEY(l_shipdate), SORT KEY (l_shipdate))AS SELECT l_orderkey, l_shipdate FROM lineitem;
After creating the projection and using the projection query hint, the query will run faster:
SELECT COUNT(l_orderkey), l_shipdateFROM lineitem WITH (use_projection='lineitem_shard_sort_shipdate')WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'GROUP BY l_shipdate;
+-------------------+------------+
| COUNT(l_orderkey) | l_shipdate |
+-------------------+------------+
| 1853 | 1992-01-02 |
| 3814 | 1992-01-03 |
| . . . . . . . . . | . . . . . |
| 54004 | 1992-01-30 |
| 55737 | 1992-01-31 |
+-------------------+------------+
4 ms
The image below shows the query using the new lineitem_
projection, labeled lineitem_
in the image.
Creating a Collocated Join
The lineitem
table needs to be joined to the orders
table in order to pull the order dates into the query.orders
table:
SELECT COUNT(l_orderkey), o_orderdate, l_shipdateFROM lineitem INNER JOIN orders ON lineitem.l_orderkey = orders.o_orderkeyWHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'GROUP BY l_shipdate, o_orderdate;
+-------------------+-------------+-------------+
| COUNT(l_orderkey) | o_orderdate | l_shipdate |
+-------------------+-------------+-------------+
| 1853 | 1992-01-01 | 1992-01-02 |
| 1974 | 1992-01-02 | 1992-01-03 |
| . . . . . . . . . | . . . . . . | . . . . . . |
| 1833 | 1992-01-21 | 1992-01-31 |
| 1831 | 1992-01-30 | 1992-01-31 |
+-------------------+-------------+-------------+
250 ms
The image below illustrates the execution of this query.order_
(the JOIN
column) is not the shard key for the orders
table, rows from orders
need to be transferred between nodes so that all rows from orders
with a particular value of order_
are on the same node as with rows from lineitem
that have that same value of order_
.
As above, the bold lines in this image represent network transfers, which consume time and resources and which can be eliminated by using a projection.
You can create a projection and set an explicit shard key on the o_
o_
:
CREATE PROJECTION IF NOT EXISTS orders_shard_orderkey_sort_orderdate(SHARD KEY(o_orderkey), SORT KEY(o_orderdate))AS SELECT o_orderkey, o_orderdate FROM orders;
Note
The projection query hint may need to be positioned after the table that is projected in a JOIN
clause.orders
table is the projected table so the WITH (use_
syntax is placed after "INNER JOIN orders
".
SELECT COUNT(l_orderkey), o_orderdate, l_shipdateFROM lineitem INNER JOIN orders WITH (use_projection='orders_shard_orderkey_sort_orderdate') ON lineitem.l_orderkey = orders.o_orderkeyWHERE l_shipdate BETWEEN '1992-01-01' AND '1992-01-31'GROUP BY l_shipdate, o_orderdate;
+-------------------+-------------+-------------+
| COUNT(l_orderkey) | o_orderdate | l_shipdate |
+-------------------+-------------+-------------+
| 1853 | 1992-01-01 | 1992-01-02 |
| 3814 | 1992-01-01 | 1992-01-03 |
| . . . . . . . . . | . . . . . . | . . . . . . |
| 54004 | 1992-01-09 | 1992-01-30 |
| 55737 | 1992-01-26 | 1992-01-31 |
+-------------------+-------------+-------------+
139 ms
Run the SHOW PROFILE
command to confirm that the optimizer used the projection.
+--------------------------------------------------------------------------------+| PROFILE |+--------------------------------------------------------------------------------+| . . . . . . . . . . . . . . . . . . . . . . . . . . || ColumnStoreScan tpch.orders PROJECTION orders_shard_orderkey_sort_orderdate . .|| . . . . . . . . . . . . . . . . . . . . . . . . . . |+--------------------------------------------------------------------------------+
The image below shows the same query using the orders_
projection, labeled orders_
in the image.
Optimize Key Lookups
Prior to the introduction of projections, any key lookup where the key didn't contain the shard key would require a broadcast to every leaf node to look up rows having that key on that node.
Consider this query.
SELECT * FROM lineitem WHERE l_partkey = 8752612;
If you run the EXPLAIN
statement for this query, you will notice "Gather partitions:all" in the output, which means a fanout is used to send a message to every leaf node to look up rows having that key on that node.
EXPLAIN SELECT * FROM lineitem WHERE l_partkey = 8752612;
+------------------------------------------------------------------+
| EXPLAIN |
+------------------------------------------------------------------+
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
| Gather partitions:all alias:remote_0 parallelism_level:segment |
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
+------------------------------------------------------------------+
If the lookup key contains the shard key, then a single partition query can be used to query only the leaf node containing the key.l_
) contains the shard key:
CREATE PROJECTION IF NOT EXISTS lineitem_shard_key_sort_key_partkey(SHARD KEY(l_partkey), SORT KEY(l_partkey))AS SELECT * FROM lineitem;
Now if you run the EXPLAIN
statement, you notice "Gather partitions:single" in the output.
EXPLAIN SELECT * FROM lineitem WITH (use_projection='lineitem_shard_key_sort_key_partkey')WHERE l_partkey = 527508935;
+--------------------------------------------------------------------+
| EXPLAIN |
+--------------------------------------------------------------------+
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
| Gather partitions:single alias:remote_0 parallelism_level:segment |
| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
+--------------------------------------------------------------------+
Note that the projection query hint is currently required in this case.
You can further optimize this example by adding hash index keys on l_
and l_
on their respective projections.
Last modified: October 24, 2024