Highly Selective Joins

Working with Highly Selective Joins

Starting in version 7.0, SingleStore introduced support for columnstore hash indexes to broaden the support for OLTP-type queries. However, a common join pattern in OLTP is to have a very selective filter on one table, which produces a few rows from the source table, and then join those rows with another table. Databases for OLTP normally use a nested loop join for this. For each row from the outer table, an index seek is done on the inner table.

SingleStore Helios supports these highly selective joins using an adaptive hash join algorithm. As a result, there’s no need to run a full columnstore scan when there’s a matching hash index. First, it builds a hash table for the table with the highly-selective filter. Then, depending on the number of rows in the hash table, the adaptive algorithm will switch strategies internally.

  • If there are only a few rows in the hash table, it switches to use a nested loop join strategy, seeking into the larger table (on the probe side) via the index on the join column of the table on the probe side.

  • If the hash build side produces a lot of rows, then it performs a normal hash join.

The following example takes advantage of this strategy for selective joins.

CREATE TABLE orders(
oid INT,
d DATETIME,
SORT KEY(d),
SHARD(oid),
KEY(oid) USING hash);
CREATE TABLE lineitems(
id INT,
oid INT,
item INT,
SORT KEY(oid),
SHARD(oid));

Now, add some sample data to orders:

INSERT INTO orders VALUES(1, NOW());

Execute the following command repeatedly, until the table has around 33.5 million rows in it.

INSERT INTO orders
SELECT oid+(SELECT MAX(oid) FROM orders), NOW()
FROM orders;

Add 67.1 million rows of data to the lineitems table, such that each line item belongs to an order, and each order has exactly two line items.

INSERT INTO lineitems SELECT oid, oid, 1 FROM orders;
INSERT INTO lineitems SELECT oid + 1000*1000*1000, oid, 2 FROM orders;

Find a selective DATETIME value for d to search on:

SELECT d, COUNT(*)
FROM orders
GROUP BY d;

The result shows that a number of DATETIME values only appear in one row in orders. For this example, let’s say that one such value is 2020-03-30 16:47:05.

The following query uses this date to produce a join result with exactly two rows:

SELECT *
FROM orders o JOIN lineitems l ON o.oid = l.oid
WHERE o.d = "2020-03-30 16:47:05";

This query filters rows on o.d to find a single row of orders, then joins to lineitems via the hash index on lineitems.oid, using the new selective join algorithm.

In the JSON profile plan, you can see if a plan may be able to perform a join via a hash index. If this optimization strategy is available in your plan, you will see join index in descriptions of columnstore filter operators. For example:

"executor":"ColumnStoreFilter",
"keyId":4294968023,
"condition":[
        "o.oid = l.oid bloom AND l.oid = o.oid join index"
],

You may see a condition that mentions join index.

o.oid = l.oid bloom AND l.oid = o.oid join index

The join index filter can also be viewed in the EXPLAIN output.

Performance of Highly Selective Joins

A higher number of join columns having a matching columnstore hash index and fewer number of hash values result in better join performance. For example, if inventory and product are two columnstore tables with hash indexes on the columns id and code, then the following query has better join performance than query with single join column:

SELECT * FROM
inventory JOIN product
ON inventory.id = product.id AND inventory.code = product.code;

See SingleStore Universal Storage – And Then There Was One for additional information about (1) using columnstores with hash indexes, sub-segment access, and fine-grain locking to enable OLTP operations on data bigger than will fit in RAM, and (2) using SPARSE rowstore compression to reduce TCO for rowstore tables with many NULL values.

Last modified: March 2, 2023

Was this article helpful?