Distributed Joins
Reference Joins
As a general rule, SingleStore will efficiently execute any join query with a single sharded table and as many reference tables as you’d like. Since reference tables are fully replicated on every machine in the cluster, leaf nodes can join against their local copies of reference tables.
These queries leverage reference joins:
SELECT * FROM a INNER JOIN r ON a.a1 = r.r1; SELECT * FROM r LEFT JOIN a ON a.a1 = r.r1; SELECT * FROM a, r r1, r r2, r r3; SELECT * FROM a INNER JOIN (SELECT DISTINCT r1 FROM r) x ON a.a1 = x.c;
Aligning Shard Keys for Performance
Aligning the shard keys of large tables enables more efficient joining. It is possible to perform arbitrary distributed joins across any tables and along any column. However, if you join two tables with identical shard key signatures along that shard key, the joins will be performed local to the partitions, reducing network overhead.
CREATE TABLE users ( id BIGINT AUTO_INCREMENT, user_name VARCHAR(1000), account_id BIGINT, PRIMARY KEY (id) ); CREATE TABLE clicks ( click_id BIGINT AUTO_INCREMENT, account_id BIGINT, user_id BIGINT, page_id INT, ts TIMESTAMP, SHARD KEY (user_id), PRIMARY KEY (click_id, user_id) );
In this example, id
is the shard key of the users
table, and the shard key on the clicks
table has the same signature (a single BIGINT). These queries join locally without network overhead:
SELECT * FROM users INNER JOIN clicks ON users.id = clicks.user_id WHERE clicks.page_id = 10; SELECT avg(c1.ts - c2.ts) FROM clicks c1 INNER JOIN clicks c2 ON c1.user_id = c2.user_id WHERE c1.page_id > c2.page_id;
Whereas this query will stream rows between leaf nodes:
SELECT u.account_id, count(distinct user_id), count(1) FROM users u INNER JOIN clicks c ON u.account_id = c.account_id GROUP BY u.account_id;
If you identify your data layout and join patterns in advance, this technique can be an extremely effective way to run performant joins between distributed tables. For more information about how queries execute, see EXPLAIN.