Distributed Joins
On this page
Aligning Shard Keys for Performance
Aligning the shard keys of large tables enables more efficient joining.
CREATE TABLE users (id BIGINT AUTO_INCREMENT,user_name VARCHAR(1000),account_id BIGINT,PRIMARY KEY (id));CREATE TABLE clicks (click_id BIGINT AUTO_INCREMENT,account_id BIGINT,user_id BIGINT,page_id INT,ts TIMESTAMP,SHARD KEY (user_id),PRIMARY KEY (click_id, user_id));
In this example, id
is the shard key of the users
table, and the shard key on the clicks
table has the same signature (a single BIGINT).
SELECT * FROM users INNER JOIN clicks ON users.id = clicks.user_id WHERE clicks.page_id = 10;SELECT avg(c1.ts - c2.ts) FROM clicks c1 INNER JOIN clicks c2 ON c1.user_id = c2.user_id WHERE c1.page_id > c2.page_id;
Whereas this query will stream rows between leaf nodes:
SELECT u.account_id, count(distinct user_id), count(1)FROM users u INNER JOIN clicks c ON u.account_id = c.account_idGROUP BY u.account_id;
If you identify your data layout and join patterns in advance, this technique can be an extremely effective way to run performant joins between distributed tables.
Last modified: September 7, 2023