Distributed Joins

Aligning Shard Keys for Performance

Aligning the shard keys of large tables enables more efficient joining. It is possible to perform arbitrary distributed joins across any tables and along any column. However, if you join two tables with identical shard key signatures along that shard key, the joins will be performed local to the partitions, reducing network overhead.

SQL

CREATE TABLE users (
    id BIGINT AUTO_INCREMENT,
    user_name VARCHAR(1000),
    account_id BIGINT,
    PRIMARY KEY (id)
);

CREATE TABLE clicks (
    click_id BIGINT AUTO_INCREMENT,
    account_id BIGINT,
    user_id BIGINT,
    page_id INT,
    ts TIMESTAMP,
    SHARD KEY (user_id),
    PRIMARY KEY (click_id, user_id)
);

In this example, id is the shard key of the users table, and the shard key on the clicks table has the same signature (a single BIGINT). These queries join locally without network overhead:

SQL

SELECT * FROM users INNER JOIN clicks ON users.id = clicks.user_id WHERE clicks.page_id = 10;

SELECT avg(c1.ts - c2.ts) FROM clicks c1 INNER JOIN clicks c2 ON c1.user_id = c2.user_id WHERE c1.page_id > c2.page_id;

Whereas this query will stream rows between leaf nodes:

SQL

SELECT u.account_id, count(distinct user_id), count(1)
FROM users u INNER JOIN clicks c ON u.account_id = c.account_id
GROUP BY u.account_id;

If you identify your data layout and join patterns in advance, this technique can be an extremely effective way to run performant joins between distributed tables. For more information about how queries execute, see EXPLAIN.

Distributed Joins

On this page

Aligning Shard Keys for Performance

Was this article helpful?

On this page

Was this article helpful?