SingleStore DB

ERROR: Distributed Join Error. Leaf X cannot connect to Leaf Y.

Issue

When a distributed join occurs, the leaves within the cluster must reshuffle data amongst themselves, which requires the leaves to connect to one another. If the leaves are not able to communicate with one another, and a distributed join is touching those leaves, the distributed query will not run successfully. The inter-leaf communication that needed for distributed join queries relies on the DNS cache on each leaf. If this cache is out of sync with the current state of the leaves, the distributed join will fail.

Solution

Use the following steps to troubleshoot this scenario:

  1. Confirm you are able to access SingleStore DB from one leaf to another in the cluster. This will eliminate network connection issues.

    Notice

    You are able to connect manually from one leaf to another because doing so does not utilize the DNS cache on the leaf.

  2. Run SHOW LEAVES on an affected leaf (e.g. leaf X) in the cluster. The Opened_Connections columns should reveal what leaves the affected leaf has open connections with. Verify that leaf Y is not in this list.

  3. When leaves connect to each other, they cache connection information (leaf-1 is at IP 192.0.2.1, leaf-2 is at IP 192.0.2.2, etc.). If the IPs of these leaves ever change the cache will not automatically update. This will ultimately result in an unsuccessful connection attempt because the other leaves in the cluster are using old IP address information. The solution is to flush the DNS cache and connection pools on all affected nodes. You can do so by running the following: sql FLUSH HOSTS; FLUSH CONNECTION POOLS;

    FLUSH HOSTS clears the DNS cache on the node. This must be performed all affected nodes in the cluster. FLUSH CONNECTION POOLS shuts down all existing connections and closes idle pooled connections.