Parallel Read Support

You can enable parallel reads via the enableParallelRead option. The parallel read operation creates multiple Spark tasks. This can drastically improve the performance in some cases.

Note: Parallel reads are not consistent

Parallel reads read directly from partitions on the leaf nodes, which skips our transaction layer. This means that each individual read will see an independent version of the database's distributed state. If some queries (other than read operation) are run on the database, they may affect the current read operation. Make sure to take this into account when enabling parallel read.

Note: Parallel reads transparently fallback to single stream reads

Parallel reads currently only work for query-shapes which do not work on the Aggregator and thus can be pushed entirely down to the leaf nodes. To determine if a particular query is being pushed down you can ask the DataFrame how many partitions it has like so:

df.rdd.getNumPartitions

If this value is > 1 then we are reading in parallel from leaf nodes.

Note: Parallel reads require consistent authentication and connectible leaf nodes

In order to use parallel reads, the username and password provided to the singlestore-spark-connector must be the same across all nodes in the cluster.

In addition, the hostnames and ports listed by SHOW LEAVES must be directly connectible from Spark.

Last modified: February 23, 2024

Was this article helpful?