Parallel Read Support
On this page
You can enable parallel reads via the enableParallelRead
option.
spark.read.format("singlestore").option("enableParallelRead", "automatic").option("parallelRead.Features", "readFromAggregatorsMaterialized,readFromAggregators").option("parallelRead.repartition", "true").option("parallelRead.repartition.columns", "a, b").option("parallelRead.TableCreationTimeout", "1000").load("db.table")
SingleStore Helios supports parallel reads from SingleStore Spark Connector versions 3.
enableParallelRead
Modes
The enableParallelRead
option can have one of the following values:
-
disabled
: Disables parallel reads and performs non-parallel reads. -
automaticLite
: Performs parallel reads if at least one parallel read feature specified inparallelRead.
is supported.Features Otherwise performs a non-parallel read. In automaticLite
mode, after push down of the outer sorting operation (for example, a nestedSELECT
statement where sorting is done in a top-levelSELECT
) into SingleStore is done, a non-parallel read is used. -
automatic
: Performs parallel reads if at least one parallel read feature specified inparallelRead.
is supported.Features Otherwise performs a non-parallel read. In automatic
mode, thesinglestore-spark-connector
is unable to push down an outer sorting operation into SingleStore.Final sorting is done at the Spark end of the operation. -
forced
: Performs parallel reads if at least one parallel read feature specified inparallelRead.
is supported.Features Otherwise it returns an error. In forced
mode, thesinglestore-spark-connector
is unable to push down an outer sorting operation into SingleStore.Final sorting is done at the Spark end of the operation.
Note
By default, enableParallelRead
is set to automaticLite
.
Parallel Read Features
The SingleStore Spark Connector supports the following parallel read features:
-
readFromAggregators
-
readFromAggregatorsMaterialized
The connector uses the first feature specified in parallelRead.
which meets all the requirements.readFromAggregators
feature.readFromAggregators
and readFromAggregatorsMaterialized
features.
readFromAggregators
When this feature is used, the number of partitions in the resulting DataFrame is the least of the number of partitions in the SingleStore database and Spark parallelism level (i.spark.
for all executors).parallelRead.
option.
Use the parallelRead.
option to specify a timeout for result table creation.
Requirements
To use this feature, the following requirements must be met:
-
SingleStore Spark Connector version 3.
2+ -
Either the
database
option is set, or the database name is specified in theload
option -
SingleStore parallel read functionality supports the generated query
readFromAggregatorsMaterialized
When using this feature, the number of partitions in the resulting DataFrame will be the same as the number of partitions in the SingleStore database.parallelRead.
option.readFromAggregators
feature.readFromAggregatorsMaterialized
uses the MATERIALIZED
option to create the result table.MATERIALIZED
option may cause a query to fail if SingleStore does not have enough memory to materialize the result set.
Use the parallelRead.
option to specify a timeout for materialized result table creation.
Requirements
To use this feature, the following requirements must be met:
-
SingleStore Spark Connector version 3.
2+ -
Either the
database
option is set, or the database name is specified in theload
option -
SingleStoreparallel read functionality supports the generated query
Parallel Read Repartitioning
You can repartition the result using parallelRead.
option for the readFromAggregators
and readFromAggregatorsMaterialized
features to ensure that each task reads approximately the same amount of data.
Last modified: February 23, 2024