Configuration Settings

The SingleStore Spark Connector leverages Spark SQL’s Data Sources API.

The singlestore-spark-connector is configurable globally via Spark options and locally when constructing a DataFrame. The global and local options use the same names; however the global options have the prefix spark.datasource.singlestore. The connection to SingleStore relies on the following Spark configuration settings:

Option

Description

Default Value

ddlEndpoint (required)

The hostname or IP address of the SingleStore Master Aggregator in the host[:port] format, where port is an optional parameter. Example: master-agg.foo.internal:3308 or master-agg.foo.internal.

dmlEndPoint

The hostname or IP address of SingleStore Aggregator nodes to run queries against in the host[:port],host[:port],... format, where :port is an optional parameter (multiple hosts separated by comma). Example: child-agg:3308,child-agg2.

ddlendpoint

user

SingleStore username.

root

password

SingleStore password.

query

The query to run (mutually exclusive with dbtable option).

dbtable

The table to query (mutually exclusive with query).

database

If set, all connections use this database by default. This option is empty by default.

overwriteBehavior

Specifies the behavior during Overwrite. It can have one of the following values: dropAndCreate, truncate, merge.

dropAndCreate

truncate

This option is deprecated, please use overwriteBehavior instead. Truncates instead of dropping an existing table during Overwrite.

false

loadDataCompression

Compresses data on load. It can have one of the following three values: GZip, LZ4, and Skip.

GZip

disablePushdown

Disables SQL Pushdown when running queries.

false

enableParallelRead

Enables reading data in parallel for some query shapes. It can have of the following values: disabled, automaticLite, automatic, and forced. For more information, see Parallel Read Support.

automaticLite

parallelRead.Features

Specifies a comma separated list of parallel read features that are tried in the order they are listed. We support the following features: ReadFromLeaves, ReadFromAggregators, and ReadFromAggregatorsMaterialized. For example, ReadFromAggregators, ReadFromAggregatorsMaterialized. For more information, see Parallel Read Support.

ReadFromAggregators

parallelRead.tableCreationTimeoutMS

Specifies the amount of time (in ms) the reader waits for the result table creation when using the ReadFromAggregators feature. If set to 0, timeout is disabled.

0

parallelRead.tableCreationTimeoutMaterializedMS

Specifies the amount of time (in ms) the reader waits for the result table creation when using the ReadFromAggregatorsMaterialized feature. If set to 0, timeout is disabled.

0

parallelRead.repartition

Repartitions data before reading.

false

parallelRead.repartition.columns

Specifies a comma separated list of columns that are used for repartitioning (when parallelRead.repartition is enabled). By default, an additional column with RAND() value is used for repartitioning.

tableKey

Specifies additional keys to add to tables created by the connector. See Load Data from Spark Examples for more information.

onDuplicateKeySQL

If this option is specified and a new row with duplicate PRIMARY KEY or UNIQUE index is inserted, SingleStore performs an UPDATE operation on the existing row. See Load Data from Spark Examples for more information.

insertBatchSize

Specifies the size of the batch for row insertion.

10000

loadDataFormat

Serializes data on load. It can have one of the following values: Avro or CSV.

CSV

maxErrors

The maximum number of errors in a single LOAD DATA request. When this limit is reached, the load fails. If this property is set to 0, no error limit exists.

0