Configuration Options for Different Sources

SingleStore supports a number of configuration options for different sources. These options can be used with the CONFIG clause in the CREATE PIPELINE command.

Kafka Configurations

The following table shows the SingleStore-specific configurations for a Kafka environment.

Parameter

Description

spoof.dns

Used while connecting to Kafka via a proxy, for example, when connecting across multiple cloud services. Use spoof.dns to re-route the connections to the proxy without modifying the Kafka broker configuration.

operation.timeout.ms

Specifies a timeout for operations such as metadata requests and message consumption/production. This value can be adjusted based on the size of the consumed/produced dataset.

Default: 10 seconds

CONFIG '{"operation.timeout.ms" : "10000"}'

sasl.kerberos.cache

Used with Kerberos authentication to specify where to cache Kerberos tickets. When this value is not specified, a "sasl.tmpdir" + /pipeline_digest location is used.

Default sasl.tmpdir: /tmp

CONFIG '{"sasl.kerberos.cache" : "/tmp"}'

sasl.kerberos.disable.kinit

Use this parameter if the client does not support kinit and refresh tokens with SingleStore. Running kinit is not required if a background process keeps the Kerberos ticket cache up to date.

CONFIG '{"sasl.kerberos.disable.kinit" : true}'

The CONFIG clause of a Kafka pipeline can accept a spoof.dns element as an alternative to configuring Kafka brokers. The spoof.dns element must be a JSON object consisting of an arbitrary number of key-value pairs with URL string values. When the pipeline attempts to connect to a Kafka broker whose URL matches one of the keys, the pipeline will connect to the corresponding URL value, effectively remapping the broker URLs inside the pipeline Kafka client.

This CREATE PIPELINE command will let you set the AWS private link configuration for Kafka Brokers with AWS MSK.

CREATE PIPELINE <pipeline_name> AS LOAD DATA KAFKA '<Kafka bootstrap server endpoint>:<port>/<topic name>'
CONFIG '{
"spoof.dns": {
"<broker 1 endpoint>:<port>":"<SingleStore shared endpoint (outbound)>:<NLB listener port for broker 1>",
"<broker 2 endpoint>:<port>":"<SingleStore shared endpoint (outbound)>:<NLB listener port for broker 2>",
"<broker 3 endpoint>:<port>":"<SingleStore shared endpoint (outbound)>:<NLB listener port for broker 3>",
}
}'
INTO TABLE <table_name>;

There are a few more configuration options that are supported by Kafka. Consult the CONFIGURATION.md file in the librdkafka project in GitHub to see the full list.

Note

Some of the configuration options are not supported in SingleStore. The client will receive a "Forbidden Key" error when accessing unsupported configuration options.

The configuration below controls some of the various aspects of the consumer's behavior (e.g., timeouts, fetching behavior, and message handling). These parameters can be adjusted to optimize the performance and reliability of the Kafka consumer based on your environment and requirements.

CREATE PIPELINE p AS LOAD DATA kafka 'host.example.com:9092/whatever'
CONFIG '{"fetch.max.bytes": "52428800", "topic.metadata.refresh.interval.ms": "300000", "message.max.bytes": "1000000",
"fetch.wait.max.ms": "500", "session.timeout.ms": "45000", "topic.metadata.refresh.fast.interval.ms": "100",
"fetch.min.bytes": "1", "max.partition.fetch.bytes": "1048576", "fetch.message.max.bytes": "1048576",
"socket.keepalive.enable": "true", "fetch.error.backoff.ms": "500", "socket.timeout.ms": "60000"}'
INTO TABLE t format CSV;

The following configuration sets some of the different communication options that are used with Kafka brokers (e.g., timeouts, batching behavior, and resource usage). These parameters should be based on your application requirements and specific Kafka deployment environment.

CREATE PIPELINE p AS LOAD DATA kafka 'host.example.com:9092/whatever2'
CONFIG '{"connections.max.idle.ms": "230000", "client.id": "<client_id>", "fetch.max.bytes": "1000000",
"operation.timeout.ms": "30000", "batch.num.messages": "1000", "socket.keepalive.enable": "false",
"socket.timeout.ms": "60000"}'
INTO TABLE t format CSV;

S3 Configurations

The following table shows the SingleStore-specific configurations for S3.

Parameter

Description

disable_gunzip

When this parameter is set to true, files with the .gz extension are not decompressed.

When this parameter is disabled or missing, files with the .gz extension are decompressed.

CONFIG '{"disable_gunzip" : true}'

request_payer

Specifies who is responsible for paying for the data transfer and request costs associated with accessing an S3 bucket.

By default, the owner of an S3 bucket is responsible for paying these costs. However, when using the request_payer parameter, the requester will be responsible for covering the costs associated with the request. This can include costs such as GET, PUT, and LIST requests, as well as data transfer charges.

CONFIG '{"request_payer" : "name"}'

endpoint_url

Specifies the URL of the S3-compatible storage provider. This parameter can be used to direct requests to a non-standard endpoint, such as an S3-compatible service other than AWS. For example, MiniO, which is an S3-compatible storage provider, or a private cloud object storage which exposes an interface like S3.

CONFIG '{"endpoint_url" : "sample_url"}'

compatibility_mode

Instructs the downloader to use S3 API calls that are better supported by third parties.

CONFIG '{"compatibility_mode" : true}'

file_compression

Decompresses files with the specified extensions. It can have the following values: "gz", "lz4", "auto", and "disable". This parameter overrides disable_gunzip.

CONFIG '{"file_compression" : "gz"}'

file_time_threshold

If set, files last modified before the specified timestamp are not ingested. The timestamp must be specified in the Unix Timestamp format represented as an integer value.

CONFIG '{"file_time_threshold" : 10070010}'

No CONFIG clause is required to create an S3 pipeline. This clause is used to specify things like the Amazon S3 region where the source bucket is located or an entrypoint for an S3-compatible object sore. If no CONFIG clause is specified, SingleStore will automatically use the us-east-1 region, also known as US Standard in the Amazon S3 console. To specify a different region, such as us-west-1, include a CONFIG clause as shown in the example below. The CONFIG clause can also be used to specify the suffixes for files to load. These suffixes are a JSON array of strings. When specified, CREATE PIPELINE only loads files that have the specified suffix. Suffixes in the CONFIG clause can be specified without a . before them, for example, CONFIG '{"suffixes": ["csv"]}'.

CREATE OR REPLACE PIPELINE <pipeline_name>
AS LOAD DATA S3 'data-test-bucket'
CONFIG '{"region": "us-east-1","request_payer": "requester", "endpoint_url": "https://storage.googleapis.com", "compatibility_mode": true}'
CREDENTIALS '{"aws_access_key_id": "ANIAVX7U2LM9QVJMK2ZT",
"aws_secret_access_key": "xxxxxxxxxxxxxxxxxxxxxxx"}'
INTO TABLE 'market_data'
(ts, timestamp, event_type, ticker, price, quantity, exchange, conditions);

Azure Blob Configurations

The following table shows the SingleStore-specific configurations for Azure Blobs.

Parameter

Description

disable_gunzip

When this parameter is set to true, files with the .gz extension are not decompressed.

When this parameter is disabled or missing, files with the .gz extension are decompressed.

CONFIG '{"disable_gunzip" : true}'

Note that no CONFIG clause is required to create an Azure pipeline unless you need to specify the suffixes for files to load. These suffixes are a JSON array of strings. When specified, CREATE PIPELINE only loads files that have the specified suffix. Suffixes in the CONFIG clause can be specified without a . before them, for example, CONFIG '{"suffixes": ["csv"]}'.

GCS Configurations

The following table shows the SingleStore-specific configurations for GCS.

Parameter

Description

disable_gunzip

When this parameter is set to true, files with the .gz extension are not decompressed.

When this parameter is disabled or missing, files with the .gz extension are decompressed.

CONFIG '{"disable_gunzip" : true}'

HDFS Configurations

The following table shows the SingleStore-specific configurations for HDFS.

Parameter

Description

disable_partial_check

When this parameter is set to true, a pipeline is created that imports Hive output files. When the pipeline runs, the extractor imports files, but does not check for additional files in the directory.

CONFIG '{"disable_partial_check" : true}'

disable_gunzip

When this parameter is set to true, files with the .gz extension are not decompressed.

When this parameter is disabled or missing, files with the .gz extension are decompressed.

CONFIG '{"disable_gunzip" : true}'

Filesystem Configurations

The following table shows the SingleStore-specific configurations for the filesystem.

Parameter

Description

disable_gunzip

When this parameter is set to true, files with the .gz extension are not decompressed.

When this parameter is disabled or missing, files with the .gz extension are decompressed.

CONFIG '{"disable_gunzip" : true}'

process_zero_byte_files

When this parameter is set to true, zero-byte files are processed.

When this parameter is disabled or missing, zero-byte files are not processed.

CONFIG '{"process_zero_byte_files" : true}'

Last modified: November 12, 2024

Was this article helpful?