Best Practices for Pipelines
On this page
The following best practices ensure reliable, scalable, and high-performance data ingestion from supported sources, including Kafka, Amazon S3, Azure Blob Storage, Google Cloud Storage (GCS), and HDFS.
General Pipeline Best Practices
-
Resource management: Use resource pools to isolate workloads.
Pipelines waiting in the queue consume slots reserved for background pipelines, as defined by MAX_
andCONCURRENCY MAX_
.QUEUE_ DEPTH Control the maximum number of pipelines that can run concurrently using the pipelines_
engine variable.max_ concurrent -
Aggregator pipelines: For sources with low parallelism (such as single file S3 loads or single-partition Kafka topics), use aggregator pipelines.
Aggregator pipelines simplify load behavior and improve performance in these scenarios. -
Use the following engine variables to control pipeline limits and defaults:
-
pipelines_
: Specifies the maximum number of pipelines that can run concurrently.max_ concurrent -
pipelines_
: Controls the default batch frequency (unless overridden at the pipeline level).batch_ interval -
pipelines_
: Determines whether a pipeline stops or continues after encountering errors.stop_ on_ error -
pipelines_
: Configures retries for batch partitions to improve resilience.max_ retries_ per_ batch_ partition
-
-
Avoid using
IGNORE
orSKIP ALL ERRORS
options unless necessary, because they can generate many warnings which can slow down pipelines. -
Avoid using
ENCLOSED BY
orOPTIONALLY ENCLOSED BY
clauses inCREATE PIPELINE … LOAD DATA
statement when performance is critical, since these options add parsing overhead. -
If the transform is slow, consider using a
CREATE PIPELINE … SET … WHERE
statement as an alternative to a transform.
Resource Utilization
-
Use
SHOW PROFILE PIPELINE
to measure batch-level resource usage, including start and end times. -
Query
information_
for batch diagnostics such as database, pipeline name, batch duration, and batch state.schema. PIPELINES_ BATCHES_ SUMMARY -
A large number of
Queued
batches indicates that workspace resources are overcommitted.
-
-
Decrease
MAX_
to lower parallelism and reduce overhead or resource contention.PARTITIONS_ PER_ BATCH -
Increase
BATCH_
to run pipelines less frequently and ease system load.INTERVAL -
Split large files into smaller chunks.
Large files parsed on a single leaf node can cause slow performance. -
For Amazon S3, Azure, and GCS pipelines, enable offsets metadata garbage collection with
ENABLE OFFSETS METADATA GC
to prevent memory growth from uncollected offsets metadata.
Monitoring
-
Use the SingleStore Monitoring Dashboards:
-
Pipeline Summary Dashboard: View the overall state of pipelines, including running, stopped, and errored pipelines.
-
Pipeline Performance Dashboard: Monitor performance metrics such as CPU time per execution, average elapsed time, and execution counts.
-
-
Query the information schema views:
-
information_
: Provides high-level information about each pipeline.schema. PIPELINES -
information_
: Displays batch durations and states for monitoring pipeline activity.schema. PIPELINES_ BATCHES_ SUMMARY -
information_
: Lists errors by batch and partition.schema. PIPELINES_ ERRORS -
information_
: Shows extracted or loaded files and their states.schema. PIPELINES_ FILES This view is especially useful for file-based pipelines.
-
Troubleshooting
-
Debugging errors:
-
When a
CREATE PIPELINE
statement fails, useSHOW WARNINGS
to view detailed error messages. -
Use
CREATE PIPELINE … FORCE
to create a pipeline even if connection or data source errors occur.Errors are recorded in the information_
view.schema. PIPELINES_ ERRORS -
Enable debug logging by setting the engine variable
pipelines_
.extractor_ debug_ logging=ON For detailed error analysis, create a debug pipeline. Debug logs can generate up to 10 MB of data.
-
-
Addressing specific error types:
-
If a table referenced by a pipeline must be renamed, you need to:
-
Stop the pipeline.
-
Drop the pipeline.
-
Rename the table.
-
Recreate the pipeline with the updated table name.
-
Start the pipeline.
-
-
For file corruption or inaccessible files/objects, the pipeline skips the affected files.
Use ALTER PIPELINE … DROP FILE <filename>
to allow reprocessing.
-
-
Error accumulation issues:
-
If pipeline error storage reaches the limit defined by
ingest_
, the pipeline may pause.errors_ max_ disk_ space_ mb To resolve this: -
Increase the value of
ingest_
, orerrors_ max_ disk_ space_ mb -
Run
CLEAR PIPELINE ERRORS
to reclaim space.
-
-
-
If pipeline metadata grows significantly (for example, the
PIPELINES_
table contains many loaded file entries), manage metadata by:FILES -
Dropping file records with
ALTER PIPELINE … DROP FILE
, or -
Dropping and recreating the pipeline.
Note
Dropping and recreating a pipeline deletes all associated metadata.
-
Stored Procedures Best Practices
Avoid operations in stored procedures that force processing on the aggregator node.
-
For Example, using
COLLECT()
with large result sets places heavy load on the aggregator. -
When inserting into reference tables, using auto-increment columns, or performing non-parallel operations, design the logic carefully to balance workload between leaf nodes and aggregators.
Refer to Writing Efficient Stored Procedures for Pipelines for more information.
Kafka-Specific Best Practices
-
Configure secure connections with appropriate authentication options.
-
security.
(for example,protocol SASL_
)SSL -
sasl.
(mechanism PLAIN
,SCRAM
, etc.) -
CA certificates for SSL validation
-
-
Use the
CONFIG
clause inCREATE PIPELINE … KAFKA
to tune throughput and latency.-
Common options include
fetch.
,max. bytes fetch.
, andmin. bytes socket.
.timeout. ms
-
-
Use
MAX_
in theOFFSETS_ PER_ BATCH_ PARTITION CREATE PIPELINE
statement to limit the number of offsets processed per partition in each batch.This setting helps control batch size and manage resource utilization. -
If using a schema registry, install SSL certificates so the pipeline can validate and retrieve schema definitions.
-
To ensure in-order ingestion (for example, with upserts), disable out-of-order optimization:
-
Add
DISABLE OUT_
in theOF_ ORDER OPTIMIZATION CREATE PIPELINE
statement, or -
Use
ALTER PIPELINE
to apply the setting after creation.
-
-
Define multiple Kafka brokers in the pipeline to increase availability and redundancy.
-
Monitor Kafka pipeline errors, such as missing topics, missing partitions, or authentication failures.
-
Verify that topics exist and the pipeline has the necessary permissions.
-
Kafka pipelines do not check for the latest offsets when stopped.
Files added after the pipeline stops do not appear in offsets. Running ALTER PIPELINE <pipeline_
while the pipeline is stopped does not change the offset count.name> SET OFFSETS LATEST
S3-Specific Best Practices
-
Ensure AWS credentials include the required permissions, such as
s3:GetObject
ands3:ListBucket
. -
Use the
CONFIG
andCREDENTIALS
clauses to specify parameters such asregion
,role_
, or temporary credentials based on deployment requirements.arn -
Use wildcards or globbing in S3 paths to capture specific file sets.
-
Example:
bucket/path/*.
for files in one directory, orjson bucket/folder/**
to include files recursively.
-
-
The suffixes parameter in the
CONFIG
clause provides an alternative way to specify the file extensions to load. -
By default, pipelines store metadata for all processed files, which can grow significantly over time.
Enable the OFFSETS METADATA GC
feature to automatically remove metadata for old files. -
Control pipeline parallelism with
MAX_
to avoid overwhelming leaf nodes or consuming excessive memory.PARTITIONS_ PER_ BATCH -
Adjust
BATCH_
to balance ingestion latency and resource usage.INTERVAL Short intervals increase overhead, while longer intervals reduce load. -
Use the
file_
parameter in the pipeline configuration to load only recently added files instead of processing the entire archive.time_ threshold Refer to S3 Configurations for more information. -
Ensure that most files are approximately the same size and not too small.
Files in the tens or hundreds of megabytes are acceptable. Large numbers of very small files create overhead as SingleStore tracks metadata for each file. A limited number of small files is acceptable, but having a few files significantly larger than the rest can reduce loading performance by limiting parallelism.
Last modified: October 14, 2025