Best Practices for Pipelines
On this page
The following best practices ensure reliable, scalable, and high-performance data ingestion from supported sources, including Kafka, Amazon S3, Azure Blob Storage, Google Cloud Storage (GCS), and HDFS.
General Pipeline Best Practices
-
Resource management: Use resource pools to isolate workloads.
Pipelines waiting in the queue consume slots reserved for background pipelines, as defined by MAX_andCONCURRENCY MAX_.QUEUE_ DEPTH Control the maximum number of pipelines that can run concurrently using the pipelines_engine variable.max_ concurrent -
Aggregator pipelines: For sources with low parallelism (such as single file S3 loads or single-partition Kafka topics), use aggregator pipelines.
Aggregator pipelines simplify load behavior and improve performance in these scenarios. -
Use the following engine variables to control pipeline limits and defaults:
-
pipelines_: Specifies the maximum number of pipelines that can run concurrently.max_ concurrent -
pipelines_: Controls the default batch frequency (unless overridden at the pipeline level).batch_ interval -
pipelines_: Determines whether a pipeline stops or continues after encountering errors.stop_ on_ error -
pipelines_: Configures retries for batch partitions to improve resilience.max_ retries_ per_ batch_ partition
-
-
Avoid using
IGNOREorSKIP ALL ERRORSoptions unless necessary, because they can generate many warnings which can slow down pipelines. -
Avoid using
ENCLOSED BYorOPTIONALLY ENCLOSED BYclauses inCREATE PIPELINE … LOAD DATAstatement when performance is critical, since these options add parsing overhead. -
If the transform is slow, consider using a
CREATE PIPELINE … SET … WHEREstatement as an alternative to a transform.
Resource Utilization
-
Use
SHOW PROFILE PIPELINEto measure batch-level resource usage, including start and end times. -
Query
information_for batch diagnostics such as database, pipeline name, batch duration, and batch state.schema. PIPELINES_ BATCHES_ SUMMARY -
A large number of
Queuedbatches indicates that workspace resources are overcommitted.
-
-
Decrease
MAX_to lower parallelism and reduce overhead or resource contention.PARTITIONS_ PER_ BATCH -
Increase
BATCH_to run pipelines less frequently and ease system load.INTERVAL -
Split large files into smaller chunks.
Large files parsed on a single leaf node can cause slow performance. -
For Amazon S3, Azure, and GCS pipelines, enable offsets metadata garbage collection with
ENABLE OFFSETS METADATA GCto prevent memory growth from uncollected offsets metadata.
Monitoring
-
Use the SingleStore Monitoring Dashboards:
-
Pipeline Summary Dashboard: View the overall state of pipelines, including running, stopped, and errored pipelines.
-
Pipeline Performance Dashboard: Monitor performance metrics such as CPU time per execution, average elapsed time, and execution counts.
-
-
Query the information schema views:
-
information_: Provides high-level information about each pipeline.schema. PIPELINES -
information_: Displays batch durations and states for monitoring pipeline activity.schema. PIPELINES_ BATCHES_ SUMMARY -
information_: Lists errors by batch and partition.schema. PIPELINES_ ERRORS -
information_: Shows extracted or loaded files and their states.schema. PIPELINES_ FILES This view is especially useful for file-based pipelines.
-
Troubleshooting
-
Debugging errors:
-
When a
CREATE PIPELINEstatement fails, useSHOW WARNINGSto view detailed error messages. -
Use
CREATE PIPELINE … FORCEto create a pipeline even if connection or data source errors occur.Errors are recorded in the information_view.schema. PIPELINES_ ERRORS -
Enable debug logging by setting the engine variable
pipelines_.extractor_ debug_ logging=ON For detailed error analysis, create a debug pipeline. Debug logs can generate up to 10 MB of data.
-
-
Addressing specific error types:
-
If a table referenced by a pipeline must be renamed, you need to:
-
Stop the pipeline.
-
Drop the pipeline.
-
Rename the table.
-
Recreate the pipeline with the updated table name.
-
Start the pipeline.
-
-
For file corruption or inaccessible files/objects, the pipeline skips the affected files.
Use ALTER PIPELINE … DROP FILE <filename>to allow reprocessing.
-
-
Error accumulation issues:
-
If pipeline error storage reaches the limit defined by
ingest_, the pipeline may pause.errors_ max_ disk_ space_ mb To resolve this: -
Increase the value of
ingest_, orerrors_ max_ disk_ space_ mb -
Run
CLEAR PIPELINE ERRORSto reclaim space.
-
-
-
If pipeline metadata grows significantly (for example, the
PIPELINES_table contains many loaded file entries), manage metadata by:FILES -
Dropping file records with
ALTER PIPELINE … DROP FILE, or -
Dropping and recreating the pipeline.
Note
Dropping and recreating a pipeline deletes all associated metadata.
-
Stored Procedures Best Practices
Avoid operations in stored procedures that force processing on the aggregator node.
-
For Example, using
COLLECT()with large result sets places heavy load on the aggregator. -
When inserting into reference tables, using auto-increment columns, or performing non-parallel operations, design the logic carefully to balance workload between leaf nodes and aggregators.
Refer to Writing Efficient Stored Procedures for Pipelines for more information.
Kafka-Specific Best Practices
-
Configure secure connections with appropriate authentication options.
-
security.(for example,protocol SASL_)SSL -
sasl.(mechanism PLAIN,SCRAM, etc.) -
CA certificates for SSL validation
-
-
Use the
CONFIGclause inCREATE PIPELINE … KAFKAto tune throughput and latency.-
Common options include
fetch.,max. bytes fetch., andmin. bytes socket..timeout. ms
-
-
Use
MAX_in theOFFSETS_ PER_ BATCH_ PARTITION CREATE PIPELINEstatement to limit the number of offsets processed per partition in each batch.This setting helps control batch size and manage resource utilization. -
If using a schema registry, install SSL certificates so the pipeline can validate and retrieve schema definitions.
-
To ensure in-order ingestion (for example, with upserts), disable out-of-order optimization:
-
Add
DISABLE OUT_in theOF_ ORDER OPTIMIZATION CREATE PIPELINEstatement, or -
Use
ALTER PIPELINEto apply the setting after creation.
-
-
Define multiple Kafka brokers in the pipeline to increase availability and redundancy.
-
Monitor Kafka pipeline errors, such as missing topics, missing partitions, or authentication failures.
-
Verify that topics exist and the pipeline has the necessary permissions.
-
Kafka pipelines do not check for the latest offsets when stopped.
Files added after the pipeline stops do not appear in offsets. Running ALTER PIPELINE <pipeline_while the pipeline is stopped does not change the offset count.name> SET OFFSETS LATEST
S3-Specific Best Practices
-
Ensure AWS credentials include the required permissions, such as
s3:GetObjectands3:ListBucket. -
Use the
CONFIGandCREDENTIALSclauses to specify parameters such asregion,role_, or temporary credentials based on deployment requirements.arn -
Use wildcards or globbing in S3 paths to capture specific file sets.
-
Example:
bucket/path/*.for files in one directory, orjson bucket/folder/**to include files recursively.
-
-
The suffixes parameter in the
CONFIGclause provides an alternative way to specify the file extensions to load. -
By default, pipelines store metadata for all processed files, which can grow significantly over time.
Enable the OFFSETS METADATA GCfeature to automatically remove metadata for old files. -
Control pipeline parallelism with
MAX_to avoid overwhelming leaf nodes or consuming excessive memory.PARTITIONS_ PER_ BATCH -
Adjust
BATCH_to balance ingestion latency and resource usage.INTERVAL Short intervals increase overhead, while longer intervals reduce load. -
Use the
file_parameter in the pipeline configuration to load only recently added files instead of processing the entire archive.time_ threshold Refer to S3 Configurations for more information. -
Ensure that most files are approximately the same size and not too small.
Files in the tens or hundreds of megabytes are acceptable. Large numbers of very small files create overhead as SingleStore tracks metadata for each file. A limited number of small files is acceptable, but having a few files significantly larger than the rest can reduce loading performance by limiting parallelism.
Last modified: October 14, 2025