# Best Practices for Pipelines

The following best practices ensure reliable, scalable, and high-performance data ingestion from supported sources, including Kafka, Amazon S3, Azure Blob Storage, Google Cloud Storage (GCS), and HDFS.

## General Pipeline Best Practices

* Resource management: Use resource pools to isolate workloads. Pipelines waiting in the queue consume slots reserved for background pipelines, as defined by `MAX_CONCURRENCY` and `MAX_QUEUE_DEPTH`. Control the maximum number of pipelines that can run concurrently using the `pipelines_max_concurrent` engine variable.
* Aggregator pipelines: For sources with low parallelism (such as single file S3 loads or single-partition Kafka topics), use aggregator pipelines. Aggregator pipelines simplify load behavior and improve performance in these scenarios.
* Use the following engine variables to control pipeline limits and defaults:

  * `pipelines_max_concurrent`: Specifies the maximum number of pipelines that can run concurrently.
  * `pipelines_batch_interval`: Controls the default batch frequency (unless overridden at the pipeline level).
  * `pipelines_stop_on_error`: Determines whether a pipeline stops or continues after encountering errors.
  * `pipelines_max_retries_per_batch_partition`: Configures retries for batch partitions to improve resilience.
* Avoid using `IGNORE` or `SKIP ALL ERRORS` options unless necessary, because they can generate many warnings which can slow down pipelines.
* Avoid using `ENCLOSED BY` or `OPTIONALLY ENCLOSED BY` clauses in `CREATE PIPELINE … LOAD DATA` statement when performance is critical, since these options add parsing overhead.
* If the transform is slow, consider using a `CREATE PIPELINE … SET … WHERE` statement as an alternative to a transform.

## Resource Utilization

* Use `SHOW PROFILE PIPELINE` to measure batch-level resource usage, including start and end times.
* Query `information_schema.PIPELINES_BATCHES_SUMMARY` for batch diagnostics such as database, pipeline name, batch duration, and batch state.

  * A large number of `Queued` batches indicates that cluster resources are overcommitted.
* Decrease `MAX_PARTITIONS_PER_BATCH` to lower parallelism and reduce overhead or resource contention.
* Increase `BATCH_INTERVAL` to run pipelines less frequently and ease system load.
* Split large files into smaller chunks. Large files parsed on a single leaf node can cause slow performance.
* For Amazon S3, Azure, and GCS pipelines, enable offsets metadata garbage collection with `ENABLE OFFSETS METADATA GC` to prevent memory growth from uncollected offsets metadata.

## Monitoring

* Use the SingleStore Monitoring Dashboards:

  * Pipeline Summary Dashboard: View the overall state of pipelines, including running, stopped, and errored pipelines.
  * Pipeline Performance Dashboard: Monitor performance metrics such as CPU time per execution, average elapsed time, and execution counts.
* Query the information schema views:

  * `information_schema.PIPELINES`: Provides high-level information about each pipeline.
  * `information_schema.PIPELINES_BATCHES_SUMMARY`: Displays batch durations and states for monitoring pipeline activity.
  * `information_schema.PIPELINES_ERRORS`: Lists errors by batch and partition.
  * `information_schema.PIPELINES_FILES`: Shows extracted or loaded files and their states. This view is especially useful for file-based pipelines.

## Troubleshooting

* Debugging errors:

  * When a `CREATE PIPELINE` statement fails, use `SHOW WARNINGS` to view detailed error messages.
  * Use `CREATE PIPELINE … FORCE` to create a pipeline even if connection or data source errors occur. Errors are recorded in the `information_schema.PIPELINES_ERRORS` view.
  * Enable debug logging by setting the engine variable `pipelines_extractor_debug_logging=ON`. For detailed error analysis, create a debug pipeline. Debug logs can generate up to 10 MB of data.
* Addressing specific error types:

  * If a table referenced by a pipeline must be renamed, you need to:

    1. Stop the pipeline.

    2. Drop the pipeline.

    3. Rename the table.

    4. Recreate the pipeline with the updated table name.

    5. Start the pipeline.
  * For file corruption or inaccessible files/objects, the pipeline skips the affected files. Use `ALTER PIPELINE … DROP FILE <filename>` to allow reprocessing.
* Error accumulation issues:

  * If pipeline error storage reaches the limit defined by `ingest_errors_max_disk_space_mb`, the pipeline may pause. To resolve this:

    * Increase the value of `ingest_errors_max_disk_space_mb`, or
    * Run `CLEAR PIPELINE ERRORS` to reclaim space.
* If pipeline metadata grows significantly (for example, the `PIPELINES_FILES` table contains many loaded file entries), manage metadata by:

  * Dropping file records with `ALTER PIPELINE … DROP FILE`, or
  * Dropping and recreating the pipeline.
    > **📝 Note**: Dropping and recreating a pipeline deletes all associated metadata.

## Stored Procedures Best Practices

Avoid operations in stored procedures that force processing on the aggregator node. These operations increase memory usage and reduce performance.

* For Example, using `COLLECT()` with large result sets places heavy load on the aggregator.
* When inserting into reference tables, using auto-increment columns, or performing non-parallel operations, design the logic carefully to balance workload between leaf nodes and aggregators.

Refer to [Writing Efficient Stored Procedures for Pipelines](https://docs.singlestore.com/db/v9.1/load-data/about-singlestore-pipelines/pipeline-concepts/writing-efficient-stored-procedures-for-pipelines.md) for more information.

## Kafka-Specific Best Practices

* Configure secure connections with appropriate authentication options.

  * `security.protocol` (for example, `SASL_SSL`)
  * `sasl.mechanism` (`PLAIN`, `SCRAM`, etc.)
  * CA certificates for SSL validation
* Use the `CONFIG` clause in `CREATE PIPELINE … KAFKA` to tune throughput and latency.

  * Common options include `fetch.max.bytes`, `fetch.min.bytes`, and `socket.timeout.ms`.
* Use `MAX_OFFSETS_PER_BATCH_PARTITION` in the `CREATE PIPELINE` statement to limit the number of offsets processed per partition in each batch. This setting helps control batch size and manage resource utilization.
* If using a schema registry, install SSL certificates so the pipeline can validate and retrieve schema definitions.
* To ensure in-order ingestion (for example, with upserts), disable out-of-order optimization:

  * Add `DISABLE OUT_OF_ORDER OPTIMIZATION` in the `CREATE PIPELINE` statement, or
  * Use `ALTER PIPELINE` to apply the setting after creation.
* Define multiple Kafka brokers in the pipeline to increase availability and redundancy.
* Monitor Kafka pipeline errors, such as missing topics, missing partitions, or authentication failures.
* Verify that topics exist and the pipeline has the necessary permissions.
* Kafka pipelines do not check for the latest offsets when stopped. Files added after the pipeline stops do not appear in offsets. Running `ALTER PIPELINE <pipeline_name> SET OFFSETS LATEST` while the pipeline is stopped does not change the offset count.

## S3-Specific Best Practices

* Ensure AWS credentials include the required permissions, such as `s3:GetObject` and `s3:ListBucket`.
* Use the `CONFIG` and `CREDENTIALS` clauses to specify parameters such as `region`, `role_arn`, or temporary credentials based on deployment requirements.
* Use wildcards or globbing in S3 paths to capture specific file sets.

  * Examples:

    * `bucket/path/*.json` matches JSON files in `bucket/path` only. `bucket/folder/*` matches all files and subfolders under `bucket/folder` recursively.
    * `bucket/path/**` matches only the files directly under `bucket/path`, not the files in subfolders.
    * `bucket/file?.csv` matches a single character: `file1.csv`, `fileA.csv`
    * `bucket/log[0-9].txt` matches a character range: `log0.txt` through `log9.txt`
    * `bucket/data_202[0-4].csv` matches specific years: `2020-2024`
  * For best performance, place glob patterns as far in the path as possible (e.g., `bucket/2024/data_*.csv` rather than `bucket/*/data.csv`).
* The suffixes parameter in the `CONFIG` clause provides an alternative way to specify the file extensions to load.
* By default, pipelines store metadata for all processed files, which can grow significantly over time. Enable the `OFFSETS METADATA GC` feature to automatically remove metadata for old files.
* Control pipeline parallelism with `MAX_PARTITIONS_PER_BATCH` to avoid overwhelming leaf nodes or consuming excessive memory.
* Adjust `BATCH_INTERVAL` to balance ingestion latency and resource usage. Short intervals increase overhead, while longer intervals reduce load.
* Use the `file_time_threshold` parameter in the pipeline configuration to load only recently added files instead of processing the entire archive. Refer to [S3 Configurations](https://docs.singlestore.com/db/v9.1/load-data/data-sources/configuration-options-for-different-sources/#section-idm234488319463038.md) for more information.
* Ensure that most files are approximately the same size and not too small. Files in the tens or hundreds of megabytes are acceptable. Large numbers of very small files create overhead as SingleStore tracks metadata for each file. A limited number of small files is acceptable, but having a few files significantly larger than the rest can reduce loading performance by limiting parallelism.

***

Modified at: May 11, 2026

Source: [/db/v9.1/load-data/about-singlestore-pipelines/pipeline-concepts/best-practices-for-pipelines/](https://docs.singlestore.com/db/v9.1/load-data/about-singlestore-pipelines/pipeline-concepts/best-practices-for-pipelines/)

(An index of the documentation is available at /llms.txt)
