Best Practices for Pipelines

The following best practices ensure reliable, scalable, and high-performance data ingestion from supported sources, including Kafka, Amazon S3, Azure Blob Storage, Google Cloud Storage (GCS), and HDFS.

General Pipeline Best Practices

  • Resource management: Use resource pools to isolate workloads. Pipelines waiting in the queue consume slots reserved for background pipelines, as defined by MAX_CONCURRENCY and MAX_QUEUE_DEPTH. Control the maximum number of pipelines that can run concurrently using the pipelines_max_concurrent engine variable.

  • Aggregator pipelines: For sources with low parallelism (such as single file S3 loads or single-partition Kafka topics), use aggregator pipelines. Aggregator pipelines simplify load behavior and improve performance in these scenarios.

  • Use the following engine variables to control pipeline limits and defaults:

    • pipelines_max_concurrent: Specifies the maximum number of pipelines that can run concurrently.

    • pipelines_batch_interval: Controls the default batch frequency (unless overridden at the pipeline level).

    • pipelines_stop_on_error: Determines whether a pipeline stops or continues after encountering errors.

    • pipelines_max_retries_per_batch_partition: Configures retries for batch partitions to improve resilience.

  • Avoid using IGNORE or SKIP ALL ERRORS options unless necessary, because they can generate many warnings which can slow down pipelines.

  • Avoid using ENCLOSED BY or OPTIONALLY ENCLOSED BY clauses in CREATE PIPELINE … LOAD DATA statement when performance is critical, since these options add parsing overhead.

  • If the transform is slow, consider using a CREATE PIPELINE … SET … WHERE statement as an alternative to a transform.

Resource Utilization

  • Use SHOW PROFILE PIPELINE to measure batch-level resource usage, including start and end times.

  • Query information_schema.PIPELINES_BATCHES_SUMMARY for batch diagnostics such as database, pipeline name, batch duration, and batch state.

    • A large number of Queued batches indicates that workspace resources are overcommitted.

  • Decrease MAX_PARTITIONS_PER_BATCH to lower parallelism and reduce overhead or resource contention.

  • Increase BATCH_INTERVAL to run pipelines less frequently and ease system load.

  • Split large files into smaller chunks. Large files parsed on a single leaf node can cause slow performance.

  • For Amazon S3, Azure, and GCS pipelines, enable offsets metadata garbage collection with ENABLE OFFSETS METADATA GC to prevent memory growth from uncollected offsets metadata.

Monitoring

  • Use the SingleStore Monitoring Dashboards:

    • Pipeline Summary Dashboard: View the overall state of pipelines, including running, stopped, and errored pipelines.

    • Pipeline Performance Dashboard: Monitor performance metrics such as CPU time per execution, average elapsed time, and execution counts.

  • Query the information schema views:

    • information_schema.PIPELINES: Provides high-level information about each pipeline.

    • information_schema.PIPELINES_BATCHES_SUMMARY: Displays batch durations and states for monitoring pipeline activity.

    • information_schema.PIPELINES_ERRORS: Lists errors by batch and partition.

    • information_schema.PIPELINES_FILES: Shows extracted or loaded files and their states. This view is especially useful for file-based pipelines.

Troubleshooting

  • Debugging errors:

    • When a CREATE PIPELINE statement fails, use SHOW WARNINGS to view detailed error messages.

    • Use CREATE PIPELINE … FORCE to create a pipeline even if connection or data source errors occur. Errors are recorded in the information_schema.PIPELINES_ERRORS view.

    • Enable debug logging by setting the engine variable pipelines_extractor_debug_logging=ON. For detailed error analysis, create a debug pipeline. Debug logs can generate up to 10 MB of data.

  • Addressing specific error types:

    • If a table referenced by a pipeline must be renamed, you need to:

      1. Stop the pipeline.

      2. Drop the pipeline.

      3. Rename the table.

      4. Recreate the pipeline with the updated table name.

      5. Start the pipeline.

    • For file corruption or inaccessible files/objects, the pipeline skips the affected files. Use ALTER PIPELINE … DROP FILE <filename> to allow reprocessing.

  • Error accumulation issues:

    • If pipeline error storage reaches the limit defined by ingest_errors_max_disk_space_mb, the pipeline may pause. To resolve this:

      • Increase the value of ingest_errors_max_disk_space_mb, or

      • Run CLEAR PIPELINE ERRORS to reclaim space.

  • If pipeline metadata grows significantly (for example, the PIPELINES_FILES table contains many loaded file entries), manage metadata by:

    • Dropping file records with ALTER PIPELINE … DROP FILE, or

    • Dropping and recreating the pipeline.

      Note

      Dropping and recreating a pipeline deletes all associated metadata.

Stored Procedures Best Practices

Avoid operations in stored procedures that force processing on the aggregator node. These operations increase memory usage and reduce performance.

  • For Example, using COLLECT() with large result sets places heavy load on the aggregator.

  • When inserting into reference tables, using auto-increment columns, or performing non-parallel operations, design the logic carefully to balance workload between leaf nodes and aggregators.

Refer to Writing Efficient Stored Procedures for Pipelines for more information.

Kafka-Specific Best Practices

  • Configure secure connections with appropriate authentication options.

    • security.protocol (for example, SASL_SSL)

    • sasl.mechanism (PLAIN, SCRAM, etc.)

    • CA certificates for SSL validation

  • Use the CONFIG clause in CREATE PIPELINE … KAFKA to tune throughput and latency.

    • Common options include fetch.max.bytes, fetch.min.bytes, and socket.timeout.ms.

  • Use MAX_OFFSETS_PER_BATCH_PARTITION in the CREATE PIPELINE statement to limit the number of offsets processed per partition in each batch. This setting helps control batch size and manage resource utilization.

  • If using a schema registry, install SSL certificates so the pipeline can validate and retrieve schema definitions.

  • To ensure in-order ingestion (for example, with upserts), disable out-of-order optimization:

    • Add DISABLE OUT_OF_ORDER OPTIMIZATION in the CREATE PIPELINE statement, or

    • Use ALTER PIPELINE to apply the setting after creation.

  • Define multiple Kafka brokers in the pipeline to increase availability and redundancy.

  • Monitor Kafka pipeline errors, such as missing topics, missing partitions, or authentication failures.

  • Verify that topics exist and the pipeline has the necessary permissions.

  • Kafka pipelines do not check for the latest offsets when stopped. Files added after the pipeline stops do not appear in offsets. Running ALTER PIPELINE <pipeline_name> SET OFFSETS LATEST while the pipeline is stopped does not change the offset count.

S3-Specific Best Practices

  • Ensure AWS credentials include the required permissions, such as s3:GetObject and s3:ListBucket.

  • Use the CONFIG and CREDENTIALS clauses to specify parameters such as region, role_arn, or temporary credentials based on deployment requirements.

  • Use wildcards or globbing in S3 paths to capture specific file sets.

    • Example: bucket/path/*.json for files in one directory, or bucket/folder/** to include files recursively.

  • The suffixes parameter in the CONFIG clause provides an alternative way to specify the file extensions to load.

  • By default, pipelines store metadata for all processed files, which can grow significantly over time. Enable the OFFSETS METADATA GC feature to automatically remove metadata for old files.

  • Control pipeline parallelism with MAX_PARTITIONS_PER_BATCH to avoid overwhelming leaf nodes or consuming excessive memory.

  • Adjust BATCH_INTERVAL to balance ingestion latency and resource usage. Short intervals increase overhead, while longer intervals reduce load.

  • Use the file_time_threshold parameter in the pipeline configuration to load only recently added files instead of processing the entire archive. Refer to S3 Configurations for more information.

  • Ensure that most files are approximately the same size and not too small. Files in the tens or hundreds of megabytes are acceptable. Large numbers of very small files create overhead as SingleStore tracks metadata for each file. A limited number of small files is acceptable, but having a few files significantly larger than the rest can reduce loading performance by limiting parallelism.

Last modified: October 14, 2025

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK