Skip to main content

Load Data with Pipelines

SingleStore Pipelines is a feature that continuously loads data as it arrives from external sources. As a built-in component of the database, Pipelines can extract, shape (modify), and load external data without the need for third-party tools or middleware. Pipelines is robust, scalable, highly performant, and supports fully distributed workloads.

Pipelines support Apache Kafka, Amazon S3, Azure Blob, file system, Google Cloud Storage and HDFS data sources.

Pipelines support the JSON, Avro, Parquet, and CSV data formats.

A database backup preserves the state of all pipelines (offsets, etc.) in that database.

When a backup is restored all pipelines in that database will revert to the state (offsets, etc.) they were in when the target backup was generated.

Features

The features of SingleStore Pipelines make it a powerful alternative to third-party ETL middleware in many scenarios:

  • Easy continuous loading: Pipelines monitor their source folder or Kafka queue and, when new files or messages arrive, automatically load them. This simplifies the job of the application developer.

  • Scalability: Pipelines inherently scales with SingleStoreDB clusters as well as distributed data sources like Kafka and cloud data stores like Amazon S3.

  • High Performance: Pipelines data is loaded in parallel from the data source directly to the SingleStore leaves, in most situations; this improves throughput by bypassing the aggregator. Additionally, Pipelines has been optimized for low lock contention and concurrency.

  • Exactly-once Semantics: The architecture of Pipelines ensures that transactions are processed exactly once, even in the event of failover.

  • Debugging: Pipelines makes it easier to debug each step in the ETL process by storing exhaustive metadata about transactions, including stack traces and stderr messages.