Load Data with Pipelines
On this page
SingleStore Pipelines is a feature that continuously loads data as it arrives from external sources.
Pipelines support Apache Kafka, Amazon S3, Azure Blob Storage, file system, Google Cloud Storage and HDFS data sources.
Pipelines support the JSON, Avro, Parquet, and CSV data formats.
A database backup preserves the state of all pipelines (offsets, etc.
When a backup is restored all pipelines in that database will revert to the state (offsets, etc.
Features
The features of SingleStore Pipelines make it a powerful alternative to third-party ETL middleware in many scenarios:
-
Easy continuous loading: Pipelines monitor their source folder or Kafka queue and, when new files or messages arrive, automatically load them.
This simplifies the job of the application developer. -
Scalability: Pipelines inherently scales with SingleStore clusters as well as distributed data sources like Kafka and cloud data stores like Amazon S3.
-
High Performance: Pipelines data is loaded in parallel from the data source directly to the SingleStore leaves, in most situations; this improves throughput by bypassing the aggregator.
Additionally, Pipelines has been optimized for low lock contention and concurrency. -
Exactly-once Semantics: The architecture of Pipelines ensures that transactions are processed exactly once, even in the event of failover.
-
Debugging: Pipelines makes it easier to debug each step in the ETL process by storing exhaustive metadata about transactions, including stack traces and
stderr
messages. -
Concurrency: Multiple pipelines can insert data into a single table.
This ability is similar to using multiple write queries. See Sync Variables Lists for more information.
Related Topics
In this section
Last modified: October 8, 2024