SingleStore Managed Service

Parallelized Data Extraction with Pipelines

A pipeline extracts data from a source, in parallel, using these general rules:

  • The pipeline pairs n number of source partitions or objects with p number of SingleStore leaf node partitions.

  • Each leaf node partition runs its own extraction process independently of other leaf nodes and their partitions.

  • Extracted data is stored on the leaf node where a partition resides until it can be written to the destination table. Depending on the way your table is sharded, the extracted data may only temporarily be stored on this leaf node.

Data Loading for Azure Blob Pipelines

Similar to S3 pipelines, each leaf partition will process a single object from Azure Blob storage as part of a batch operation. When all partitions have finished extracting, transforming, and loading their object, the batch operation will be completed.

Data Loading for HDFS Pipelines

When the master aggregator reads an HDFS output directory’s contents, it schedules each file on a single SingleStore DB partition. After each leaf partition across the cluster has finished extracting, transforming, and loading its file, a batch operation has been completed.

Data Loading for Kafka Pipelines

For Kafka pipelines, there should be a 1:1 relationship between the number of leaf node partitions and the number of Kafka partitions. For example, if your database has two leaves with eight partitions each, your Kafka cluster should have 16 partitions. If the database or the data source’s partitions aren’t equal in number, leaf nodes will either sit idle or will process uneven amounts of data. However, even in scenarios when leaf nodes are processing an uneven amount of data, ingestion using Pipelines will generally be more performant than parallel loading through aggregator nodes.

Data Loading for S3 Pipelines

For S3 pipelines, each leaf node partition will process a single object from the source bucket in a batch. For example, if your cluster has 16 partitions, and the source bucket contains 16 objects, the 1:1 relationship between objects and partitions means that every object in the bucket will be ingested at the same time. Once each partition has finished processing its object, the batch will be complete.

If the source bucket contains objects that greatly differ in size, it’s important to understand how an S3 pipeline’s performance may be affected. Consider two partitions on a leaf: partition1 is processing an object that is 1KB in size, while partition2 is processing an object that is 10 MB in size. partition1 will finish processing its object sooner than partition2. In this case, partition1 will sit idle and will not extract the next object from the bucket until partition2 finishes processing its 10 MB object. New objects will be processed only when partition1 and partition2 are both finished processing their respective objects.

Pipelines Scheduling

SingleStore DB supports running multiple pipelines in parallel. Pipelines will be run in parallel until all SingleStore DB partitions have been saturated. For example, consider a SingleStore DB cluster with 10 partitions. With this architecture, it is possible to run 5 parallel pipelines using 2 partitions each, 2 pipelines using 5 partitions each, and so on.

If the partition requirements of any two pipelines exceed the total number of SingleStore partitions, each pipeline will be run serially in a round robin fashion. For example, consider a SingleStore DB cluster with 10 partitions, and 3 pipelines. Let's say the first batch of pipelines P1, P2, and P3 requires 4, 8, and 4 partitions, respectively. The pipelines are scheduled concurrently with the aim of saturating the partitions in a SingleStore DB cluster. Hence, the scheduler will run pipelines P1 and P3 in parallel to process their first batch. And then, it will run pipeline P2 serially, because the sum of number of partitions required by P2 and any other pipeline (P1 or P3) is greater than the number of partitions in the cluster (10 partitions). You can specify the maximum number of pipeline batch partitions that can run concurrently using the pipelines_max_concurrent_batch_partitions engine variable. Note that how many partitions a pipeline uses is dependent on the pipeline source.