Parallelized Data Extraction with Pipelines

A pipeline extracts data from a source, in parallel, using these general rules:

  • The pipeline pairs n number of source partitions or objects with p number of SingleStore leaf node partitions.

  • Each leaf node partition runs its own extraction process independently of other leaf nodes and their partitions.

  • Extracted data is stored on the leaf node where a partition resides until it can be written to the destination table. Depending on the way your table is sharded, the extracted data may only temporarily be stored on this leaf node.

Note

The term batch partition is used below and elsewhere in the documentation. A batch partition is a partition in the data source. If the data source does not contain partitions, then a batch partition refers to a single object in the data source.

Data Loading for Azure Blob Pipelines

Similar to S3 pipelines, each leaf partition will process a single object from Azure Blob storage as part of a batch operation. When all partitions have finished extracting, transforming, and loading their object, the batch operation will be completed.

Data Loading for HDFS Pipelines

When the master aggregator reads an HDFS output directory’s contents, it schedules each file on a single SingleStore partition. After each leaf partition across the workspace has finished extracting, transforming, and loading its file, a batch operation has been completed.

Data Loading for Kafka Pipelines

For Kafka pipelines to have optimal performance, there should be an equal number of partitions between SingleStore and Kafka (i.e., a 1-to-1 relationship). Leaf nodes will process data unevenly or sit idle if the number of partitions in the database and the data source are not equal.

In scenarios where the leaf nodes are processing unequal amounts of data, pipeline ingestion will generally outperform parallel loading through aggregator nodes.

The SingleStore Master Aggregator (MA) connects to Kafka’s lead broker and requests metadata about the Kafka cluster, including information about the brokers, topics, and partitions. For example, the MA examines the information and determines the Kafka cluster has four partitions spread across two brokers. The MA will assign leaf node partitions to each Kafka partition. The leaf partitions become Kafka consumers.

Kafka to SingleStore One-to-One Relationship

Kafka Cluster

SingleStore Cluster

BKR (P1) (P3)

MA = Master Aggregator

CA = Child aggregator

BKR (P2) (P4)

L1 & L2 = Leaf 1 and Leaf 2

P1 - P4 = partitions 1 - 4

Data Loading for S3 Pipelines

For S3 pipelines, each leaf node partition will process a single object from the source bucket in a batch.For example, if your workspace has 16 partitions, and the source bucket contains 16 objects, the 1:1 relationship between objects and partitions means that every object in the bucket will be ingested at the same time. Once each partition has finished processing its object, the batch will be complete.

If the source bucket contains objects that greatly differ in size, it’s important to understand how an S3 pipeline’s performance may be affected. Consider two partitions on a leaf: partition1 is processing an object that is 1KB in size, while partition2 is processing an object that is 10 MB in size. partition1will finish processing its object sooner than partition2. In this case, partition1 will sit idle and will not extract the next object from the bucket until partition2 finishes processing its 10 MB object. New objects will be processed only when partition1 and partition2 are both finished processing their respective objects.

Last modified: February 23, 2024

Was this article helpful?