Data Loading for S3 Pipelines

For S3 pipelines, each leaf node partition will process a single object from the source bucket in a batch. For example, if your cluster has 16 partitions, and the source bucket contains 16 objects, the 1:1 relationship between objects and partitions means that every object in the bucket will be ingested at the same time. Once each partition has finished processing its object, the batch will be complete.

If the source bucket contains objects that greatly differ in size, it’s important to understand how an S3 pipeline’s performance may be affected. Consider two partitions on a leaf: partition1 is processing an object that is 1KB in size, while partition2 is processing an object that is 10 MB in size. partition1 will finish processing its object sooner than partition2. In this case, partition1 will sit idle and will not extract the next object from the bucket until partition2 finishes processing its 10 MB object. New objects will be processed only when partition1 and partition2 are both finished processing their respective objects.

Last modified: June 22, 2022

Was this article helpful?