Data Loading for Kafka Pipelines

For Kafka pipelines to have optimal performance, there should be an equal number of partitions between SingleStore and Kafka (i.e., a 1-to-1 relationship). Leaf nodes will process data unevenly or sit idle if the number of partitions in the database and the data source are not equal.

In scenarios where the leaf nodes are processing unequal amounts of data, pipeline ingestion will generally outperform parallel loading through aggregator nodes.

The SingleStore Master Aggregator (MA) connects to Kafka’s lead broker and requests metadata about the Kafka cluster, including information about the brokers, topics, and partitions. For example, if the MA examines the information and determines the Kafka cluster has four partitions spread across two brokers. The MA will assign leaf node partitions to each Kafka partition. The leaf partitions become Kafka consumers.

SingleStoreDB processes data from Kafka in order, per partition. Note that if data is added in different Kafka partitions then the inserts may not be sequential.

Each leaf node will process different Kafka partitions per batch. In subsequent batches, different leaf nodes can end up processing the same Kafka partition and a leaf node is not guaranteed to process the same partition from batch to batch.

Kafka to SingleStore One-to-One Relationship

Kafka Cluster

SingleStore Cluster

BKR (P1) (P3)

MA = Master Aggregator

CA = Child aggregator

BKR (P2) (P4)

L1 & L2 = Leaf 1 and Leaf 2

P1 - P4 = partitions 1 - 4

Last modified: September 11, 2023

Was this article helpful?