Kafka Connect Pipelines

Overview

SingleStore Kafka Connect Pipelines can use Kafka Connect source connectors to stream data from external systems into SingleStore. SingleStore leverages the Kafka Connect ecosystem instead of building native connectors for every data source, which includes multiple pre-built connectors.

Key Concepts

The Kafka Connect framework is an open-source component of Apache Kafka that provides a scalable and reliable method to stream data between systems. Kafka Connect includes:

Source connectors: Pull data from external systems into Kafka
Sink connectors: Push data from Kafka to external systems

SingleStore uses Kafka Connect source connectors to ingest data directly into SingleStore tables without requiring an intermediate Kafka cluster. Kafka Connect Pipelines support source connectors only. Sink connectors are not supported. For writing data from SingleStore to Kafka, use the SingleStore Kafka sink connector. Kafka Connect Pipelines supports Amazon Kinesis as a data source currently. Refer to Amazon Kinesis for more information.

Architecture

Key Architectural Features

Leaf Node Processing: The extractor processes data on leaf nodes rather than the Master Aggregator which reduces load on the aggregator and improves performance.
Static Schema Table: Data is automatically loaded into a predefined table structure with three columns:
- topic (TEXT): Source identifier
- id (JSON): Unique record identifier
- record (JSON): Complete record data
JSON-Based Offset Management: Uses information_schema.PIPELINES_SOURCE_OFFSETS table to track offsets in JSON format that supports complex offset structures required by different connectors.
Multi-Task Support: Pipelines can spawn multiple tasks for parallel processing when the data source supports partitioning (for example, Kinesis shards).

How Kafka Connect Pipelines Works

When a Kafka Connect pipeline is created, SingleStore performs the following:

Checks the connector class and configuration parameters
If using CREATE INFERRED PIPELINE, automatically creates an inferred table with the static schema
Launches the extractor process on leaf nodes
Ingests data from the external data source and loads data into the table
Stores offset information in PIPELINES_SOURCE_OFFSETS for exactly-once semantics

Deploy Kafka Connect Connectors

SingleStore provides full support for custom Kafka Connect connectors and enables deploying and configuring them with complete control over configuration and management. Ongoing maintenance and updates are handled by users.

Enable Kafka Connect Pipelines

Kafka Connect Pipelines is an experimental feature that must be explicitly enabled. Run the following command to enable this feature:

SQL

SET GLOBAL experimental_features_config = "kafka_connect_enabled=true"

Note

This setting must be configured before creating Kafka Connect Pipelines and requires the SUPER permission. The setting persists across cluster restarts and changes take effect immediately.

Run the following command to confirm whether this feature is enabled.

SQL

SHOW VARIABLES LIKE 'experimental_features_config'

+------------------------------+----------------------------+
|        Variable_name         |            Value           |
+------------------------------+----------------------------+
| experimental_features_config | kafka_connect_enabled=true |
+------------------------------+----------------------------+

After enabling the Kafka Connect Pipelines, start the Kafka Connect source connector manually using the provided configuration.

Syntax

SQL

CREATE [OR REPLACE] INFERRED PIPELINE <pipeline_name>
AS LOAD DATA KAFKACONNECT '<kafka_connector>'
CONFIG '<connector_configuration_json>'
CREDENTIALS '<credentials_json>'
FORMAT AVRO;

CONFIG Parameter

The CONFIG parameter must contain a JSON object with the following:

Required fields:
- connector.class: The fully-qualified Java class name of the Kafka Connect source connector
Common optional fields:
- tasks.max: The maximum number of parallel tasks. The default value is 4.
Connector specific fields: Vary by connector type.

Refer to CREATE PIPELINE and CREATE INFERRED PIPELINE for more information.

Offset Management

Kafka Connect Pipelines use JSON-based offsets stored in information_schema.PIPELINES_SOURCE_OFFSETS, which differs from traditional integer-based offsets used by native Kafka pipelines.

SQL

SELECT * FROM information_schema.PIPELINES_SOURCE_OFFSETS
WHERE PIPELINE_NAME = 'kafkaconnect-pipeline'

Offset Format Examples

The following is the format of KEY and VALUE of Amazon Kinesis offsets:

JSON

KEY: {"shardId":"shardId-XXXX"}
VALUE: {"sequenceNumber":"XXXX"}

Manage Kafka Connect Pipelines

Pipeline Lifecycle Operations

Refer to The Lifecycle of a Pipeline for more information.

Check Pipeline Status

SQL

SELECT
  PIPELINE_NAME,
  STATE,
  CONFIG_JSON
FROM information_schema.PIPELINES
WHERE PIPELINE_NAME = '<kafkaconnect-pipeline>'

The following are the pipeline states:

Running: Pipeline is actively ingesting data
Stopped: Pipeline is stopped
Error: Pipeline encountered an error

Configuration Best Practices

Task Configuration

Parallel processing with tasks.max:

Configure tasks.max based on data source partitioning
For Amazon Kinesis: Set tasks.max equal to the number of shards
Monitor TASK_ID distribution in PIPELINES_SOURCE_OFFSETS

SQL

-- Check task distribution
SELECT
  TASK_ID,
  COUNT(*) as offset_count
FROM information_schema.PIPELINES_SOURCE_OFFSETS
WHERE PIPELINE_NAME = '<pipeline_name>'
GROUP BY TASK_ID;

Security Best Practices

Credential Storage: Always use the CREDENTIALS parameter for sensitive information, never include passwords in CONFIG
Network Security: Ensure secure connections to data sources (use SSL/TLS when available)
Access Control: Grant minimum required permissions to pipeline users
Audit Logging: Enable logging for pipeline operations and monitor access

Performance Optimization

Configure Extraction Parameters: Use SET statements to tune the extraction performance:

SQL

SET GLOBAL pipelines_extractor_batch_size = 10000;
SET GLOBAL pipelines_extractor_max_batch_interval_ms = 1000;

Computed Columns: Create computed columns for frequently accessed JSON fields

SQL

ALTER TABLE <pipeline_table>
ADD COLUMN customer_id AS (JSON_EXTRACT_STRING(record, 'customer_id')) PERSISTED INT;

Indexes: Add indexes on computed columns for better query performance
SQL
```
CREATE INDEX idx_customer_id ON <pipeline_table>(customer_id)
```
Monitor Batch Times: Track batch processing time and adjust configuration if needed
Offset Progress: Regularly check PIPELINES_SOURCE_OFFSETS to ensure offsets are advancing

Examples

The following examples demonstrate how to ingest data from Amazon Kinesis.

Example: Amazon Kinesis Pipeline

The following example demonstrates how to create a basic Kafka Connect Pipeline that automatically creates a table with a static schema and then ingests data from Amazon Kinesis into the table.

SQL

-- Enable the experimental feature
SET GLOBAL experimental_features_config = "kafka_connect_enabled=true";

-- Create Kinesis pipeline
CREATE INFERRED PIPELINE kinesis_pipeline
AS LOAD DATA KAFKACONNECT 'kafka-connector'
CONFIG '{
  "connector.class": "com.singlestore.kafka.connect.kinesis.KinesisSourceConnector",
  "aws.access.key.id": "<aws_access_key>",
  "aws.secret.key.id": "<aws_secret_key>",
  "kafka.topic": "kinesis-topic",
  "kinesis.stream": "my-kinesis-stream",
  "kinesis.region": "us-east-1",
  "tasks.max": 4
}'
CREDENTIALS '{}' -- AWS credentials are in CONFIG for Kinesis connector
FORMAT AVRO;

-- Start the pipeline
START PIPELINE kinesis_pipeline;

Static Schema Table

When an inferred Kafka Connect Pipeline is created, SingleStore automatically creates a table:

CREATE TABLE `<pipeline_name>` (
  `topic` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `id` JSON COLLATE utf8mb4_bin NOT NULL,
  `record` JSON COLLATE utf8mb4_bin NOT NULL,
  SORT KEY `__UNORDERED` (),
  SHARD KEY ()
)

Querying Data

Extract data from the JSON record column using JSON functions:

SQL

-- Extract a specific field
SELECT
  topic,
  JSON_EXTRACT_STRING(record, 'fieldName') AS field_value
FROM kinesis_pipeline;

-- Filter based on JSON content
SELECT *
FROM kinesis_pipeline
WHERE JSON_EXTRACT_STRING(record, 'status') = 'active';

-- Extract multiple fields
SELECT
  topic,
  JSON_EXTRACT_STRING(record, 'customer_id') AS customer_id,
  JSON_EXTRACT_STRING(record, 'order_id') AS order_id,
  JSON_EXTRACT_STRING(record, 'timestamp') AS event_time
FROM kinesis_pipeline;

-- Create a computed column for better performance
ALTER TABLE kinesis_pipeline
ADD COLUMN status AS (JSON_EXTRACT_STRING(record, 'status')) PERSISTED STRING;

CREATE INDEX idx_status ON kinesis_pipeline(status);

-- Query using the computed column
SELECT *
FROM kinesis_pipeline
WHERE status = 'active';

Example 2: Amazon Kinesis Pipeline with Stored Procedure

The following example demonstrates how to use a stored procedure to process and transform incoming Kinesis data before inserting it into a custom table schema. The stored procedure is specified in the INTO PROCEDURE clause.

SQL

-- Enable the experimental feature
SET GLOBAL experimental_features_config = "kafka_connect_enabled=true";

-- Create the target table for parsed records
CREATE TABLE parsed_kinesis_stream (
 partitionKey BIGINT,
 sequence_number VARCHAR(255),
 shardId VARCHAR(255),
 data_base64 VARCHAR(255),
 parsed_data AS FROM_BASE64(data_base64) PERSISTED VARCHAR(255)
);

-- Create the stored procedure to process incoming records
CREATE OR REPLACE PROCEDURE parse_kinesis_stream(
 batch QUERY(topic VARCHAR(255), id JSON, record JSON)
)
AS
BEGIN
 INSERT INTO parsed_kinesis_stream (
   partitionKey,
   sequence_number,
   shardId,
   data_base64
 )
 SELECT
   record::$partitionKey,
   record::$sequenceNumber,
   record::$shardId,
   record::$data
 FROM batch;
END;

-- Create Kinesis pipeline with INTO PROCEDURE
CREATE PIPELINE kinesis_pipeline_to_proc
AS LOAD DATA KAFKACONNECT 'kafka-connector'
CONFIG '{
 "connector.class": "com.singlestore.kafka.connect.kinesis.KinesisSourceConnector",
 "aws.access.key.id": "<aws_access_key>",
 "aws.secret.key.id": "<aws_secret_key>",
 "kafka.topic": "kinesis-topic",
 "kinesis.stream": "my-kinesis-stream",
 "kinesis.region": "us-east-1",
 "tasks.max": 4
}'
CREDENTIALS '{}'
BATCH_INTERVAL 2500
INTO PROCEDURE parse_kinesis_stream
FORMAT AVRO (
 `topic` <- `topic`,
 `id` <- `id`,
 `record` <- `record`
);

-- Start the pipeline
START PIPELINE kinesis_pipeline_to_proc;

Target Table Schema

The target table stores both the raw base64-encoded data and a computed column that automatically decodes it:

Column	Type	Description
`partitionKey`	`BIGINT`	Kinesis partition key
`sequence_number`	`VARCHAR(255)`	Kinesis sequence number
`shardId`	`VARCHAR(255)`	Source shard identifier
`data_base64`	`VARCHAR(255)`	Raw base64-encoded payload
`parsed_data`	`VARCHAR(255)`	Automatically decoded payload (computed)

Querying Data

Query the decoded data directly using the computed column:

SQL

-- Query decoded data
SELECT
 partitionKey,
 sequence_number,
 parsed_data
FROM parsed_kinesis_stream;

-- Extract JSON fields from decoded data
SELECT
 partitionKey,
 JSON_EXTRACT_STRING(parsed_data, 'customer_id') AS customer_id,
 JSON_EXTRACT_STRING(parsed_data, 'event_type') AS event_type
FROM parsed_kinesis_stream;

Working with Base64-Encoded Data

Kinesis record payloads are delivered as base64-encoded in the record::$data field. Use FROM_BASE64() to decode the data before applying JSON functions:

SQL

-- Decode and extract in a single query
SELECT
 JSON_EXTRACT_STRING(FROM_BASE64(data_base64), 'customer_id') AS customer_id,
 JSON_EXTRACT_STRING(FROM_BASE64(data_base64), 'status') AS status
FROM parsed_kinesis_stream
WHERE JSON_EXTRACT_STRING(FROM_BASE64(data_base64), 'status') = 'active';

On this page

Overview

Key Concepts

Architecture

Key Architectural Features

How Kafka Connect Pipelines Works

Deploy Kafka Connect Connectors

Enable Kafka Connect Pipelines

Syntax

CONFIG Parameter

Offset Management

Offset Format Examples

Manage Kafka Connect Pipelines

Pipeline Lifecycle Operations

Check Pipeline Status

Configuration Best Practices

Task Configuration

Security Best Practices

Performance Optimization

Examples

Example: Amazon Kinesis Pipeline

Static Schema Table

Querying Data

Example 2: Amazon Kinesis Pipeline with Stored Procedure

Target Table Schema

Querying Data

Working with Base64-Encoded Data

Was this article helpful?

On this page

Was this article helpful?