The Lifecycle of a Pipeline

Use CREATE PIPELINE to create a new pipeline.
Use START PIPELINE to start the pipeline.

Note

Steps 3 to 8 refer to a batch, which is a subset of data that the pipeline extracts from its data source. Refer to Batches for related information.
The pipeline extracts the batch (files for S3; offset ranges for Kafka) and updates offsets to reflect progress in the data source.
The pipeline optionally transforms (modifies) the batch, using one of the data shaping methods. Refer to Methods for Data Shaping with Pipelines for more information.
If the pipeline is able to successfully process the batch, it loads the batch into one or more SingleStore tables. If the processing fails, the batch is fully rolled back.
If an error occurs while a batch b is running, the batch b fails and all its transactions are rolled back. Batch b is then retried up to pipelines_max_retries_per_batch_partition times. If pipelines_stop_on_error is set to ON and all retries are unsuccessful, the pipeline stops. Otherwise, the pipeline continues and processes a new batch nb, using the same files or objects from batch b, excluding the files or objects that may have caused the error. Refer to Pipeline Troubleshooting for more information.
The pipeline updates the FILE_STATE column in the information_schema.PIPELINES_FILES table as follows:
- Files and objects in the batch that the pipeline processed successfully are marked as Loaded.
- Files and objects in the batch that the pipeline did not process successfully, after all retries are unsuccessful (as described in step 6), are marked as Skipped.
A file or object that is marked as Loaded or Skipped is not processed again by the pipeline, unless ALTER PIPELINE ... DROP FILE ... is run. The pipeline does not delete files and objects from the data source.
The pipeline checks if the data source contains new data. If it does, the pipeline processes another batch immediately by running steps 3 to 7 again. If the data source does not contain more data, the pipeline waits for BATCH_INTERVAL milliseconds (which is specified in the CREATE PIPELINE statement) before checking the data source for new data. If the pipeline finds new data at this point, the pipeline runs steps 3 to 7 again.

Note

Use STOP PIPELINE to stop a running pipeline. If a batch is in progress when the command runs, the batch completes before the pipeline stops.
If a pipeline is stopped with the DETACH PIPELINE command, the data loading is stopped. However, it can be restarted with the START PIPELINE command and the data loading continues in the same way as before.

During a pipeline's lifecycle, the pipeline updates the pipelines tables in the information schema. Refer to Data Ingest for more information on pipelines tables. Except for the update of the information_schema.PIPELINES_FILES table mentioned in step 7, other updates are not discussed here.

Batches

A batch is the set of data processed in one pipeline cycle across all batch partitions. It succeeds or fails as a unit, and failures cause a rollback for the entire batch.

A batch partition is the work performed by a single database partition during a pipeline lifecycle. Each partition independently extracts data from the source (files or offsets) and contributes to the batch.

BATCH_INTERVAL is the time a running pipeline waits before checking for new data when the source temporarily has no data.

pipelines_max_retries_per_batch_partition defines the maximum number of retries for a failing batch partition before the pipeline either continues or stops, depending on the configuration.

S3 Pipelines

Each database partition (batch partition) picks exactly one source file per batch cycle, if available.
The union of all files across partitions in that cycle is the batch.
If fewer files are available than partitions, some partitions remain idle; the batch still consists of the files picked that cycle.
File paths appear in information_schema.PIPELINES_BATCHES as BATCH_SOURCE_PARTITION_ID.

For example,

SQL

SELECT * FROM information_schema.PIPELINES_BATCHES WHERE DATABASE_NAME = "trades";

| DATABASE_NAME | PIPELINE_NAME | BATCH_ID | BATCH_STATE | BATCH_ROWS_WRITTEN | BATCH_ROWS_INSERTED | BATCH_ROWS_UPDATED | BATCH_ROWS_DELETED | BATCH_TIME | BATCH_START_UNIX_TIMESTAMP | BATCH_PARTITION_STATE | BATCH_PARTITION_PARSED_ROWS | BATCH_SOURCE_PARTITION_ID | BATCH_EARLIEST_OFFSET | BATCH_LATEST_OFFSET | BATCH_PARTITION_TIME | BATCH_PARTITION_EXTRACTED_BYTES | BATCH_PARTITION_TRANSFORMED_BYTES | BATCH_PARTITION_EXTRACTOR_WAIT_TIME | BATCH_PARTITION_TRANSFORM_WAIT_TIME |                                                     HOST                                      | PORT | PARTITION |
| ------------- | ------------- | -------- | ----------- | ------------------ | ------------------- | ------------------ | ------------------ | ---------- | -------------------------- | --------------------- | --------------------------- | ------------------------- | --------------------- | ------------------- | -------------------- | ------------------------------- | --------------------------------- | ----------------------------------- | ----------------------------------- | --------------------------------------------------------------------------------------------- | ---- | --------- |
|     trades    |    company    |   40129  |  Succeeded  |        3,288       |        3,288        |          0         |           0        |  1.161775  |      1762498594.440545     |       Succeeded       |             3,288           |     trades/company.csv    |           0           |         1           |       0.740079       |              387,137            |                 NULL              |                 0.6682              |                   0                 | node-57eb03ab-7fba-445a-b4c8-31af2e0633ae-leaf-ag2-0.svc-57eb03ab-7fba-445a-9abb-23fe45e4c0ab | 3306 |     5     |
|     trades    |    trade      |   40130  |  Succeeded  |     5,810,328      |       5,810,328     |          0         |           0        |  17.830798 |      1762498594.440578     |       Succeeded       |           5,810,328         |     trades/trade.csv      |           0           |         1           |       15.383328      |            329,317,993          |                 NULL              |                 3.7094              |                   0                 | node-57eb03ab-7fba-445a-b4c8-31af2e0633ae-leaf-ag1-0.svc-57eb03ab-7fba-445a-9abb-23fe45e4c0ab | 3306 |     1     |

Kafka Pipelines

Each database partition claims a contiguous set of offsets from its assigned Kafka partition for the current batch cycle.
The union of all offsets across partitions constitutes the batch.
Earliest and latest offsets for each batch partition appear as BATCH_EARLIEST_OFFSET and BATCH_LATEST_OFFSET along with the Kafka partition in BATCH_SOURCE_PARTITION_ID.

For Example,

SQL

SELECT * FROM information_schema.PIPELINES_BATCHES WHERE DATABASE_NAME = "adtech";

| DATABASE_NAME | PIPELINE_NAME | BATCH_ID | BATCH_STATE | BATCH_ROWS_WRITTEN | BATCH_ROWS_INSERTED | BATCH_ROWS_UPDATED | BATCH_ROWS_DELETED | BATCH_TIME | BATCH_START_UNIX_TIMESTAMP | BATCH_PARTITION_STATE | BATCH_PARTITION_PARSED_ROWS | BATCH_SOURCE_PARTITION_ID | BATCH_EARLIEST_OFFSET | BATCH_LATEST_OFFSET | BATCH_PARTITION_TIME | BATCH_PARTITION_EXTRACTED_BYTES | BATCH_PARTITION_TRANSFORMED_BYTES | BATCH_PARTITION_EXTRACTOR_WAIT_TIME | BATCH_PARTITION_TRANSFORM_WAIT_TIME |                                              HOST                                             | PORT | PARTITION |
| ------------- | ------------- | -------- | ----------- | ------------------ | ------------------- | ------------------ | ------------------ | ---------- | -------------------------- | --------------------- | --------------------------- | ------------------------- | --------------------- | ------------------- | -------------------- | ------------------------------- | --------------------------------- | ----------------------------------- | ----------------------------------- | --------------------------------------------------------------------------------------------- | ---- | --------- |
|     adtech    |     events    |   40006  |  Succeeded  |      8,000,000     |       8,000,000     |           0        |          0         | 158.126145 |      1762497976.481443     |        Succeeded      |           1,000,000         |              0            |       37,021,295      |      38,021,295     |       145.240453     |             105,468,756         |                 NULL              |                 0.7554              |                   0                 | node-57eb03ab-7fba-445a-b4c8-31af2e0633ae-leaf-ag2-0.svc-57eb03ab-7fba-445a-9abb-23fe45e4c0ab | 3306 |     7     |

The Lifecycle of a Pipeline

On this page

Batches

S3 Pipelines

Kafka Pipelines

Was this article helpful?

On this page

Was this article helpful?