# Troubleshoot Pipeline Performance and Memory Usage

## Concepts

This topic requires an understanding of pipeline batches, which are explained in [The Lifecycle of a Pipeline](https://docs.singlestore.com/db/v9.1/load-data/about-singlestore-pipelines/pipeline-concepts/the-lifecycle-of-a-pipeline.md).

## Detect Slow Performance and High Memory Usage

## Run `PROFILE PIPELINE`

Run [PROFILE PIPELINE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/profile-pipeline.md) to gather resource consumption metrics, such as starting and ending times, for operations that a batch processes. Partition specific metrics are not included; to retrieve those query the information schema view specified in the next section.

## Query the `information_schema.PIPELINE_BATCHES_SUMMARY` View

The following query returns diagnostics for each batch associated with a pipeline that writes to a database.

```sql
SELECT DATABASE_NAME, PIPELINE_NAME, BATCH_ID, BATCH_STATE, BATCH_TIME
FROM information_schema.PIPELINES_BATCHES_SUMMARY
ORDER BY BATCH_ID;
```

* `BATCH_STATE`: If the `BATCH_STATE` of a batch is `Queued`, it indicates that the batch cannot extract, shape, or load data until cluster resources are freed.
* A high `BATCH_TIME` value indicates a long amount of time for the batch to extract, shape, and to load the data (combined).

For complete reference and description of all the fields, refer to [information\_schema.PIPELINE\_BATCHES\_SUMMARY](https://docs.singlestore.com/db/v9.1/reference/information-schema-reference/data-ingest/pipelines-batches-summary.md).

## Query the `information_schema.MV_ACTIVITIES` and `information_schema.MV_QUERIES` Views

As an alternative to querying the `information_schema.PIPELINE_BATCHES_SUMMARY` table, as detailed in the previous section, you can run the following query, which provides diagnostic information per pipeline, such as `avg_cpu_time_ms`, `avg_network_time_ms`, `avg_disk_time_ms`, and `avg_memory_mb`.

```sql
SELECT substr(replace(replace(MVAC.activity_name,'\n',''), ' ',' '),1,30) AS ACTIVITY_n,
substr(replace(replace(query_text,'\n',''), ' ',' '),1,30) AS query_text,
database_name AS databaseName,last_finished_timestamp AS last_run,
(cpu_time_ms/(run_count+success_count+failure_count)) AS avg_cpu_time_ms,
(cpu_wait_time_ms/(run_count+success_count+failure_count)) AS avg_cpu_wait_time_ms,
(elapsed_time_ms/(run_count+success_count+failure_count)) AS avg_elapsed_time_ms,
(lock_time_ms/(run_count+success_count+failure_count)) AS avg_lock_time_ms,
(network_time_ms/(run_count+success_count+failure_count)) AS avg_network_time_ms,
(disk_time_ms/(run_count+success_count+failure_count)) AS avg_disk_time_ms,
(round((disk_b/1024/1024),2)/(run_count+success_count+failure_count)) AS avg_io_mb,
(round((network_b/1024/1024),2)/(run_count+success_count+failure_count)) AS avg_network_mb,
round((1000*(memory_bs/1024/1024)/(elapsed_time_ms)),2) AS avg_memory_mb,
(memory_major_faults/(run_count+success_count+failure_count)) AS avg_major_faults,
(run_count+success_count+failure_count) AS total_executions
FROM information_schema.mv_activities_cumulative MVAC
JOIN information_schema.mv_queries MVQ ON MVQ.activity_name = MVAC.activity_name
WHERE MVAC.activity_name like '%RunPipeline%' ORDER BY avg_elapsed_time_ms DESC LIMIT 10;
```

## Track Pipeline Resource Usage with `activity_tracking_id`

The `activity_tracking_id` is an identifier that is used to tag queries and activities in a session so a user can track the combined resource consumption of those activities.

The `activity_tracking_id` is automatically set for all pipeline activities and can be used to identify and track pipeline resource consumption by querying multiple information schema views. This ID enables tracking of:

* CPU time and CPU wait time
* Memory usage (byte-seconds)
* Elapsed time
* Disk I/O
* Network usage

Unless explicitly specified otherwise, the value of `activity_tracking_id` for pipeline queries are as follows:

* If the query is directly run by the pipeline, its activity\_tracking\_id is defined as `PI_<PIPELINE_ID>`.
* If the pipeline calls a stored procedure, the query run by the stored procedure has `activity_tracking_id` defined as `PR_PI_<PIPELINE_ID>`.

## Relevant Information Schema Views

The `activity_tracking_id` column is available in the following information schema views:

| **View**                            | **Description**                                  | **Primary Use Case**               |
| ----------------------------------- | ------------------------------------------------ | ---------------------------------- |
| `MV_ACTIVITIES`                     | Recent resource usage over a defined time period | Real-time monitoring               |
| `MV_ACTIVITIES_EXTENDED`            | Extended statistics with additional metrics      | Detailed performance analysis      |
| `MV_ACTIVITIES_CUMULATIVE`          | Cumulative statistics since system startup       | Historical trend analysis          |
| `MV_ACTIVITIES_EXTENDED_CUMULATIVE` | Extended cumulative statistics                   | Comprehensive historical reporting |
| `MV_TASKS`                          | Currently executing tasks                        | Active pipeline monitoring         |
| `MV_FINISHED_TASKS`                 | Statistics for completed tasks                   | Average resource usage calculation |

## Query Pipeline Resource Usage

**Get CPU and Memory Usage for Pipelines**

The following example shows how to track CPU time and memory usage for all pipelines.

```sql
SELECT                                                                                                                                                                                                                                             
      activity_tracking_id,                                                                                                                                                                                                                          
      activity_name,                                                                                                                                                                                                                                 
      database_name,                                                                                                                                                                                                                                 
      cpu_time_ms,                                                                                                                                                                                                                                   
      (cpu_time_ms / (success_count + failure_count)) AS avg_cpu_time_ms,                                                                                                                                                                            
      ROUND((memory_bs / 1024 / 1024 / 1024), 2) AS memory_gb_seconds,                                                                                                                                                                               
      ROUND((1000 * (memory_bs / 1024 / 1024) / elapsed_time_ms), 2) AS avg_memory_mb,                                                                                                                                                               
      success_count,                                                                                                                                                                                                                                 
      failure_count                                                                                                                                                                                                                                  
  FROM information_schema.MV_ACTIVITIES_CUMULATIVE                                                                                                                                                                                                   
  WHERE activity_tracking_id LIKE '%PI%'                                                                                                                                                                                                             
  ORDER BY cpu_time_ms DESC; 
```

**Track Specific Pipeline Resource Usage over Time**

The following example shows how to monitor a specific pipeline. Use the pipeline ID in the `activity_tracking_id` filter.

```sql
SELECT                                                                                                                                                                                                                                             
      node_id,                                                                                                                                                                                                                                       
      activity_name,                                                                                                                                                                                                                                 
      cpu_time_ms,                                                                                                                                                                                                                                   
      cpu_wait_time_ms,                                                                                                                                                                                                                              
      elapsed_time_ms,                                                                                                                                                                                                                               
      ROUND((memory_bs / 1024 / 1024), 2) AS memory_mb_seconds,                                                                                                                                                                                      
      disk_time_ms,                                                                                                                                                                                                                                  
      network_time_ms,                                                                                                                                                                                                                               
      success_count,                                                                                                                                                                                                                                 
      last_finished_timestamp                                                                                                                                                                                                                        
  FROM information_schema.MV_FINISHED_TASKS                                                                                                                                                                                                          
  WHERE activity_tracking_id = 'PI_12345';

```

Replace `12345` with actual pipeline ID.

**Monitor Currently Running Pipelines**

The following example shows how to monitor active pipelines and their current resource consumption.

```sql
SELECT                                                                                                                                                                                                                                             
      activity_tracking_id,                                                                                                                                                                                                                          
      activity_name,                                                                                                                                                                                                                                 
      database_name,                                                                                                                                                                                                                                 
      cpu_time_ms,                                                                                                                                                                                                                                   
      ROUND((memory_bs / 1024 / 1024), 2) AS memory_mb_seconds,                                                                                                                                                                                      
      elapsed_time_ms                                                                                                                                                                                                                                
  FROM information_schema.MV_TASKS                                                                                                                                                                                                                   
  WHERE activity_tracking_id LIKE '%PI%';
```

## Address Slow Performance and High Memory Usage

Performing the following tasks may improve the pipeline performance and reduce memory usage.

* Decrease the maximum number of partitions that a batch can utilize using the [ALTER PIPELINE SET MAX\_PARTITIONS\_PER\_BATCH](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/alter-pipeline/#UUID-2be7ea6a-a0ff-014f-ec65-1facc5e11999.md) command. Decreasing the maximum number of partitions will limit parallelism for large data sources.
* Increase the batch interval of an existing pipeline using the [ALTER PIPELINE SET BATCH\_INTERVAL](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/alter-pipeline/#UUID-659a9b16-fc5a-4c72-b515-692f56aa754c.md) command. If you increase the batch interval, data will be loaded from the data source less frequently.
* Break large files into smaller chunks to increase the ingestion speed. Large files get parsed on a single leaf, which increases the ingestion time. The recommended file size for smaller chunks is 100 MB.
* Add more leaves to your cluster, and then rebalance the cluster, which will result in fewer numbers of partitions per leaf.
* Are you using a stored procedure with your pipeline? If so, the stored procedure implementation can significantly affect the memory usage and pipeline performance. For details, refer to [Writing Efficient Stored Procedures for Pipelines](https://docs.singlestore.com/db/v9.1/load-data/about-singlestore-pipelines/pipeline-concepts/writing-efficient-stored-procedures-for-pipelines.md).
* For low parallelism pipelines, such as single file S3 loads and single partition Kafka topics, you can use an aggregator pipeline by running the `CREATE AGGREGATOR PIPELINE` command (an aggregator pipeline is not required for single-partition Kafka topics). If you are using an existing non-aggregator pipeline, you cannot convert it directly to an aggregator pipeline. Instead, drop your existing pipeline and recreate it as an aggregator pipeline.
  > **⚠️ Warning**: If a pipeline is dropped and recreated, the pipeline will start reading from the earliest offset in the data source, once the pipeline is started. This may cause duplicate records to be inserted into the destination table(s).
* Have queries or stored procedures that you normally run changed recently? If so, check these queries or stored procedures for changes that could be contributing to slow performance and high memory usage.
* Do you have any uncommitted transactions? These are transactions where `START TRANSACTION` is run, but neither `COMMIT` nor `ROLLBACK` was run afterwards. For information on resolving issues with uncommitted transactions, refer to Uncommitted Transactions section in [ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction](https://docs.singlestore.com/db/v9.1/reference/troubleshooting-reference/query-errors/error-1205-hy-000-lock-wait-timeout-exceeded-try-restarting-transaction.md).
* Using the `IGNORE` or `SKIP ALL ERRORS` options in the `CREATE PIPELINE ... LOAD DATA ...` statement may cause the pipeline to run slowly, because these options write large numbers of warnings to the logs. For `IGNORE` and `SKIP ALL ERRORS` alternatives, see [LOAD DATA](https://docs.singlestore.com/db/v9.1/reference/sql-reference/data-manipulation-language-dml/load-data.md).
* If possible, avoid using the `ENCLOSED BY` and `OPTIONALLY ENCLOSED BY` options in your `CREATE PIPELINE ... LOAD DATA ...` statement.
* If your pipeline uses a [transform](https://docs.singlestore.com/db/v9.1/load-data/about-singlestore-pipelines/pipeline-concepts/writing-a-transform-to-use-with-a-pipeline.md), and implements the transform using an interpreted language such as Python, consider using a compiled language to improve the performance.
* If your pipeline uses a transform, and it is slow, you may be able to use a `CREATE PIPELINE ... SET ... WHERE` statement instead of using a transform.
* Check if removing or moving files that were already processed by the pipeline improves pipeline performance. To perform this check, follow these steps:

  1. Run:
     ```sql
     SELECT database_name, pipeline_name, batch_start_unix_timestamp, batch_time - max(batch_partition_time) AS batch_overhead
     FROM information_schema.PIPELINES_BATCHES
     GROUP BY database_name, pipeline_name, batch_id
     ORDER BY database_name, pipeline_name, batch_id;
     ```

  2. If `batch_overhead` is increasing as `batch_start_unix_timestamp` increases, try the following steps:

     1. Run `SELECT * FROM information_schema.PIPELINES_FILES;`.

     2. For rows in the output that have the value `LOADED` in the `FILE_STATE` column, move or the delete the corresponding files in the `FILE_NAME` column.
* When using an FS, S3, Azure, or GCS pipeline, over time the `md_extractors_offsets` table starts utilising more memory, eventually leading to Out-of-Memory errors. To clear the data in this table, you can use the optional clause `ENABLE OFFSETS METADATA GC`.

  By default, the pipeline garbage collector (GC) for FS, S3, Azure, or GCS is not enabled. `ENABLE OFFSETS METADATA GC` must be added to the `CREATE PIPELINE` query block to enable pipeline garbage collection on new pipelines.
  > **📝 Note**: - Modified files get reloaded. If you update a file’s content in place using the same path, GC eventually removes its metadata, and the pipeline ingests the entire file again. If your workflow modifies existing files, do not enable GC.
  > - Old files may be skipped. Files with `LastModified` timestamps older than the GC threshold are filtered out and never ingested. This is expected behavior, as GC assumes older files have already been processed.
  > - Moved files may cause inconsistencies. Metadata GC uses file timestamps. If old files are moved into the directory instead of creating new ones, timestamps may not update correctly, which can cause files to be incorrectly ignored or loaded twice. Ensure your file management process accounts for this.
  > - Workflow queries may miss entries. If you rely on checking `file_state = 'Loaded'` for specific files in `pipelines_files`, a GC’d file will not have a corresponding row. The absence of a row for an old file indicates that it was loaded.
  To enable pipeline garbage collection on an existing pipeline, use the `ALTER PIPELINE` statement with the `ENABLE OFFSETS METADATA GC` clause. To check the memory usage, use the following query:
  ```sql
  SELECT * FROM information_schema.INTERNAL_TABLE_STATISTICS WHERE table_name LIKE "md_extractors_offsets" ORDER BY memory_use DESC;

  ```
  ```
  +---------------+-----------------------+---------+-----------+------+------------+----------------+------+------------+-------------------+----------------+
  | DATABASE_NAME | TABLE_NAME            | ORDINAL | HOST      | PORT | NODE_TYPE  | PARTITION_TYPE | ROWS | MEMORY_USE | STORAGE_TYPE      | ROWS_IN_MEMORY |
  +---------------+-----------------------+---------+-----------+------+------------+----------------+------+------------+-------------------+----------------+
  | ticket_test   | md_extractors_offsets |    NULL | 127.0.0.1 | 3306 | Aggregator | Reference      |    2 |     524544 | INTERNAL_METADATA |              2 |
  | ticket_test   | md_extractors_offsets |    NULL | 127.0.0.1 | 3307 | Leaf       | Reference      |    2 |     524544 | INTERNAL_METADATA |              2 |
  | ticket_test   | md_extractors_offsets |       1 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  | ticket_test   | md_extractors_offsets |       7 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  | ticket_test   | md_extractors_offsets |       6 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  | ticket_test   | md_extractors_offsets |       5 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  | ticket_test   | md_extractors_offsets |       4 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  | ticket_test   | md_extractors_offsets |       3 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  | ticket_test   | md_extractors_offsets |       2 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  | ticket_test   | md_extractors_offsets |       0 | 127.0.0.1 | 3307 | Leaf       | Master         |    0 |          0 | INTERNAL_METADATA |              0 |
  +---------------+-----------------------+---------+-----------+------+------------+----------------+------+------------+-------------------+----------------+
  ```

## Clear the `information_schema.PIPELINE_FILES` View

You can clear the data in the `information_schema.PIPELINE_FILES` table. It stores data for the duration of a pipeline’s lifetime, resulting in a large number of files being referenced in the view. If the data in this view is not required, you can:

1. To drop a single file and its file record from the `information_schema.PIPELINES_FILES` table, use the [ALTER PIPELINE ... DROP FILE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/alter-pipeline/#UUID-41e7c9a8-7cb0-07b8-27ad-eacbd8855998.md) command. For example:
   ```sql
   ALTER PIPELINE <your_pipeline> DROP FILE 'pipeline_files';
   ```

2. To drop all of the files associated with a pipeline, you can drop and recreate the pipeline. This clears all the file records associated with the pipeline. However, the metadata related to the pipeline will also be deleted. If you do need the metadata, copy it to a separate table before recreating the pipeline. To recreate the pipeline:

   1. Inspect and note the pipeline settings to use when recreating the pipeline.
      ```sql
      SHOW CREATE PIPELINE <your_pipeline> EXTENDED;
      ```

   2. Stop the pipeline.
      ```sql
      STOP PIPELINE <your_pipeline>;
      ```

   3. Drop the pipeline.
      ```sql
      DROP PIPELINE <your_pipeline>;
      ```

   4. Recreate the pipeline using the required configuration options.

   5. Start the pipeline.
      ```sql
      START PIPELINE <your_pipeline>;
      ```

***

Modified at: April 21, 2026

Source: [/db/v9.1/load-data/about-singlestore-pipelines/pipeline-troubleshooting/troubleshoot-pipeline-performance-and-memory-usage/](https://docs.singlestore.com/db/v9.1/load-data/about-singlestore-pipelines/pipeline-troubleshooting/troubleshoot-pipeline-performance-and-memory-usage/)

(An index of the documentation is available at /llms.txt)
