View the Dashboards

When all cluster monitoring components are installed, configured, and running, the Grafana dashboards can be used to monitor SingleStoreDB cluster health over time.

Each dashboard provides insights that can be used to identify trends that may require intervention, including:

Cluster View

Chart Name

What it shows

When to use it

CPU Utilization

The percentage of the host’s CPU that is being used:

  • max single-core load: The maximum CPU load across all of the available CPU cores

  • avg core load: The average CPU load across all of the available CPU cores

  • min single-core load: The minimum CPU load across all of the available CPU cores

To understand CPU usage and host resource usage in general, or for a given workload.

Memory Utilization

The percent of the host’s memory that is being used

To understand host memory usage for a given workload over time.

Local Disk Utilization

The local disk utilization for the cluster.

Total storage can be managed by dropping obsolete tables and/or purging older data in large tables.

To identify the amount of warm data so preventive actions can be taken to better handle the load, thereby ensuring stability and optimum performance.

Read/Write Queries per Second

The number of reads/writes per second of the queries running on the system

To understand typical (“normal”) cluster activity to benchmark workloads and their query rate and identify anomalies in the read/write workload.

If the number of rows read or written is very high or uneven, it could indicate that some queries or operations are taking longer to process than others. This can be due to poor indexing, inefficient queries, or database design issues.

Failed Read/Write Queries per Second

The number of reads/writes failed per second of the queries running on the system

Rows Read or Written

The number of rows read/written

Network Received or Sent Bytes

The network bytes sent and received

To understand network usage for a given workload, identify bottlenecks, and determine if any non-SingleStoreDB activity is affecting a host’s network.

Execution Time per Read/Write Query

The elapsed time of read/write query

To identify changes in the pattern of execution time per read/write query from the historical norm. This may indicate an issue with an application or changes in your workload.

Threads - Connected

The number of open connections (threads_connected) to the database relative to the maximum limit (max_connections)

To identify if the database is approaching the maximum allowed connections, which is indicated by a utilization near 100%. This can potentially lead to performance issues, as queries may need to wait in a queue until threads become available to process them.

Threads - Running

The number of threads that are actively running queries (threads_running) relative to the maximum limit (max_connection_threads)

To identify if the system is approaching its capacity with regard to the number of queries that can be executed in parallel, which is indicated by a utilization near 100%. This can potentially lead to resource pressure, system unresponsiveness, latency spikes, and eventual failures.

Detailed Cluster View by Node

Chart Name

What it shows

When to use it

All Metrics in Cluster View Reported per Node

All metrics in cluster view reported per node

To understand resource utilization of a node and to ensure the workload is evenly distributed across nodes for ideal performance.

Historical Workload Monitoring

Chart Name

What it shows

When to use it

CPU time by Database

The CPU time spent by each query activity, grouped by database

To identify which databases incur the most CPU usage. Note: A blank database indicates system activity that is not related to a user database.

Execution Count

The number of queries executed in a given time

To perform capacity planning for workloads and identify if workloads in general, or workload spikes in particular, are putting the cluster at risk of running out of memory.

Metrics by Query Plan

The queries executed and their relative resource consumption

To identify which queries are expensive, including how long queries are taking to complete, their CPU times, failure rates etc.

Memory Usage

Chart Name

What it shows

When to use it

Total Memory Used vs. Total Limit

The memory in use compared to the total memory available (megabytes)

To perform capacity planning for memory and identify if the cluster is not performing optimally due to a shortage of memory.

Query Memory vs. Total Limit

The query memory in use compared to the total memory available (megabytes)

To perform capacity planning for workloads and identify if workloads in general, or workload spikes in particular, are putting the cluster at risk of running out of memory.

Data Memory Used vs. Total Limit

The data memory in use versus the total memory available (megabytes)

To perform capacity planning for data memory and identify if given write workloads are putting the cluster at risk of running out of memory.

Internal Memory Allocators vs. Limit

The memory used by SingleStoreDB memory allocators (megabytes)

To identify why memory allocations have increased, or are anomalously large, when there are no other indicators of increased memory use, such as workload or data, and to discover where memory is allocated (table, query, etc.).

Disk Monitoring

Note: Requires SingleStoreDB version 8.0.23 or later.

Chart Name

What it shows

When to use it

Disk Utilization

Disk utilization for the cluster.

Total storage can be managed by dropping obsolete tables and/or purging older data in large tables.

To identify the quantity of warm data so preventive action can be taken to better handle the load, thereby ensuring stability and optimum performance.

Distribution of Components Using Disk

Distribution of disk utilization by data, plancache, auditlogs, and tracelogs

To understand how the disk is being utilized.

Analyzing disk usage can reveal if certain artifacts (such as data, plancache, audit logs, or trace logs) are consuming an excessive amount of space. This can either cause performance issues, or require additional resources to maintain optimal operation.

Monitoring disk usage and activity can also help identify performance bottlenecks, which may require disk policies to be adjusted, additional resources to be allocated, and/or your workload to be optimized.

Breakdown of Disk Utilization by Data

Disk consumption breakdown by "Data" category.

Adding utilization across blobs, transaction logs, snapshots, temp blobs, etc. will be equal to the total disk utilized by "Data."

Distribution of Databases Using Disk

Distribution of disk utilization by databases.

Adding utilization across databases will be equal to the total disk utilized by "Data."

Blob Cache Downloaded per Second (by Database)

Rate at which the blob cache is downloading files from remote storage.

To understand how SingleStoreDB's blob cache is performing.

By understanding and monitoring the rate at which the blob cache is downloading files from remote storage, potential performance bottlenecks and/or issues related to blob cache activity can be identified.

For example, if high download rates are observed relative to the size of your database and scale of your hardware, you may consider increasing the local cache size.

Regularly reviewing this metric can help you make well-informed decisions for optimizing the performance of SingleStoreDB.

Blob Cache Evicted per Second (by Database)

Rate at which the blob cache is evicting files.

To understand how SingleStoreDB’s blob cache is performing.

By understanding and monitoring the rate at which the blob cache is evicting files, system resource utilization can be optimized based on your data management needs.

A high eviction rate may indicate that the cache size is insufficient, or that your workload is imposing a high cache turnover. To improve overall cluster performance, reviewing data access patterns and adjusting the cache size is recommended.

Regularly reviewing this metric can help you identify potential performance bottlenecks and make well-informed decisions for optimizing the performance of SingleStoreDB.

Pipeline Dashboards

Note: Requires SingleStoreDB version 8.0.25 or later.

Pipeline Summary

Chart Name

What it shows

When to use it

State Distribution

A high-level overview of all pipelines, including the number of pipelines in running, stopped, and error states, and the percentage of each.

To identify potential issues by comparing the number of running pipelines to those that have either stopped or produced an error.

Historical Pipeline State

The state of all pipelines over a period of time.

To identify potential issues by examining how a pipeline behaves over time.

Summary

The current state of all pipelines.

To identify which pipelines are currently running, stopped, or in an errored state along with their associated database.

Pipeline Performance

Chart Name

What it shows

When to use it

Execution Count

The total number of executions that have run in a pipeline.

To estimate how frequently a query is executed in a pipeline and whether a query has failed. Useful for optimizing queries.

Avg CPU Time per Execution

The average CPU time for each execution in a pipeline.

To identify which pipelines are consuming excessive CPU cycles.

Avg Elapsed Time per Execution

The average elapsed time for each execution in a pipeline.

To identify which pipelines are experiencing degraded performance over time.

Avg I/O per Execution

The average disk I/O (number of bytes that SingleStoreDB read and wrote to the filesystem or the in-memory transaction log) per execution in a pipeline.

To identify if a pipeline is experiencing I/O-related performance issues (typically when this value is consistently high).

Avg Memory Use per Execution

The average memory usage per execution in a pipeline.

To identify which pipelines are exhibiting excessive memory use.

Avg Network Bytes per Execution

The average network bytes per execution in a pipeline.

To identify which pipelines are experiencing degraded performance due to network constraints.

Pipeline Errors

Which pipelines have produced an error, including the pipeline name, error ID, error code, error message, and the time the error occurred.

To identify and troubleshoot pipelines that have produced an error.

Last modified: November 9, 2023

Was this article helpful?