# Components to Monitor

## SingleStore Components to Monitor

SingleStore is a distributed system. Because of this, all machines running SingleStore should be monitored to ensure smooth operation. This section demonstrates how to programmatically monitor a SingleStore cluster.

## Monitor OS resources

If you are using third party monitoring tools, make sure to monitor the following resources within each machine of the SingleStore cluster:

* CPU Usage
* CPU Load Average
* Memory Usage
* Memory Paging (page ins, page outs)
* Disk Utilization
* Disk Queue Time
* Network Usage
* Dropped packets / TCP retransmits

> **📝 Note**: Paging refers to a technique that Linux and other operating systems use to deal with high memory usage. If your system is consistently paging, you should add more memory capacity or you will experience severely performance degradation.When the operating system predicts that it will require more memory than it has physically available, it will move infrequently accessed pages of memory out of RAM and onto the disk to make room for more frequently accessed memory. When this memory is used later by a process, the process must wait for the page to be read off disk and into RAM. If memory used by SingleStore is moved to disk, the latency of queries that access that memory will be substantially increased.You can measure paging on the command line by using the Linux tool by running the command `vmstat 1` and looking at the `swap` section (`si` and `so` refer to paging memory off the disk and into RAM and out of RAM and onto disk, respectively)

## Monitor the memsqld process

During deployment, SingleStore server binaries are installed on each host, where the server is then used to run multiple nodes.

Each node is comprised of two instances of the primary `memsqld` process, one of which handles network requests (typically listening on port `3306`), while the other listens on a local Unix socket

If only one `memsqld` process is running for a given node, this could indicate that the command process has stopped running. Refer to  [SingleStore Server Config Files](https://docs.singlestore.com/db/v9.1/reference/configuration-reference/cluster-config-files/singlestore-server-config-files.md) for more information.

## Monitor cluster status through MV\_CLUSTER\_STATUS table

To know the status of the databases on your cluster, as well as information about the nodes in your cluster, query the `information_schema.MV_CLUSTER_STATUS` table from an aggregator. You can also access this table through `SHOW CLUSTER STATUS`; however, querying the table provides the advantage of being able to join against it.

## Table description

| Field                     | Data Type (Size) | Description                                               | Example Value                                    |
| ------------------------- | ---------------- | --------------------------------------------------------- | ------------------------------------------------ |
| `NODE_ID`                 | bigint(10)       | ID of node                                                | 1                                                |
| `HOST`                    | varchar(512)     | Host of the node                                          | 127.0.0.1                                        |
| `PORT`                    | bigint(10)       | The port of the node                                      | 10000                                            |
| `DATABASE_NAME`           | varchar(512)     | Name of database                                          | vigilantia\_0\_AUTO\_REPLICA                     |
| `ROLE`                    | varchar(512)     | Database’s role (e.g. orphan, master, replica, reference) | master                                           |
| `STATE`                   | varchar(256)     | Database state                                            | replicating                                      |
| `POSITION`                | varchar(256)     | Position in transaction log                               | 0:8832                                           |
| `MASTER_HOST`             | varchar(256)     | Host of this node’s aggregator                            | 127.0.0.1                                        |
| `MASTER_PORT`             | bigint(10)       | Port of this node’s aggregator                            | 3304                                             |
| `METADATA_MASTER_NODE_ID` | bigint(10)       | Master’s node ID expected by metadata                     | 1                                                |
| `METADATA_MASTER_HOST`    | varchar(256)     | Master’s host expected by metadata                        | 127.0.0.1                                        |
| `METADATA_MASTER_PORT`    | bigint(10)       | Master’s port expected by metadata                        | 3306                                             |
| `METADATA_ROLE`           | varchar(512)     | Database’s role based on metadata                         | Orphan                                           |
| `DETAILS`                 | varchar(512)     | Extra details                                             | stage: packet wait, state: x\_streaming, err: no |

## Sample output

```sql
SELECT * FROM information_schema.MV_CLUSTER_STATUS;

```

```output

+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
| NODE_ID | HOST      | PORT  | DATABASE_NAME             | ROLE          | STATE       | POSITION | MASTER_HOST | MASTER_PORT | METADATA_MASTER_NODE_ID | METADATA_MASTER_HOST | METADATA_MASTER_PORT | METADATA_ROLE | DETAILS                                         |
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
|       1 | 127.0.0.1 | 10000 | cluster                   | master        | online      | 0:46     | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       1 | 127.0.0.1 | 10000 | monitoring                | master        | online      | 0:8832   | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       1 | 127.0.0.1 | 10000 | vigilantia                | master        | online      | 0:24616  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | cluster                   | async replica | replicating | 0:45     | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     | stage: packet wait, state: x_streaming, err: no |
|       3 | 127.0.0.1 | 10001 | monitoring                | sync replica  | replicating | 0:8832   | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_0              | master        | online      | 0:58893  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_0_AUTO_REPLICA | async replica | replicating | 0:58893  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_1              | master        | online      | 0:57439  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_1_AUTO_REPLICA | async replica | replicating | 0:57439  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_2              | master        | online      | 0:49952  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_2_AUTO_REPLICA | async replica | replicating | 0:49952  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia                | sync replica  | replicating | 0:24616  | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_0              | master        | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_0_AUTO_REPLICA | async replica | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_1              | master        | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_1_AUTO_REPLICA | async replica | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_2              | master        | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_2_AUTO_REPLICA | async replica | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+

```

## Monitor cluster events through MV\_EVENTS table

As another facet in monitoring the health of your cluster, the `information_schema.MV_EVENTS` table provides cluster-level event reporting that you can query against. Querying the `information_schema.MV_EVENTS` table provides events from the entire cluster and can only be done from an aggregator. To monitor events from individual leaves, connect to that leaf and query the `information_schema.LMV_EVENTS` table, which is exactly the same in structure.

## Table description

| Field            | Data Type (Size) | Description                                                                              | Example Value              |
| ---------------- | ---------------- | ---------------------------------------------------------------------------------------- | -------------------------- |
| `ORIGIN_NODE_ID` | bigint(4)        | ID of node where the event happened.                                                     | 3                          |
| `EVENT_TIME`     | timestamp        | Timestamp when event occurred.                                                           | 2018-04-25 18:08:13        |
| `SEVERITY`       | varchar(512)     | Severity of the event. Can be one of the following values:`NOTICE`,`WARNING`, or`ERROR`. | NOTICE                     |
| `EVENT_TYPE`     | varchar(512)     | Type of event that occurred. See the section below for more details.                     | NODE\_ONLINE               |
| `DETAILS`        | varchar(512)     | Additional information about the event, in JSON format.                                  | {“node”:“172.18.0.2:3306”} |

## Event type definitions

## Node events

| Event type       | Description                           |
| ---------------- | ------------------------------------- |
| `NODE_ONLINE`    | A node has come online                |
| `NODE_OFFLINE`   | A node has gone offline               |
| `NODE_ATTACHING` | A node is in the process of attaching |
| `NODE_DETACHED`  | A node has become detached            |

**Details output**

| Variable | Value           | Description     |
| -------- | --------------- | --------------- |
| `node`   | “Hostname:port” | Address of node |

## Rebalance events

| Event type           | Description                       |
| -------------------- | --------------------------------- |
| `REBALANCE_STARTED`  | A partition rebalance has started |
| `REBALANCE_FINISHED` | A partition rebalance has ended   |

**Details output**

| Variable         | Value                      | Description                                          |
| ---------------- | -------------------------- | ---------------------------------------------------- |
| `Database`       | “database\_name or (null)” | Database being rebalanced (80 characters truncated)  |
| `user_initiated` | “true/false”               | If the rebalance was initiated by the user orcluster |
| `success`        | “true/false”               | If the rebalance succeeded or failed                 |

## Replication events

| Event type                   | Description                                  |
| ---------------------------- | -------------------------------------------- |
| `DATABASE_REPLICATION_START` | A database has started replication           |
| `DATABASE_REPLICATION_STOP`  | A database has stopped or paused replication |

**Details output**

| Variable          | Value                    | Description                                    |
| ----------------- | ------------------------ | ---------------------------------------------- |
| `local_database`  | “local\_database\_name”  | The name of the database being replicated to   |
| `remote_database` | “remote\_database\_name” | The name of the database being replicated from |

## Network status events

| Event type         | Description                                                                                         |
| ------------------ | --------------------------------------------------------------------------------------------------- |
| `NODE_UNREACHABLE` | A node is unreachable from the master aggregator, either starting the grace period or going offline |
| `NODE_REACHABLE`   | A node is now reachable from the master aggregator, recovering within the grace period              |

**Details output**

| Variable               | Value                 | Description                                                                |
| ---------------------- | --------------------- | -------------------------------------------------------------------------- |
| `node`                 | “Hostname:port”       | Address of node                                                            |
| `message`              | “message about event” | For unreachable: describing which stage of`unreachable_node`the node is in |
| `grace_period_in_secs` | “int”                 | The number of seconds the grace period is set to                           |

## Backup/Restore events

| Event type   | Description                                         |
| ------------ | --------------------------------------------------- |
| `BACKUP_DB`  | A database has completed a`BACKUP DATABASE`command  |
| `RESTORE_DB` | A database has completed a`RESTORE DATABASE`command |

**Details output**

| Variable    | Value                      | Description                                                |
| ----------- | -------------------------- | ---------------------------------------------------------- |
| `db`        | “database\_name”           | Name of the database being backed up                       |
| `type`      | “s3 or fs or azure or gcs” | Where the backup is going to, S3, filesystem, Azure or GCS |
| `backup_id` | “unsigned int”             | Id of the backup (only for backup)                         |

## Out of Memory Events

| Event type         | Description                                |
| ------------------ | ------------------------------------------ |
| `MAX_MEMORY`       | Maximum server memory has been hit         |
| `MAX_TABLE_MEMORY` | A table has hit the max table memory value |

**Details output**

| Variable                       | Value                                 | Description                                                      |
| ------------------------------ | ------------------------------------- | ---------------------------------------------------------------- |
| `actual_memory_mb`             | “memory use in mb”                    | Current memory usage in mb                                       |
| `maximum_memory_mb`            | “maximum memory in mb”                | Value of variable`maximum_memory`                                |
| `actual_table_memory_mb`       | “memory use in mb”                    | Memory use of offending table                                    |
| `maximum_table_memory_mb`      | “maximum table memory variable value” | Value of variable`maximum_table_memory`                          |
| `memory_needed_for_redundancy` | “memory in mb needed”                 | Memory needed to allow the requested redundancy to fit in memory |

## Miscellaneous events

| Event type                   | Description                                                       |
| ---------------------------- | ----------------------------------------------------------------- |
| `NOTIFY_AGGREGATOR_PROMOTED` | An aggregator has been promoted to master                         |
| `SYSTEM_VAR_CHANGED`         | A sensitive engine variable has been changed                      |
| `PARTITION_UNRECOVERABLE`    | A partition is lost due to failure and no longer can be recovered |

**Sensitive variables**

* auto\_attach
* leaf\_failure\_detection
* columnstore\_window\_size
* internal\_columnstore\_window\_minimum\_blob\_size
* sync\_permissions
* max\_connection\_threads

See the [List of Engine Variables](https://docs.singlestore.com/db/v9.1/reference/configuration-reference/engine-variables/list-of-engine-variables.md) for more information on these variables.

**Details output**

For `NOTIFY_AGGREGATOR_PROMOTED`: `"{}"`

For `SYSTEM_VAR_CHANGED`:

| Variable    | Value                                                | Description                                                |
| ----------- | ---------------------------------------------------- | ---------------------------------------------------------- |
| `variable`  | “variable\_name”(such as “max\_connection\_threads”) | The name of the engine variable that has been changed      |
| `new_value` | “new\_value”(such as “256”)                          | The new value that the engine variable has been changed to |

For `PARTITION_UNRECOVERABLE`:

| Variable   | Value                                  | Description                                 |
| ---------- | -------------------------------------- | ------------------------------------------- |
| `database` | “db\_name”                             | Name of the partition that is unrecoverable |
| `reason`   | “Database couldn’t commit transaction” | Reason for partition going unrecoverable    |

## Examples

```sql
SELECT * FROM information_schema.MV_EVENTS;

```

```output

+----------------+---------------------+----------+----------------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME          | SEVERITY | EVENT_TYPE                 | DETAILS                                                    |
+----------------+--------------------+-----------+----------------------------+------------------------------------------------------------+
|              2 | 2018-05-15 13:21:03 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              3 | 2018-05-15 13:21:05 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10002"}                                 |
|              1 | 2018-05-15 13:21:12 | NOTICE   | REBALANCE_STARTED          | {"database":"db1", "user_initiated":"true"}                |
|              1 | 2018-05-15 13:21:12 | NOTICE   | REBALANCE_FINISHED         | {"database":"db1", "user_initiated":"true"}                |
|              3 | 2018-05-15 13:21:15 | WARNING  | NODE_DETACHED              | {"node":"127.0.0.1:10002"}                                 |
|              3 | 2018-05-15 13:21:16 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10002"}                                 |
|              3 | 2018-05-15 13:21:22 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10002"}                                 |
|              2 | 2018-05-15 13:21:25 | WARNING  | NODE_OFFLINE               | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:29 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:30 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-05-15 13:21:35 | NOTICE   | DATABASE_REPLICATION_START | {"local_database":"db2", "remote_database":"db1"}          |
|              1 | 2018-05-15 13:21:40 | NOTICE   | DATABASE_REPLICATION_STOP  | {"database":"db2"}                                         |
|              2 | 2018-05-15 13:21:42 | WARNING  | NODE_OFFLINE               | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:47 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:48 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              3 | 2018-05-15 13:23:48 | NOTICE   | REBALANCE_STARTED          | {"database":"(null)", "user_initiated":"false"}            |
|              3 | 2018-05-15 13:23:57 | NOTICE   | REBALANCE_FINISHED         | {"database":"(null)", "user_initiated":"false"}            |
|              1 | 2018-05-15 13:23:57 | NOTICE   | SYSTEM_VAR_CHANGED         | {"variable": "leaf_failure_detection", "new_value": "off"} |
+----------------+---------------------+----------+----------------------------+------------------------------------------------------------+

```

```sql
SELECT * FROM information_schema.LMV_EVENTS;

```

```output

+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME          | SEVERITY | EVENT_TYPE         | DETAILS                                                    |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
|              1 | 2018-06-28 11:56:09 | NOTICE   | SYSTEM_VAR_CHANGED | {"variable": "max_connection_threads", "new_value": "256"} |
|              1 | 2018-06-28 11:56:11 | NOTICE   | NODE_STARTING      | {}                                                         |
|              1 | 2018-06-28 11:56:47 | NOTICE   | NODE_ONLINE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 11:56:47 | NOTICE   | LEAF_ADD           | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:28 | NOTICE   | LEAF_REMOVE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:37 | NOTICE   | NODE_ONLINE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:37 | NOTICE   | LEAF_ADD           | {"node":"127.0.0.1:10001"}                                 |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+

```

***

Modified at: April 25, 2025

Source: [/db/v9.1/user-and-cluster-administration/cluster-health-and-performance/components-to-monitor/](https://docs.singlestore.com/db/v9.1/user-and-cluster-administration/cluster-health-and-performance/components-to-monitor/)

(An index of the documentation is available at /llms.txt)
