Components to Monitor

SingleStore Components to Monitor

SingleStore is a distributed system. Because of this, all machines running SingleStore should be monitored to ensure smooth operation. This section demonstrates how to programmatically monitor a SingleStore cluster.

Monitor OS resources

If you are using third party monitoring tools, make sure to monitor the following resources within each machine of the SingleStore cluster:

  • CPU Usage

  • CPU Load Average

  • Memory Usage

  • Memory Paging (page ins, page outs)

  • Disk Utilization

  • Disk Queue Time

  • Network Usage

  • Dropped packets / TCP retransmits

Note

Paging refers to a technique that Linux and other operating systems use to deal with high memory usage. If your system is consistently paging, you should add more memory capacity or you will experience severely performance degradation.

When the operating system predicts that it will require more memory than it has physically available, it will move infrequently accessed pages of memory out of RAM and onto the disk to make room for more frequently accessed memory. When this memory is used later by a process, the process must wait for the page to be read off disk and into RAM. If memory used by SingleStore is moved to disk, the latency of queries that access that memory will be substantially increased.

You can measure paging on the command line by using the Linux tool by running the command vmstat 1 and looking at the swap section (si and so refer to paging memory off the disk and into RAM and out of RAM and onto disk, respectively)

Monitor the memsqld process

During deployment, SingleStore server binaries are installed on each host, where the server is then used to run multiple nodes.

Each node is comprised of two instances of the primary memsqld process, one of which handles network requests (typically listening on port 3306), while the other listens on a local Unix socket

If only one memsqld process is running for a given node, this could indicate that the command process has stopped running. Refer to SingleStore Server Config Files for more information.

Monitor cluster status through MV_CLUSTER_STATUS table

To know the status of the databases on your cluster, as well as information about the nodes in your cluster, query the information_schema.MV_CLUSTER_STATUS table from an aggregator. You can also access this table through SHOW CLUSTER STATUS; however, querying the table provides the advantage of being able to join against it.

Table description

Field

Data Type (Size)

Description

Example Value

NODE_ID

bigint(10)

ID of node

1

HOST

varchar(512)

Host of the node

127.0.0.1

PORT

bigint(10)

The port of the node

10000

DATABASE_NAME

varchar(512)

Name of database

vigilantia_0_AUTO_REPLICA

ROLE

varchar(512)

Database’s role (e.g. orphan, master, replica, reference)

master

STATE

varchar(256)

Database state

replicating

POSITION

varchar(256)

Position in transaction log

0:8832

MASTER_HOST

varchar(256)

Host of this node’s aggregator

127.0.0.1

MASTER_PORT

bigint(10)

Port of this node’s aggregator

3304

METADATA_MASTER_NODE_ID

bigint(10)

Master’s node ID expected by metadata

1

METADATA_MASTER_HOST

varchar(256)

Master’s host expected by metadata

127.0.0.1

METADATA_MASTER_PORT

bigint(10)

Master’s port expected by metadata

3306

METADATA_ROLE

varchar(512)

Database’s role based on metadata

Orphan

DETAILS

varchar(512)

Extra details

stage: packet wait, state: x_streaming, err: no

Sample output

SELECT * FROM information_schema.MV_CLUSTER_STATUS;
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
| NODE_ID | HOST      | PORT  | DATABASE_NAME             | ROLE          | STATE       | POSITION | MASTER_HOST | MASTER_PORT | METADATA_MASTER_NODE_ID | METADATA_MASTER_HOST | METADATA_MASTER_PORT | METADATA_ROLE | DETAILS                                         |
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
|       1 | 127.0.0.1 | 10000 | cluster                   | master        | online      | 0:46     | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       1 | 127.0.0.1 | 10000 | monitoring                | master        | online      | 0:8832   | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       1 | 127.0.0.1 | 10000 | vigilantia                | master        | online      | 0:24616  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | cluster                   | async replica | replicating | 0:45     | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     | stage: packet wait, state: x_streaming, err: no |
|       3 | 127.0.0.1 | 10001 | monitoring                | sync replica  | replicating | 0:8832   | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_0              | master        | online      | 0:58893  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_0_AUTO_REPLICA | async replica | replicating | 0:58893  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_1              | master        | online      | 0:57439  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_1_AUTO_REPLICA | async replica | replicating | 0:57439  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_2              | master        | online      | 0:49952  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_2_AUTO_REPLICA | async replica | replicating | 0:49952  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia                | sync replica  | replicating | 0:24616  | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_0              | master        | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_0_AUTO_REPLICA | async replica | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_1              | master        | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_1_AUTO_REPLICA | async replica | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_2              | master        | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_2_AUTO_REPLICA | async replica | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+

Monitor cluster events through MV_EVENTS table

As another facet in monitoring the health of your cluster, the information_schema.MV_EVENTS table provides cluster-level event reporting that you can query against. Querying the information_schema.MV_EVENTS table provides events from the entire cluster and can only be done from an aggregator. To monitor events from individual leaves, connect to that leaf and query the information_schema.LMV_EVENTS table, which is exactly the same in structure.

Table description

Field

Data Type (Size)

Description

Example Value

ORIGIN_NODE_ID

bigint(4)

ID of node where the event happened.

3

EVENT_TIME

timestamp

Timestamp when event occurred.

2018-04-25 18:08:13

SEVERITY

varchar(512)

Severity of the event. Can be one of the following values: NOTICE, WARNING, or ERROR.

NOTICE

EVENT_TYPE

varchar(512)

Type of event that occurred. See the section below for more details.

NODE_ONLINE

DETAILS

varchar(512)

Additional information about the event, in JSON format.

{node:172.18.0.2:3306}

Event type definitions

Node events

Event type

Description

NODE_ONLINE

A node has come online

NODE_OFFLINE

A node has gone offline

NODE_ATTACHING

A node is in the process of attaching

NODE_DETACHED

A node has become detached

Details output

Variable

Value

Description

node

Hostname:port

Address of node

Rebalance events

Event type

Description

REBALANCE_STARTED

A partition rebalance has started

REBALANCE_FINISHED

A partition rebalance has ended

Details output

Variable

Value

Description

Database

database_name or (null)

Database being rebalanced (80 characters truncated)

user_initiated

true/false

If the rebalance was initiated by the user or cluster

success

true/false

If the rebalance succeeded or failed

Replication events

Event type

Description

DATABASE_REPLICATION_START

A database has started replication

DATABASE_REPLICATION_STOP

A database has stopped or paused replication

Details output

Variable

Value

Description

local_database

local_database_name

The name of the database being replicated to

remote_database

remote_database_name

The name of the database being replicated from

Network status events

Event type

Description

NODE_UNREACHABLE

A node is unreachable from the master aggregator, either starting the grace period or going offline

NODE_REACHABLE

A node is now reachable from the master aggregator, recovering within the grace period

Details output

Variable

Value

Description

node

Hostname:port

Address of node

message

message about event

For unreachable: describing which stage of unreachable_node the node is in

grace_period_in_secs

int

The number of seconds the grace period is set to

Backup/Restore events

Event type

Description

BACKUP_DB

A database has completed a BACKUP DATABASE command

RESTORE_DB

A database has completed a RESTORE DATABASE command

Details output

Variable

Value

Description

db

database_name

Name of the database being backed up

type

s3 or fs or azure or gcs

Where the backup is going to, S3, filesystem, Azure or GCS

backup_id

unsigned int

Id of the backup (only for backup)

Out of Memory Events

Event type

Description

MAX_MEMORY

Maximum server memory has been hit

MAX_TABLE_MEMORY

A table has hit the max table memory value

Details output

Variable

Value

Description

actual_memory_mb

memory use in mb

Current memory usage in mb

maximum_memory_mb

maximum memory in mb

Value of variable maximum_memory

actual_table_memory_mb

memory use in mb

Memory use of offending table

maximum_table_memory_mb

maximum table memory variable value

Value of variable maximum_table_memory

memory_needed_for_redundancy

memory in mb needed

Memory needed to allow the requested redundancy to fit in memory

Miscellaneous events

Event type

Description

NOTIFY_AGGREGATOR_PROMOTED

An aggregator has been promoted to master

SYSTEM_VAR_CHANGED

A sensitive engine variable has been changed

PARTITION_UNRECOVERABLE

A partition is lost due to failure and no longer can be recovered

Sensitive variables

  • auto_attach

  • leaf_failure_detection

  • columnstore_window_size

  • internal_columnstore_window_minimum_blob_size

  • sync_permissions

  • max_connection_threads

See the List of Engine Variables for more information on these variables.

Details output

For NOTIFY_AGGREGATOR_PROMOTED: "{}"

For SYSTEM_VAR_CHANGED:

Variable

Value

Description

variable

variable_name (such as “max_connection_threads”)

The name of the engine variable that has been changed

new_value

new_value (such as “256”)

The new value that the engine variable has been changed to

For PARTITION_UNRECOVERABLE:

Variable

Value

Description

database

db_name

Name of the partition that is unrecoverable

reason

Database couldn’t commit transaction

Reason for partition going unrecoverable

Examples
SELECT * FROM information_schema.MV_EVENTS;
+----------------+---------------------+----------+----------------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME          | SEVERITY | EVENT_TYPE                 | DETAILS                                                    |
+----------------+--------------------+-----------+----------------------------+------------------------------------------------------------+
|              2 | 2018-05-15 13:21:03 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              3 | 2018-05-15 13:21:05 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10002"}                                 |
|              1 | 2018-05-15 13:21:12 | NOTICE   | REBALANCE_STARTED          | {"database":"db1", "user_initiated":"true"}                |
|              1 | 2018-05-15 13:21:12 | NOTICE   | REBALANCE_FINISHED         | {"database":"db1", "user_initiated":"true"}                |
|              3 | 2018-05-15 13:21:15 | WARNING  | NODE_DETACHED              | {"node":"127.0.0.1:10002"}                                 |
|              3 | 2018-05-15 13:21:16 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10002"}                                 |
|              3 | 2018-05-15 13:21:22 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10002"}                                 |
|              2 | 2018-05-15 13:21:25 | WARNING  | NODE_OFFLINE               | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:29 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:30 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-05-15 13:21:35 | NOTICE   | DATABASE_REPLICATION_START | {"local_database":"db2", "remote_database":"db1"}          |
|              1 | 2018-05-15 13:21:40 | NOTICE   | DATABASE_REPLICATION_STOP  | {"database":"db2"}                                         |
|              2 | 2018-05-15 13:21:42 | WARNING  | NODE_OFFLINE               | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:47 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:48 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              3 | 2018-05-15 13:23:48 | NOTICE   | REBALANCE_STARTED          | {"database":"(null)", "user_initiated":"false"}            |
|              3 | 2018-05-15 13:23:57 | NOTICE   | REBALANCE_FINISHED         | {"database":"(null)", "user_initiated":"false"}            |
|              1 | 2018-05-15 13:23:57 | NOTICE   | SYSTEM_VAR_CHANGED         | {"variable": "leaf_failure_detection", "new_value": "off"} |
+----------------+---------------------+----------+----------------------------+------------------------------------------------------------+
SELECT * FROM information_schema.LMV_EVENTS;
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME          | SEVERITY | EVENT_TYPE         | DETAILS                                                    |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
|              1 | 2018-06-28 11:56:09 | NOTICE   | SYSTEM_VAR_CHANGED | {"variable": "max_connection_threads", "new_value": "256"} |
|              1 | 2018-06-28 11:56:11 | NOTICE   | NODE_STARTING      | {}                                                         |
|              1 | 2018-06-28 11:56:47 | NOTICE   | NODE_ONLINE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 11:56:47 | NOTICE   | LEAF_ADD           | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:28 | NOTICE   | LEAF_REMOVE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:37 | NOTICE   | NODE_ONLINE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:37 | NOTICE   | LEAF_ADD           | {"node":"127.0.0.1:10001"}                                 |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+

Last modified: August 22, 2024

Was this article helpful?