Components to Monitor
On this page
SingleStore Components to Monitor
SingleStore is a distributed system.
Monitor OS resources
If you are using third party monitoring tools, make sure to monitor the following resources within each machine of the SingleStore cluster:
-
CPU Usage
-
CPU Load Average
-
Memory Usage
-
Memory Paging (page ins, page outs)
-
Disk Utilization
-
Disk Queue Time
-
Network Usage
-
Dropped packets / TCP retransmits
Note
Paging refers to a technique that Linux and other operating systems use to deal with high memory usage.
When the operating system predicts that it will require more memory than it has physically available, it will move infrequently accessed pages of memory out of RAM and onto the disk to make room for more frequently accessed memory.
You can measure paging on the command line by using the Linux tool by running the command vmstat 1
and looking at the swap
section (si
and so
refer to paging memory off the disk and into RAM and out of RAM and onto disk, respectively)
Monitor the memsqld process
During deployment, SingleStore server binaries are installed on each host, where the server is then used to run multiple nodes.
Each node is comprised of two instances of the primary memsqld
process, one of which handles network requests (typically listening on port 3306
), while the other listens on a local Unix socket
If only one memsqld
process is running for a given node, this could indicate that the command process has stopped running.
Monitor cluster status through MV_ CLUSTER_ STATUS table
To know the status of the databases on your cluster, as well as information about the nodes in your cluster, query the information_
table from an aggregator.SHOW CLUSTER STATUS
; however, querying the table provides the advantage of being able to join against it.
Table description
Field |
Data Type (Size) |
Description |
Example Value |
---|---|---|---|
|
bigint(10) |
ID of node |
1 |
|
varchar(512) |
Host of the node |
127. |
|
bigint(10) |
The port of the node |
10000 |
|
varchar(512) |
Name of database |
vigilantia_ |
|
varchar(512) |
Database’s role (e. |
master |
|
varchar(256) |
Database state |
replicating |
|
varchar(256) |
Position in transaction log |
0:8832 |
|
varchar(256) |
Host of this node’s aggregator |
127. |
|
bigint(10) |
Port of this node’s aggregator |
3304 |
|
bigint(10) |
Master’s node ID expected by metadata |
1 |
|
varchar(256) |
Master’s host expected by metadata |
127. |
|
bigint(10) |
Master’s port expected by metadata |
3306 |
|
varchar(512) |
Database’s role based on metadata |
Orphan |
|
varchar(512) |
Extra details |
stage: packet wait, state: x_ |
Sample output
SELECT * FROM information_schema.MV_CLUSTER_STATUS;
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
| NODE_ID | HOST | PORT | DATABASE_NAME | ROLE | STATE | POSITION | MASTER_HOST | MASTER_PORT | METADATA_MASTER_NODE_ID | METADATA_MASTER_HOST | METADATA_MASTER_PORT | METADATA_ROLE | DETAILS |
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
| 1 | 127.0.0.1 | 10000 | cluster | master | online | 0:46 | NULL | NULL | NULL | NULL | NULL | Reference | |
| 1 | 127.0.0.1 | 10000 | monitoring | master | online | 0:8832 | NULL | NULL | NULL | NULL | NULL | Reference | |
| 1 | 127.0.0.1 | 10000 | vigilantia | master | online | 0:24616 | NULL | NULL | NULL | NULL | NULL | Reference | |
| 3 | 127.0.0.1 | 10001 | cluster | async replica | replicating | 0:45 | 127.0.0.1 | 10000 | 1 | 127.0.0.1 | 10000 | Reference | stage: packet wait, state: x_streaming, err: no |
| 3 | 127.0.0.1 | 10001 | monitoring | sync replica | replicating | 0:8832 | 127.0.0.1 | 10000 | 1 | 127.0.0.1 | 10000 | Reference | |
| 3 | 127.0.0.1 | 10001 | monitoring_0 | master | online | 0:58893 | NULL | NULL | NULL | NULL | NULL | Master | |
| 3 | 127.0.0.1 | 10001 | monitoring_0_AUTO_REPLICA | async replica | replicating | 0:58893 | 127.0.0.1 | 10001 | NULL | NULL | NULL | Orphan | |
| 3 | 127.0.0.1 | 10001 | monitoring_1 | master | online | 0:57439 | NULL | NULL | NULL | NULL | NULL | Master | |
| 3 | 127.0.0.1 | 10001 | monitoring_1_AUTO_REPLICA | async replica | replicating | 0:57439 | 127.0.0.1 | 10001 | NULL | NULL | NULL | Orphan | |
| 3 | 127.0.0.1 | 10001 | monitoring_2 | master | online | 0:49952 | NULL | NULL | NULL | NULL | NULL | Master | |
| 3 | 127.0.0.1 | 10001 | monitoring_2_AUTO_REPLICA | async replica | replicating | 0:49952 | 127.0.0.1 | 10001 | NULL | NULL | NULL | Orphan | |
| 3 | 127.0.0.1 | 10001 | vigilantia | sync replica | replicating | 0:24616 | 127.0.0.1 | 10000 | 1 | 127.0.0.1 | 10000 | Reference | |
| 3 | 127.0.0.1 | 10001 | vigilantia_0 | master | online | 0:25874 | NULL | NULL | NULL | NULL | NULL | Master | |
| 3 | 127.0.0.1 | 10001 | vigilantia_0_AUTO_REPLICA | async replica | replicating | 0:25874 | 127.0.0.1 | 10001 | NULL | NULL | NULL | Orphan | |
| 3 | 127.0.0.1 | 10001 | vigilantia_1 | master | online | 0:25874 | NULL | NULL | NULL | NULL | NULL | Master | |
| 3 | 127.0.0.1 | 10001 | vigilantia_1_AUTO_REPLICA | async replica | replicating | 0:25874 | 127.0.0.1 | 10001 | NULL | NULL | NULL | Orphan | |
| 3 | 127.0.0.1 | 10001 | vigilantia_2 | master | online | 0:25874 | NULL | NULL | NULL | NULL | NULL | Master | |
| 3 | 127.0.0.1 | 10001 | vigilantia_2_AUTO_REPLICA | async replica | replicating | 0:25874 | 127.0.0.1 | 10001 | NULL | NULL | NULL | Orphan | |
+---------+-----------+-------+---------------------------+---------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
Monitor cluster events through MV_ EVENTS table
As another facet in monitoring the health of your cluster, the information_
table provides cluster-level event reporting that you can query against.information_
table provides events from the entire cluster and can only be done from an aggregator.information_
table, which is exactly the same in structure.
Table description
Field |
Data Type (Size) |
Description |
Example Value |
---|---|---|---|
|
bigint(4) |
ID of node where the event happened. |
3 |
|
timestamp |
Timestamp when event occurred. |
2018-04-25 18:08:13 |
|
varchar(512) |
Severity of the event. |
NOTICE |
|
varchar(512) |
Type of event that occurred. |
NODE_ |
|
varchar(512) |
Additional information about the event, in JSON format. |
{ |
Event type definitions
Node events
Event type |
Description |
---|---|
|
A node has come online |
|
A node has gone offline |
|
A node is in the process of attaching |
|
A node has become detached |
Details output
Variable |
Value |
Description |
---|---|---|
|
|
Address of node |
Rebalance events
Event type |
Description |
---|---|
|
A partition rebalance has started |
|
A partition rebalance has ended |
Details output
Variable |
Value |
Description |
---|---|---|
|
|
Database being rebalanced (80 characters truncated) |
|
|
If the rebalance was initiated by the user or cluster |
Replication events
Event type |
Description |
---|---|
|
A database has started replication |
|
A database has stopped or paused replication |
Details output
Variable |
Value |
Description |
---|---|---|
|
|
The name of the database being replicated to |
|
|
The name of the database being replicated from |
Network status events
Event type |
Description |
---|---|
|
A node is unreachable from the master aggregator, either starting the grace period or going offline |
|
A node is now reachable from the master aggregator, recovering within the grace period |
Details output
Variable |
Value |
Description |
---|---|---|
|
|
Address of node |
|
|
For unreachable: describing which stage of |
|
|
The number of seconds the grace period is set to |
Backup/Restore events
Event type |
Description |
---|---|
|
A database has completed a |
|
A database has completed a |
Details output
Variable |
Value |
Description |
---|---|---|
|
|
Name of the database being backed up |
|
|
Where the backup is going to, S3, filesystem, Azure or GCS |
|
|
Id of the backup (only for backup) |
Out of Memory Events
Event type |
Description |
---|---|
|
Maximum server memory has been hit |
|
A table has hit the max table memory value |
Details output
Variable |
Value |
Description |
---|---|---|
|
|
Current memory usage in mb |
|
|
Value of variable |
|
|
Memory use of offending table |
|
|
Value of variable |
|
|
Memory needed to allow the requested redundancy to fit in memory |
Miscellaneous events
Event type |
Description |
---|---|
|
An aggregator has been promoted to master |
|
A sensitive engine variable has been changed |
|
A partition is lost due to failure and no longer can be recovered |
Sensitive variables
-
auto_
attach -
leaf_
failure_ detection -
columnstore_
window_ size -
internal_
columnstore_ window_ minimum_ blob_ size -
sync_
permissions -
max_
connection_ threads
See the List of Engine Variables for more information on these variables.
Details output
For NOTIFY_
: "{}"
For SYSTEM_
:
Variable |
Value |
Description |
---|---|---|
|
|
The name of the engine variable that has been changed |
|
|
The new value that the engine variable has been changed to |
For PARTITION_
:
Variable |
Value |
Description |
---|---|---|
|
|
Name of the partition that is unrecoverable |
|
|
Reason for partition going unrecoverable |
Examples
SELECT * FROM information_schema.MV_EVENTS;
+----------------+---------------------+----------+----------------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME | SEVERITY | EVENT_TYPE | DETAILS |
+----------------+--------------------+-----------+----------------------------+------------------------------------------------------------+
| 2 | 2018-05-15 13:21:03 | NOTICE | NODE_ONLINE | {"node":"127.0.0.1:10001"} |
| 3 | 2018-05-15 13:21:05 | NOTICE | NODE_ONLINE | {"node":"127.0.0.1:10002"} |
| 1 | 2018-05-15 13:21:12 | NOTICE | REBALANCE_STARTED | {"database":"db1", "user_initiated":"true"} |
| 1 | 2018-05-15 13:21:12 | NOTICE | REBALANCE_FINISHED | {"database":"db1", "user_initiated":"true"} |
| 3 | 2018-05-15 13:21:15 | WARNING | NODE_DETACHED | {"node":"127.0.0.1:10002"} |
| 3 | 2018-05-15 13:21:16 | NOTICE | NODE_ATTACHING | {"node":"127.0.0.1:10002"} |
| 3 | 2018-05-15 13:21:22 | NOTICE | NODE_ONLINE | {"node":"127.0.0.1:10002"} |
| 2 | 2018-05-15 13:21:25 | WARNING | NODE_OFFLINE | {"node":"127.0.0.1:10001"} |
| 2 | 2018-05-15 13:21:29 | NOTICE | NODE_ATTACHING | {"node":"127.0.0.1:10001"} |
| 2 | 2018-05-15 13:21:30 | NOTICE | NODE_ONLINE | {"node":"127.0.0.1:10001"} |
| 1 | 2018-05-15 13:21:35 | NOTICE | DATABASE_REPLICATION_START | {"local_database":"db2", "remote_database":"db1"} |
| 1 | 2018-05-15 13:21:40 | NOTICE | DATABASE_REPLICATION_STOP | {"database":"db2"} |
| 2 | 2018-05-15 13:21:42 | WARNING | NODE_OFFLINE | {"node":"127.0.0.1:10001"} |
| 2 | 2018-05-15 13:21:47 | NOTICE | NODE_ATTACHING | {"node":"127.0.0.1:10001"} |
| 2 | 2018-05-15 13:21:48 | NOTICE | NODE_ONLINE | {"node":"127.0.0.1:10001"} |
| 3 | 2018-05-15 13:23:48 | NOTICE | REBALANCE_STARTED | {"database":"(null)", "user_initiated":"false"} |
| 3 | 2018-05-15 13:23:57 | NOTICE | REBALANCE_FINISHED | {"database":"(null)", "user_initiated":"false"} |
| 1 | 2018-05-15 13:23:57 | NOTICE | SYSTEM_VAR_CHANGED | {"variable": "leaf_failure_detection", "new_value": "off"} |
+----------------+---------------------+----------+----------------------------+------------------------------------------------------------+
SELECT * FROM information_schema.LMV_EVENTS;
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME | SEVERITY | EVENT_TYPE | DETAILS |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
| 1 | 2018-06-28 11:56:09 | NOTICE | SYSTEM_VAR_CHANGED | {"variable": "max_connection_threads", "new_value": "256"} |
| 1 | 2018-06-28 11:56:11 | NOTICE | NODE_STARTING | {} |
| 1 | 2018-06-28 11:56:47 | NOTICE | NODE_ONLINE | {"node":"127.0.0.1:10001"} |
| 1 | 2018-06-28 11:56:47 | NOTICE | LEAF_ADD | {"node":"127.0.0.1:10001"} |
| 1 | 2018-06-28 17:42:28 | NOTICE | LEAF_REMOVE | {"node":"127.0.0.1:10001"} |
| 1 | 2018-06-28 17:42:37 | NOTICE | NODE_ONLINE | {"node":"127.0.0.1:10001"} |
| 1 | 2018-06-28 17:42:37 | NOTICE | LEAF_ADD | {"node":"127.0.0.1:10001"} |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
Last modified: August 22, 2024