Troubleshooting Poorly Performing Queries
On this page
1. Identifying Expensive Queries
-
In Studio, navigate to the Workload Monitoring page.
-
If you do not have monitoring data from SingleStore’s monitoring solution, select Begin Profiling to capture current cluster activity for a period of time.
Be sure to record during the workload or query of interest, if one has been identified while collecting data in the previous step. -
If one has been identified, switch the database selector to the suspect database.
-
Sort on
Elapsed Time
andRun Count
by selecting the field name.Hover over the Time Spent
bar and review which queries are consuming the most amount of time. -
Sort on CPU, Memory, and Network to identify other resource-intensive queries.
-
Distinguish query shapes that may be resource-intensive due to having a high
run count
compared to others.For example, is a query using resources because of how many times it has been run, or is it using a large amount of resources across a low number of runs? -
Note the
activity_
of queries that have high elapsed time, run count, and/or that are using a large amount of a given resource.name -
Select the Nodes tab to identify which node these queries are running.
The Nodes tab can also help to confirm why a specific node is running slowly.
Next Steps
2. Investigating Expensive Queries
-
If a query has a high run count and is using a large amount of resources (on average), it is important to identify why the query is running at such high concurrency.
-
Is this normal for your application?
-
Can you change how often it is run?
-
-
If you have identified a query with a low run count that uses a large amount of resources, or a frequent query that is a normal occurrence, copy the query from SingleSoreDB Studio.
To get the query text from what is displayed, query the information_ schema table MV_ QUERIES. SELECT * FROM information_schema.MV_QUERIES where activity_name = <identified_activity_name>; -
Check the
PLAN_
column for any hints about potential optimizations.WARNINGS If there are no hints, continue to the next step. -
If the query in question is an
INSERT .
statement, copy only the. . SELECT SELECT
clause of the query. -
Navigate to the SQL Editor in the Studio for your database, and paste in the query.
-
Run the query in the SQL Editor by selecting the Run button in the top right-hand corner of the page.
After the first run is complete, highlight the query and select the Visual Explain icon to the left of the Run button to display the query plan in a graphical interface. Here, you will be given a choice between EXPLAIN
andPROFILE
, which display the query plan and further details for each query operation, respectively.SingleStore recommends running both of these options. -
Check if there is anything unusual about the query plan.
You may find a query plan operation that takes a large amount of time. If so, investigate that particular operator. You may also find that each component is similarly time-consuming. In either case, here are some questions to consider: -
Are you reshuffling or repartitioning a large amount of data?
-
Common indicators of this include the
Project
operator combined with largeSELECT
costs orRepartitioning
being present in the query plan. -
If so, can you add or change a shard key?
-
-
Are you doing a full scan of a table?
-
Look for the
TableScan
operator to indicate this information. -
If so, can you add/change an index or sort key?
-
-
Is the query performing slowly or using more resources because of a schema change?
-
Is the query performing slowly because the data working set increased?
-
Are autostats enabled for the tables in the query?
-
Look for
AUTOSTATS_
,ENABLED=FALSE AUTOSTATS_
, orCARDINALITY_ MODE=OFF AUTOSTATS_
in theHISTOGRAM_ MODE CREATE TABLE
statement to determine this.
-
Do you need to invalidate and refresh your current query plan?
-
Autostats do not invalidate the plan automatically.
To invalidate the plan, run ANALYZE on the table manually. -
Alternatively, to drop the current plan, run DROP … FROM PLANCACHE and re-run the query to see if it performs better.
-
-
-
Refer to Optimizing Table Data Structures for additional details, as well as Query Tuning for further guidance on how to optimize specific query shapes and tables.
Next Steps
-
If you did not identify any further useful information here: continue to Step 3.
3. Are any queries waiting on something?
Use the following steps to determine whether your queries are being blocked or waiting for another reason.
-
Check the Active Queries page in Studio, or the
PROCESSLIST
.-
On the active queries page in Studio, check whether there are many queued queries by sorting on the
status
column. -
The default
workload_
before queries fail is 100.management_ max_ queue_ depth Refer to the List of Engine Variables for more information on this variable. You can also check the current setting by running SHOW VARIABLES
-
If you are using the database client, query MV_
PROCESSLIST. SELECT * FROM information_
schema. MV_ PROCESSLIST WHERE state = 'queued';
-
-
Are there queries lingering or in an unkillable state?
-
Check the
PROCESSLIST
or the Active Queries page by using the above.-
Sort by
time
.If there is a long runtime (depending on a normal run time for your queries, which could be minutes, hours, or days), it is likely the query is hung. -
Identify those queries and run KILL QUERY on them.
This can be done in Studio by checking the boxes shown above, or from the client. -
You can also run KILLALL QUERIES to remove a large number of queries at once.
Note: This will kill all queries, whether hanging or running. -
If you see queries with both a long runtime and a state of
KILLED
on a given node, there may be queries stuck in an unkillable state.Restart the affected node.
-
-
Are queries queued by the workload manager? Check this by querying MV_
WORKLOAD_ MANAGEMENT_ STATUS and looking at Queued Queries (from global queue)
&Queued Queries (from local queue)
within theStat
andValue
columns.SELECT * FROM information_
schema. WORKLOAD_ MANAGEMENT_ STATUS;
-
-
Are queries blocked?
-
Check this by querying information_
schema. MV_ BLOCKED_ QUERIES. SELECT * FROM information_
schema. MV_ BLOCKED_ QUERIES; -
Which query is blocked? Check the
QUERY_
column.TEXT -
What is blocking it and where is it running? Check the
BLOCKING_
andNODE_ ID BLOCKING_
columns.QUERY_ TEXT -
Check the blocking reason by looking at the
BLOCKING_
TYPE -
Kill the blocking query by using the
ID
column, which is the same as theID
ininformation_
, withschema. PROCESSLIST KILL QUERY <id>
.
-
-
If your queries are failing:
-
Is the queue depth too large? Query MV_
PROCESSLIST to check this. SELECT COUNT(*) FROM information_
schema. MV_ PROCESSLIST WHERE state = 'queued'; -
Are these queries getting killed by the resource governor? Check resource the status of resource pools by querying RESOURCE_
POOLS. SELECT * FROM information_
schema. RESOURCE_ POOLS; -
Examine the failure errors from failed queries.
-
Next Steps
4. Are key resources being overused?
The next step in troubleshooting your cluster is looking at key resources to see if they are being overused.
-
For a quick sanity check, navigate to the Studio Dashboard to look at resource usage against their respective totals available by hovering over each.
-
Check the Studio Hosts and/or Nodes pages to look at CPU and memory usage.
-
Sort on CPU Usage, Memory Usage, and Disk Usage.
Are there resources being overused? -
If you see one or more resources approaching their limit (e.
g. , memory, disk, or CPU limits), consider the following: -
Have you added a new workload to your cluster that is using more resources?
-
Have you loaded more data (on disk, or in memory)?
-
Have you increased concurrency?
-
Does your workload require more nodes to meet the scale?
-
Is there an expensive workload or query that is causing a spike and could be optimized?
-
Next Steps
5. Are there other ongoing operational activities?
The next step in troubleshooting your cluster is checking other ongoing operational activities that may be affecting your cluster performance.
-
Navigate to the Active Queries page in Studio from the master aggregator, or query MV_
PROCESSLIST SELECT * FROM information_schema.mv_processlist; -
Search (ctrl-F) in the Active Queries page, or look in the results of your query to
MV_
for any of the following:PROCESSLIST ALTER, CREATE, DROP, ADD|REMOVE LEAF/AGGREGATOR, BACKUP/RESTORE, REBALANCE, RESTORE REDUNDANCY
. -
For either the Active Queries page, or
MV_
, check thePROCESSLIST status
column to see if any of these activities are long running, or waiting on other operations.-
If an operation is
waiting on metadata lock
, the query is likely waiting on a cluster wide operation such as a backup.
-
Next Steps
-
If no other ongoing operational activities were identified: continue to Step 6.
6. Checking Node, Partition, and Overall Database Health
The next step in troubleshooting your cluster is checking its overall health.
-
Go to the Studio Hosts and/or Nodes page(s) to ensure nodes are online.
-
Run show-leaves to get the
id
of any offline node.sdb-admin show-leaves
-
Then, bring the node(s) online with start-node.
sdb-admin start-node <node-id>
-
-
Go to the Databases page, and examine the suspect database.
-
Click on the database, check the Partitions tab, and make sure everything is online and attached.
-
-
If there are offline or detached partitions, investigate the health of the nodes housing those partitions in Studio.
Next Steps
-
If all nodes, partitions, and databases are healthy: continue to Step 7.
7. Checking Partition Number and Data Skew
This step checks whether there are enough partitions and if data is properly balanced across your cluster.
-
Check the total core-to-partition ratio per leaf for the database that is having trouble:
-
Run SHOW PARTITIONS on the database in question and compare that with the total number of cores in your cluster.
-
Is this ratio at least 1:1? If you have high concurrency, consider increasing the concurrency from 2, 3, or 4 to 1 core to partition ratio.
-
-
Check that your data is balanced across the cluster properly by running EXPLAIN REBALANCE PARTITIONS on the database in question.
If you see a non-empty set, run REBALANCE PARTITIONS on the database. -
Check that your data is redundant by running EXPLAIN RESTORE REDUNDANCY.
If you see a non-empty set, run RESTORE REDUNDANCY. -
Data Skew: refer to the Detecting and Resolving Data Skew.
Next Steps
-
If there are enough partitions and no data skew is detected: continue to Step 8.
8. Checking for System Changes and Log Hints
The last step in this guide is considering any system changes and/or log hints.
-
Were there any operating system and/or application changes?
-
Check the Studio EVENTS page and
memsql.
for any information about recent log events at a high level.log Sort by event_
to see those that are critical.level -
For more verbose logs, obtain the tracelog from the master aggregator data directory (the default for this is
/var/lib/memsql/node-nodeHash/data/tracelogs
). -
Find the
memsql.
and runlog file tail -f
orcat
andgrep
forerror
and/orwarning
.
-
Final Recommendations
If nothing unusual arose in the investigation across all of this guide, but your cluster is still having performance issues, consider the following recommendations:
-
Did you properly capture the workload of interest in following the steps of this guide? It is possible an expensive query, queuing, or a resource problem was missed because the data was not examined at the right time.
Make sure that the workload of interest is running during the time of investigation. -
Does your system have a recommended configuration? Check the
sdb-report
output again and consider any configuration optimization suggestions in the check output. -
Is the system’s hardware healthy overall? Check disk and network performance.
It is possible that a hardware issue might be leading to performance bottlenecks. -
Refer to the Troubleshooting Reference to see if any of the reference topics included there are applicable to your cluster.
Additional Things to Consider
SingleStore Queries Typically Run Faster the Second Time They Are Executed
Traditional relational database management systems interpret SQL queries the same way interpreters for languages like Python and Ruby run programs.
The process of code generation involves extracting parameters from the query and then transforming the normalized query into a SingleStore specific intermediate representation tailored to the system.
Balancing Expensive Single Partition Queries Against Data Skew
In general, queries that operate on a single partition are great for overall performance because only a portion of the cluster's resources is being utilized.
However, it is important to select a shard key that promotes lightweight single-partition queries.PRODUCTS
table for a university bookstore:
CREATE TABLE IF NOT EXISTS PRODUCTS(prod_id int,prod_name varchar(20),category varchar(10),color varchar(10));
The shard key could be set on the category
column if the primary workload against the table is querying inventory and filtering on category
.category
= 'apparel'), the most queried data would be located in one partition.category
as the shard key causes an expensive single partition workload.
Query processing is distributed at 1 CPU core per partition.
In this example, prod_
would be the best choice as the shard key.id
column since it is the best practice to NOT shard on a varchar() data type to use.category
can be set as the clustered columnstore key, so filtering on the category will benefit from segment elimination.
Last modified: September 10, 2024