Troubleshooting Poorly Performing Queries

1. Identifying Expensive Queries

  1. In Studio, navigate to the Activity Resource Usage page.

  2. If you don’t have monitoring data from SingleStore’s monitoring solution, press Begin Profiling to capture current cluster activity for a period of time. Be sure to record during the workload or query of interest, if you have identified one from collecting a report in the previous step, or from another source.

  3. If one has been identified, switch the database selector to the problematic database. See the following example image, where you can find the database selector to the top left.

  4. Sort on Elapsed Time and Run Count by clicking on the field name. Hover over the Time Spent bar and look at what queries are spending the most time on.

  5. Sort on CPU, Memory, and Network, respectively to identify other resource-intensive queries.

  6. Distinguish query shapes that may be resource-intensive due to having a high run count compared to others. Is a query using resources because of how many times it has been executed, or is it using a large number of resources across a low number of executions?

  7. Note the activity_name of queries that have high elapsed time, run count, and/or that are using a lot of a given resource.

  8. Click on the Nodes tab to identify which node these queries are running on. The Nodes tab can also help to confirm why a given node is running slowly.

Next Steps

  • If you found a query of interest that is long-running, utilizing a lot of resources, or has an unusual run count, move onto Step 2.

  • If you didn’t identify any useful information here: move onto Step 3.

2. Investigating Expensive Queries

  1. If a query has a high run count that is using a lot of resources on average, it is important to identify why the query is running at such high concurrency.

    • Is this normal for your application?

    • Can you change how often it's run?

  2. If you have identified a query with a low run count using a lot of resources, or a frequent query that is a normal occurrence, copy the query out of SingleSoreDB Studio. To get the query text from what’s displayed, query the information_schema table MV_QUERIES.

    SELECT * FROM information_schema.MV_QUERIES where activity_name = <identified_activity_name>;
  3. Check the PLAN_WARNINGS column for any hints about potential optimizations. If there are no hints, move onto the next step.

  4. If the query in question is an INSERT ... SELECT statement, copy only the SELECT clause of the query.

  5. Navigate to the Studio SQL Editor, and paste the query in.

  6. Run the query once in the editor by clicking the Run button in the top right hand corner. After the first run is complete, highlight the query and click the Visual Explain icon to the left of the Run button to display the query plan in a graphical interface. Here, you will be given a choice between EXPLAIN and PROFILE, which display the query plan and further details for each query operation, respectively. SingleStore recommends running both of these options. See an example of a visual PROFILE below:

  7. Look to see if there is anything unusual about the query plan. You may find a query plan operation that takes a lot of time. If so, investigate that particular operator. You may also that find that each component is similarly time consuming. For either case, here are some questions to consider:

    • Are you reshuffling or repartitioning a lot of data?

      1. Common indicators of this include the Project operator combined with large SELECT costs, or Repartitioning being present in the query plan.

      2. If so, can you add or change a shard key?

    • Are you doing a full scan of a table?

      1. Look for the TableScan operator to indicate this information.

      2. If so, can you add/change an index or sort key?

    • Is the query performing slowly or using more resources because of a schema change?

    • Is the query performing slowly because the data working set increased?

    • Are autostats enabled for the tables in the query?

      1. Run SHOW CREATE TABLE

      2. Look for AUTOSTATS_ENABLED=FALSE, AUTOSTATS_CARDINALITY_MODE=OFF, or AUTOSTATS_HISTOGRAM_MODE in the CREATE TABLE statement to determine this.

    • Do you need to invalidate and refresh your current query plan?

      1. Autostats doesn’t invalidate the plan automatically. To invalidate the plan, run ANALYZE on the table manually.

      2. Alternatively, to drop the current plan, run DROP … FROM PLANCACHE and re-run the query to see if it performs better.

  8. See Optimizing Table Data Structures for additional details, as well as Query Tuning for further guidance on how to optimize specific query shapes and tables.

Next Steps

  • If you didn’t identify any further useful information here: move onto Step 3.

3. Are any queries waiting on something?

Use the following steps to determine whether your queries are being blocked or waiting for another reason.

  1. Check the Active Queries page in Studio, or the PROCESSLIST.

    • On the active processes page in Studio (see an image of this below), check to see whether there are many queued queries by sorting on the status column.

    • The default workload_management_max_queue_depth before queries fail is 100. For more information about this variable, see the List of Engine Variables. You can also check the current setting by running SHOW VARIABLES

    • If you are using the database client, query MV_PROCESSLIST. SELECT * FROM information_schema.MV_PROCESSLIST WHERE state = 'queued';

  2. Are there queries lingering or in an unkillable state?

    • Check the PROCESSLIST or the Active Processes page by using the above.

      1. Sort by time. If there is a long runtime (depending on a normal run time for your queries, which could be minutes, hours or days), it is likely the query is hung.

      2. Identify those queries and run KILL QUERY on them. This can be done in Studio by checking the boxes shown above, or from the client.

      3. You can also run KILLALL QUERIES to remove a large number of queries at once. Note: This will kill all queries, whether hanging or running.

      4. If you see queries with both a long runtime and a state of KILLED on a given node, there may be queries stuck in an unkillable state. Restart the affected node.

    • Are queries queued by the workload manager? Check this by querying MV_WORKLOAD_MANAGEMENT_STATUS and looking at Queued Queries (from global queue) & Queued Queries (from local queue) within the Stat and Value columns. SELECT * FROM information_schema.WORKLOAD_MANAGEMENT_STATUS;

  3. Are queries blocked?

    • Check this by querying information_schema.MV_BLOCKED_QUERIES. SELECT * FROM information_schema.MV_BLOCKED_QUERIES;

    • Which query is blocked? Check the QUERY_TEXT column.

    • What is blocking it and where is it running? Check the BLOCKING_NODE_ID and BLOCKING_QUERY_TEXT columns.

    • Check the blocking reason by looking at the BLOCKING_TYPE

    • Kill the blocking query by using the ID column, which is the same as the ID in information_schema.PROCESSLIST, with KILL QUERY <id>.

  4. If your queries are failing:

    • Is the queue depth too large? Query MV_PROCESSLIST to check this. SELECT COUNT(*) FROM information_schema.MV_PROCESSLIST WHERE state = 'queued';

    • Are these queries getting killed by the resource governor? Check resource the status of resource pools by querying RESOURCE_POOLS. SELECT * FROM information_schema.RESOURCE_POOLS;

    • Examine the failure errors from failed queries.

Next Steps

  • My queries are queued and waiting on resources: move onto Step 4.

  • There are no queued, blocked, or hung queries: move onto Step 4.

4. Are key resources being overused?

The next step in troubleshooting your cluster is looking at key resources to see if they are being overused.

  1. For a quick sanity check, navigate to the Studio Dashboard to look at resource usage against their respective totals available by hovering over each. See an image of the main dashboard, which shows disk usage, below:

  2. Check the Studio Hosts and/or Nodes pages to look at CPU and memory usage. See an image of this, showing the Nodes page below:

  3. Sort on CPU Usage, Memory Usage, and Disk Usage. Are there resources being overused?

  4. If you see one or more resources approaching their limit (e.g., memory, disk or CPU limits), consider the following:

    • Have you added a new workload to your cluster that’s using more resources?

    • Have you loaded more data (on disk, or in memory)?

    • Have you increased concurrency?

    • Does your workload require more nodes to meet the scale?

    • Is there an expensive workload or query that is causing a spike and could be optimized?

Next Steps

  • If you identify one or more key resources being overused: return to Step 1, to identify whether there are specific queries causing this.

  • If the cause is not specific queries, or there aren’t resource usage issues: move onto Step 5.

5. Are there other ongoing operational activities?

The next step in troubleshooting your cluster is checking on other ongoing operational activities that may be affecting your cluster performance.

  1. Navigate to the Active Processes page in Studio from the master aggregator, or query MV_PROCESSLIST

    SELECT * FROM information_schema.mv_processlist;
  2. Search (ctrl-F) in the Active Processes page, or look in the results of your query to MV_PROCESSLIST for any of the following: ALTER, CREATE, DROP, ADD|REMOVE LEAF/AGGREGATOR, BACKUP/RESTORE, REBALANCE, RESTORE REDUNDANCY.

  3. For either the Active Processes page, or MV_PROCESSLIST, check the status column to see if any of these activities are long running, or waiting on other operations.

    • If an operation is waiting on metadata lock, the query is likely waiting on a cluster wide operation such as a backup.

Next Steps

  • If no other ongoing operational activities were identified: move onto Step 6.

6. Checking Node, Partition, and Overall Database Health

The next step in troubleshooting your cluster is checking on its overall health.

  1. Go to the Studio Hosts and/or Nodes page(s) to ensure nodes are online.

    • Run show-leaves to get the id of any offline node. sdb-admin show-leaves

    • Then, bring the node(s) online with start-node. sdb-admin start-node <node-id>

  2. Go to the Databases page, and examine the problematic database.

    • Click on the database and check the Partitions tab and make sure everything is online and attached.

  3. If there are offline or detached partitions, investigate the health of the nodes housing those partitions in Studio. See an image of the Database page and Partitions tab below:

Next Steps

  • If all nodes, partitions, and databases are healthy: move onto Step 7.

7. Checking Partition Number and Data Skew

This step checks whether there are enough partitions and if data is properly balanced across your cluster.

  1. Check the total core to partition ratio per leaf for the database that is having trouble:

    • Run SHOW PARTITIONS on the database in question and compare that with the total number of cores in your cluster.

    • Is this ratio at least 1:1? If you have high concurrency, consider increasing the concurrency from 2, 3, or 4 to 1 core to partition ratio.

  2. Check that your data is balanced across the cluster properly by running EXPLAIN REBALANCE PARTITIONS on the database in question. If you see a non-empty set, run REBALANCE PARTITIONS on the database.

  3. Check that your data is redundant by running EXPLAIN RESTORE REDUNDANCY. If you see a non-empty set, run RESTORE REDUNDANCY.

  4. Data Skew: see the Detecting and Resolving Data Skew.

Next Steps

  • If there are enough partitions, and no data skew is detected: move onto Step 8.

8. Checking for System Changes and Log Hints

The last step in this guide is considering any system changes and/or log hints.

  1. Was there any operating system and/or application changes?

  2. Check the Studio EVENTS page and memsql.log for any information about recent log events at a high level. Sort by event_level to see those that are critical.

    • For more verbose logs, obtain the tracelog from the master aggregator data directory (the default for this is /var/lib/memsql/node-nodeHash/data/tracelogs).

    • Find the memsql.log file and run tail -f or cat and grep for error and/or warning.

Final Recommendations

If nothing unusual arose in the investigation across all of this guide, but your cluster is still having performance issues, consider the following recommendations:

  1. Did you properly capture the workload of interest in following the steps of this guide? It is possible an expensive query, queuing, or a resource problem was missed because the data wasn’t examined at the right time. Make sure that the workload of interest is running during the time of investigating.

  2. Does your system have a recommended configuration? Check the sdb-report output from Step 1 again and consider any configuration optimization suggestions in the check output.

  3. Is the system’s hardware healthy overall? Check disk and network performance. It is possible that a hardware issue might be leading to performance bottlenecks.

  4. Check out the Troubleshooting Reference to see if any of the reference topics included there are applicable to your cluster.

Additional Things to Consider

SingleStore Queries Typically Run Faster the Second Time They Are Executed

Traditional relational database management systems interpret SQL queries the same way interpreters for languages like Python and Ruby run programs. The first time a SingleStore server encounters a given query shape, it will optimize and compile the query for future invocations. This incurs overhead which does not depend on the amount of data to be processed, but rather the complexity of the query.

The process of code generation involves extracting parameters from the query then transforming the normalized query into a SingleStore specific intermediate representation tailored to the system. Subsequent requests with the same shape can reuse this plan to complete both quickly and consistently.

Balancing Expensive Single Partition Queries Against Data Skew

In general, queries that operate on a single partition are great for overall performance because only a portion of the cluster's resources is being utilized. This leaves the resources dedicated to the other partitions free to run other queries. It also cuts out some of the overhead on the aggregator for coordinating a cluster-wide, distributed query gathering and merging data from all partitions.

However, it is important to select a shard key that promotes lightweight single partition queries. For example, if you have a PRODUCTS table for a university bookstore:

CREATE TABLE IF NOT EXISTS PRODUCTS
(prod_id int,
prod_name varchar(20),
category varchar(10),
color varchar(10)
);

The shard key could be set on the category column if the primary workload against the table is querying inventory and filtering on category. If this fictional university bookstore sells mostly t-shirts and sweatshirts (category = 'apparel'), the most queried data would be located in one partition. So the data skew caused by selecting category as the shard key causes an expensive single partition workload.

Query processing is distributed at 1 CPU core per partition. In our example, we have data skew on one partition and a single partition query frequently runs against it; so the query will run single threaded. This increases latency or the query can be queued. Meanwhile, the other resources in the cluster are not being utilized; negating both the performance achievements you were aiming for and the main benefit of using a distributed database.

In this example, prod_id would be the best choice as the shard key. Or creating and sharding on an auto-increment id column since it's best practice to NOT shard on a varchar() data type. to use an Or the table can be created as a columnstore and category can be set as the clustered columnstore key, so filtering on category will benefit from segment elimination. Since the "apparel" values will be located on the same location on disk, the query will perform better.

Last modified: April 5, 2023

Was this article helpful?