Query Errors

ISSUE: Inconsistent Query Run Times

If you receive very different response times when you run the same query (or groups of queries on the same leaf) repeatedly, Transparent Huge Pages (THP) may be the cause.

Linux organizes RAM into pages that are ordinarily 4KB. With THP, Linux can instead use 2MB pages or larger. But THP transparently re-organizes memory used by a process inside the kernel, merging some small pages to huge pages, and splitting some huge pages to small pages. This background process holds locks on the memory manager, and can prevent your process from accessing memory.

You can check for THP by running cat /sys/kernel/mm/*transparent_hugepage/enabled. If the [brackets] are around [always], you should disable THP on all nodes in your cluster. Fully remedying the issue may require restarting SingleStore, or fully restarting all of the servers. Cluster operators should disable THP before launching in production.

Note: Having THP enabled may also be the cause of high system CPU (sometimes called red CPU).

How to Disable THP

SingleStore recommends disabling transparent huge pages (THP) at boot time on all nodes (master aggregator, child aggregators, and leaves) in the cluster. Not doing so may result in inconsistent query run times.

To disable THP, add the following lines to the end of rc.local before the exit line (if present), and reboot the host machines.

For Debian-based distributions, add the following to /etc/rc.d/rc.local:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo no > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag

For Red Hat/CentOS distributions, add the following to /etc/rc.local:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
echo 0 > /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag
echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag

Note

The khugepaged/defrag option will be 1 or 0 on newer Linux versions (e.g. CentOS 7+), but yes or no on older versions (e.g. CentOS 6). You can check which your system uses by running cat /sys/kernel/mm/*transparent_hugepage/khugepaged/defrag and keep only the matching setting. For example, if you see 1 or 0 update the line with echo 0. If you see yes or no, update the line with echo no.

ISSUE: Long Running Queries Blocking DDL Operations and Workload

SingleStore supports execution of online DDL statements, which means that you can perform read and write operations while the DDL command is being executed on a table. The execution of a DDL command does not begin until all the in-progress DML queries running on the table finish. As soon as the in-progress read and write operations complete and the DDL command begins executing, new reads and writes proceed as normal. This blocking period usually lasts on the order of milliseconds. If there are long running DML queries in your workload, it can extend the blocking period because the DDL must wait until those queries complete. The DDL waits and blocks until the default_distributed_ddl_timeout is reached (default value is 180000 milliseconds or 3 minutes), after which it throws a timeout error. If you frequently run DDL statements and have a lot of long-running queries, then it negatively impacts the workload since the DDL operation blocks other queries from starting while it waits for completion of long-running queries.

Increasing the timeout duration for a DDL operation is not a recommended option since it delays the start of the operation. While the DDL operation is delayed from starting, it will continue to block new queries from executing either. Therefore, raising the timeout can cause longer delays in regular queries in your workload.

Solution

  • Optimize your workload. If you have long running queries, then you can optimize your current workload by running frequent maintenance and backups, or by using a different workload. For example, run the analytical queries in your workload during the lowest traffic time. Additionally, you can use cluster replication to replicate the data to a secondary read-only cluster, and then run all long running or analytical queries on the secondary cluster. This drastically reduces the workload on the primary cluster and improves workload performance.

In this section

Last modified: November 22, 2022

Was this article helpful?