Detecting and Resolving Data Skew

Skew is a condition in which a table’s data is unevenly balanced among partitions in the cluster. Probability being what it is, small amounts of skew are inevitable and harmless. By default, SingleStore distributes data based on a hash of the primary key. Since the hash values are unique and evenly distributed, this keeps skew to a minimum.

You can tell SingleStore to shard a table by a set of columns instead of the primary key. For example, consider a table of URLs from web domains. It might be defined as follows:

SQL

CREATE TABLE urls (
    id BIGINT,
    domain_id BIGINT,
    path VARCHAR(8192),
    first_seen INT UNSIGNED NOT NULL,
    crawl count INT UNSIGNED NOT NULL,
    SHARD KEY (domain_id)
);

As you can see, we are sharding on domain_id. Queries made against this table will compile and run just fine. However, it’s probable that some domains will have many more times the number of URLs. For example, the partition that contains links for youtube.com will almost certainly have more rows than average, throwing off the balance of the cluster. That unlucky partition containing youtube.com will not only have to store more data, it will probably be asked to do more work during SELECT, UPDATE, and DELETE queries. For best performance, you should choose a shard key that minimizes data skew.

Measuring and Detecting Skew

Note

While bulk-loading data, recovering from a restart, or after deleting lots of data, it’s possible that one partition may temporarily show more memory use and memory skew than others, as memory allocation and garbage collection operations may not take effect immediately across partitions. This is normal, and corrects itself over time as the system reaches a new steady state.

You can use the following methods to measure and detect data skew.

Method 1: Inspect memory settings

With any version of SingleStore, you can manually compare maximum_table_memory and Alloc_table_memory (memory used by tables) on each node:

SQL

SHOW VARIABLES LIKE "maximum_%";

+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| maximum_memory       | 48291 |
| maximum_table_memory | 43462 |
+----------------------+-------+

SQL

SHOW STATUS EXTENDED LIKE "Alloc_table_memory";

+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| Alloc_table_memory | 43462 MB |
+--------------------+----------+
1 row in set (0.01 sec)

If memory consumption is significantly higher on specific nodes, skew is present. If memory consumption is fairly uniform across the leaf nodes, no skew is present.

Method 2: Find the number of rows and memory use per partition

The built-in INFORMATION_SCHEMA database contains metadata about tables, columns, indexes, and partitions. In particular, the TABLE_STATISTICS table contains one row per table partition, and can be used to see if a given partition has a lot more rows than average:

SQL

SELECT
    DATABASE_NAME,
    TABLE_NAME,
    ORDINAL AS PARTITION_ID,
    ROWS,
    MEMORY_USE
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
WHERE TABLE_NAME = 'domains';

+---------------+------------+--------------+-------+------------+
| DATABASE_NAME | TABLE_NAME | PARTITION_ID | ROWS  | MEMORY_USE |
+---------------+------------+--------------+-------+------------+
| web           | domains    | 0            | 85916 |   11527936 |
| web           | domains    | 1            | 85925 |   11528512 |
| web           | domains    | 2            | 85427 |   11627712 |
| web           | domains    | 3            | 86185 |   11545152 |
| web           | domains    | 4            | 85592 |   11376128 |
| web           | domains    | 5            | 85549 |   11504448 |
...

Method 3: Calculate the memory skew for all tables, per database

Use the relative standard deviation (RSD) to calculate the memory skew across all partitions and tables in your database. The RSD is a decent way to measure skewness.

SQL

SELECT
    DATABASE_NAME,
    TABLE_NAME,
    FLOOR(AVG(ROWS)) AS avg_rows,
    ROUND(STDDEV(ROWS)/AVG(ROWS),3) * 100 AS row_skew,
    FLOOR(AVG(MEMORY_USE)) AS avg_memory,
    ROUND(STDDEV(MEMORY_USE)/AVG(MEMORY_USE),3) * 100 AS memory_skew
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
GROUP BY 1, 2
HAVING SUM(ROWS) > 10000
ORDER BY row_skew DESC;

+---------------+----------------------+----------+----------+--------------+-------------+
| DATABASE_NAME | TABLE_NAME           | avg_rows | row_skew | avg_memory   | memory_skew |
+---------------+----------------------+----------+----------+--------------+-------------+
| web           | urls                 | 28881401 |    9.520 |  25733328291 |       3.200 |
| web           | domains              |   902817 |    0.100 |    111046491 |       0.200 |
...

A skew score of N means that there is about N% variation in the number of rows (or memory allocated) in a partition. As a general rule of thumb, skew less than 10% should not greatly affect system stability or performance and is not a cause for worry. If you do see persistent skew and observe other issues as well, such as a server running particularly hot, then you should further investigate the distribution of the values in your chosen shard keys.

Method 4: Calculate the skew per partition for the columns in a table

Again, use the relative standard deviation (RSD) to calculate the memory skew across all partitions in a specific tables in your database.

If the skew factor is high, that indicates that some partitions have very high cardinality and some have very low cardinality and can lead to long single-threaded operations.

Considering the urls table above, you can calculate the data skew per partition in the following example:

SQL

SELECT ROUND(STDDEV(id)/AVG(id),3)*100 AS group_skew,
            PARTITION_ID()
FROM  (SELECT id,
       domain_id,
       count(*)             
       FROM urls      
       GROUP BY 1, 2) sub
GROUP BY PARTITION_ID();

Method 5: Show the data distribution for a table if you were to reshard on another column or set of columns

Run this query to check for data skew prior to making shard key changes:

SQL

SELECT with(leaf_pushdown=true) 
    SUM(c) rows,     
    PARTITION_ID() 
FROM (SELECT count(*) c FROM <table_name> GROUP BY <potential_shard_key(s)>) reshuffle 
GROUP BY PARTITION_ID();

Considering the urls table above, you can show the potential distribution of data if the table is resharded on a different column or set of columns. See the following example:

SQL

SELECT with(leaf_pushdown=true) 
    SUM(c) rows,     
    PARTITION_ID() 
FROM (SELECT count(*) c FROM urls GROUP BY id) reshuffle 
GROUP BY PARTITION_ID();

+---------+----------------+
| rows    | PARTITION_ID() |
+------+-------------------+
|    2005 |              1 |
|    1999 |              3 |
|    2001 |              4 |
+---------+----------------+

This example shows if the urls table is resharded on the id column, there would be a more even distribution of data on each partition.

Resolving Data Skew

SingleStore distributes data based on the SHARD KEY specified in a table’s schema. If no SHARD KEY is explicitly specified, the PRIMARY KEY is used as the default SHARD KEY. If the SHARD KEY shards on a low cardinality column, data will accumulate on certain nodes. To resolve this, you should export your data, modify your schema, and then reload your data.

To export your data, see Export Data. After you backup your data, you can DROP TABLE low_cardinality_shard_key_table, and then CREATE TABLE with a higher cardinality SHARD KEY. When you reload the data, SingleStore will automatically shard your data based on the new SHARD KEY.

Example:

Shell

$ mysqldump -h 127.0.0.1 -u root my_db low_cardinality_shard_key_table --no-create-info > databackup.sql

SQL

DROP TABLE low_cardinality_shard_key_table;

source updated_shard_key_schema.sql;