Detecting and Resolving Data Skew
On this page
An important property of a distributed database is that the data gets distributed more or less evenly.skewed
out of balance.
Skew
is a condition in which a table’s data is unevenly balanced among partitions in the cluster.
You can tell SingleStore to shard a table by a set of columns instead of the primary key.
CREATE TABLE urls (id BIGINT,domain_id BIGINT,path VARCHAR(8192),first_seen INT UNSIGNED NOT NULL,crawl count INT UNSIGNED NOT NULL,SHARD KEY (domain_id));
As you can see, we are sharding on domain_
.youtube.
will almost certainly have more rows than average, throwing off the balance of the cluster.youtube.
will not only have to store more data, it will probably be asked to do more work during SELECT
, UPDATE
, and DELETE
queries.
Measuring and Detecting Skew
Note
While bulk-loading data, recovering from a restart, or after deleting lots of data, it’s possible that one partition may temporarily show more memory use and memory skew than others, as memory allocation and garbage collection operations may not take effect immediately across partitions.
You can use the following methods to measure and detect data skew.
Method 1: Inspect memory settings
With any version of SingleStore, you can manually compare maximum_
and Alloc_
(memory used by tables) on each node:
SHOW VARIABLES LIKE "maximum_%";
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| maximum_memory | 48291 |
| maximum_table_memory | 43462 |
+----------------------+-------+
SHOW STATUS EXTENDED LIKE "Alloc_table_memory";
+--------------------+----------+
| Variable_name | Value |
+--------------------+----------+
| Alloc_table_memory | 43462 MB |
+--------------------+----------+
1 row in set (0.01 sec)
If memory consumption is significantly higher on specific nodes, skew is present.
Method 2: Find the number of rows and memory use per partition
The built-in INFORMATION_
database contains metadata about tables, columns, indexes, and partitions.TABLE_
table contains one row per table partition, and can be used to see if a given partition has a lot more rows than average:
SELECTDATABASE_NAME,TABLE_NAME,ORDINAL AS PARTITION_ID,ROWS,MEMORY_USEFROM INFORMATION_SCHEMA.TABLE_STATISTICSWHERE TABLE_NAME = 'domains';
+---------------+------------+--------------+-------+------------+
| DATABASE_NAME | TABLE_NAME | PARTITION_ID | ROWS | MEMORY_USE |
+---------------+------------+--------------+-------+------------+
| web | domains | 0 | 85916 | 11527936 |
| web | domains | 1 | 85925 | 11528512 |
| web | domains | 2 | 85427 | 11627712 |
| web | domains | 3 | 86185 | 11545152 |
| web | domains | 4 | 85592 | 11376128 |
| web | domains | 5 | 85549 | 11504448 |
...
Method 3: Calculate the memory skew for all tables, per database
Use the relative standard deviation (RSD) to calculate the memory skew across all partitions and tables in your database.skewness
.
SELECTDATABASE_NAME,TABLE_NAME,FLOOR(AVG(ROWS)) AS avg_rows,ROUND(STDDEV(ROWS)/AVG(ROWS),3) * 100 AS row_skew,FLOOR(AVG(MEMORY_USE)) AS avg_memory,ROUND(STDDEV(MEMORY_USE)/AVG(MEMORY_USE),3) * 100 AS memory_skewFROM INFORMATION_SCHEMA.TABLE_STATISTICSGROUP BY 1, 2HAVING SUM(ROWS) > 10000ORDER BY row_skew DESC;
+---------------+----------------------+----------+----------+--------------+-------------+
| DATABASE_NAME | TABLE_NAME | avg_rows | row_skew | avg_memory | memory_skew |
+---------------+----------------------+----------+----------+--------------+-------------+
| web | urls | 28881401 | 9.520 | 25733328291 | 3.200 |
| web | domains | 902817 | 0.100 | 111046491 | 0.200 |
...
A skew score of N means that there is about N% variation in the number of rows (or memory allocated) in a partition.hot
, then you should further investigate the distribution of the values in your chosen shard keys.
Method 4: Calculate the skew per partition for the columns in a table
Again, use the relative standard deviation (RSD) to calculate the memory skew across all partitions in a specific tables in your database.
If the skew factor is high, that indicates that some partitions have very high cardinality and some have very low cardinality and can lead to long single-threaded operations.
Considering the urls
table above, you can calculate the data skew per partition in the following example:
SELECT ROUND(STDDEV(id)/AVG(id),3)*100 AS group_skew,PARTITION_ID()FROM (SELECT id,domain_id,count(*)FROM urlsGROUP BY 1, 2) subGROUP BY PARTITION_ID();
Method 5: Show the data distribution for a table if you were to reshard on another column or set of columns
Run this query to check for data skew prior to making shard key changes:
SELECT with(leaf_pushdown=true)SUM(c) rows,PARTITION_ID()FROM (SELECT count(*) c FROM <table_name> GROUP BY <potential_shard_key(s)>) reshuffleGROUP BY PARTITION_ID();
Considering the urls
table above, you can show the potential distribution of data if the table is resharded on a different column or set of columns.
SELECT with(leaf_pushdown=true)SUM(c) rows,PARTITION_ID()FROM (SELECT count(*) c FROM urls GROUP BY id) reshuffleGROUP BY PARTITION_ID();
+---------+----------------+
| rows | PARTITION_ID() |
+------+-------------------+
| 2005 | 1 |
| 1999 | 3 |
| 2001 | 4 |
+---------+----------------+
This example shows if the urls
table is resharded on the id
column, there would be a more even distribution of data on each partition.
Resolving Data Skew
SingleStore distributes data based on the SHARD KEY specified in a table’s schema.
To export your data, see Export Data.DROP TABLE low_
, and then CREATE TABLE
with a higher cardinality SHARD KEY.
Example:
$ mysqldump -h 127.0.0.1 -u root my_db low_cardinality_shard_key_table --no-create-info > databackup.sql
DROP TABLE low_cardinality_shard_key_table;source updated_shard_key_schema.sql;
Last modified: July 29, 2024