Statistics and Sampling Tasks

Disabling and Enabling Automatic Statistics

You can disable or enable autostats on each table. If you disable autostats on a table, statistics must be collected with the ANALYZE command. If autostats is re-enabled, automatic statistics collection will resume (see the next section for details). We recommend leaving autostats enabled in most situations.

You can disable autostats on an existing table by running:

ALTER TABLE table_name DISABLE AUTOSTATS;

or

ALTER TABLE table_name AUTOSTATS_ENABLED = OFF;

and enable autostats by running

ALTER TABLE table_name ENABLE AUTOSTATS;

or

ALTER TABLE table_name AUTOSTATS_ENABLED = ON;

You can choose whether autostats are enabled or disabled when creating a table by adding the option:

AUTOSTATS_ENABLED=<TRUE|FALSE>

at the end of the CREATE TABLE statement, such as:

CREATE TABLE t (a int, b int, key (a) using clustered columnstore) AUTOSTATS_ENABLED=FALSE;

When AUTOSTATS_ENABLED is set to TRUE, autostats categories are set to the following default settings

  • For rowstore tables,

    AUTOSTATS_CARDINALITY_MODE = PERIODIC
    AUTOSTATS_HISTOGRAM_MODE = CREATE
  • For columnstore tables,

    AUTOSTATS_CARDINALITY_MODE = INCREMENTAL
    AUTOSTATS_HISTOGRAM_MODE = CREATE
    AUTOSTATS_SAMPLING = ON

SingleStore allows you to independently control how each category of statistics is automatically gathered.

AUTOSTATS_CARDINALITY_MODE = [ INCREMENTAL | PERIODIC | OFF ]
AUTOSTATS_HISTOGRAM_MODE = [ CREATE | UPDATE | OFF ]
AUTOSTATS_SAMPLING = [ ON | OFF ]

These autostats settings can also be specified or modified through the ALTER TABLE and CREATE TABLE commands. Multiple options can be combined in a single CREATE TABLE statement. For example,

ALTER TABLE table_name AUTOSTATS_CARDINALITY_MODE = PERIODIC;
CREATE TABLE t1 (a INT, b INT, KEY(a) USING CLUSTERED COLUMNSTORE)
AUTOSTATS_CARDINALITY_MODE = INCREMENTAL,
AUTOSTATS_HISTOGRAM_MODE = CREATE,
AUTOSTATS_SAMPLING= ON;

Warning

Each type of autostats setting may not work for all types of tables. For example, you cannot have incremental autostats on a rowstore table.

Additionally, you can change whether autostats is enabled or disabled by default on future CREATE TABLE statements by setting the global variable default_autostats_enabled to TRUE or FALSE on the master aggregator. The default is TRUE.

You can also modify the default settings for autostats, individually, by setting the following global variables:

default_autostats_columnstore_cardinality_mode = [ INCREMENTAL | PERIODIC | OFF ]
default_autostats_rowstore_cardinality_mode = [ PERIODIC | OFF ]
default_autostats_histogram_mode = [ CREATE | UPDATE | OFF ]
default_autostats_columnstore_sampling = [ ON | OFF ]

Notice

Limitations associated with using autostats: * Enabling AUTOSTATS_HISTOGRAM_MODE (either mode besides OFF) requires enabling AUTOSTATS_CARDINALITY_MODE (either mode besides OFF). * INCREMENTAL cardinality mode is only supported on columnstore tables. * AUTOSTATS_SAMPLING is not allowed for rowstore tables. On columnstore tables, it controls whether the columnstore sample is built and used for estimation.

You can see whether autostats is enabled or disabled on a table by running SHOW CREATE TABLE.

Autostats Support upon Upgrading SingleStore

Any table that is upgraded to a newer version of SingleStore retains its autostats settings. If the previous SingleStore version did not have autostats support, then the upgraded tables will have autostats disabled upon upgrade. Also, if a table had autostats enabled prior to upgrade, the same autostats will be enabled on the table post upgrade. To enable autostats on existing columnstore tables created prior to the upgrade, which we recommend in most cases, run ALTER TABLE <table_name> ENABLE AUTOSTATS on each table. As discussed in Disabling and Enabling Autostats, this will begin the process of collecting automatic statistics in the background.

Below is a simple script you can use to enable autostats on all tables (written in bash to minimize dependencies).

Save this script in a file called enable-autostats.sh:

#!/bin/bash

# Enables autostats on all columnstore tables which currently have autostats disabled
#
# Run this bash script with the mysql client arguments you use to connect to the SingleStore DB master aggregator, like:
# bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>
#
# Use the --dry-run option to only print the list of tables which would be altered, without actually running the ALTER commands.

set -e

client_args=(mysql --batch --skip-column-names)
dry_run=false

while [[ $# -gt 0 ]]
do
    case "$1" in
        --dry-run)
            dry_run=true
            ;;
        *)
            client_args+=("$1")
            ;;
    esac
    shift
done

# List of table names (fully qualified and quoted)
tables=$(${client_args[@]} -e "select concat(table_schema, '.\`', replace(table_name, '\`', '\`\`'), '\`') from information_schema.tables where table_type = 'BASE TABLE'")

echo "$tables" | while read -r table
do
    show_create_table=$(${client_args[@]} -e "show create table $table")
    alter_stmt=""
    if echo "$show_create_table" | grep -q "AUTOSTATS_CARDINALITY_MODE=OFF"
    then
        if echo "$show_create_table" | grep -q "COLUMNSTORE"
        then
            alter_stmt="${alter_stmt}AUTOSTATS_CARDINALITY_MODE=INCREMENTAL"
        else
            alter_stmt="${alter_stmt}AUTOSTATS_CARDINALITY_MODE=PERIODIC"
        fi
    fi
    if echo "$show_create_table" | grep -q "AUTOSTATS_HISTOGRAM_MODE=OFF"
    then
        if [[ "$alter_stmt" != "" ]]
        then
            alter_stmt="${alter_stmt},"
        fi
        alter_stmt="${alter_stmt}AUTOSTATS_HISTOGRAM_MODE=CREATE"
    fi
    if echo "$show_create_table" | grep -q "AUTOSTATS_SAMPLING=OFF"
    then
        if echo "$show_create_table" | grep -q "COLUMNSTORE"
        then
            if [[ "$alter_stmt" != "" ]]
            then
                alter_stmt="${alter_stmt},"
            fi
        fi
        alter_stmt="${alter_stmt}AUTOSTATS_SAMPLING=ON"
    fi
    if [[ "$alter_stmt" != "" ]]
    then
        alter_stmt="ALTER TABLE $table $alter_stmt"
        echo $alter_stmt
        ${client_args[@]} -e "$alter_stmt"
    fi
done

Run this bash script with the mysql client arguments you use to connect to the SingleStore master aggregator, like:

bash enable-autostats.sh -h<host> -P<port> -u<user> -p<password>

Possible reasons to disable autostats include:

  • Incremental autostats require a small memory overhead per column to maintain the statistics. If you have a large number of columns across all the columnstore tables in your database, this memory overhead may be significant. In this case you can modify your table to use periodic autostats, as the memory overhead is significantly lower and it will still keep the statistics up to date.

  • In CREATE mode (when default_autostats_histogram_mode is set to CREATE), automatic histograms may end up creating histograms on unwanted columns. As each histogram has a small memory overhead, you may choose to switch the histogram mode to UPDATE so that all existing histograms will be kept up to date, but no new histograms will be automatically created. You can still create any histograms manually using the ANALYZE command.

  • Histograms cannot be deleted in CREATE mode. Deleting a histogram in this mode would have limited impact, as the same histogram may be created automatically in the future. To choose between specific columns for creating histograms, a table must be in UPDATE mode.

  • Incremental autostats require a small performance cost on writes to a table. You may wish to consider disabling autostats if you find this write performance cost to be significant on your workload, can achieve better overall performance with manual statistics, and are willing to handle the operational requirements of managing statistics manually.

  • Periodic autostats uses a small fixed amount of resources in the background. If the resources used negatively affect your workload, you may find it beneficial to disable autostats

  • If you want to “freeze” the statistics, so that the query optimizer continues to use a fixed set of statistics regardless of how the data is changing, you can do so by disabling automatic statistics and using ANALYZE to manage statistics.

Statistics Population after Enabling Autostats

Immediately after enabling autostats on a table where it was previously disabled, SingleStore will start collecting automatic statistics in the background until it catches up with the current table data. During this time, the query optimizer will use the previous manually-gathered statistics if available, until the autostats are sufficiently up-to-date - see the next section for details.

This background statistics collection uses limited resources to avoid large impacts on the performance of other workloads. Therefore, collecting automatic statistics may take a long time for large datasets, but this only affects the first time statistics are populated after autostats is enabled on a table where it was previously disabled

Troubleshooting Autostats

Viewing Cardinality Statistics

You can view the cardinality statistics last collected by querying the information_schema.OPTIMIZER_STATISTICS table.

DESCRIBE INFORMATION_SCHEMA.OPTIMIZER_STATISTICS;
+-----------------------+--------------+------+------+---------+-------+
| Field                 | Type         | Null | Key  | Default | Extra |
+-----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME         | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME            | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME           | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT             | bigint(21)   | YES  |      | NULL    |       |
| NULL_COUNT            | bigint(21)   | YES  |      | NULL    |       |
| CARDINALITY           | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_ENABLED     | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_OUTDATED    | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_ROW_COUNT   | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_NULL_COUNT  | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_CARDINALITY | bigint(21)   | YES  |      | NULL    |       |
| ADVANCED_HISTOGRAMS   | bigint(21)   | YES  |      | NULL    |       |
| LEGACY_HISTOGRAMS     | bigint(21)   | YES  |      | NULL    |       |
| RANGE_STATS           | bigint(21)   | YES  |      | NULL    |       |
| SAMPLE_SIZE           | bigint(21)   | YES  |      | NULL    |       |
| LAST_UPDATED          | datetime     | YES  |      | NULL    |       |
+-----------------------+--------------+------+------+---------+-------+

See the information_schema.OPTIMIZER_STATISTICS page for detailed explanation of how to interpret the information_schema.OPTIMIZER_STATISTICS table.

Viewing Histogram Statistics

When collected, histograms can be queried through the information_schema.ADVANCED_HISTOGRAMS table:

DESCRIBE INFORMATION_SCHEMA.ADVANCED_HISTOGRAMS;
+-------------------------+---------------+------+------+---------+-------+
| Field                   | Type          | Null | Key  | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME           | varchar(512)  | NO   |      | NULL    |       |
| TABLE_NAME              | varchar(512)  | NO   |      | NULL    |       |
| COLUMN_NAME             | varchar(512)  | NO   |      | NULL    |       |
| TYPE                    | varchar(255)  | NO   |      | NULL    |       |
| VERSION                 | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_COUNT            | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_INDEX            | bigint(21)    | NO   |      | NULL    |       |
| RANGE_MIN               | varchar(8192) | YES  |      | NULL    |       |
| RANGE_MAX               | varchar(8192) | YES  |      | NULL    |       |
| UNIQUE_COUNT            | double        | YES  |      | NULL    |       |
| CARDINALITY             | double        | YES  |      | NULL    |       |
| UNIQUE_COUNT_CUMULATIVE | double        | YES  |      | NULL    |       |
| CARDINALITY_CUMULATIVE  | double        | YES  |      | NULL    |       |
+-------------------------+---------------+------+------+---------+-------+

See the information_schema.ADVANCED_HISTOGRAMS page for detailed explanation of how to interpret the information_schema.ADVANCED_HISTOGRAMS table.

Out-of-Date Statistics

If there is too large a difference between the data which SingleStore has gathered statistics on and the actual table data, such as immediately after autostats is enabled, autostats will be considered out-of-date. As soon as the difference becomes small enough, autostats will be considered up-to-date. While autostats are out-of-date, the query optimizer uses the statistics saved the last time ANALYZE was run on the table, if any. EXPLAIN will show a warning that autostats is out of date. If autostats are missing, sampling will be used to get any needed estimates.

You can accelerate the collection of autostats by running OPTIMIZE TABLE:

OPTIMIZE TABLE <table_name> FLUSH;
OPTIMIZE TABLE <table_name>;

After statistics have been updated, you may run ANALYZE TABLE to trigger plan invalidation and re-optimization of any query plans that were based on stale statistics - see Plan Invalidation.

Identifying Missing Statistics Warnings in EXPLAIN

When you run EXPLAIN on a query, SingleStore DB will show information about any missing statistics that the optimizer could take advantage of, and the ANALYZE commands that you can run to collect those statistics. With autostats enabled, these messages will not appear or will go away on their own quickly, unless autostats is disabled for the needed columns. So if you see these messages, you typically should double check that autostats is enabled for all tables and columns.

When autostats are disabled, you should normally follow the recommendations in these warnings. The missing statistics listed here are statistics that are not present that the query optimizer would be able to use if collected.

Notice

As discussed above, collecting column statistics is strongly recommended, and collecting histograms is recommended in most situations. See Histograms vs. Sampling for more details.

For example, for the SELECT statement below, the EXPLAIN output shows that the optimizer is missing column statistics on both tables t and t2 as well as histograms on t.x.

CREATE TABLE t (id BIGINT PRIMARY KEY, x DOUBLE);
CREATE TABLE t2 (id BIGINT PRIMARY KEY, x DOUBLE);
EXPLAIN SELECT * FROM t, t2 WHERE t.id = t2.id AND t.x > 1;
+--------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                          |
+--------------------------------------------------------------------------------------------------+
| WARNING: Missing statistics on some tables, consider running the following commands:             |
|     ANALYZE TABLE db.`t2`;                                                                       |
|     ANALYZE TABLE db.`t` COLUMNS `x` ENABLE;                                                     |
| See http://docs.singlestore.com/v5.0/docs/analyze for more information on statistics collection. |
|                                                                                                  |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                          |
| Gather partitions:all                                                                            |
| Project [t.id, t.id2, t.x, t2.id, t2.x]                                                          |
| NestedLoopJoin                                                                                   |
| |---IndexSeek db.t2, PRIMARY KEY (id) scan:[id = t.id]                                           |
| Filter [t.x > 1]                                                                                 |
| TableScan db.t, PRIMARY KEY (id)                                                                 |
+--------------------------------------------------------------------------------------------------+

Identifying Missing Histograms in the Plancache

In addition, you can view information about any recently run queries which were missing histograms, which you can use to help identify which columns you may need to enable autostats on, or collect histograms on. Again, it’s highly recommended to use autostats, and if you are, then you will either not see missing histogram information, or such information can be disregarded because soon autostats will gather histograms.

The view information_schema.MV_PROSPECTIVE_HISTOGRAMS shows columns which did not have histograms and were used in queries where a histogram could have been used. This view counts over all queries in the in-memory plancache (see Understanding the Plancache), i.e. those queries shown in information_schema.PLANCACHE and information_schema.MV_QUERIES.

SELECT * FROM information_schema.MV_PROSPECTIVE_HISTOGRAMS;
+---------------+------------+-------------+-------------+
| DATABASE_NAME | TABLE_NAME | COLUMN_NAME | USAGE_COUNT |
+---------------+------------+-------------+-------------+
| db            | t          | c           |           3 |
| db            | t          | b           |           3 |
| db            | t          | a           |           3 |
+---------------+------------+-------------+-------------+

The view information_schema.MV_QUERY_PROSPECTIVE_HISTOGRAMS shows precisely which queries in the plancache could have made use of the histograms. You may join this to views such as MV_QUERIES and MV_ACTIVITIES to learn more about the queries.

SELECT * FROM information_schema.MV_QUERY_PROSPECTIVE_HISTOGRAMS;
+-----------------------------------+---------------+------------+-------------+-------------+
| ACTIVITY_NAME                     | DATABASE_NAME | TABLE_NAME | COLUMN_NAME | USAGE_COUNT |
+-----------------------------------+---------------+------------+-------------+-------------+
| Select_symbols_t_349f04a075ae9982 | db            | t          | a           |           1 |
| Select_symbols_t_bf93eac63fe6a2e8 | db            | t          | a           |           1 |
| Select_symbols_t_bf93eac63fe6a2e8 | db            | t          | c           |           1 |
| Select_symbols_t_056d53b2265b58ce | db            | t          | a           |           1 |
| Select_symbols_t_4222228538392d61 | db            | t          | b           |           1 |
| Select_symbols_t_ecb971f633697055 | db            | t          | b           |           1 |
| Select_symbols_t_ecb971f633697055 | db            | t          | c           |           1 |
| Select_symbols_t_d0692ff8d63f22e8 | db            | t          | b           |           1 |
| Select_symbols_t_d0692ff8d63f22e8 | db            | t          | c           |           1 |
+-----------------------------------+---------------+------------+-------------+-------------+

Both of these views display columns which were missing histograms at the time the queries were compiled. If you later collect histograms, or histograms are collected automatically, the views are not updated unless/until the query is recompiled.

Exporting and Importing Statistics

It is also possible to export and import statistics. This can be useful for testing purposes - for example, you can copy statistics from one SingleStore DB cluster to another, in order to emulate the optimizer’s behavior on a test system that does not have the same data.

ANALYZE TABLE table_name INTO OUTFILE '</path/file.json>';

Save column and histograms recorded on table_name into the file /path/file.json in a JSON representation. You can use absolute or relative paths; relative paths are rooted in the data directory of the SingleStore DB instance.

Note that this exports the current statistics; it does not collect new statistics. It uses whatever set of statistics would currently be used by for query optimization - for column statistics on autostats-enabled tables, this would be the automatic statistics if they are enabled and up-to-date, and otherwise the statistics last saved by ANALYZE.

ANALYZE TABLE table_name INFILE '/path/file.json';

Import range and column statistics for table_name from the file /path/file.json. You can use absolute or relative paths; relative paths are rooted in the data directory of the SingleStore DB instance.

This command is disallowed if autostats is enabled on the table. If you wish to use imported column statistics, you should disable autostats and then import the statistics. If you wish to use imported histograms only on an autostats-enabled table, you can temporarily disable autostats, run ANALYZE ... INFILE to import the statistics, and re-enable autostats.

You can also export and import configuration parameters (including session and global variables) used by the optimizer via a .json file. Session variables are exported from or imported into the current session, while global variables are saved and set globally.

ANALYZE OPTIMIZER_STATE INTO OUTFILE '</path/file.json>';
ANALYZE OPTIMIZER_STATE INFILE '</path/file.json>';

INTO OUTFILE stores the configuration parameters into the specified file.

INFILE specifies a file from which to import configuration parameters.

Notice

Using INFILE to set the configuration will update global variables. This can impact other SingleStore sessions.

Statistics Lifetime

Statistics are tied to a base table, and are persisted through ALTER TABLE or by autostats. Statistics are also persisted through server restarts. Statistics are deleted upon running DROP TABLE or ANALYZE TABLE t DROP.