Statistics and Sampling Concepts

Introduction

SingleStore automatically gathers statistics to give the optimizer the information it needs. The system that gathers statistics automatically is known as autostats. The term autostats is also used to describe statistics that are gathered automatically.

It’s almost always best to leave autostats enabled. But if for some reason you don’t want to use autostats, you can disable it. If you turn off autostats, you should use the ANALYZE command to gather statistics manually. Most users can rely on autostats and will never need to run the ANALYZE command.

The ANALYZE command also can be used to trigger plan re-optimization when data statistics change, whether or not autostats is enabled.

Autostat Types

SingleStore Helios automatically collects and updates column statistics on columnstore tables, incrementally in the background. This greatly reduces the operational burden around maintaining these statistics with ANALYZE, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.

Autostats provides automatic maintenance of column statistics. Statistics are collected and incrementally updated in the background as the table is modified with DML statements such as INSERT, UPDATE, DELETE, LOAD DATA, and TRUNCATE, as well as ALTER TABLE.

While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running ANALYZE.

Running ANALYZE also saves a copy of the table’s current automatic statistics. If you disable automatic statistics or they become out-of-date (see below), the query optimizer will use the last set of statistics saved by ANALYZE instead.

SingleStore uses three types of autostats:

	Column Statistics Methods	Histogram Methods	Sampling Methods
Columnstore	Incremental or Periodic	Create or Update	On or Off
Rowstore	Periodic only	Create or Update	On only

Column Statistics

	Memory Usage	Best Use Case	Can Disable
Columnstore	Incremental: 7kb per column per partition Periodic: ~150 bytes per column per node	`GROUP BY` against a table	Yes
Rowstore	~150 bytes per column per node	`GROUP BY` against a table	Yes

Memory Usage

Best Use Case

Can Disable

Columnstore

Incremental: 7kb per column per partition

Periodic: ~150 bytes per column per node

GROUP BY against a table

Yes

Rowstore

~150 bytes per column per node

GROUP BY against a table

Yes

Column statistics are collected on every column in a table when the table is processed by autostats or the ANALYZE command. These statistics include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table.

You can view the column statistics last collected by querying the information_schema.OPTIMIZER_STATISTICS table or DESCRIBE INFORMATION_SCHEMA.OPTIMIZER_STATISTICS.

+-----------------------+--------------+------+------+---------+-------+
| Field                 | Type         | Null | Key  | Default | Extra |
+-----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME         | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME            | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME           | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT             | bigint(21)   | YES  |      | NULL    |       |
| NULL_COUNT            | bigint(21)   | YES  |      | NULL    |       |
| CARDINALITY           | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_ENABLED     | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_OUTDATED    | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_ROW_COUNT   | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_NULL_COUNT  | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_CARDINALITY | bigint(21)   | YES  |      | NULL    |       |
| ADVANCED_HISTOGRAMS   | bigint(21)   | YES  |      | NULL    |       |
| LEGACY_HISTOGRAMS     | bigint(21)   | YES  |      | NULL    |       |
| RANGE_STATS           | bigint(21)   | YES  |      | NULL    |       |
| SAMPLE_SIZE           | bigint(21)   | YES  |      | NULL    |       |
| LAST_UPDATED          | datetime     | YES  |      | NULL    |       |
+-----------------------+--------------+------+------+---------+-------+

AUTOSTATS_ENABLED indicates if cardinality autostats are enabled. The LAST_UPDATED column shows the time non-incremental statistics were saved. This occurs when ANALYZE runs, or on periodic autostats collection.

Note

information_schema.OPTIMIZER_STATISTICS.LAST_UPDATED is not accurate for columnstore tables using incremental autostats. This is because autostats are collected upon ingest and maintained per segment, meaning that they are not stored like other statistics collected by ANALYZE.

The best way to check that incremental autostats are up to date is to run EXPLAIN on the table in question. If no error is surfaced, autostats are up to date. See Identifying Missing Statistics for more information.

The RANGE_STATS column indicates whether any kind of histogram is stored for this column. Two types of histograms, advanced or legacy, may be present. The LEGACY_HISTOGRAMS and ADVANCED_HISTOGRAMS columns indicate presence of legacy and new histograms, respectively. If you are using legacy histograms (using cardinality_estimation_level set to 6.0), information about the histograms is shown in information_schema.RANGE_STATISTICS.

The cardinality_estimation_level engine variable controls the behavior of histograms and sampling used for selectivity estimation, which is an important part of query optimization. The variable can be set to different settings to dictate how the optimizer estimates the number of rows that operations like joins or selections will return.

AUTOSTATS_ROW_COUNT, AUTOSTATS_NULL_COUNT, and AUTOSTATS_CARDINALITY are deprecated. The columns remain in the table, but their value is always NULL.

Columns in information_schema.OPTIMIZER_STATISTICS will contain NULL values if autostats haven’t yet been collected, or autostats is disabled and ANALYZE has not yet been run.

Method of Cardinality Autostat Collection

Cardinality autostats can be collected in two ways: incremental or periodic. Incremental collection is the default.

Incremental - Incremental autostats are supported only on columnstore tables. The incremental autostats gathering process updates the column statistics of a table incrementally as the table’s data changes. Incremental autostats are enabled by default on all columnstore tables.
Periodic - Periodic autostats are supported on both columnstore and rowstore tables. Periodic autostats collects column statistics on a table over a fixed period, but only after the table’s data has changed significantly. To collect these statistics, the SingleStore Helios engine automatically runs ANALYZE on the table, in the background.

Histogram or Range Statistics

	Memory Usage	Best Use Case	Can Disable
Columnstore	For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types	Range filter that does not increase; e.g. timestamp or auto increment	Yes
Rowstore	For each column with histograms enabled: On each node, ~2KB for int/double types and ~8KB for string types	Range filter that does not increase; e.g. timestamp or auto increment	Yes

The ANALYZE command can additionally collect histograms over designated columns. When histograms are enabled for a column, ANALYZE will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.

When collected, histograms can be queried through the information_schema.ADVANCED_HISTOGRAMS table or DESCRIBE INFORMATION_SCHEMA.ADVANCED_HISTOGRAMS.

+-------------------------+---------------+------+------+---------+-------+
| Field                   | Type          | Null | Key  | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME           | varchar(512)  | NO   |      | NULL    |       |
| TABLE_NAME              | varchar(512)  | NO   |      | NULL    |       |
| COLUMN_NAME             | varchar(512)  | NO   |      | NULL    |       |
| TYPE                    | varchar(255)  | NO   |      | NULL    |       |
| VERSION                 | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_COUNT            | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_INDEX            | bigint(21)    | NO   |      | NULL    |       |
| RANGE_MIN               | varchar(8192) | YES  |      | NULL    |       |
| RANGE_MAX               | varchar(8192) | YES  |      | NULL    |       |
| UNIQUE_COUNT            | double        | YES  |      | NULL    |       |
| CARDINALITY             | double        | YES  |      | NULL    |       |
| UNIQUE_COUNT_CUMULATIVE | double        | YES  |      | NULL    |       |
| CARDINALITY_CUMULATIVE  | double        | YES  |      | NULL    |       |
+-------------------------+---------------+------+------+---------+-------+

The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket. Bucket index -1 corresponds to the number of nulls tracked by the histogram.

If you want to find a list of the columns that have histograms, you can use the following query:

SQL

SELECT DISTINCT DATABASE_NAME, TABLE_NAME, COLUMN_NAME
    FROM information_schema.ADVANCED_HISTOGRAMS;

Method of Histogram Autostat Collection

Histogram autostats can be collected in two ways: create or update. Create collection is the default.

Create - With AUTOSTATS_HISTOGRAM_MODE set to CREATE, if a column is identified at least twice in information_schema.mv_prospective_histograms that having a histogram on it could be beneficial, then the histogram will be automatically created, and updated at that time. Like any histograms you create, it is updated when you run ANALYZE, and also as part of the periodic autostats (same interval: 10 minutes, and actually collected as part of the same process).
Update - With AUTOSTATS_HISTOGRAM_MODE set to UPDATE, the existing histograms are kept current, but new histograms are not created.

Sampling

	Memory Usage	Best Use Case	Can Disable
Columnstore	~0.1% more disk space needed for samples stored on disk plus ~2% more memory for columnstore metadata	Not an issue since SingleStore separately stores a small percentage of data from tables in a separate internal table called the sampling table.	Yes
Rowstore	No extra cost	Sampling can't be turned off. Partitions have fairly equal distribution of data.	No

Columnstore and rowstore tables both can be sampled by the query optimizer. Columnstore tables automatically have a random row-level sample built and maintained when autostats is enabled. This random sample is kept up to date in the background. The stored sample is sometimes called the columnstore sample. If a columnstore must be sampled during query optimization, this row-level sample is used.

Rowstore tables are sampled directly. An actual row-level random sample is collected by scanning the table structure and skipping a large fraction of the rows. This is an efficient process that takes time proportional to the sample size.

Method of Sampling Autostat Collection

Sampling is either enabled or disabled for columnstore tables, via the default_autostats_columnstore_sampling engine variable. Enabled is the default. Rowstore tables only have the enabled option.

For columnstore tables, the method of sampling is configured via the sampling_estimate_approach engine variable, which has four possible modes:

auto - This is the default mode, and it is the behavior that SingleStore has used historically. A separate sample table is maintained at each leaf node, and these are queried by the aggregator node. Due to the shape of the query (filter predicates are inside an aggregator function) used to determine the results, this can sometimes be expensive and cause significant load on the leaf nodes.
aggregation - This is the same behavior as auto. (This behavior may change in future releases.)
lightweight - This option only dispatches a query to evaluate the numerator (the number of rows that passes through the filter predicate), and uses a "good enough" value for the denominator (the total number of rows in the table) that has been previously computed. In this case, the filter predicate can be pushed down to the operator tree as it is not contained within a aggregator function, and this should result in a faster query. A potential downside is that this query does not compute the denominator. Since SingleStore uses an estimate for the denominator, the selectivity value computed from this option can be inaccurate and can result in suboptimal plans.

Try this option if a significant performance issue is observed from the auto or aggregation sampling query.
filtered - If the lightweight option does not provide accurate selectivity values, try this option, which uses two table scans to evaluate both accurate numerator and denominator, and joins the results in one query. Similar to the lightweight option, the filter predicate is with the WHERE (not in an aggregator function) and should perform better in theory. The potential downside of this method is the overhead from the additional table scan, which is essentially running another query.

Histograms vs Sampling

If histograms are unavailable, the optimizer will use a sample to compute estimates. For rowstore tables, it will use dynamic sampling (i.e. sample the table data by scanning part of the table) to compute estimates. For columnstore tables, it will use the stored columnstore sample.

Histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously-collected histograms, which are relatively small objects, while sampling may require scanning a large amount of table data.

Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query. For example, histograms are most effective on columns where the overall distribution of data is not changing quickly (even if the data itself is changing quickly). When the data distribution is changing, if autostats is disabled, it is important to periodically update histograms to reflect changes in data by re-running ANALYZE TABLE.

Conversely, histograms are least effective on columns where the data distribution is quickly and continuously changing. For example, a typical case where histograms are generally inaccurate and not recommended is on columns where the range of data is continuously growing, such as a timestamp or ID column where new, ever-increasing values are continuously being added. Histograms only reflect the range of data at the time the histograms were last updated, so new values outside that range are not reflected in the histograms. This leads to inaccurate estimates for queries that filter for this recent data. For these types of columns, leaving histograms disabled, which means sampling is used instead, typically leads to the best results.

Also, since histograms are one-dimensional, they cannot capture correlation between columns. Sampling can estimate predicates that feature a high degree of correlation between columns more accurately than histograms.

Finally, sampling may be inaccurate on predicates that are highly selective, or that have a high degree of skew between partitions. Histograms are typically more accurate on these predicates.

Plan Invalidation

When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics. It is often desirable to re-optimize the query plan after the data statistics change significantly, as a new plan may be significantly better than the old plan. Running ANALYZE triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.

Plans can be invalidated manually by running the ANALYZE command. To automatically invalidate the plan as autostats update, set the enable_background_plan_invalidation variable to ON. By default, this variable is set to OFF to prevent unexpected recompilation of plans when data changes.

SingleStore uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.e. if an old rowcount (from the statistics at the time the query was compiled) is less than 50% or greater than 200% of the current rowcount. When you run ANALYZE, all plans eligible for invalidation (i.e. those where the current statistics and compile-time statistics differ in any table rowcount by at least 2x) are invalidated, i.e. they will be reoptimized and recompiled the next time they are run.

In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.e. after autostats become up-to-date and you run ANALYZE, the plan will be invalidated).

Also, running SHOW PLANCACHE will display any existing plans. If there are any plans you wish to invalidate, running DROP <plan_id> FROM PLANCACHE will invalidate that plan. If you run DROP ALL FROM PLANCACHE that will drop all valid plans. See DROP … FROM PLANCACHE for more details.

Correlation Statistics

Correlated columns mean the value of one column depends on the value of another column to some degree. This dependency typically arises from an inherent hierarchical relationship. For example, the values in a "zip code" column would be potentially 100% correlated with a "state" column.

Optimizing queries containing correlated columns can be challenging. Filters applied to multiple columns exhibit varied combinations based on the correlation levels between those respective columns. Correlation statistics allow the optimizer to improve the query plan of certain queries that filter on multiple columns that are related to each other.

Consider a scenario where 30% of a company's customers reside in California, resulting in a 30% selectivity when filtering by state = "CA." However, if the query additionally filters by zip code within California, the selectivity may drastically drop. In cases of uncorrelated columns, the optimizer combines these estimates through multiplication, resulting in a significantly lower overall selectivity which can lead to less efficient query plans. In queries where the filtered columns have a strong correlation, such as states and zip codes, the state filter becomes redundant. It is prudent to adopt the more selective filter's selectivity for a more accurate estimation and efficient execution plan, leading to better overall performance.

Optimizing the query plan can be achieved by using statistics that account for correlated column relationships and dependencies within a query predicate. Correlation coefficients can be added between columns to fine-tune how the optimizer combines the selectivity of single-column filters when using histogram estimation. At a high level, the optimizer must create an estimate of what portion of each base table will be read during query execution (the selectivity, measured between 0 and 1).

The optimizer defaults the correlation coefficient to 0.5 between any two columns. This feature allows the user to change the correlation coefficient for any set of two columns in a table.

Consider the tables and query:

SQL

CREATE TABLE inventory (
    id INT
    , model_year INT
    , make VARCHAR(10)
    , model VARCHAR(20)
    , SHARD KEY(id)
    , SORT KEY(model));

SQL

CREATE TABLE sales (
    s_id INT PRIMARY KEY,
    i_id INT
    , sales_month FLOAT
    , monthly_sales FLOAT
    ,yearly_sales AS (monthly_sales * 12) PERSISTED FLOAT
    , SORT KEY(sales_month));

SQL

SELECT inventory.model_year, inventory.make, inventory.model, sales.sales_month, sales.montly_sales
    FROM inventory JOIN sales ON inventor.id = sales.i_id
    WHERE inventory.model = 'Honda' AND inventory.make = 'Accord';

In this case, the optimizer should disregard the inventory.model filter in estimations as it is functionally determined by inventory.make and use the correlation statistics functionality to correlate inventory.model and inventory.make with a coefficient of 1.

To set the correlation coefficient, run the following command. Refer to the ANALYZE page for more details:

SQL

ANALYZE TABLE inventory CORRELATE COLUMN inventory.model WITH COLUMN inventory.make USING COEFFICIENT 1.0;

To view the correlation coefficient set between two columns on a table, run the information_schema.correlated_column_statistics statement:

SQL

SELECT * FROM information_schema.correlated_column_statistics;

+---------------+------------+------------------------+-----------------------+-------------------|---------------------+
| DATABASE_NAME	| TABLE_NAME | CORRELATED_COLUMN_NAME | CORRELLEE_COLUMN_NAME |	CORRELATION_COEFF | CREATED_TIME        |
+---------------+------------+------------------------+-----------------------+-------------------+---------------------+
| test1         | inventory  | model                  | make                  | 1                 | 2023-12-07 22:02:31 |
+---------------+------------+------------------------+-----------------------+-------------------+---------------------+

On this page