# Statistics and Sampling Concepts

## Introduction

SingleStore automatically gathers statistics to give the optimizer the information it needs. The system that gathers statistics automatically is known as autostats. The term autostats is also used to describe statistics that are gathered automatically.

It’s almost always best to leave autostats enabled. But if for some reason you don’t want to use autostats, you can disable it. If you turn off autostats, you should use the [ANALYZE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/operational-commands/analyze.md) command to gather statistics manually. Most users can rely on autostats and will never need to run the `ANALYZE` command.

The `ANALYZE` command also can be used to trigger plan re-optimization when data statistics change, whether or not autostats is enabled.

## Autostat Types

SingleStore automatically collects and updates column statistics on columnstore tables, incrementally in the background. This greatly reduces the operational burden around maintaining these statistics with `ANALYZE`, ensures they are available without requiring user action, and keeps them up-to-date as table data changes with less performance impact.

Autostats provides automatic maintenance of column statistics. Statistics are collected and incrementally updated in the background as the table is modified with DML statements such as `INSERT`, `UPDATE`, `DELETE`, `LOAD DATA`, and `TRUNCATE`, as well as `ALTER TABLE`.

While statistics are updated automatically, to re-optimize query plans after statistics change, you must trigger plan invalidation by running `ANALYZE`.

Running `ANALYZE` also saves a copy of the table’s current automatic statistics. If you disable automatic statistics or they become out-of-date (see below), the query optimizer will use the last set of statistics saved by `ANALYZE` instead.

SingleStore uses three types of autostats:

|             | Column Statistics Methods  | Histogram Methods   | Sampling Methods |
| ----------- | -------------------------- | ------------------- | ---------------- |
| Columnstore | **Incremental**or Periodic | **Create**or Update | **On**or Off     |
| Rowstore    | **Periodic**only           | **Create**or Update | **On**only       |

## Column Statistics

|             | Memory Usage                                                                       | Best Use Case             | Can Disable |
| ----------- | ---------------------------------------------------------------------------------- | ------------------------- | ----------- |
| Columnstore | Incremental: 7kb per column per partitionPeriodic: \~150 bytes per column per node | `GROUP BY`against a table | Yes         |
| Rowstore    | \~150 bytes per column per node                                                    | `GROUP BY`against a table | Yes         |

Column statistics are collected on every column in a table when the table is processed by autostats or the `ANALYZE` command. These statistics include the number of distinct values in each column, the number of NULLs in each column, and the number of rows in the table.

You can view the column statistics last collected by querying the `information_schema.OPTIMIZER_STATISTICS` table or `DESCRIBE INFORMATION_SCHEMA.OPTIMIZER_STATISTICS.`

```
+-----------------------+--------------+------+------+---------+-------+
| Field                 | Type         | Null | Key  | Default | Extra |
+-----------------------+--------------+------+------+---------+-------+
| DATABASE_NAME         | varchar(512) | NO   |      | NULL    |       |
| TABLE_NAME            | varchar(512) | NO   |      | NULL    |       |
| COLUMN_NAME           | varchar(512) | NO   |      | NULL    |       |
| ROW_COUNT             | bigint(21)   | YES  |      | NULL    |       |
| NULL_COUNT            | bigint(21)   | YES  |      | NULL    |       |
| CARDINALITY           | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_ENABLED     | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_OUTDATED    | tinyint(1)   | YES  |      | NULL    |       |
| AUTOSTATS_ROW_COUNT   | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_NULL_COUNT  | bigint(21)   | YES  |      | NULL    |       |
| AUTOSTATS_CARDINALITY | bigint(21)   | YES  |      | NULL    |       |
| ADVANCED_HISTOGRAMS   | bigint(21)   | YES  |      | NULL    |       |
| LEGACY_HISTOGRAMS     | bigint(21)   | YES  |      | NULL    |       |
| RANGE_STATS           | bigint(21)   | YES  |      | NULL    |       |
| SAMPLE_SIZE           | bigint(21)   | YES  |      | NULL    |       |
| LAST_UPDATED          | datetime     | YES  |      | NULL    |       |
+-----------------------+--------------+------+------+---------+-------+
```

`AUTOSTATS_ENABLED` indicates if cardinality autostats are enabled. The `LAST_UPDATED` column shows the time non-incremental statistics were saved. This occurs when `ANALYZE` runs, or on periodic autostats collection.

> **📝 Note**: `information_schema.OPTIMIZER_STATISTICS.LAST_UPDATED` is not accurate for columnstore tables using incremental autostats. This is because autostats are collected upon ingest and maintained per segment, meaning that they are not stored like other statistics collected by `ANALYZE`.The best way to check that incremental autostats are up to date is to run [EXPLAIN](https://docs.singlestore.com/db/v9.1/reference/sql-reference/data-manipulation-language-dml/explain.md) on the table in question. If no error is surfaced, autostats are up to date. See [Identifying Missing Statistics](https://docs.singlestore.com/db/v9.1/query-data/query-tuning/statistics-and-sampling/statistics-and-sampling-tasks/#section-idm4609582140752032788527116465.md) for more information.

The `RANGE_STATS` column indicates whether any kind of histogram is stored for this column. Two types of histograms, advanced or legacy, may be present. The `LEGACY_HISTOGRAMS` and `ADVANCED_HISTOGRAMS` columns indicate presence of legacy and new histograms, respectively. If you are using legacy histograms (using `cardinality_estimation_level` set to 6.0), information about the histograms is shown in information\_schema.`RANGE_STATISTICS`. This table is deprecated as of v7.3 but will continue to show this information.

The `cardinality_estimation_level` engine variable controls the behavior of histograms and sampling used for selectivity estimation, which is an important part of query optimization. The variable can be set to different settings to dictate how the optimizer estimates the number of rows that operations like joins or selections will return. The behavior of each setting is as follows:

* `6.0` - Use legacy histograms.
* `6.5` - Use advanced histograms.
* `7.0` - Same behavior as `6.5`.
* `7.1` - Use histograms for join estimation.
* `7.3` - Use sampling and histograms together (when both are available), to improve selectivity estimation. When query processing distributed joins, the optimizer will consider initiating joining from the leftmost derived table.&#x20;

  Please note, that the behavior of the engine variable `verify_fields_in_transitivity` (which provides control over how join predicates are applied) is dependent on the `cardinality_estimation_level` setting. When the `cardinality_estimation_level` is set at 7.3, the behavior of the `verify_fields_in_transitivity` variable is interpreted as `ON`. When the `cardinality_estimation_level` is set below `7.3`, the behavior of the `verify_fields_in_transitivity` variable is interpreted as `OFF`.

`AUTOSTATS_ROW_COUNT`, `AUTOSTATS_NULL_COUNT`, and `AUTOSTATS_CARDINALITY` are deprecated. The columns remain in the table, but their value is always NULL.

Columns in `information_schema.OPTIMIZER_STATISTICS` will contain `NULL` values if autostats haven’t yet been collected, or autostats is disabled and `ANALYZE` has not yet been run.

## Method of Cardinality Autostat Collection

Cardinality autostats can be collected in two ways: incremental or periodic. Incremental collection is the default.

* Incremental - Incremental autostats are supported only on columnstore tables. The incremental autostats gathering process updates the column statistics of a table incrementally as the table’s data changes. Incremental autostats are enabled by default on all columnstore tables.
* Periodic - Periodic autostats are supported on both columnstore and rowstore tables. Periodic autostats collects column statistics on a table over a fixed period, but only after the table’s data has changed significantly. To collect these statistics, the SingleStore engine automatically runs `ANALYZE` on the table, in the background. As `ANALYZE` can be an expensive operation, total resource usage by periodic autostats is limited according to the `system_optimizer_pool` variable.

## Histogram or Range Statistics

|             | Memory Usage                                                                                                 | Best Use Case                                                         | Can Disable |
| ----------- | ------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------- | ----------- |
| Columnstore | For each column with histograms enabled: On each node, \~2KB for int/double types and \~8KB for string types | Range filter that does not increase; e.g. timestamp or auto increment | Yes         |
| Rowstore    | For each column with histograms enabled: On each node, \~2KB for int/double types and \~8KB for string types | Range filter that does not increase; e.g. timestamp or auto increment | Yes         |

The `ANALYZE` command can additionally collect histograms over designated columns. When histograms are enabled for a column, `ANALYZE` will sample the rows in that column and create a histogram from the samples, recording the endpoints of each bucket and statistical information about the distribution of values within each bucket.

When collected, histograms can be queried through the `information_schema.ADVANCED_HISTOGRAMS` table or `DESCRIBE INFORMATION_SCHEMA.ADVANCED_HISTOGRAMS`.

```
+-------------------------+---------------+------+------+---------+-------+
| Field                   | Type          | Null | Key  | Default | Extra |
+-------------------------+---------------+------+------+---------+-------+
| DATABASE_NAME           | varchar(512)  | NO   |      | NULL    |       |
| TABLE_NAME              | varchar(512)  | NO   |      | NULL    |       |
| COLUMN_NAME             | varchar(512)  | NO   |      | NULL    |       |
| TYPE                    | varchar(255)  | NO   |      | NULL    |       |
| VERSION                 | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_COUNT            | bigint(21)    | NO   |      | NULL    |       |
| BUCKET_INDEX            | bigint(21)    | NO   |      | NULL    |       |
| RANGE_MIN               | varchar(8192) | YES  |      | NULL    |       |
| RANGE_MAX               | varchar(8192) | YES  |      | NULL    |       |
| UNIQUE_COUNT            | double        | YES  |      | NULL    |       |
| CARDINALITY             | double        | YES  |      | NULL    |       |
| UNIQUE_COUNT_CUMULATIVE | double        | YES  |      | NULL    |       |
| CARDINALITY_CUMULATIVE  | double        | YES  |      | NULL    |       |
+-------------------------+---------------+------+------+---------+-------+
```

The table will display, in human-readable form, the end points of the histogram buckets along with total and unique count for each bucket. Bucket index -1 corresponds to the number of nulls tracked by the histogram.

If you want to find a list of the columns that have histograms, you can use the following query:

```sql
SELECT DISTINCT DATABASE_NAME, TABLE_NAME, COLUMN_NAME
    FROM information_schema.ADVANCED_HISTOGRAMS;
```

## Method of Histogram Autostat Collection

Histogram autostats can be collected in two ways: create or update. Create collection is the default.

* Create - With `AUTOSTATS_HISTOGRAM_MODE` set to `CREATE`, if a column is identified at least twice in `information_schema.mv_prospective_histograms` that having a histogram on it could be beneficial, then the histogram will be automatically created, and updated at that time. Like any histograms you create, it is updated when you run `ANALYZE`, and also as part of the periodic autostats (same interval: 10 minutes, and actually collected as part of the same process).
* Update - With `AUTOSTATS_HISTOGRAM_MODE` set to `UPDATE`, the existing histograms are kept current, but new histograms are not created.

## Sampling

|             | Memory Usage                                                                                            | Best Use Case                                                                                                                                 | Can Disable |
| ----------- | ------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | ----------- |
| Columnstore | \~0.1% more disk space needed for samples stored on disk plus \~2% more memory for columnstore metadata | Not an issue sinceSingleStoreseparately stores a small percentage of data from tables in a separate internal table called the sampling table. | Yes         |
| Rowstore    | No extra cost                                                                                           | Sampling can't be turned off. Partitions have fairly equal distribution of data.                                                              | No          |

Columnstore and rowstore tables both can be sampled by the query optimizer. Columnstore tables automatically have a random row-level sample built and maintained when autostats is enabled. This random sample is kept up to date in the background. The stored sample is sometimes called the columnstore sample. If a columnstore must be sampled during query optimization, this row-level sample is used.

Rowstore tables are sampled directly. An actual row-level random sample is collected by scanning the table structure and skipping a large fraction of the rows. This is an efficient process that takes time proportional to the sample size.

## Method of Sampling Autostat Collection

Sampling is either enabled or disabled for columnstore tables, via the `default_autostats_columnstore_sampling` engine variable. Enabled is the default. Rowstore tables only have the enabled option.

For columnstore tables, the method of sampling is configured via the `sampling_estimate_approach` engine variable, which has four possible modes:

* `auto` - This is the default mode, and it is the behavior that SingleStore has used historically. A separate sample table is maintained at each leaf node, and these are queried by the aggregator node. Due to the shape of the query (filter predicates are inside an aggregator function) used to determine the results, this can sometimes be expensive and cause significant load on the leaf nodes.
* `aggregation` - This is the same behavior as `auto`. (This behavior may change in future releases.)
* `lightweight` - This option only dispatches a query to evaluate the numerator (the number of rows that passes through the filter predicate), and uses a "good enough" value for the denominator (the total number of rows in the table) that has been previously computed. In this case, the filter predicate can be pushed down to the operator tree as it is not contained within a aggregator function, and this should result in a faster query. A potential downside is that this query does not compute the denominator. Since SingleStore uses an estimate for the denominator, the selectivity value computed from this option can be inaccurate and can result in suboptimal plans.

  Try this option if a significant performance issue is observed from the `auto` or `aggregation` sampling query.
* `filtered` - If the `lightweight` option does not provide accurate selectivity values, try this option, which uses two table scans to evaluate both accurate numerator and denominator, and joins the results in one query. Similar to the `lightweight` option, the filter predicate is with the `WHERE` (not in an aggregator function) and should perform better in theory. The potential downside of this method is the overhead from the additional table scan, which is essentially running another query.

## Histograms vs Sampling

If histograms are unavailable, the optimizer will use a sample to compute estimates. For rowstore tables, it will use dynamic sampling (i.e. sample the table data by scanning part of the table) to compute estimates. For columnstore tables, it will use the stored columnstore sample.

Histograms are much faster to use than sampling during query optimization, since the optimizer merely needs to read the previously-collected histograms, which are relatively small objects, while sampling may require scanning a large amount of table data.

Whether histogram-based estimates or sampling-based estimates are more accurate depends on the table and query. For example, histograms are most effective on columns where the overall distribution of data is not changing quickly (even if the data itself is changing quickly). When the data distribution is changing, if autostats is disabled, it is important to periodically update histograms to reflect changes in data by re-running `ANALYZE TABLE`.

Conversely, histograms are least effective on columns where the data distribution is quickly and continuously changing. For example, a typical case where histograms are generally inaccurate and not recommended is on columns where the range of data is continuously growing, such as a timestamp or ID column where new, ever-increasing values are continuously being added. Histograms only reflect the range of data at the time the histograms were last updated, so new values outside that range are not reflected in the histograms. This leads to inaccurate estimates for queries that filter for this recent data. For these types of columns, leaving histograms disabled, which means sampling is used instead, typically leads to the best results.

Also, since histograms are one-dimensional, they cannot capture correlation between columns. Sampling can estimate predicates that feature a high degree of correlation between columns more accurately than histograms.

Finally, sampling may be inaccurate on predicates that are highly selective, or that have a high degree of skew between partitions. Histograms are typically more accurate on these predicates.

## Plan Invalidation

When data statistics change, an old query plan optimized based on the old data statistics may no longer be the best for the new data statistics. It is often desirable to re-optimize the query plan after the data statistics change significantly, as a new plan may be significantly better than the old plan. Running `ANALYZE` triggers re-optimization of any plans based on statistics for that table that are significantly different than the current statistics, by invalidating each such plan and recompiling the corresponding query the next time it is run.

Plans can be invalidated manually by running the `ANALYZE` command. To automatically invalidate the plan as autostats update, set the enable\_background\_plan\_invalidation variable to `ON`. By default, this variable is set to `OFF` to prevent unexpected recompilation of plans when data changes.

SingleStore uses changes in the table rowcount as a heuristic for significant changes in statistics: a plan is eligible for invalidation if there is a table rowcount differing by a factor of 2 between the statistics it was compiled with and the current statistics, i.e. if an old rowcount (from the statistics at the time the query was compiled) is less than 50% or greater than 200% of the current rowcount. When you run `ANALYZE`, all plans eligible for invalidation (i.e. those where the current statistics and compile-time statistics differ in any table rowcount by at least 2x) are invalidated, i.e. they will be reoptimized and recompiled the next time they are run.

In addition, if autostats on a table were outdated when the plan was compiled, the plan will be eligible for invalidation after autostats become up-to-date (i.e. after autostats become up-to-date and you run `ANALYZE`, the plan will be invalidated).

Also, running `SHOW PLANCACHE` will display any existing plans. If there are any plans you wish to invalidate, running `DROP <plan_id> FROM PLANCACHE` will invalidate that plan. If you run `DROP ALL FROM PLANCACHE` that will drop all valid plans. See [DROP … FROM PLANCACHE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/data-definition-language-ddl/drop-from-plancache.md) for more details.

## Correlation Statistics

Correlated columns mean the value of one column depends on the value of another column to some degree. This dependency typically arises from an inherent hierarchical relationship. For example, the values in a "zip code" column would be potentially 100% correlated with a "state" column.

Optimizing queries containing correlated columns can be challenging. Filters applied to multiple columns exhibit varied combinations based on the correlation levels between those respective columns. Correlation statistics allow the optimizer to improve the query plan of certain queries that filter on multiple columns that are related to each other.

Consider a scenario where 30% of a company's customers reside in California, resulting in a 30% selectivity when filtering by state = "CA." However, if the query additionally filters by zip code within California, the selectivity may drastically drop. In cases of uncorrelated columns, the optimizer combines these estimates through multiplication, resulting in a significantly lower overall selectivity which can lead to less efficient query plans. In queries where the filtered columns have a strong correlation, such as states and zip codes, the state filter becomes redundant. It is prudent to adopt the more selective filter's selectivity for a more accurate estimation and efficient execution plan, leading to better overall performance.

Optimizing the query plan can be achieved by using statistics that account for correlated column relationships and dependencies within a query predicate. Correlation coefficients can be added between columns to fine-tune how the optimizer combines the selectivity of single-column filters when using histogram estimation. At a high level, the optimizer must create an estimate of what portion of each base table will be read during query execution (the selectivity, measured between 0 and 1).

The optimizer defaults the correlation coefficient to 0.5 between any two columns. This feature allows the user to change the correlation coefficient for any set of two columns in a table.

Consider the tables and query:

```sql
CREATE TABLE inventory (
    id INT
    , model_year INT
    , make VARCHAR(10)
    , model VARCHAR(20)
    , SHARD KEY(id) 
    , SORT KEY(model));
```

```sql
CREATE TABLE sales (
    s_id INT PRIMARY KEY,
    i_id INT
    , sales_month FLOAT
    , monthly_sales FLOAT
    ,yearly_sales AS (monthly_sales * 12) PERSISTED FLOAT
    , SORT KEY(sales_month));

```

```sql
SELECT inventory.model_year, inventory.make, inventory.model, sales.sales_month, sales.montly_sales
    FROM inventory JOIN sales ON inventor.id = sales.i_id
    WHERE inventory.model = 'Honda' AND inventory.make = 'Accord'; 
```

In this case, the optimizer should disregard the `inventory.model` filter in estimations as it is functionally determined by `inventory.make` and use the correlation statistics functionality to correlate `inventory.model` and `inventory.make` with a coefficient of 1.

To set the correlation coefficient, run the following command. Refer to the [ANALYZE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/operational-commands/analyze.md) page for more details:

```sql
ANALYZE TABLE inventory CORRELATE COLUMN inventory.model WITH COLUMN inventory.make USING COEFFICIENT 1.0;
```

To view the correlation coefficient set between two columns on a table, run the `information_schema.correlated_column_statistics` statement:

```sql
SELECT * FROM information_schema.correlated_column_statistics;

```

```output

+---------------+------------+------------------------+-----------------------+-------------------|---------------------+
| DATABASE_NAME	| TABLE_NAME | CORRELATED_COLUMN_NAME | CORRELLEE_COLUMN_NAME |	CORRELATION_COEFF | CREATED_TIME        |
+---------------+------------+------------------------+-----------------------+-------------------+---------------------+
| test1         | inventory  | model                  | make                  | 1                 | 2023-12-07 22:02:31 |
+---------------+------------+------------------------+-----------------------+-------------------+---------------------+

```

***

Modified at: July 2, 2025

Source: [/db/v9.1/query-data/query-tuning/statistics-and-sampling/statistics-and-sampling-concepts/](https://docs.singlestore.com/db/v9.1/query-data/query-tuning/statistics-and-sampling/statistics-and-sampling-concepts/)

(An index of the documentation is available at /llms.txt)
